SEGUID v2: Checksums for Linear, Circular, Single- and Double-Stranded Biological Sequences

Warning

This is work under construction! Please do not use this in production until we’ve officially released SEGUID v2.

The SEquence Globally Unique IDentifier (SEGUID) checksum (Babnigg and Giometti 2006) was introduced to provide a stable, unifying key for the same sequence in different databases facilitating linking protein sequences across databases. SEGUID v2 (Pereira et al. 2024) extends the original SEGUID method to support also double-stranded sequences (e.g. DNA) and circular sequences (e.g. proteins and double-stranded DNA).

Example: SEGUID v2 for circular double-stranded DNA

SEGUID v2 is designed to be invariant to (i) rotation and (ii) duality (see above figure). No matter where we choose to “start” the circular dsDNA sequence, and no matter which strand we choose to be the Watson strand, the produced checksums are identical.

>>> from seguid import *

>>> cdseguid("TATGCCAA", "TTGGCATA")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

## Same swapping Watson and Crick 
>>> cdseguid("TTGGCATA", "TATGCCAA")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'

## Same rotating two basepairs (= minimal rotation by Watson)
>>> cdseguid("AATATGCC", "GGCATATT")
'cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A'
> library(seguid)

> cdseguid("TATGCCAA", "TTGGCATA")
[1] "cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A"

## Same swapping Watson and Crick 
> cdseguid("TTGGCATA", "TATGCCAA")
[1] "cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A"

## Same rotating two basepairs (= minimal rotation by Watson)
> cdseguid("AATATGCC", "GGCATATT")
[1] "cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A"
$ npx seguid --type=cdseguid <<< 'TATGCCAA;TTGGCATA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same swapping Watson and Crick
$ npx seguid --type=cdseguid <<< 'TTGGCATA;TATGCCAA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same rotating two basepairs (= minimal rotation by Watson)
$ npx seguid --type=cdseguid <<< 'AATATGCC;GGCATATT'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A
$ python -m seguid --type=cdseguid <<< 'TATGCCAA;TTGGCATA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same swapping Watson and Crick
$ python -m seguid --type=cdseguid <<< 'TTGGCATA;TATGCCAA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same rotating two basepairs (= minimal rotation by Watson)
$ python -m seguid --type=cdseguid <<< 'AATATGCC;GGCATATT'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A
$ Rscript -e seguid::seguid --type=cdseguid <<< 'TATGCCAA;TTGGCATA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same swapping Watson and Crick
$ Rscript -e seguid::seguid --type=cdseguid <<< 'TTGGCATA;TATGCCAA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same rotating two basepairs (= minimal rotation by Watson)
$ Rscript -e seguid::seguid --type=cdseguid <<< 'AATATGCC;GGCATATT'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A
$ tclsh seguid --type=cdseguid <<< 'TATGCCAA;TTGGCATA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same swapping Watson and Crick
$ tclsh seguid --type=cdseguid <<< 'TTGGCATA;TATGCCAA'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

## Same rotating two basepairs (= minimal rotation by Watson)
$ tclsh seguid --type=cdseguid <<< 'AATATGCC;GGCATATT'
cdseguid=dUxN7YQyVInv3oDcvz8ByupL44A

Availability

Implementations of above SEGUID methods are currently available for JavaScript, Python, R, and Tcl;

References

Babnigg, György, and Carol S Giometti. 2006. A database of unique protein sequence identifiers for proteome studies.” Proteomics 6 (16): 4514–22. https://doi.org/10.1002/pmic.200600032.
Pereira, Humberto, Paulo César Silva, Wayne M Davis, Louis Abraham, Gyorgy Babnigg, Henrik Bengtsson, and Bjorn Johansson. 2024. SEGUID v2: Extending SEGUID Checksums for Circular, Linear, Single- and Double-Stranded Biological Sequences.” bioRxiv. https://doi.org/10.1101/2024.02.28.582384.