E-pRSA

Embeddings improve the prediction of residue Relative Solvent Accessibility in protein sequence.
Part of the Bioinformatics Sweeties collection.

Datasets


Download training set
Download blind test sets used for benchmark
Download cross-validation split

Both the training and the blind test sets are available as tsv files containing five columns, including:

  • UniProt: The UniProt ID of the protein
  • PDB: The PDB chain that was used to compute RSA values
  • Pos: A progressive numbering of the residues
  • Res: The residue type
  • Class: Classification of the residues. 0: Buried residues (RSA < 20%). 1: Exposed residues (RSA >= 20%). -1: residues missing in the PDB file, thus lacking a computed RSA. -2: neighbours of residues belonging to the -1 class, thus having an RSA impossible to estimate correctly.
  • RSA: Computed RSA. Real number from 0 (completely buried) to 1 (maximally exposed) OR negative value with the same meaning of the class

In the Blind_Test_Sets folder, all the blind test sets used for benchmark (MM165, MM23, CASP12, CASP14) are available.

In the cross_validation folder, 11 text files (split0-split9 and test.txt) are available, each containing the protein IDs belonging to the corresponding subset used in cross-validation.