E-pRSA

Embeddings improve the prediction of residue Relative Solvent Accessibility in protein sequence

Datasets


Download training set
Download blind test sets used for benchmark
Download cross-validation split

Both the training and the blind test sets are available as tsv files containing five columns, including:

  • UniProt: The UniProt ID of the protein
  • PDB: The PDB chain that was used to compute RSA values
  • Pos: A progressive numbering of the residues
  • Res: The residue type
  • Class: Classification of the residues. 0: Buried residues (RSA < 20%). 1: Exposed residues (RSA >= 20%). -1: residues missing in the PDB file, thus lacking a computed RSA. -2: neighbours of residues belonging to the -1 class, thus having an RSA impossible to estimate correctly.
  • RSA: Computed RSA. Real number from 0 (completely buried) to 1 (maximally exposed) OR negative value with the same meaning of the class

In the Blind_Test_Sets folder, all the blind test sets used for benchmark (MM165, MM23, CASP12, CASP14) are available.

In the cross_validation folder, 11 text files (split0-split9 and test.txt) are available, each containing the protein IDs belonging to the corresponding subset used in cross-validation.