Skip to content

PickyBinders/tea

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Embedded Alphabet (TEA)

Model Architecture

This repository contains the code accompanying our pre-print: Rewriting protein alphabets with language models. A web server with TEA converted datasets is available here.

Installation

python -m pip install git+https://github.com/PickyBinders/tea.git

Sequence Conversion with TEA

The tea_convert command takes protein sequences from a FASTA file and generates new tea-FASTA. It supports confidence-based sequence output where low-confidence positions are displayed in lowercase, and has options for saving logits and entropy. If --save_avg_entropy is set, the FASTA identifiers will contain the average entropy of the sequence in the format <key>|H=<avg_entropy>.

usage: tea_convert [-h] -f FASTA_FILE -o OUTPUT_FILE [-l] [-H] [-r] [-c] [-t ENTROPY_THRESHOLD]

options:
  -h, --help            show this help message and exit
  -f FASTA_FILE, --fasta_file FASTA_FILE
                        Input FASTA file containing protein amino acid sequences
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output FASTA file for generated tea sequences
  -l, --save_logits     Save per-residue logits to .pt file
  -H, --save_avg_entropy
                        Save average entropy values in FASTA identifiers
  -r, --save_residue_entropy
                        Save per-residue entropy values to .pt file
  -c, --lowercase_entropy
                        Save residues with entropy > threshold in lowercase
  -t ENTROPY_THRESHOLD, --entropy_threshold ENTROPY_THRESHOLD
                        Entropy threshold for lowercase conversion

Using the huggingface model

from tea.model import Tea
from transformers import AutoTokenizer, AutoModel
from transformers import BitsAndBytesConfig
import torch
import re

tea = Tea.from_pretrained("PickyBinders/tea")
tea.eval()
device = next(tea.parameters()).device
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
esm2 = AutoModel.from_pretrained(
        "facebook/esm2_t33_650M_UR50D",
        torch_dtype="auto",
        quantization_config=bnb_config,
        add_pooling_layer=False,
    ).to(device)
esm2.eval()
sequence_examples = ["PRTEINO", "SEQWENCE"]
sequence_examples = [" ".join(list(re.sub(r"[UZOBJ]", "X", sequence))) for sequence in sequence_examples]
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
with torch.no_grad():
    x = esm2(
        input_ids=input_ids, attention_mask=attention_mask
    ).last_hidden_state.to(device)
    results = tea.to_sequences(embeddings=x, input_ids=input_ids, return_avg_entropy=True, return_logits=False, return_residue_entropy=False)
results

Using tea sequences with MMseqs2

The matcha.out substitution matrix is included with the tea package. You can get its path programmatically:

from tea import get_matrix_path
matcha_path = get_matrix_path()
print(f"Matrix path: {matcha_path}")

Then use it with MMseqs2:

mmseqs easy-search tea_query.fasta tea_target.fasta results.m8 tmp/ \
    --comp-bias-corr 0 \
    --mask 0 \
    --gap-open 18 \
    --gap-extend 3 \
    --sub-mat /path/to/matcha.out \
    --seed-sub-mat /path/to/matcha.out \
    --exact-kmer-matching 1

About

"Rewriting protein alphabets with language models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages