GO Terms Defination Embedding

GO Terms Defination Embedding#

NLU:State of the Art Text Mining in Python The Simplicity of Python, the Power of Spark NLP

Input File

GO_BP_Ontology_corups.tsv,

Input Folder

None

Output File

GO_BP_Ontology_corups_Filtered.tsv, predictions_full_BP.pickle

Output Folder

pipe_full_BP

🐍Load Python libraries

import os
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from IPython.display import display, HTML

Make sure java is loaded

!module load Java/1.8.0_202
!java -version

openjdk version "1.8.0_332"
OpenJDK Runtime Environment (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)

Set Java JAVA_HOME and PATH variable

os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], line 1
----> 1 os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
      2 os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

File ~/.conda/envs/sparknlp/lib/python3.8/os.py:675, in _Environ.__getitem__(self, key)
    672     value = self._data[self.encodekey(key)]
    673 except KeyError:
    674     # raise KeyError with the original key value
--> 675     raise KeyError(key) from None
    676 return self.decodevalue(value)

KeyError: 'JAVA_HOME'

print(os.environ["JAVA_HOME"])

/share/apps/rc/software/Java/1.8.0_202

Load ontology file

df = pd.read_csv('GO_BP_Ontology_corups.tsv', sep="\t")
print(df.shape)
display(df.head())

(35228, 4)

	GO	Name	Definition	Depth
0	GO:0000001	mitochondrion inheritance.	The distribution of mitochondria, including th...	5
1	GO:0000002	mitochondrial genome maintenance.	The maintenance of the structure and integrity...	6
2	GO:0000003	reproduction.	The production of new individuals that contain...	1
3	GO:0000006	high-affinity zinc transmembrane transporter a...	Enables the transfer of zinc ions (Zn2+) from ...	6
4	GO:0000006	high-affinity zinc transmembrane transporter a...	In high-affinity transport the transporter is ...	6

Filter GO terms which are not in TAIR database to focus only on the Plant associated GO terms

TAIR_terms = open("Auxiliary_Files/TAIR_GOTERMS.txt").read().splitlines()
len(TAIR_terms)

Save filter GO term data

df = df[df.GO.isin(TAIR_terms)]
df.to_csv('GO_BP_Ontology_corups_Filtered.tsv', index=False, sep="\t")
df.shape

(5619, 4)

Load model and calclulate sentence_embedding_biobert

Set Python env variables

os.environ['PYSPARK_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'

Load BioBERT Sentence Embeddings (Pubmed)

This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.

The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.

%%time
pipe = nlu.load('en.embed_sentence.biobert.pubmed_base_cased')

sent_biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
CPU times: user 10.9 ms, sys: 7.83 ms, total: 18.7 ms
Wall time: 3.25 s

Get the sentence embedding (biobert) of GO term definition

The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.

predictions_full = pipe.predict(df.Definition, output_level='document')
predictions_full

	document	sentence_embedding_biobert
0	The maintenance of the structure and integrity...	[-0.1828726977109909, 0.25272125005722046, -0....
1	The production of new individuals that contain...	[-0.031625792384147644, 0.027806805446743965, ...
2	The repair of single strand breaks in DNA.	[-0.05223826691508293, -0.028281446546316147, ...
3	Repair of such breaks is mediated by the same ...	[0.02170223370194435, -0.04093412682414055, -0...
4	Catalysis of the hydrolysis of ester linkages ...	[-0.00821786466985941, -0.23827922344207764, -...
...	...	...
5614	Any process that activates or increases the fr...	[0.053574178367853165, -0.080331951379776, -0....
5615	Any process that activates or increases the fr...	[0.015425608493387699, -0.013072511181235313, ...
5616	The chemical reactions and pathways involving ...	[0.0855507105588913, -0.05061568692326546, -0....
5617	The chemical reactions and pathways resulting ...	[0.11358797550201416, -0.13088946044445038, -0...
5618	The chemical reactions and pathways resulting ...	[0.11483287811279297, -0.15022939443588257, -0...

5619 rows × 2 columns

Save as embeddings as Pickle

%%time

with open('predictions_full_BP.pickle', 'wb') as handle:
    pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('predictions_full_bkp.pickle', 'wb') as handle:
#     pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# !ls *.pickle

CPU times: user 29.1 ms, sys: 10.4 ms, total: 39.5 ms
Wall time: 40.9 ms

Save as Pipe

%%time
stored_model_path = "pipe_full_BP"
pipe.save(stored_model_path, overwrite=True)

Stored model_anno_obj in pipe_full_BP
CPU times: user 4.81 ms, sys: 4.3 ms, total: 9.11 ms
Wall time: 3.55 s

(Optional): Load Saved model#

with open('predictions_full_BP.pickle', 'rb') as handle:
    predictions = pickle.load(handle)
predictions