GO Terms Defination Embedding#

NLU:State of the Art Text Mining in Python The Simplicity of Python, the Power of Spark NLP
Input File
GO_BP_Ontology_corups.tsv,
Input Folder
None
Output File
GO_BP_Ontology_corups_Filtered.tsv, predictions_full_BP.pickle
Output Folder
pipe_full_BP
šLoad Python libraries
import os
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from IPython.display import display, HTML
Make sure java is loaded
!module load Java/1.8.0_202
!java -version
openjdk version "1.8.0_332"
OpenJDK Runtime Environment (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)
Set Java JAVA_HOME and PATH variable
os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[3], line 1
----> 1 os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
2 os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
File ~/.conda/envs/sparknlp/lib/python3.8/os.py:675, in _Environ.__getitem__(self, key)
672 value = self._data[self.encodekey(key)]
673 except KeyError:
674 # raise KeyError with the original key value
--> 675 raise KeyError(key) from None
676 return self.decodevalue(value)
KeyError: 'JAVA_HOME'
print(os.environ["JAVA_HOME"])
/share/apps/rc/software/Java/1.8.0_202
Load ontology file
df = pd.read_csv('GO_BP_Ontology_corups.tsv', sep="\t")
print(df.shape)
display(df.head())
(35228, 4)
GO | Name | Definition | Depth | |
---|---|---|---|---|
0 | GO:0000001 | mitochondrion inheritance. | The distribution of mitochondria, including th... | 5 |
1 | GO:0000002 | mitochondrial genome maintenance. | The maintenance of the structure and integrity... | 6 |
2 | GO:0000003 | reproduction. | The production of new individuals that contain... | 1 |
3 | GO:0000006 | high-affinity zinc transmembrane transporter a... | Enables the transfer of zinc ions (Zn2+) from ... | 6 |
4 | GO:0000006 | high-affinity zinc transmembrane transporter a... | In high-affinity transport the transporter is ... | 6 |
Filter GO terms which are not in TAIR database to focus only on the Plant associated GO terms
TAIR_terms = open("Auxiliary_Files/TAIR_GOTERMS.txt").read().splitlines()
len(TAIR_terms)
7589
Save filter GO term data
df = df[df.GO.isin(TAIR_terms)]
df.to_csv('GO_BP_Ontology_corups_Filtered.tsv', index=False, sep="\t")
df.shape
(5619, 4)
Load model and calclulate sentence_embedding_biobert
Set Python env variables
os.environ['PYSPARK_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'
Load BioBERT Sentence Embeddings (Pubmed)
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.The details are described in the paper āBioBERT: a pre-trained biomedical language representation model for biomedical text miningā.
%%time
pipe = nlu.load('en.embed_sentence.biobert.pubmed_base_cased')
sent_biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
CPU times: user 10.9 ms, sys: 7.83 ms, total: 18.7 ms
Wall time: 3.25 s
Get the sentence embedding (biobert) of GO term definition
The details are described in the paper āBioBERT: a pre-trained biomedical language representation model for biomedical text miningā.
predictions_full = pipe.predict(df.Definition, output_level='document')
predictions_full
document | sentence_embedding_biobert | |
---|---|---|
0 | The maintenance of the structure and integrity... | [-0.1828726977109909, 0.25272125005722046, -0.... |
1 | The production of new individuals that contain... | [-0.031625792384147644, 0.027806805446743965, ... |
2 | The repair of single strand breaks in DNA. | [-0.05223826691508293, -0.028281446546316147, ... |
3 | Repair of such breaks is mediated by the same ... | [0.02170223370194435, -0.04093412682414055, -0... |
4 | Catalysis of the hydrolysis of ester linkages ... | [-0.00821786466985941, -0.23827922344207764, -... |
... | ... | ... |
5614 | Any process that activates or increases the fr... | [0.053574178367853165, -0.080331951379776, -0.... |
5615 | Any process that activates or increases the fr... | [0.015425608493387699, -0.013072511181235313, ... |
5616 | The chemical reactions and pathways involving ... | [0.0855507105588913, -0.05061568692326546, -0.... |
5617 | The chemical reactions and pathways resulting ... | [0.11358797550201416, -0.13088946044445038, -0... |
5618 | The chemical reactions and pathways resulting ... | [0.11483287811279297, -0.15022939443588257, -0... |
5619 rows Ć 2 columns
Save as embeddings as Pickle
%%time
with open('predictions_full_BP.pickle', 'wb') as handle:
pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('predictions_full_bkp.pickle', 'wb') as handle:
# pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)
# !ls *.pickle
CPU times: user 29.1 ms, sys: 10.4 ms, total: 39.5 ms
Wall time: 40.9 ms
Save as Pipe
%%time
stored_model_path = "pipe_full_BP"
pipe.save(stored_model_path, overwrite=True)
Stored model_anno_obj in pipe_full_BP
CPU times: user 4.81 ms, sys: 4.3 ms, total: 9.11 ms
Wall time: 3.55 s
(Optional): Load Saved model#
with open('predictions_full_BP.pickle', 'rb') as handle:
predictions = pickle.load(handle)
predictions