GO Terms Defination Embedding#

NLU:State of the Art Text Mining in Python The Simplicity of Python, the Power of Spark NLP

Input File

GO_BP_Ontology_corups.tsv,

Input Folder

None

Output File

GO_BP_Ontology_corups_Filtered.tsv, predictions_full_BP.pickle

Output Folder

pipe_full_BP

šŸLoad Python libraries

import os
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from IPython.display import display, HTML

Make sure java is loaded

!module load Java/1.8.0_202
!java -version
openjdk version "1.8.0_332"
OpenJDK Runtime Environment (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)

Set Java JAVA_HOME and PATH variable

os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], line 1
----> 1 os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
      2 os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

File ~/.conda/envs/sparknlp/lib/python3.8/os.py:675, in _Environ.__getitem__(self, key)
    672     value = self._data[self.encodekey(key)]
    673 except KeyError:
    674     # raise KeyError with the original key value
--> 675     raise KeyError(key) from None
    676 return self.decodevalue(value)

KeyError: 'JAVA_HOME'
print(os.environ["JAVA_HOME"])
/share/apps/rc/software/Java/1.8.0_202

Load ontology file
df = pd.read_csv('GO_BP_Ontology_corups.tsv', sep="\t")
print(df.shape)
display(df.head())
(35228, 4)
GO Name Definition Depth
0 GO:0000001 mitochondrion inheritance. The distribution of mitochondria, including th... 5
1 GO:0000002 mitochondrial genome maintenance. The maintenance of the structure and integrity... 6
2 GO:0000003 reproduction. The production of new individuals that contain... 1
3 GO:0000006 high-affinity zinc transmembrane transporter a... Enables the transfer of zinc ions (Zn2+) from ... 6
4 GO:0000006 high-affinity zinc transmembrane transporter a... In high-affinity transport the transporter is ... 6

Filter GO terms which are not in TAIR database to focus only on the Plant associated GO terms

TAIR_terms = open("Auxiliary_Files/TAIR_GOTERMS.txt").read().splitlines()
len(TAIR_terms)
7589

Save filter GO term data

df = df[df.GO.isin(TAIR_terms)]
df.to_csv('GO_BP_Ontology_corups_Filtered.tsv', index=False, sep="\t")
df.shape
(5619, 4)

Load model and calclulate sentence_embedding_biobert

Set Python env variables

os.environ['PYSPARK_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'

Load BioBERT Sentence Embeddings (Pubmed)

This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.

The details are described in the paper ā€œBioBERT: a pre-trained biomedical language representation model for biomedical text miningā€.

%%time
pipe = nlu.load('en.embed_sentence.biobert.pubmed_base_cased')
sent_biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
CPU times: user 10.9 ms, sys: 7.83 ms, total: 18.7 ms
Wall time: 3.25 s

Get the sentence embedding (biobert) of GO term definition

The details are described in the paper ā€œBioBERT: a pre-trained biomedical language representation model for biomedical text miningā€.

predictions_full = pipe.predict(df.Definition, output_level='document')
predictions_full
                                                                                
document sentence_embedding_biobert
0 The maintenance of the structure and integrity... [-0.1828726977109909, 0.25272125005722046, -0....
1 The production of new individuals that contain... [-0.031625792384147644, 0.027806805446743965, ...
2 The repair of single strand breaks in DNA. [-0.05223826691508293, -0.028281446546316147, ...
3 Repair of such breaks is mediated by the same ... [0.02170223370194435, -0.04093412682414055, -0...
4 Catalysis of the hydrolysis of ester linkages ... [-0.00821786466985941, -0.23827922344207764, -...
... ... ...
5614 Any process that activates or increases the fr... [0.053574178367853165, -0.080331951379776, -0....
5615 Any process that activates or increases the fr... [0.015425608493387699, -0.013072511181235313, ...
5616 The chemical reactions and pathways involving ... [0.0855507105588913, -0.05061568692326546, -0....
5617 The chemical reactions and pathways resulting ... [0.11358797550201416, -0.13088946044445038, -0...
5618 The chemical reactions and pathways resulting ... [0.11483287811279297, -0.15022939443588257, -0...

5619 rows Ɨ 2 columns

Save as embeddings as Pickle

%%time

with open('predictions_full_BP.pickle', 'wb') as handle:
    pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('predictions_full_bkp.pickle', 'wb') as handle:
#     pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# !ls *.pickle
CPU times: user 29.1 ms, sys: 10.4 ms, total: 39.5 ms
Wall time: 40.9 ms

Save as Pipe

%%time
stored_model_path = "pipe_full_BP"
pipe.save(stored_model_path, overwrite=True)
Stored model_anno_obj in pipe_full_BP
CPU times: user 4.81 ms, sys: 4.3 ms, total: 9.11 ms
Wall time: 3.55 s

(Optional): Load Saved model#

with open('predictions_full_BP.pickle', 'rb') as handle:
    predictions = pickle.load(handle)
predictions