# GO Terms Defination Embedding
<div>
<img src="https://www.johnsnowlabs.com/wp-content/uploads/2019/07/johnsnowlabs_logo.png" width="200"/>
</div>

**NLU**:State of the Art Text Mining in Python
__The Simplicity of Python, the Power of Spark NLP__



::::{grid}
:gutter: 4

:::{grid-item-card} Input File
**GO_BP_Ontology_corups.tsv**, 
:::

:::{grid-item-card} Input Folder
**None**
:::

:::{grid-item-card} Output File
**GO_BP_Ontology_corups_Filtered.tsv**, 
**predictions_full_BP.pickle**
:::

:::{grid-item-card} Output Folder
**pipe_full_BP**
:::
::::

<div class="alert alert-block alert-info">
    <h2>üêçLoad Python libraries</h2>
</div>

In [20]:
import os
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from IPython.display import display, HTML

<div class="alert alert-block alert-warning">
    <h2>Make sure java is loaded</h2>
</div>

In [21]:
!module load Java/1.8.0_202
!java -version

java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)


<div class="alert alert-block alert-info">
    <h2>Set Java JAVA_HOME and PATH variable</h2>
</div>

In [31]:
os.environ["JAVA_HOME"] = os.environ["JAVA_HOME"]
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

In [32]:
print(os.environ["JAVA_HOME"])

/share/apps/rc/software/Java/1.8.0_202


***
<div class="alert alert-block alert-info">
Load ontology file
</div>

In [22]:
df = pd.read_csv('GO_BP_Ontology_corups.tsv', sep="\t")
print(df.shape)
display(df.head())

(35228, 4)


Unnamed: 0,GO,Name,Definition,Depth
0,GO:0000001,mitochondrion inheritance.,"The distribution of mitochondria, including th...",5
1,GO:0000002,mitochondrial genome maintenance.,The maintenance of the structure and integrity...,6
2,GO:0000003,reproduction.,The production of new individuals that contain...,1
3,GO:0000006,high-affinity zinc transmembrane transporter a...,Enables the transfer of zinc ions (Zn2+) from ...,6
4,GO:0000006,high-affinity zinc transmembrane transporter a...,In high-affinity transport the transporter is ...,6


<div class="alert alert-block alert-warning">
    <h2>Filter GO terms which are not in TAIR database to focus only on the Plant associated GO terms</h2>
</div>

In [23]:
TAIR_terms = open("Auxiliary_Files/TAIR_GOTERMS.txt").read().splitlines()
len(TAIR_terms)

7589

Save filter GO term data

In [24]:
df = df[df.GO.isin(TAIR_terms)]
df.to_csv('GO_BP_Ontology_corups_Filtered.tsv', index=False, sep="\t")
df.shape

(5619, 4)

<div class="alert alert-block alert-info">
    <h2>Load model and calclulate sentence_embedding_biobert</h2>
</div>

<div class="alert alert-block alert-info">
    <h3>Set Python env variables</h3>
</div>

In [25]:
os.environ['PYSPARK_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'

<div class="alert alert-block alert-info">
    <h2>Load BioBERT Sentence Embeddings (Pubmed)</h2>
    This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. 
</div>

The details are described in the paper [‚ÄúBioBERT: a pre-trained biomedical language representation model for biomedical text mining‚Äù](https://arxiv.org/abs/1901.08746).

In [26]:
%%time
pipe = nlu.load('en.embed_sentence.biobert.pubmed_base_cased')

sent_biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
CPU times: user 10.9 ms, sys: 7.83 ms, total: 18.7 ms
Wall time: 3.25 s


<div class="alert alert-block alert-info">
    <h2>Get the sentence embedding (biobert) of GO term definition</h2>
</div>

The details are described in the paper __[‚ÄúBioBERT: a pre-trained biomedical language representation model for biomedical text mining‚Äù](https://arxiv.org/abs/1901.08746)__.

In [27]:
predictions_full = pipe.predict(df.Definition, output_level='document')
predictions_full

                                                                                

Unnamed: 0,document,sentence_embedding_biobert
0,The maintenance of the structure and integrity...,"[-0.1828726977109909, 0.25272125005722046, -0...."
1,The production of new individuals that contain...,"[-0.031625792384147644, 0.027806805446743965, ..."
2,The repair of single strand breaks in DNA.,"[-0.05223826691508293, -0.028281446546316147, ..."
3,Repair of such breaks is mediated by the same ...,"[0.02170223370194435, -0.04093412682414055, -0..."
4,Catalysis of the hydrolysis of ester linkages ...,"[-0.00821786466985941, -0.23827922344207764, -..."
...,...,...
5614,Any process that activates or increases the fr...,"[0.053574178367853165, -0.080331951379776, -0...."
5615,Any process that activates or increases the fr...,"[0.015425608493387699, -0.013072511181235313, ..."
5616,The chemical reactions and pathways involving ...,"[0.0855507105588913, -0.05061568692326546, -0...."
5617,The chemical reactions and pathways resulting ...,"[0.11358797550201416, -0.13088946044445038, -0..."


<div class="alert alert-block alert-info">
    <h2>Save as embeddings as Pickle</h2>
</div>

In [28]:
%%time

with open('predictions_full_BP.pickle', 'wb') as handle:
    pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('predictions_full_bkp.pickle', 'wb') as handle:
#     pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# !ls *.pickle

CPU times: user 29.1 ms, sys: 10.4 ms, total: 39.5 ms
Wall time: 40.9 ms


<div class="alert alert-block alert-info">
    <h2>Save as Pipe</h2>
</div>

In [29]:
%%time
stored_model_path = "pipe_full_BP"
pipe.save(stored_model_path, overwrite=True)

Stored model_anno_obj in pipe_full_BP
CPU times: user 4.81 ms, sys: 4.3 ms, total: 9.11 ms
Wall time: 3.55 s


### <b>(Optional): </b>Load Saved model

```
with open('predictions_full_BP.pickle', 'rb') as handle:
    predictions = pickle.load(handle)
predictions

```