{ "cells": [ { "cell_type": "markdown", "id": "893938a0-573a-461b-8969-8f66b8380b6b", "metadata": { "tags": [] }, "source": [ "# GO Terms Defination Embedding\n", "
\n", "\n", "
\n", "\n", "**NLU**:State of the Art Text Mining in Python\n", "__The Simplicity of Python, the Power of Spark NLP__\n", "\n" ] }, { "cell_type": "markdown", "id": "7525f05e-ff51-4717-b196-45e1569b2cc4", "metadata": {}, "source": [ "::::{grid}\n", ":gutter: 4\n", "\n", ":::{grid-item-card} Input File\n", "**GO_BP_Ontology_corups.tsv**, \n", ":::\n", "\n", ":::{grid-item-card} Input Folder\n", "**None**\n", ":::\n", "\n", ":::{grid-item-card} Output File\n", "**GO_BP_Ontology_corups_Filtered.tsv**, \n", "**predictions_full_BP.pickle**\n", ":::\n", "\n", ":::{grid-item-card} Output Folder\n", "**pipe_full_BP**\n", ":::\n", "::::" ] }, { "cell_type": "markdown", "id": "28fa17dd-8ad3-4d4a-b3ec-5b7693de3a4e", "metadata": {}, "source": [ "
\n", "

šŸLoad Python libraries

\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "id": "23a48a52-3219-4fee-93ed-923d42e255b6", "metadata": {}, "outputs": [], "source": [ "import os\n", "import nlu\n", "import pandas as pd\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import pickle\n", "from IPython.display import display, HTML" ] }, { "cell_type": "markdown", "id": "9276a39d-4c5c-4924-9159-840e9df34229", "metadata": { "tags": [] }, "source": [ "
\n", "

Make sure java is loaded

\n", "
" ] }, { "cell_type": "code", "execution_count": 21, "id": "091a3391-6ea2-43cd-ae04-720e867b8ee0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "java version \"1.8.0_202\"\n", "Java(TM) SE Runtime Environment (build 1.8.0_202-b08)\n", "Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)\n" ] } ], "source": [ "!module load Java/1.8.0_202\n", "!java -version" ] }, { "cell_type": "markdown", "id": "afd75a35-faad-4ac3-9313-f0fde38b0a2b", "metadata": {}, "source": [ "
\n", "

Set Java JAVA_HOME and PATH variable

\n", "
" ] }, { "cell_type": "code", "execution_count": 31, "id": "68821ad9-1fb9-46d5-ad27-12d7a2c55514", "metadata": {}, "outputs": [], "source": [ "os.environ[\"JAVA_HOME\"] = os.environ[\"JAVA_HOME\"]\n", "os.environ[\"PATH\"] = os.environ[\"JAVA_HOME\"] + \"/bin:\" + os.environ[\"PATH\"]" ] }, { "cell_type": "code", "execution_count": 32, "id": "879d5b44-3395-43c7-a50e-da747b09f38d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/share/apps/rc/software/Java/1.8.0_202\n" ] } ], "source": [ "print(os.environ[\"JAVA_HOME\"])" ] }, { "cell_type": "markdown", "id": "91fce46a-2bfb-4a81-8863-1558bc0efe90", "metadata": {}, "source": [ "***\n", "
\n", "Load ontology file\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "id": "94ef0b64-c6dc-4b86-bc86-939ad8a57cc1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(35228, 4)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GONameDefinitionDepth
0GO:0000001mitochondrion inheritance.The distribution of mitochondria, including th...5
1GO:0000002mitochondrial genome maintenance.The maintenance of the structure and integrity...6
2GO:0000003reproduction.The production of new individuals that contain...1
3GO:0000006high-affinity zinc transmembrane transporter a...Enables the transfer of zinc ions (Zn2+) from ...6
4GO:0000006high-affinity zinc transmembrane transporter a...In high-affinity transport the transporter is ...6
\n", "
" ], "text/plain": [ " GO Name \\\n", "0 GO:0000001 mitochondrion inheritance. \n", "1 GO:0000002 mitochondrial genome maintenance. \n", "2 GO:0000003 reproduction. \n", "3 GO:0000006 high-affinity zinc transmembrane transporter a... \n", "4 GO:0000006 high-affinity zinc transmembrane transporter a... \n", "\n", " Definition Depth \n", "0 The distribution of mitochondria, including th... 5 \n", "1 The maintenance of the structure and integrity... 6 \n", "2 The production of new individuals that contain... 1 \n", "3 Enables the transfer of zinc ions (Zn2+) from ... 6 \n", "4 In high-affinity transport the transporter is ... 6 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = pd.read_csv('GO_BP_Ontology_corups.tsv', sep=\"\\t\")\n", "print(df.shape)\n", "display(df.head())" ] }, { "cell_type": "markdown", "id": "fe022279", "metadata": {}, "source": [ "
\n", "

Filter GO terms which are not in TAIR database to focus only on the Plant associated GO terms

\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "id": "82626819", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7589" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TAIR_terms = open(\"Auxiliary_Files/TAIR_GOTERMS.txt\").read().splitlines()\n", "len(TAIR_terms)" ] }, { "cell_type": "markdown", "id": "22530fd7-3b1a-4cb0-a640-386cd2fcf97c", "metadata": {}, "source": [ "Save filter GO term data" ] }, { "cell_type": "code", "execution_count": 24, "id": "40404517", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5619, 4)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[df.GO.isin(TAIR_terms)]\n", "df.to_csv('GO_BP_Ontology_corups_Filtered.tsv', index=False, sep=\"\\t\")\n", "df.shape" ] }, { "cell_type": "markdown", "id": "3c32f74c-306d-421d-85ae-6b04bee8b2c1", "metadata": {}, "source": [ "
\n", "

Load model and calclulate sentence_embedding_biobert

\n", "
\n", "\n", "
\n", "

Set Python env variables

\n", "
" ] }, { "cell_type": "code", "execution_count": 25, "id": "51c48579-ccaa-48fc-a1ec-120e0f90ea63", "metadata": {}, "outputs": [], "source": [ "os.environ['PYSPARK_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'\n", "os.environ['PYSPARK_DRIVER_PYTHON'] = '/home/nileshkr/.conda/envs/sparknlp/bin/python3.8'" ] }, { "cell_type": "markdown", "id": "e99fa30a-9f0e-442c-b64e-7d80513d4a9f", "metadata": {}, "source": [ "
\n", "

Load BioBERT Sentence Embeddings (Pubmed)

\n", " This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. \n", "
\n", "\n", "The details are described in the paper [ā€œBioBERT: a pre-trained biomedical language representation model for biomedical text miningā€](https://arxiv.org/abs/1901.08746)." ] }, { "cell_type": "code", "execution_count": 26, "id": "2215bdf5-bca9-44e8-987b-05ef87ccee65", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sent_biobert_pubmed_base_cased download started this may take some time.\n", "Approximate size to download 386.4 MB\n", "[OK!]\n", "CPU times: user 10.9 ms, sys: 7.83 ms, total: 18.7 ms\n", "Wall time: 3.25 s\n" ] } ], "source": [ "%%time\n", "pipe = nlu.load('en.embed_sentence.biobert.pubmed_base_cased')" ] }, { "cell_type": "markdown", "id": "edb47edb-1bf8-41f5-91da-f5adc34fd77b", "metadata": {}, "source": [ "
\n", "

Get the sentence embedding (biobert) of GO term definition

\n", "
\n", "\n", "The details are described in the paper __[ā€œBioBERT: a pre-trained biomedical language representation model for biomedical text miningā€](https://arxiv.org/abs/1901.08746)__." ] }, { "cell_type": "code", "execution_count": 27, "id": "864b6423-2376-4db3-b4c8-b6efc86c7e72", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
documentsentence_embedding_biobert
0The maintenance of the structure and integrity...[-0.1828726977109909, 0.25272125005722046, -0....
1The production of new individuals that contain...[-0.031625792384147644, 0.027806805446743965, ...
2The repair of single strand breaks in DNA.[-0.05223826691508293, -0.028281446546316147, ...
3Repair of such breaks is mediated by the same ...[0.02170223370194435, -0.04093412682414055, -0...
4Catalysis of the hydrolysis of ester linkages ...[-0.00821786466985941, -0.23827922344207764, -...
.........
5614Any process that activates or increases the fr...[0.053574178367853165, -0.080331951379776, -0....
5615Any process that activates or increases the fr...[0.015425608493387699, -0.013072511181235313, ...
5616The chemical reactions and pathways involving ...[0.0855507105588913, -0.05061568692326546, -0....
5617The chemical reactions and pathways resulting ...[0.11358797550201416, -0.13088946044445038, -0...
5618The chemical reactions and pathways resulting ...[0.11483287811279297, -0.15022939443588257, -0...
\n", "

5619 rows Ɨ 2 columns

\n", "
" ], "text/plain": [ " document \\\n", "0 The maintenance of the structure and integrity... \n", "1 The production of new individuals that contain... \n", "2 The repair of single strand breaks in DNA. \n", "3 Repair of such breaks is mediated by the same ... \n", "4 Catalysis of the hydrolysis of ester linkages ... \n", "... ... \n", "5614 Any process that activates or increases the fr... \n", "5615 Any process that activates or increases the fr... \n", "5616 The chemical reactions and pathways involving ... \n", "5617 The chemical reactions and pathways resulting ... \n", "5618 The chemical reactions and pathways resulting ... \n", "\n", " sentence_embedding_biobert \n", "0 [-0.1828726977109909, 0.25272125005722046, -0.... \n", "1 [-0.031625792384147644, 0.027806805446743965, ... \n", "2 [-0.05223826691508293, -0.028281446546316147, ... \n", "3 [0.02170223370194435, -0.04093412682414055, -0... \n", "4 [-0.00821786466985941, -0.23827922344207764, -... \n", "... ... \n", "5614 [0.053574178367853165, -0.080331951379776, -0.... \n", "5615 [0.015425608493387699, -0.013072511181235313, ... \n", "5616 [0.0855507105588913, -0.05061568692326546, -0.... \n", "5617 [0.11358797550201416, -0.13088946044445038, -0... \n", "5618 [0.11483287811279297, -0.15022939443588257, -0... \n", "\n", "[5619 rows x 2 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions_full = pipe.predict(df.Definition, output_level='document')\n", "predictions_full" ] }, { "cell_type": "markdown", "id": "6d211398-e677-4bb0-bec1-1283d8055eec", "metadata": {}, "source": [ "
\n", "

Save as embeddings as Pickle

\n", "
" ] }, { "cell_type": "code", "execution_count": 28, "id": "31b63376-d540-4679-bb5e-166b51a98171", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 29.1 ms, sys: 10.4 ms, total: 39.5 ms\n", "Wall time: 40.9 ms\n" ] } ], "source": [ "%%time\n", "\n", "with open('predictions_full_BP.pickle', 'wb') as handle:\n", " pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", "\n", "# with open('predictions_full_bkp.pickle', 'wb') as handle:\n", "# pickle.dump(predictions_full, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", " \n", "# !ls *.pickle" ] }, { "cell_type": "markdown", "id": "62be25ef", "metadata": {}, "source": [ "
\n", "

Save as Pipe

\n", "
" ] }, { "cell_type": "code", "execution_count": 29, "id": "5ee3d5e4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stored model_anno_obj in pipe_full_BP\n", "CPU times: user 4.81 ms, sys: 4.3 ms, total: 9.11 ms\n", "Wall time: 3.55 s\n" ] } ], "source": [ "%%time\n", "stored_model_path = \"pipe_full_BP\"\n", "pipe.save(stored_model_path, overwrite=True)" ] }, { "cell_type": "markdown", "id": "0df77de2", "metadata": {}, "source": [ "### (Optional): Load Saved model\n", "\n", "```\n", "with open('predictions_full_BP.pickle', 'rb') as handle:\n", " predictions = pickle.load(handle)\n", "predictions\n", "\n", "```" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "sparknlp", "language": "python", "name": "sparknlp" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" } }, "nbformat": 4, "nbformat_minor": 5 }