Preprocessing of Ontology file#

GO Biological Process

In bioinformatics, the Gene Ontology (GO) is a standardized vocabulary that is used to describe the functions of genes and their products.

Input File

None

Input Folder

None

Output File

GO_BP_Ontology_corups.tsv

Output Folder

None

šŸLoad Python libraries

import networkx as nx
import obonet
import re

šŸ“„Download and load the ontology

%%time
url = 'http://current.geneontology.org/ontology/go.obo'
graph = obonet.read_obo(url)
CPU times: user 5.87 s, sys: 203 ms, total: 6.07 s
Wall time: 6.72 s

Map Ontology name and IDs

id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
name_to_id = {data['name']: id_ for id_, data in graph.nodes(data=True) if 'name' in data}
print(len(graph))
print(graph.number_of_edges())
43093
87006

Test the mapped dictionaly#

Find Parent#

sorted(id_to_name[superterm] for superterm in nx.descendants(graph, 'GO:0042552'))
['anatomical structure development',
 'axon ensheathment',
 'biological_process',
 'cellular process',
 'developmental process',
 'ensheathment of neurons',
 'multicellular organism development',
 'multicellular organismal process',
 'nervous system development',
 'system development']

More test#

print(graph.nodes['GO:0008150'])
id_to_name['GO:0008150'], 
name_to_id['biological_process']
{'name': 'biological_process', 'namespace': 'biological_process', 'alt_id': ['GO:0000004', 'GO:0007582', 'GO:0044699'], 'def': '"A biological process is the execution of a genetically-encoded biological module or program. It consists of all the steps required to achieve the specific biological objective of the module. A biological process is accomplished by a particular set of molecular functions carried out by specific gene products (or macromolecular complexes), often in a highly regulated manner and in a particular temporal sequence." [GOC:pdt]', 'comment': "Note that, in addition to forming the root of the biological process ontology, this term is recommended for use for the annotation of gene products whose biological process is unknown. When this term is used for annotation, it indicates that no information was available about the biological process of the gene product annotated as of the date the annotation was made; the evidence code 'no data' (ND), is used to indicate this.", 'subset': ['goslim_candida', 'goslim_chembl', 'goslim_metagenomics', 'goslim_pir', 'goslim_plant', 'goslim_pombe', 'goslim_yeast'], 'synonym': ['"biological process" EXACT []', '"physiological process" EXACT []', '"single organism process" RELATED []', '"single-organism process" RELATED []'], 'xref': ['Wikipedia:Biological_process'], 'property_value': ['term_tracker_item https://github.com/geneontology/go-ontology/issues/24968 xsd:anyURI'], 'created_by': 'jl', 'creation_date': '2012-09-19T15:05:24Z'}
'GO:0008150'

Select only *Biological process*

Ontology = sorted(superterm for superterm in nx.ancestors(graph, name_to_id['biological_process']))
len(Ontology)
30049
graph.nodes[Ontology[0]]
{'name': 'mitochondrion inheritance',
 'namespace': 'biological_process',
 'def': '"The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]',
 'synonym': ['"mitochondrial inheritance" EXACT []'],
 'is_a': ['GO:0048308', 'GO:0048311']}

Write to .tsv file

Calculate distance of each node from the parent ('biological_process':'GO:0008150') and save Ontology in a tabular format.
  1. Load the Ontology. The Ontology can be loaded in a variety of formats, such as JSON, XML, or text. Once the Ontology is loaded, it can be represented as a graph.

  2. Find the parent node. The parent node is the node that all other nodes in the graph are descended from. In this case, the parent node is ā€˜biological_process’:’GO:0008150’.

  3. Calculate the distance of each node from the parent. The distance of a node from the parent is the number of edges that must be traversed to reach the parent node. The distance can be calculated using a variety of algorithms, such as breadth-first search or depth-first search.

  4. Save the Ontology in a tabular format. The Ontology can be saved in a variety of formats, such as TSV, CSV, JSON, or XML. The tabular format will include the following columns:

    • Node ID

    • Node name

    • Distance from parent

The following is an example of a tabular representation of the Ontology:

GO ID

Term Name

Definition

Distance from Root

GO:0000001

mitochondrion

Organelle..

5

GO:0000002

mitochondrial

Part of..

4

GO:0000003

reproduction

Biological process…

1

GO:0000004

biological

Process…

0

GO:0000005

molecular

Function…

0

%%time

temp = []
out_file = "GO_BP_Ontology_corups.tsv" 
fh = open(out_file, "w") 
print("GO", "Name", "Definition", 'Depth', sep="\t", file=fh)

def GO_def(GO, Parent):
    name = graph.nodes[GO]["name"]
    Depth = nx.shortest_path_length(graph, GO)[Parent]
    if not name.endswith("."):name += "."
    defs = [""]
    if "def" in graph.nodes[GO]:
        defs = graph.nodes[GO]["def"].split('"')[1].split(". ")
    for Def in defs:
        
        if not Def.endswith("."):Def += "."        
        if Def not in temp and len(Def.split()) > 3:
            temp.append(Def)
            print(GO, name, Def, Depth, sep="\t", file=fh)

for i in range(len(Ontology)):
    GO_def(Ontology[i], "GO:0008150")
fh.close()
CPU times: user 11.2 s, sys: 26.6 ms, total: 11.2 s
Wall time: 11.2 s