Preprocessing of Ontology file#
GO Biological Process
In bioinformatics, the Gene Ontology (GO) is a standardized vocabulary that is used to describe the functions of genes and their products.
None
None
GO_BP_Ontology_corups.tsv
None
šLoad Python libraries
import networkx as nx
import obonet
import re
š„Download and load the ontology
%%time
url = 'http://current.geneontology.org/ontology/go.obo'
graph = obonet.read_obo(url)
CPU times: user 5.87 s, sys: 203 ms, total: 6.07 s
Wall time: 6.72 s
Map Ontology name and IDs
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
name_to_id = {data['name']: id_ for id_, data in graph.nodes(data=True) if 'name' in data}
print(len(graph))
print(graph.number_of_edges())
43093
87006
Test the mapped dictionaly#
Find Parent#
sorted(id_to_name[superterm] for superterm in nx.descendants(graph, 'GO:0042552'))
['anatomical structure development',
'axon ensheathment',
'biological_process',
'cellular process',
'developmental process',
'ensheathment of neurons',
'multicellular organism development',
'multicellular organismal process',
'nervous system development',
'system development']
More test#
print(graph.nodes['GO:0008150'])
id_to_name['GO:0008150'],
name_to_id['biological_process']
{'name': 'biological_process', 'namespace': 'biological_process', 'alt_id': ['GO:0000004', 'GO:0007582', 'GO:0044699'], 'def': '"A biological process is the execution of a genetically-encoded biological module or program. It consists of all the steps required to achieve the specific biological objective of the module. A biological process is accomplished by a particular set of molecular functions carried out by specific gene products (or macromolecular complexes), often in a highly regulated manner and in a particular temporal sequence." [GOC:pdt]', 'comment': "Note that, in addition to forming the root of the biological process ontology, this term is recommended for use for the annotation of gene products whose biological process is unknown. When this term is used for annotation, it indicates that no information was available about the biological process of the gene product annotated as of the date the annotation was made; the evidence code 'no data' (ND), is used to indicate this.", 'subset': ['goslim_candida', 'goslim_chembl', 'goslim_metagenomics', 'goslim_pir', 'goslim_plant', 'goslim_pombe', 'goslim_yeast'], 'synonym': ['"biological process" EXACT []', '"physiological process" EXACT []', '"single organism process" RELATED []', '"single-organism process" RELATED []'], 'xref': ['Wikipedia:Biological_process'], 'property_value': ['term_tracker_item https://github.com/geneontology/go-ontology/issues/24968 xsd:anyURI'], 'created_by': 'jl', 'creation_date': '2012-09-19T15:05:24Z'}
'GO:0008150'
Select only *Biological process*
Ontology = sorted(superterm for superterm in nx.ancestors(graph, name_to_id['biological_process']))
len(Ontology)
30049
graph.nodes[Ontology[0]]
{'name': 'mitochondrion inheritance',
'namespace': 'biological_process',
'def': '"The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]',
'synonym': ['"mitochondrial inheritance" EXACT []'],
'is_a': ['GO:0048308', 'GO:0048311']}
Write to .tsv file
Calculate distance of each node from the parent ('biological_process':'GO:0008150') and save Ontology in a tabular format.Load the Ontology. The Ontology can be loaded in a variety of formats, such as JSON, XML, or text. Once the Ontology is loaded, it can be represented as a graph.
Find the parent node. The parent node is the node that all other nodes in the graph are descended from. In this case, the parent node is ābiological_processā:āGO:0008150ā.
Calculate the distance of each node from the parent. The distance of a node from the parent is the number of edges that must be traversed to reach the parent node. The distance can be calculated using a variety of algorithms, such as breadth-first search or depth-first search.
Save the Ontology in a tabular format. The Ontology can be saved in a variety of formats, such as TSV, CSV, JSON, or XML. The tabular format will include the following columns:
Node ID
Node name
Distance from parent
The following is an example of a tabular representation of the Ontology:
GO ID |
Term Name |
Definition |
Distance from Root |
---|---|---|---|
GO:0000001 |
mitochondrion |
Organelle.. |
5 |
GO:0000002 |
mitochondrial |
Part of.. |
4 |
GO:0000003 |
reproduction |
Biological process⦠|
1 |
GO:0000004 |
biological |
Process⦠|
0 |
GO:0000005 |
molecular |
Function⦠|
0 |
%%time
temp = []
out_file = "GO_BP_Ontology_corups.tsv"
fh = open(out_file, "w")
print("GO", "Name", "Definition", 'Depth', sep="\t", file=fh)
def GO_def(GO, Parent):
name = graph.nodes[GO]["name"]
Depth = nx.shortest_path_length(graph, GO)[Parent]
if not name.endswith("."):name += "."
defs = [""]
if "def" in graph.nodes[GO]:
defs = graph.nodes[GO]["def"].split('"')[1].split(". ")
for Def in defs:
if not Def.endswith("."):Def += "."
if Def not in temp and len(Def.split()) > 3:
temp.append(Def)
print(GO, name, Def, Depth, sep="\t", file=fh)
for i in range(len(Ontology)):
GO_def(Ontology[i], "GO:0008150")
fh.close()
CPU times: user 11.2 s, sys: 26.6 ms, total: 11.2 s
Wall time: 11.2 s