Gene regulatory network inference#

import os
import re

import pandas as pd
from collections import defaultdict

from arboreto.algo import grnboost2, genie3
from arboreto.utils import load_tf_names
from distributed import LocalCluster, Client
tfdf = pd.read_csv("Auxiliary_File/Arabidopsis_TF and family.csv")
tf_names = list(set(tfdf['Protein ID'].values.tolist()))
len(tf_names)
2192

Uncut#

# Exp = pd.read_csv("1_Expression_data/Expr_Uncut.csv")
# Exp.T

ex_matrix = pd.read_csv("1_Expression_data/Expr_Uncut.csv", sep=',', index_col=0).T
ex_matrix.head()
Locus AT1G01010 AT1G01020 AT1G01030 AT1G01040 AT1G01046 AT1G01050 AT1G01060 AT1G01070 AT1G01073 AT1G01080 ... ATMG01330 ATMG01340 ATMG01350 ATMG01360 ATMG01370 ATMG01380 ATMG01390 ATMG01400 ATMG01410 CFP
wolsc_kb2_4_10 7.702431 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.359552
wolsc_kb2_4_1 0.000000 8.378906 0.0 0.0 0.0 10.810870 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000
wolsc_kb2_4_18 0.000000 0.000000 0.0 0.0 0.0 9.858738 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000
wolsc_kb2_4_22 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.388047
wolsc_kb2_4_26 0.000000 0.000000 0.0 0.0 0.0 5.779188 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.552472

5 rows × 32679 columns

%%time
# tfdf = pd.read_csv("masterTF-target.txt", sep="\t")
# tf_names = list(set(tfdf.TF.values.tolist()))
# len(tf_names)

# ex_matrix = pd.read_csv("1_Expression_data/GSE10576_Fe_arboreto.tsv", sep='\t')
# ex_matrix.head()


local_cluster = LocalCluster(n_workers=10,
                                 threads_per_worker=1,
                                 memory_limit=8e9)
custom_client = Client(local_cluster)

network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names, verbose=True, client_or_address=custom_client)

network.to_csv('3_GRN_data/GSE74488_Uncut_arboreto_regnet.tsv', sep='\t', index=False)

network.head()
preparing dask client
parsing input
creating dask graph
10 partitions
computing dask graph
not shutting down client, client was created externally
finished
CPU times: total: 47min 22s
Wall time: 2h 5min 40s
TF target importance
287 AT1G34190 AT1G54150 123.624165
280 AT1G33240 AT1G32640 118.439476
1375 AT4G01120 AT3G02180 116.013558
1837 AT5G20900 AT2G46140 115.321538
41 AT1G04990 AT2G42230 112.942165

3hpc#

ex_matrix = pd.read_csv("1_Expression_data/Expr_3hpc.csv", sep=',', index_col=0).T
ex_matrix.head()
Locus AT1G01010 AT1G01020 AT1G01030 AT1G01040 AT1G01046 AT1G01050 AT1G01060 AT1G01070 AT1G01073 AT1G01080 ... ATMG01330 ATMG01340 ATMG01350 ATMG01360 ATMG01370 ATMG01380 ATMG01390 ATMG01400 ATMG01410 CFP
sc_1228_pa_30 0.000000 0.000000 0.0 0.0 0.0 5.380400 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.009010
wolsc_kb2_3_13 0.000000 7.092747 0.0 0.0 0.0 10.358681 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000
wolsc_kb2_3_14 0.000000 5.949744 0.0 0.0 0.0 9.528396 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.615233
wolsc_kb2_3_2 3.829904 7.912041 0.0 0.0 0.0 10.317229 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.349404
wolsc_kb2_3_27 0.000000 0.000000 0.0 0.0 0.0 8.714814 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.626259

5 rows × 32679 columns

local_cluster = LocalCluster(n_workers=10,
                                 threads_per_worker=1,
                                 memory_limit=8e9)
custom_client = Client(local_cluster)

network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names, verbose=True, client_or_address=custom_client)

network.to_csv('3_GRN_data/GSE74488_3hpc_arboreto_regnet.tsv', sep='\t', index=False)

network.head()
preparing dask client
parsing input
creating dask graph
10 partitions
computing dask graph


2022-09-26 15:38:38,710 - distributed.scheduler - WARNING - Worker failed to heartbeat within 300 seconds. Closing: <WorkerState 'tcp://127.0.0.1:59010', name: 3, status: running, memory: 1294, processing: 2851>
2022-09-26 15:38:40,154 - distributed.scheduler - WARNING - Worker failed to heartbeat within 300 seconds. Closing: <WorkerState 'tcp://127.0.0.1:59016', name: 2, status: running, memory: 1290, processing: 4233>
2022-09-26 15:38:43,509 - distributed.scheduler - WARNING - Worker failed to heartbeat within 300 seconds. Closing: <WorkerState 'tcp://127.0.0.1:59022', name: 0, status: running, memory: 1312, processing: 2648>
2022-09-26 15:38:44,025 - distributed.scheduler - WARNING - Worker failed to heartbeat within 300 seconds. Closing: <WorkerState 'tcp://127.0.0.1:59028', name: 1, status: running, memory: 1181, processing: 4211>
2022-09-26 15:38:44,917 - distributed.scheduler - WARNING - Worker failed to heartbeat within 300 seconds. Closing: <WorkerState 'tcp://127.0.0.1:59034', name: 5, status: running, memory: 861, processing: 5912>
2022-09-26 15:38:46,231 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:59010'.
2022-09-26 15:38:46,237 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:59028'.
2022-09-26 15:38:46,240 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:59034'.
2022-09-26 15:38:46,248 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:59022'.
2022-09-26 15:38:46,252 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:59016'.
2022-09-26 15:38:49,243 - distributed.nanny - WARNING - Restarting worker
2022-09-26 15:38:49,265 - distributed.nanny - WARNING - Restarting worker
2022-09-26 15:38:49,274 - distributed.nanny - WARNING - Restarting worker
2022-09-26 15:38:49,286 - distributed.nanny - WARNING - Restarting worker
2022-09-26 15:38:49,957 - distributed.nanny - WARNING - Restarting worker


not shutting down client, client was created externally
finished
TF target importance
421 AT1G62990 AT4G09990 123.463492
421 AT1G62990 AT3G18660 96.355794
902 AT2G42680 AT4G32470 91.878618
2006 AT5G49450 AT5G49448 89.479224
660 AT2G18160 AT2G18162 88.655565

END