# CentralityCosDist Tutorial

## The CentralityCosDist has following steps:
1. The CentralityCosDist algorithm takes a network and a list of seeds as input.
2. It calculates the centrality of each node in the network using multiple centrality measures.
3. It then calculates the cosine similarity between each seed and all other nodes.
4. It calculates the mean similarity of all nodes from all seed nodes.
5. It ranks all nodes based on the mean similarity score.
6. It sorts the rankings and returns them.

## The following the pseudo code of CentralityCosDist

```
# CentralityCosDist

# Input: Network, seeds

# Output: Rankings of nodes

# Initialize rankings
rankings = []

# For each seed:
for seed in seeds:

  # Perform multiple centrality analyses on the network
  centralities = [
    centrality(network, seed)
    for centrality in CENTRALITY_MEASURES
  ]

  # Determine the cosine similarity among seeds
  seed_similarities = cosine_similarity(centralities)

  # (Optional) eliminate seeds that are highly dissimilar to the majority of the other seeds
  if ELIMINATE_SEEDS:
    seed_similarities = eliminate_seeds(seed_similarities)

  # Determine the cosine similarity between the chosen seed and all other nodes
  node_similarities = cosine_similarity(centralities, seed)

  # Calculate the mean similarity of all nodes from all seed nodes
  mean_similarity = mean_similarity(node_similarities)

  # Rank all nodes based on the mean similarity score
  rankings.append(mean_similarity)

# Sort the rankings
rankings.sort()

# Return the rankings
return rankings
```

## Network centrality analysis

Network centrality analysis is a way of measuring the importance of nodes in a network. There are many different centrality measures, but some of the most common ones include:
- Degree centrality: This measures how many nodes a node is connected to.
- Betweenness centrality: This measures how often a node lies on the shortest path between other nodes.
- Closeness centrality: This measures how close a node is to all other nodes.

There are many common tools for network centrality analysis. Some of the most popular ones include:

- NetworkX is a Python library for analyzing graphs and networks. It has a number of functions that can be used to calculate centrality measures, visualize networks, and perform other network analysis tasks.
- Gephi is an open source software for visualizing and analyzing networks. It has a user-friendly interface that makes it easy to create and explore networks.
- Cytoscape is an open source software for visualizing and analyzing networks. It has a variety of features that make it a powerful tool for network analysis.
- R is a programming language and environment for statistical computing and graphics. It has a number of packages (e.g: igraph) that can be used for network analysis.
- MATLAB is a programming language and environment for scientific computing. It has a number of functions that can be used for network analysis.

Here is how to calculate multiple centrality analysis of a graph and export as csv file using NetworkX:

```
import networkx as nx

# Create a graph
graph = nx.Graph()

# Add some nodes
graph.add_nodes([1, 2, 3, 4])

# Add some edges
graph.add_edges([1, 2], [2, 3], [3, 4])

# Calculate the centrality measures
centralities = {}

for centrality_measure in ['degree', 'betweenness', 'closeness']:
  centrality = nx.centrality.calculate(graph, centrality_measure)
  centrality_measures.update({centrality_measure: centrality})

# Write the centrality measures to a CSV file
with open('centralities.csv', 'w') as csvfile:
  writer = csv.writer(csvfile, delimiter=',')
  writer.writerows([centrality_measures.items() for centrality_measures in centrality_measures.items()])
```

This will create a CSV file called centralities.csv that contains the following columns:
- Node
- Degree Centrality
- Betweenness Centrality
- Closeness Centrality

Following is the Network centrality file will used in this tutorial.

In [1]:
import pandas as pd
from IPython.display import HTML

Network_Centrality_File = "data/Network_Centrality.csv"
Seed_File = "data/Seeds.tsv"


df_centralites = pd.read_csv(Network_Centrality_File)
display(df_centralites.head(5))

Unnamed: 0,ID,Information_centrality,Degree_centrality,Betweenness_centrality,Eigenvector_centrality,Closeness_centrality,clustering_coefficient,Load_centrality,Page_rank
0,AT5G13650,0.000147,0.000369,0.000509,1.836035e-07,0.053561,0.0,0.000509,0.000133
1,AT5G65360,0.000243,0.007556,0.001312,0.1591563,0.054834,0.273171,0.001209,0.000353
2,AT5G14030,0.000105,0.000184,0.0,8.635333e-13,0.037856,0.0,0.0,7.6e-05
3,AT3G48070,0.000242,0.002396,0.000404,1.440151e-06,0.057145,0.0,0.000461,0.000247
4,AT4G35590,9.9e-05,0.000369,0.00017,1.255741e-09,0.041956,0.0,0.00017,0.000181


## Seed node list

Seed nodes are a subset of nodes in a network that are used to start the ranking process. The algorithm then ranks the remaining nodes based on their relationship to the seed nodes.

Seed nodes can be chosen in a variety of ways. Some common methods include:
- [x] **Choosing the nodes with the highest biological process significance.**
- [X] Choosing the nodes with the highest degree centrality.
- [x] Choosing the nodes with the highest betweenness centrality.
- [x] Choosing the nodes that are connected to the most other nodes.
- [x] Choosing the nodes that are connected to the most important nodes.


    The choice of seed nodes can have a significant impact on the accuracy of the ranking algorithm.


    Before we move forward, we need to filter out seed nodes for which we don't have centrality information. This is because the centrality measures are used to rank the nodes, and we can't rank a node if we don't have any information about its centrality. We can filter out the seed nodes by simply removing them from the network. This will ensure that the remaining nodes all have centrality information, and that the ranking algorithm will be able to rank them accurately.

In [2]:
Seeds = set(open(Seed_File).read().splitlines()[1:]) # [1:] to remove header
Seeds

{'AT1G09100',
 'AT1G09770',
 'AT1G63290',
 'AT3G01850',
 'AT3G03900',
 'AT3G05530',
 'AT3G51840',
 'AT5G08670',
 'AT5G17310',
 'ATCG00480'}

In [3]:
Nodes = set(df_centralites.ID.to_list())
Seeds = list(Nodes.intersection(Seeds))
Seeds

['AT1G09770',
 'AT3G01850',
 'AT5G17310',
 'AT3G51840',
 'AT3G03900',
 'AT5G08670',
 'AT3G05530',
 'AT1G63290',
 'AT1G09100',
 'ATCG00480']

## CentralityCosDist

### Load CentralityCosDist and create new instance of CentralityCosDist

In [4]:
from centralitycosdist import CentralityCosDist
algorithm = CentralityCosDist(Centrality_file=Network_Centrality_File)

### Execute CentralityCosDist

In [5]:
algorithm.run(seed_nodes=Seeds)

### Get ranks

In [6]:
df_rank = algorithm.rank
display(df_rank.head(10))

ID
AT3G03900     1.0
AT1G09100     2.0
AT3G51840     3.0
AT3G05530     4.0
AT5G17310     5.0
ATCG00480     6.0
AT5G08670     7.5
AT5G08680     7.5
AT5G08690     9.0
AT5G19680    10.0
Name: Rank, dtype: float64

### Get similarity score

In [7]:
display(algorithm.similarity_score.head(10))

ID
AT3G03900    0.984956
AT1G09100    0.984796
AT3G51840    0.983411
AT3G05530    0.981703
AT5G17310    0.977869
ATCG00480    0.975787
AT5G08670    0.973695
AT5G08680    0.973695
AT5G08690    0.971715
AT5G19680    0.970025
Name: Similarity_score, dtype: float64

### Checkout ranks of seed nodes

In [8]:
display(df_rank.loc[list(Seeds)])

ID
AT1G09770    25.0
AT3G01850    11.5
AT5G17310     5.0
AT3G51840     3.0
AT3G03900     1.0
AT5G08670     7.5
AT3G05530     4.0
AT1G63290    11.5
AT1G09100     2.0
ATCG00480     6.0
Name: Rank, dtype: float64

In [9]:
import session_info
session_info.show()

ðŸ”š