Skip to content

Commit 23f516e

Browse files
authored
Merge pull request #3683 from programminghistorian/Issue-3682
Issue 3682
2 parents 28fca41 + c1f23b3 commit 23f516e

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

en/lessons/exploring-and-analyzing-network-data-with-python.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ tested-date: 2023-08-21
3737
## Lesson Goals
3838

3939
In this tutorial, you will learn:
40-
- To use the [**NetworkX**](https://networkx.github.io/documentation/stable/index.html) package for working with network data in [**Python**](/lessons/introduction-and-installation); and
40+
- To use the [**NetworkX**](https://perma.cc/P9PX-GUE6) package for working with network data in [**Python**](/lessons/introduction-and-installation); and
4141
- To analyze humanities network data to find:
4242
- Network structure and path lengths,
4343
- Important or central nodes, and
@@ -169,7 +169,7 @@ G.add_nodes_from(node_names)
169169
G.add_edges_from(edges)
170170
```
171171

172-
This is one of several ways to add data to a network object. You can check out the [NetworkX documentation](https://networkx.github.io/documentation/stable/tutorial.html#adding-attributes-to-graphs-nodes-and-edges) for information about adding weighted edges, or adding nodes and edges one-at-a-time.
172+
This is one of several ways to add data to a network object. You can check out the [NetworkX documentation](https://perma.cc/6N9D-RLKK) for information about adding weighted edges, or adding nodes and edges one-at-a-time.
173173

174174
Finally, you can get basic information about your newly-created network by printing the `G` variable:
175175

@@ -359,7 +359,7 @@ print("Shortest path between Fell and Whitehead:", fell_whitehead_path)
359359

360360
Depending on the size of your network, this could take a little while to calculate, since Python first finds all possible paths and then picks the shortest one. The output of `shortest_path` will be a list of the nodes that includes the "source" (Fell), the "target" (Whitehead), and the nodes between them. In this case, we can see that Quaker Founder George Fox is on the shortest path between them. Since Fox is also a **hub** (see degree centrality, below) with many connections, we might suppose that several shortest paths run through him as a mediator. What might this say about the importance of the Quaker founders to their social network?
361361

362-
Python includes many tools that calculate shortest paths. There are functions for the lengths of shortest paths, for all shortest paths, and for whether or not a path exists at all in the [documentation](https://networkx.github.io/documentation/stable/reference/algorithms/shortest_paths.html). You could use a separate function to find out the length of the Fell-Whitehead path we just calculated, or you could simply take the length of the list minus one,[^path] like this:
362+
Python includes many tools that calculate shortest paths. There are functions for the lengths of shortest paths, for all shortest paths, and for whether or not a path exists at all in the [documentation](https://perma.cc/3PMY-3S4F). You could use a separate function to find out the length of the Fell-Whitehead path we just calculated, or you could simply take the length of the list minus one,[^path] like this:
363363

364364
```python
365365
print("Length of that path:", len(fell_whitehead_path)-1)
@@ -441,9 +441,9 @@ for d in sorted_degree[:20]:
441441

442442
As you can see, Penn's degree is 18, relatively high for this network. But printing out this ranking information illustrates the limitations of degree as a centrality measure. You probably didn't need NetworkX to tell you that William Penn, Quaker leader and founder of Pennsylvania, was important. Most social networks will have just a few hubs of very high degree, with the rest of similar, much lower degree.[^power] Degree can tell you about the biggest hubs, but it can't tell you that much about the rest of the nodes. And in many cases, those hubs it's telling you about (like Penn or Quakerism co-founder Margaret Fell, with a degree of 13) are not especially surprising. In this case almost all of the hubs are founders of the religion or otherwise important political figures.
443443

444-
Thankfully there are other centrality measures that can tell you about more than just hubs. [Eigenvector centrality](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.eigenvector_centrality.html) is a kind of extension of degree---it looks at a combination of a node's edges and the edges of that node's neighbors. Eigenvector centrality cares if you are a hub, but it also cares how many hubs you are connected to. It's calculated as a value from 0 to 1: the closer to one, the greater the centrality. Eigenvector centrality is useful for understanding which nodes can get information to many other nodes quickly. If you know a lot of well-connected people, you could spread a message very efficiently. If you've used Google, then you're already somewhat familiar with Eigenvector centrality. Their PageRank algorithm uses an extension of this formula to decide which webpages get to the top of its search results.
444+
Thankfully there are other centrality measures that can tell you about more than just hubs. [Eigenvector centrality](https://perma.cc/P888-3FBU) is a kind of extension of degree---it looks at a combination of a node's edges and the edges of that node's neighbors. Eigenvector centrality cares if you are a hub, but it also cares how many hubs you are connected to. It's calculated as a value from 0 to 1: the closer to one, the greater the centrality. Eigenvector centrality is useful for understanding which nodes can get information to many other nodes quickly. If you know a lot of well-connected people, you could spread a message very efficiently. If you've used Google, then you're already somewhat familiar with Eigenvector centrality. Their PageRank algorithm uses an extension of this formula to decide which webpages get to the top of its search results.
445445

446-
[Betweenness centrality](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html) is a bit different from the other two measures in that it doesn't care about the number of edges any one node or set of nodes has. Betweenness centrality looks at all the **shortest paths** that pass through a particular node (see above). To do this, it must first calculate every possible shortest path in your network, so keep in mind that betweenness centrality will take longer to calculate than other centrality measures (but it won't be an issue in a dataset of this size). Betweenness centrality, which is also expressed on a scale of 0 to 1, is fairly good at finding nodes that connect two otherwise disparate parts of a network. If you're the only thing connecting two clusters, every communication between those clusters has to pass through you. In contrast to a hub, this sort of node is often referred to as a **broker**. Betweenness centrality is not the only way of finding brokerage (and other methods are more systematic), but it's a quick way of giving you a sense of which nodes are important not because they have lots of connections themselves but because they stand *between* groups, giving the network connectivity and cohesion.
446+
[Betweenness centrality](https://perma.cc/TPK4-WFK4) is a bit different from the other two measures in that it doesn't care about the number of edges any one node or set of nodes has. Betweenness centrality looks at all the **shortest paths** that pass through a particular node (see above). To do this, it must first calculate every possible shortest path in your network, so keep in mind that betweenness centrality will take longer to calculate than other centrality measures (but it won't be an issue in a dataset of this size). Betweenness centrality, which is also expressed on a scale of 0 to 1, is fairly good at finding nodes that connect two otherwise disparate parts of a network. If you're the only thing connecting two clusters, every communication between those clusters has to pass through you. In contrast to a hub, this sort of node is often referred to as a **broker**. Betweenness centrality is not the only way of finding brokerage (and other methods are more systematic), but it's a quick way of giving you a sense of which nodes are important not because they have lots of connections themselves but because they stand *between* groups, giving the network connectivity and cohesion.
447447

448448
These two centrality measures are even simpler to run than degree---they don't need to be fed a list of nodes, just the graph `G`. You can run them with these functions:
449449

@@ -488,7 +488,7 @@ Another common thing to ask about a network dataset is what the subgroups or com
488488

489489
Very dense networks are often more difficult to split into sensible partitions. Luckily, as you discovered earlier, this network is not all that dense. There aren't nearly as many actual connections as possible connections, and there are several altogether disconnected components. Its worthwhile partitioning this sparse network with modularity and seeing if the result make historical and analytical sense.
490490

491-
Community detection and partitioning in NetworkX requires a little more setup than some of the other metrics. There are some built-in approaches to community detection (like [minimum cut](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.flow.minimum_cut.html), but modularity is not included with NetworkX. Fortunately there's an [additional python module](https://github.com/taynaud/python-louvain/) you can use with NetworkX, which you already installed and imported at the beginning of this tutorial. You can read the [full documentation](http://perso.crans.org/aynaud/communities/api.html) for all of the functions it offers, but for most community detection purposes you'll only want `best_partition()`:
491+
Community detection and partitioning in NetworkX requires a little more setup than some of the other metrics. There are some built-in approaches to community detection (like [minimum cut](https://perma.cc/B6CN-LQX4), but modularity is not included with NetworkX. Fortunately there's an [additional python module](https://github.com/taynaud/python-louvain/) you can use with NetworkX, which you already installed and imported at the beginning of this tutorial. You can read the [full documentation](http://perso.crans.org/aynaud/communities/api.html) for all of the functions it offers, but for most community detection purposes you'll only want `best_partition()`:
492492

493493
```python
494494
communities = community.greedy_modularity_communities(G)
@@ -541,7 +541,7 @@ Working with NetworkX alone will get you far, and you can find out a lot about m
541541

542542
# Exporting Data
543543

544-
NetworkX supports a very large number of file formats for [data export](https://networkx.github.io/documentation/stable/reference/readwrite/index.html). If you wanted to export a plaintext edgelist to load into Palladio, there's a [convenient wrapper](https://networkx.github.io/documentation/stable/reference/readwrite/generated/networkx.readwrite.edgelist.write_edgelist.html) for that. Frequently at *Six Degrees of Francis Bacon*, we export NetworkX data in [D3's specialized JSON format](https://networkx.github.io/documentation/stable/reference/readwrite/generated/networkx.readwrite.json_graph.node_link_data.html), for visualization in the browser. You could even [export](https://networkx.github.io/documentation/stable/reference/generated/networkx.convert_matrix.to_pandas_adjacency.html) your graph as a [Pandas dataframe](http://pandas.pydata.org/) if there were more advanced statistical operations you wanted to run. There are lots of options, and if you've been diligently adding all your metrics back into your Graph object as attributes, all your data will be exported in one fell swoop.
544+
NetworkX supports a very large number of file formats for [data export](https://perma.cc/CYJ5-P6MR). If you wanted to export a plaintext edgelist to load into Palladio, there's a [convenient wrapper](https://perma.cc/MW25-9VMN) for that. Frequently at *Six Degrees of Francis Bacon*, we export NetworkX data in [D3's specialized JSON format](https://perma.cc/454D-C3FS), for visualization in the browser. You could even [export](https://perma.cc/PGS5-SKYC) your graph as a [Pandas dataframe](http://pandas.pydata.org/) if there were more advanced statistical operations you wanted to run. There are lots of options, and if you've been diligently adding all your metrics back into your Graph object as attributes, all your data will be exported in one fell swoop.
545545

546546
Most of the export options work in roughly the same way, so for this tutorial you'll learn how to export your data into Gephi's GEXF format. Once you've exported the file, you can upload it [directly into Gephi](https://gephi.org/quickstart/) for visualization.
547547

@@ -585,4 +585,4 @@ Each of these findings is an invitation to more research rather than an endpoint
585585

586586
[^pipinstall]: In many (but not all) cases, `pip` or `pip3` will be installed automatically with Python3.
587587

588-
[^random]: The most principled way of doing this kind of comparison is to create *random graphs* of identical size to see if the metrics differ from the norm. NetworkX offers plenty of tools for [generating random graphs](https://networkx.github.io/documentation/stable/reference/generators.html#module-networkx.generators.random_graphs).
588+
[^random]: The most principled way of doing this kind of comparison is to create *random graphs* of identical size to see if the metrics differ from the norm. NetworkX offers plenty of tools for [generating random graphs](https://perma.cc/5BZ6-K2TL).

0 commit comments

Comments
 (0)