<- matrix(c(0, 1, 0,
mat 0, 0, 0,
1, 1, 0),
nrow=3, ncol=3,
byrow=T)
mat
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 0
[3,] 1 1 0
For anyone discovering this, the code below here is pulled from quarto slides so there are a lot of headers. I urge you to use the Table of Contents on the right to navigate this. The beginning of each section should indicate what packages are being used. In general the following packages are used:
The course this was written for followed Analyzing Social Networks in R but did not make use of the xUCINET package as it appears to be no longer maintained (it is not on CRAN).
One basic format for network data is the matrix. You can make basic matrices using the matrix()
function:
<- matrix(c(0, 1, 0,
mat 0, 0, 0,
1, 1, 0),
nrow=3, ncol=3,
byrow=T)
mat
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 0
[3,] 1 1 0
The rownames()
and colnames()
functions are used access or set the row names and column names.
rownames(mat) <- c("JoJo", "Frida", "Kevin ")
colnames(mat) <- rownames(mat)
mat
JoJo Frida Kevin
JoJo 0 1 0
Frida 0 0 0
Kevin 1 1 0
Although we can do a lot with basic matrices, what we really want to create is an object that has a the properties of a network.
The igraph
package has a graph_from_adjacency_matrix()
function used to create network data. It turns a matrix into an igraph
object.
library(igraph)
Attaching package: 'igraph'
The following objects are masked from 'package:stats':
decompose, spectrum
The following object is masked from 'package:base':
union
<- graph_from_adjacency_matrix(mat, mode="directed") net
graph_from_adjacency_matrix()
There are some options we can set:
mode=
"directed"
directed network"undirected"
undirected, using upper triangle to makeweighted=
NULL
(default) the numbers in matrix give the number of edges betweenTRUE
creates edge weights.diag=
where to include the diagonal, set to FALSE
to ignore diagonals.add.colnames=
NULL
(default) use column names as the vertex names.NA
ignore the column names.We can call our network object to then see some information about it.
net
IGRAPH a75aeb6 DN-- 3 3 --
+ attr: name (v/c)
+ edges from a75aeb6 (vertex names):
[1] JoJo ->Frida Kevin ->JoJo Kevin ->Frida
This shows the the attributes for the vertices (name) and some of the edges using the names of the vertices.
Lets load some more interesting data. We need to read in a csv (creating a data frame) of an adjacency csv. Because we know we will make this into an adjacency matrix we set row.names=1
to convert the first column of the csv into row names.
The data is here and is an adjacency matrix based on interactions in the first epsiode of Amazon Prime’s Wheel of Time.
<- read.csv("network_data/wheel_of_time_ep1.csv", row.names=1)
net_mat # setting row.names=1, turns the first column into rownames
1:5, 1:5] # look at first 5 rows and cols net_mat[
Moraine Lan Marin Nynavene Rand
Moraine 0 4 1 1 1
Lan 4 0 0 1 0
Marin 1 0 0 1 1
Nynavene 1 1 0 0 1
Rand 1 0 0 0 0
We again use graph_from_adjacency_matrix
to convert to an igraph object, but first put net_mat
into as.matrix()
. Why do we set weighted=TRUE
?
library(igraph)
<- graph_from_adjacency_matrix(as.matrix(net_mat),
net mode="directed", weighted=TRUE)
If we ever want to go back to the adjacency matrix we can:
<- as.matrix(net)
mat 1:5,1:5] mat[
5 x 5 sparse Matrix of class "dgCMatrix"
Moraine Lan Marin Nynavene Rand
Moraine . 1 1 1 1
Lan 1 . . 1 .
Marin 1 . . 1 1
Nynavene 1 1 . . 1
Rand 1 . . . .
The iGraph object actually has the adjacency matrix always there
1:5,1:5] net[
5 x 5 sparse Matrix of class "dgCMatrix"
Moraine Lan Marin Nynavene Rand
Moraine . 4 1 1 1
Lan 4 . . 1 .
Marin 1 . . 1 1
Nynavene 1 1 . . 1
Rand 1 . . . .
What would we want to know about our network as a whole?
ecount()
counts the number of edges, vcount()
the number of vertices
ecount(net)
[1] 63
vcount(net)
[1] 22
. . .
edge_density()
the proportion of possible edges that exist.
edge_density(net)
[1] 0.1363636
The count_components()
function returns the number of components within a network.
count_components(net)
[1] 2
For directed data it defaults to weak components, we can also switch to strong components using mode=
:
count_components(net, mode="strong")
[1] 7
We are going to work on this more later, but it is useful for us to be able to quickly visualize our data (and check components).
plot(net)
We can also create an undirected network out of a directed network using as.undirected()
. The way it creates these depends on mode=
.
<- as.undirected(net, mode="mutual")
und_net ## Undirected edge exists only if both have an edge
as.matrix(und_net)[1:5,1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
Moraine Lan Marin Nynavene Rand
Moraine . 1 1 1 1
Lan 1 . . 1 .
Marin 1 . . . .
Nynavene 1 1 . . .
Rand 1 . . . .
Switching it to mode="collapse"
<- as.undirected(net, mode="collapse")
und_net ## Undirected edge exists if either have an edge
as.matrix(und_net)[1:5,1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
Moraine Lan Marin Nynavene Rand
Moraine . 1 1 1 1
Lan 1 . . 1 .
Marin 1 . . 1 1
Nynavene 1 1 1 . 1
Rand 1 . 1 1 .
We use the V()
and E()
functions to access the vertices or edges of our network (note the capitalization)
V(net)
+ 22/22 vertices, named, from 32d1bee:
[1] Moraine Lan Marin Nynavene Rand
[6] Mat Perrin Egwene False.Dragon Eladia
[11] Dying.Man Tam Laila Tom Brandelwyn
[16] Cabin.Trolloc Danya Matt.s.Mom Matt.s.Sisters Padan.Fain
[21] Town.Trolloc Daise
E(net)
+ 63/63 edges from 32d1bee (vertex names):
[1] Moraine ->Lan Moraine ->Marin Moraine ->Nynavene
[4] Moraine ->Rand Moraine ->Mat Moraine ->Perrin
[7] Moraine ->Egwene Lan ->Moraine Lan ->Nynavene
[10] Marin ->Moraine Marin ->Nynavene Marin ->Rand
[13] Marin ->Egwene Nynavene->Moraine Nynavene->Lan
[16] Nynavene->Rand Nynavene->Perrin Nynavene->Egwene
[19] Nynavene->Dying.Man Rand ->Moraine Rand ->Mat
[22] Rand ->Perrin Rand ->Egwene Rand ->Tam
[25] Mat ->Moraine Mat ->Rand Mat ->Perrin
[28] Mat ->Danya Mat ->Matt.s.Mom Mat ->Matt.s.Sisters
+ ... omitted several edges
We can have attributes at three different levels of the network: whole network, vertex, and edges. We can get the full list of the different attributes using *_attr_names()
functions:
vertex_attr_names(net)
[1] "name"
edge_attr_names(net)
[1] "weight"
graph_attr_names(net)
character(0)
There are several weights to access a vertex or edge attribute, but the easiest one is to something like: V(net)$attr
where attr
is the name of the attribute. You can do this with E()
as well
V(net)$name
[1] "Moraine" "Lan" "Marin" "Nynavene"
[5] "Rand" "Mat" "Perrin" "Egwene"
[9] "False.Dragon" "Eladia" "Dying.Man" "Tam"
[13] "Laila" "Tom" "Brandelwyn" "Cabin.Trolloc"
[17] "Danya" "Matt.s.Mom" "Matt.s.Sisters" "Padan.Fain"
[21] "Town.Trolloc" "Daise"
E(net)$weight
[1] 4 1 1 1 1 1 1 4 1 1 1 1 2 1 1 1 1 3 1 1 3 4 4 3 1 3 4 1 2 2 1 1 3 3 4 1 1 4
[39] 4 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1
You modify attributes in a way similar to how dataframes are modified in R: E(net)$attr <- "Good"
V(net)$type <- "Character"
V(net)$type
[1] "Character" "Character" "Character" "Character" "Character" "Character"
[7] "Character" "Character" "Character" "Character" "Character" "Character"
[13] "Character" "Character" "Character" "Character" "Character" "Character"
[19] "Character" "Character" "Character" "Character"
Each vertex and edge can be indexed using [ ]
. For example V(net)[2]
will return the second vertex
V(net)[2]
+ 1/22 vertex, named, from 32d1bee:
[1] Lan
E(net)[2]
+ 1/63 edge from 32d1bee (vertex names):
[1] Moraine->Marin
We can also use the indexing to modify the attributes:
V(net)$name[9] ## Missing a space
[1] "False.Dragon"
V(net)$name[9] <- "False Dragon"
V(net)$name ## Fixed
[1] "Moraine" "Lan" "Marin" "Nynavene"
[5] "Rand" "Mat" "Perrin" "Egwene"
[9] "False Dragon" "Eladia" "Dying.Man" "Tam"
[13] "Laila" "Tom" "Brandelwyn" "Cabin.Trolloc"
[17] "Danya" "Matt.s.Mom" "Matt.s.Sisters" "Padan.Fain"
[21] "Town.Trolloc" "Daise"
Dealing with edges can be a bit more confusing, they also have IDs. The best way to identify them is to identify them by what vertex they are incident to.
get.edge.ids()
will give you the edges from one vertex to another (given using the vp=
argument)
V(net)[5] # Rand
+ 1/22 vertex, named, from 32d1bee:
[1] Rand
V(net)[6] # Mat
+ 1/22 vertex, named, from 32d1bee:
[1] Mat
<- get.edge.ids(net, vp=c(V(net)[5], V(net)[6]))
edid edid
[1] 21
E(net)[edid]
+ 1/63 edge from 32d1bee (vertex names):
[1] Rand->Mat
incident()
is used to get all edges that are incident to a vertex. You can set mode=
to decide what to do with directed edges.
incident(net, V(net)[5], mode="in")
+ 7/63 edges from 32d1bee (vertex names):
[1] Moraine ->Rand Marin ->Rand Nynavene->Rand Mat ->Rand Perrin ->Rand
[6] Egwene ->Rand Tam ->Rand
incident(net, V(net)[5], mode="out")
+ 5/63 edges from 32d1bee (vertex names):
[1] Rand->Moraine Rand->Mat Rand->Perrin Rand->Egwene Rand->Tam
incident(net, V(net)[5], mode="all")
+ 12/63 edges from 32d1bee (vertex names):
[1] Rand ->Moraine Moraine ->Rand Marin ->Rand Nynavene->Rand
[5] Rand ->Mat Mat ->Rand Rand ->Perrin Perrin ->Rand
[9] Rand ->Egwene Egwene ->Rand Rand ->Tam Tam ->Rand
We set mode="in"
for the number of edges pointing towards a node, mode="out"
for number pointing out and mode="all"
for all.
It can be useful to then add this back to our network data.
degree(net) [1:5] ## Show first 5 so I don't go off screen
Moraine Lan Marin Nynavene Rand
13 4 8 13 12
mean(degree(net)) # average degree
[1] 5.727273
V(net)$degree <- degree(net)
V(net)$indegree <- degree(net, mode="in")
V(net)$outdegree <- degree(net, mode="out")
We can do the same thing but with membership in different components. We are going to use component.dist()
now which returns an object with information about the components.
<- components(net, mode="strong")
comps $no #number of components comps
[1] 7
$membership #Which component each vertex is in comps
Moraine Lan Marin Nynavene Rand
4 4 4 4 4
Mat Perrin Egwene False Dragon Eladia
4 4 4 3 3
Dying.Man Tam Laila Tom Brandelwyn
5 4 4 6 4
Cabin.Trolloc Danya Matt.s.Mom Matt.s.Sisters Padan.Fain
7 4 4 4 4
Town.Trolloc Daise
2 1
V(net)$components <- comps$membership
One thing I like to do is convert numbers into letters for labels. All R sessions have two variables letters
and LETTERS
that are the letters from a to z.
c(1, 3, 5)] letters[
[1] "a" "c" "e"
$members] LETTERS[comps
[1] "D" "D" "D" "D" "D" "D" "D" "D" "C" "C" "E" "D" "D" "F" "D" "G" "D" "D" "D"
[20] "D" "B" "A"
We can check which vertices are cutpoints using the articulation_points()
function.
For directed networks this can only identify cutpoints for weak components.
articulation_points(net)
+ 5/22 vertices, named, from 32d1bee:
[1] Tam Egwene Daise Mat Nynavene
<- articulation_points(net)
cut_points V(net)$cuts <- FALSE
V(net)$cuts[cut_points] <- TRUE
V(net)$cuts
[1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
We can also identify bridges, using the bridges()
function
For directed networks this only can identify bridges for weak components.
bridges(net)
+ 5/63 edges from 32d1bee (vertex names):
[1] Tam ->Cabin.Trolloc Egwene ->Tom Daise ->Town.Trolloc
[4] Daise ->Egwene Nynavene->Dying.Man
<- bridges(net)
bridge_edges E(net)$bridges <- FALSE
E(net)$bridges[bridge_edges] <- TRUE
E(net)$bridges
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE TRUE TRUE
We’ve saved a lot of useful information about our vertices in our network. It is often easier to use this info as a dataframe though. We use as_data_frame()
to convert a network into a dataframe and set what="vertices"
to give us all that wonderful vertex info (we can then save it even).
<- as_data_frame(net, what="vertices")
df_out tail(df_out, 5)
name type degree indegree outdegree components
Matt.s.Mom Matt.s.Mom Character 3 1 2 4
Matt.s.Sisters Matt.s.Sisters Character 3 2 1 4
Padan.Fain Padan.Fain Character 2 1 1 4
Town.Trolloc Town.Trolloc Character 1 1 0 2
Daise Daise Character 2 0 2 1
cuts
Matt.s.Mom FALSE
Matt.s.Sisters FALSE
Padan.Fain FALSE
Town.Trolloc FALSE
Daise TRUE
write.csv(df_out, "vertices.csv", row.names=F)
The last thing for this week is calculating geodesic distances. We use the distances()
function to do that. If we set mode="out"
then the resulting matrix can be read as the distances from the row to the column.
<- distances(net, mode="out")
dist 1:5, 1:5] dist[
Moraine Lan Marin Nynavene Rand
Moraine 0 2 1 1 1
Lan 2 0 3 1 2
Marin 1 2 0 1 1
Nynavene 1 1 2 0 1
Rand 1 3 2 2 0
To calculate the average distances we can use the mean_distance()
function. It will automatically ignore any disconnected components. We can treat the network as directed or undirected.
mean_distance(net, directed=TRUE)
[1] 2.978261
mean_distance(net, directed=FALSE)
[1] 2.827225
One problem with the distances()
function is that it returns unconnected distances as Inf
which can make it impossible to calculate things. We can use is.infinite()
and subsetting to replace that with NA
which we can exclude more easily
max(dist, na.rm=T) # BROKEN
[1] Inf
is.infinite(dist)] <- NA
dist[max(dist, na.rm=T) # not broken
[1] 6
Another type of network data is an edge list format, where each row shows an edge in the format “start_of_edge, end_of_edge” (or head to tail)
<- as_edgelist(net)
edge 1:8,] edge[
[,1] [,2]
[1,] "Moraine" "Lan"
[2,] "Moraine" "Marin"
[3,] "Moraine" "Nynavene"
[4,] "Moraine" "Rand"
[5,] "Moraine" "Mat"
[6,] "Moraine" "Perrin"
[7,] "Moraine" "Egwene"
[8,] "Lan" "Moraine"
Edge lists are common ways of providing network data.
I have data that shows connections between members of the 134th Ohio Senate by the number of bills they cosponsored with each other.
<- read.csv("network_data/OH_134_cosponsor.csv")
edge_data head(edge_data)
.tail .head weight
1 Kenny Yuko Jay Hottinger 129
2 Matthew Dolan Jay Hottinger 107
3 Vernon Sykes Jay Hottinger 91
4 Sandra Williams Jay Hottinger 53
5 Teresa Fedor Jay Hottinger 70
6 Robert Hackett Jay Hottinger 136
We can create the network the same way as before though we want to indicate this is undirected data.
<- graph_from_data_frame(edge_data, directed=FALSE)
leg_net leg_net
IGRAPH 5063edf UNW- 34 560 --
+ attr: name (v/c), weight (e/n)
+ edges from 5063edf (vertex names):
[1] Kenny Yuko --Jay Hottinger Matthew Dolan --Jay Hottinger
[3] Vernon Sykes --Jay Hottinger Sandra Williams --Jay Hottinger
[5] Teresa Fedor --Jay Hottinger Robert Hackett --Jay Hottinger
[7] Andrew Brenner --Jay Hottinger Kristina Roegner--Jay Hottinger
[9] Terry Johnson --Jay Hottinger Nickie Antonio --Jay Hottinger
[11] Bob Peterson --Jay Hottinger Louis Blessing --Jay Hottinger
[13] Stephanie Kunze --Jay Hottinger Mark Romanchuk --Jay Hottinger
[15] Bill Reineke --Jay Hottinger Hearcel Craig --Jay Hottinger
+ ... omitted several edges
With an edge list we can also easily add in information about each node/vertex. Here I load a second dataset OH_134_people.csv which has information about each individual. The first column needs to be the vertex names.
<- read.csv("network_data/OH_134_people.csv")
vertex_data <- graph_from_data_frame(edge_data, directed=FALSE,
leg_net vertices=vertex_data)
leg_net
IGRAPH 1fa10a5 UNW- 34 560 --
+ attr: name (v/c), people_id (v/n), first_name (v/c), middle_name
| (v/c), last_name (v/c), suffix (v/c), nickname (v/c), party_id (v/n),
| party (v/c), role_id (v/n), role (v/c), district (v/c),
| followthemoney_eid (v/n), votesmart_id (v/n), opensecrets_id (v/l),
| ballotpedia (v/c), knowwho_pid (v/n), committee_id (v/n), weight
| (e/n)
+ edges from 1fa10a5 (vertex names):
[1] Jay Hottinger--Kenny Yuko Matthew Dolan--Jay Hottinger
[3] Jay Hottinger--Vernon Sykes Jay Hottinger--Sandra Williams
[5] Jay Hottinger--Teresa Fedor Jay Hottinger--Robert Hackett
+ ... omitted several edges
ggplot2 is the library for making figures in R. The following slides give you an introduction to how to use it.
All plots starts with ggplot()
function that includes the dataframe you are going to use as the first argument. By itself it doesn’t do anything.
library(ggplot2)
ggplot(df_out)
The other common part of a ggplot()
call is a call to another function aes()
. This function maps parts of your dataframe onto parts of the plot. In the below example I map the indegree
variable to the x axis and the outdegree
variable to the y axis.
ggplot(df_out, aes(x=indegree, y=outdegree))
We use geom_*()
functions to actually add things to the plot. To do this we literally add the geom_point()
function to our previous call
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_point()
We can change things about the points by adding other arguments to geom_point()
(there are a lot of options, scroll to Aesthetics here to see them).
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_point(color="steelblue4", size=6)
We can also use the aes()
function within the geom_point()
call to have different aesthetics mapped to our data. The below will change the color of the dots depending on whether they are a cutpoint or not.
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_point(aes(color=cuts), size=6)
The power of ggplot are all the geoms. Lets change geom_point()
to geom_jitter()
which bumps around the dots, and add on another one: geom_smooth()` to show the trend.
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_jitter(aes(color=cuts), size=6) +
geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'
Look what happens when you move the color=cuts
part to the ggplot()
call. Any aes()
calls here impact all geoms.
ggplot(df_out, aes(x=indegree, y=outdegree,
color=cuts)) +
geom_jitter(size=6) +
geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'
We can add or modify labels to our plot using the labs()
function
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_jitter(aes(color=cuts), size=6) +
geom_smooth(method="lm") +
labs(y="Out Degree", x="In Degree",
title="Out vs In Degree")
`geom_smooth()` using formula = 'y ~ x'
There are also scale_*_*()
functions that can be used to change the style and labels of the scales.
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_jitter(aes(color=cuts), size=6) +
geom_smooth(method="lm") +
labs(y="Out Degree", x="In Degree",
title="Out vs In Degree") +
scale_color_brewer("Cut Point?",
type="qual", palette=2)
`geom_smooth()` using formula = 'y ~ x'
Finally you can use theme_*()
to change the overall style of the plot
ggplot(df_out, aes(x=indegree, y=outdegree)) +
geom_jitter(aes(color=cuts), size=6) + geom_smooth(method="lm") +
labs(y="Out Degree", x="In Degree",
title="Out vs In Degree") +
scale_color_brewer("Cut Point?",
type="qual", palette=2) + theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There are three main functions we will need for this:
mds()
from the smacof package.CA()
from the FactoMineR package.hclust()
a base R function (along with cutree()
and dist()
).The mds()
function in smacof
lets us estimate multiple types of MDS:
mds(distance, type="interval")
- Metric MDSmds(distance, type="ordinal")
- Non-Metric MDSIn both cases the distance
object needs to be a symmetric dissimilarity (distance) matrix. Larger values mean observations are more different from each other.
You select the number of dimensions using ndim=
R has an object UScitiesD
that you can call at anytime which shows the distances between several cities.
# UScitiesD #Run this by itself to see
library(smacof)
<- mds(UScitiesD, type="interval")
mds $stress # The stress mds
[1] 0.001913151
$conf # the actual points mds
D1 D2
Atlanta -0.4543817 0.09024782
Chicago -0.2415148 -0.21576888
Denver 0.3051669 -0.01607968
Houston -0.1021788 0.36232737
LosAngeles 0.7612104 0.24459966
Miami -0.7164226 0.36679903
NewYork -0.6783018 -0.32755339
SanFrancisco 0.8982951 0.06986850
Seattle 0.8477661 -0.36291042
Washington.DC -0.6196388 -0.21153000
We can plot this with ggplot2
library(ggplot2)
<- as.data.frame(mds$conf)
points ggplot(points, aes(x=D1, y=D2)) +
geom_point() +
theme_minimal()
Lets add some labels
$names <- rownames(points)
pointsggplot(points, aes(x=D1, y=D2)) +
geom_point() + geom_text(aes(label=names)) +
theme_minimal()
Flip the scales
ggplot(points, aes(x=-D1, y=-D2)) +
geom_text(aes(label=names)) +
theme_minimal()
For the next section we are going to use data on UN Votes (this is in an RData format and is already a distance object).
<- readRDS("network_data/distance_1960s.RData")
UN_votes
<- mds(UN_votes, ndim=2, type="ordinal")
out $stress out
[1] 0.1247679
If we want to check different levels of stress we could estimate several models like this:
<- mds(UN_votes, ndim=1, type="ordinal")
out_1 <- mds(UN_votes, ndim=2, type="ordinal")
out_2 <- mds(UN_votes, ndim=3, type="ordinal")
out_3 <- mds(UN_votes, ndim=4, type="ordinal") out_4
This is repetitive, and good code minimizes repetition.
We can use a for loop to repeat some lines of code multiple times changing things as we do:
Basic idea:
for(ii in 1:5){
print(ii)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
We want to create something to store the results and then store it every time we go through the loop.
<- numeric(5) ## Vector of length 5
out for(ii in 1:5){
<- ii
out[ii]
}print(out)
[1] 1 2 3 4 5
We want to create something to store the results and then store it every time we go through the loop.
<- numeric(5) ## Vector of length 5
out for(ii in 1:5){
<- smacof::mds(UN_votes, ndim=ii, type="ordinal")
tmp <- tmp$stress
out[ii]
}print(out)
[1] 0.32174980 0.12476788 0.08293623 0.06374924 0.05249708
I use the basic plot()
function to make a scree plot:
plot(x=1:5, y=out, ylab="Stress",
xlab="Dimensions",
type="b")
The hclust()
function needs a distance
object to work. Our UN_votes
data already shows “distances” but isn’t a distance
object so we use as.dist()
to convert it.
<- hclust(as.dist(UN_votes))
hcl plot(hcl) ## creates a dendrogram
You can change the method with the method=
argument (“single”, “complete”, “average”)
<- hclust(as.dist(UN_votes), method="single")
hcl plot(hcl) ## creates a dendrogram
The cutree()
function “cuts the tree” to make clusters. Set k=
to set the number of groups you want.
<- cutree(hcl, k=4)
groups 1:20] groups[
AFG ALB DZA ARG AUS AUT BRB BLR BEL BEN BOL BWA BRA BGR BFA BDI CIV KHM CMR CAN
1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1
For correspondence analysis we use the CA()
function, and can set the number of dimensions to estimate with ncp=
(feel free to se this to something high)
The data here shows the number of a country’s EU MEPs are in each party group.
<- read.csv("network_data/EU_seats.csv", row.names=1)
df library(FactoMineR)
<- CA(df, ncp=5) ca
Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
We can see how many dimensions to use by calling output$eig
$eig ca
eigenvalue percentage of variance cumulative percentage of variance
dim 1 0.29470402 29.073034 29.07303
dim 2 0.23559433 23.241767 52.31480
dim 3 0.17732184 17.493090 69.80789
dim 4 0.12367321 12.200564 82.00846
dim 5 0.09625459 9.495673 91.50413
dim 6 0.05514655 5.440298 96.94443
dim 7 0.03097337 3.055574 100.00000
If we want to extract the scales we use output$row$coord
or output$col$coord
. We can plot it easily, by converting it data.frame and sticking it into ggplot
<- as.data.frame(ca$row$coord)
countries ggplot(countries, aes(x=`Dim 1`, y=`Dim 2`)) +
geom_point(size=5) + theme_minimal()
If we want to plot them both we need to combine the two data.frames (using rbind()
) giving them labels first:
<- as.data.frame(ca$row$coord)
countries $Type <- "Country"
countries$Name <- rownames(countries)
countries<- as.data.frame(ca$col$coord)
parties $Type <- "Party"
parties$Name <- rownames(parties)
parties
<- rbind(countries, parties)
all
ggplot(all, aes(x=`Dim 1`, y=`Dim 2`, color=Type)) +
geom_point(size=5) + theme_minimal()
Now we are going to move onto visualizing networks.
Visualizations can be useful with network data, but they are also hard to do:
We are going to use the ggraph
package.
Benefits:
Cons:
We are going to use a small canned dataset you can download from the internet: strike.paj
It is a communication network between workers at a sawmill. It also is a unique data format: “pajek” which thankfully igraph has a function for
library(igraph)
<- read_graph("network_data/strike.paj", format="pajek") net
Warning in read.graph.pajek(file, ...): At
src/vendor/cigraph/src/io/pajek-lexer.l:130 : Skipping unknown section
'*Partition' on line 68.
ecount(net)
[1] 38
vcount(net)
[1] 24
Just like ggplot2 all visualizations will start with a call to ggraph()
library(ggraph)
theme_set(theme_graph())
ggraph(net)
Using "stress" as default layout
Warning: Existing variables `x` and `y` overwritten by layout variables
To add nodes and edges to this plot we will use geom_node_point()
and geom_edge_link()
geom_node_point
: Adds our nodes as circlesgeom_edge_link
: Adds edges as straight lines (no arrows)ggraph(net) + geom_node_point(size=6) +
geom_edge_link()
Using "stress" as default layout
Warning: Existing variables `x` and `y` overwritten by layout variables
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Laying out a plot can impact how useful it is by a lot:
Warning: Existing variables `x` and `y` overwritten by layout variables
Warning: Existing variables `x` and `y` overwritten by layout variables
In most of these layouts they do something like:
FR views vertexes as “atomic particles or celestial bodies, exerting attractive and repulsive forces from one another”.
How does this algorithm work?
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=1)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="1 Iteration")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=2)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="2 Iterations")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=3)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="3 Iterations")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=4)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="4 Iterations")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=10)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="10 Iteration")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=25)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="25 Iterations")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=50)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="50 Iterations")
set.seed(1)
<- create_layout(net, layout="igraph", algorithm="fr",
lo niter=100)
ggplot(lo) + geom_node_point(size=6) +
geom_edge_link() + labs(title="100 Iterations")
To set the layout you set layout=
to what you want, you can also pass additional arguments as necessary.
If you want to create the exact same layout every time run set.seed()
directly prior to making the plot. This sets the “random seed” that is used.
set.seed(1)
ggraph(graph=net, layout="fr", niter=250) +
geom_edge_link() + geom_node_point(size=6)
Warning: Existing variables `x` and `y` overwritten by layout variables
Network of DNC emails from here.
<- read_graph("network_data/dnc.gml", format='gml') dnc_net
Warning in read.graph.gml(file, ...): At vendor/cigraph/src/io/gml.c:149 : One
or more unknown entities will be returned verbatim (
).
ggraph(graph=dnc_net, "graphopt") +
geom_edge_link() + geom_node_point()
We can use the function largest_component()
to grab just that part. Also the |>
is a pipe which passes on the output.
|> largest_component() |>
dnc_net ggraph("fr") +
geom_edge_link() + geom_node_point()
Deleting all the isolates using which()
and delete_vertices()
.
<- which(degree(dnc_net)==0)
isolates |> delete_vertices(v=isolates) |>
dnc_net ggraph("fr") + geom_edge_link() + geom_node_point()
We can use geom_node_text()
or geom_node_label()
to label our nodes.
ggraph(graph=net, "stress") +
geom_edge_link() +
geom_node_label(aes(label=name))
Warning: Existing variables `x` and `y` overwritten by layout variables
They also have a repel=T
argument that will move the labels away from the center of the node.
ggraph(graph=net, "stress") +
geom_edge_link() + geom_node_point() +
geom_node_text(aes(label=name), repel=T)
Warning: Existing variables `x` and `y` overwritten by layout variables
Vertex attributes are included for a variety of reasons. This includes:
Often we will scale a node size by a measure of importance, like degree:
V(net)$degree <- degree(net)
ggraph(graph=net, "stress") +
geom_edge_link() + geom_node_point(aes(size=degree)) +
ggtitle("Sized by Degree") +
scale_size("Degree")
Warning: Existing variables `x` and `y` overwritten by layout variables
This data is of Spanish high school students and includes negative and positive relations. We are going to delete the negative edges.
<- read.csv("network_data/spanish_hs_edges.csv")
edges <- read.csv("network_data/spanish_hs_nodes.csv")
nodes <- graph_from_data_frame(edges, vertices=nodes, directed=T)
net <- which(E(net)$weight < 0)
neg_edges <- delete_edges(net, neg_edges)
net net
IGRAPH fdda5ca DNW- 105 1058 --
+ attr: name (v/n), Colegio (v/n), Curso (v/n), Grupo (v/c), Sexo
| (v/c), prosocial (v/n), crttotal (v/n), X_pos (v/c), id (e/n), weight
| (e/n)
+ edges from fdda5ca (vertex names):
[1] 3043->3047 3043->3087 3043->3093 3043->3065 3043->3097 3043->3044
[7] 3043->3045 3043->3088 3043->3056 3043->3090 3043->3073 3043->3066
[13] 3043->3060 3043->3092 3043->3096 3043->3077 3043->3084 3043->3105
[19] 3043->3067 3043->3064 3043->3081 3043->3068 3043->3061 3043->3058
[25] 3043->3055 3043->3072 3043->3095 3043->3051 3043->3054 3043->3086
[31] 3043->3085 3043->3089 3043->3047 3043->3048 3043->3049 3043->3050
+ ... omitted several edges
We can color the nodes by setting aes(color=)
to a vertex attribute.
ggraph(graph=net, "stress") +
geom_edge_link() +
geom_node_point(aes(color=Sexo), size=4) +
ggtitle("Colored by Sex")
There are a few things we might want to do with our edges:
I think it is easier to see a network by making the edges gray.
ggraph(graph=net, "stress") +
geom_edge_link(color="gray") +
geom_node_point(aes(color=Sexo), size=4) +
ggtitle("Colored by Sex")
Arrows are annoying to add here, but there is some good help online. We manualy create an arrow (arrow
) and manually end them before the node (end_cap
)
ggraph(graph=net, "stress") +
geom_edge_link(color="gray",
arrow = arrow(length = unit(4, 'mm')),
end_cap = circle(3, 'mm')) +
geom_node_point(aes(color=Sexo), size=4)
Finally we can assign edge attributes to aesthetics
ggraph(graph=net, "stress") +
geom_edge_link(color="gray", aes(width=weight),
arrow = arrow(length = unit(4, 'mm')),
end_cap = circle(3, 'mm')) +
geom_node_point(aes(color=Sexo), size=4)
The default for ggraph is to show only a single edge when there are two mutual edges. We can change that by using geom_edge_fan()
ggraph(graph=net, "stress") +
geom_edge_fan(aes(color=weight),
arrow = arrow(length = unit(4, 'mm')),
end_cap = circle(3, 'mm')) +
geom_node_point(aes(color=Sexo), size=4)
This data shows the relationships between donors in the Ohio legislature. You can download the edgelist here and the meta data here
library(igraph)
<- read.csv("network_data/edge_OH.csv")
edges <- read.csv("network_data/meta_OH.csv")
nodes <- graph_from_data_frame(edges, directed=F, vertices=nodes) net
degree()
eigen_centrality()
power_centrality()
set the beta value with exponent=
closeness()
harmonic_centrality()
betweenness()
ego_size()
These are generally relatively simple to use by eigen_centrality()
outputs a whole object with information. To get just the centralities use $vector
V(net)$degree <- degree(net)
<- eigen_centrality(net)
tmp V(net)$eigen <- tmp$vector
$value ## The eigenvalue tmp
[1] 115.055
V(net)$beta <- power_centrality(net, exponent=0.008)
<- as_data_frame(net, what="vertices")
stats ggplot(stats, aes(x=Total, y=eigen)) +
geom_point() + theme_minimal() +
geom_smooth(method="lm") +
scale_x_log10()
`geom_smooth()` using formula = 'y ~ x'
ggplot(stats, aes(x=CatCodeGroup, y=eigen)) +
geom_boxplot() + theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Again, for the undirected unweighted version there isn’t a lot too these:
V(net)$between <- betweenness(net, normalized=T)
V(net)$close <- closeness(net, normalized=T)
V(net)$harmonic <- harmonic_centrality(net, normalized=T)
V(net)$step_2 <- ego_size(net, 2) # Two step
V(net)$step_3 <- ego_size(net, 3) # Three step
ggraph(net) + geom_edge_link(color="gray") +
geom_node_point(aes(size=between))
Using "stress" as default layout
Things get a bit more complicated if we want to use edge weights, the data we have has something we can use to weight the edges:
E(net)$weight <- ifelse(E(net)$edge == "Strong", 2, 1)
Weighted degree is the strength()
function.
<- strength(net)
weighted_degree <- degree(net)
regular_degree plot(x=regular_degree, y=weighted_degree)
If there is a weight
edge attribute it will automatically use it, but you can turn this off by setting weights=NA
<- eigen_centrality(net)$vector
weighted_eigen <- eigen_centrality(net, weights=NA)$vector
regular_eigen plot(x=regular_eigen, y=weighted_eigen)
For directed networks you can calculate directed versions of theses measures but it isn’t always consistent now:
eigen_centrality()
have to set directed=T
and automatically provides the “in-centrality”. If you want out-centrality put your network in t()
: eigen_centrality(t(net), directed=T)
power_centrality()
will automatically do a directed measure if you give it a directed network. Calculates the “out-centrality” use t()
to calculate the “in-centrality”Lets use the Spanish High School Network you used earlier this week:
library(igraph)
<- read.csv("network_data/spanish_hs_edges.csv")
edges <- read.csv("network_data/spanish_hs_nodes.csv")
nodes <- graph_from_data_frame(edges, vertices=nodes, directed=T)
net <- delete_edges(net, E(net)[E(net)$weight <= 0]) net
Density can be calculated with edge_density()
There is an option to turn on loops (edges that start and stop at the same node)
edge_density(net)
[1] 0.09688645
Degree can be calculated with degree()
and then average it up with mean()
# note that this has gmode, instead of mode. why?
<- degree(net)
deg mean(deg)
[1] 20.15238
Component ratio is not in the SNA package, but the formula is pretty simple:
\[ \textnormal{Component Ratio} = \frac{c-1}{n-1} \]
So we need to calculate the number of components (c), and the number of nodes (n).
<- components(net, mode="strong")$no
comps <- vcount(net)
n
-1)/(n-1) (comps
[1] 0.2788462
There is no native function to calculate connectedness but we have all the tools:
<- distances(net, mode="out") ## Get the distances matrix
dist <- is.finite(dist) ## Where are there finite paths?
reachability diag(reachability) <- NA ## Ignore the diag
sum(reachability, na.rm=T) / (vcount(net) * (vcount(net) - 1))
[1] 0.7168498
Again, no native function but we can get it relatively quickly in a similar manner:
<- distances(net, mode="out") ## Get the distances matrix
dist diag(dist) <- NA ## Ignore the diag
sum(1/dist, na.rm=T) / (vcount(net) * (vcount(net) - 1))
[1] 0.2914342
We calculate reciprocity with the reciprocity()
function.
The default is to calculate arc-reciprocity, but we can switch it to the other (less good) measure by setting mode="ratio"
reciprocity(net)
[1] 0.389414
edge_density(net)
[1] 0.09688645
There is a transitivity()
function, which works well with undirected networks. For directed networks we have to make use of the sna
library and the gtrans()
function.
transitivity(net) ## Treats as undirected
[1] 0.4361186
<- as_adjacency_matrix(net, sparse=F) # Convert to adj mat
mat_net ::gtrans(mat_net) ## Call SNA function sna
[1] 0.4617656
Finally, we have centralization. You can use a lot of different measures of centrality to measure centralization and so there are a suite of functions:
center_degree()
- Uses degree centralitycenter_eigen()
- Uses eigen centralitycenter_betw()
- Uses betweennessThey also all output an object with multiple things to get the centralization score use $centralization
centr_betw(net)$centralization
[1] 0.09744318
centr_degree(net)$centralization
[1] 0.144878
centr_eigen(net)$centralization
[1] 0.7185371
We are going to use the Cocaine Mambo network today.
<- read.csv("network_data/COCAINE_MAMBO.csv", row.names=1)
df <- as.matrix(df)
mat >1] <- 1 # there is a weird cell with a 2, not a 1.
mat[mat<- graph_from_adjacency_matrix(mat, mode="undirected") coke_net
Warning: The `adjmatrix` argument of `graph_from_adjacency_matrix()` must be symmetric
with mode = "undirected" as of igraph 1.6.0.
ℹ Use mode = "max" to achieve the original behavior.
There are several igraph functions for cliques. We want to use max_cliques()
max_cliques()
returns a list
of the cliques in your network with each item in that list a clique.
<- max_cliques(coke_net, min=3)
cl length(cl) # number of cliques
[1] 15
1]] # the first clique cl[[
+ 3/31 vertices, named, from 755ae35:
[1] ARG JDD JPPM
If we want to look at who is in each clique we are going to have to process this list a little. In particular we will make a matrix showing who is in each clique:
# Create a matixi of 0s
<- matrix(0, nrow=vcount(coke_net), ncol=length(cl))
cl_mat
# Use a for loop to step through the cliques
for(ii in seq_along(cl)){
<- 1
cl_mat[cl[[ii]] , ii]
}# We can add rownames
rownames(cl_mat) <- V(coke_net)$name
# Size of each clique
colSums(cl_mat)
[1] 3 3 3 3 3 4 3 4 4 3 3 4 4 4 4
# Number of cliques each individual is in
rowSums(cl_mat)
A AP ARG CPGT EDRS ELM ERC FR GA HAMS HPM JB JDD JE JEAB JG
0 1 1 2 0 5 1 2 0 3 1 1 3 0 2 1
JHLM JJTE JO JPPM JV LIRP OS PCCP RAJH RB RS SNRM VR WGV YPMG
1 2 0 10 0 1 1 2 0 0 3 7 0 0 2
We can calculate the number of cliques each individual is in and put that back on the original data:
V(coke_net)$numb_cliques <- rowSums(cl_mat)
ggraph(coke_net) + geom_edge_link(color="gray") +
geom_node_point(size=8,
aes(color=numb_cliques))
Using "stress" as default layout
With that matrix we can see easily how many cliques each person is in with every other person:
# Multiply the matrix by its transpose
<- cl_mat %*% t(cl_mat)
overlap_mat <- graph_from_adjacency_matrix(overlap_mat, diag=F,
overlap_net mode="undirected", weighted=T)
# ignores the diag, has it undirected,
# and uses the values (overlaps) as weights
ggraph(overlap_net, "fr") +
geom_edge_link(color="gray", aes(width=weight)) +
geom_node_point(size=8, color="purple")
There are several functions we will use to find clusters:
cluster_edge_betweenness()
cluster_fast_greedy()
cluster_walktrap()
cluster_louvain()
Each function is used in a similar way and returns something similar.
All you have to do is give it the network.
<- cluster_fast_greedy(coke_net)
clust
clust
IGRAPH clustering fast greedy, groups: 5, mod: 0.39
+ groups:
$`1`
[1] "ARG" "CPGT" "FR" "GA" "JDD" "JPPM" "OS" "PCCP"
$`2`
[1] "EDRS" "JB" "JE" "JO" "SNRM" "VR" "WGV"
$`3`
[1] "HAMS" "JG" "JJTE" "JV" "RAJH" "RB" "YPMG"
$`4`
+ ... omitted several groups/vertices
You can access the modularity scores for each new clustering level with $modularity
and $membership
provides the membership of each vertex in the clustering level with the highest modularity
$membership clust
[1] 4 4 1 1 2 4 5 1 1 3 4 2 1 2 5 3 5 3 2 1 3 4 1 1 3 3 4 2 2 2 3
$modularity clust
[1] -5.960166e-02 -4.295482e-02 -1.085018e-02 5.499405e-03 2.170036e-02
[6] 3.775268e-02 5.380499e-02 6.970868e-02 8.546373e-02 1.053805e-01
[11] 1.336207e-01 1.544293e-01 1.728597e-01 1.878716e-01 2.028835e-01
[16] 2.266647e-01 2.535672e-01 2.702140e-01 2.850773e-01 2.997919e-01
[21] 3.143579e-01 3.287753e-01 3.430440e-01 3.573127e-01 3.757432e-01
[26] 3.840666e-01 3.862961e-01 3.834721e-01 3.745541e-01 3.567182e-01
[31] -5.551115e-17
I personally like giving each cluster a letter so we can do the following:
V(coke_net)$Cluster <- LETTERS[clust$membership]
ggraph(coke_net) + geom_edge_link(color="gray") +
geom_node_point(size=8,
aes(color=Cluster))
Using "stress" as default layout
We can use as_data_frame()
to access the vertex data again from our network
<- igraph::as_data_frame(coke_net, what="vertices")
vertex_data 1:5,] vertex_data[
name numb_cliques Cluster
A A 0 D
AP AP 1 D
ARG ARG 1 A
CPGT CPGT 2 A
EDRS EDRS 0 B
#write.csv(vertex_data, "tmp.csv", row.names=F)
Finally you can also plot the dendrogram (if it is applicable)
plot(as.dendrogram(clust))
The data we will use is a london gang network.
<- read_graph("network_data/london.graphml", format="graphml")
net net
IGRAPH 638f4bf UN-- 54 1010 --
+ attr: name (v/c), Age (v/n), Birthplace (v/n), Residence (v/n),
| Arrests (v/n), Convictions (v/n), Prison (v/n), Music (v/n), Ranking
| (v/n), id (v/c)
+ edges from 638f4bf (vertex names):
[1] X1--X2 X1--X3 X1--X4 X1--X4 X1--X5 X1--X6 X1--X7 X1--X7 X1--X8
[10] X1--X8 X1--X8 X1--X9 X1--X9 X1--X10 X1--X10 X1--X11 X1--X11 X1--X11
[19] X1--X12 X1--X17 X1--X18 X1--X20 X1--X20 X1--X21 X1--X21 X1--X22 X1--X22
[28] X1--X22 X1--X23 X1--X25 X1--X27 X1--X28 X1--X29 X1--X43 X1--X45 X1--X46
[37] X1--X51 X1--X2 X2--X3 X2--X3 X2--X3 X2--X6 X2--X6 X2--X7 X2--X8
[46] X2--X8 X2--X9 X2--X10 X2--X10 X2--X11 X2--X11 X2--X12 X2--X13 X2--X14
+ ... omitted several edges
The sedist()
function from the blocmodeling
package calculate structural similarity.
sedist()
accepts for arguments:
M=
the adjacency matrix.method=
A method for calculating the distance (anything in the dist()
function can be used)
"euclidean"
- Calculate the euclidean distance"cor'
- Calculate the Pearson correlation.hand.interaction=
If you want to play around with how it deals with the interactions (don’t worry about this).It outputs a dist
matrix.
library(blockmodeling)
To cite package 'blockmodeling' in publications please use package
citation and (at least) one of the articles:
Žiberna, Aleš (2007). Generalized blockmodeling of valued networks.
Social Networks 29(1), 105-126.
Žiberna, Aleš (2008). Direct and indirect approaches to blockmodeling
of valued networks in terms of regular equivalence. Journal of
Mathematical Sociology 32(1), 57–84.
Žiberna, Aleš (2014). Blockmodeling of multilevel networks. Social
Networks 39, 46–61. https://doi.org/10.1016/j.socnet.2014.04.002.
Žiberna, Aleš (2023). Generalized and Classical Blockmodeling of
Valued Networks, R package version 1.1.5.
Cugmas, Marjan (2023). Generalized and Classical Blockmodeling of
Valued Networks, R package version 1.1.5.
To see these entries in BibTeX format, use 'format(<citation>,
bibtex=TRUE)', or 'toBibtex(.)'.
## Convert to an adjacency matrix
<- as_adjacency_matrix(net, sparse=F)
mat ## Calculate structural similarity using euclidean
<- sedist(mat, method="euclidean")
london_se ## Look at first 5 rows/columns
as.matrix(london_se)[1:5,1:5]
X1 X2 X3 X4 X5
X1 0.00000 21.35416 25.29822 27.276363 26.076810
X2 21.35416 0.00000 26.38181 24.657656 24.000000
X3 25.29822 26.38181 0.00000 22.090722 22.090722
X4 27.27636 24.65766 22.09072 0.000000 8.944272
X5 26.07681 24.00000 22.09072 8.944272 0.000000
Previously we tested whether structural similarity related to ranking similarity, lets look at it for age similarity instead now.
What we need then is the distance in ages for all dyads in our network. We can use the dist()
function for this.
If we give dist()
a single vector it will calculate the pairwise distance between all items in the vector
## Calculate the distance in ages
<- dist(V(net)$Age, method="euclidean")
age_dist ## Check the value (person 1 and 2 are the same age)
## Person 1 and 3 are off by one year.
as.matrix(age_dist)[1:5, 1:5]
1 2 3 4 5
1 0 0 1 1 4
2 0 0 1 1 4
3 1 1 0 2 5
4 1 1 2 0 3
5 4 4 5 3 0
What we need though is a full vector of ages and distances so we can directly compare them.
If we put a matrix into c()
it will convert it to a vector. Example:
<- matrix(1:4, nrow=2, ncol=2)
test_mat test_mat
[,1] [,2]
[1,] 1 3
[2,] 2 4
c(test_mat)
[1] 1 2 3 4
We can do the same thing to our distance matrix.
<- data.frame("Age_Dist"=c(age_dist),
data "SE_Dist"=c(london_se))
1:5,] data[
Age_Dist SE_Dist
1 0 21.35416
2 1 25.29822
3 1 27.27636
4 4 26.07681
5 5 24.97999
We can create a scatter plot or run a correlation (cor()
):
cor(data$Age_Dist, data$SE_Dist, use="complete.obs")
[1] -0.02831001
ggplot(data, aes(x=Age_Dist, y=SE_Dist)) +
geom_jitter() + theme_minimal() +
labs(y="Struct Distance",
x="Age Distance")
We can also hierarchical cluster the output from sedist()
just as we did before:
<- sedist(mat, method="euclidean")
london_se <- hclust(london_se)
clu plot(clu)
We can also hierarchical cluster the output from sedist()
just as we did before.
Remember k=
sets the number of clusters you want, while h=
says how far down the y axis to go to create the clusters (only set one)
<- cutree(clu, k=4)
memberships 1:5] memberships[
X1 X2 X3 X4 X5
1 1 2 2 2
We can use this to look at it as a blockmodel. There are a few ways to do this:
plotMat()
function creates matrix plot showing the blocks, and coloring the squares
M=
is the adjacency matrixclu=
is the memberships in each cluster.orderClu=
can be used to order the clusters their total value (though this can be wonky if there are ties)mar=
sets the margins of the plots (the default are gigantic)funByBlocks()
applies a function to each block. The default is to calculate the mean()
of each block.
M=
and clu=
(same as above)FUN=
the function you want to use.plotMat(M=mat, clu=memberships, orderClu=T,
mar=c(0.1, 0.1, 0.1, 0.1))
plotMat(M=mat>0, clu=memberships, orderClu=T,
mar=c(0.1, 0.1, 0.1, 0.1))
I use mat >0
here as the matrix has weights and this just switches it only 1 and 0.
funByBlocks(M=mat>0, clu=memberships)
1 2 3 4
1 0.7692308 0.4175824 0.1615385 0.4807692
2 0.4175824 1.0000000 0.1619048 0.2142857
3 0.1615385 0.1619048 0.1080460 0.1250000
4 0.4807692 0.2142857 0.1250000 1.0000000
funByBlocks(M=mat>0, clu=memberships, FUN=median, na.rm=T)
1 2 3 4
1 1 0 0 0
2 0 1 0 0
3 0 0 0 0
4 0 0 0 1
We can find a blockmodel by direct optimization using optRandomParC()
. This function can do a lot, we are going to use it with a limited set of options.
M=
The matrixk=
The number of clusters to identifyapproaches="bin"
This sets it to use the binary algorithm.blocks=c("nul", "com")
Allows for null or complete blocks.rep=
The number of random starts to use (should be at least 100)nCores=
The number of cores on your computer to use (defaults to 1)
doParallel
and doRNG
packages.It will automatically display the number of solutions and errors (for the best)
<- optRandomParC(M=mat, k= 4, approaches="bin",
blocks blocks=c("nul", "com"),
rep=500, nCores=4)
Loading required namespace: doRNG
Optimization of all partitions completed
2 solution(s) with minimal error = 384 found.
# The image matrix
IM(blocks)
[,1] [,2] [,3] [,4]
[1,] "nul" "nul" "nul" "nul"
[2,] "nul" "com" "nul" "com"
[3,] "nul" "nul" "com" "com"
[4,] "nul" "com" "com" "com"
# The error matrix
EM(blocks)
[,1] [,2] [,3] [,4]
[1,] 76 46 37 35
[2,] 46 10 12 9
[3,] 37 12 2 8
[4,] 35 9 8 2
You can either use clu()
or orderClu()
to access the clusters. orderClu()
simple reorders the clusters (similar as above)
clu(blocks)[1:10]
[1] 4 4 4 3 3 3 4 2 2 2
orderClu(blocks)[1:10]
4 4 4 3 3 3 4 2 2 2
1 1 1 2 2 2 1 3 3 3
You can plot it as is
plot(x=blocks, mar=c(0.1, 0.1, 0.1, 0.1))
Or ordered
plot(x=blocks, mar=c(0.1, 0.1, 0.1, 0.1), orderClu=T)
For igraph the graph_from_biadjacency_matrix()
function can be used to convert the matrix into a bipartite network. We are using a network of interlocking boards from Scotland. You can download it here.
library(igraph)
library(ggraph)
<- read.csv("network_data/scotland.csv", row.names=1)
inc_df <- as.matrix(inc_df)
inc_mat
<- graph_from_biadjacency_matrix(inc_mat)
net
net
IGRAPH cbaad78 UN-B 244 358 --
+ attr: type (v/l), name (v/c)
+ edges from cbaad78 (vertex names):
[1] earl of dalkeith--north.british.railway
[2] earl of dalkeith--forth.bridge.railway
[3] earl of dalkeith--scottish.widows.fund.life.ass.
[4] carlow, c. --north.british.railway
[5] carlow, c. --fife.coal
[6] carlow, c. --royal.bank.of.scotland
[7] gilroy, a.(b.) --north.british.railway
[8] gilroy, a.(b.) --victoria.jute
+ ... omitted several edges
The important thing that sets this as a bipartite network is that there is a nodal attribute type
that defines which node is in which group.
table(V(net)$type)
FALSE TRUE
136 108
This is useful, but also frustrating, as it doesn’t tell you what is what. I recommend creating a new attribute with better labels
V(net)$label <- NA
# I know which is which because
# I know there are 109 companies.
V(net)$label[V(net)$type] <- "Company"
V(net)$label[!V(net)$type] <- "Board Member"
table(V(net)$label)
Board Member Company
136 108
There is a layout for bipartite networks (I don’t like it)
ggraph(net, "bipartite") +
geom_edge_link(color="gray") +
geom_node_point(size=3, aes(color=label))
Lets start by calculating the degree for each group independently this is easy by subsetting the results our label (or the type) attribute
<- degree(net)[V(net)$label == "Company"]
corp_degree <- degree(net)[V(net)$label != "Company"]
member_degree
mean(corp_degree)
[1] 3.314815
mean(member_degree)
[1] 2.632353
For density, we need the number of edges, and the number of each type:
table(V(net)$type)
FALSE TRUE
136 108
ecount(net)/(136*108)
[1] 0.02437364
There is no easy way to calculate the 4-cycle closures measure in R (or UCINET for what I can tell).
Projecting can be pretty easy:
## Get the "true"
<- bipartite_projection(net, which="true")
company_net
## Get the "false"
<- bipartite_projection(net, which="false")
members_net
members_net
IGRAPH fefd4dd UNW- 136 678 --
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from fefd4dd (vertex names):
[1] earl of dalkeith--carlow, c.
[2] earl of dalkeith--gilroy, a.(b.)
[3] earl of dalkeith--grierson, h.
[4] earl of dalkeith--mccosh, a.k.
[5] earl of dalkeith--macpherson, h.s.
[6] earl of dalkeith--simpson, a.
[7] earl of dalkeith--younger, h.g.
[8] earl of dalkeith--lord balfour of burleigh
+ ... omitted several edges
We can then plot this
ggraph(members_net, "kk") +
geom_edge_link(color="gray", aes(width=weight)) +
geom_node_point(size=3)
First identify what the distribution of edge weights looks like.
hist(E(members_net)$weight)
Lets drop any edge with a weight less than 1 (THIS IS A BAD IDEA FOR THIS NETWORK)
We can use which()
to identify which edges are at a certain threshold and delete.edges()
to… delete edges
<- which(E(members_net)$weight < 2)
edge_to_delete <- delete.edges(members_net, edge_to_delete) pruned_net
Warning: `delete.edges()` was deprecated in igraph 2.0.0.
ℹ Please use `delete_edges()` instead.
ggraph(pruned_net, "kk") +
geom_edge_link(color="gray", aes(width=weight)) +
geom_node_point(size=3)
So this isn’t a super common metric, so I wrote an R function you can download. You can use it easily by calling the file with the code through source()
It works to the matrix, so we create that with as_biadjacency_matrix()
source("network_data/project_bonac.r")
# you will now have a project_bonac() function
<- as_biadjacency_matrix(net)
mat <- project_bonac(mat)
bonac_normalized <- graph_from_adjacency_matrix(bonac_normalized,
bonac_memb_net diag=F, mode="undirected", weight=T)
ggraph(bonac_memb_net, "kk") +
geom_edge_link(color="gray", aes(width=weight)) +
geom_node_point(size=3) + scale_edge_width(range=c(0.5, 2))
If you want to cluster the bipartite network using the Dual Louvain method then you have to do some work to get it back into the right spot:
<- cluster_louvain(members_net)
clust_1 <- cluster_louvain(company_net)
clust_2
## Put each label in the right spot
<- V(net)$label=="Board Member"
node_select V(net)$dual_cluster[node_select] <- letters[clust_1$membership]
<- V(net)$label=="Company"
node_select V(net)$dual_cluster[node_select] <- LETTERS[clust_2$membership]
# To clean things up I'm going to drop
# the isolates
<- delete.vertices(net, degree(net) == 0) plot_net
Warning: `delete.vertices()` was deprecated in igraph 2.0.0.
ℹ Please use `delete_vertices()` instead.
ggraph(plot_net, "kk") +
geom_edge_link(color="gray") +
geom_node_point(size=3,
aes(color=dual_cluster, shape=label))
The actual running of QAP based regression and correlation isn’t that hard but getting the data into the right format can be.
For dyadic QAP tests we need to have matrices/networks that represent all of our variables of interest.
The sna
package has (most of the) QAP functions.
We are going to use three networks and two other datasets.
library(igraph)
setwd("network_data")
<- read.csv("cow/mids_net.csv")
mids_df <- read.csv("cow/nmc_vertex.csv")
nmc_data <- graph_from_data_frame(mids_df,
mids_net vertices=nmc_data, directed=F)
<- read.csv("cow/trade_net.csv")
trade_df <- graph_from_data_frame(trade_df,
trade_net vertices=nmc_data, directed=F)
<- read.csv("cow/cont_net.csv")
cont_df <- graph_from_data_frame(cont_df,
cont_net vertices=nmc_data, directed=F)
Most of what we will use is in the SNA package. The SNA package accepts matrices, so we can convert everything into matrices.
Remember, we set sparse=F
and if we want to use weights we set attr="weight"
(we don’t have weights here).
<- as_adjacency_matrix(mids_net,sparse=F)
mids_mat
<- as_adjacency_matrix(trade_net, sparse=F)
trade_mat
<- as_adjacency_matrix(cont_net, sparse=F) cont_mat
gcor()
will correlate two matrices, but doesn’t give us a p-value hypothesis test.
Set mode="graph"
if you have an undirected network, or mode="digraph"
if you have a directed network.
library(sna)
Loading required package: statnet.common
Attaching package: 'statnet.common'
The following objects are masked from 'package:base':
attr, order
Loading required package: network
'network' 1.18.2 (2023-12-04), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
Attaching package: 'network'
The following objects are masked from 'package:igraph':
%c%, %s%, add.edges, add.vertices, delete.edges, delete.vertices,
get.edge.attribute, get.edges, get.vertex.attribute, is.bipartite,
is.directed, list.edge.attributes, list.vertex.attributes,
set.edge.attribute, set.vertex.attribute
sna: Tools for Social Network Analysis
Version 2.7-2 created on 2023-12-05.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
For citation information, type citation("sna").
Type help(package="sna") to get started.
Attaching package: 'sna'
The following object is masked from 'package:blockmodeling':
sedist
The following objects are masked from 'package:igraph':
betweenness, bonpow, closeness, components, degree, dyad.census,
evcent, hierarchy, is.connected, neighborhood, triad.census
gcor(cont_mat, trade_mat, mode="graph")
[1] 0.2254307
To test the hypothesis we need: gaptest()
which takes a list of networks, the function you want to use to calculate the test statistics, and two annoying indicators (g1=1, g2=2
). Finally, reps=
will tell it how many permutations to do.
<- qaptest(list(cont_mat, trade_mat),
qap g1=1, g2=2, reps=1000, mode="graph") gcor,
You can call summary()
on your output to get details
p(f(perm) >= f(d)
: The proportion of permuted values that are greater than or equal to your real value. A one-sided greater than p-valuep(f(perm) <= f(d))
: The proportion of permuted values that are less than or equal to your real value. A one-sided less than p-value.summary(qap)
QAP Test Results
Estimated p-values:
p(f(perm) >= f(d)): 0
p(f(perm) <= f(d)): 1
Test Diagnostics:
Test Value (f(d)): 0.2254307
Replications: 1000
Distribution Summary:
Min: -0.01351818
1stQ: -0.005134005
Med: -0.0009419196
Mean: -0.000174768
3rdQ: 0.003250166
Max: 0.02840268
If you are curious you can access all the different correlations found through the permutation/QAP test with $dist
and the observed value is $testval
sum(abs(qap$dist) < qap$testval) # Two-sided p-value
[1] 1000
hist(qap$dist)
We can use netlogit()
or netlm()
to calculate our regression.
It takes:
y=
our DV (as a matrix or network)x=
a list of independent variables (as matrices or networks)reps=
the number of reps (for not set this at like 10).nullhyp=
If we want to do Y-permutations ("qap"
) or semi-partial ("qapspp"
)We already have our trade network, and contiguity network, so we can just put them into this:
<- netlogit(y=mids_mat,
mod x=list(trade_mat, cont_mat),
reps=100, nullhyp="qapspp")
To get the results just call the object
mod
Network Logit Model
Coefficients:
Estimate Exp(b) Pr(<=b) Pr(>=b) Pr(>=|b|)
(intercept) -5.553377 0.00387435 1.00 0.00 1.00
x1 0.341634 1.40724509 0.86 0.14 0.33
x2 4.620454 101.54015727 1.00 0.00 0.00
Goodness of Fit Statistics:
Null deviance: 50308.62 on 36290 degrees of freedom
Residual deviance: 2578.148 on 36287 degrees of freedom
Chi-Squared test of fit improvement:
47730.47 on 3 degrees of freedom, p-value 0
AIC: 2584.148 BIC: 2609.646
Pseudo-R^2 Measures:
(Dn-Dr)/(Dn-Dr+dfn): 0.5680815
(Dn-Dr)/Dn: 0.9487534
Each row shows estimated statistics for the variable.
Pr(<=b)
One-sided less than hypothesis p-value.Pr(>=b)
One-sided greater than hypothesis p-value.Pr(>=|b|)
Two-sided hypothesis p-value.We talked about creating variables that indicate where two nodes are equal to, or the distance between two nodes, or the summation of two nodes… How? The outer()
function.
outer()
takes in two vectors, and an operation, and then does that function to every pairwise combination.
<- c(1,3,5)
test_vector outer(test_vector, test_vector, "+")
[,1] [,2] [,3]
[1,] 2 4 6
[2,] 4 6 8
[3,] 6 8 10
We can do this also with differences, but we should note that a-b
is different than b-a
so for undirected data we want to take the absolute value.
<- c(1,3,5)
test_vector outer(test_vector, test_vector, "-")
[,1] [,2] [,3]
[1,] 0 -2 -4
[2,] 2 0 -2
[3,] 4 2 0
outer(test_vector, test_vector, "-") |> abs()
[,1] [,2] [,3]
[1,] 0 2 4
[2,] 2 0 2
[3,] 4 2 0
It is often interesting to look at if two nodes are the same as other nodes. We can do that with the function "=="
in outer()
<- c("Miami", "OSU", "OU")
test_vector outer(test_vector, test_vector, "==")
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
So to add the difference in Mil Personnel we need to extract that vector from the network, clean it up slightly, put it through outer()
and then take the absolute value of that.
<- V(mids_net)$milper
mil_cap ==-9] <- 0 # Replacing missing values with 0s.
mil_cap[mil_cap<- outer(mil_cap, mil_cap, "-")
mil_cap_mat <- (abs(mil_cap_mat)) mil_cap_mat
We can add that matrix to the netlogit
function just as we did with the networks before.
<- netlogit(y=mids_mat,
mod x=list(trade_mat, cont_mat, mil_cap_mat),
reps=100)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod)
Network Logit Model
Coefficients:
Estimate Exp(b) Pr(<=b) Pr(>=b) Pr(>=|b|)
(intercept) -5.7848114241 0.00307389 1.00 0.00 1.00
x1 0.3666349780 1.44287114 0.87 0.13 0.24
x2 4.5827769946 97.78556750 1.00 0.00 0.00
x3 0.0008590974 1.00085947 1.00 0.00 0.00
Goodness of Fit Statistics:
Null deviance: 50308.62 on 36290 degrees of freedom
Residual deviance: 2509.59 on 36286 degrees of freedom
Chi-Squared test of fit improvement:
47799.03 on 4 degrees of freedom, p-value 0
AIC: 2517.59 BIC: 2551.587
Pseudo-R^2 Measures:
(Dn-Dr)/(Dn-Dr+dfn): 0.5684336
(Dn-Dr)/Dn: 0.9501161
Contingency Table (predicted (rows) x actual (cols)):
Actual
Predicted 0 1
0 35942 314
1 22 12
Total Fraction Correct: 0.9907413
Fraction Predicted 1s Correct: 0.3529412
Fraction Predicted 0s Correct: 0.9913394
False Negative Rate: 0.9631902
False Positive Rate: 0.0006117228
Test Diagnostics:
Null Hypothesis: qap
Replications: 100
Distribution Summary:
(intercept) x1 x2 x3
Min -92.54464 -2.63579 -2.35812 -4.43309
1stQ -89.01781 -1.04673 -1.07821 -1.95423
Median -88.30635 -0.43733 -0.05183 -0.38320
Mean -88.33197 -0.11787 0.03580 -0.09013
3rdQ -87.53640 0.76717 0.78712 1.33367
Max -83.63469 5.23985 4.63217 8.70260
We are going to load the high school data we had before and then delete the nodes that are have missing information.
<- read.csv("network_data/spanish_hs_edges.csv")
edges <- read.csv("network_data/spanish_hs_nodes.csv")
nodes <- graph_from_data_frame(edges, vertices=nodes, directed=T)
net <- delete_vertices(net, which(is.na(V(net)$prosocial))) net
We want to look at what leads an individual to indicate that they have a better relationship with someone.
The DV is the rating the sender gave the relationship (-2 to +2), with 0 meaning no relationship.
Our dyadic independent variables:
The dependent variable is easy enough using as_adjacency_matrix()
and extracting the weight attribute. We’ve seen how to identify if the two nodes are the same before.
<- as_adjacency_matrix(net, attr="weight", sparse=F)
dv_mat <- outer(V(net)$Sexo, V(net)$Sexo, "==")
same_sex <- outer(V(net)$Grupo, V(net)$Grupo, "==") same_group
To create a ‘sender is male’ matrix we can compare the nodal sex with a vector of just "Male"
.
Remember outer()
goes through each item in the first argument along the rows, so for sender we put the node’s first and we switch the order for the receiver matrix.
<- outer(V(net)$Sexo, rep("Male", length(V(net))), "==")
send_male <- outer(rep("Male", length(V(net))), V(net)$Sexo, "==") rec_male
We can do something similar with the prosocial variable but just multiply it by 1.
<- outer(V(net)$prosocial, rep(1, length(V(net))), "*")
send_soc <- outer(rep(1, length(V(net))), V(net)$prosocial, "*") rec_soc
If you are confused by what we are doing try running:
outer(c(1.4, 0.4, 0.1), rep(1, 3), "*")
outer(rep(1, 3), c(1.4, 0.4, 0.1), "*")
We can estimate this just like before using netlm()
. mode="digraph"
is the default and indicates a directed network.
<- netlm(y=dv_mat, x=list(same_sex,
mod
same_group,
send_male, rec_male,
rec_soc, send_soc), mode="digraph")
mod
OLS Network Model
Coefficients:
Estimate Pr(<=b) Pr(>=b) Pr(>=|b|)
(intercept) -0.054080439 0.193 0.807 0.381
x1 0.106753064 1.000 0.000 0.000
x2 0.391130533 1.000 0.000 0.000
x3 -0.005801325 0.416 0.584 0.865
x4 0.040534712 0.979 0.021 0.044
x5 -0.024541180 0.265 0.735 0.554
x6 0.029253310 0.655 0.345 0.662
Residual standard error: 0.5489 on 6313 degrees of freedom
F-statistic: 114.3 on 6 and 6313 degrees of freedom, p-value: 0
Multiple R-squared: 0.09797 Adjusted R-squared: 0.09711
The coefficient on same sex (x1
) and same group (x21
) are both statistically significant and positive. In particular, if the receiver is in the same group as the sender then they are likely to rank them 0.39 points higher, and if they are the same sex then they are likely to rank them 0.11 higher.
The coefficient on receiver male has a two-sided p-value of 0.044 and so is also statistically significant. This indicates that male students are more likely to receive a higher rating than female students (although the effect is relatively small as the coefficient is 0.04).
If you want to compare it to the version without QAP you can set nullhyp="classic"
.
You’ll see that the two-sided p-values are generally smaller in this case (except for the ones that are exactly 0).
<- netlm(y=dv_mat, x=list(same_sex,
mod_wrong
same_group,
send_male, rec_male,
rec_soc, send_soc), mode="digraph", nullhyp="classic")
mod_wrong
OLS Network Model
Coefficients:
Estimate Pr(<=b) Pr(>=b) Pr(>=|b|)
(intercept) -0.054080439 0.03956882 9.604312e-01 7.913763e-02
x1 0.106753064 1.00000000 1.766833e-14 3.533667e-14
x2 0.391130533 1.00000000 9.981165e-126 1.996233e-125
x3 -0.005801325 0.34136768 6.586323e-01 6.827354e-01
x4 0.040534712 0.99784800 2.151996e-03 4.303991e-03
x5 -0.024541180 0.20131319 7.986868e-01 4.026264e-01
x6 0.029253310 0.84077019 1.592298e-01 3.184596e-01
Residual standard error: 0.5489 on 6313 degrees of freedom
F-statistic: 114.3 on 6 and 6313 degrees of freedom, p-value: 0
Multiple R-squared: 0.09797 Adjusted R-squared: 0.09711
For nodal tests we need another library (I know right), this one is permuco
so run install.packages("permuco")
.
What do we need to do to test what leads to having more trading partners?
(If degree()
is freaking out it might be a clash between sna
and igraph
fix it with igraph::degree()
)
V(trade_net)$degree <- igraph::degree(trade_net)
<- as_data_frame(trade_net, what="vertices")
trade_df setwd("network_data")
<- read.csv("cow/democracy_data.csv") democracy_df
To merge two datasets we use the merge()
function:
x=
The first dataset.y=
The second dataset.by.x=
The column from the first dataset to use to merge.by.y=
The column from the second dataset to use to merge.<- merge(trade_df, democracy_df,
all_data by.x="name",
by.y="COWcode")
Running the regression is pretty simple, and is similar to what you would do with normal linear regression in R.
lmperm()
is the function, it takes in a formula, a dataset, and a number of permutations (lets do 50 so this doesn’t take forever).
library(permuco)
<- lmperm(degree~v2x_polyarchy + milper,
mod data=all_data, np=50)
Warning in lmperm_fix(formula = formula, data = data, method = method, type =
type, : The number of permutations is below 2000, p-values might be unreliable.
Warning in check_distribution(distribution = distribution, digits = 10, : the
distribution of v2x_polyarchy, milper may be discrete.
The results are similarly formatted.
summary(mod)
Table of marginal t-test of the betas
Resampling test using freedman_lane to handle nuisance variables and 50 permutations.
Estimate Std. Error t value parametric Pr(>|t|)
(Intercept) -0.046836 0.4700088 -0.09965 9.207e-01
v2x_polyarchy 3.265364 0.7930987 4.11722 6.017e-05
milper 0.003611 0.0007406 4.87514 2.515e-06
resampled Pr(<t) resampled Pr(>t) resampled Pr(>|t|)
(Intercept)
v2x_polyarchy 1 0.02 0.02
milper 1 0.02 0.02
We are going to go back to the MIDs data we were using before (scroll above for info on it)
library(network)
library(ergm)
Registered S3 methods overwritten by 'ergm':
method from
simulate.formula lme4
simulate.formula_lhs lme4
summary.formula Hmisc
'ergm' 4.6.0 (2023-12-17), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
Attaching package: 'ergm'
The following object is masked from 'package:statnet.common':
snctrl
library(intergraph)
setwd("network_data")
<- read.csv("cow/mids_net.csv")
mids_df <- read.csv("cow/nmc_vertex.csv")
nmc_data <- graph_from_data_frame(mids_df,
mids_net vertices=nmc_data, directed=F)
V(mids_net)$milper[V(mids_net)$milper<0] <- 0
<- asNetwork(mids_net)
mids_net
<- read.csv("cow/trade_net.csv")
trade_df <- graph_from_data_frame(trade_df,
trade_net vertices=nmc_data, directed=F)
<- asNetwork(trade_net)
trade_net
<- read.csv("cow/cont_net.csv")
cont_df <- graph_from_data_frame(cont_df,
cont_net vertices=nmc_data, directed=F)
<- asNetwork(cont_net) cont_net
We started by incorporating our other networks into the model, we can do this here by putting those network objects into dyadcov()
.
<- ergm(mids_net ~ edges +
mod dyadcov(trade_net) +
dyadcov(cont_net) )
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Evaluating log-likelihood at the estimate.
summary(mod)
Call:
ergm(formula = mids_net ~ edges + dyadcov(trade_net) + dyadcov(cont_net))
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -5.5534 0.1207 0 -46.026 <1e-04 ***
dyadcov.trade_net 0.3416 0.2993 0 1.141 0.254
dyadcov.cont_net 4.6205 0.1813 0 25.479 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 25154 on 18145 degrees of freedom
Residual Deviance: 1289 on 18142 degrees of freedom
AIC: 1295 BIC: 1318 (Smaller is better. MC Std. Err. = 0)
We can add in a variety of dyadic-independent terms:
nodematch()
will add a term indicating that two nodes match on a categorical attribute.nodecov()
will add the dyadic sum of the attribute.nodefactor()
will add a count of how many types each level of an attribute shows up in a dyad.
absdiff()
will add a term that is the absolute difference between the nodes on whatever variable you include.We will start by accounting for the absolute difference in military personnel.
<- ergm(mids_net ~ edges +
mod dyadcov(trade_net) +
dyadcov(cont_net) +
absdiff("milper") )
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Evaluating log-likelihood at the estimate.
The coefficients are going to be the same as QAP, but the p-values aren’t. We haven’t included any network variables here.
summary(mod)
Call:
ergm(formula = mids_net ~ edges + dyadcov(trade_net) + dyadcov(cont_net) +
absdiff("milper"))
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -5.7848114 0.1328795 0 -43.534 <1e-04 ***
dyadcov.trade_net 0.3666350 0.3015442 0 1.216 0.224
dyadcov.cont_net 4.5827770 0.1845116 0 24.837 <1e-04 ***
absdiff.milper 0.0008591 0.0001316 0 6.530 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 25154 on 18145 degrees of freedom
Residual Deviance: 1255 on 18141 degrees of freedom
AIC: 1263 BIC: 1294 (Smaller is better. MC Std. Err. = 0)
The two gw-terms we’ve learned:
gwdegree()
captures the tendency towards transitivity. Positive values mean open triads are more likely to close.gwesp()
captures a tendency towards having equal or unequal distributions of edges. Positive values mean edges are more equally distributed, negative values mean edges tend to go to those who already have edges (popularity).Both can have their decay set to 0 with decay=0, fixed=T
.
<- ergm(mids_net ~ edges + dyadcov(trade_net) +
full_mod dyadcov(cont_net) + nodecov("milper") +
gwdegree(decay=0, fixed=T) +
gwesp(decay=0, fixed=T) )
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
Optimizing with step length 0.1218.
The log-likelihood improved by 2.0849.
Estimating equations are not within tolerance region.
Iteration 2 of at most 60:
Optimizing with step length 0.1449.
The log-likelihood improved by 1.8646.
Estimating equations are not within tolerance region.
Iteration 3 of at most 60:
Optimizing with step length 0.1308.
The log-likelihood improved by 1.8698.
Estimating equations are not within tolerance region.
Iteration 4 of at most 60:
Optimizing with step length 0.4978.
The log-likelihood improved by 2.9557.
Estimating equations are not within tolerance region.
Iteration 5 of at most 60:
Optimizing with step length 0.9212.
The log-likelihood improved by 2.3475.
Estimating equations are not within tolerance region.
Iteration 6 of at most 60:
Optimizing with step length 1.0000.
The log-likelihood improved by 0.1328.
Estimating equations are not within tolerance region.
Iteration 7 of at most 60:
Optimizing with step length 1.0000.
The log-likelihood improved by 0.0764.
Convergence test p-value: 0.0577. Not converged with 99% confidence; increasing sample size.
Iteration 8 of at most 60:
Optimizing with step length 1.0000.
The log-likelihood improved by 0.0353.
Convergence test p-value: 0.8188. Not converged with 99% confidence; increasing sample size.
Iteration 9 of at most 60:
Optimizing with step length 1.0000.
The log-likelihood improved by 0.0316.
Convergence test p-value: 0.1981. Not converged with 99% confidence; increasing sample size.
Iteration 10 of at most 60:
Optimizing with step length 1.0000.
The log-likelihood improved by 0.0166.
Convergence test p-value: 0.1883. Not converged with 99% confidence; increasing sample size.
Iteration 11 of at most 60:
Optimizing with step length 1.0000.
The log-likelihood improved by 0.0123.
Convergence test p-value: 0.0042. Converged with 99% confidence.
Finished MCMLE.
Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
Bridging between the dyad-independent submodel and the full model...
Setting up bridge sampling...
Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
Bridging finished.
This model was fit using MCMC. To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
summary(full_mod)
Call:
ergm(formula = mids_net ~ edges + dyadcov(trade_net) + dyadcov(cont_net) +
nodecov("milper") + gwdegree(decay = 0, fixed = T) + gwesp(decay = 0,
fixed = T))
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -5.4664006 0.1834414 0 -29.799 <1e-04 ***
dyadcov.trade_net 0.2838874 0.2630876 0 1.079 0.281
dyadcov.cont_net 4.0072176 0.1818470 0 22.036 <1e-04 ***
nodecov.milper 0.0006911 0.0001079 0 6.403 <1e-04 ***
gwdeg.fixed.0 -1.0343249 0.2488827 0 -4.156 <1e-04 ***
gwesp.fixed.0 0.8095523 0.1512802 0 5.351 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 25154 on 18145 degrees of freedom
Residual Deviance: 1123 on 18139 degrees of freedom
AIC: 1135 BIC: 1182 (Smaller is better. MC Std. Err. = 0.459)
If you look at the ergm documentation, there are a lot of network variables. This leads to the question. Can we identify if we have a good (or bad) model?
One way to check this is to see if our model does a good job simulating a network with similar characteristics.
This can be done through the gof()
function and then plot()
.
gof()
takes in a model object, simulates a large number of networks, and then calculates descriptive statistics on it.
The boxplots show the distribution of descriptive statistics for the simulated networks. The black line shows the statistic for the observed network. A model that fits well is a model where the black line is generally within the gray boxplots.
Lets look at trade networks in particular, lets see if democracies are more likely to trade with other democracies.
Need to do some cleaning to get the data in:
<- read.csv("network_data/spanish_hs_edges.csv")
edges <- read.csv("network_data/spanish_hs_nodes.csv")
nodes <- graph_from_data_frame(edges, vertices=nodes, directed=T)
net <- delete_vertices(net, which(is.na(V(net)$prosocial)))
net <- delete_edges(net, which(E(net)$weight < 0))
net <- as.undirected(net, mode="mutual")
net
<- asNetwork(net) net
Run our simple model.
<- ergm(net ~ edges + nodematch("Sexo") + nodematch("Grupo") +
mod nodefactor("Sexo") + nodefactor("Grupo") +
nodecov("prosocial"))
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Evaluating log-likelihood at the estimate.
summary(mod)
Call:
ergm(formula = net ~ edges + nodematch("Sexo") + nodematch("Grupo") +
nodefactor("Sexo") + nodefactor("Grupo") + nodecov("prosocial"))
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -4.61398 0.38530 0 -11.975 <1e-04 ***
nodematch.Sexo 1.22256 0.18073 0 6.765 <1e-04 ***
nodematch.Grupo 2.95815 0.19446 0 15.212 <1e-04 ***
nodefactor.Sexo.Male 0.10895 0.11254 0 0.968 0.3330
nodefactor.Grupo.B -0.09376 0.11821 0 -0.793 0.4277
nodefactor.Grupo.C -0.15182 0.13449 0 -1.129 0.2590
nodefactor.Grupo.D 0.23911 0.14274 0 1.675 0.0939 .
nodecov.prosocial -0.35101 0.23933 0 -1.467 0.1425
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 4381 on 3160 degrees of freedom
Residual Deviance: 1108 on 3152 degrees of freedom
AIC: 1124 BIC: 1173 (Smaller is better. MC Std. Err. = 0)
<- gof(mod)
est_gof par(mfrow=c(2,2)) # Creates a 2 by 2
plot(est_gof)
Lets add in dyadic dependent terms, starting with a fixed decay of 0.
set.seed(1)
<- ergm(net ~ edges +nodematch("Sexo") + nodematch("Grupo") +
mod nodefactor("Sexo") + nodefactor("Grupo") +
nodecov("prosocial") +
gwesp(decay=0, fixed=T) + gwdegree(0, fixed=T))
<- gof(mod) est_gof
par(mfrow=c(2,2))
plot(est_gof)
<- ergm(net ~ edges +nodematch("Sexo") + nodematch("Grupo") +
mod nodefactor("Sexo") + nodefactor("Grupo") +
nodecov("prosocial") +
gwesp(decay=.25, fixed=T) + gwdegree(.25, fixed=T))
<- gof(mod) est_gof
par(mfrow=c(2,2))
plot(est_gof)
<- ergm(net ~ edges +nodematch("Sexo") + nodematch("Grupo") +
mod nodefactor("Sexo") + nodefactor("Grupo") +
nodecov("prosocial") +
gwesp(decay=.5, fixed=T) + gwdegree(.5, fixed=T))
<- gof(mod) est_gof
par(mfrow=c(2,2))
plot(est_gof)
<- ergm(net ~ edges +nodematch("Sexo") + nodematch("Grupo") +
mod nodefactor("Sexo") + nodefactor("Grupo") +
nodecov("prosocial") +
gwesp(decay=.75, fixed=T) + gwdegree(.75, fixed=T))
<- gof(mod) est_gof
par(mfrow=c(2,2))
plot(est_gof)
summary(mod)
Call:
ergm(formula = net ~ edges + nodematch("Sexo") + nodematch("Grupo") +
nodefactor("Sexo") + nodefactor("Grupo") + nodecov("prosocial") +
gwesp(decay = 0.75, fixed = T) + gwdegree(0.75, fixed = T))
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -5.940007 0.351806 0 -16.884 < 1e-04 ***
nodematch.Sexo 0.724214 0.121896 0 5.941 < 1e-04 ***
nodematch.Grupo 1.408892 0.147166 0 9.573 < 1e-04 ***
nodefactor.Sexo.Male 0.034010 0.067383 0 0.505 0.61375
nodefactor.Grupo.B -0.005018 0.056520 0 -0.089 0.92925
nodefactor.Grupo.C -0.048045 0.070391 0 -0.683 0.49490
nodefactor.Grupo.D 0.136707 0.081334 0 1.681 0.09280 .
nodecov.prosocial -0.192379 0.168620 0 -1.141 0.25391
gwesp.fixed.0.75 1.316061 0.136495 0 9.642 < 1e-04 ***
gwdeg.fixed.0.75 1.075639 0.384633 0 2.797 0.00517 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 4380.7 on 3160 degrees of freedom
Residual Deviance: 952.1 on 3150 degrees of freedom
AIC: 972.1 BIC: 1033 (Smaller is better. MC Std. Err. = 0.4174)
We can look at each term here in kind:
edges
This is equivalent to the intercept of a normal regression and so generally is not worth discussing.nodematch.Sexo
This shows the effect of two students having the same sex. The positive coefficient (0.72) is statistically significant (p-value < 0.05). This means that two students are more likely to have a friendship (an edge) if they are of the same sex.nodematch.Grupo
This shows the effect of two students being in same group. This coefficient is against positive (1.48) and is significant (p-value < 0.05) meaning that students in the same group are more likely to be friends.nodefactor.Sexo.Male
This shows whether male students are more likely to have friends than female students. It is not significant so there is no difference.nodefactor.Grupo.B
, nodefactor.Grupo.C
, and nodefactor.Grupo.D
show whether students in group B, C, and D are more likely to have friends than those in group A. This is again, not significant, so there is no difference.nodecov.prosocial
This shows whether the overall total amount of “prosocial” behavior increases a friendship between two people. There is again, no effect.gwesp.fixed.0.75
This shows how much transitivity matters to the network structure. The coefficient is positive (1.32) and statistically significant (p-value < 0.05) meaning that triangles are likely to form in this network (more than compared to a random network). In practice this indicates that transitivity happens in this networks.gwdeg.fixed.0.75
This models the distribution of degrees, the positive (1.08) and significant (p-value <0.05) means that there is a “rich get richer” aspect to this. The distribution of degrees tends towards a few with a large number of ties and many with few ties. (If it was negative we’d expect a more uniform distribution of degrees).