Background
This project was done as part of the Data Mining course at University Of Cincinnati. I and three other folks - Aarti Ranganathan, Swaraj Mohapatra and Ambrose Wong worked together to come up with this interesting project.
Introduction
In today's developed world everything is connected: people, information, events and places, and online social media has made it even easier. A practical way of making sense of the fuzzy connections is to analyze them as networks. With the help of Gephi, large amount of social network data is analyzed in an interactive way. The project demonstrates the hands-on analysis of real-world data sets and focuses on a range of tasks: from identifying important nodes in the network, to detecting communities, to tracing information diffusion and opinion formation. The idea is to be able to extract information from a seemingly fuzzy network and recognize communities and cliques. We further used the same techniques and applied them on other networks such as word adjacencies, books about US politics, political blogs and Facebook data. Using these datasets, we portrayed gender distribution in a Facebook network, commonly used adjectives in a book, conflicting trends in customer demands, and political dysfunction through blogs. These exploratory analysis was combined with unsupervised learning to detect communities in a Facebook dataset. These detection algorithms used clustering type algorithms where each node was selected into a certain group. However, this was a major drawback because people could belong to multiple communities. Hence linked community analysis was also performed where people could belong to multiple clusters.
Finally, with the help of ERGM, various models are built for Facebook data and the GWESP based model with degree 1 term is chosen based on MCMC diagnostics – which reveals the error distribution. Although the model has the second best AIC and BIC values, it is still chosen as its noise distribution is normal.
Datasets
We analyzed four datasets in Gephi. The datasets differ from each other on various levels and the idea is to be able to extract information from seemingly fuzzy and different networks. The datasets include User Facebook data, word adjacencies, US political blogs and political books. In addition to Gephi, we also used R to analyze Facebook data. Note that the Facebook data was extracted using an API.
Facebook Network Analysis
Facebook data from one of the users of the group was used for analysis.The attributes extracted from the facebook data include: Nodes - The name of the people on the facebook user’s network Id - Id is a variable that uniquely identifies the nodes on the network Label - Label is an attribute that defines how a node is labelled on the network Sex - Sex of the Facebook user’s friend Agerank - The user’s friends’s age rank. May be 13-17, 18-20 or 21+ Locale - A two-letter language code and a two-letter country code representing the users’ friends’ locale Degree - The total number of edges connected to a node in the network
The network in the figure below shows that the user has quite a few communities and cliques in his Facebook network. Note that the nodes represent friends of the user and they are colored based on gender. With red nodes representing males and green nodes representing females. People who have not updated their sex on Facebook are represented as blue nodes. The links between the nodes indicate connections between the friends of the user and are called edges. In the above graph, the edges are undirected and each have the same weight. This implies that if you are my friend than I’m also your friend.
The figure below shows a node which acts as a significant connection between the two groups of Facebook friends. This person plays a vital role and influences the common posts that the different user groups can find in their notifications. Studying this node further can help analyze the user behavior.
Based on the Facebook network, we can understand the distribution of the user’s friends. The people who play important roles in the network based on the number of mutual friends, friends who play the role of intermittent connections between different communities. This information affects the news feed of the user, information that will be accessible by the user and mutual friends’ suggestions.
Word Adjacencies
We explored the common adjectives and nouns in the novel "David Copperfield" by Charles Dickens and created an adjacency network. The attributes in the dataset include node and edge descriptions. The node descriptions are as follows:
- Nodes - The noun or adjective commonly used in the book
- Id - Id is a variable that uniquely identifies the nodes on the network
- Label - Label is the attribute which defines how the node is labelled on the network
- Value - Indicates if the work is a noun or an adjective. 0: Adjective, 1: Noun The edge descriptions are as follows:
- Source - Indicates the node that acts as the source for the edge
- Target - Indicates the node that acts as the target for the edge
- Type - Specifies the type of the edge: Directed
- Id - Id is a variable that uniquely identifies the edge on the network
- Weight - The weight of an edge. In this dataset, each edge has an equal weight Since the dataset is directed, every node has an in-degree and an out-degree where in-degree is the number of words that have the current word as its predecessor and out-degree is the number of words that have the current word as its successor. The edge has a source and target indicating which node is the predecessor and which node is the successor for that connection.
The figure below indicates the distribution of the commonly used nouns and adjectives in the book. The nodes in red are adjectives and the nodes in blue are nouns.
The following image gives more perspective as the size of the nodes depend on the in-degree and the out-degree. The nodes that is connected to many other nodes is indicated with a size bigger than the node that is less widely used. We can study these nodes further.
The figure below shows that the adjectives “little” and “old” are the two commonly used adjectives that were highlighted in the previous figure. Gephi seems to be a pretty powerful tool to visualize data.
Books about US Politics
This dataset consists of a network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers. The dataset include nodes and edges. The node attributes are as follows:
- Node - The name of the book
- Id - Id is a variable that uniquely identifies the nodes on the network
- Label - Label is the attribute which defines how the node is labelled on the network
- Value - Indicates the category of the book: liberal(l), neutral(n) or conservative(c) The edge attributes are as follows:
- Source - Indicates the node that acts as the source for the edge
- Target - Indicates the node that acts as the target for the edge
- Type - Specifies the type of the edge: UnDirected
- Id - Id is a variable that uniquely identifies the edge on the network
- Weight - The weight of every edge. In this dataset, every edge has an equal weight of 1
The below figure shows the division of the political books sold by Amazon in a particular period into three categories. The nodes in blue are books on conservatives, the nodes in green are books on liberals and the nodes in red are neutral. The network indicates the tendency of a buyer to buy the same kind of the book is high but there are a few inter-linking books that indicate a different trend.
The following figure shows the books that a user purchased along with “Surprise, Security and the Americal Experience”. We see that buyers have an equal tendency to purchase a conservative or a liberal book along with this book.
Political Blogs
A directed network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance. The node attributes of the data are as follows:
Nodes - The URL of the blog
Id - Id is a variable that uniquely identifies the nodes on the network
Label - Label is the attribute which defines how the node is labelled on the network
Value - 0: left or liberal political parties, 1: right or conservative parties
Source - Specifies the source of the blog. Example: Blogarama
The edge attributes are given below:
Source - Indicates the node that acts as the source for the edge
Target - Indicates the node that acts as the target for the edge
Type - Specifies the type of the edge: Directed
Id - Id is a variable that uniquely identifies the edge on the network
Weight - The weight of every edge. In this dataset, every edge has an equal weight of 1
The Political blogs considered in the dataset are written around the 2004 presidential election. These blogs distinguished based on political leaning. The two kinds of blogs include left or liberal and right or conservative. Analysis of Gephi network indicated the type and source of the blogs. The web crawls from one political blog to another are also considered and these are depicted by the edges in the graph. In the below figure, the left or liberal blogs are indicated in red and the and right or conservative are shown in blue. We can see that a person is more likely to read a similar kind of a blog rather than switching between liberal and conservative. The graph also shows certain blogs which link these two groups. Such blogs will be considered for further analysis.
In the following figure, the size of the nodes are shown with respect to its degree. The highlighted nodes indicate that users have traversed to this blog from other political blogs more often compared to other blogs.
Unsupervised Learning
Unsupervised learning plays a major role in the detection of communities in social networks. A common used measure is the modularity of a network which based its calculation on the total number of edges in the group and the expected number of edges in the group. However, this method is computationally very expensive because it considers all edges. Hence, there are algorithms developed that will optimally find the solution using a smaller number of sets. In Gephi, the Louvain method is used and in R, many other algorithms are used. An inherent difficulty with this type of analysis is that nodes in social networks can belong to multiple communities, but this type of clustering only allows each node to belong to one community. Using another package in R: Linkcomm, we can place different nodes into multiple communities. The code and the analysis is shown below:
library(subgraphMining)
library(igraph0)
library(linkcomm)
Ambrose <- read.graph("C:/Users/Ambrose/Desktop/Courses/Bana 7047/Project/data/Facebook.gml",format=c("gml"))
Ambrose.el <- get.edgelist(Ambrose)
Ambrose.lc <- getLinkCommunities(Ambrose.el,directed = FALSE, hcmethod = "single")
fgreedy<-fastgreedy.community(Ambrose,merges=TRUE, modularity=TRUE)
dendPlot(fgreedy)
plot(Ambrose, vertex.color=membership(fgreedy),vertex.size=2, vertex.label.font=0.2, vertex.label.color=membership(fgreedy))
Random Graph Models
The statnet package in R has been used to do build ERGM models from the Facebook dataset. The models have been created under several parameter constraints leading to different models.
install.packages("statnet")
install.packages("igraph")
install.packages("intergraph")
library(statnet)
library(igraph)
library(intergraph)
#Reading the graph
fbAarti <- read.graph(file = "Aart_FB.gml", format = "gml")
#Converting the file to a network file
fb <- asNetwork(fbAarti)
plot(fb)
summary(fb)
#Mixing Matrices based on attribute names
mixingmatrix(fb, "locale")
mixingmatrix(fb, "agerank")
mixingmatrix(fb, "label")
mixingmatrix(fb, "sex")
#Network components
table(component.dist(fb)$csize)
network.size(fb)
#Erdos-Renyi model
fb.model.er <- ergm(fb~edges)
summary(fb.model.er)
fb.model.er$mle.lik
#Assortative mixing edge model
fb.model.assor.2 <- ergm(fb~edges + nodematch("locale") + nodematch("sex"))
summary(fb.model.assor.2)
fb.model.assor.2$mle.lik
#Simulate the network
sim.assor <- simulate(fb.model.assor.2, burnin = 1e+6, verbose = TRUE, seed = 9)
mixingmatrix(sim.assor, "locale")
mixingmatrix(sim.assor, "sex")
#Network components
table(component.dist(fb)$csize)
network.size(fb)
#Degree distribution
summary(fb ~ degree(0:120))
summary(sim.assor ~ degree(0:120))
c(fb = summary(fb ~ triangle), sim.assor = summary(sim.assor ~ triangle))
fb.model.ed.tr <- ergm(fb ~ edges + triangle, maxit = 4, control.ergm = control.ergm(seed = 9), verbose = TRUE)
summary(fb.model.ed.tr)
fb.model.ed.tr$mle.lik
# GWESP model with assortative fields
fb.gwesp <- ergm(fb ~ edges + nodematch("sex") + nodematch("locale") + gwesp(0.25, fixed = TRUE),
MCMCsamplesize = 10000, maxit = 4, verbose = TRUE,
control = control.ergm(steplength = 0.25), seed = 9, parallel = 4)
fb.model.ed.tr.1 <- ergm(fb ~ edges + triangle + nodematch("sex") + nodematch("locale") + degree(1),
maxit = 4, MCMCsamplesize = 20000, control.ergm = control.ergm(seed = 9, MCMC.interval=500),
calc.mcmc.se = FALSE, parallel = 4, verbose = TRUE)
A comparison of these models have been done to decide the best model. The edge and triangle based model has the highest likelihood and lowest AIC and BIC values. The GWESP based model with degree 1 term has the next best values.
mcmc.diagnostics(fb.model.ed.tr.1)
The MCMC diagnostics of the model which show how the error for each of the terms are modeled show that the GWESP based model with degree 1 term has the terms normally distributed while the edge and triangle based model does not. This mean that the GWESP model captures the network better and hence it is chosen.
gof.fb.model.ed.tr <- gof(fb.model.ed.tr)
plot(gof.fb.model.ed.tr)
gof.deg.fb.model.ed.tr <- gof(fb.model.ed.tr~degree)
plot(gof.deg.fb.model.ed.tr)
The goodness-of-fit diagnostics of the model shows how the observed and distribution of simulated networks compare with each other. The figures below contain the observed network’s statistics plotted in bold while the simulated networks box-plot distribution is plotted in gray. The observed network is contained in the simulated distribution of networks for the most part.
Conclusions
Network analysis of various dataset: A Facebook network, words in a book, political blogs, and books on politics were analyzed. Exploratory analysis was initially done explore the graphs and then clustering detection algorithms was used to find groups within networks. Finally, linked community analysis was used to detect multiple communities that certain nodes could belong to. In addition, random graphs were used to compare our Facebook networks. In particular, a GWESP based ERGM model with degree 1 term is the best model on the Facebook data as its noise distribution is normal.