0% found this document useful (0 votes)
13 views

Unit Iv

Uiokmop

Uploaded by

Durga vinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit Iv

Uiokmop

Uploaded by

Durga vinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Applied Data

Science

UNIT-IV
Social-Network
A social network is an online communication platform that is used for creating
relationships with other people who share an interest or a relationship.

Examples: LinkedIn, Facebook, Twitter, Google+, Instagram, Quora etc.


Characteristics of a social network:
1. Collection of entities that participate in the network.
2. Relationship between entities in the network.
3. Locality: If A is related to both B and C, then there is a higher probability that B
and C are related.
Social Networks as Graphs
● Social networks are modeled as graphs and called as a social graph.
● Entities are the nodes. An Edge connects two nodes if they are related by a
relationship.

Example: Entities are the nodes from A through G.


The relationship is “friends” and is represented by the
edges. For example, D is friend with B, E, F and G.
Varieties of Social Networks
1. Telephone Networks
● Nodes represent phone numbers.
● There is an edge between two nodes if a call has been placed between those
phones in some fixed period of time.
● The edges could be weighted by the number of calls made between these
phones during the period.
● Communities in a telephone network will form groups of people that
communicate frequently: groups of friends, members of a club, or people
working at the same company.
Varieties of Social Networks
2. Email Networks

● The nodes represent email addresses.


● An edge represents the fact that there was at least one email in at least one
direction between the two addresses.
● Strong edges represent communication in both directions, while weak edges
indicate that the communication was in one direction only.
Varieties of Social Networks
3. Collaboration Networks

● Nodes represent individuals who have published research papers.


● There is an edge between two individuals who published one or more papers
jointly.
● The communities in this network are authors working in a particular area.
Graphs with several node types
● Entities can be of different type in a social network.
● Example: In a collaboration network, nodes can be authors and papers. This is
considered as a authorship network.
● Example: Users at a site can place tags on web pages. Here, the entities are
users, tags, and pages.
● Users are connected if they tag the same pages, tags are connected if they
appear on the same pages, and pages are connected if they had many tags in
common.
● The natural way to represent such kinds of graphs is a k-partite graph, k>1.
● K-partite graph consists of k sets of nodes, with no edges between nodes of the
same set.
Graphs with several node types
Tripartite graph representing users, tags and web pages

There are three sets of nodes.


Users {U1, U2}

Tags {T1, T2, T3, T4}

Web pages {W1, W2, W3}

An edge (U1, T2) means that user 1 had placed the tag 2 on at
least one page.
Social Network Analysis
● Social network analysis is the study of social entities (people in an organization,
called actors), and their interactions and relationships. The interactions and
relationships can be represented with a network or graph, where each vertex
(or node) represents an actor and each link represents a relationship.
● Two types of Social Network Analysis
1. Centrality
2. Prestige
● Both centrality and prestige are measures of degree of prominence of an actor
in a social network.
Centrality
•In the context of an organization, a person with extensive contacts (links) or communications with
many other people in the organization is considered more important than a person with relatively
fewer contacts. The links can also be called ties.
•A central actor is one involved in many ties.
•Each node in the social network is an actor and each link indicates that the actors on the two ends
of the link communicate with each other.
•In the below network, actor i is the most central actor because he/she can communicate with most
other actors.
Types of Centrality
•There are different types of links or involvements between actors. Thus, several
types of centrality are defined on undirected and directed graphs.
•Three popular types:
1.Degree Centrality
2.Closeness Centrality
3.Betweenness Centrality
Degree Centrality
● Central actors are the most active actors that have most links or ties with other actors.
● Let the total number of actors in the network be n.
● Undirected Graph: In an undirected graph, the degree centrality of an actor i (denoted by
CD(i)) is simply the node degree (the number of edges) of the actor node, denoted by d(i),
normalized with the maximum degree, n-1.

● The value of this measure ranges between 0 and 1 as n-1 is the maximum value of d(i).
● Directed Graph: In this case, we need to distinguish in-links of actor i (links pointing to i),
and out-links (links pointing out from i). The degree centrality is defined based on only the
out-degree (the number of out-links or edges), do(i).
Closeness Centrality
● This view of centrality is based on the closeness or distance.
● The basic idea is that an actor x is central if it can easily interact with all other actors. That
is, its distance to all other actors is short.
● Use the shortest distance to compute this measure. Let the shortest distance from actor i
to actor j be d(i, j) (measured as the number of links in a shortest path).
● Undirected Graph:The closeness centrality CC(i) of actor i is defined as

● The value of this measure also ranges between 0 and 1 as n-1 is the minimum value of the
denominator, which is the sum of the shortest distances from i to all other actors.
● Directed Graph: The same equation can be used for a directed graph. The distance
computation needs to consider directions of links or edges
Betweenness Centrality
● Betweenness centrality quantifies the number of times a node acts as a bridge along the
shortest path between two other nodes.
● It was introduced as a measure for quantifying the control of a human on the communication
between other humans in a social network.
● If two non-adjacent actors j and k want to interact and actor i is on the path between j and k,
then i may have some control over their interactions.
● Betweenness measures the control of i over other pairs of actors. Thus, if i is on the paths of
many such interactions, then i is an important actor.
● Undirected Graph: Betweenness is given by the formula

where pjk be the number of shortest paths between actors j and k, and pjk(i), (j # i and k # i) is the
number of shortest paths that pass i (denoted by pjk(i), (j # i and k # i).
Betweenness Centrality
● There may be multiple shortest paths between actor j and actor k. Some pass i
and some do not. CB(i) has a minimum of 0, attained when i falls on no shortest
path. Its maximum is (n-1)(n-2)/2, which is the number of pairs of actors not
including i.
● If we are to ensure that the value range is between 0 and 1, we can normalize it
with (n-1)(n-2)/2, which is the maximum value of CB(i).
● The betweenness may be normalized by dividing through the number of pairs of
vertices not including v, which for directed graphs is (n − 1)(n − 2) and for
undirected graphs is (n − 1)(n − 2)/2.
Example
Example
Degree centrality of a node refers to the
number of edges attached to the node.
In order to know the standardized
score, you need to divide each score by
n-1 (n = the number of nodes).
Example
The closeness centrality C (i) of actor i is defined as
C

The closeness centrality of node 1 is 6/ (sum of shortest distance from node 1 to all the remaining
nodes)
Shortest distance from node 1 to 2 is 2
Shortest distance from node 1 to 3 is 1
Shortest distance from node 1 to 4 is 2
Shortest distance from node 1 to 5 is 3
Shortest distance from node 1 to 6 is 4
Shortest distance from node 1 to 7 is 4
Hence, closeness centrality of node 1 is 6/(2+1+2+3+4+4)=6/16=3/8
Example
Closeness centrality of all the other nodes
Example
● Betweenness centrality of node 3 is
● (1/1)+(1/1)+(1/1)+(1/1)+(1/1)+(1/1)+(1/1)+(1/1)+(1/1)+(0/1)+(0/1)+(0/1)+(0/1)+(0/1
)+(0/1)= 9
● Normalized score=9/ ((n-1)(n-2)/2) = 9/ ((6)(5)/2) = 9/15.
● For every pair of vertices (x,y), not including 3, find the number of shortest paths
between (x,y) that pass through node 3 and divide it by the total number of shortest
paths between (x,y).
● For instance, consider vertices (1,2). The number of shortest paths between (1,2) that
pass through 3 is 1 and the total number of shortest paths between (1,2) is also 1.
Hence 1/1 is the value.
● Repeat the same procedure for the remaining vertices (1,5), (1,6), (1,7), (2,4), (2,5),
(2,6), (2,7), (4,5), (4,6), (4,7), (5,6), (5,7) and (6,7). Find the sum of all the values which is
9. Normalize by dividing 9 with 15. This gives the betweenness of node 3.
Clustering of Social-Network Graphs
● A social network contains communities of entities that are connected by the
edges.
● These correspond to groups of friends at school or groups of researchers
interested in the same area.
● Inorder to identify the communities, clustering of graphs is to be considered.

Distance measures for Social-Network Graphs:

● Assume that the nodes are close if they have an edge between them and distant
if not.
● d(x,y)=0 if there is an edge (x,y), otherwise d(x,y)=1.
Clustering of Social-Network Graphs
Betweenness:

Betweenness of an edge (a, b) is defined as the number of pairs of nodes x and y such
that the edge (a, b) lies on the shortest path between x and y.

The Girvan-Newman Algorithm:

Computes the number of shortest paths from node X to each of the other nodes that
go through each of the edges.

Step-1: Perform a breadth-first search (BFS) of the graph, starting at the node X.
Clustering of Social-Network Graphs

● The level of each node in the BFS presentation


is the length of the shortest path from X to
that node.
● Edges that go between nodes at the same level
can never be part of a shortest path from X.
● Edges between levels are called DAG edges
(Directed Acyclic Graph).
● If there is a DAG edge (Y, Z), where Y is at the
level above Z, then we shall call Y a parent of Z
and Z a child of Y.
Clustering of Social-Network Graphs
Step-2: Label each node by the number of shortest
paths that reach it from the root.
● Start by labeling the root 1. Then, from the top
down, label each node Y by the sum of the labels of
its parents.
● First, label the root E with 1. At level 1 are the
nodes D and F . Each has only E as a parent, so they
too are labeled 1.
● Nodes B and G are at level 2. B has only D as a
parent, so B’s label is the same as the label of D,
which is 1. However, G has parents D and F, so its
label is the sum of their labels, or 2.
● Finally, at level 3, A and C each have only parent B,
so their labels are the label of B, which is 1.
Clustering of Social-Network Graphs
Step-3: Calculate for each edge e the sum over all nodes Y of the fraction of
shortest paths from the root X to Y that go through e.

This calculation is done for both nodes and edges from bottom.

The rules for calculation are as follows:


1. Each leaf in the DAG gets a credit of 1.
2. Each node that is not a leaf gets a credit equal to 1 plus the sum of the credits
of the DAG edges from that node to the level below.
3. A DAG edge e entering node Z from the level above is given a share of the
credit of Z proportional to the fraction of shortest paths from the root to Z
that go through e.
Clustering of Social-Network Graphs

● First, A and C, being leaves, get credit 1. Each of these nodes have only
one parent, so their credit is given to the edges (B, A) and (B, C),
respectively.
● At level 2, G is a leaf, so it gets credit 1. B is not a leaf, so it gets credit
equal to 1 plus the credits on the DAG edges entering it from below.
Since both these edges have credit 1, the credit of B is 3.
● B has only one parent, D, so the edge (D, B) gets the entire credit of B,
which is 3.
● G has two parents, D and F . We therefore need to divide the credit of 1
that G has between the edges (D, G) and (F, G).
● After Step-2, D and F have label 1, representing the fact that there is
one shortest path from E to each of these nodes.
● Give half the credit of G to each of these edges; i.e., their credit is each
1/(1 + 1) = 0.5.
● If D and F had the labels 5 and 3, then the credit of edge (D, G) would
have been 5/8 and the credit of edge (F, G) would have been 3/8.
Clustering of Social-Network Graphs
● D gets 1 plus the credits of the edges entering it from below, which are 3 and
0.5. That is, the credit of D is 4.5. The credit of F is 1 plus the credit of the edge
(F, G) i.e. 1.5.
● Finally, the edges (E, D) and (E, F ) receive the credit of D and F , respectively,
since each of these nodes has only one parent.
● To complete the betweenness calculation, we have to repeat this calculation
for every node as the root and sum the contributions.
● Finally, we must divide by 2 to get the true betweenness, since every shortest
path will be discovered twice, once for each of its endpoints.
Using Betweenness to find Communities
● Start with the graph and all its edges; then remove edges with the highest betweenness until the graph
has broken into a suitable number of connected components.
● Betweenness scores of each edge is shown in the below graph.
● Edge (B, D) has the highest betweenness, so it is removed first.
● That leaves two communities, namely: {A, B, C} and {D, E, F, G}.
● Next to leave are (A, B) and (B, C) with a score of 5, followed by (D, E) and (D, G) with a score of 4.5.
Then, (D, F ), whose score is 4, would leave the graph.
Direct discovery of communities
In Graph clustering, it is not possible to place an individual in more than one community.

Cliques:

A Clique is a set of nodes in a graph such that every two nodes in Clique are adjacent.

Complete Bipartite Graphs:

A complete bipartite graph consists of s nodes on one side and t nodes on the other side,
with all st possible edges between the nodes of one side and the other present. We denote
this graph by Ks,t.
Direct discovery of communities
Finding Complete Bipartite Subgraphs
● Given a large bipartite graph G we want to find instances of Ks,t within it.
● It is possible to view the problem of finding instances of Ks,t within G as one of finding
frequent itemsets.
● Let the “items” be the nodes on one side of G, which we shall call the left side.
● The “baskets” correspond to the nodes on the other side of G (the right side).
● Let the support threshold be s, the number of nodes that the instance of Ks,t has on
the right side.
● State the problem of finding instances of Ks,t as that of finding frequent itemsets F of
size t.
● If a set of t nodes on the left side is frequent, then they all occur together in at least s
baskets.
Direct discovery of communities
● Each basket corresponds to a node that is connected to all t of the nodes in F . Thus,
the frequent itemset of size t and s of the baskets in which all those items appear
form an instance of Ks,t.

Example: The left side is the nodes {1, 2, 3, 4} and the


right side is {a, b, c, d}.
The latter are the baskets, so basket a consists of
“items” 1 and 4; that is, a = {1, 4}. Similarly, b = {2, 3}, c =
{1} and d = {3}.
If s = 2 and t = 1, we must find itemsets of size 1 that
appear in at least two baskets. {1} is one such itemset,
and {3} is another.
Partitioning of Graphs
What makes a good partition?

Partition a graph to minimize the number of edges that connect different


components.

A cut is a partition of the vertices of a graph into two disjoint subsets.

In the below graph, removing the edge


BD, partitions the graph into two
subgraphs. {A,B,C} is one and {D,E,F,G}
is another. Hence, {BD} is called the cut
and its size is 1.
Partitioning of Graphs
The smallest cut might not be the best cut.
● A new node H is added to the
graph along with two edges, (H,C)
and (C,G).
● To minimize the size of the cut, the
best choice would be to put H in
one set and all the other nodes in
the other set.
● This cut is not desirable because
the two components are too
unequal in size.
● The best cut consists of edges (B,
D) and (C, G), which partitions the
graph into two equal-sized sets {A,
B, C, H} and {D, E, F, G}.
Partitioning of Graphs
Normalized cuts:

● A proper definition of a “good” cut must balance the size of the cut itself against
the difference in the sizes of the sets that the cut creates.
● Suppose we partition the nodes of a graph into two disjoint sets S and T .
● Let Cut(S, T ) be the number of edges that connect a node in S to a node in T .
● Volume of a set S of nodes, denoted Vol (S), is the number of edges with at least
one end in S.
● Then the normalized cut value for S and T is
Partitioning of Graphs
If we choose S = {H} and T = {A, B, C, D, E,
F, G},
then Cut(S, T ) = 1. Vol(S) = 1, and
Vol(T)=11.
Thus, the normalized cut for this
partition is 1/1 + 1/11 = 1.09.

Consider the preferred cut for this


graph consisting of the edges (B, D) and
(C, G).
Then S = {A, B, C, H} and T = {D, E, F, G}.
Cut(S, T ) = 2,
Vol (S) = 6, and Vol(T ) = 7.
The normalized cut for this partition is
thus only 2/6 + 2/7 = 0.62.
Matrices that describe graphs
Three matrices that describe aspects of graph:

1. Adjacency matrix
2. Degree matrix
3. Laplacian matrix

1. Adjacency matrix:

It has 1 in the row i and column j if there is an edge between nodes i and j and 0
otherwise.
Matrices that describe graphs

Adjacency matrix for the given graph


Matrices that describe graphs

2. Degree matrix

● The entry for row and column i is the degree of the ith node.
● This graph has nonzero entries only on the diagonal.
● For instance, the entry in row 4 and column 4 is 4 because node D has edges to
four other nodes
Matrices that describe graphs
3. Laplacian matrix

● The difference between the degree matrix and the adjacency matrix is called
Laplacian matrix.
● Suppose our graph has adjacency matrix A and degree matrix D. Our third matrix,
called the Laplacian matrix, is L = D − A.
● Sum of each row and each column results in 0.
● It is a symmetric matrix.
Eigenvalues of the Laplacian matrix
Example:
Partition the graph using eigenvalues of the laplacian matrix.
The Laplacian matrix for the given graph is

The eigenvalues for the Laplacian


matrix are {0,1,2,3,4,5}.
The corresponding eigenvectors are

The second eigenvector has three positive and three


negative components. Hence one partition should be
the nodes {1,2,3} and the other partition should be the
nodes {4,5,6}.
Eigenvalues of the Laplacian matrix
● The smallest eigenvalue for every Laplacian matrix is 0, and its corresponding
eigenvector is [1, 1, . . . , 1].

Proof:

● Let L be the Laplacian matrix for a graph of n nodes, and let 1 be the column vector of
all 1’s with length n.
● Consider row i of L. Its diagonal element has the degree d of node i. Row i also will
have d occurrences of −1, and all other elements of row i are 0.
● Multiplying row i by column vector 1 has the effect of summing the row, and this sum
is clearly d + (−1)d = 0.
● Thus, we can conclude that 0 is the eigenvalue and 1 is its corresponding
eigenvector.
Eigenvalues of the Laplacian matrix
● Let G be a connected graph on vertex set N={1,2,...,n} with adjacency matrix A.
Let L=D-A be the Laplacian matrix of G. Let v2 be an eigenvector corresponding
to the second smallest eigenvalue of L. The graph partition is

C1={i Ɛ N : v2(i)<0}

C2={i Ɛ N : v2(i)>0}

● Vertices for which v2(i) is 0 can be placed in either cluster.


Neighborhood properties of graphs
Directed graphs and Neighborhoods:
● A directed graph has a set of nodes and a set of arcs/edges.
● An arc is a pair of nodes written u → v, where u is the source and v is the target
of the arc.
● Example-1: The web is a directed graph, where the arc u->v is a link from page u
to page v.
● Example-2: The telephone subscriber u has called subscriber v in the past
month.
● Example-3: The individual u following individual v in the twitter.
● Example-4: The research paper u references paper v.
Directed graphs and Neighborhoods
● All undirected graphs can be represented as directed graphs.
● Instead of the undirected edge (u, v), use two arcs u → v and v → u.
● A path in a directed graph is a sequence of nodes v0, v1, . . . , vk such that there are arcs
vi → vi+1 for all i = 0, 1, . . . , k − 1.
● The length of this path is k, the number of arcs along the path.
● The neighborhood of radius d for a node v is the set of nodes u for which there is a
path of length at most d from v to u. We denote this neighborhood by N (v, d).
● N (v, 0) is always {v}, and N (v, 1) is v plus the set of nodes to which there is an arc from
v.
● The neighborhood profile of a node v is the sequence of sizes of its neighborhoods

|N (v, 1)|, |N (v, 2)|, . . . .


Directed graphs and Neighborhoods
● To turn it into a directed graph, think of each edge
as a pair of arcs, one in each direction.
● For instance, the edge (A, B) becomes the arcs A →
B and B → A.
● First, consider the neighborhoods of node A. We
know N (A, 0) = {A}.
● Moreover, N (A, 1) = {A, B, C}, since there are arcs
from A only to B and C.
● Furthermore, N (A, 2) = {A, B, C, D} and N (A, 3) =
{A, B, C, D, E, F, G}.
● Consider node B, N (B, 0) = {B}, N (B, 1) = {A, B, C,
D}, and N (B, 2) = {A, B, C, D, E, F, G}.
● B is more central to the network than A, and this
fact is reflected by the neighborhood profiles of
the two nodes.
● Node A has profile 3, 4, 7, . . ., while B has profile 4,
7, . . . . Hence B is more central than A.
The diameter of a graph
● The diameter of a directed graph is the smallest integer d such that for every two nodes u
and v there is a path of length d or less from u to v.
● The diameter of a graph is the length of the shortest path between the most distanced
nodes.
● We can compute the diameter of a graph by computing the sizes of its neighborhoods of
increasing radius, until at some radius we fail to add any more nodes.
● For each node v, find the smallest d such that |N (v, d)| = |N (v, d + 1)|.
● This d is the tight upper bound on the length of the shortest path from v to any node it can
reach. Call it d(v).
● Diameter of the graph is maxv(d(v)).
● For example, in the running example, N(A,3)={A,B,C,D,E,F,G}, N(A,4)={A,B,C,D,E,F,G},
● I.e. |N(A,3)|=|N(A,4)|=7. Hence, d(A)=3.
● Similarly, find d(B), d(C), d(D), d(E), d(F), and d(G). Maximum among these values is the
diameter of the graph.
Exercise

Find the diameter of the graph.


Transitive closure and reachability
● The transitive closure of a graph is the set of pairs of nodes (u, v) such that there is a
path from u to v of length zero or more and denoted as Path(u,v).
● Node u reaches node v if there is Path(u, v).
● The problem of computing the transitive closure is to find all pairs of nodes u and v
in a graph for which Path(u, v) is true.
● The reachability problem is, given a node u in the graph, find all v such that Path(u,
v) is true.
● Path(u, v) is true if and only if v is in N (u, ∞), it is defined as to be ⋃i≥0 N (u, i).
● The reachability problem is to compute the union of all the neighborhoods of any
radius for a given node u.
● Given a directed graph, find out if a vertex v is reachable from another vertex u for
all vertex pairs (u, v) in the given graph. Here reachable mean that there is a path
from vertex u to v. The reach-ability matrix is called transitive closure of a graph.
Exercise

Given adjacency list of a graph, find its transitive


closure
Exercise

Find the transitive closure of the graph


Exercise

1. If the graph is represented as a


directed graph, how many arcs are
there?
2. What are the neighborhood
profiles for nodes A and B?
3. What is the diameter of the graph?
4. How many pairs are in the
transitive closure?

You might also like