Big Scholarly Data
Big Scholarly Data
Yogendra Singh
University Librarian
Swami Rama Himalayan University
Email - [email protected]
After this talk
• You should have a basic knowledge of Social Network Analysis
• Who matters among this crowd? There could be different answers depending
upon different point of views
• The analysis of this type of social network using graph theory is called Social
Network Analysis
• Since Scholarly networks are the network of people (co-authors), it can well be
applied to large scholarly data or Big Scholarly Data
Big Scholarly Data (BSD)
•BSD refers to millions of scholarly records available today due to
tremendous changes in scholarly communication cycle
• E-books, articles, reports, standards, patents etc., published by major commercial and not for
profit organizations - sciencedirect.com, tandfonline.com, doaj.org etc.
• Abstracting and Indexing databases- Scopus, Web of Science, EBSCO, Google Scholar
• Academic social networks- Academia, ResearchGate, Mendeley etc.
• Many other type of scholarly data
Three major scholarly data providers
Sl. No. Brand- Publisher Coverage No. of
name Records
1. Google Google Full Universe of Knowledge/ All 350+ million
Scholar Formats
•Statistical analysis
Suitable for smaller datasets
• Clustering coefficient
• Centralities
Average Path Length
Centrality may mean differently for different people and in different context
Why are Centrality and Centralization
Important?
• Access to information and ideas
• Visibility
• Closeness
• Betweenness
• Eigenvector
Calculating Centrality
• Degree – Proportional to the number of other nodes to which a node is links –
Number of links divided by (n-1).
Calculating Centrality
• Closeness – The sum of geodesic distances (shortest paths) to all other points in the graph.
Divide by (n-1), then invert.
• Betweenness – The extent to which a particular point lies ‘between’ other points in the graph;
how many shortest paths (geodesics) is it on? A measure of brokerage or gatekeeping.
• Eigenvector– A weighted measure of centrality that takes into account the centrality of other
nodes to which a node is connected. That is, being connect with other central nodes increases
centrality. E.g., secretary of powerful person. Google’s page rank algorithm is based on a
variation of this approach.
Network Analysis Tools Applied to BSD
Software/ Access Platform/ Language Description
CiteSpace/ Windows, IOS/ Visualizing and analyzing trends and patterns in scientific
Free Java
literature; knowledge domain visualization, best for WoS datasets
• Co-Word Network
• Co-Citation Network
BSD Analysis Applications
•Scientific Impact Evaluation
Article Impact
Author Impact
Journal Impact
Institutional Impact
BSD Analysis Application - Academic
Recommendations
• Literature Recommendations
• Expert Recommendations
• Collaboration Recommendations
• Priority Recommendations
Scholarly Data Analysis: Steps
•Data Collection
Download desired dataset from appropriate source
•Data cleaning
Most difficult task as same name, institute, department is represented in different ways even by
the same individual
• Create Graph File through some scientific network creating online tool
such as Table2Net or Scopus2Net
• Analyse that Graph file in Network Analysis software such as Gephi. You
can calculate all SNA measures using Gephi
Conclusion
• Application of Social Networking Tools to Big Scholarly Data is going to be big area of
interest to scientometricians as very large BSD is generated daily.
• These measures can be used to evaluate the authors, institutions, subject areas or
countries objectively.
• Special areas of interests, possible collaboration opportunities can be easily identified.
• As the impact of the publications can be easily identified, it will have great impact in
policy making.
• Librarians can also use SNA for analyzing in-house generated data such as circulation,
reference data, even footfall data.
THANK YOU