In Search of Influential Academic Publications

Dated Jun 28, 2019; last modified on Sat, 25 Apr 2020

I recently downloaded citations data from Aminer . I seek interesting papers to read during my free time [note]. The database has the following structure:

TableColumnsRemarks
paperspaper_id, title, venue, year, number of citations, abstract3,079,007 papers from 1936 to 2018
authorsname, appearance order, paper_id1,766,547 distinct names (out of 9,476,165 names)
paper_referencespaper_id, cited_paper_id25,166,994 references

Approach 1: Citation Count

For one, we can answer that with a SQL query:

SELECT papers.* FROM papers ORDER BY papers.n_citation DESC LIMIT 10;

# CitationsTitleYearFormat
73,362Genetic Algorithms in Search, Optimization and Machine Learning1989
42,508Distinctive Image Features from Scale-Invariant Keypoints2004
34,288Bowling Alone: The Collapse and Revival of American Community2000
33,016LIBSVM: A library for support vector machines2011
29,285Reinforcement Learning: An Introduction1999
28,679Random Forests2001
27,068Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology1989
26,357Ad-hoc On-demand Distance Vector Routing1999
26,114Support-Vector Networks1995
25,835Fuzzy Identification of Systems and Its Applications to Modeling and Control1985

Huh, Aminer's dataset is incomplete. I expected to find Protein Measurement With the Folin Phenol Reagent , which according to Nature, had garnered 305,000 citations by 2014. Nonetheless, this blog post must go on.

Approach 2: PageRank

The Anatomy of a Large-Scale Hypertextual Web Search Engine presents Google's PageRank algorithm which counts the number and quality of links to a page to determine a rough estimate of how important the website is . Understandably, I'm not the only one to apply it to citation graphs, e.g. The Pagerank-Index: Going beyond Citation Counts in Quantifying Scientific Impact of Researchers . However, I'm interested in finding interesting papers, not scholars.

NetworkX provides a good implementation of PageRank in Python, so I used that instead. To save on computational resources, I only considered publications that had either had \(\ge 100\) citations, or were cited by papers that had had \(\ge 100\) citations. Here are the results:

# CitationsTitleYearFormat
73,362Genetic Algorithms in Search, Optimization and Machine Learning1989
13,227The Design and Analysis of Computer Algorithms1974
6,589Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference1988
11,580Reducibility among Combinatorial Problems2010
21,256Snakes: Active Contour Models1988
17,064New Directions in Cryptography1976
6,906C4.5: Programs for Machine Learning1993
42,508Distinctive Image Features from Scale-Invariant Keypoints2004
18,861A Method for Obtaining Digital Signatures and Public-Key Cryptosystems1978
20,878Introduction to Modern Information Retrieval1986

I think the ranking is worthwhile. I recognize some influential publications on the list, e.g. RSA encryption, Diffie-Hellman key exchange and causal probability. That said, this isn't a definitive ranking for there are different variations of PageRank. For instance, there is a damping factor, \(d\), which is the probability that at each paper, the reader will get bored and instead of following a citation, pick a paper at random. I used \(d = 0.85\) as it was used in the original PageRank algorithm. Given the assumptions I've made, providing the top 15 results does the ranking no justice. I have thus availed the top 3,000 publications out of the 3,079,007. I bet most of the gems lie somewhere in there.

Notes