In Search of Influential Academic Publications

Dated Jun 28, 2019; last modified on Sat, 25 Apr 2020

I recently downloaded citations data from Aminer . I seek interesting papers to read during my free time [note]. The database has the following structure:

Table Columns Remarks
papers paper_id, title, venue, year, number of citations, abstract 3,079,007 papers from 1936 to 2018
authors name, appearance order, paper_id 1,766,547 distinct names (out of 9,476,165 names)
paper_references paper_id, cited_paper_id 25,166,994 references

Approach 1: Citation Count

For one, we can answer that with a SQL query:

SELECT papers.* FROM papers ORDER BY papers.n_citation DESC LIMIT 10;

# Citations Title Year Format
73,362 Genetic Algorithms in Search, Optimization and Machine Learning 1989
42,508 Distinctive Image Features from Scale-Invariant Keypoints 2004
34,288 Bowling Alone: The Collapse and Revival of American Community 2000
33,016 LIBSVM: A library for support vector machines 2011
29,285 Reinforcement Learning: An Introduction 1999
28,679 Random Forests 2001
27,068 Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology 1989
26,357 Ad-hoc On-demand Distance Vector Routing 1999
26,114 Support-Vector Networks 1995
25,835 Fuzzy Identification of Systems and Its Applications to Modeling and Control 1985

Huh, Aminer's dataset is incomplete. I expected to find Protein Measurement With the Folin Phenol Reagent , which according to Nature, had garnered 305,000 citations by 2014. Nonetheless, this blog post must go on.

Approach 2: PageRank

The Anatomy of a Large-Scale Hypertextual Web Search Engine presents Google's PageRank algorithm which counts the number and quality of links to a page to determine a rough estimate of how important the website is . Understandably, I'm not the only one to apply it to citation graphs, e.g. The Pagerank-Index: Going beyond Citation Counts in Quantifying Scientific Impact of Researchers . However, I'm interested in finding interesting papers, not scholars.

NetworkX provides a good implementation of PageRank in Python, so I used that instead. To save on computational resources, I only considered publications that had either had \(\ge 100\) citations, or were cited by papers that had had \(\ge 100\) citations. Here are the results:

# Citations Title Year Format
73,362 Genetic Algorithms in Search, Optimization and Machine Learning 1989
13,227 The Design and Analysis of Computer Algorithms 1974
6,589 Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference 1988
11,580 Reducibility among Combinatorial Problems 2010
21,256 Snakes: Active Contour Models 1988
17,064 New Directions in Cryptography 1976
6,906 C4.5: Programs for Machine Learning 1993
42,508 Distinctive Image Features from Scale-Invariant Keypoints 2004
18,861 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems 1978
20,878 Introduction to Modern Information Retrieval 1986

I think the ranking is worthwhile. I recognize some influential publications on the list, e.g. RSA encryption, Diffie-Hellman key exchange and causal probability. That said, this isn't a definitive ranking for there are different variations of PageRank. For instance, there is a damping factor, \(d\), which is the probability that at each paper, the reader will get bored and instead of following a citation, pick a paper at random. I used \(d = 0.85\) as it was used in the original PageRank algorithm. Given the assumptions I've made, providing the top 15 results does the ranking no justice. I have thus availed the top 3,000 publications out of the 3,079,007. I bet most of the gems lie somewhere in there.