In Search of Influential Academic Publications

I recently downloaded citations data from Aminer . I seek interesting papers to read during my free time [note]. The database has the following structure:

Table	Columns	Remarks
papers	paper_id, title, venue, year, number of citations, abstract	3,079,007 papers from 1936 to 2018
authors	name, appearance order, paper_id	1,766,547 distinct names (out of 9,476,165 names)
paper_references	paper_id, cited_paper_id	25,166,994 references

Approach 1: Citation Count

For one, we can answer that with a SQL query:

SELECT papers.* FROM papers ORDER BY papers.n_citation DESC LIMIT 10;

# Citations	Title	Year
73,362	Genetic Algorithms in Search, Optimization and Machine Learning	1989
42,508	Distinctive Image Features from Scale-Invariant Keypoints	2004
34,288	Bowling Alone: The Collapse and Revival of American Community	2000
33,016	LIBSVM: A library for support vector machines	2011
29,285	Reinforcement Learning: An Introduction	1999
28,679	Random Forests	2001
27,068	Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology	1989
26,357	Ad-hoc On-demand Distance Vector Routing	1999
26,114	Support-Vector Networks	1995
25,835	Fuzzy Identification of Systems and Its Applications to Modeling and Control	1985

Huh, Aminer's dataset is incomplete. I expected to find Protein Measurement With the Folin Phenol Reagent , which according to Nature, had garnered 305,000 citations by 2014. Nonetheless, this blog post must go on.

Approach 2: PageRank

The Anatomy of a Large-Scale Hypertextual Web Search Engine presents Google's PageRank algorithm which counts the number and quality of links to a page to determine a rough estimate of how important the website is . Understandably, I'm not the only one to apply it to citation graphs, e.g. The Pagerank-Index: Going beyond Citation Counts in Quantifying Scientific Impact of Researchers . However, I'm interested in finding interesting papers, not scholars.

NetworkX provides a good implementation of PageRank in Python, so I used that instead. To save on computational resources, I only considered publications that had either had \(\ge 100\) citations, or were cited by papers that had had \(\ge 100\) citations. Here are the results:

# Citations	Title	Year
73,362	Genetic Algorithms in Search, Optimization and Machine Learning	1989
13,227	The Design and Analysis of Computer Algorithms	1974
6,589	Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference	1988
11,580	Reducibility among Combinatorial Problems	2010
21,256	Snakes: Active Contour Models	1988
17,064	New Directions in Cryptography	1976
6,906	C4.5: Programs for Machine Learning	1993
42,508	Distinctive Image Features from Scale-Invariant Keypoints	2004
18,861	A Method for Obtaining Digital Signatures and Public-Key Cryptosystems	1978
20,878	Introduction to Modern Information Retrieval	1986

I think the ranking is worthwhile. I recognize some influential publications on the list, e.g. RSA encryption, Diffie-Hellman key exchange and causal probability. That said, this isn't a definitive ranking for there are different variations of PageRank. For instance, there is a damping factor, \(d\), which is the probability that at each paper, the reader will get bored and instead of following a citation, pick a paper at random. I used \(d = 0.85\) as it was used in the original PageRank algorithm. Given the assumptions I've made, providing the top 15 results does the ranking no justice. I have thus availed the top 3,000 publications out of the 3,079,007. I bet most of the gems lie somewhere in there.

Notes

On finding interesting papers to read, Fermat's Library Journal Club usually has great weekly picks. PDF submissions on Hacker News also tend to be worth the time.