I recently downloaded citations data from Aminer . I seek interesting papers to read during my free time [note]. The database has the following structure:
Table | Columns | Remarks |
---|---|---|
papers | paper_id, title, venue, year, number of citations, abstract | 3,079,007 papers from 1936 to 2018 |
authors | name, appearance order, paper_id | 1,766,547 distinct names (out of 9,476,165 names) |
paper_references | paper_id, cited_paper_id | 25,166,994 references |
Approach 1: Citation Count
For one, we can answer that with a SQL query:
SELECT papers.* FROM papers ORDER BY papers.n_citation DESC LIMIT 10;
Huh, Aminer's dataset is incomplete. I expected to find Protein Measurement With the Folin Phenol Reagent , which according to Nature, had garnered 305,000 citations by 2014. Nonetheless, this blog post must go on.
Approach 2: PageRank
The Anatomy of a Large-Scale Hypertextual Web Search Engine presents Google's PageRank algorithm which counts the number and quality of links to a page to determine a rough estimate of how important the website is . Understandably, I'm not the only one to apply it to citation graphs, e.g. The Pagerank-Index: Going beyond Citation Counts in Quantifying Scientific Impact of Researchers . However, I'm interested in finding interesting papers, not scholars.
NetworkX provides a good implementation of PageRank in Python, so I used that instead. To save on computational resources, I only considered publications that had either had \(\ge 100\) citations, or were cited by papers that had had \(\ge 100\) citations. Here are the results:
# Citations | Title | Year | Format |
---|---|---|---|
73,362 | Genetic Algorithms in Search, Optimization and Machine Learning | 1989 | |
13,227 | The Design and Analysis of Computer Algorithms | 1974 | |
6,589 | Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference | 1988 | |
11,580 | Reducibility among Combinatorial Problems | 2010 | |
21,256 | Snakes: Active Contour Models | 1988 | |
17,064 | New Directions in Cryptography | 1976 | |
6,906 | C4.5: Programs for Machine Learning | 1993 | |
42,508 | Distinctive Image Features from Scale-Invariant Keypoints | 2004 | |
18,861 | A Method for Obtaining Digital Signatures and Public-Key Cryptosystems | 1978 | |
20,878 | Introduction to Modern Information Retrieval | 1986 |
I think the ranking is worthwhile. I recognize some influential publications on the list, e.g. RSA encryption, Diffie-Hellman key exchange and causal probability. That said, this isn't a definitive ranking for there are different variations of PageRank. For instance, there is a damping factor, \(d\), which is the probability that at each paper, the reader will get bored and instead of following a citation, pick a paper at random. I used \(d = 0.85\) as it was used in the original PageRank algorithm. Given the assumptions I've made, providing the top 15 results does the ranking no justice. I have thus availed the top 3,000 publications out of the 3,079,007. I bet most of the gems lie somewhere in there.
Notes
- On finding interesting papers to read, Fermat's Library Journal Club usually has great weekly picks. PDF submissions on Hacker News also tend to be worth the time.