On Mathematical Analysis of MathSciNet & MathOverflow

Question

This question has two original motivations: mathematical and social.

The mathematical motivation is mainly based on what I have seen about Zipf's law here and there. The Zipf's law simply states that a Zipfian distribution (a variant of power-law probability distribution) provides a good approximation of many types of data corresponding to physical or social phenomena.

The iconic example is the frequency of the words used in a natural language where the Zipf's law indicates that the most frequent word in the language will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

There are plenty of other surprising appearances of Zipf's (and power) law in various real-world situations. For instance, Silagadze in his paper, "Citations and the Zipf-Mandelbrot's law", shows that the number of citations in scientific papers obeys the Zipf's law!

These made me think whether Zipf/power-law distributions appear, in other ways, in large mathematical research databases such as Arxiv, MathSciNet, or MathOverflow, say in the number of publications, co-authors, reviews, reputation points, etc.

The social motivation, however, comes from occasional general claims by some mathematicians about how the mathematical society was, is or will be. Statements such as the followings:

The mathematical paradigm is slowly shifting from pure mathematics to applied as more and more mathematicians are doing research of applied nature.
In comparison with other branches of logic, model theory has the strongest ties with other mathematical disciplines.
Due to the urgent need and rapid advancement in AI and computer science, computability and complexity are the most rapidly expanding branches of mathematical logic.
The most influential work of a mathematician usually happens before age 40.

Such controversial claims often provoke intense arguments among the mathematicians. The question is how to settle them once for all. Indeed, a rigorous approach towards verification of any such social claim is to mathematically analyze the large mathematical databases such as Arxiv, MathSciNet, or MathOverflow in order to extract general patterns including the distribution of research topics, possible changes in mathematical research fashion and evaluating the interaction and intersection of various mathematical disciplines with each other.

These motivations lead to the following general question:

Question. Have large mathematical databases such as Arxiv, MathSciNet, or MathOverflow been the subject of any social network and database analysis so far? What are examples of published mathematical research about possible mathematical patterns that may exist in them as databases? What patterns are found? (I am particularly interested in the case of Zipf's law and other naturally occurring statistical patterns such as Benford's law.)

It is likely that some journal ranking organizations have already conducted some research along these lines but I haven't seen any outline so far. It is also interesting if the research has been done in a comparative way which allows one to compare the general characteristics of math community with its other counterparts, say physics or biology.

Update. A MathOverflow fellow just emailed me expressing his interest in conducting some statistical research on MathSciNet and MathOverflow databases. He asked whether I know how to get access to the corresponding background data, which I actually don't. I am not even sure if it is publicly accessible or free (particularly in the case of MathSciNet and Arxiv). I also suspect that there must be some non-disclosure rules and restrictions which any corresponding research along these lines should follow. It would be nice if somebody who knows how to get access to the raw material needed for any research concerning these databases, sheds some light on these issues.

Not to mention that the total reputation in MathOverflow doesn't seem to be distributed in a Zipfian way because the reputation of the 1st person is not twice the size of the second one. Maybe one needs to compare users' in the same category (say those who post in topology or set theory) or consider different power-law distributions. — Morteza Azad, May 27 '18 at 17:46
Tangentially related: (1) How many mathematicians are there? (2) An interactive graph of MathOverflow tags. (3) National Academies of Science, "Important Trends in the Mathematical Sciences," The Mathematical Sciences in 2025. — Joseph O'Rourke, May 27 '18 at 18:16
Not sure if this is the sort of thing you're looking for, but there was a paper in the Notices of the AMS back in 2010 on the issue of topical bias in generalist mathematical journals. http://www.ams.org/notices/201011/rtx101101421p.pdf — Timothy Chow, May 27 '18 at 18:30
@JosephO'Rourke The Meta post No. 2 is closely related to what I have in mind for a possible mathematical model of MathOverflow via graph theory where the users (tags) are weighted vertexes depending on their reputation (number of included posts) and are connected to each other via weighted edges which indicate their interaction (intersection). There are various interesting parameters to investigate in such a graph. For example, its minimal dominating set may reveal the most influential contributors (topics) in the community. — Morteza Azad, May 27 '18 at 19:07
@TimothyChow (+1) Very nice, Tim! This piece of research is indeed of the type that I am looking for in my question. So why not posting it as an answer to this reference request question?! — Morteza Azad, May 27 '18 at 19:44
Another related instance of real-world appearance of power-law distribution is presented in Sean Gourley's TED talk entitled "The mathematics of war" in which he describes how the number of casualties in ANY war could be explained via a unique power law distribution equation regardless of the nature of conflict and the number and type of the involved factions! — Morteza Azad, May 27 '18 at 20:30
See also the discussion at https://meta.mathoverflow.net/questions/1894/is-mo-connected — Gerry Myerson, May 27 '18 at 23:46
Naturally occurring numbers often follow Benford's law rather than Zipf's law. — Kimball, May 28 '18 at 01:54
@Kimball (+1) These universal numerical laws which appear out of nowhere in data sets are just amazing! Your comment revived my hope to find the pattern governing the set of MathOverflow users' reputation points! I think, there must be some pattern because the reputation score is not accumulated randomly. Certain underlying laws of the community combined with users' natural human behavior possibly gives rise to one such statistical law to appear in this particular set of numbers. Maybe some statistician can take a look at them, apply some pattern finding algorithms and discover something! — Morteza Azad, May 28 '18 at 03:45
@GerryMyerson (+1) Thanks for linking to the Meta thread, Gerry! Maybe a more mathematical version of the mentioned Meta post was best suited for MO (so that I could find it in my pre-posting search). Initially, I had some doubts deciding where the appropriate place for posting this question is, Main or Meta! The OP is a Meta question in some sense because it is asking about MathOverflow. However, when I finalized the post I found it suitable enough for the main forum as it is asking a specific question about certain types of math papers and not even eligible for the soft-question tag. — Morteza Azad, May 28 '18 at 06:59
One big issue with MathSciNet is that AMS restricts the access to the database, which doesn't allow arbitrary users (including with institutional access) to make deep statistical studies. — YCor, May 28 '18 at 09:24
@YCor Indeed it is a big deal! Does anybody here know any MathOverflow user in charge of the MathSciNet database so he can draw their attention to this problem? I think it doesn't seem ethically sound to prevent mathematicians from analyzing their own community for free or at least at an affordable price. Maybe by some mutual interactions between math community and MathSciNet database admins, a solution to this problem could be found. — Morteza Azad, May 28 '18 at 20:00
@MortezaAzad they're perfectly aware of this, and I don't think they consider this as a problem. They like to keep the right to allow who they like to do such searches. Possibly once a free database will exist, but at least the web has been existing for more than 20 years and there's none at the moment. — YCor, May 28 '18 at 23:40
Another related thread on meta: Papers, articles, books and other resources discussing MathOverflow. — Martin Sleziak, Jun 26 '19 at 06:16

Carlo Beenakker · Answer 1 · 2018-05-27T19:30:42.790

34

• Mathoverflow has been studied as a "complex network" in Social achievement and centrality in MathOverflow, by L.V. Montoya, A. Ma, and R.J. Mondragón.
The analysis distinguishes degree centrality (based on the number of edges that a node has), betweenness centrality (which measures the fraction of geodesic paths that pass through a node), closeness centrality (the mean geodesic distance from a node to every other node), and eigenvector centrality (which measures how well connected a node is and how much direct influence it may have over other well connected nodes in the network). Three hypotheses that are tested (the first two pass, the third fails):

A user’s reputation score is closely related to their degree centrality.
The total number of views obtained by a user is related to their eigenvector centrality.
The number of upvotes obtained by a user is related to their closeness centrality.

• MathSciNet has been used by Jerrold W. Grossman to analyze the network of collaborations among mathematics in Patterns of Collaboration in Mathematical Research: Apparently, the appropriate popular buzz phrase for mathematicians should be “eight degrees of separation”.
See also Patterns of Research in Mathematics by the same author.

edited May 27 '18 at 19:30

answered May 27 '18 at 19:04

Carlo Beenakker

177,695

(+1) Really interesting references, Carlo! I will take a look at them. Thanks! – Morteza Azad May 27 '18 at 19:27
2

@MortezaAzad There's a more recent paper than Grossman's by Patrick Ion and some other people you might look at too. – Kimball May 28 '18 at 01:59
@Kimball (+1) Nice! Could you please add a link to the reference that you mentioned (possibly as an answer to this question)? – Morteza Azad May 28 '18 at 03:27
2

thanks, @Kimball , I guess that's Evolutionary Events in a Mathematical Sciences Research Collaboration Network – Carlo Beenakker May 28 '18 at 06:42
It seems the eight (!) authors of this paper are an example of a "Mathematical Sciences Research Collaboration Network" themselves! ;-) – Morteza Azad May 28 '18 at 07:57

Glorfindel · Answer 2 · 2018-05-28T09:24:25.513

With regards to reputation on Stack Exchange, I did a very short analysis last year on the distribution of reputation on Stack Overflow. Thanks to the Stack Exchange Data Explorer, I can easily run the same analysis for MathOverflow:

_{x-axis: logarithm of reputation; y-axis: logarithm of number of users; logarithms are base-10, so the 2.0 on the x-axis corresponds to $10^2 = 100$ reputation and there are about $10^{4.25} \approx 18000$ users with this much reputation.}

Some interesting points, caused by 'oddities' in the Stack Exchange reputation system:

are the many no-activity users with just 1 reputation
the sawtooth until x = 2.0 (±100 reputation) looks strange, but makes sense once you realize how hard it is to get a total reputation of 2 (1 question upvote followed by 2 downvotes).
a peak on and just after x = 2.0, corresponding to 101 reputation; these are mainly users from other Stack Exchange sites who have the association bonus on Math Overflow plus some optional minor additional activity.
the peak at 300 reputation is also caused by the association bonus. Users with 200-300 reputation either don't have other accounts on the network, or have another site where they have more reputation.

Feel free to fork the query to experiment yourselves with the data.

One of the things here that more closely follows Zipf's Law is the number of questions with a certain number of answers:

(+1) This is a fascinating analysis, particularly in the Zipf's law part! Have you ever ran into any appearance of other naturally occurring statistical laws such as Benford's law in the MathOverflow or Math.StackExchange data set? — Morteza Azad, May 29 '18 at 08:26
Thank you! My SEDE analysis is usually motivated by somebody asking a (social) question about e.g. reputation distribution or questions on the weekend getting supposedly less attention. I'll keep an eye on other mathematically interesting patterns like Benford's law. — Glorfindel, May 29 '18 at 09:27

score 21 · Answer 3 · answered May 27 '18 at 20:20

21

In 2010, Joseph F. Grcar published a paper in the Notices of the AMS entitled Topical Bias in Generalist Mathematics Journals. The paper analyzed data from 2000 to 2009 in Zentralblatt to investigate the question of whether some generalist mathematics journals published more papers from certain fields of mathematics than from others.

answered May 27 '18 at 20:20

Timothy Chow

78,129

2

It isn't clear to me what conclusions to draw from this analysis, since the author is measuring bias using "total number of papers published in a sub-field," which might not have a lot of meaning. E.g. some fields write much shorter papers, or some fields may have more low quality papers. Also, papers in some of the fields that the Proceedings are biased against (relativity theory, computer science) are often closer to a field other than pure mathematics. – Peter Samuelson May 28 '18 at 10:54
Taking a look at this interesting article, generalist journals are heavily "biased" towards pure mathematics at the expense of applied one. If you ask me, this is exactly how it should be. – Alex Gavrilov May 28 '18 at 12:56
1

It would be interesting to know if there is topical bias in MathOverflow posts and how that compares to the AMS publications. – Matthias Wendt May 28 '18 at 15:03

On Mathematical Analysis of MathSciNet & MathOverflow

3 Answers3