# Erdős-Renyi graph

• I am slightly exasperated at problem 52/5, about the directed variant of an Erdős-Renyi graph.

The hint suggests looking at a random graph $G(n,p)$ for $n=10^9$ and $np = 2$. Since $d = np > 1$, such a graph is expected to have a giant component with size $S \cdot n$, where $S$ is a fixed point for the function $1 - e^{-2S}$; numerically, $S \sim 0.797$, which gives the answer 797 000 000.

That answer (or reasonably close values) is not accepted by expii. Of course, the process is not exactly the same as for a $G(n,p)$ graph, so maybe run some computer simulations first?

That I did (hint: you don't need to generate the whole graph, it is enough to know the total of people in the connected component; once you have $k$ people in there, each individual retweet independently has a probability $k/10^9$ of being suppressed). The experiments so far suggest an average proportion of around 85% of people in the giant connected component (and, fittingly with the E-R model, this is more like: 85% of the time, the retweets cover the whole graph; 15% of the time, they reach nobody). Correspondingly, I tried that value (850 000 000) and close values (because experimental imprecision). Then it escalated and I ended up trying everything between 650 000 000 and 1000 000 000. Of course none of these values is accepted.

So, my question is simple: am I really supposed to enter a number followed by six zeroes in this entry field, or should I simply enter the number of millions (i.e. a rounding of $|\mathrm{connected\ component}| / 10^6$) ? (expii problems are very often misleading in this way, in particular whenever the answer is a probability: it is never clear whether we are supposed to enter it as a percentage or as an element of [0,1]. This is a major problem in the formulation of the riddles right now).

• @Circonflexe

I seem to be having the same problem you did. Working with Mathematica I simulated some directed graphs, got the average in-drgree of the vertices (limited to a maximum of 1) and computed an answer of 865 million. I tried that answer and others around it, but the system rejects them all.

• Hi all! Thank you for bringing this up on this forum! There is indeed an issue with that problem right now. I originally thought it was significantly simpler than it actually is. The problem is actually much more mathematically interesting.

It turns out that the answer is not simply recoverable from the giant component size of an Erdos-Renyi graph, as I originally had thought. That's due in a large part to the fact that the out-degree distribution in the breadth-first exploration is not Poisson(2). Instead, it has a very discrete distribution, and is either 0 or 200. Although the expected value is 2, the end result is very different.

I'm very happy to see that this is under discussion here now, because maybe we can figure out the answer together. :)

This is related to the giant component in a random graph, in the sense that it can be modeled by the following branching process:

1. Maintain a queue of "active" vertices, and a set of "seen" vertices. Everything else is "unseen".

2. Start with a single vertex in the "active" queue, and nothing in the "seen" set.

3. Each iteration, take a vertex V out of the "active" queue, and with probability 1%, select 200 uniformly random and independent vertices from the entire graph. For each such vertex which is not already in the "active" or "seen" sets, add it to the "active" queue.

The question is what the expected number of "seen" vertices is in the end.

We see, for example, that there is 99% probability that the first vertex adds no other vertices to the queue, and the entire process dies out with only 1 vertex seen. This immediately means that the answer to the whole question must be less than or equal to:

$1\% \times 10^9 = 10^7$

We know that with probability 99%, the total number of "seen" vertices is 1.

Inspect further into the remaining 1% of probability. We now have a queue with 200 vertices in it. My instinct is that heuristically, as long as the queue doesn't die out prematurely, it will reach the size of the giant component in the Erdos-Renyi random graph with $np = 2$. My heuristic explanation is that since the expected number of new vertices inserted at each iteration is 2, which is greater than 1, the "active" set has very strong pressure to grow without bound, until it runs into the same problem that the Erdos-Renyi giant component does: there are few fresh vertices to join to. For example, if already $tn$ of the vertices have been "seen" or are still "active", the expected number of new vertices added to the queue is only $2(1-t)$, which can drop below 1. In that end behavior, I would expect our process to behave similarly.

Also, as the process gains more vertices, it is less and less likely to die out prematurely. For example, once it has 200 vertices in the queue, there are 200 independent chances to have one of the vertices retweet to 200 followers, and each such chance is 1%. So, the probability that none of them retweet is now only:

$0.99^{200} = (1 - \frac{1}{100})^{200} \approx e^{-2} \approx 0.135$

Remember we had so far figured out that in 0.99 of the probability space, the number of "seen" vertices would be 1.

This means that in $0.01 e^{-2}$ of the probability space, the number of "seen" vertices is 201.

The remaining $0.01 (1 - e^{-2})$ of the probability space has the number of "seen" vertices at least 401, and at least another fresh 200 vertices in the queue. However, there might be even more fresh vertices in the queue. Indeed, with 200 independent chances to retweet, we calculated the probability of 0 retweets at about 0.135 above. The probability of exactly 1 retweet is:

$200 \cdot 0.99^{199} \cdot 0.01 \approx 0.271$

and the probability of exactly 2 retweets is:

$\binom{200}{2} \cdot 0.99^{198} \cdot 0.01^2 \approx 0.272$

and the probability of exactly 3 retweets is:

$\binom{200}{3} \cdot 0.99^{197} \cdot 0.01^3 \approx 0.181$

and the probability of exactly 4 retweets is:

$\binom{200}{4} \cdot 0.99^{196} \cdot 0.01^4 \approx 0.090$

These start to build up. For example, if there were 2 retweets among the first 200-batch, then there are now 400 in the queue, and the probability of 0 retweets is only

$(1-\frac{1}{100})^{400} \approx e^{-4}$

and so on. So, if the process picks up critical mass, I think it runs away to fill up to the large size that matches the end behavior of the Erdos-Renyi giant component breadth-first exploration process. Very importantly, the process would almost definitely end with either a relatively small total number (under 20,000) or it would blow up to the full size (about 797 million).

To calculate the answer, we then need to estimate the probability that the process dies out early. Since we're approximating the answer, this can likely be done with just a few rounds of calculations like what I just did above, because as the dying-out-probability falls too low, it will become an error term that is swamped by the huge outcome (hundreds of millions) that appears when critical mass is passed. This would indeed parallel the giant component structure in the Erdos-Renyi random graph, where there is a single linear-size component and all other components are logarithmic in the number of vertices.

To check the intuition on the end behavior, I think you might be able to model the growth process by a system of recursions (or maybe differential equations). Imagine the process is already in an advanced stage, with a linear number of vertices seen, and a linear number of vertices in the "active" queue. If there are $f(t)n$ vertices which have been seen so far, and $g(t)n$ vertices which are active, then reveal which new vertices the active vertices reach through their tweets. Since we now have many vertices "active", the law of large numbers applies, and we can estimate that the total number of retweets that happened from the $g(t)n$ vertices is asymptotically $g(t) n/100$. These are now independently distributed over all vertices, and each one of them reaches a new vertex with probability $1-f(t)-g(t)$. So, we can estimate the number of new vertices added to the "active" queue, which helps us estimate $g(t+1)$. We have the simple recursion $f(t+1) = f(t) + g(t)$ because all active vertices are now seen. I think that appropriate initial conditions would be $f(0) = g(0) = 0.001$ (some small constant, reflecting growth in the period before the number of vertices is a limitation), because if each "active" vertex gives rise to about 2 new active vertices, then the "active" frontier is typically doubling each time, and in the sum $1 + 2 + 4 + 8 + 16 + \cdots$ each term is approximately equal to the sum of all preceding terms. If this recursion reaches the threshold from the Erdos-Renyi prediction (about 0.797), then this indicates that the end behavior model is likely to be correct.

Thank you for joining me on this journey through problems. I am sorry that there are sometimes errors. I'm creating the problems fresh each week on my own, by trying to identify some mathematical insights in real world phenomena. I am often surprised to learn new things about the world, and to see the world in a new light. However, the creation of puzzles weekly is not easy, and there are sometimes errors and oversights. I truly appreciate all of your feedback, even if I am often unable to reply directly. (Indeed, that is why I upgraded this forum so that we can have a community of people discussing, and I am active on this new forum as well.) It is a pleasure to have a worldwide community of enthusiasts who are able to help think through these creations together. I hope that my post above provides a useful starting point for us to collectively solve this problem completely.

Looks like your connection to Expii Forum was lost, please wait while we try to reconnect.