A probability problem involving 40,000 sent letters.
A podcast got me doing maths again.
I was listening to this episode from the 99% invisible podcast. It contains a fascinating story about a radio contest where you could’ve won a house. Turns out that in the 80ies there was a severe economic crisis and this prompted many people to sign up to the contest. Many people signed up and only three men were picked to take part in the actual competition: who-ever could live on a billboard the longest would win a house. By the end of the podcast you’ll learned that this contest lasted for 261 days and got way out of hand.
It’s a great story, but a specific part of the story caught my attention.
Out of all the submissions that were being sent in, only three people would be selected. This got people thinking: one of the contestants actually submitted 47000 entries in order to increase the odds of getting selected. The tactic makes sense if you’re desperate and the podcast explains that the 1st ten letters that were opened by the selection committee were from a single participant.
The radio station ended up picking this contestant nonetheless but I wondered: what if the station had a policy of deny-ing anybody that sends more than one letter? The radio station wouldn’t go through all the letters (for obvious reasons, the podcast reports 500,000 letters being sent) but suppose that we pick three random letters for the contestants then we would expect each candidate to only occur once. With this in mind, you might want to limit the number of letters that you send.
Let’s do the maths here. We have a few free variables in the system that we’ve described.
Let’s start simple by first calculating the probability of getting selected if you’re the only person sending something in.
\[ p_{c=1}(\text{picked exactly once}) = \frac{s}{a + s} \]
If only one candidate is picked then we want to have \(s\) be as large as possible.
Let’s now consider two contestants. The idea is that either you get selected first and then somebody who isn’t you needs to get selected or the other way around.
\[\begin{equation} \label{eq1} \begin{split} p_{c=2}(\text{picked exactly once}) & = \frac{s}{a + s} \times \frac{a}{a + s -1} + \frac{a}{a + s} \times \frac{s}{a + s -1} \\\\ & = \frac{s \times a}{(a+s)\times(a+s-1)} + \frac{s \times a}{(a+s)\times(a+s-1)} \\\\ & = 2 \times \frac{s \times a}{(a+s)\times(a+s-1)} \end{split} \end{equation}\]
It seems like we need to be a bit more careful now. If \(s >> a\) then we need to check that what is at the denominator doesn’t grow.
When we do the maths for three contestants then we see a pattern occur.
\[\begin{equation} \label{eq2} \begin{split} p_{c=3}(\text{picked exactly once}) & = \frac{s}{a + s} \times \frac{a}{a + s -1} \times \frac{a-1}{a + s - 2} \\\\ & + \frac{a}{a + s} \times \frac{s}{a + s -1} \times \frac{a-1}{a + s - 2} \\\\ & + \frac{a}{a + s} \times \frac{a-1}{a + s -1} \times \frac{s}{a + s - 2} \\\\ & = 3 \times \frac{s \times a \times (a-1)}{(a+s) \times (a+s-1) \times (a+s-2)} \end{split} \end{equation}\]
Maths is nice and all but true intuition comes from plotting the numbers.
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
s = np.arange(1, 500000)
a = 500000
c = 3
prob_zero = (a*(a-1)*(a-2))/((a+s)*(a+s-1)*(a+s-2))
prob_one = 3*(s*a*(a-1))/((a+s)*(a+s-1)*(a+s-2))
prob_two = 3*(s*(s-1)*a)/((a+s)*(a+s-1)*(a+s-2))
prob_three = (s*(s-1)*(s-2))/((a+s)*(a+s-1)*(a+s-2))
df = pd.DataFrame({'zero': prob_zero,
'one': prob_one,
'two': prob_two,
'three': prob_three})
df.index = s
df.plot(title="probability of num letters opened after sending",
figsize=(16, 8));
Turns out that if you really want to win at this game, you’ll need to make sure that about half of the amount of other letters is the amount you need to send (this translates to 1/3 of the total letters to need to come in from you). But can we find this number exactly?
Let’s look at our equation from before.
\[\begin{equation} \label{eq3} \begin{split} p_{c=3}(\text{picked exactly once}) & = 3 \times \frac{s \times a \times (a-1)}{(a+s) \times (a+s-1) \times (a+s-2)} \end{split} \end{equation}\]
If we want to optimise that equation for \(a\) we’ll find that it is actually very tricky. We have a fraction with a polynomial in there and even when we have something like sympy
it seems that it would not yield a pretty solution.
Realising this makes the path towards the actual solution a whole lot easier. It gives us an opporunity to reformulate the problem such that everything becomes a lot more simple. Let’s change the variables, but only slightly.
Suppose now that we have a system with these variables.
Note that the \(t\) parameter is the only one that is really different. With these parameters to use, let us check what our equation might look like.
\[\begin{equation} \label{eq4} \begin{split} p_{c=3}(\text{picked exactly once}) & = \frac{s}{t} \times \frac{t-s}{t -1} \times \frac{t-s-1}{t - 2} \\\\ & + \frac{t-s}{t} \times \frac{s}{t -1} \times \frac{t-s-1}{t - 2} \\\\ & + \frac{t-s}{t} \times \frac{t-s-1}{t -1} \times \frac{s}{t - 2} \\\\ & = 3 \times \frac{s \times (t-s) \times (t-s-1)}{t \times (t-1) \times (t-2)} \end{split} \end{equation}\]
By just rephrasing this, everything became a whole lot simpler. Since the variable \(t\) is something that is given we merely have created a division by a constant. You still won’t need to do any maths if you don’t feel like it though because sympy is here to help.
import sympy as sp
s, t = sp.symbols("s, t")
sp.solve(sp.diff(3*s*(t-s)*(t-s-1)/(t*(t-1)*(t-2)), s), s, 0)
The positive solution of that expression yields:
\[ \frac{2 t}{3} + \frac{\sqrt{t^{2} - t + 1}}{3} - \frac{1}{3} \]
Note that in sympy you can also solve numerically via;
t = 500000
sp.nsolve(sp.diff(3*s*(t-s)*(t-s-1)/(t*(t-1)*(t-2)), s), s, 0)
When \(t=500000\) then \(s^* = 166666.667\). This is a similar conclusion to what we saw before.
What is nice about this problem is that the maths can either be very easy or very hard depending on how you formulate the problem. If you take the initial formulation then you’ll find that even sympy
will have a huge problem with it.
The exercise got a whole lot simpler when we rephrased the problem to make our life easier. It might make you wonder if a similar phenomenon occurs in machine learning too.