This is a counting problem: there are bn possible assignments of b birthdays to n people. Of those, let q(k;n,b) be the number of assignments for which no birthday is shared by more than k people but at least one birthday actually is shared by k people. The probability we seek can be found by summing the q(k;n,b) for appropriate values of k and multiplying the result by b−n.
nn=4 (this is the smallest interesting situation). The possibilities are:
- Each person has a unique birthday; the code is {4}.
- Exactly two people share a birthday; the code is {2,1}.
- Two people have one birthday and the other two have another; the code is {0,2}.
- Three people share a birthday; the code is {1,0,1}.
- Four people share a birthday; the code is {0,0,0,1}.
Generally, the code {a[1],a[2],…} is a tuple of counts whose kth element stipulates how many distinct birthdates are shared by exactly k people. Thus, in particular,
1a[1]+2a[2]+...+ka[k]+…=n.
Note, even in this simple case, that there are two ways in which the maximum of two people per birthday is attained: one with the code {0,2} and another with the code {2,1}.
We can directly count the number of possible birthday assignments corresponding to any given code. This number is the product of three terms. One is a multinomial coefficient; it counts the number of ways of partitioning n people into a[1] groups of 1, a[2] groups of 2, and so on. Because the sequence of groups does not matter, we have to divide this multinomial coefficient by a[1]!a[2]!⋯; its reciprocal is the second term. Finally, line up the groups and assign them each a birthday: there are b candidates for the first group, b−1 for the second, and so on. These values have to be multiplied together, forming the third term. It is equal to the "factorial product" b(a[1]+a[2]+⋯) where b(m) means b(b−1)⋯(b−m+1).
There is an obvious and fairly simple recursion relating the count for a pattern {a[1],…,a[k]} to the count for the pattern {a[1],…,a[k−1]}. This enables rapid calculation of the counts for modest values of n. Specifically, a[k] represents a[k] birthdates shared by exactly k people each. After these a[k] groups of k people have been drawn from the n people, which can be done in x distinct ways (say), it remains to count the number of ways of achieving the pattern {a[1],…,a[k−1]} among the remaining people. Multiplying this by x gives the recursion.
I doubt there is a closed form formula for q(k;n,b), which is obtained by summing the counts for all partitions of n whose maximum term equals k. Let me offer some examples:
With b=5 (five possible birthdays) and n=4 (four people), we obtain
q(1)q(2)q(3)q(4)=q(1;4,5)=360+60=120=420=80=5.
Whence, for example, the chance that three or more people out of four share the same "birthday" (out of 5 possible dates) equals (80+5)/625=0.136.
As another example, take b=365 and n=23. Here are the values of q(k;23,365) for the smallest k (to six sig figs only):
k=1:k=2:k=3:k=4:k=5:k=6:k=7:k=8:0.492700.4945920.01253080.0001728441.80449E−61.48722E−89.92255E−115.45195E−13.
Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger n we need an approximation.
One approximation is obtained by means of the Poisson distribution with expectation n/b, because we can view a birthday assignment as arising from b almost (but not quite) independent Poisson variables each with expectation n/b: the variable for any given possible birthday describes how many of the n people have that birthday. The distribution of the maximum is therefore approximately F(k)b where F is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for n=23, b=365 gives
k=1:k=2:k=3:k=4:0.4987830.4968030.0141870.000225115.
By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of n and b suggests the approximation is usually about this good.
To wrap up, let's consider the original question: take n=10,000 (the number of observations) and b=1000000 (the number of possible "structures," approximately). The approximate distribution for the maximum number of "shared birthdays" is
k=1:k=2:k=3:k=4:k>4:00.8475+0.1520+0.0004+<1E−6.
(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because n and b are both large, I expect the approximation to work quite well here.
Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like
simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];
which is then iterated and summarized, as in this example which runs 10,000 iterations of the n=10000, b=1000000 case:
Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm
Its output is
2 8503
3 1493
4 4
These frequencies closely agree with those predicted by the Poisson approximation.