Mở rộng nghịch lý sinh nhật tới hơn 2 người

Trong Nghịch lý Sinh nhật truyền thống, câu hỏi đặt ra là "cơ hội nào mà hai hoặc nhiều người trong một nhóm $n$ người chia sẻ một ngày sinh nhật". Tôi bị mắc kẹt trong một vấn đề là một phần mở rộng của điều này.

Thay vì biết xác suất hai người chia sẻ một ngày sinh nhật, tôi cần mở rộng câu hỏi để biết xác suất mà $x$ hoặc nhiều người chia sẻ một ngày sinh nhật là gì. Với $x=2$ bạn có thể làm điều này bằng cách tính xác suất không có hai người chia sẻ ngày sinh và trừ đi từ $1$ , nhưng tôi không nghĩ rằng tôi có thể mở rộng logic này thành số lớn hơn $x$ .

Để làm phức tạp thêm điều này, tôi cũng cần một giải pháp sẽ hoạt động với số lượng rất lớn cho $n$ (hàng triệu) và $x$ (hàng nghìn).

probability combinatorics birthday-paradox

— Simon Andrews
nguồn

I presume that it's bioinformatics problem

— csgillespie

It is actually a bioinformatics problem, but since it boils down to the same concept as the birthday paradox I thought I'd save the irrelevant specifics!

— Simon Andrews

Normally I would agree with you, but in this case the specifics might matter since there could already be a bioconductor package that does what you ask.

— csgillespie

If you really want to know, it's a pattern finding problem where I'm trying to accurately estimate the probability of a given level of enrichment of a subsequence within a set of larger sequences. I therefore have a set of subsequences with associated counts and I know how many subsequences I observed and how many theoretically observable sequences are available. If I saw a particular sequence 10 times out of 10,000 observations I need to know how likely that was to have occurred by chance.

— Simon Andrews

Almost eight years later, I posted an answer to this problem at stats.stackexchange.com/questions/333471. The code there does not work for large

n,

$n,$ though, because it takes quadratic time in

n

$n$ .

— whuber

Câu trả lời:

This is a counting problem: there are $b^n$ possible assignments of $b$ birthdays to $n$ people. Of those, let $q(k; n, b)$ be the number of assignments for which no birthday is shared by more than $k$ people but at least one birthday actually is shared by $k$ people. The probability we seek can be found by summing the $q(k;n,b)$ for appropriate values of $k$ and multiplying the result by $b^{-n}$ .

$n$ $n = 4$ (this is the smallest interesting situation). The possibilities are:

Each person has a unique birthday; the code is {4}.
Exactly two people share a birthday; the code is {2,1}.
Two people have one birthday and the other two have another; the code is {0,2}.
Three people share a birthday; the code is {1,0,1}.
Four people share a birthday; the code is {0,0,0,1}.

Generally, the code $\{a[1], a[2], \ldots\}$ is a tuple of counts whose $k^\text{th}$ element stipulates how many distinct birthdates are shared by exactly $k$ people. Thus, in particular,

1 a [1] + 2 a [2] + . . . + k a [k] + \dots = n .

$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$

Note, even in this simple case, that there are two ways in which the maximum of two people per birthday is attained: one with the code $\{0,2\}$ and another with the code $\{2,1\}$ .

We can directly count the number of possible birthday assignments corresponding to any given code. This number is the product of three terms. One is a multinomial coefficient; it counts the number of ways of partitioning $n$ people into $a[1]$ groups of $1$ , $a[2]$ groups of $2$ , and so on. Because the sequence of groups does not matter, we have to divide this multinomial coefficient by $a[1]!a[2]!\cdots$ ; its reciprocal is the second term. Finally, line up the groups and assign them each a birthday: there are $b$ candidates for the first group, $b-1$ for the second, and so on. These values have to be multiplied together, forming the third term. It is equal to the "factorial product" $b^{(a[1]+a[2]+\cdots)}$ where $b^{(m)}$ means $b(b-1)\cdots(b-m+1)$ .

There is an obvious and fairly simple recursion relating the count for a pattern $\{a[1], \ldots, a[k]\}$ to the count for the pattern $\{a[1], \ldots, a[k-1]\}$ . This enables rapid calculation of the counts for modest values of $n$ . Specifically, $a[k]$ represents $a[k]$ birthdates shared by exactly $k$ people each. After these $a[k]$ groups of $k$ people have been drawn from the $n$ people, which can be done in $x$ distinct ways (say), it remains to count the number of ways of achieving the pattern $\{a[1], \ldots, a[k-1]\}$ among the remaining people. Multiplying this by $x$ gives the recursion.

I doubt there is a closed form formula for $q(k; n, b)$ , which is obtained by summing the counts for all partitions of $n$ whose maximum term equals $k$ . Let me offer some examples:

With $b=5$ (five possible birthdays) and $n=4$ (four people), we obtain

\begin{aligned} q (1) & = q (1; 4, 5) & = 120 \\ q (2) & = 360 + 60 & = 420 \\ q (3) & = 80 \\ q (4) & = 5. \end{aligned}

$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$

Whence, for example, the chance that three or more people out of four share the same "birthday" (out of $5$ possible dates) equals $(80 + 5)/625 = 0.136$ .

As another example, take $b = 365$ and $n = 23$ . Here are the values of $q( k;23,365)$ for the smallest $k$ (to six sig figs only):

\begin{aligned} k = 1 : & 0.49270 \\ k = 2 : & 0.494592 \\ k = 3 : & 0.0125308 \\ k = 4 : & 0.000172844 \\ k = 5 : & 1.80449 E - 6 \\ k = 6 : & 1.48722 E - 8 \\ k = 7 : & 9.92255 E - 11 \\ k = 8 : & 5.45195 E - 13. \end{aligned}

$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$

Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger $n$ we need an approximation.

One approximation is obtained by means of the Poisson distribution with expectation $n/b$ , because we can view a birthday assignment as arising from $b$ almost (but not quite) independent Poisson variables each with expectation $n/b$ : the variable for any given possible birthday describes how many of the $n$ people have that birthday. The distribution of the maximum is therefore approximately $F(k)^b$ where $F$ is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for $n = 23$ , $b = 365$ gives

\begin{aligned} k = 1 : & 0.498783 \\ k = 2 : & 0.496803 \\ k = 3 : & 0.014187 \\ k = 4 : & 0.000225115. \end{aligned}

$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$

By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of $n$ and $b$ suggests the approximation is usually about this good.

To wrap up, let's consider the original question: take $n = 10,000$ (the number of observations) and $b = 1\,000\,000$ (the number of possible "structures," approximately). The approximate distribution for the maximum number of "shared birthdays" is

\begin{aligned} k = 1 : & 0 \\ k = 2 : & 0.8475 + \\ k = 3 : & 0.1520 + \\ k = 4 : & 0.0004 + \\ k > 4 : & < 1 E - 6. \end{aligned}

$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$

(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because $n$ and $b$ are both large, I expect the approximation to work quite well here.

Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like

simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];

which is then iterated and summarized, as in this example which runs 10,000 iterations of the $n = 10000$ , $b = 1\,000\,000$ case:

Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm

Its output is

2 8503

3 1493

4 4

These frequencies closely agree with those predicted by the Poisson approximation.

— whuber
nguồn

What a fantastic answer, thank you very much @whuber.

— JKnight

"There is an obvious and fairly simple recursion" — Namely?

— Kodiologist

@Kodiologist I inserted a brief description of the idea.

— whuber

+1 but where in the original question did you see that n=10000 and b=1mln? The OP looks like it is asking about n=1mln and k=10000, with b unspecified (presumably b=365). Not that it matters at this point :)

— amoeba says Reinstate Monica

@amoeba After all this time (six years, 1600 answers, and closely reading tens of thousands of posts) I cannot recall, but most likely I misinterpreted the last line. In my defense, note that if we read it literally the answer is immediate (upon applying a version of the Pigeonhole Principle): it is certain that among

n

$n$ =millions of people there will be at least one birthday that is shared among at least

x

$x$ =thousands of them!

— whuber

It is always possible to solve this problem with a monte-carlo solution, although that's far from the most efficient. Here's a simple example of the 2 person problem in R (from a presentation I gave last year; I used this as an example of inefficient code), which could be easily adjusted to account for more than 2:

birthday.paradox <- function(n.people, n.trials) {
    matches <- 0
    for (trial in 1:n.trials) {
        birthdays <- cbind(as.matrix(1:365), rep(0, 365))
        for (person in 1:n.people) {
            day <- sample(1:365, 1, replace = TRUE)
            if (birthdays[birthdays[, 1] == day, 2] == 1) {
                matches <- matches + 1
                break
            }
            birthdays[birthdays[, 1] == day, 2] <- 1
        }
        birthdays <- NULL
    }
    print(paste("Probability of birthday matches = ", matches/n.trials))
}

— Shane
nguồn

I am not sure if the multiple types solution will work here.

I think that generalisation still only works for 2 or more people sharing a birthday - just that you can have different sub-classes of people.

— Simon Andrews

This is an attempt at a general solution. There may be some mistakes so use with caution!

First some notation:

$P(x,n)$ be the probability that $x$ or more people share a birthday among $n$ people,

$P(y|n)$ be the probability that exactly $y$ people share a birthday among $n$ people.

Notes:

Abuse of notation as $P(.)$ is being used in two different ways.
By definition $y$ cannot take the value of 1 as it does not make any sense and $y$ = 0 can be interpreted to mean that no one shares a common birthday.

Then the required probability is given by:

$P(x,n) = 1 - P(0|n) - P(2|n) - P(3|n) .... - P(x-1|n)$

Now,

$P(y|n) = {n \choose y} (\frac{365}{365})^y \ \prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$

Here is the logic: You need the probability that exactly $y$ people share a birthday.

Step 1: You can pick $y$ people in ${n \choose y}$ ways.

Step 2: Since they share a birthday it can be any of the 365 days in a year. So, we basically have 365 choices which gives us $(\frac{365}{365})^y$ .

Step 3: The remaining $n-y$ people should not share a birthday with the first $y$ people or with each other. This reasoning gives us $\prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$ .

You can check that for $x$ = 2 the above collapses to the standard birthday paradox solution.

Will this solution suffer from the curse of dimensionality? If instead of n=365, n=10^6 is this solution still feasible?

— csgillespie

Some approximations may have to be used to deal with high dimensions. Perhaps, use Stirling's approximation for factorials in the binomial coefficient. To deal with the product terms you could take logs and compute the sums instead of the products and then take the anti-log of the sum.

There are also several other forms of approximations possible using for example the Taylor series expansion for the exponential function. See the wiki page for these approximations: en.wikipedia.org/wiki/Birthday_problem#Approximations

Suppose y=2, n=4, and there are just two birthdays. Your formula, adapted by replacing 365 by 2, seems to say the probability that exactly 2 people share a birthday is Comb(4,2)*(2/2)^2*(1-1/2)*(1-2/2) = 0. (In fact, it's easy to see--by brute force enumeration if you like--that the probabilities that 2, 3, or 4 people share a "birthday" are 6/16, 8/16, and 2/16, respectively.) Indeed, whenever n-y >= 365, your formula yields 0, whereas as n gets large and y is fixed the probability should increase to a non-zero maximum before n reaches 365*y and then decrease, but never down to 0.

— whuber

Why you are replacing 365 by

n

$n$ ? The probability that 2 people share a birthday is computed as: 1 - Prob(they have unique birthday). Prob(that they have unique birthday) = (364/365). The logic is as follows: Pick a person. This person can have any day of the 365 days as a birthday. The second person can then only have a birthday on one of the remaining 364 days. Thus, the prob that they have a unique birthday is 364/365. I am not sure how you are calculating 6/16.