Ước tính n trong bài toán của người sưu tập phiếu giảm giá

14

Trong một biến thể về vấn đề của người thu thập phiếu giảm giá , bạn không biết số lượng phiếu giảm giá và phải xác định điều này dựa trên dữ liệu. Tôi sẽ coi đây là vấn đề cookie may mắn:

Cho một số lượng không rõ các thông điệp cookie tài sản riêng biệt $n$ , hãy ước tính $n$ bằng cách lấy mẫu từng cái một lần và đếm xem mỗi lần xuất hiện bao nhiêu lần. Đồng thời xác định số lượng mẫu cần thiết để có được khoảng tin cậy mong muốn trong ước tính này.

Về cơ bản tôi cần một thuật toán lấy mẫu vừa đủ dữ liệu để đạt được khoảng tin cậy nhất định, giả sử $n \pm 5$ với tin cậy . Để đơn giản, chúng ta có thể giả định rằng tất cả các vận may xuất hiện với xác suất / tần suất bằng nhau, nhưng điều này không đúng với một vấn đề tổng quát hơn, và một giải pháp cho điều đó cũng được hoan nghênh. $95\%$

Điều này có vẻ tương tự như vấn đề xe tăng của Đức , nhưng trong trường hợp này, bánh quy may mắn không được dán nhãn tuần tự, và do đó không có thứ tự.

estimation coupon-collector-problem

— chim ưng
nguồn

1

Chúng ta có biết các tin nhắn là thường xuyên như nhau?

— Glen_b -Reinstate Monica

câu hỏi đã được chỉnh sửa: Có

— goweon

2

Bạn có thể viết ra chức năng khả năng?

— Zen

2

Những người làm nghiên cứu động vật hoang dã chụp, gắn thẻ và thả động vật. Sau đó, họ suy ra kích thước của quần thể dựa trên tần suất mà chúng bắt lại động vật đã được gắn thẻ. Có vẻ như vấn đề của bạn là tương đương về mặt toán học với họ.

— Emil Friedman

6

Đối với trường hợp xác suất / tần suất bằng nhau, phương pháp này có thể phù hợp với bạn.

Gọi là tổng kích thước mẫu, là số lượng vật phẩm khác nhau được quan sát, là số vật phẩm được nhìn thấy chính xác một lần, là số vật phẩm được nhìn thấy chính xác hai lần, $K$ $N$ $N_1$ $N_2$ và $A=N_1(1− {N_1 \over K} )+2N_2,$ $\hat Q = {N_1 \over K}.$

Sau đó, khoảng tin cậy xấp xỉ 95% trên tổng kích thước dân số được đưa ra bởi $n$

{\hat{n}}_{L o w e r} = \frac{1}{1 - \hat{Q} + \frac{1.96 \sqrt{A}}{K}}

$\hat n_{Lower}={1 \over {1-\hat Q+{1.96 \sqrt{A} \over K} }}$

{\hat{n}}_{U p p e r} = \frac{1}{1 - \hat{Q} - \frac{1.96 \sqrt{A}}{K}}

$\hat n_{Upper}={1 \over {1-\hat Q-{1.96 \sqrt{A} \over K} }}$

Khi thực hiện, bạn có thể cần điều chỉnh những điều này tùy thuộc vào dữ liệu của bạn.

Phương pháp này là do Tốt và Turing. Một tài liệu tham khảo với khoảng tin cậy là Esty, Warren W. (1983), "Luật giới hạn bình thường đối với công cụ ước tính không đối xứng của phạm vi bảo hiểm của mẫu ngẫu nhiên" , Ann. Thống kê. , Tập 11, Số 3, 905-912.

Đối với vấn đề chung hơn, Bunge đã sản xuất phần mềm miễn phí tạo ra một số ước tính. Tìm kiếm với tên của anh ấy và từ CatchAll .

— ngâm
nguồn

1

Tôi đã tự do thêm tài liệu tham khảo Esty. Vui lòng kiểm tra kỹ xem đó có phải là ý bạn không

— Glen_b -Reinstate Monica

Có thể @soakley có được giới hạn (có thể là giới hạn ít chính xác hơn) nếu bạn chỉ biết

(cỡ mẫu) và

(số lượng vật phẩm duy nhất được nhìn thấy)? tức là chúng ta không có thông tin về

và

.

K

$K$

N

$N$

N_{1}

$N_1$

N_{2}

$N_2$

— Basj

Tôi không biết một cách để làm điều đó chỉ với

và

K

$K$

N .

$N.$

— soakley

2

Tôi không biết nếu nó có thể giúp đỡ, nhưng nó là vấn đề của việc quả bóng khác nhau trong thử nghiệm trong một urn với bóng dán nhãn khác nhau với thay thế. Theo trang này (bằng tiếng Pháp) nếu nếu biến ngẫu nhiên đếm số lượng bóng khác nhau thì hàm xác suất được cho bởi: $k$ $n$ $m$ $X_n$ $P(X_n = k) = {m \choose k} \sum_{i=0}^k {(-1)^{k-i}{k \choose i}}{(\frac{i}{m})^n}$

Then you can use a maximum likelihood estimator.

Another formula with proof is given here to solve the occupancy problem.

— sylvain
nguồn

1

Likelihood function and probability

In an answer to a question about the reverse birthday problem a solution for a likelihood function has been given by Cody Maughan.

The likelihood function for the number of fortune cooky types $m$ when we draw $k$ different fortune cookies in $n$ draws (where every fortune cookie type has equal probability of appearing in a draw) can be expressed as:

\begin{matrix} L (m | k, n) = m^{- n} \frac{m!}{(m - k)!} \propto P (k | m, n) & = & m^{- n} \frac{m!}{(m - k)!} \cdot \underset{\begin{array}{l} Stirling number \\ of the 2nd kind \end{array}}{\underset{⏟}{S (n, k)}} \\ = & m^{- n} \frac{m!}{(m - k)!} \cdot \frac{1}{k!} \sum_{i = 0}^{k} (- 1)^{i} (\binom{k}{i}) (k - i)^{n} \\ = & (\binom{m}{k}) \sum_{i = 0}^{k} (- 1)^{i} (\binom{k}{i}) {(\frac{k - i}{m})}^{n} \end{matrix}

$\begin{array}{} \mathcal{L}(m \, \vert \, k,n ) = m^{-n} \frac{m!}{(m-k)!} \propto P(k \, \vert \, m,n) &=& m^{-n}\frac{m!}{(m-k)!} \cdot \underbrace{S(n,k)}_{\begin{subarray}{l}\text{Stirling number }\\ \text{of the 2nd kind}\end{subarray}}\\ &=& m^{-n}\frac{m!}{(m-k)!} \cdot \frac{1}{k!} \sum_{i=0}^k {(-1)^{i}{k \choose i}}{(k-i)^n} \\ &=& {{m}\choose{k}} \sum_{i=0}^k {(-1)^{i}{k \choose i}}{\left(\frac{k-i}{m}\right)^n} \end{array}$

For a derivation of the probability on the right hand side see the the occupancy problem. This has been described before on this website by Ben. The expression is similar to the one in the answer by Sylvain.

Maximum likelihood estimate

We can compute first order and second order approximations of the maximum of the likelihood function at

m_{1} \approx \frac{(\binom{n}{2})}{n - k}

$m_1 \approx \frac{ {{n}\choose{2}}}{n-k}$

m_{2} \approx \frac{(\binom{n}{2}) + \sqrt{{(\binom{n}{2})}^{2} - 4 (n - k) (\binom{n}{3})}}{2 (n - k)}

$m_2 \approx \frac{ {{n}\choose{2}} + \sqrt{{{n}\choose{2}}^2 - 4(n-k) {{n}\choose{3}}}}{2(n-k)}$

Likelihood interval

(note, this is not the same as a confidence interval see: The basic logic of constructing a confidence interval)

This remains an open problem for me. I am not sure yet how to deal with the expression $m^{-n} \frac{m!}{(m-k)!}$ (of course one can compute all values and select the boundaries based on that, but it would be more nice to have some explicit exact formula or estimate). I can not seem to relate it to any other distribution which would greatly help to evaluate it. But I feel like a nice (simple) expression could be possible from this likelihood interval approach.

Confidence interval

For the confidence interval we can use a normal approximation. In Ben's answer the following mean and variance are given:

E [K] = m (1 - {(1 - \frac{1}{m})}^{n})

$\mathbb{E}[K] = m \left(1-\left(1 - \frac{1}{m}\right)^n\right)$

V [K] = m ((m - 1) {(1 - \frac{2}{m})}^{n} + {(1 - \frac{1}{m})}^{n} - m {(1 - \frac{1}{m})}^{2 n})

$\mathbb{V}[K] = m \left(\left(m-1\right)\left(1-\frac{2}{m}\right)^n + \left(1 - \frac{1}{m}\right)^n - m \left(1 - \frac{1}{m}\right)^{2n} \right)$

Say for a given sample $n=200$ and observed unique cookies $k$ the 95% boundaries $\mathbb{E}[K] \pm 1.96 \sqrt{\mathbb{V}[K]}$ look like:

In the image above the curves for the interval have been drawn by expressing the lines as a function of the population size $m$ and sample size $n$ (so the x-axis is the dependent variable in drawing these curves).

The difficulty is to inverse this and obtain the interval values for a given observed value $k$ . It can be done computationally, but possibly there might be some more direct function.

In the image I have also added Clopper Pearson confidence intervals based on a direct computation of the cumulative distribution based on all the probabilities $P(k \, \vert \, m,n)$ (I did this in R where I needed to use the Strlng2 function from the CryptRndTest package which is an asymptotic approximation of the logarithm of the Stirling number of the second kind). You can see that the boundaries coincide reasonably well, so the normal approximation is performing well in this case.

# function to compute Probability
library("CryptRndTest")
P5 <- function(m,n,k) {
  exp(-n*log(m)+lfactorial(m)-lfactorial(m-k)+Strlng2(n,k))
}
P5 <- Vectorize(P5)

# function for expected value 
m4 <- function(m,n) {
  m*(1-(1-1/m)^n)
}

# function for variance
v4 <- function(m,n) {
  m*((m-1)*(1-2/m)^n+(1-1/m)^n-m*(1-1/m)^(2*n))
}


# compute 95% boundaries based on Pearson Clopper intervals
# first a distribution is computed
# then the 2.5% and 97.5% boundaries of the cumulative values are located
simDist <- function(m,n,p=0.05) {
  k <- 1:min(n,m)
  dist <- P5(m,n,k)
  dist[is.na(dist)] <- 0
  dist[dist == Inf] <- 0
  c(max(which(cumsum(dist)<p/2))+1,
       min(which(cumsum(dist)>1-p/2))-1)
}


# some values for the example
n <- 200
m <- 1:5000
k <- 1:n

# compute the Pearon Clopper intervals
res <- sapply(m, FUN = function(x) {simDist(x,n)})


# plot the maximum likelihood estimate
plot(m4(m,n),m,
     log="", ylab="estimated population size m", xlab = "observed uniques k",
     xlim =c(1,200),ylim =c(1,5000),
     pch=21,col=1,bg=1,cex=0.7, type = "l", yaxt = "n")
axis(2, at = c(0,2500,5000))

# add lines for confidence intervals based on normal approximation
lines(m4(m,n)+1.96*sqrt(v4(m,n)),m, lty=2)
lines(m4(m,n)-1.96*sqrt(v4(m,n)),m, lty=2)
# add lines for conficence intervals based on Clopper Pearson
lines(res[1,],m,col=3,lty=2)
lines(res[2,],m,col=3,lty=2)

# add legend
legend(0,5100,
       c("MLE","95% interval\n(Normal Approximation)\n","95% interval\n(Clopper-Pearson)\n")
       , lty=c(1,2,2), col=c(1,1,3),cex=0.7,
       box.col = rgb(0,0,0,0))

— Sextus Empiricus
nguồn

For the case of unequal probabilities. You can approximate the number of cookies of a particular type as independent Binomial/Poisson distributed variables and describe whether they are filled or not as Bernouilli variables. Then add together the variance and means for those variables. I guess that this is also how Ben derived/approximated the expectation value and variance. ----- A problem is how you describe these different probabilities. You can not do this explicitly since you do not know the number of cookies.

— Sextus Empiricus