Làm thế nào để chọn một số yếu tố tiềm ẩn tối ưu trong nhân tố ma trận không âm?

Cho ma trận $\mathbf V^{m \times n}$ , Hệ số ma trận không âm (NMF) tìm thấy hai ma trận không âm $\mathbf W^{m \times k}$ và $\mathbf H^{k \times n}$ (nghĩa là với tất cả các phần tử $\ge 0$ ) để biểu diễn ma trận phân tách là:

V \approx W H,

$\mathbf V \approx \mathbf W\mathbf H,$

ví dụ bằng cách yêu cầu không âm $\mathbf W$ và $\mathbf H$ giảm thiểu các lỗi tái

‖ V - W H ‖^{2} .

$\|\mathbf V-\mathbf W\mathbf H\|^2.$

Are there common practices to estimate the number $k$ in NMF? How could, for example, cross validation be used for that purpose?

— Steve Sailer
nguồn

I don't have any citations (and actually I did a quick search on google scholar and failed to find any), but I believe that cross-validation should be possible.

— amoeba says Reinstate Monica

Could you tell me more details about how to perform cross validation for NMF? The K values for the Frobenius Norm will always decrease as the number of K increases.

— Steve Sailer

What are you doing NMF for? Is it to represent

V

$V$ in lower dimension space (unsupervised) or is it to provide recommendations (supervised). How big is your

V

$V$ ? Do you need to explain a certain percentage of the variance? You can apply CV after you define your objective metric. I would encourage you to think of the application and finding a metric that makes sense.

— ignorant

Câu trả lời:

To choose an optimal number of latent factors in non-negative matrix factorization, use cross-validation.

As you wrote, the aim of NMF is to find low-dimensional $\mathbf W$ and $\mathbf H$ with all non-negative elements minimizing reconstruction error $\|\mathbf V-\mathbf W\mathbf H\|^2$ . Imagine that we leave out one element of $\mathbf V$ , e.g. $V_{ab}$ , and perform NMF of the resulting matrix with one missing cell. This means finding $\mathbf W$ and $\mathbf H$ minimizing reconstruction error over all non-missing cells:

\sum_{i j \neq a b} (V_{i j} - [W H]_{i j})^{2} .

$\sum_ {ij\ne ab} (V_{ij}-[\mathbf W\mathbf H]_{ij})^2.$

Once this is done, we can predict the left out element $V_{ab}$ by computing $[\mathbf W\mathbf H]_{ab}$ and calculate the prediction error

e_{a b} = (V_{a b} - [W H]_{a b})^{2} .

$e_{ab}=(V_{ab}-[\mathbf W\mathbf H]_{ab})^2.$ One can repeat this procedure leaving out all elements

V_{a b}

$V_{ab}$ one at a time, and sum up the prediction errors over all

a

$a$ and

b

$b$ . This will result in an overall PRESS value (predicted residual sum of squares)

E (k) = \sum_{a b} e_{a b}

$E(k)=\sum_{ab}e_{ab}$ that will depend on

k

$k$ . Hopefully function

E (k)

$E(k)$ will have a minimum that can be used as an 'optimal'

k

$k$ .

Note that this can be computationally costly, because the NMF has to be repeated for each left out value, and might also be tricky to program (depending on how easy it is to perform NMF with missing values). In PCA one can get around this by leaving out full rows of $\mathbf V$ (which accelerates the computations a lot), see my reply in How to perform cross-validation for PCA to determine the number of principal components?, but this is not possible here.

Of course all the usual principles of cross-validation apply here, so one can leave out many cells at a time (instead of only a single one), and/or repeat the procedure for only some random cells instead of looping over all cells. Both approaches can help accelerating the process.

Edit (Mar 2019): See this very nice illustrated write-up by @AlexWilliams: http://alexhwilliams.info/itsneuronalblog/2018/02/26/crossval. Alex uses https://github.com/kimjingu/nonnegfac-python for NMF with missing values.

— amoeba says Reinstate Monica
nguồn

To my knowledge, there are two good criteria: 1) the cophenetic correlation coefficient and 2) comparing the residual sum of squares against randomized data for a set of ranks (maybe there is a name for that, but I dont remember)

Cophenetic correlation coefficient: You repeat NMF several time per rank and you calculate how similar are the results. In other words, how stable are the identified clusters, given that the initial seed is random. Choose the highest K before the cophenetic coefficient drops.
RSS against randomized data For any dimensionality reduction approach, there is always a loss of information compared to your original data (estimated by RSS). Now perform NMF for increasing K and calculate RSS with both your original dataset and a randomized dataset. When comparing RSS in function of K, the RSS decreases with increasing K in the original dataset, but this is less the case for the randomized dataset. By comparing both slopes, there should be an K where they cross. In other words, how much information could you afford to lose (=highest K) before being within the noise.

Hope I was clear enough.

Edit: I have found those articles.

1.Jean-P. Brunet, Pablo Tamayo, Todd R. Golub and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In Proceedings of the National Academy of Sciences of the USA, 101(12): 4164-4169, 2004.

2.Attila Frigyesi and Mattias Hoglund. Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Informatics, 6: 275-292, 2008.

— Jean-Paul Abbuehl
nguồn

It's not clear why the RSS of random data should be lower than the RSS computed with original data when K is small ? For the rest I understand that RSS of random should decrease more slowly than that on the original Data.

— Malik Koné

In the NMF factorization, the parameter $k$ (noted $r$ in most literature) is the rank of the approximation of $V$ and is chosen such that $k < \text{min}(m, n)$ . The choice of the parameter determines the representation of your data $V$ in an over-complete basis composed of the columns of $W$ ; the $w_i \text{ , } i = 1, 2, \cdots, k$ . The results is that the ranks of matrices $W$ and $H$ have an upper bound of $k$ and the product $WH$ is a low rank approximation of $V$ ; also $k$ at most. Hence the choice of $k < \text{min}(m, n)$ should constitute a dimensionality reduction where $V$ can be generated/spanned from the aforementioned basis vectors.

Further details can be found in chapter 6 of this book by S. Theodoridis and K. Koutroumbas.

After minimization of your chosen cost function with respect to $W$ and $H$ , the optimal choice of $k$ , (chosen empirically by working with different feature sub-spaces) should give $V^*$ , an approximation of $V$ , with features representative of your initial data matrix $V$ .

Working with different feature sub-spaces in the sense that, $k$ the number of columns in $W$ , is the number of basis vectors in the NMF sub-space. And empirically working with different values of $k$ is tantamount to working with different dimensionality-reduced feature spaces.

— Gilles
nguồn

But the question was about how to choose the optimal

k

$k$ ! Can you provide any insights about that?

— amoeba says Reinstate Monica

@amoeba Unless I misread the initial question, it is "Are there common practices to estimate the number

k

$k$ in NMF?". The optimal

k

$k$ is chosen empirically. I have expanded my answer.

— Gilles

Your explanation of the NMF factorization makes total sense, but the initial question was specifically about the common practices to estimate k. Now you wrote that one can chose k "empirically" (okay) "by working with different feature sub-spaces". I am not sure I understand what "working with different feature sub-spaces" means, could you expand on that? How should one work with them?? What is the recipe to chose k? This is what the question is about (at least as I understood it). Will be happy to revert my downvote!

— amoeba says Reinstate Monica

I appreciate your edits, and am very sorry for being so dumb. But let's say I have my data, and I [empirically] try various values of

k

$k$ between 1 and 50. How am I supposed to choose the one which worked the best??? This is how I understand the original question, and I cannot find anything in your reply about that. Please let me know if I missed it, or if you think that the original question was different.

— amoeba says Reinstate Monica

@amoeba That will depend on your application, data, and what you want to accomplish. Is it just the dimensionality reduction, or source separation, etc ? In audio applications for instance, say source separation, the optimal

k

$k$ would be the one that gives you the best quality when listening to the separated audio sources. The motivation for the choice here will of course be different if you were working with images for instance.

— Gilles