Trực giác đằng sau các tương tác sản phẩm tenor trong GAM (gói MGCV trong R)

Các mô hình phụ gia tổng quát là những mô hình trong đó

y = α + f_{1} (x_{1}) + f_{2} (x_{2}) + e_{i}

$y = \alpha + f_1(x_1) + f_2(x_2) + e_i$ chẳng hạn. các chức năng là trơn tru, và được ước tính. Thông thường bởi splines bị phạt. MGCV là một gói trong R làm như vậy, và tác giả (Simon Wood) viết một cuốn sách về gói của mình với các ví dụ R. Ruppert và cộng sự. (2003) viết một cuốn sách dễ tiếp cận hơn về các phiên bản đơn giản hơn của cùng một thứ.

Câu hỏi của tôi là về sự tương tác trong các loại mô hình. Điều gì xảy ra nếu tôi muốn làm một cái gì đó như sau:

y = α + f_{1} (x_{1}) + f_{2} (x_{2}) + f_{3} (x_{1} \times x_{2}) + e_{i}

$y = \alpha + f_1(x_1) + f_2(x_2) + f_3(x_1\times x_2) + e_i$ nếu chúng tôi ở trong vùng đất OLS (nơi

f

$f$ chỉ là một phiên bản beta), tôi muốn không có vấn đề với việc giải thích

. Nếu chúng ta ước tính thông qua các spline bị phạt, tôi cũng không có vấn đề gì với việc diễn giải trong bối cảnh phụ gia.

{\hat{f}}_{3}

$\hat{f}_3$

Nhưng gói MGCV trong GAM có những thứ gọi là "làm mịn sản phẩm tenor". Tôi google "sản phẩm tenor" và mắt tôi lập tức trừng mắt cố gắng đọc những lời giải thích mà tôi tìm thấy. Hoặc tôi không đủ thông minh hoặc toán học không được giải thích tốt, hoặc cả hai.

Thay vì mã hóa

normal = gam(y~s(x1)+s(x2)+s(x1*x2))

một sản phẩm tenor sẽ làm điều tương tự (?)

what = gam(y~te(x1,x2))

khi tôi làm

plot(what)

hoặc là

vis.gam(what)

Tôi nhận được một số đầu ra thực sự mát mẻ. Nhưng tôi không biết điều gì đang xảy ra bên trong hộp đen đó te(), cũng như làm thế nào để diễn giải sản phẩm tuyệt vời đã nói ở trên. Mới đêm nọ, tôi gặp ác mộng là tôi đang tổ chức một buổi hội thảo. Tôi đã cho mọi người xem một biểu đồ thú vị, họ hỏi tôi ý nghĩa của nó và tôi không biết. Sau đó tôi phát hiện ra rằng tôi không có quần áo trên.

Bất cứ ai có thể giúp cả tôi và hậu thế, bằng cách đưa ra một chút cơ học và trực giác về những gì đang diễn ra bên dưới mui xe ở đây? Lý tưởng nhất bằng cách nói một chút về sự khác biệt giữa trường hợp tương tác phụ gia thông thường và trường hợp tenor? Điểm thưởng cho việc nói mọi thứ bằng tiếng Anh đơn giản trước khi chuyển sang môn toán.

— chung_user
nguồn

ví dụ đơn giản, được lấy từ sách của tác giả gói: thư viện (mgcv) dữ liệu (cây) ct5 <- gam (Tập ~ te (Chiều cao, Girth, k = 5), gia đình = Gamma (liên kết = log), dữ liệu = cây) ct5 cốt truyện vis.gam (ct5) (ct5, quá.far = 0.15)

— generic_user

Tôi sẽ (cố gắng) trả lời điều này theo ba bước: trước tiên, hãy xác định chính xác ý nghĩa của chúng tôi bằng cách làm trơn tru đơn biến. Tiếp theo, chúng tôi sẽ mô tả một mịn đa biến (cụ thể, trơn tru của hai biến). Cuối cùng, tôi sẽ cố gắng hết sức để mô tả một sản phẩm tenor trơn tru.

1) mịn màng

Hãy nói rằng chúng tôi có một số phản ứng dữ liệu rằng chúng tôi phỏng đoán là một chức năng chưa biết của một biến dự đoán cộng với một số lỗi . Mô hình sẽ là: $y$ $f$ $x$ $ε$

y = f (x) + ε

$y=f(x)+ε$

Now, in order to fit this model, we have to identify the functional form of $f$ . The way we do this is by identifying basis functions, which are superposed in order to represent the function $f$ in its entirety. A very simple example is a linear regression, in which the basis functions are just $β_2x$ and $β_1$ , the intercept. Applying the basis expansion, we have

y = β_{1} + β_{2} x + ε

$y=β_1+β_2x+ε$

In matrix form, we would have:

Y = X β + ε

$Y=Xβ+ε$

Trong trường hợp là một vector cột n-by-1, là một mô hình ma trận n-by-2, là một vector cột 2-by-1 trong tổng số hệ số mô hình, và là một vector cột n-by-1 lỗi . có hai cột vì có hai thuật ngữ trong mở rộng cơ sở của chúng tôi: thuật ngữ tuyến tính và chặn. $Y$ $X$ $β$ $ε$ $X$

Nguyên tắc tương tự áp dụng cho việc mở rộng cơ sở trong MGCV, mặc dù các chức năng cơ bản phức tạp hơn nhiều. Cụ thể, các hàm cơ sở riêng lẻ không cần được xác định trên toàn bộ miền của biến độc lập . Điều này thường xảy ra khi sử dụng các cơ sở dựa trên nút thắt (xem "ví dụ dựa trên nút thắt" $x$ ). Mô hình sau đó được biểu diễn dưới dạng tổng của các hàm cơ sở, mỗi hàm được đánh giá ở mọi giá trị của biến độc lập. Tuy nhiên, như tôi đã đề cập, một số hàm cơ bản này có giá trị bằng 0 bên ngoài một khoảng nhất định và do đó không đóng góp vào việc mở rộng cơ sở bên ngoài khoảng đó. Ví dụ, hãy xem xét một cơ sở spline hình khối trong đó mỗi hàm cơ sở đối xứng với một giá trị (nút) khác nhau của biến độc lập - nói cách khác, mọi hàm cơ sở trông giống nhau nhưng chỉ được dịch chuyển dọc theo trục của biến độc lập (đây là một sự đơn giản hóa, vì bất kỳ cơ sở thực tế nào cũng sẽ bao gồm một đánh chặn và một thuật ngữ tuyến tính, nhưng hy vọng bạn có được ý tưởng).

To be explicit, a basis expansion of dimension $i-2$ could look like:

y = β_{1} + β_{2} x + β_{3} f_{1} (x) + β_{4} f_{2} (x) + . . . + β_{i} f_{i - 2} (x) + ε

$y=β_1+β_2x+β_3f_1(x)+β_4f_2(x)+...+β_if_{i-2} (x)+ε$

where each function $f$ is, perhaps, a cubic function of the independent variable $x$ .

The matrix equation $Y=Xβ+ε$ can still be used to represent our model. The only difference is that $X$ is now an n-by-i matrix; that is, it has a column for every term in the basis expansion (including the intercept and linear term). Since the process of basis expansion has allowed us to represent the model in the form of a matrix equation, we can use linear least squares to fit the model and find the coefficients $β$ .

This is an example of unpenalized regression, and one of the main strengths of MGCV is its smoothness estimation via a penalty matrix and smoothing parameter. In other words, instead of:

β = (X^{T} X)^{- 1} X^{T} Y

$β=(X^TX)^{-1}X^TY$

we have:

β = (X^{T} X + λ S)^{- 1} X^{T} Y

$β=(X^TX+λS)^{-1}X^TY$

where $S$ is a quadratic $i$ -by- $i$ penalty matrix and $λ$ is a scalar smoothing parameter. I will not go into the specification of the penalty matrix here, but it should suffice to say that for any given basis expansion of some independent variable and definition of a quadratic "wiggliness" penalty (for example, a second-derivative penalty), one can calculate the penalty matrix $S$ .

MGCV can use various means of estimating the optimal smoothing parameter $λ$ . I will not go into that subject since my goal here was to give a broad overview of how a univariate smooth is constructed, which I believe I have done.

2) Multivariate smooth

The above explanation can be generalized to multiple dimensions. Let's go back to our model that gives the response $y$ as a function $f$ of predictors $x$ and $z$ . The restriction to two independent variables will prevent cluttering the explanation with arcane notation. The model is then:

y = f (x, z) + ε

$y=f(x,z)+ε$

Now, it should be intuitively obvious that we are going to represent $f(x,z)$ with a basis expansion (that is, a superposition of basis functions) just like we did in the univariate case of $f(x)$ above. It should also be obvious that at least one, and almost certainly many more, of these basis functions must be functions of both $x$ and $z$ (if this was not the case, then implicitly $f$ would be separable such that $f(x,z)=f_x(x)+f_z(z)$ ). A visual illustration of a multidimensional spline basis can be found here. A full two dimensional basis expansion of dimension $i-3$ could look something like:

y = β_{1} + β_{2} x + β_{3} z + β_{4} f_{1} (x, z) + . . . + β_{i} f_{i - 3} (x, z) + ε

$y=β_1+β_2x+β_3z+β_4f_1(x,z)+...+β_if_{i-3} (x,z)+ε$

I think it's pretty clear that we can still represent this in matrix form with:

Y = X β + ε

$Y=Xβ+ε$

by simply evaluating each basis function at every unique combination of $x$ and $z$ . The solution is still:

β = (X^{T} X)^{- 1} X^{T} Y

$β=(X^TX)^{-1}X^TY$

Computing the second derivative penalty matrix is very much the same as in the univariate case, except that instead of integrating the second derivative of each basis function with respect to a single variable, we integrate the sum of all second derivatives (including partials) with respect to all independent variables. The details of the foregoing are not especially important: the point is that we can still construct penalty matrix $S$ and use the same method to get the optimal value of smoothing parameter $λ$ , and given that smoothing parameter, the vector of coefficients is still:

β = (X^{T} X + λ S)^{- 1} X^{T} Y

$β=(X^TX+λS)^{-1}X^TY$

Now, this two-dimensional smooth has an isotropic penalty: this means that a single value of $λ$ applies in both directions. This works fine when both $x$ and $z$ are on approximately the same scale, such as a spatial application. But what if we replace spatial variable $z$ with temporal variable $t$ ? The units of $t$ may be much larger or smaller than the units of $x$ , and this can throw off the integration of our second derivatives because some of those derivatives will contribute disproportionately to the overall integration (for example, if we measure $t$ in nanoseconds and $x$ in light years, the integral of the second derivative with respect to $t$ may be vastly larger than the integral of the second derivative with respect to $x$ , and thus "wiggliness" along the $x$ direction may go largely unpenalized). Slide 15 of the "smooth toolbox" I linked has more detail on this topic.

It is worth noting that we did not decompose the basis functions into marginal bases of $x$ and $z$ . The implication here is that multivariate smooths must be constructed from bases supporting multiple variables. Tensor product smooths support construction of multivariate bases from univariate marginal bases, as I explain below.

3) Tensor product smooths

Tensor product smooths address the issue of modeling responses to interactions of multiple inputs with different units. Let's suppose we have a response $y$ that is a function $f$ of spatial variable $x$ and temporal variable $t$ . Our model is then:

y = f (x, t) + ε

$y=f(x,t)+ε$

What we'd like to do is construct a two-dimensional basis for the variables $x$ and $t$ . This will be a lot easier if we can represent $f$ as:

f (x, t) = f_{x} (x) f_{t} (t)

$f(x,t)=f_x(x)f_t(t)$

In an algebraic / analytical sense, this is not necessarily possible. But remember, we are discretizing the domains of $x$ and $t$ (imagine a two-dimensional "lattice" defined by the locations of knots on the $x$ and $t$ axes) such that the "true" function $f$ is represented by the superposition of basis functions. Just as we assumed that a very complex univariate function may be approximated by a simple cubic function on a specific interval of its domain, we may assume that the non-separable function $f(x,t)$ may be approximated by the product of simpler functions $f_x(x)$ and $f_t(t)$ on an interval—provided that our choice of basis dimensions makes those intervals sufficiently small!

Our basis expansion, given an $i$ -dimensional basis in $x$ and $j$ -dimensional basis in $t$ , would then look like:

\begin{aligned} y = & β_{1} + β_{2} x + β_{3} f_{x 1} (x) + β_{4} f_{x 2} (x) + . . . + \\ β_{i} f_{x (i - 3)} (x) + β_{i + 1} t + β_{i + 2} t x + β_{i + 3} t f_{x 1} (x) + β_{i + 4} t f_{x 2} (x) + . . . + \\ β_{2 i} t f_{x (i - 3)} (x) + β_{2 i + 1} f_{t 1} (t) + β_{2 i + 2} f_{t 1} (t) x + β_{2 i + 3} f_{t 1} (t) f_{x 1} (x) + β_{i + 4} f_{t 1} (t) f_{x 2} (x) + . . . + \\ β_{2 i} f_{t 1} (t) f_{x (i - 3)} (x) + \dots + \\ β_{i j} f_{t (j - 3)} (t) f_{x (i - 3)} (x) + ε \end{aligned}

$\begin{align} y = &β_{1} + β_{2}x + β_{3}f_{x1}(x)+β_{4}f_{x2}(x)+...+ \\ &β_{i}f_{x(i-3)}(x)+ β_{i+1}t + β_{i+2}tx + β_{i+3}tf_{x1}(x)+β_{i+4}tf_{x2}(x)+...+ \\ &β_{2i}tf_{x(i-3)}(x)+ β_{2i+1}f_{t1}(t) + β_{2i+2}f_{t1}(t)x + β_{2i+3}f_{t1}(t)f_{x1}(x)+β_{i+4}f_{t1}(t)f_{x2}(x){\small +...+} \\ &β_{2i}f_{t1}(t)f_{x(i-3)}(x)+\ldots+ \\ &β_{ij}f_{t(j-3)}(t)f_{x(i-3)}(x) + ε \end{align}$

Which may be interpreted as a tensor product. Imagine that we evaluated each basis function in $x$ and $t$ , thereby constructing n-by-i and n-by-j model matrices $X$ and $T$ , respectively. We could then compute the $n^2$ -by- $ij$ tensor product $X \otimes T$ of these two model matrices and reorganize into columns, such that each column represented a unique combination $ij$ . Recall that the marginal model matrices had $i$ and $j$ columns, respectively. These values correspond to their respective basis dimensions. Our new two-variable basis should then have dimension $ij$ , and therefore the same number of columns in its model matrix.

NOTE: I'd like to point out that since we explicitly constructed the tensor product basis functions by taking products of marginal basis functions, tensor product bases may be constructed from marginal bases of any type. They need not support more than one variable, unlike the multivariate smooth discussed above.

In reality, this process results in an overall basis expansion of dimension $ij-i-j+1$ because the full multiplication includes multiplying every $t$ basis function by the x-intercept $β_{x1}$ (so we subtract $j$ ) as well as multiplying every $x$ basis function by the t-intercept $β_{t1}$ (so we subtract $i$ ), but we must add the intercept back in by itself (so we add 1). This is known as applying an identifiability constraint.

So we can represent this as:

y = β_{1} + β_{2} x + β_{3} t + β_{4} f_{1} (x, t) + β_{5} f_{2} (x, t) + . . . + β_{i j - i - j + 1} f_{i j - i - j - 2} (x, t) + ε

$y=β_1+β_2x+β_3t+β_4f_1(x,t)+β_5f_2(x,t)+...+β_{ij-i-j+1}f_{ij-i-j-2}(x,t)+ε$

Where each of the multivariate basis functions $f$ is the product of a pair of marginal $x$ and $t$ basis functions. Again, it's pretty clear having constructed this basis that we can still represent this with the matrix equation:

Y = X β + ε

$Y=Xβ+ε$

Which (still) has the solution:

β = (X^{T} X)^{- 1} X^{T} Y

$β=(X^TX)^{-1}X^TY$

Where the model matrix $X$ has $ij-i-j+1$ columns. As for the penalty matrices $J_x$ and $J_t$ , these are are constructed separately for each independent variable as follows:

J_{x} = β^{T} I_{j} \otimes S_{x} β

$J_x=β^T I_j \otimes S_x β$

and,

J_{t} = β^{T} S_{t} \otimes I_{i} β

$J_t=β^T S_t \otimes I_i β$

This allows for an overall anisotropic (different in each direction) penalty (Note: the penalties on the second derivative of $x$ are added up at each knot on the $t$ axis, and vice versa). The smoothing parameters $λ_x$ and $λ_t$ may now be estimated in much the same way as the single smoothing parameter was for the univariate and multivariate smooths. The result is that the overall shape of a tensor product smooth is invariant to rescaling of its independent variables.

I recommend reading all the vignettes on the MGCV website, as well as "Generalized Additive Models: and introduction with R." Long live Simon Wood.

— Josh
nguồn

Nice answer. I've since learned quite a lot more than I knew three years ago. But I'm not sure that I would have understood 3 years ago what you wrote today. Or maybe I would have. I think the place to start is to think of a basis expansion in many dimensions as a "net" across the variable space. I suppose tensors can be described as a net with rectangular patterns... And maybe different "shear" forces pulling from each direction.

— generic_user

On another note, I would caution you against thinking of the tensor product as representing something spatial. This is because the actual tensor product of marginal

x

$x$ and

t

$t$ basis functions will include tons of zeros which represent the evaluation of basis functions outside of their defined range. The actual tensor product will usually be very sparse.

— Josh

Thanks for this great summary! Just one remark: The equation after "Our basis expansion," is not completely correct. It does give the correct basis functions, but it gives a parametrization where the corresponding parameters are of product form (

β_{x i} β_{t j}

$\beta_{xi}\beta_{tj}$ ).

— jarauh

@Josh Ok, I tried. It's not easy to have it correct and easy to understand at the same time (and to follow someone else's notation). By the way, the link to smooth-toolbox.pdf seems to be broken.

— jarauh

Looks good. Apparently your edit was rejected, but I overrode the rejection and approved it. When I started writing this answer I didn't realize just how confusing the expansions would look. I should probably go back and rewrite it with pi (product) notation one of these days.

— Josh