14

Tôi đang cố gắng để hiểu được trực giác đằng sau kernel SVM. Bây giờ, tôi hiểu cách thức hoạt động của SVM tuyến tính, theo đó một dòng quyết định được thực hiện để phân chia dữ liệu tốt nhất có thể. Tôi cũng hiểu nguyên tắc đằng sau việc chuyển dữ liệu sang không gian có chiều cao hơn và làm thế nào điều này có thể giúp dễ dàng tìm thấy một dòng quyết định tuyến tính trong không gian mới này. Điều tôi không hiểu là làm thế nào một kernel được sử dụng để chiếu các điểm dữ liệu vào không gian mới này.

Những gì tôi biết về một hạt nhân là nó thể hiện một cách hiệu quả sự "tương đồng" giữa hai điểm dữ liệu. Nhưng làm thế nào điều này liên quan đến chiếu?

machine-learning svm kernel-trick

— Karnivaurus
nguồn

3

Nếu bạn đi đến một không gian đủ cao, tất cả các điểm dữ liệu huấn luyện có thể được phân tách hoàn hảo bằng một mặt phẳng. Điều đó không có nghĩa là nó sẽ có bất kỳ sức mạnh dự đoán nào. Tôi nghĩ rằng đi đến không gian rất cao là tương đương đạo đức (một hình thức) của quá mức.

— Mark L. Stone

@Mark L. Stone: điều đó đúng (+1) nhưng nó vẫn có thể là một câu hỏi hay để hỏi làm thế nào một hạt nhân có thể ánh xạ tới không gian vô hạn? Làm thế nào mà làm việc? Tôi đã thử, xem câu trả lời của tôi

Tôi sẽ cẩn thận về việc gọi tính năng ánh xạ "chiếu". Ánh xạ tính năng nói chung là biến đổi phi tuyến.

— Paul

Một bài viết rất hữu ích về thủ thuật kernel hiển thị không gian sản phẩm bên trong của kernel và mô tả cách các vectơ đặc trưng chiều cao được sử dụng để đạt được điều này, hy vọng điều này trả lời chính xác câu hỏi: eric-kim.net/eric-kim-net/ bài đăng / 1 / kernel_trick.html

— JStrahl

5

Hãy là chiếu vào không gian chiều cao . Về cơ bản các chức năng hạt nhân $h(x)$ $\mathcal{F}$ $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$ , đó là khu vực nội thành phẩm. Vì vậy, nó không được sử dụng để chiếu các điểm dữ liệu, mà là kết quả của phép chiếu. Nó có thể được coi là thước đo tương tự, nhưng trong một SVM, nó còn hơn thế nữa.

Việc tối ưu hóa để tìm siêu phẳng tách tốt nhất trong liên quan đến thông qua hình thức sản phẩm bên trong. Đó là để nói, nếu bạn biết , bạn không cần phải biết các hình thức chính xác của , mà làm cho việc tối ưu hóa dễ dàng hơn. $\mathcal{F}$ $h(x)$ $K(\cdot,\cdot)$ $h(x)$

Mỗi hạt nhân cũng có tương ứng . Vì vậy, nếu bạn đang sử dụng một SVM với hạt nhân đó, thì bạn đang ngầm tìm dòng quyết định tuyến tính trong không gian mà ánh xạ vào. $K(\cdot,\cdot)$ $h(x)$ $h(x)$

Chương 12 về các yếu tố của học thống kê giới thiệu ngắn gọn về SVM. Điều này cung cấp thêm chi tiết về kết nối giữa kernel và ánh xạ tính năng: http://statweb.stanford.edu/~tibs/ElemStatLearn/

— Lii
nguồn

bạn có nghĩa là đối với một hạt nhân

có một

duy nhất nằm bên dưới không?

K (x, y)

$K(x,y)$

h (x)

$h(x)$

2

@fcoppens Không; đối với một ví dụ tầm thường, hãy xem xét

và

. Tuy nhiên, không tồn tại một không gian Hilbert nhân bản duy nhất tương ứng với hạt nhân đó.

h

$h$

- h

$-h$

— Dougal

@Dougal: Sau đó tôi có thể đồng ý với bạn, nhưng trong câu trả lời ở trên có ghi 'một

tương ứng ' nên tôi muốn chắc chắn. Đối với RKHS Tôi hiểu rồi, nhưng bạn có nghĩ rằng nó có thể giải thích trong một cuộc 'cách trực quan' gì chuyển đổi này

trông giống như một hạt nhân

?

h

$h$

h

$h$

K (x, y)

$K(x,y)$

@fcoppens Nói chung, không; việc tìm kiếm các đại diện rõ ràng của các bản đồ này là khó khăn. Đối với một số hạt nhân, nó không quá khó hoặc đã được thực hiện trước đó.

— Dougal

1

@fcoppens bạn nói đúng, h (x) không phải là duy nhất. Bạn có thể dễ dàng thực hiện các thay đổi cho h (x) trong khi vẫn giữ nguyên sản phẩm bên trong <h (x), h (x ')>. Tuy nhiên, bạn có thể coi chúng là các hàm cơ bản và không gian chúng trải rộng (tức là RKHS) là duy nhất.

— Lii

4

Các thuộc tính hữu ích của kernel SVM không phải là phổ quát - chúng phụ thuộc vào sự lựa chọn của kernel. Để có được trực giác, thật hữu ích khi xem xét một trong những hạt nhân được sử dụng phổ biến nhất, hạt nhân Gaussian. Đáng chú ý, hạt nhân này biến SVM thành một cái gì đó rất giống như một bộ phân loại hàng xóm gần nhất k.

Câu trả lời này giải thích như sau:

Tại sao luôn có thể phân tách hoàn hảo dữ liệu huấn luyện tích cực và tiêu cực với hạt nhân Gaussian có băng thông đủ nhỏ (với chi phí vượt mức)
Làm thế nào sự phân tách này có thể được hiểu là tuyến tính trong một không gian tính năng.
Làm thế nào kernel được sử dụng để xây dựng ánh xạ từ không gian dữ liệu đến không gian đặc trưng. Spoiler: không gian tính năng là một đối tượng trừu tượng rất toán học, với một sản phẩm bên trong trừu tượng khác thường dựa trên nhân.

1. Đạt được sự tách biệt hoàn hảo

Luôn luôn có thể phân tách hoàn hảo với hạt nhân Gaussian do các thuộc tính cục bộ của hạt nhân, dẫn đến ranh giới quyết định linh hoạt tùy ý. Đối với băng thông hạt nhân đủ nhỏ, ranh giới quyết định sẽ trông giống như bạn chỉ vẽ các vòng tròn nhỏ xung quanh các điểm bất cứ khi nào cần thiết để phân tách các ví dụ tích cực và tiêu cực:

(Tín dụng: Khóa học máy trực tuyến của Andrew Ng ).

Vì vậy, tại sao điều này xảy ra từ góc độ toán học?

Hãy xem xét các thiết lập tiêu chuẩn: bạn có một Gaussian kernel và đào tạo dữ liệu $K(\mathbf{x},\mathbf{z}) = \exp(- ||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ trong đó cácgiá trị là . Chúng tôi muốn tìm hiểu một chức năng phân loại $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ $y^{(i)}$ $\pm 1$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} K (x^{(i)}, x)

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$

Now how will we ever assign the weights $w_i$ ? Do we need infinite dimensional spaces and a quadratic programming algorithm? No, because I just want to show that I can separate the points perfectly. So I make $\sigma$ a billion times smaller than the smallest separation $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ between any two training examples, and I just set $w_i = 1$ . This means that all the training points are a billion sigmas apart as far as the kernel is concerned, and each point completely controls the sign of $\hat{y}$ in its neighborhood. Formally, we have

\hat{y} (x^{(k)}) = \sum_{i = 1}^{n} y^{(k)} K (x^{(i)}, x^{(k)}) = y^{(k)} K (x^{(k)}, x^{(k)}) + \sum_{i \neq k} y^{(i)} K (x^{(i)}, x^{(k)}) = y^{(k)} + ϵ

$\hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$

where $\epsilon$ is some arbitrarily tiny value. We know $\epsilon$ is tiny because $\mathbf{x}^{(k)}$ is a billion sigmas away from any other point, so for all $i \neq k$ we have

K (x^{(i)}, x^{(k)}) = \exp (- | | x^{(i)} - x^{(k)} | |^{2} / σ^{2}) \approx 0.

$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(- ||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$

$\epsilon$ $\hat{y}(\mathbf{x}^{(k)})$ definitely has the same sign as $y^{(k)}$ , and the classifier achieves perfect accuracy on the training data. In practice this would be terribly overfitting but it shows the tremendous flexibility of the Gaussian kernel SVM, and how it can act very similar to a nearest neighbor classifier.

2. Kernel SVM learning as linear separation

The fact that this can be interpreted as "perfect linear separation in an infinite dimensional feature space" comes from the kernel trick, which allows you to interpret the kernel as an abstract inner product some new feature space:

K (x^{(i)}, x^{(j)}) = ⟨ Φ (x^{(i)}), Φ (x^{(j)}) ⟩

$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$

$\Phi(\mathbf{x})$ is the mapping from the data space into the feature space. It follows immediately that the $\hat{y}(\mathbf{x})$ function as a linear function in the feature space:

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), Φ (x) ⟩ = L (Φ (x))

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$

where the linear function $L(\mathbf{v})$ is defined on feature space vectors $\mathbf{v}$ as

L (v) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), v ⟩

$L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$

This function is linear in $\mathbf{v}$ because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary $\hat{y}(\mathbf{x}) = 0$ is just $L(\mathbf{v}) = 0$ , the level set of a linear function. This is the very definition of a hyperplane in the feature space.

3. How the kernel is used to construct the feature space

Kernel methods never actually "find" or "compute" the feature space or the mapping $\Phi$ explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function $K$ . It is possible to write down a formula for $\Phi$ but the feature space it maps to is quite abstract and is only really used for proving theoretical results about SVM. If you're still interested, here's how it works.

Basically we define an abstract vector space $V$ where each vector is a function from $\mathcal{X}$ to $\mathbb{R}$ . A vector $f$ in $V$ is a function formed from a finite linear combination of kernel slices:

f (x) = \sum_{i = 1}^{n} α_{i} K (x^{(i)}, x)

$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$ (Here the

x^{(i)}

$\mathbf{x}^{(i)}$ are just an arbitrary set of points and need not be the same as the training set.) It is convenient to write

f

$f$ more compactly as

f = \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}

$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$ where

K_{x} (y) = K (x, y)

$K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$ is a function giving a "slice" of the kernel at

x

$\mathbf{x}$ .

The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:

⟨ \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}, \sum_{j = 1}^{n} β_{j} K_{x^{(j)}} ⟩ = \sum_{i, j} α_{i} β_{j} K (x^{(i)}, x^{(j)})

$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$

This definition is very deliberate: its construction ensures the identity we need for linear separation, $\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$ .

With the feature space defined in this way, $\Phi$ is a mapping $\mathcal{X} \rightarrow V$ , taking each point $\mathbf{x}$ to the "kernel slice" at that point:

Φ (x) = K_{x}, where K_{x} (y) = K (x, y) .

$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}).$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details.

— Paul
nguồn

Great explanation, but I think you have missed a minus for the definition of the gaussian kernel. K(x,z)=exp(-||x−z||2/σ2) . As it's written, it does not make sense with the ϵ found in the part (1)

— hqxortn

1

For the background and the notations I refer to How to calculate decision boundary from support vectors?.

So the features in the 'original' space are the vectors $x_i$ , the binary outcome $y_i \in \{-1, +1\}$ and the Lagrange multipliers are $\alpha_i$ .

As said by @Lii (+1) the Kernel can be written as $K(x,y)=h(x) \cdot h(y)$ (' $\cdot$ ' represents the inner product.

I will try to give some 'intuitive' explanation of what this $h$ looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.

I have to 'transform' my feature space (so my $x_i$ ) into some 'new' feature space in which the linear separation will be solved.

For each observation $x_i$ , I define functions $\phi_i(x)=K(x_i,x)$ , so I have a function $\phi_i$ for each element of my training sample. These functions $\phi_i$ span a vector space. The vector space spanned by the $\phi_i$ , note it $V=span(\phi_{i, i=1,2,\dots N})$ .

I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space $V$ can be written as as a linear combination of the $\phi_i$ , i.e.: $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ are real numbers.

$N$ is the size of the training sample and therefore the dimension of the vector space $V$ can go up to $N$ , depending on whether the $\phi_i$ are linear independent. As $\phi_i(x)=K(x_i,x)$ (see supra, we defined $\phi$ in this way), this means that the dimension of $V$ depends on the kernel used and can go up to the size of the training sample.

The transformation, that maps my original feature space to $V$ is defined as

$\Phi: x_i \to \phi(x)=K(x_i, x)$ .

This map $\Phi$ maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.

Obviously, this transformation (a) depends on the kernel, (b) depends on the values $x_i$ in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of $V$ look like $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ , $\gamma_i$ are real numbers.

Looking at the function $f(x)$ in How to calculate decision boundary from support vectors? it can be seen that $f(x)=\sum_i y_i \alpha_i \phi_i(x)+b$ .

In other words, $f(x)$ is a linear combination of the $\phi_i$ and this is a linear separator in the V-space : it is a particular choice of the $\gamma_i$ namely $\gamma_i=\alpha_i y_i$ !

The $y_i$ are known from our observations, the $\alpha_i$ are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the $V$ -spave.

This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space $V$ , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.

So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?

— Community
nguồn

"for each element of my training sample" -- is element here referring to a row or column (i.e. feature )

— user1761806

what is x and x_i? If my X is an input of 5 columns, and 100 rows, what would x and x_i be?

— user1761806

@user1761806 an element is a row. The notation is explained in the link at the beginning of the answer

1

Transform predictors (input data) to a high-dimensional feature space. It is sufficient to just specify the kernel for this step and the data is never explicitly transformed to the feature space. This process is commonly known as the kernel trick.

Let me explain it. The kernel trick is the key here. Consider the case of a Radial Basis Function (RBF) Kernel here. It transforms the input to infinite dimensional space. The transformation of input $x$ to $\phi(x)$ can be represented as shown below (taken from http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf)

The input space is finite dimensional but the transformed space is infinite dimensional. Transforming the input to an infinite dimensional space is something that happens as a result of the kernel trick. Here $x$ which is the input and $\phi$ is the transformed input. But $\phi$ is not computed as it is, instead the product $\phi(x_i)^T\phi(x)$ is computed which is just the exponential of the norm between $x_i$ and $x$ .

There is a related question Feature map for the Gaussian kernel to which there is a nice answer /stats//a/69767/86202.

The output or decision function is a function of the kernel matrix $K(x_i,x)=\phi(x_i)^T\phi(x)$ and not of the input $x$ or transformed input $\phi$ directly.

— prashanth
nguồn

0

Mapping to a higher dimension is merely a trick to solve a problem that is defined in the original dimension; so concerns such as overfitting your data by going into a dimension with too many degrees of freedom are not a byproduct of the mapping process, but are inherent in your problem definition.

Basically, all that mapping does is converting conditional classification in the original dimension to a plane definition in the higher dimension, and because there is a 1 to 1 relationship between the plane in the higher dimension and your conditions in the lower dimension, you can always move between the two.

Taking the problem of overfitting, clearly, you can overfit any set of observations by defining enough conditions to isolate each observation into its own class, which is equivalent of mapping your data to (n-1)D where n is the number of your observations.

Taking the simplest problem, where your observations are [[1,-1], [0,0], [1,1]] [[feature, value]], by moving into the 2D dimension and separating your data with a line, your are simply turning the conditional classification of feature < 1 && feature > -1 : 0 to defining a line that passes through (-1 + epsilon, 1 - epsilon). If you had more data points and needed more condition, you just needed to add one more degree of freedom to your higher dimension by each new condition that your define.

You can replace the process of mapping to a higher dimension with any process that provides you with a 1 to 1 relationship between the conditions and the degrees of freedom of your new problem. Kernel tricks simply do that.

— Hou
nguồn

1

As a different example, take the problem where the phenomenon results in observations of the form of [x, floor(sin(x))]. Mapping your problem into a 2D dimension is not helpful here at all; in fact, mapping to any plane will not be helpful here, which is because defining the problem as a set of x < a && x > b : z is not helpful in this case. The simplest mapping in this case is mapping into a polar coordinate, or into the imaginary plane.

— Hou

Kernel SVM: Tôi muốn một sự hiểu biết trực quan về ánh xạ tới không gian tính năng chiều cao hơn và cách điều này giúp phân tách tuyến tính có thể

1. Đạt được sự tách biệt hoàn hảo

2. Kernel SVM learning as linear separation

3. How the kernel is used to construct the feature space