Gradient descent - Optimazation algorithm
Hello every body ^^, Have you ever heard about optimazation problem in predictor of machine learning? Before taking about that, we glide example below: An example of spam email classification First, some terminology. A predictor is a function f that maps an input x to an output y. In ...
Hello every body ^^, Have you ever heard about optimazation problem in predictor of machine learning? Before taking about that, we glide example below:
An example of spam email classification
First, some terminology. A predictor is a function f that maps an input x to an output y. In statistics, y is known as a response, and when x is a real vector, then it is known as the covariates. The starting point of machine learning is the data, which is the main resource that we can use to address the information complexity of the prediction task at hand. In the post, we will focus on supervised learning, in which our data provides both inputs and outputs, in contrast to unsupervised learning, which only provides inputs. A (supervised) example (also called a data point or instance) is simply an input-output pair (x, y), which specifies that y is the ground-truth output for x. The training data Dtrain is a multiset of examples (repeats are allowed, but this is not important), which forms a partial specification of desired behavior of the predictor. One question released that How to separate email into spam or not??? It is based on Feature extraction: Given input x, output a set of (feature name, feature value) pairs. Example for email classification: We will consider predictors f based on feature extractors. Feature extraction is a bit of an art that requires intuition about both the task and also what machine learning algorithms are capable of. The general principle is that features should represent properties of x which might be relevant for predicting y. It is okay to add features which turn out to be irrelevant, since the learning algorithm can sort it out (though it might require more data to do so). From feature extractor, we have a feature vector of input x: ϕ(x)=[ϕ1(x)⋅⋅⋅ϕd(x)]phi (x) = egin{bmatrix} phi _{1}(x)& cdot cdot cdot & phi _{d}(x) end{bmatrix}ϕ(x)=[ϕ1(x)⋅⋅⋅ϕd(x)] Think of ϕ(x)ϵRdphi (x) epsilon R^{d}ϕ(x)ϵRd as a point in a high-dimensional space. For each feature j of feature extraction, we have real number wj representing contribution of feature to prediction We have vector input x, and its weight, we can use predictor to classify data.
Loss function in linear regression
According to above example, we use predictor f_{w} to classsify data (binary classification). But right now, we focus on linear regression. All predictors are whether true or false, and we can trust it in certain percentage. So we have to find a function which denote training error, is loss function. There are some loss function in linear regression such as squared loss , absolute deviation loss. A popular and convenient loss function to use in linear regression is the squared loss, which penalizes the residual of the prediction quadratically. An alternative to the squared loss is the absolute deviation loss, which simply takes the absolute value of the residual. Now, look at the formula: (w⋅ϕ(x))−yleft ( w cdot phi(x) ight ) - y(w⋅ϕ(x))−y is the residual that the amount by which prediction fw=w⋅ϕ(x)f_{w} = wcdot phi (x)fw=w⋅ϕ(x) overshoots the target y. So if the value of loss function is minimization, our predictor is reliable. One question released: How to minimize loss function ??? Yeah, the scientist has just said that we can find to minimize the TrainLoss(w): So the key that we need to set w to make global tradeoffs — not every example can be happy. This idea urge us to find a optimazation algorithm that minimize TrainLoss(w). It is gradient descent (GD), we will talk about it in below section.
Gradient Descent (GD)
Do not let you wait for long, we are talking about GD right now!. The first, we talk about some concept.
- Gradient: The gradient ∇wTrainLoss(w) is the direction that increases the loss the most.
- Descent: So, you can see about above image. A general approach is to use iterative optimization, which essentially starts at some starting point w (say, all zeros), and tries to tweak w so that the objective function value decreases. To do this, we will rely on the gradient of the function, which tells us which direction to move in to decrease the objective the most. The gradient is a valuable piece of information, especially since we will often be optimizing in high dimensions (d on the order of thousands). This iterative optimization procedure is called gradient descent. Gradient descent has two hyperparameters, the step size η (which specifies how aggressively we want to pursue a direction) and the number of iterations T. Let’s not worry about how to set them, but you can think of T = 100 and η = 0.1 for now. Ok, I have just explained the GD 's mean and operation mechanism. As we seen, we have multiple loops before converging. So which loss function is choosen for now ???