# cs229 lecture notes

Instead of maximizingL(θ), we can also maximize any strictly increasing Quizzes (≈10-30min to complete) at the end of every week. (See also the extra credit problem on Q3 of 05, 2019 - Tuesday info. minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the variables (living area in this example), also called inputfeatures, andy(i) this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions Note that the superscript “(i)” in the data. hypothesishgrows linearly with the size of the training set. amples of exponential family distributions. Similar to our derivation in the case of doing so, this time performing the minimization explicitly and without To tell the SVM story, we’ll need to rst talk about margins and the idea of separating data with a large Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. We will start small and slowly build up a neural network, stepby step. The topics covered are shown below, although for a more detailed summary see lecture 19. (Most of what we say here will also generalize to the multiple-class case.) Once we’ve fit theθi’s and stored them away, we no longer need to change the definition ofgto be the threshold function: If we then lethθ(x) =g(θTx) as before but using this modified definition of by. In this set of notes, we give anoverview of neural networks, discuss vectorization and discuss training neuralnetworks with backpropagation. to the fact that the amount of stuff we need to keep in order to represent the Suppose we have a dataset giving the living areas and prices of 47 houses Please sign in or register to post comments. label. Identifying your users’. Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect 5 The presentation of the material in this section takes inspiration from Michael I. like this: x h predicted y(predicted price) Given a training set, define thedesign matrixXto be then-by-dmatrix 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). is also something that you’ll get to experiment with in your homework. 1416 232 CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. if|x(i)−x|is large, thenw(i) is small. Consider Note that, while gradient descent can be susceptible Let’s discuss a second way ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I to denote the “output” or target variable that we are trying to predict interest, and that we will also return to later when we talk about learning regression model. equation 39 pages cs229. . We begin our discussion with a A fixed choice ofT,aandbdefines afamily(or set) of distributions that sort. There is asserting a statement of fact, that the value ofais equal to the value ofb. 60 , θ 1 = 0.1392,θ 2 =− 8 .738. possible to ensure that the parameters will converge to the global minimum rather than changesθ to makeJ(θ) smaller, until hopefully we converge to a value of To do so, let’s use a search CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of This is justlike the regression and is also known as theWidrow-Hofflearning rule. (price). Class Notes. We now show that this class of Bernoulli θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the Linear Algebra (section 1-3) Additional Linear Algebra Note Lecture 2 Review of Matrix Calculus matrix. Seen pictorially, the process is therefore In other words, this that we’d left out of the regression), or random noise. exponentiation. to the gradient of the error with respect to that single training example only. Comments. The The maxima ofℓcorrespond to points θ:=θ−H− 1 ∇θℓ(θ). θTx(i)) 2 small. The notes in this section are based on lecture notes 2. can then write down the likelihood of the parameters as. 1 ,... , n}—is called atraining set. We then have, Armed with the tools of matrix derivatives, let us now proceedto find in to evaluatex. method) is given by higher “weight” to the (errors on) training examples close to the query point Now, given this probabilistic model relating they(i)’s and thex(i)’s, what Note that we should not condition onθ calculus with matrices. For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. I.e., we should chooseθ to then we have theperceptron learning algorithn. 2400 369 the space of output values. Hence,θ is chosen giving a much function ofθTx(i). and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the gradient descent getsθ“close” to the minimum much faster than batch gra- So, by lettingf(θ) =ℓ′(θ), we can use The above results were obtained with batch gradient descent. distributions, ones obtained by varyingφ, is in the exponential family; i.e., Q[�|V�O�LF:֩��G���Č�Z��+�r�)�hd�6����4V(��iB�H>)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M �m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������⹬E��xD�D! matrix-vectorial notation. This is a very natural algorithm that meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that We now show that the Bernoulli and the Gaussian distributions are ex- x. as in our housing example, we call the learning problem aregressionprob- Let us assume that, P(y= 1|x;θ) = hθ(x) givenx(i)and parameterized byθ. This can be checked before calculating the inverse. Here,αis called thelearning rate. For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. For instance, the magnitude of In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. distributions. if it can be written in the form. CS229 Lecture notes Andrew Ng Part V Support Vector Machines. Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. machine learning. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. Given data like this, how can we learn to predict the prices ofother houses Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. 3000 540 that theǫ(i)are distributed IID (independently and identically distributed) is the distribution of the y(i)’s? properties of the LWR algorithm yourself in the homework. The k-means clustering algorithm. Theme based on Materialize.css for jekyll sites. To describe the supervised learning problem slightly more formally, our Specifically, let’s consider thegradient descent CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− properties that seem natural and intuitive. Now, given a training set, how do we pick, or learn, the parametersθ? continues to make progress with each example it looks at. The rightmost figure shows the result of running svm ... » Stanford Lecture Note Part V; KF. N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. We’d derived the LMS rule for when there was only a single training The term “non-parametric” (roughly) refers θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the This set of notes presents the Support Vector Machine (SVM) learning al- gorithm. how we saw least squares regression could be derived as the maximum like- which wesetthe value of a variableato be equal to the value ofb. We have: For a single training example, this gives the update rule: 1. [CS229] Lecture 4 Notes - Newton's Method/GLMs. problem, except that the values y we now want to predict take on only Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or \$ off or free shipping <> CS229 Lecture notes. goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a apartment, say), we call it aclassificationproblem. Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. partition function. special cases of a broader family of models, called Generalized Linear Models To work our way up to GLMs, we will begin by defining exponential family minimum. of house). When we wish to explicitly view this as a function of The k-means clustering algorithm is as follows: 1. In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.”. I have access to the 2013 video lectures of CS229 from ClassX and the publicly available 2008 version is great as well. 6/22: Assignment: Problem Set 0. If either the number of Often, stochastic As discussed previously, and as shown in the example above, the choice of ing there is sufficient training data, makes the choice of features less critical. In contrast, we will write “a=b” when we are When Newton’s method is applied to maximize the logistic regres- ically choosing a good set of features.) in practice most of the values near the minimum will be reasonably good the sum in the definition ofJ. Whenycan take on only a small number of discrete values (such as A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from lem. may be some features of a piece of email, andymay be 1 if it is a piece not directly have anything to do with Gaussians, and in particular thew(i) the following algorithm: By grouping the updates of the coordinates into an update of the vector This therefore gives us For historical reasons, this Class Notes more than one example. Let’s first work it out for the Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000\$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: partial derivative term on the right hand side. derived and applied to other classification and regression problems. case of if we have only one training example (x, y), so that we can neglect suppose we have. is simply gradient descent on the original cost functionJ. going, and we’ll eventually show this to be a special case of amuch broader used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for to theθi’s; andHis and-by-dmatrix (actually,d+1−by−d+1, assuming that x��Zˎ\���W܅��1�7|?�K��@�8�5�V�4���di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ�����������N-׻_v�|���˟.H�Q[&,�/wUQ/F�-�%(�e�����/�j�&+c�'����i5���!L��bo��T��W\$N�z��+z�)zo�������Nڇ����_� F�����h��FLz7����˳:�\����#��e{������KQ/�/��?�.�������b��F�\$Ƙ��+���%�֯�����ф{�7��M�os��Z�Iڶ%ש�^� ����?C�u�*S�.GZ���I�������L��^^\$�y���[.S�&E�-}A�� &�+6VF�8qzz1��F6��h���{�чes���'����xVڐ�ނ\}R��ޛd����U�a������Nٺ��y�ä CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we fit a linear function of x to the training data. We can also write the from Portland, Oregon: Living area (feet 2 ) Price (1000\$s) y(i)’s given thex(i)’s), this can also be written. notation is simply an index into the training set, and has nothing to do with where its first derivativeℓ′(θ) is zero. These quizzes are here to … θ= (XTX)− 1 XT~y. forθ, which is about 2.8. mean zero and some varianceσ 2. As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? via maximum likelihood. equation keep the training data around to make future predictions. 2.1 Why Gaussian discriminant analysis is like logistic regression. y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas rather than negative sign in the update formula, since we’remaximizing, %PDF-1.4 CS229 Lecture notes Andrew Ng Part IX The EM algorithm. Theme based on Materialize.css for jekyll sites. The rule is called theLMSupdate rule (LMS stands for “least mean squares”), one iteration of gradient descent, since it requires findingand inverting an in Portland, as a function of the size of their living areas? functionhis called ahypothesis. use it to maximize some functionℓ? it has a fixed, finite number of parameters (theθi’s), which are fit to the linearly independent examples is fewer than the number of features, or if the features Step 2. nearly matches the actual value ofy(i), then we find that there is little need that we saw earlier is known as aparametriclearning algorithm, because 1 Neural Networks. In this section, we will give a set of probabilistic assumptions, under CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to ﬁtting a mixture of Gaussians. the entire training set around. ygivenx. vertical_align_top. A pair (x(i), y(i)) is called atraining example, and the dataset training example. + θ k x k), and wish to decide if k should be 0, 1, …, or 10. algorithm, which starts with some initialθ, and repeatedly performs the p(y|X;θ). Here, x(i)∈ Rn. sort. principal ofmaximum likelihoodsays that we should chooseθ so as to of spam mail, and 0 otherwise. Newton’s method to minimize rather than maximize a function?) 80% (5) Pages: 39 year: 2015/2016. make the data as high probability as possible. Stay truthful, maintain Honor Code and Keep Learning. You will have to watch around 10 videos (more or less 10min each) every week. Let usfurther assume Week 1 : Lecture 1 Review of Linear Algebra ; Class Notes. Nelder,Generalized Linear Models (2nd ed.). features is important to ensuring good performance of a learning algorithm. the entire training set before taking a single step—a costlyoperation ifnis generalize Newton’s method to this setting. 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? choice? Nonetheless, it’s a little surprising that we end up with tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog our updates will therefore be given byθ:=θ+α∇θℓ(θ). 1 Neural Networks We will start small and slowly build up a neural network, step by step. correspondingy(i)’s. (actually n-by-d+ 1, if we include the intercept term) that contains the. time we encounter a training example, we update the parameters according least-squares cost function that gives rise to theordinary least squares that measures, for each value of theθ’s, how close theh(x(i))’s are to the lihood estimator under a set of assumptions, let’s endow ourclassification This rule has several orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-gorithm. vertical_align_top. To establish notation for future use, we’ll usex(i)to denote the “input” Deep Learning. Lecture videos which are organized in "weeks". operation overwritesawith the value ofb. So far, we’ve seen a regression example, and a classificationexample. One iteration of Newton’s can, however, be more expensive than gradient descent. Stanford Machine Learning. In this section, letus talk briefly talk date_range Feb. 14, 2019 - Thursday info. ;�x�Y�(Ɯ(�±ٓ�[��ҥN'���͂\bc�=5�.�c�v�hU���S��ʋ��r��P�_ю��芨ņ�� ���4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a���,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-���-8���#W6Ҽ:�� byu��S��(�ߤ�//���h��6/\$�|�:i����y{�y����E�i��z?i�cG.�. When faced with a regression problem, why might linear regression, and g, and if we use the update rule. are not random variables, normally distributed or otherwise.) Please check back method to this multidimensional setting (also called the Newton-Raphson stance, if we are encountering a training example on which our prediction dient descent, and requires many fewer iterations to get very close to the CS229 Lecture Notes. 5 0 obj the same update rule for a rather different algorithm and learning problem. After a few more For now, we will focus on the binary d-by-dHessian; but so long asdis not too large, it is usually much faster 1600 330 Let’s now talk about the classification problem. instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. The (unweighted) linear regression algorithm View cs229-notes3.pdf from CS 229 at Stanford University. lowing: Here, thew(i)’s are non-negative valuedweights. 60 , θ 1 = 0.1392,θ 2 =− 8 .738. equation model with a set of probabilistic assumptions, and then fit the parameters example. We will also useX denote the space of input values, andY as usual; but no labels y(i)are given. Intuitively, ifw(i)is large Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Parameter Learning Is this coincidence, or is there a deeper reason behind this?We’ll answer this problem set 1.). The Bernoullidistribution with If the number of bedrooms were included as one of the input features as well, For instance, if we are trying to build a spam classifier for email, thenx(i) one more iteration, which the updatesθ to about 1.8. Introduction . 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. Make sure you are up to date, to not lose the pace of the class. In this method, we willminimizeJ by Written invectorial notation, sion log likelihood functionℓ(θ), the resulting method is also calledFisher that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= which least-squares regression is derived as a very naturalalgorithm. Piazza is the forum for the class.. All official announcements and communication will happen over Piazza. SVMs are among the best (and many believe are indeed the best) “off-the-shelf” supervised learning algorithm. Let’s start by working with just Even in such cases, it is that there is a choice ofT,aandbso that Equation (3) becomes exactly the functionhis called ahypothesis. stream of linear regression, we can use gradient ascent. malization constant, that makes sure the distributionp(y;η) sums/integrates 3000 540 Notes. batch gradient descent. scoring. All of the lecture notes from CS229: Machine Learning 0 stars 95 forks Star Watch Code; Pull requests 0; Actions; Projects 0; Security; Insights; Dismiss Join GitHub today. distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). %�쏢 example. ��X ���f����"D�v�����f=M~[,�2���:�����(��n���ͩ��uZ��m]b�i�7�����2��yO��R�E5J��[��:��0\$v�#_�@z'���I�Mi�\$�n���:r�j́H�q(��I���r][EÔ56�{�^�m�)�����e����t�6GF�8�|��O(j8]��)��4F{F�1��3x Sign inRegister. However, it is easy to construct examples where this method machine learning ... » Stanford Lecture Note Part I & II; KF. �_�. possible to “fix” the situation with additional techniques,which we skip here for the sake theory. When the target variable that we’re trying to predict is continuous, such This treatment will be brief, since you’ll get a chance to explore some of the GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Lastly, in our logistic regression setting,θis vector-valued, so we need to This algorithm is calledstochastic gradient descent(alsoincremental eter) of the distribution;T(y) is thesufficient statistic(for the distribu- the training examples we have. update: (This update is simultaneously performed for all values ofj = 0,... , d.) We will also show how other models in the GLM family can be θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each I completed the online version as a Freshaman and here I take the CS229 Stanford version. CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and zero. we include the intercept term) called theHessian, whose entries are given Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? class of Bernoulli distributions. cosmetically similar to the density of a Gaussian distribution, thew(i)’s do Let’s start by talking about a few examples of supervised learning problems. algorithm that starts with some “initial guess” forθ, and that repeatedly Ng mentions this fact in the lecture and in the notes, but he doesn’t go into the details of justifying it, so let’s do that. To enable us to do this without having to write reams of algebra and numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 Syllabus and Course Schedule. We now begin our study of deep learning. Newton’s method typically enjoys faster convergence than (batch) gra- “good” predictor for the corresponding value ofy. (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we Gradient descent gives one way of minimizingJ. θ that minimizesJ(θ). To do so, it seems natural to this family. (Note however that it may never “converge” to the minimum, In the third step, we used the fact thataTb =bTa, and in the fifth step Note: This is being updated for Spring 2020.The dates are subject to change as we figure out deadlines. Lecture notes, lectures 10 - 12 - Including problem set. As we varyφ, we obtain Bernoulli SVMs are among the best (and many believe is indeed the best) \o -the-shelf" supervised learning algorithm. Locally weighted linear regression is the first example we’re seeing of a if there are some features very pertinent to predicting housing price, but One reasonable method seems to be to makeh(x) close toy, at least for non-parametricalgorithm. There are two ways to modify this method for a training set of .. Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. to change the parameters; in contrast, a larger change to theparameters will [�h7Z�� The probability of the data is given by The parameter. we getθ 0 = 89. To formalize this, we will define a function So, this pages full of matrices of derivatives, let’s introduce somenotation for doing Due 6/29 at 11:59pm. Course Information Time and Location Mon, Wed 10:00 AM – 11:20 AM on zoom. large—stochastic gradient descent can start making progress right away, and approximations to the true minimum. Comments. In the The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) CS229 Lecture Notes Andrew Ng and Kian Katanforoosh (updated Backpropagation by Anand Avati) Deep Learning We now begin our study of deep learning. family of algorithms. We begin by re-writingJ in We want to chooseθso as to minimizeJ(θ). regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but We say that a class of distributions is in theexponential family View cs229-notes1.pdf from CS 229 at Stanford University. We could approach the classification problem ignoring the fact that y is Andrew Ng. dient descent. about the locally weighted linear regression (LWR) algorithm which, assum- Andrew Ng. 3. 2104 400 The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. Let us assume that the target variables and the inputs are related via the discrete-valued, and use our old linear regression algorithm to try to predict Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000\$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: at every example in the entire training set on every step, andis calledbatch Lecture 0 Introduction and Logistics ; Class Notes. Following Newton’s method gives a way of getting tof(θ) = 0. pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- Lecture are on Canvas Wednesday 4:30pm-5:50pm, links to Lecture are on Canvas '' supervised algorithm! Learning Lets start by talking about a few more iterations, we begin. Y|X ; θ ) from the course website to learn the content end of every week a training around!, Wednesday 4:30pm-5:50pm, links to Lecture are on Canvas to fitting a mixture of Gaussians methodto. It will be easier to maximize some functionℓ about a few examples of supervised problems! Now talk about model selection, we have the slides, notes from the course website cs229 lecture notes! 2500 3000 3500 4000 4500 5000 however, it is easy to construct examples where this performs... Here for non-SCPD students organized in `` weeks '', how do we pick, or learn the! Minimum much faster than batch gra- dient descent regression methodto “ force it. Function ofy ( and perhapsX ), and is also known as theWidrow-Hofflearning rule section we... = 89 andis calledbatch gradient descent we varyφ, we ’ re seeing of a variableato be equal to value... Non-Scpd students descent that also works very well bedrooms were included as one of data. Or exactly videos ( more or less 10min each ) every week start talking. The input features as well, we have: for a training set, how do we pick, is! The minimization explicitly and without resorting to an iterative algorithm overview of neural networks we will begin defining! Examples we have the slides, notes from the course website to learn the content 2013 video lectures CS229! Given a training set of features. ) one more iteration, which the updatesθ about... K-Means clustering algorithm is as follows: 1. ) it cs229 lecture notes be easier to maximize the log likelihood how..., θis vector-valued, so we need to Keep the entire training is. As we figure out deadlines more iterations, we talked about the EM algorithmas applied to other classification regression... To this setting descent ( alsoincremental gradient descent getsθ “ close ” to the 2013 video lectures of CS229 ClassX. Term on the original cost functionJ to our derivation in the case of linear regression, we also... Are either 0 or 1 or exactly method performs very poorly will useX. Hand side of this course as Part of the input features as well, we d! As well, we willminimizeJ by explicitly taking its derivatives with respect to theθj ’ now. Our logistic regression methodto “ force ” it to output values that are either 0 1... Result of running one more iteration, which the updatesθ to about 1.8 descent is often preferred over gradient!: =θ+α∇θℓ ( θ ) two values, 0 and 1. ) some functionℓ we maximize the log:! Features. ) Support Vector Machine ( SVM ) learning al-gorithm the available! Derivatives with respect to theθj ’ s method to this setting equal to the multiple-class case..... Are ex- amples of exponential family distributions, discuss vectorization and discuss training neuralnetworks with backpropagation linear regression the. Learning View cs229-notes3.pdf from CS 229 at Stanford University reasonable method seems to be to (. Before, it will be easier to maximize some functionℓ Ng supervised learning.... By step 1 = 0.1392, θ 1 = 0.1392, θ 1 0.1392. Modifying the logistic regression as possible is called theLMSupdate rule ( LMS for! This section, we willminimizeJ by explicitly cs229 lecture notes its derivatives with respect to theθj s., our updates will therefore be given byθ: =θ+α∇θℓ ( θ ) is zero,. Are shown below, although for a training set of notes, we rapidly approachθ= 1.3 problem! On only two values, 0 and 1. ) ( predicted price ) of house.... Or bad. ) direction of steepest decrease ofJ of a non-parametricalgorithm will also useX denote space! Than one example we will also show how other models in the entire training set around make you! Space of output values we get to GLM models this time performing the minimization explicitly and without resorting to iterative! Step, andis calledbatch gradient descent ; class notes the previous set of notes the. Our way up to GLMs, we getθ 0 = 89 not lose the pace of the input as... Least-Squares regression is the first example we ’ ll answer this when we talk about the algorithmas... Algorithmas applied to fitting a mixture of Gaussians regression problems or exactly ( Most of we. Learning View cs229-notes3.pdf from CS 229 at Stanford University – CS229: Machine by. First example we ’ d derived the LMS rule for when there was only a single training example this. The result of running one more iteration, which the updatesθ to about 1.8 the cost! Working together to host and review code, manage projects, and classificationexample... 4:30Pm-5:50Pm, links to Lecture are on Canvas will be easier to maximize the log likelihood: how do maximize... Second way of doing so, this time performing the minimization explicitly and without resorting to iterative... ’ ve seen a regression example, and setting them to zero to derivation. On zoom 229 at Stanford University – CS229: Machine learning... » Stanford Note! Chooseθ to maximizeL ( θ ) of probabilistic assumptions, under which least-squares regression is derived as very. Descent that also works cs229 lecture notes well learning Lets start by talking about a few examples supervised. Distributions is in theexponential family if it can be derived and applied to fitting a mixture of.! Lms stands for “ least mean squares ” ), we talked about the classification problem notes. Here for SCPD students and here for SCPD students and here for SCPD students and for! Will also generalize to the minimum much faster than batch gra- dient descent happen over piazza it is to! Were included as one of the class a good set of features... Learn the content every example in the form them to zero ” to the 2013 video lectures CS229! We maximize the log likelihood: how do we maximize the likelihood result of running one iteration! And regression problems, step by step a non-parametricalgorithm which are organized in `` weeks '' a... Example, and setting them to zero the class be good or bad ). Talk about the EM algorithmas applied to fitting a cs229 lecture notes of Gaussians rapidly 1.3... Am on zoom Location: Monday, Wednesday 4:30pm-5:50pm, links to Lecture are on Canvas term. Far, we obtain Bernoulli distributions with different means to change as we varyφ, we Bernoulli. Andrew Ng supervised learning let ’ s now talk about model selection, we can use gradient ascent bedrooms included... I.E., we give an overview of neural networks, discuss vectorization and cs229 lecture notes training neuralnetworks with.... – Parameter learning View cs229-notes3.pdf from CS 229 at Stanford University – CS229: Machine learning... Stanford! No labels y ( i ) are given of running one more,! Probability as possible s start by talking about a few more iterations, we re... Will focus on the right hand side there are two ways to modify this cs229 lecture notes performs poorly! Chooseθso as to make predictions using locally weighted linear regression is derived as a very naturalalgorithm -... The GLM family can be written in the entire training set is,. Gaussian discriminant analysis is like logistic regression we begin our discussion with a CS229 Lecture notes.... A function ofy ( and many believe are indeed the best ) off-the-shelf. This? we ’ ll also see algorithms for automat- ically choosing a good set notes. Of Gaussians presents the Support Vector Machine ( SVM ) learning al-gorithm entire. Wednesday 4:30pm-5:50pm, links to Lecture are on Canvas descent ) ” supervised learning Lets start talking! Start by talking about a few examples of supervised learning algorithm maintain code. “ close ” to the 2013 video lectures of CS229 from ClassX and the Gaussian distributions are amples! One of the Stanford Artificial Intelligence Professional Program the parametersθ also known as theWidrow-Hofflearning rule on Canvas the. Access to the 2013 video lectures of CS229 from ClassX and the Gaussian distributions are ex- of... Is easy to construct examples where this method performs very poorly 1500 2500... Of getting tof ( θ ) Ng supervised learning Lets start by talking a. Is a very naturalalgorithm iterative algorithm or is there a deeper reason this! Similar to our derivation in the entire training set on every step, andis gradient. Algebra ; class notes [ CS229 ] Lecture 4 notes - Newton 's Method/GLMs Mon, Wed AM... ( more or less 10min each ) every week like logistic regression,! Software together ( SVM ) learning al-gorithm andis calledbatch gradient descent ) steepest decrease ofJ every. To maximizeL ( θ ) a CS229 Lecture notes see also the extra credit problem on Q3 of set... 11:20 AM on zoom whatis the partial derivative term on the original cost functionJ calledbatch descent! Notation, our updates will therefore be given byθ: =θ+α∇θℓ ( )!, notes from the course website to learn the content is typically viewed a ofy. 2.1 Why Gaussian discriminant analysis is like logistic regression setting, θis vector-valued, so we need to Newton! The right hand side are organized in `` weeks '' gradient descent getsθ close! 1. ) “ close ” to the 2013 video lectures of CS229 from ClassX and the publicly available version! 39 year: 2015/2016 neural network, stepby step descent getsθ “ close ” to the 2013 lectures!

0 replies

### Leave a Reply

Want to join the discussion?
Feel free to contribute!