1 00:00:00,240 --> 00:00:01,410 In this video, I'd like to 2 00:00:01,560 --> 00:00:03,590 talk about the Gaussian distribution, which 3 00:00:03,830 --> 00:00:05,810 is also called the normal distribution. 4 00:00:07,430 --> 00:00:08,940 In case you're already intimately 5 00:00:09,620 --> 00:00:11,980 familiar with the Gaussian distribution, it is 6 00:00:12,160 --> 00:00:13,810 probably okay to skip this video. 7 00:00:14,640 --> 00:00:15,890 But if you're not sure or 8 00:00:15,970 --> 00:00:16,890 if it's been a while since you've 9 00:00:17,040 --> 00:00:18,770 worked with a Gaussian distribution or the normal 10 00:00:19,020 --> 00:00:20,480 distribution then please do 11 00:00:20,610 --> 00:00:22,960 watch this video all the way to the end. 12 00:00:23,220 --> 00:00:24,260 And in the video after this, 13 00:00:24,480 --> 00:00:25,740 we'll start applying the Gaussian 14 00:00:25,980 --> 00:00:28,890 distribution to developing an anomaly detection algorithm. 15 00:00:31,990 --> 00:00:33,310 Let's say x is a 16 00:00:33,540 --> 00:00:36,470 real value random variable, so x is a real number. 17 00:00:37,380 --> 00:00:39,080 If the probability distribution of 18 00:00:39,270 --> 00:00:41,160 x is Gaussian, it 19 00:00:41,400 --> 00:00:42,710 would mean Mu and variant 20 00:00:43,110 --> 00:00:45,360 sigma squared, then we'll 21 00:00:45,540 --> 00:00:47,600 write this as x the 22 00:00:47,690 --> 00:00:49,270 random variable tilde. 23 00:00:51,930 --> 00:00:53,300 That's this little tilde 24 00:00:53,540 --> 00:00:59,520 as distributed as and then 25 00:00:59,730 --> 00:01:01,550 to denote the Gaussian Distribution, sometimes 26 00:01:02,070 --> 00:01:03,930 you're going to write script n, parentheses 27 00:01:04,830 --> 00:01:07,140 Mu, sigma squared. 28 00:01:07,470 --> 00:01:09,310 So, this script's end stands for 29 00:01:09,530 --> 00:01:10,920 normal, since Gaussian and normal 30 00:01:11,300 --> 00:01:12,170 distribution, they mean the same 31 00:01:12,390 --> 00:01:14,660 phase of synonymous and a 32 00:01:14,780 --> 00:01:16,190 Gussian distribution is parameterized 33 00:01:17,070 --> 00:01:18,430 by 2 parameters, by a 34 00:01:19,010 --> 00:01:20,930 mean parameter which we 35 00:01:21,020 --> 00:01:22,770 denote Mu, and a variance 36 00:01:23,090 --> 00:01:25,010 parameter, which we denote by sigma squared. 37 00:01:26,120 --> 00:01:27,270 If we pluck the Gaussian distribution 38 00:01:27,990 --> 00:01:30,100 or Gaussian probability density, it 39 00:01:30,220 --> 00:01:31,760 will look like the bell shaped 40 00:01:32,100 --> 00:01:34,820 curve, which you may have seen before. 41 00:01:36,230 --> 00:01:37,860 And so, this bell-shaped curve 42 00:01:38,110 --> 00:01:40,350 is parameterized by those 2 parameters Mu and sigma. 43 00:01:41,330 --> 00:01:42,670 And the location of the 44 00:01:42,930 --> 00:01:44,230 center of this bell-shaped curve 45 00:01:44,580 --> 00:01:46,960 is the mean Mu, and the 46 00:01:47,050 --> 00:01:48,150 width of this bell-shaped curve, 47 00:01:49,430 --> 00:01:51,020 so it's roughly that, is 48 00:01:51,290 --> 00:01:52,970 the, this parameter 49 00:01:53,500 --> 00:01:55,390 sigma, is also called one standard deviation. 50 00:01:56,540 --> 00:01:58,350 And so, this specifies the 51 00:01:58,530 --> 00:01:59,630 probability of x taking 52 00:01:59,910 --> 00:02:00,990 on different values, so x 53 00:02:01,190 --> 00:02:02,730 taking on values, you know 54 00:02:02,810 --> 00:02:03,770 in the middle here is pretty high 55 00:02:04,020 --> 00:02:05,290 since the Gaussian density here 56 00:02:05,400 --> 00:02:06,490 is pretty high whereas 57 00:02:06,610 --> 00:02:08,540 x taking on values further and 58 00:02:08,720 --> 00:02:10,310 further away will be diminishing 59 00:02:10,860 --> 00:02:12,600 in probability. Finally, just 60 00:02:12,920 --> 00:02:13,770 for completeness, let me write 61 00:02:14,020 --> 00:02:15,260 out the formula for the Gaussian 62 00:02:16,080 --> 00:02:17,310 distribution so the property 63 00:02:17,710 --> 00:02:19,780 of x and I'll 64 00:02:19,940 --> 00:02:20,940 sometimes write this instead of 65 00:02:21,050 --> 00:02:22,070 p of x, I'm going 66 00:02:22,190 --> 00:02:22,960 to write this as p of 67 00:02:23,350 --> 00:02:24,930 x semicolon Mu comma sigma squared. 68 00:02:25,500 --> 00:02:26,750 And so this denotes that the probability of 69 00:02:26,910 --> 00:02:28,670 x is parametrized by 70 00:02:28,810 --> 00:02:30,660 the two parameters Mu and sigma squared. 71 00:02:31,940 --> 00:02:33,330 And the formula for the 72 00:02:33,370 --> 00:02:34,760 Gaussian density is this, 73 00:02:35,170 --> 00:02:37,860 1 over 2pi, sigma e 74 00:02:38,070 --> 00:02:41,510 to the negative x minus Mu squared over 2 sigma squared. 75 00:02:41,870 --> 00:02:45,980 So there's no need to memorize this 76 00:02:46,470 --> 00:02:47,530 formula, you know, this 77 00:02:47,690 --> 00:02:49,410 is just the formula for the 78 00:02:49,540 --> 00:02:51,020 bell-shaped curve over here on the left. 79 00:02:51,700 --> 00:02:53,100 There's no need to memorize it and 80 00:02:53,270 --> 00:02:53,990 if you ever need to use this, 81 00:02:54,190 --> 00:02:56,460 you can always look this up. 82 00:02:56,540 --> 00:02:57,450 And so that figure on the 83 00:02:57,740 --> 00:02:58,420 left, that is what you get 84 00:02:58,910 --> 00:03:00,100 if you take a fixed 85 00:03:00,290 --> 00:03:01,200 value of Mu and a 86 00:03:01,250 --> 00:03:04,070 fixed value of sigma and 87 00:03:04,450 --> 00:03:06,140 you plot p of x. So this 88 00:03:06,870 --> 00:03:07,830 curve here, this is really 89 00:03:08,390 --> 00:03:10,000 p of x plotted as a 90 00:03:10,030 --> 00:03:11,540 function of x, you know, 91 00:03:11,640 --> 00:03:15,970 for a fixed value of Mu 92 00:03:16,190 --> 00:03:18,770 and of sigma squared sigma squared, that's called the variance. 93 00:03:19,950 --> 00:03:22,270 And sometimes it's easier to think in terms of sigma. 94 00:03:22,950 --> 00:03:24,730 So sigma is called the 95 00:03:25,120 --> 00:03:27,850 standard deviation and it, 96 00:03:28,000 --> 00:03:29,640 so it specifies the 97 00:03:29,800 --> 00:03:31,310 width of this Gaussian probability 98 00:03:31,730 --> 00:03:33,120 density whereas the square 99 00:03:33,330 --> 00:03:34,490 of sigma, so sigma squared, is 100 00:03:34,620 --> 00:03:36,830 called the variance. Let's look 101 00:03:37,000 --> 00:03:39,980 at some examples of what the Gaussian distribution looks like. 102 00:03:41,010 --> 00:03:43,280 If Mu equals zero, sigma equals 1. 103 00:03:43,650 --> 00:03:44,730 Then we have a Gaussian distribution 104 00:03:45,480 --> 00:03:48,000 that is centered around zero, because that's Mu. 105 00:03:48,810 --> 00:03:50,560 And the width of this Gaussian, so 106 00:03:50,730 --> 00:03:53,610 that's one standard deviation is sigma over there. 107 00:03:55,140 --> 00:03:56,330 Let's look at some examples of 108 00:03:56,700 --> 00:03:58,770 Gaussians. If Mu 109 00:03:58,970 --> 00:04:00,750 is equal to zero it equals 1. 110 00:04:00,950 --> 00:04:02,150 Then that corresponds to a 111 00:04:02,370 --> 00:04:04,030 Gaussian distribution that is centered 112 00:04:04,770 --> 00:04:06,380 at zero since Mu is zero. 113 00:04:07,390 --> 00:04:08,310 And the width of this Gaussian 114 00:04:10,810 --> 00:04:12,570 is Gaussian thus controlled 115 00:04:13,010 --> 00:04:15,430 by sigma by that variance parameter sigma. 116 00:04:16,850 --> 00:04:17,390 Here's another example. 117 00:04:20,520 --> 00:04:21,270 Let's say Mu is equal to 118 00:04:21,550 --> 00:04:23,670 zero and sigma is equal to one-half. 119 00:04:24,200 --> 00:04:26,290 So the standard deviation is 120 00:04:26,530 --> 00:04:27,650 one-half and the variance 121 00:04:28,280 --> 00:04:29,550 sigma squared would therefore be 122 00:04:29,710 --> 00:04:33,600 the square of 0.5 would be 0.25. 123 00:04:33,680 --> 00:04:34,910 And in that case the Gaussian distribution, 124 00:04:35,600 --> 00:04:37,040 the Gaussian probability density looks 125 00:04:37,180 --> 00:04:39,490 like this, is also centered at zero. 126 00:04:40,110 --> 00:04:41,410 But now the width of 127 00:04:41,600 --> 00:04:43,250 this is much smaller because 128 00:04:43,620 --> 00:04:45,170 the smaller variance, the 129 00:04:45,520 --> 00:04:46,980 width of this Gaussian density 130 00:04:47,450 --> 00:04:49,350 is roughly half as wide. 131 00:04:50,550 --> 00:04:51,710 But because this is a 132 00:04:51,970 --> 00:04:53,590 probability distribution, the area under 133 00:04:53,800 --> 00:04:54,850 the curve, that is the shaded 134 00:04:55,310 --> 00:04:56,790 area there, that area 135 00:04:57,180 --> 00:04:58,810 must integrate to 1. 136 00:04:58,810 --> 00:05:00,500 This is a property of probability distributions. 137 00:05:01,650 --> 00:05:02,680 And so, you know, this 138 00:05:02,830 --> 00:05:04,530 is a much taller Gaussian density because 139 00:05:04,820 --> 00:05:06,050 it's half as wide, with 140 00:05:06,200 --> 00:05:08,150 half the standard deviation, but it's twice as tall. 141 00:05:09,130 --> 00:05:11,510 Another example, if sigma is 142 00:05:11,640 --> 00:05:12,540 equal to 2, then you 143 00:05:12,650 --> 00:05:14,870 get a much fatter, or much wider Gaussian density. 144 00:05:15,310 --> 00:05:17,090 And so here, the sigma 145 00:05:17,370 --> 00:05:19,300 parameter controls that this 146 00:05:19,630 --> 00:05:21,000 Gaussian density has a wider width. 147 00:05:21,930 --> 00:05:23,180 And once again, the area under 148 00:05:23,220 --> 00:05:24,390 the curve, that is this shaded 149 00:05:24,700 --> 00:05:26,720 area, you know, it always integrates to 1. 150 00:05:26,840 --> 00:05:28,170 That's a property of probability 151 00:05:28,800 --> 00:05:30,280 distributions, and because it's 152 00:05:30,480 --> 00:05:31,930 wider, it's also half as 153 00:05:32,650 --> 00:05:36,640 tall, in order to just integrate to the same thing. 154 00:05:36,750 --> 00:05:37,520 And finally, one last example would be, 155 00:05:37,880 --> 00:05:38,980 if we now changed the Mu 156 00:05:39,130 --> 00:05:40,660 parameters as well, then instead 157 00:05:41,000 --> 00:05:42,320 of being centered at zero, we 158 00:05:42,410 --> 00:05:43,840 now we have a Gaussian distribution 159 00:05:44,830 --> 00:05:46,810 that is centered at three, because 160 00:05:47,710 --> 00:05:49,740 this shifts over the entire Gaussian distribution. 161 00:05:51,170 --> 00:05:54,040 Next, lets take about the parameter estimation problem. 162 00:05:55,100 --> 00:05:56,570 So what is the parameter estimation problem? 163 00:05:57,520 --> 00:05:58,350 Let's say we have a data set 164 00:05:58,850 --> 00:06:00,180 of m examples, so x1 165 00:06:00,350 --> 00:06:01,470 through x(m), and let's say 166 00:06:01,710 --> 00:06:03,250 each of these examples is a real number. 167 00:06:04,200 --> 00:06:05,520 Here in the figure, I've plotted an 168 00:06:05,620 --> 00:06:06,390 example of a data set, 169 00:06:06,580 --> 00:06:08,390 so the horizontal axis is the 170 00:06:08,580 --> 00:06:09,430 x axis and, you know, I 171 00:06:09,560 --> 00:06:12,290 have a range of examples of x and I've just plotted them 172 00:06:12,560 --> 00:06:15,060 on this figure here. 173 00:06:15,260 --> 00:06:17,280 And the parameter estimation problem is, let's 174 00:06:17,500 --> 00:06:18,750 say I suspect that these examples 175 00:06:19,450 --> 00:06:21,160 came from a Gaussian distribution so 176 00:06:21,300 --> 00:06:24,560 let's say I suspect that each of my example x(i) was distributed. 177 00:06:25,300 --> 00:06:26,930 That's what this tilde thing means. 178 00:06:27,590 --> 00:06:28,520 Thus, I suspect that each of 179 00:06:28,580 --> 00:06:30,220 these examples was distributed according 180 00:06:30,760 --> 00:06:32,190 to a normal distribution or 181 00:06:32,250 --> 00:06:34,060 Gaussian distribution with some 182 00:06:34,300 --> 00:06:36,210 parameter Mu and some parameter sigma squared. 183 00:06:37,570 --> 00:06:39,560 But I don't know what the values of these parameters are. 184 00:06:40,820 --> 00:06:42,360 The problem with parameter estimation is, 185 00:06:43,160 --> 00:06:44,480 given my data set I want 186 00:06:44,800 --> 00:06:45,720 to figure out, I want to 187 00:06:45,880 --> 00:06:46,840 estimate, what are the 188 00:06:46,990 --> 00:06:48,470 values of Mu and sigma squared. 189 00:06:49,620 --> 00:06:50,570 So, if you're given a 190 00:06:50,640 --> 00:06:51,660 data set like this, you know, 191 00:06:51,790 --> 00:06:54,050 it looks like maybe, if I 192 00:06:54,190 --> 00:06:56,210 estimate what Gaussian distribution the 193 00:06:56,350 --> 00:06:59,010 data came from, maybe that 194 00:07:00,660 --> 00:07:01,770 might be roughly the Gaussian distribution 195 00:07:02,280 --> 00:07:04,410 it came from, with Mu 196 00:07:05,500 --> 00:07:07,350 being the center of the distribution and 197 00:07:07,990 --> 00:07:11,680 sigma the standard deviation controlling the width of this Gaussian distribution. 198 00:07:12,140 --> 00:07:12,820 It seems like a reasonable 199 00:07:13,260 --> 00:07:15,280 fit to the data, because, you know, it 200 00:07:15,440 --> 00:07:16,880 looks the data has 201 00:07:17,110 --> 00:07:18,910 a very high probability of being 202 00:07:19,240 --> 00:07:20,590 in the central region, low probability of 203 00:07:21,640 --> 00:07:24,720 being further out, low probability of being further out, and so on. 204 00:07:24,780 --> 00:07:25,770 And so, maybe this is 205 00:07:25,890 --> 00:07:27,360 a reasonable estimate of 206 00:07:28,020 --> 00:07:29,920 Mu and of sigma squared, 207 00:07:30,410 --> 00:07:31,810 that is, if it corresponds to 208 00:07:31,960 --> 00:07:33,970 a Gaussian distribution, that then looks like this. 209 00:07:35,650 --> 00:07:36,340 So, what I'm going to 210 00:07:36,430 --> 00:07:37,550 do is write out the 211 00:07:37,660 --> 00:07:39,090 formulas, the standard formulas 212 00:07:39,750 --> 00:07:40,920 for estimating the parameters from 213 00:07:41,130 --> 00:07:43,480 Mu and sigma squared. 214 00:07:44,110 --> 00:07:44,860 The way we are going to 215 00:07:45,390 --> 00:07:47,140 estimate Mu is going to 216 00:07:47,380 --> 00:07:48,850 be just the average 217 00:07:49,670 --> 00:07:50,630 of my example. 218 00:07:51,210 --> 00:07:52,190 So Mu is the mean parameter, 219 00:07:52,750 --> 00:07:53,340 so I'm going to take my 220 00:07:53,380 --> 00:07:54,350 training set, take my m 221 00:07:54,450 --> 00:07:56,200 examples and average them. 222 00:07:56,470 --> 00:07:58,120 And that just gives me the center of this distribution. 223 00:08:01,150 --> 00:08:01,670 How about sigma squared? 224 00:08:01,890 --> 00:08:03,110 Well the variance, I'll just 225 00:08:03,340 --> 00:08:04,890 write out the standard formula again, 226 00:08:05,150 --> 00:08:06,780 I'm going to estimate as sum 227 00:08:07,280 --> 00:08:08,900 of 1 through m of x(i), 228 00:08:09,150 --> 00:08:11,730 minus Mu squared, 229 00:08:12,130 --> 00:08:13,130 and so this Mu here is 230 00:08:13,240 --> 00:08:14,030 actually the Mu that I compute 231 00:08:14,450 --> 00:08:15,580 over here using this formula, 232 00:08:16,790 --> 00:08:17,920 and what the variance is, or 233 00:08:18,040 --> 00:08:18,890 one interpretation of the variance, 234 00:08:19,440 --> 00:08:20,230 is that, if you look at the 235 00:08:20,250 --> 00:08:21,580 this term, that's the square 236 00:08:22,090 --> 00:08:23,580 difference between the value 237 00:08:24,020 --> 00:08:25,190 I've got in my example minus 238 00:08:25,740 --> 00:08:28,300 the mean, minus the center, minus the mean of distribution. 239 00:08:28,810 --> 00:08:29,690 And so, you know, the 240 00:08:29,730 --> 00:08:30,630 variance, I'm gonna estimate 241 00:08:31,250 --> 00:08:32,530 as just the average of 242 00:08:32,570 --> 00:08:35,520 the square differences, between my examples, minus the mean. 243 00:08:37,270 --> 00:08:38,370 And as a side comment, 244 00:08:38,850 --> 00:08:40,150 only for those of you that are experts 245 00:08:40,490 --> 00:08:41,820 in statistics, if you're 246 00:08:42,010 --> 00:08:43,690 an expert in statistics and if 247 00:08:43,830 --> 00:08:45,570 you've heard of maximum likelihood estimation, 248 00:08:46,680 --> 00:08:47,950 then these estimates are actually the 249 00:08:48,770 --> 00:08:50,530 maximum likelihood estimates of the parameters 250 00:08:50,680 --> 00:08:52,590 of Mu and sigma squared. 251 00:08:53,220 --> 00:08:55,260 But if you haven't heard of that before, don't worry about it. 252 00:08:55,440 --> 00:08:56,500 All you need to know is that 253 00:08:56,750 --> 00:08:57,880 these are the two standard formulas 254 00:08:58,600 --> 00:09:01,090 for how you try to 255 00:09:01,520 --> 00:09:03,820 figure out what our Mu and sigma squared given the dataset. 256 00:09:05,050 --> 00:09:06,140 Finally one last side comment. 257 00:09:06,650 --> 00:09:07,810 Again only for those of 258 00:09:07,950 --> 00:09:10,520 you that has maybe taken a statistics class before. 259 00:09:10,880 --> 00:09:12,040 But if you have taken a statistics 260 00:09:12,200 --> 00:09:13,530 class before, some of you 261 00:09:13,610 --> 00:09:14,620 may have seen the formula here 262 00:09:14,820 --> 00:09:15,810 where, you know, this is m minus 263 00:09:16,030 --> 00:09:17,300 1, instead of m. So 264 00:09:17,700 --> 00:09:19,310 this first term becomes 1 265 00:09:19,520 --> 00:09:20,410 over m minus 1, instead 266 00:09:20,450 --> 00:09:22,640 of 1 over m. In machine 267 00:09:22,960 --> 00:09:25,170 learning, people tend to use this 1 over m formula. 268 00:09:26,000 --> 00:09:27,230 But in practice, whether it 269 00:09:27,470 --> 00:09:28,480 is 1 over m or 1 270 00:09:28,550 --> 00:09:29,710 over m minus one, makes essentially 271 00:09:30,170 --> 00:09:32,290 no difference, assuming, you know, m is 272 00:09:32,540 --> 00:09:34,670 reasonably large, it's a large training set size. 273 00:09:35,310 --> 00:09:36,480 So, just in case you've seen 274 00:09:36,740 --> 00:09:38,570 this other version before, in either 275 00:09:38,810 --> 00:09:39,970 version it works just equally 276 00:09:40,300 --> 00:09:41,630 well, but in machine 277 00:09:41,910 --> 00:09:42,850 learning most people tend to 278 00:09:42,970 --> 00:09:44,410 use 1 over m in this formula. 279 00:09:45,690 --> 00:09:46,740 And the two versions have slightly 280 00:09:47,070 --> 00:09:48,770 different theoretical properties, slightly different 281 00:09:49,030 --> 00:09:50,530 mathematical properties, but in 282 00:09:50,590 --> 00:09:54,080 practice it really makes very little difference, if any. 283 00:09:56,490 --> 00:09:57,670 So, hopefully, you now have 284 00:09:57,890 --> 00:09:58,900 a good sense of what the 285 00:09:59,020 --> 00:10:00,410 Gaussian distribution looks like, 286 00:10:00,740 --> 00:10:02,210 as well as how to estimate 287 00:10:02,270 --> 00:10:03,730 the parameters, mu and sigma 288 00:10:04,010 --> 00:10:05,770 squared, of the Gaussian distribution, and 289 00:10:05,910 --> 00:10:07,510 if you're given a training set, 290 00:10:07,850 --> 00:10:08,940 that is if you're given a 291 00:10:09,240 --> 00:10:10,220 set of data that you suspect 292 00:10:11,130 --> 00:10:12,350 comes from a Gaussian 293 00:10:12,410 --> 00:10:15,190 distribution with unknown parameters using sigma squared. 294 00:10:16,190 --> 00:10:17,510 In the next video, we'll start 295 00:10:17,810 --> 00:10:18,820 to take this and apply it 296 00:10:18,920 --> 00:10:20,810 to develop the anomaly detection algorithm.