1 00:00:05,930 --> 00:00:09,375 Dispersion, which is also called variability, 2 00:00:09,375 --> 00:00:12,030 scatter or spread, is the extent to which 3 00:00:12,030 --> 00:00:15,045 the data distribution is stretched or squeezed. 4 00:00:15,045 --> 00:00:16,350 The common measures of 5 00:00:16,350 --> 00:00:19,095 dispersion are standard deviation and variance. 6 00:00:19,095 --> 00:00:21,660 If you are at a university or college, 7 00:00:21,660 --> 00:00:23,715 you may have heard about the bell curve, 8 00:00:23,715 --> 00:00:25,395 which looks like this. 9 00:00:25,395 --> 00:00:27,690 You will often hear this is within 10 00:00:27,690 --> 00:00:29,790 one standard deviation of the mean or 11 00:00:29,790 --> 00:00:32,190 within two standard deviations of the mean. 12 00:00:32,190 --> 00:00:34,335 Let's see what that means. 13 00:00:34,335 --> 00:00:36,660 Let's look at the age of an instructor. 14 00:00:36,660 --> 00:00:39,270 Let's say the average age is 52. 15 00:00:39,270 --> 00:00:42,045 This means that the individual ages may differ, 16 00:00:42,045 --> 00:00:46,085 some may be 48 or maybe 55 or 75. 17 00:00:46,085 --> 00:00:48,500 So the average age is an estimate. 18 00:00:48,500 --> 00:00:49,850 But what we also need is 19 00:00:49,850 --> 00:00:52,340 an estimate for the dispersion in the data-set. 20 00:00:52,340 --> 00:00:55,435 The other thing to note is the range in our data-set. 21 00:00:55,435 --> 00:00:59,060 For example, the difference of the ranges from a minimum 22 00:00:59,060 --> 00:01:02,570 of 29 years of age to a maximum of 73 years. 23 00:01:02,570 --> 00:01:05,210 This to you refers to a distance or 24 00:01:05,210 --> 00:01:08,705 the difference between the minimum and the maximum. 25 00:01:08,705 --> 00:01:12,560 Unlike the difference between population and sample mean, 26 00:01:12,560 --> 00:01:14,270 the difference between a variance for 27 00:01:14,270 --> 00:01:16,340 the population and a sample is 28 00:01:16,340 --> 00:01:18,410 that when you compute the population variance 29 00:01:18,410 --> 00:01:20,120 denoted as sigma squared, 30 00:01:20,120 --> 00:01:23,660 you divide it by the total number of observations. 31 00:01:23,660 --> 00:01:25,565 These are the deviations between 32 00:01:25,565 --> 00:01:27,725 observation and the mean squared, 33 00:01:27,725 --> 00:01:29,990 then added and then divided by 34 00:01:29,990 --> 00:01:32,305 the total number of observations. 35 00:01:32,305 --> 00:01:36,545 For sample variance, which is denoted as S squared, 36 00:01:36,545 --> 00:01:39,115 you divide it by N minus 1. 37 00:01:39,115 --> 00:01:42,170 The purpose of using N minus 1 is so 38 00:01:42,170 --> 00:01:45,320 that our estimate is unbiased in the long run. 39 00:01:45,320 --> 00:01:48,005 That means if we take a second sample, 40 00:01:48,005 --> 00:01:50,545 we'll get a different value of S squared. 41 00:01:50,545 --> 00:01:52,595 If we take a third sample, 42 00:01:52,595 --> 00:01:56,275 we'll get a third value of S squared and so on. 43 00:01:56,275 --> 00:01:59,150 We use N minus 1 so that the average of 44 00:01:59,150 --> 00:02:03,740 all these values of S squared is equal to sigma squared. 45 00:02:03,740 --> 00:02:06,110 We usually talk about squares called 46 00:02:06,110 --> 00:02:08,695 standard deviation rather than the variance. 47 00:02:08,695 --> 00:02:11,540 Standard deviation is essentially the square root of 48 00:02:11,540 --> 00:02:14,810 the variance or the variances in square units. 49 00:02:14,810 --> 00:02:17,540 It's good to use the standard deviation because 50 00:02:17,540 --> 00:02:20,240 it's exactly the same units as the variable. 51 00:02:20,240 --> 00:02:22,700 The standard deviation of age will also 52 00:02:22,700 --> 00:02:25,430 be measured in years rather than years squared. 53 00:02:25,430 --> 00:02:27,950 Here you see that we just took the square root of 54 00:02:27,950 --> 00:02:31,415 variance and this becomes standard deviation. 55 00:02:31,415 --> 00:02:33,860 We'll return to our data-set and we'll 56 00:02:33,860 --> 00:02:36,230 look at the variables that we computed before. 57 00:02:36,230 --> 00:02:38,270 You can see that the standard deviation 58 00:02:38,270 --> 00:02:40,460 for beauty, evaluation scores, 59 00:02:40,460 --> 00:02:41,660 and age were computed with 60 00:02:41,660 --> 00:02:43,160 the descriptive statistics using 61 00:02:43,160 --> 00:02:45,590 the describe function in Python. 62 00:02:45,590 --> 00:02:47,660 Let me explain why mean and 63 00:02:47,660 --> 00:02:50,330 standard deviation have to go hand in hand. 64 00:02:50,330 --> 00:02:51,950 I will use the example from 65 00:02:51,950 --> 00:02:53,855 basketball about the two giants, 66 00:02:53,855 --> 00:02:56,150 Michael Jordan and Wilt Chamberlain, 67 00:02:56,150 --> 00:02:58,285 who preceded Michael Jordan. 68 00:02:58,285 --> 00:03:01,115 If you consider their average score per game, 69 00:03:01,115 --> 00:03:03,515 you would notice that they didn't differ much. 70 00:03:03,515 --> 00:03:05,570 Their average was around 30 points 71 00:03:05,570 --> 00:03:07,100 for both Jordan and Chamberlain. 72 00:03:07,100 --> 00:03:08,720 However, when you look at 73 00:03:08,720 --> 00:03:10,910 the standard deviation of their performance, 74 00:03:10,910 --> 00:03:13,880 Jordan was around 4.76 compared to 75 00:03:13,880 --> 00:03:17,540 Chamberlain who was at around 10.59. 76 00:03:17,540 --> 00:03:19,550 If you were to plot this distribution to see 77 00:03:19,550 --> 00:03:21,080 Michael Jordan scores using 78 00:03:21,080 --> 00:03:23,270 the mean and standard deviation. 79 00:03:23,270 --> 00:03:26,420 Assuming that their scores are normally distributed. 80 00:03:26,420 --> 00:03:28,160 You would notice even though 81 00:03:28,160 --> 00:03:30,065 both players had around the same mean, 82 00:03:30,065 --> 00:03:32,870 the tighter distribution for Jordan suggests that he 83 00:03:32,870 --> 00:03:36,770 was more consistent in his performance than Chamberlain. 84 00:03:36,770 --> 00:03:38,885 The main takeaway is that 85 00:03:38,885 --> 00:03:41,350 average will only paint a partial picture. 86 00:03:41,350 --> 00:03:43,070 If you really want to understand 87 00:03:43,070 --> 00:03:45,890 the complete picture about a variable or data-set, 88 00:03:45,890 --> 00:03:48,530 it is important to compute both the average and 89 00:03:48,530 --> 00:03:49,985 the standard deviation to get 90 00:03:49,985 --> 00:03:52,690 insights on what the data is telling us. 91 00:03:52,690 --> 00:03:55,610 So a mean with a standard deviation means 92 00:03:55,610 --> 00:03:59,340 something more useful than the mean by itself.