1 00:00:06,200 --> 00:00:09,210 The measures of central tendency are the most 2 00:00:09,210 --> 00:00:11,925 commonly used in statistical analysis. 3 00:00:11,925 --> 00:00:14,235 We know them as mean, median, 4 00:00:14,235 --> 00:00:15,900 and mode and their use is 5 00:00:15,900 --> 00:00:18,195 ubiquitous and statistical analysis. 6 00:00:18,195 --> 00:00:20,685 So let us see how it works. 7 00:00:20,685 --> 00:00:23,490 Before we begin, let us take 8 00:00:23,490 --> 00:00:25,920 a quick look at our dataset in this course. 9 00:00:25,920 --> 00:00:28,515 We have been using the teaching evaluation Data 10 00:00:28,515 --> 00:00:30,285 from the University of Texas. 11 00:00:30,285 --> 00:00:34,080 The dataset comprises of 463 courses, 12 00:00:34,080 --> 00:00:35,550 in which we have information about 13 00:00:35,550 --> 00:00:38,730 the teaching evaluation score received by the instructor. 14 00:00:38,730 --> 00:00:40,190 We have information about 15 00:00:40,190 --> 00:00:42,050 the attributes of the instructor, 16 00:00:42,050 --> 00:00:45,065 as well as the characteristics of the course. 17 00:00:45,065 --> 00:00:46,955 Once you have imported 18 00:00:46,955 --> 00:00:50,360 a CSV file with a Pandas Python library, 19 00:00:50,360 --> 00:00:52,730 the first step in getting to know your data is to 20 00:00:52,730 --> 00:00:55,750 discover the different data types it contains. 21 00:00:55,750 --> 00:00:57,990 You can display all columns and 22 00:00:57,990 --> 00:01:00,950 their data types with dataframe dot info. 23 00:01:00,950 --> 00:01:05,310 In this case, we have named our dataframe, ratings_df. 24 00:01:05,530 --> 00:01:08,660 It tells you how many rows you have. 25 00:01:08,660 --> 00:01:10,595 For the teaching rating data, 26 00:01:10,595 --> 00:01:13,370 we have 463 entries from zero to 27 00:01:13,370 --> 00:01:17,255 462 because Python starts counting from zero. 28 00:01:17,255 --> 00:01:20,555 Then it also gives you information about the data types. 29 00:01:20,555 --> 00:01:22,630 Object represents strings. 30 00:01:22,630 --> 00:01:26,450 In 64 represents integer or whole numbers, 31 00:01:26,450 --> 00:01:28,640 and float represents real numbers, 32 00:01:28,640 --> 00:01:31,265 which could take on decimal points. 33 00:01:31,265 --> 00:01:33,710 Before we begin, let us have 34 00:01:33,710 --> 00:01:36,500 a conversation about population and samples. 35 00:01:36,500 --> 00:01:38,270 Essentially, if you have 36 00:01:38,270 --> 00:01:39,680 all the information of interest 37 00:01:39,680 --> 00:01:41,150 for a particular decision, 38 00:01:41,150 --> 00:01:42,830 about every individual that is 39 00:01:42,830 --> 00:01:44,900 supposed to be involved in that decision, 40 00:01:44,900 --> 00:01:46,910 that is called a population. 41 00:01:46,910 --> 00:01:48,770 So if you are interested in looking 42 00:01:48,770 --> 00:01:50,570 at some attribute of driving, 43 00:01:50,570 --> 00:01:52,085 and we have information about 44 00:01:52,085 --> 00:01:55,115 all possible automobile drivers in the US, 45 00:01:55,115 --> 00:01:57,605 then we call this, the population. 46 00:01:57,605 --> 00:01:59,945 The sample, on the other hand, 47 00:01:59,945 --> 00:02:01,910 is a subset of population. 48 00:02:01,910 --> 00:02:04,700 So for example, if we have 49 00:02:04,700 --> 00:02:07,925 data on all married drivers over the age of twenty five, 50 00:02:07,925 --> 00:02:09,820 then that's a subset. 51 00:02:09,820 --> 00:02:13,160 Within that subset, if we were to randomly 52 00:02:13,160 --> 00:02:14,415 select five percent of 53 00:02:14,415 --> 00:02:17,225 those married drivers over the age of 25, 54 00:02:17,225 --> 00:02:19,400 that would be our sample. 55 00:02:19,400 --> 00:02:21,290 We use samples, especially 56 00:02:21,290 --> 00:02:22,790 in cases where we do not want to 57 00:02:22,790 --> 00:02:24,260 incur the cost of collecting 58 00:02:24,260 --> 00:02:26,465 data for the entire population. 59 00:02:26,465 --> 00:02:29,165 Now, let us consider that there are 60 00:02:29,165 --> 00:02:32,315 230 million individuals in the country. 61 00:02:32,315 --> 00:02:33,915 A sample size of 62 00:02:33,915 --> 00:02:39,215 330 to 500 individuals randomly selected would suffice. 63 00:02:39,215 --> 00:02:41,330 This reduces the cost, 64 00:02:41,330 --> 00:02:43,220 especially in cases where you cannot 65 00:02:43,220 --> 00:02:45,875 collect information for the entire population. 66 00:02:45,875 --> 00:02:48,140 Therefore, using samples, 67 00:02:48,140 --> 00:02:51,095 it's really helpful and cost-effective. 68 00:02:51,095 --> 00:02:54,575 Here you see some Greek symbols on the screen. 69 00:02:54,575 --> 00:02:56,120 But don't be afraid. 70 00:02:56,120 --> 00:02:58,145 They mostly show the Formula. 71 00:02:58,145 --> 00:03:00,530 We will then proceed from here. 72 00:03:00,530 --> 00:03:03,095 While they may differ in notation, 73 00:03:03,095 --> 00:03:04,340 essentially the mean for 74 00:03:04,340 --> 00:03:06,950 a population and sample are the same. 75 00:03:06,950 --> 00:03:10,190 It is the sum of all the observations, then, 76 00:03:10,190 --> 00:03:13,460 divide it by the number of observations to get the mean, 77 00:03:13,460 --> 00:03:15,889 which we call averages. 78 00:03:15,889 --> 00:03:17,960 There are several properties 79 00:03:17,960 --> 00:03:19,550 of the mean and they are meaningful. 80 00:03:19,550 --> 00:03:20,990 But one of the characteristics of 81 00:03:20,990 --> 00:03:22,310 a mean is that if you take 82 00:03:22,310 --> 00:03:25,490 the difference between the average value for a variable, 83 00:03:25,490 --> 00:03:29,315 and subtract from all the observations and sum them up. 84 00:03:29,315 --> 00:03:32,210 That sum would be equal to zero. 85 00:03:32,210 --> 00:03:34,820 The median is different from the mean. 86 00:03:34,820 --> 00:03:36,380 When you order the data from 87 00:03:36,380 --> 00:03:38,600 the smallest value to the largest value, 88 00:03:38,600 --> 00:03:40,710 the result is in the middle. 89 00:03:40,710 --> 00:03:43,450 That is, the value in the middle 90 00:03:43,450 --> 00:03:44,500 indicating that there are 91 00:03:44,500 --> 00:03:46,210 an equal number of observations, 92 00:03:46,210 --> 00:03:47,920 that are above and the equal number of 93 00:03:47,920 --> 00:03:50,380 observations are below that family. 94 00:03:50,380 --> 00:03:53,050 That value is called the median. 95 00:03:53,050 --> 00:03:57,595 So if the median salary in some city is $45 thousand, 96 00:03:57,595 --> 00:03:59,770 it means that 50 percent of the people make 97 00:03:59,770 --> 00:04:02,050 more than $45 thousand and 98 00:04:02,050 --> 00:04:05,980 the other 50 percent make less than $45 thousand. 99 00:04:05,980 --> 00:04:08,830 Mode is essentially the value 100 00:04:08,830 --> 00:04:10,810 that occurs most frequently. 101 00:04:10,810 --> 00:04:13,450 Therefore, if the most common age and 102 00:04:13,450 --> 00:04:16,930 a class of students is 16, then that's the mode. 103 00:04:16,930 --> 00:04:20,260 We will now turn to Python for our hands-on training to 104 00:04:20,260 --> 00:04:23,490 estimate the Summary Statistics values for beauty score. 105 00:04:23,490 --> 00:04:26,445 Teaching you about evaluation and H. 106 00:04:26,445 --> 00:04:28,295 We will use the DataFrame 107 00:04:28,295 --> 00:04:32,135 dot describe function to find the Summary Statistics. 108 00:04:32,135 --> 00:04:34,040 This prints out the number of 109 00:04:34,040 --> 00:04:36,665 rows, mean, standard deviation, 110 00:04:36,665 --> 00:04:39,880 minimum value, 25th, 50th, 111 00:04:39,880 --> 00:04:43,745 75th percentile, and the maximum value. 112 00:04:43,745 --> 00:04:45,620 To find the summary statistics 113 00:04:45,620 --> 00:04:47,480 for a subset of the variables, 114 00:04:47,480 --> 00:04:49,715 you will have to state the column names 115 00:04:49,715 --> 00:04:51,575 as we can see here. 116 00:04:51,575 --> 00:04:54,500 Otherwise, for the full population, 117 00:04:54,500 --> 00:04:58,710 we will call the dot describe function on the DataFrame.