1 00:00:06,440 --> 00:00:08,580 When we are comparing 2 00:00:08,580 --> 00:00:10,620 the difference in means or when we are 3 00:00:10,620 --> 00:00:12,360 comparing the averages between 4 00:00:12,360 --> 00:00:14,145 groups that are more than two, 5 00:00:14,145 --> 00:00:18,000 we will use ANOVA or analysis of variance. 6 00:00:18,000 --> 00:00:20,505 We know that if there are only two groups, 7 00:00:20,505 --> 00:00:22,230 we can use the t-test, 8 00:00:22,230 --> 00:00:23,535 but when we are comparing 9 00:00:23,535 --> 00:00:25,470 averages for more than two groups, 10 00:00:25,470 --> 00:00:27,960 we use analysis of variance. 11 00:00:27,960 --> 00:00:30,884 Working with our teaching evaluation dataset, 12 00:00:30,884 --> 00:00:32,940 we took the teaching evaluation scores 13 00:00:32,940 --> 00:00:34,650 and then we wanted to see what 14 00:00:34,650 --> 00:00:35,700 would happen if we took 15 00:00:35,700 --> 00:00:38,535 the instructors and divide them into three groups, 16 00:00:38,535 --> 00:00:40,155 40 years and younger, 17 00:00:40,155 --> 00:00:43,055 those between 40 and 57 years of age 18 00:00:43,055 --> 00:00:45,905 and those that are 57 years or older. 19 00:00:45,905 --> 00:00:48,020 We computed the average value for 20 00:00:48,020 --> 00:00:50,675 teaching evaluation score for the three groups. 21 00:00:50,675 --> 00:00:52,610 We wanted to determine if 22 00:00:52,610 --> 00:00:55,955 the three mean values were statistically different. 23 00:00:55,955 --> 00:00:58,850 To recap, we ran the analysis of 24 00:00:58,850 --> 00:01:02,225 variance test, which uses F-distribution. 25 00:01:02,225 --> 00:01:05,780 The p-value is less than 0.05. 26 00:01:05,780 --> 00:01:08,000 We reject the null hypothesis 27 00:01:08,000 --> 00:01:09,755 that averages of the group are 28 00:01:09,755 --> 00:01:11,450 equal and concluded that 29 00:01:11,450 --> 00:01:14,360 the differences are statistically significant. 30 00:01:14,360 --> 00:01:17,645 Now, let us do this with the regression model. 31 00:01:17,645 --> 00:01:20,179 We will use the statsmodel library 32 00:01:20,179 --> 00:01:22,805 and also import the OLS function. 33 00:01:22,805 --> 00:01:24,340 We will create or 34 00:01:24,340 --> 00:01:26,770 initiate a linear model of the beauty score, 35 00:01:26,770 --> 00:01:28,675 which is our y-variable. 36 00:01:28,675 --> 00:01:30,430 Please note that when dealing with 37 00:01:30,430 --> 00:01:32,005 a linear regression model, 38 00:01:32,005 --> 00:01:35,245 the y-variable has to be a continuous variable. 39 00:01:35,245 --> 00:01:37,895 Otherwise results will not be accurate. 40 00:01:37,895 --> 00:01:40,345 Now, create the linear model 41 00:01:40,345 --> 00:01:42,820 and fit it using the fit function. 42 00:01:42,820 --> 00:01:46,210 Use the ANOVA_IM function to create 43 00:01:46,210 --> 00:01:47,650 a table that prints out 44 00:01:47,650 --> 00:01:49,760 the results of the test statistics. 45 00:01:49,760 --> 00:01:51,915 The results will look like this. 46 00:01:51,915 --> 00:01:53,830 It will print out the degree of freedom, 47 00:01:53,830 --> 00:01:57,350 the sum of square F statistic and the p-value. 48 00:01:57,350 --> 00:01:59,785 Like ANOVA from this api package, 49 00:01:59,785 --> 00:02:01,270 we get the same results, 50 00:02:01,270 --> 00:02:03,909 which is that we will reject the null hypothesis, 51 00:02:03,909 --> 00:02:06,440 the averages of the group are equal and 52 00:02:06,440 --> 00:02:07,850 conclude that the differences 53 00:02:07,850 --> 00:02:10,280 are statistically significant. 54 00:02:10,280 --> 00:02:12,920 You can also turn the age group values into 55 00:02:12,920 --> 00:02:14,450 dummy values and run it like you 56 00:02:14,450 --> 00:02:16,400 run the regression for t-test. 57 00:02:16,400 --> 00:02:18,080 To do that, you will need to 58 00:02:18,080 --> 00:02:20,150 create dummy variables for the age groups 59 00:02:20,150 --> 00:02:24,250 using the get_dummies function in pandas. 60 00:02:24,250 --> 00:02:27,080 It will look like this, where one means they 61 00:02:27,080 --> 00:02:29,930 belong to that group and zero means otherwise. 62 00:02:29,930 --> 00:02:32,479 Just like a binary variable, 63 00:02:32,479 --> 00:02:35,515 values can only belong to one group. 64 00:02:35,515 --> 00:02:38,510 Run the same as you did for the t-test by fitting 65 00:02:38,510 --> 00:02:41,180 the variables into an OLS function, 66 00:02:41,180 --> 00:02:44,165 predict, and print out the model summary. 67 00:02:44,165 --> 00:02:47,100 We will get results like this. 68 00:02:49,700 --> 00:02:52,115 Taking a closer look, 69 00:02:52,115 --> 00:02:53,780 we can see the same results for 70 00:02:53,780 --> 00:02:56,910 the F statistic and the p-value.