1 00:00:00,000 --> 00:00:02,260 In this video, we will introduce 2 00:00:02,260 --> 00:00:04,599 the Fundamentals of regression analysis, 3 00:00:04,599 --> 00:00:06,310 which we believe is the workhorse of 4 00:00:06,310 --> 00:00:08,605 statistical analysis. 5 00:00:08,605 --> 00:00:11,095 Now, in terms of hypothesis testing, 6 00:00:11,095 --> 00:00:13,210 these tests measure the strength of 7 00:00:13,210 --> 00:00:15,720 relationship between two or more variables. 8 00:00:15,720 --> 00:00:18,025 And you have to run them independently. 9 00:00:18,025 --> 00:00:20,470 But if you know how to run regression, 10 00:00:20,470 --> 00:00:23,050 we say, as a practical data scientist, 11 00:00:23,050 --> 00:00:25,330 you can forego these tests and go straight to regression, 12 00:00:25,330 --> 00:00:26,560 which is available in 13 00:00:26,560 --> 00:00:30,460 most spreadsheets and also in all statistical software. 14 00:00:30,460 --> 00:00:34,195 So here the fundamental basics of regression model. 15 00:00:34,195 --> 00:00:36,010 First of all, you need a question 16 00:00:36,010 --> 00:00:38,230 to answer using regression model. 17 00:00:38,230 --> 00:00:40,330 For instance, do male instructors get 18 00:00:40,330 --> 00:00:42,690 higher teaching evaluations than female instructors? 19 00:00:42,690 --> 00:00:44,440 Or does the beauty score decrease 20 00:00:44,440 --> 00:00:46,160 with the aid of the individual instructor? 21 00:00:46,160 --> 00:00:48,590 Or is there an association between 22 00:00:48,590 --> 00:00:50,150 an instructor's looks and 23 00:00:50,150 --> 00:00:52,430 the teaching evaluation score that we see? 24 00:00:52,430 --> 00:00:53,660 Do good-looking professors get 25 00:00:53,660 --> 00:00:55,670 hired teaching evaluation scores? 26 00:00:55,670 --> 00:00:58,130 So with these questions in mind, 27 00:00:58,130 --> 00:01:01,680 we focus now on the terminology of regression model. 28 00:01:01,680 --> 00:01:03,410 So there are two types of 29 00:01:03,410 --> 00:01:05,240 regression variables that we use. 30 00:01:05,240 --> 00:01:06,725 One is a dependent variable, 31 00:01:06,725 --> 00:01:09,650 that is the variable that we are really interested in. 32 00:01:09,650 --> 00:01:11,990 For example, the teaching evaluation score 33 00:01:11,990 --> 00:01:14,420 of an individual instructor and 34 00:01:14,420 --> 00:01:16,970 the explanatory variables that explain 35 00:01:16,970 --> 00:01:18,634 the variance or differences 36 00:01:18,634 --> 00:01:20,720 of values of the dependent variable. 37 00:01:20,720 --> 00:01:22,955 So for example, teaching evaluation score 38 00:01:22,955 --> 00:01:24,620 could be explained by the looks, 39 00:01:24,620 --> 00:01:26,330 or the gender, or 40 00:01:26,330 --> 00:01:27,889 the English language proficiency 41 00:01:27,889 --> 00:01:29,395 of an individual instructor. 42 00:01:29,395 --> 00:01:31,220 So you have two types of variables, 43 00:01:31,220 --> 00:01:34,175 dependent and explanatory. 44 00:01:34,175 --> 00:01:37,670 Now, let's look at the notation for a regression model. 45 00:01:37,670 --> 00:01:40,875 The dependent variable is donated as Y. 46 00:01:40,875 --> 00:01:43,745 So this Y would be the teaching evaluation score. 47 00:01:43,745 --> 00:01:47,795 And the explanatory variables are denoted as Xs. 48 00:01:47,795 --> 00:01:49,790 So beauty, the gender and 49 00:01:49,790 --> 00:01:52,400 English language proficiency would be an X. 50 00:01:52,400 --> 00:01:54,020 And the underlying assumption is 51 00:01:54,020 --> 00:01:56,230 that Y is explained by X, 52 00:01:56,230 --> 00:02:00,395 that is teaching evaluation score Y is explained by X, 53 00:02:00,395 --> 00:02:02,240 that is the beauty score, 54 00:02:02,240 --> 00:02:04,160 or y is a function of x, 55 00:02:04,160 --> 00:02:07,010 which we write as Y is equal to function of X, 56 00:02:07,010 --> 00:02:09,095 that is, the teaching evaluation score 57 00:02:09,095 --> 00:02:11,290 is some function of beauty. 58 00:02:11,290 --> 00:02:13,745 Statistically, if you run them, 59 00:02:13,745 --> 00:02:16,235 an estimate, a regression model, 60 00:02:16,235 --> 00:02:19,220 Y is equal to some constant and 61 00:02:19,220 --> 00:02:22,850 then a weighting factor for the variable X. 62 00:02:22,850 --> 00:02:24,080 If it's a beauty score, 63 00:02:24,080 --> 00:02:26,120 then the weighting factor for 64 00:02:26,120 --> 00:02:28,280 the beauty score and the error term. 65 00:02:28,280 --> 00:02:31,475 An error term is whatever we cannot explain by the model, 66 00:02:31,475 --> 00:02:33,130 that goes into error term. 67 00:02:33,130 --> 00:02:36,545 And I will explain this a little more in a minute. 68 00:02:36,545 --> 00:02:38,495 So Y is equal to, 69 00:02:38,495 --> 00:02:39,560 let say the constant is 70 00:02:39,560 --> 00:02:42,500 beta-naught plus some factor of weight, 71 00:02:42,500 --> 00:02:44,930 which is beta one for X plus 72 00:02:44,930 --> 00:02:47,825 the error term which we represent as epsilon. 73 00:02:47,825 --> 00:02:50,060 And then if you are familiar with 74 00:02:50,060 --> 00:02:52,400 your basic statistics text, 75 00:02:52,400 --> 00:02:55,235 if there are more than one variables, 76 00:02:55,235 --> 00:02:56,990 then Y is equal to beta naught, 77 00:02:56,990 --> 00:03:00,750 that's the constant, plus beta one, X_1. 78 00:03:00,750 --> 00:03:01,970 Beta one is the factor for 79 00:03:01,970 --> 00:03:03,860 one variable that could be beauty, 80 00:03:03,860 --> 00:03:06,440 plus beta two the other weight explaining 81 00:03:06,440 --> 00:03:08,690 another X_2 and other variables 82 00:03:08,690 --> 00:03:10,955 such as English language proficiency. 83 00:03:10,955 --> 00:03:12,320 So the weight for English language 84 00:03:12,320 --> 00:03:13,340 proficiency would be beta 85 00:03:13,340 --> 00:03:15,650 two and plus the epsilon, 86 00:03:15,650 --> 00:03:16,805 which is the error term 87 00:03:16,805 --> 00:03:18,740 explaining or capturing whatever 88 00:03:18,740 --> 00:03:20,135 the model couldn't capture. 89 00:03:20,135 --> 00:03:23,700 If I had estimated a regression model using the data set, 90 00:03:23,700 --> 00:03:25,120 teaching evaluations data set, 91 00:03:25,120 --> 00:03:26,570 it would look like the following. 92 00:03:26,570 --> 00:03:28,700 It will be the teaching evaluation score of 93 00:03:28,700 --> 00:03:31,010 an individual instructor is equal to some 94 00:03:31,010 --> 00:03:33,230 constant plus the weight for 95 00:03:33,230 --> 00:03:34,850 the beauty variable and then 96 00:03:34,850 --> 00:03:37,235 times the beauty score plus the error. 97 00:03:37,235 --> 00:03:40,550 So here, teaching evaluation score is equal to, 98 00:03:40,550 --> 00:03:43,145 according to the model 3.998. 99 00:03:43,145 --> 00:03:45,890 That's the constant plus the weight for beauty score, 100 00:03:45,890 --> 00:03:48,630 which is 0.133 plus the error. 101 00:03:48,630 --> 00:03:50,330 And the error is epsilon, 102 00:03:50,330 --> 00:03:52,790 which is essentially the difference 103 00:03:52,790 --> 00:03:54,920 between the actual teaching evaluation score, 104 00:03:54,920 --> 00:03:57,860 that we have recorded in the data set, 105 00:03:57,860 --> 00:04:01,820 and the one that we have forecasted using this model. 106 00:04:01,820 --> 00:04:03,830 So the difference between the actual values and 107 00:04:03,830 --> 00:04:07,200 the forecasted values is the error term.