1 00:00:07,140 --> 00:00:11,810 Now, moving ahead from comparing the average values between two or more groups, 2 00:00:11,810 --> 00:00:14,220 we are looking at two variables. 3 00:00:14,220 --> 00:00:18,150 We want to know if there is a statistically significant correlation 4 00:00:18,150 --> 00:00:22,850 between these two variables and what is needed for this to happen. 5 00:00:22,850 --> 00:00:26,838 We would need to look back to the earlier definition of types of variables. 6 00:00:26,838 --> 00:00:31,310 We generally define the variables in two groups, the categorical variables, 7 00:00:31,310 --> 00:00:33,340 an continuous variables. 8 00:00:33,340 --> 00:00:36,690 So if we were to go back to our teaching ratings data, 9 00:00:36,690 --> 00:00:39,630 we have instructors who are male and female. 10 00:00:39,630 --> 00:00:44,343 Some instructors are visible minorities and some are Caucasian. 11 00:00:44,343 --> 00:00:49,910 So we have two variables, male, and female and the visible minority status. 12 00:00:49,910 --> 00:00:54,170 These two variables are examples of categorical variables, and 13 00:00:54,170 --> 00:00:57,970 if we're comparing or trying to determine the correlation between two categorical 14 00:00:57,970 --> 00:01:01,780 variables, we would use the Chi-square Ttest. 15 00:01:01,780 --> 00:01:06,610 We would begin with a cross tabulation between the two values. If we have two 16 00:01:06,610 --> 00:01:10,770 continuous variables, for example, the teaching evaluations score and 17 00:01:10,770 --> 00:01:15,010 the beauty score of an instructor, then these are two continuous variables and 18 00:01:15,010 --> 00:01:19,050 they can assume any reasonable value within the range. 19 00:01:19,050 --> 00:01:22,610 In this case, we use a Pearson correlation test. 20 00:01:22,610 --> 00:01:26,460 We usually begin with a scatter plot to see what's the nature of the relationship 21 00:01:26,460 --> 00:01:28,774 between the two variables. 22 00:01:28,774 --> 00:01:31,308 Let us start with categorical variables. 23 00:01:31,308 --> 00:01:34,171 We will use the Chi-square Test for Association. 24 00:01:34,171 --> 00:01:36,400 First, we state our hypothesis. 25 00:01:36,400 --> 00:01:40,560 We will test the null hypothesis that gender an tenureship are independent 26 00:01:40,560 --> 00:01:44,310 against the alternative hypothesis that they're associated. 27 00:01:44,310 --> 00:01:48,472 Let's begin with a cross tabulation between gender male and female, and 28 00:01:48,472 --> 00:01:49,015 tenure. 29 00:01:49,015 --> 00:01:53,830 That is, tenured profs then followed by a Chi-square test. 30 00:01:53,830 --> 00:01:56,465 So we do the tabulations. 31 00:01:56,465 --> 00:02:00,250 In the rows we have tenured no versus tenured yes, and 32 00:02:00,250 --> 00:02:02,533 female instructors and males. 33 00:02:02,533 --> 00:02:07,520 We would like to eyeball these numbers before we turn them into percentages. 34 00:02:07,520 --> 00:02:10,380 Looking at instructors who are non-tenured, 35 00:02:10,380 --> 00:02:15,530 we notice that 50 of the instructors are female versus 52 who are male. 36 00:02:15,530 --> 00:02:19,920 But for the instructors who are tenured, 145 of them are female, 37 00:02:19,920 --> 00:02:23,040 and 216 of them are male. 38 00:02:23,040 --> 00:02:27,870 So within the tenured group we see greater probability for males to be tenured, but 39 00:02:27,870 --> 00:02:33,000 in the untenured group the distribution between males and females look similar. 40 00:02:33,000 --> 00:02:38,260 Before we go to Python, let's do this by hand to understand the concept. 41 00:02:38,260 --> 00:02:40,980 The formula for Chi-square is given as follows. 42 00:02:40,980 --> 00:02:44,790 The summation of the observed value, i.e the counts in each group minus 43 00:02:44,790 --> 00:02:49,430 the expected value, all squared, divided by the expected value. 44 00:02:49,430 --> 00:02:52,100 Expected values are based on the given totals. 45 00:02:52,100 --> 00:02:56,150 What would we say each individual value would be if we did not know the observed 46 00:02:56,150 --> 00:02:57,470 values? 47 00:02:57,470 --> 00:03:01,530 So to calculate the expected value of untenured female instructors, 48 00:03:01,530 --> 00:03:07,340 we take the row total, which is 102 multiplied by the column total 195, 49 00:03:07,340 --> 00:03:10,390 divided by the grand total of 463. 50 00:03:10,390 --> 00:03:13,060 This will give you 42.96. 51 00:03:13,060 --> 00:03:17,273 If we do the same thing for tenured male instructors, 52 00:03:17,273 --> 00:03:21,767 we will take the row total 361 multiplied by the column 53 00:03:21,767 --> 00:03:26,182 total 268 divided 463, we get 208.96.. 54 00:03:26,182 --> 00:03:30,350 If we repeat the same procedure for all of them, we get these values. 55 00:03:30,350 --> 00:03:34,270 If we take the row totals, column totals, and grand total, 56 00:03:34,270 --> 00:03:38,950 we will get the same values as the totals as the observed values. 57 00:03:38,950 --> 00:03:41,061 Now going back to this formula, 58 00:03:41,061 --> 00:03:46,532 if we take a summation of all the observed minus the expected values, all squared, 59 00:03:46,532 --> 00:03:51,789 divided by the expected value, we will get a Chi-square value of 2.557. 60 00:03:51,789 --> 00:03:54,778 And the degree of freedom will be 1. 61 00:03:54,778 --> 00:03:59,960 On the Chi-square table, we check the degree of freedom equals row one and 62 00:03:59,960 --> 00:04:02,682 find the value closest to 2.557. 63 00:04:02,682 --> 00:04:07,266 Here we can see that 2.557 will most likely 64 00:04:07,266 --> 00:04:12,203 fall in between a p-value of 0.1 and 0.25. 65 00:04:12,203 --> 00:04:16,955 Therefore, we can say that the p-value is greater than 0.1. 66 00:04:16,955 --> 00:04:19,893 Since the p-value is greater than 0.05, 67 00:04:19,893 --> 00:04:24,941 we fail to reject the null hypothesis that the two variables are independent and 68 00:04:24,941 --> 00:04:29,613 therefore we will conclude that the alternative hypothesis that there is 69 00:04:29,613 --> 00:04:33,640 an association between gender and tenureship does not exist. 70 00:04:33,640 --> 00:04:38,446 To do this in Python we will use the Chi-square contingency function in 71 00:04:38,446 --> 00:04:43,905 the SciPy statistics package, that is a Chi-square test value of 2.557. 72 00:04:43,905 --> 00:04:48,155 And the second value is the p-value of about 0.11. 73 00:04:48,155 --> 00:04:50,730 And a degree of freedom of 1. 74 00:04:50,730 --> 00:04:55,250 If you remember, the Chi-square table did not give an exact p-value but 75 00:04:55,250 --> 00:04:57,380 a range in which it falls. 76 00:04:57,380 --> 00:05:00,270 Python will give the exact p-value. 77 00:05:00,270 --> 00:05:03,670 We can see the same results as on the previous slides. 78 00:05:03,670 --> 00:05:08,565 It also prints out the expected values, which we also calculated by hand. 79 00:05:08,565 --> 00:05:12,977 Since the p-value is 0.11, which is greater than 0.05, 80 00:05:12,977 --> 00:05:18,193 we fail to reject the null hypothesis that the two variables are independent. 81 00:05:18,193 --> 00:05:21,517 And therefore we will conclude the alternative hypothesis 82 00:05:21,517 --> 00:05:25,780 that there is an association between gender and tenureship does not exist. 83 00:05:25,780 --> 00:05:29,420 This was an example of testing independence between two categorical 84 00:05:29,420 --> 00:05:31,220 variables. 85 00:05:31,220 --> 00:05:35,350 Now to continuous variables using a Pearson correlation test from the teaching 86 00:05:35,350 --> 00:05:36,610 ratings data. 87 00:05:36,610 --> 00:05:40,310 We will test the null hypothesis that there is no correlation between 88 00:05:40,310 --> 00:05:44,850 an instructor's beauty score and their teaching evaluation score against 89 00:05:44,850 --> 00:05:50,010 the alternative hypothesis that there is a correlation between both variables. 90 00:05:50,010 --> 00:05:53,390 We had the normalized beauty score on the x-axis and 91 00:05:53,390 --> 00:05:56,670 the teaching evaluation score on the y-axis. 92 00:05:56,670 --> 00:05:59,990 You can eyeball a positive upward sloping curve, but 93 00:05:59,990 --> 00:06:03,960 let's run a Pearson correlation test to find out. 94 00:06:03,960 --> 00:06:08,720 We will use the Pearson R package in the scipy.stats package and check for 95 00:06:08,720 --> 00:06:09,810 the correlation. 96 00:06:09,810 --> 00:06:13,809 We will get a coefficient value of how strong the relationship is and 97 00:06:13,809 --> 00:06:15,002 in what direction. 98 00:06:15,002 --> 00:06:18,456 Correlation coefficient values lie between -1 and 1. 99 00:06:18,456 --> 00:06:21,690 Where -1 means a strong negative correlation and 100 00:06:21,690 --> 00:06:26,277 visually represented by a downward sloping curve, and 1 means a strong 101 00:06:26,277 --> 00:06:31,260 positive relationship and visually represented by an upward sloping curve. 102 00:06:31,260 --> 00:06:36,598 In our case, we have a Pearson coefficient of 0.18, 103 00:06:36,598 --> 00:06:42,171 and a p-value of 4.25 times 10 raised to power -5. 104 00:06:42,171 --> 00:06:47,321 Since the p-value is less than 0.05, we reject the null hypothesis and 105 00:06:47,321 --> 00:06:52,149 conclude that there exists the relationship between an instructor's 106 00:06:52,149 --> 00:06:55,295 beauty score and teaching evaluation score. 107 00:06:55,295 --> 00:06:55,795 [MUSIC]