1 00:00:05,840 --> 00:00:07,860 We have learned to compute 2 00:00:07,860 --> 00:00:09,885 averages and standard deviations, 3 00:00:09,885 --> 00:00:12,060 but now we will use the same information of 4 00:00:12,060 --> 00:00:15,855 same knowledge to make comparisons between groups. 5 00:00:15,855 --> 00:00:19,275 So we will use the same dataset that we have used so far. 6 00:00:19,275 --> 00:00:21,600 That is the teaching evaluation data from 7 00:00:21,600 --> 00:00:25,125 University of Texas, comprising 463 courses. 8 00:00:25,125 --> 00:00:27,720 We are looking at the teaching evaluation, 9 00:00:27,720 --> 00:00:28,980 beauty, and age. 10 00:00:28,980 --> 00:00:30,660 We're comparing the averages for 11 00:00:30,660 --> 00:00:34,200 these three variables for female instructor variable. 12 00:00:34,200 --> 00:00:35,910 So those who are females, 13 00:00:35,910 --> 00:00:38,310 their average teaching evaluation was 3.9 14 00:00:38,310 --> 00:00:41,080 compared to those of men, 4.06. 15 00:00:41,080 --> 00:00:42,350 Here we're looking at 16 00:00:42,350 --> 00:00:46,640 the average teaching evaluation for tenured professors, 17 00:00:46,640 --> 00:00:49,775 3.96 verses untenured, 4.13. 18 00:00:49,775 --> 00:00:53,945 The average age of untenured professors was 50.2 years, 19 00:00:53,945 --> 00:00:58,160 and that for tenured professors, 47.85 years. 20 00:00:58,160 --> 00:00:59,780 One thing that is very important in 21 00:00:59,780 --> 00:01:02,105 statistical analysis is to 22 00:01:02,105 --> 00:01:04,070 think about the question and to think 23 00:01:04,070 --> 00:01:06,770 about the population or sample that you are working with. 24 00:01:06,770 --> 00:01:10,570 We are computing averages across 463 courses. 25 00:01:10,570 --> 00:01:12,765 We find the average age or beauty, 26 00:01:12,765 --> 00:01:15,635 but these are the attributes of instructors. 27 00:01:15,635 --> 00:01:17,420 We know from our data to 28 00:01:17,420 --> 00:01:19,520 their 94 instructors who have 29 00:01:19,520 --> 00:01:22,250 collectively taught 463 courses, 30 00:01:22,250 --> 00:01:24,020 and we know that there are duplicates, 31 00:01:24,020 --> 00:01:25,985 that is the same instructor 32 00:01:25,985 --> 00:01:28,700 who has taught multiple courses. 33 00:01:28,700 --> 00:01:32,570 So when I compute the average age using 463 courses, 34 00:01:32,570 --> 00:01:34,520 it's not necessarily the average age of 35 00:01:34,520 --> 00:01:37,595 the instructors because it could be true that 36 00:01:37,595 --> 00:01:40,430 older aged individuals may 37 00:01:40,430 --> 00:01:43,370 have taught more courses than younger individuals, 38 00:01:43,370 --> 00:01:45,095 resulting in an higher average. 39 00:01:45,095 --> 00:01:46,429 That is not necessarily 40 00:01:46,429 --> 00:01:48,860 the average age of the instructors. 41 00:01:48,860 --> 00:01:51,725 So to avoid this problem, 42 00:01:51,725 --> 00:01:53,480 we have to subset 43 00:01:53,480 --> 00:01:56,600 the data so that we remove the duplicates and 44 00:01:56,600 --> 00:01:58,685 have only one observation 45 00:01:58,685 --> 00:02:01,550 per individual instructor in the dataset. 46 00:02:01,550 --> 00:02:03,844 Instead of 463 observations, 47 00:02:03,844 --> 00:02:06,725 you should have just 94 observations. 48 00:02:06,725 --> 00:02:08,959 Now let's look at the comparison. 49 00:02:08,959 --> 00:02:11,630 When we use 94 observations where 50 00:02:11,630 --> 00:02:15,335 no instructor is repeated in the dataset, 51 00:02:15,335 --> 00:02:20,420 the average age or average beauty score is 0.25. 52 00:02:20,420 --> 00:02:23,990 When we look at the 463 courses, 53 00:02:23,990 --> 00:02:26,965 the average value is 0.11. 54 00:02:26,965 --> 00:02:29,495 Let's compare the age. 55 00:02:29,495 --> 00:02:32,570 The average age using 94 observations 56 00:02:32,570 --> 00:02:37,970 for males is 49.4 and for females is 44.9. 57 00:02:37,970 --> 00:02:41,130 You see here that as for age, 58 00:02:41,130 --> 00:02:42,920 we don't see much difference whether we 59 00:02:42,920 --> 00:02:45,514 use 463 observations or 94. 60 00:02:45,514 --> 00:02:48,005 But we certainly see much difference in 61 00:02:48,005 --> 00:02:52,695 the beauty scores if you were to use the wrong dataset. 62 00:02:52,695 --> 00:02:53,810 That is the dataset where 63 00:02:53,810 --> 00:02:57,020 individuals are repeated multiple times. 64 00:02:57,020 --> 00:02:59,870 Data visualization is a critical piece 65 00:02:59,870 --> 00:03:01,820 of modern-day statistical analysis. 66 00:03:01,820 --> 00:03:03,740 Their staples are helpful, 67 00:03:03,740 --> 00:03:04,970 so you don't have to eyeball 68 00:03:04,970 --> 00:03:07,655 the output to figure out what the trends are. 69 00:03:07,655 --> 00:03:11,030 The visual displays are much easier to understand. 70 00:03:11,030 --> 00:03:13,460 We will use the same datasets of teaching 71 00:03:13,460 --> 00:03:16,040 evaluations and ask this question, 72 00:03:16,040 --> 00:03:17,690 do instructors teaching 73 00:03:17,690 --> 00:03:20,780 single credit courses get higher evaluations? 74 00:03:20,780 --> 00:03:23,660 We see that, yes, they do. 75 00:03:23,660 --> 00:03:26,900 By Mean evaluation, when plotted as the chart, 76 00:03:26,900 --> 00:03:29,600 you see that instructors who teach single credit courses 77 00:03:29,600 --> 00:03:32,975 have a slightly higher average teaching evaluation. 78 00:03:32,975 --> 00:03:35,030 Let us start by determining 79 00:03:35,030 --> 00:03:36,560 how many courses were taught by 80 00:03:36,560 --> 00:03:40,145 male instructors and how many by female instructors. 81 00:03:40,145 --> 00:03:42,650 For this, we can use a bar chart. 82 00:03:42,650 --> 00:03:44,360 Notice that the information is 83 00:03:44,360 --> 00:03:46,400 complete from a statistical point of view in 84 00:03:46,400 --> 00:03:48,020 that we know how many courses were 85 00:03:48,020 --> 00:03:50,135 taught by males versus females. 86 00:03:50,135 --> 00:03:52,400 But we do not have some critical information from 87 00:03:52,400 --> 00:03:54,965 this chart as it relates to communication. 88 00:03:54,965 --> 00:03:57,020 Therefore, we can say 89 00:03:57,020 --> 00:03:59,480 this chart serves as statistical purpose, 90 00:03:59,480 --> 00:04:02,030 but it doesn't serve a communication purpose. 91 00:04:02,030 --> 00:04:04,790 Let me illustrate this with an example. 92 00:04:04,790 --> 00:04:07,700 Here you are looking at a street map. 93 00:04:07,700 --> 00:04:08,840 You can see the streets and 94 00:04:08,840 --> 00:04:10,280 the buildings and the highways, 95 00:04:10,280 --> 00:04:11,930 but you don't see the street names. 96 00:04:11,930 --> 00:04:13,685 Without street names, it is hard to 97 00:04:13,685 --> 00:04:15,260 determine where you are 98 00:04:15,260 --> 00:04:17,360 and in which direction you should be heading. 99 00:04:17,360 --> 00:04:19,580 Even though it is according to scale, 100 00:04:19,580 --> 00:04:20,840 it may be accurate in 101 00:04:20,840 --> 00:04:23,135 its depiction of the streets in the neighborhood, 102 00:04:23,135 --> 00:04:25,040 but it's still lacks the ability to 103 00:04:25,040 --> 00:04:27,005 communicate information to you. 104 00:04:27,005 --> 00:04:29,675 To add communication value to this map, 105 00:04:29,675 --> 00:04:32,045 you can simply add the street names. 106 00:04:32,045 --> 00:04:35,480 Let us apply the same philosophy to our graphic. 107 00:04:35,480 --> 00:04:38,615 But once we add information about this infographic, 108 00:04:38,615 --> 00:04:41,135 for example, adding just a title 109 00:04:41,135 --> 00:04:43,235 makes this chart more informative. 110 00:04:43,235 --> 00:04:45,034 To do this in Python, 111 00:04:45,034 --> 00:04:47,240 we'll use the countplot function in 112 00:04:47,240 --> 00:04:50,060 the seaborne library and set the title label. 113 00:04:50,060 --> 00:04:52,580 This helps your graph to be more informative. 114 00:04:52,580 --> 00:04:55,505 We can also add more dimensions to the data. 115 00:04:55,505 --> 00:04:57,890 In addition to the gender of the instructors, 116 00:04:57,890 --> 00:04:59,600 we could add the tenure status of 117 00:04:59,600 --> 00:05:01,655 the instructors as well to the graphic. 118 00:05:01,655 --> 00:05:03,350 To do that in Python, 119 00:05:03,350 --> 00:05:06,215 you add hue argument to the countplot. 120 00:05:06,215 --> 00:05:08,390 We can add another dimension to the data, 121 00:05:08,390 --> 00:05:11,465 regenerating the same graphic with the same information. 122 00:05:11,465 --> 00:05:13,070 That is, the number of 123 00:05:13,070 --> 00:05:14,930 courses taught by gender and tenure. 124 00:05:14,930 --> 00:05:16,640 Then adding the dimension of 125 00:05:16,640 --> 00:05:20,150 courses being upper-division and lower division, 126 00:05:20,150 --> 00:05:22,940 and presenting them in two rows or columns. 127 00:05:22,940 --> 00:05:24,530 To do this in Python, 128 00:05:24,530 --> 00:05:26,525 we can specify the rows argument 129 00:05:26,525 --> 00:05:28,805 using the countplot function. 130 00:05:28,805 --> 00:05:30,920 Now let's look at the situation where 131 00:05:30,920 --> 00:05:32,060 our primary variables of 132 00:05:32,060 --> 00:05:34,280 interest are continuous variables. 133 00:05:34,280 --> 00:05:35,600 We would like to explore 134 00:05:35,600 --> 00:05:37,520 the relationship between the two while adding 135 00:05:37,520 --> 00:05:41,135 further categorical variables as an additional dimension. 136 00:05:41,135 --> 00:05:44,780 Using the teaching evaluation data we ask this question, 137 00:05:44,780 --> 00:05:47,465 does age effect teaching evaluations? 138 00:05:47,465 --> 00:05:49,804 We then add two additional dimensions, 139 00:05:49,804 --> 00:05:51,830 which are gender and tenure. 140 00:05:51,830 --> 00:05:55,700 So our dataset consists of age and teaching evaluation, 141 00:05:55,700 --> 00:05:57,680 which are the two primary variables of 142 00:05:57,680 --> 00:06:00,095 interest and are continuous. 143 00:06:00,095 --> 00:06:02,510 Then we add two other dimensions, 144 00:06:02,510 --> 00:06:05,000 i.e., gender and tenure. 145 00:06:05,000 --> 00:06:07,520 These are categorical variables. 146 00:06:07,520 --> 00:06:09,440 Age is on the X axis and 147 00:06:09,440 --> 00:06:12,320 the teaching evaluation scores on the Y axis. 148 00:06:12,320 --> 00:06:14,420 The orange colored circles represent 149 00:06:14,420 --> 00:06:17,869 males and the blue colored circles represent females. 150 00:06:17,869 --> 00:06:20,660 The top panel is for tenured professors and 151 00:06:20,660 --> 00:06:23,765 the bottom panel is for the untenured instructors. 152 00:06:23,765 --> 00:06:25,565 To do this in Python, 153 00:06:25,565 --> 00:06:27,560 we use the FacetGrid option, 154 00:06:27,560 --> 00:06:29,480 which works for multiplot gridding 155 00:06:29,480 --> 00:06:31,220 and allows tweaking the plot. 156 00:06:31,220 --> 00:06:35,045 You create the row and hue for the categorical variables, 157 00:06:35,045 --> 00:06:37,400 in our case, tenure and gender. 158 00:06:37,400 --> 00:06:39,305 Then we use the map to apply 159 00:06:39,305 --> 00:06:42,540 a plotting function to each subset of the data.