1 00:00:09,530 --> 00:00:14,670 So far we have displayed data as averages and counts. 2 00:00:14,670 --> 00:00:17,429 Now let's look at some other statistical parameters 3 00:00:17,429 --> 00:00:20,175 that we will illustrate as graphics. 4 00:00:20,175 --> 00:00:23,805 We have not yet shown anything about variance 5 00:00:23,805 --> 00:00:25,695 and I think the first thing 6 00:00:25,695 --> 00:00:27,990 that one should look into beyond averages, 7 00:00:27,990 --> 00:00:31,140 is to look at variance or how the data are distributed. 8 00:00:31,140 --> 00:00:34,395 A good way of looking at the distribution of data, 9 00:00:34,395 --> 00:00:37,365 especially if it were to be a continuous variable, 10 00:00:37,365 --> 00:00:39,125 is to look at histograms. 11 00:00:39,125 --> 00:00:40,470 If you are interested in 12 00:00:40,470 --> 00:00:42,960 displaying something more than the average, 13 00:00:42,960 --> 00:00:46,185 maybe the median and the quartiles, 14 00:00:46,185 --> 00:00:49,535 then perhaps boxplots should be our choice. 15 00:00:49,535 --> 00:00:51,830 Using the teaching evaluation data, 16 00:00:51,830 --> 00:00:53,450 we have plotted a histogram 17 00:00:53,450 --> 00:00:55,160 of teaching evaluation scores. 18 00:00:55,160 --> 00:00:58,355 You could see that the mean score is around four, 19 00:00:58,355 --> 00:00:59,615 but then you could see 20 00:00:59,615 --> 00:01:03,305 very low teaching evaluation scores, not many frequently, 21 00:01:03,305 --> 00:01:05,120 but most frequently is the around 22 00:01:05,120 --> 00:01:07,865 the average and then you'd see that 23 00:01:07,865 --> 00:01:10,340 some have lower teaching evaluation scores and 24 00:01:10,340 --> 00:01:13,130 some have fairly high teaching evaluation scores. 25 00:01:13,130 --> 00:01:17,015 The histogram approximates the normal distribution curve. 26 00:01:17,015 --> 00:01:20,315 Essentially you have 3.99 to four as the mean, 27 00:01:20,315 --> 00:01:23,075 the standard deviation of 0.55, 28 00:01:23,075 --> 00:01:25,100 looking at 463 records. 29 00:01:25,100 --> 00:01:26,450 This gives you a good idea of 30 00:01:26,450 --> 00:01:28,025 how your data are distributed. 31 00:01:28,025 --> 00:01:30,215 You can in fact plot 32 00:01:30,215 --> 00:01:32,645 multiple histograms such that 33 00:01:32,645 --> 00:01:34,910 you can see the difference between the subgroups. 34 00:01:34,910 --> 00:01:36,650 Here you have the histograms 35 00:01:36,650 --> 00:01:39,380 overlaid for males and females. 36 00:01:39,380 --> 00:01:44,030 These frequent lower teaching evaluations for females 37 00:01:44,030 --> 00:01:45,560 is likely to influence 38 00:01:45,560 --> 00:01:47,420 the average teaching evaluation score 39 00:01:47,420 --> 00:01:49,640 for females versus the males. 40 00:01:49,640 --> 00:01:52,865 A box plot essentially looks like this. 41 00:01:52,865 --> 00:01:56,180 The thick line in the box represents the median. 42 00:01:56,180 --> 00:01:58,955 The top part of the box is the third quartile. 43 00:01:58,955 --> 00:02:02,200 The bottom part of the box as the first quartile. 44 00:02:02,200 --> 00:02:05,165 The line at the bottom is the minimum value, 45 00:02:05,165 --> 00:02:08,825 and the line at the top is the maximum value. 46 00:02:08,825 --> 00:02:11,240 The range between the first quartile and 47 00:02:11,240 --> 00:02:14,990 the third quartile is called the inter-quartile range. 48 00:02:14,990 --> 00:02:17,240 In this graphic, we have created 49 00:02:17,240 --> 00:02:19,655 the box plots for the age variable. 50 00:02:19,655 --> 00:02:22,220 We can see that the median age of males 51 00:02:22,220 --> 00:02:24,455 is higher than the median age of females. 52 00:02:24,455 --> 00:02:26,300 Also, the maximum age of the males is 53 00:02:26,300 --> 00:02:29,225 higher than the maximum age of females. 54 00:02:29,225 --> 00:02:31,205 To do this in Python, 55 00:02:31,205 --> 00:02:34,295 we use the boxplot function in the Seaborne library. 56 00:02:34,295 --> 00:02:36,710 We will put the gender on the Y axis and 57 00:02:36,710 --> 00:02:39,335 the age of the instructor on the X axis. 58 00:02:39,335 --> 00:02:42,185 You can play around with the X and Y axis. 59 00:02:42,185 --> 00:02:46,235 If you want a horizontal style boxplot for readability, 60 00:02:46,235 --> 00:02:49,039 I like to use vertical boxplots. 61 00:02:49,039 --> 00:02:51,800 We can also add another dimension. 62 00:02:51,800 --> 00:02:53,390 Here we will add tenure. 63 00:02:53,390 --> 00:02:55,760 So those who are tenured are plotted on the right, 64 00:02:55,760 --> 00:02:58,260 and those who were not tenured are plotted on the left. 65 00:02:58,260 --> 00:03:01,580 The blue color represents the female instructors, 66 00:03:01,580 --> 00:03:04,445 and the orange color represents the male instructors. 67 00:03:04,445 --> 00:03:07,345 We can see the differences between male and female. 68 00:03:07,345 --> 00:03:10,100 Instructors, male tenured instructors 69 00:03:10,100 --> 00:03:12,844 are older than male untenured instructors, 70 00:03:12,844 --> 00:03:15,470 whereas female tenured instructors are 71 00:03:15,470 --> 00:03:18,275 younger than female untenured instructors. 72 00:03:18,275 --> 00:03:20,150 To do this in Python, 73 00:03:20,150 --> 00:03:23,815 add the hue argument at the box plot function. 74 00:03:23,815 --> 00:03:27,575 A pie chart is another way of looking at your data. 75 00:03:27,575 --> 00:03:28,760 You can see here in 76 00:03:28,760 --> 00:03:31,250 this graphic that the number of courses taught by 77 00:03:31,250 --> 00:03:33,380 male instructors is larger than the number of 78 00:03:33,380 --> 00:03:36,140 courses taught by female instructors. 79 00:03:36,140 --> 00:03:37,745 To do this in Python, 80 00:03:37,745 --> 00:03:40,465 we will use the matplotlib library. 81 00:03:40,465 --> 00:03:42,815 First we specify the labels, 82 00:03:42,815 --> 00:03:44,795 get the number of courses taught 83 00:03:44,795 --> 00:03:47,210 both by male and females, 84 00:03:47,210 --> 00:03:49,250 and assign it to a size is variable. 85 00:03:49,250 --> 00:03:52,040 Create a subplot, insert the sizes, 86 00:03:52,040 --> 00:03:53,420 labels and percentage to 87 00:03:53,420 --> 00:03:55,610 one decimal place in the pie function, 88 00:03:55,610 --> 00:03:59,700 and print out the pie chart with the show function.