1
00:00:05,840 --> 00:00:07,860
We have learned to compute

2
00:00:07,860 --> 00:00:09,885
averages and standard deviations,

3
00:00:09,885 --> 00:00:12,060
but now we will use the
same information of

4
00:00:12,060 --> 00:00:15,855
same knowledge to make
comparisons between groups.

5
00:00:15,855 --> 00:00:19,275
So we will use the same dataset
that we have used so far.

6
00:00:19,275 --> 00:00:21,600
That is the teaching
evaluation data from

7
00:00:21,600 --> 00:00:25,125
University of Texas,
comprising 463 courses.

8
00:00:25,125 --> 00:00:27,720
We are looking at the
teaching evaluation,

9
00:00:27,720 --> 00:00:28,980
beauty, and age.

10
00:00:28,980 --> 00:00:30,660
We're comparing the averages for

11
00:00:30,660 --> 00:00:34,200
these three variables for
female instructor variable.

12
00:00:34,200 --> 00:00:35,910
So those who are females,

13
00:00:35,910 --> 00:00:38,310
their average teaching
evaluation was 3.9

14
00:00:38,310 --> 00:00:41,080
compared to those of men, 4.06.

15
00:00:41,080 --> 00:00:42,350
Here we're looking at

16
00:00:42,350 --> 00:00:46,640
the average teaching evaluation
for tenured professors,

17
00:00:46,640 --> 00:00:49,775
3.96 verses untenured, 4.13.

18
00:00:49,775 --> 00:00:53,945
The average age of untenured
professors was 50.2 years,

19
00:00:53,945 --> 00:00:58,160
and that for tenured
professors, 47.85 years.

20
00:00:58,160 --> 00:00:59,780
One thing that is
very important in

21
00:00:59,780 --> 00:01:02,105
statistical analysis is to

22
00:01:02,105 --> 00:01:04,070
think about the
question and to think

23
00:01:04,070 --> 00:01:06,770
about the population or sample
that you are working with.

24
00:01:06,770 --> 00:01:10,570
We are computing averages
across 463 courses.

25
00:01:10,570 --> 00:01:12,765
We find the average
age or beauty,

26
00:01:12,765 --> 00:01:15,635
but these are the
attributes of instructors.

27
00:01:15,635 --> 00:01:17,420
We know from our data to

28
00:01:17,420 --> 00:01:19,520
their 94 instructors who have

29
00:01:19,520 --> 00:01:22,250
collectively taught 463 courses,

30
00:01:22,250 --> 00:01:24,020
and we know that
there are duplicates,

31
00:01:24,020 --> 00:01:25,985
that is the same instructor

32
00:01:25,985 --> 00:01:28,700
who has taught multiple courses.

33
00:01:28,700 --> 00:01:32,570
So when I compute the average
age using 463 courses,

34
00:01:32,570 --> 00:01:34,520
it's not necessarily
the average age of

35
00:01:34,520 --> 00:01:37,595
the instructors because
it could be true that

36
00:01:37,595 --> 00:01:40,430
older aged individuals may

37
00:01:40,430 --> 00:01:43,370
have taught more courses
than younger individuals,

38
00:01:43,370 --> 00:01:45,095
resulting in an higher average.

39
00:01:45,095 --> 00:01:46,429
That is not necessarily

40
00:01:46,429 --> 00:01:48,860
the average age of
the instructors.

41
00:01:48,860 --> 00:01:51,725
So to avoid this problem,

42
00:01:51,725 --> 00:01:53,480
we have to subset

43
00:01:53,480 --> 00:01:56,600
the data so that we
remove the duplicates and

44
00:01:56,600 --> 00:01:58,685
have only one observation

45
00:01:58,685 --> 00:02:01,550
per individual instructor
in the dataset.

46
00:02:01,550 --> 00:02:03,844
Instead of 463 observations,

47
00:02:03,844 --> 00:02:06,725
you should have just
94 observations.

48
00:02:06,725 --> 00:02:08,959
Now let's look at the comparison.

49
00:02:08,959 --> 00:02:11,630
When we use 94 observations where

50
00:02:11,630 --> 00:02:15,335
no instructor is
repeated in the dataset,

51
00:02:15,335 --> 00:02:20,420
the average age or average
beauty score is 0.25.

52
00:02:20,420 --> 00:02:23,990
When we look at the 463 courses,

53
00:02:23,990 --> 00:02:26,965
the average value is 0.11.

54
00:02:26,965 --> 00:02:29,495
Let's compare the age.

55
00:02:29,495 --> 00:02:32,570
The average age using
94 observations

56
00:02:32,570 --> 00:02:37,970
for males is 49.4 and
for females is 44.9.

57
00:02:37,970 --> 00:02:41,130
You see here that as for age,

58
00:02:41,130 --> 00:02:42,920
we don't see much
difference whether we

59
00:02:42,920 --> 00:02:45,514
use 463 observations or 94.

60
00:02:45,514 --> 00:02:48,005
But we certainly see
much difference in

61
00:02:48,005 --> 00:02:52,695
the beauty scores if you were
to use the wrong dataset.

62
00:02:52,695 --> 00:02:53,810
That is the dataset where

63
00:02:53,810 --> 00:02:57,020
individuals are repeated
multiple times.

64
00:02:57,020 --> 00:02:59,870
Data visualization
is a critical piece

65
00:02:59,870 --> 00:03:01,820
of modern-day
statistical analysis.

66
00:03:01,820 --> 00:03:03,740
Their staples are helpful,

67
00:03:03,740 --> 00:03:04,970
so you don't have to eyeball

68
00:03:04,970 --> 00:03:07,655
the output to figure out
what the trends are.

69
00:03:07,655 --> 00:03:11,030
The visual displays are
much easier to understand.

70
00:03:11,030 --> 00:03:13,460
We will use the same
datasets of teaching

71
00:03:13,460 --> 00:03:16,040
evaluations and
ask this question,

72
00:03:16,040 --> 00:03:17,690
do instructors teaching

73
00:03:17,690 --> 00:03:20,780
single credit courses
get higher evaluations?

74
00:03:20,780 --> 00:03:23,660
We see that, yes, they do.

75
00:03:23,660 --> 00:03:26,900
By Mean evaluation, when
plotted as the chart,

76
00:03:26,900 --> 00:03:29,600
you see that instructors who
teach single credit courses

77
00:03:29,600 --> 00:03:32,975
have a slightly higher
average teaching evaluation.

78
00:03:32,975 --> 00:03:35,030
Let us start by determining

79
00:03:35,030 --> 00:03:36,560
how many courses were taught by

80
00:03:36,560 --> 00:03:40,145
male instructors and how
many by female instructors.

81
00:03:40,145 --> 00:03:42,650
For this, we can use a bar chart.

82
00:03:42,650 --> 00:03:44,360
Notice that the information is

83
00:03:44,360 --> 00:03:46,400
complete from a statistical
point of view in

84
00:03:46,400 --> 00:03:48,020
that we know how
many courses were

85
00:03:48,020 --> 00:03:50,135
taught by males versus females.

86
00:03:50,135 --> 00:03:52,400
But we do not have some
critical information from

87
00:03:52,400 --> 00:03:54,965
this chart as it relates
to communication.

88
00:03:54,965 --> 00:03:57,020
Therefore, we can say

89
00:03:57,020 --> 00:03:59,480
this chart serves as
statistical purpose,

90
00:03:59,480 --> 00:04:02,030
but it doesn't serve a
communication purpose.

91
00:04:02,030 --> 00:04:04,790
Let me illustrate
this with an example.

92
00:04:04,790 --> 00:04:07,700
Here you are looking
at a street map.

93
00:04:07,700 --> 00:04:08,840
You can see the streets and

94
00:04:08,840 --> 00:04:10,280
the buildings and the highways,

95
00:04:10,280 --> 00:04:11,930
but you don't see
the street names.

96
00:04:11,930 --> 00:04:13,685
Without street
names, it is hard to

97
00:04:13,685 --> 00:04:15,260
determine where you are

98
00:04:15,260 --> 00:04:17,360
and in which direction
you should be heading.

99
00:04:17,360 --> 00:04:19,580
Even though it is
according to scale,

100
00:04:19,580 --> 00:04:20,840
it may be accurate in

101
00:04:20,840 --> 00:04:23,135
its depiction of the streets
in the neighborhood,

102
00:04:23,135 --> 00:04:25,040
but it's still lacks
the ability to

103
00:04:25,040 --> 00:04:27,005
communicate information to you.

104
00:04:27,005 --> 00:04:29,675
To add communication
value to this map,

105
00:04:29,675 --> 00:04:32,045
you can simply add
the street names.

106
00:04:32,045 --> 00:04:35,480
Let us apply the same
philosophy to our graphic.

107
00:04:35,480 --> 00:04:38,615
But once we add information
about this infographic,

108
00:04:38,615 --> 00:04:41,135
for example, adding just a title

109
00:04:41,135 --> 00:04:43,235
makes this chart
more informative.

110
00:04:43,235 --> 00:04:45,034
To do this in Python,

111
00:04:45,034 --> 00:04:47,240
we'll use the
countplot function in

112
00:04:47,240 --> 00:04:50,060
the seaborne library and
set the title label.

113
00:04:50,060 --> 00:04:52,580
This helps your graph
to be more informative.

114
00:04:52,580 --> 00:04:55,505
We can also add more
dimensions to the data.

115
00:04:55,505 --> 00:04:57,890
In addition to the gender
of the instructors,

116
00:04:57,890 --> 00:04:59,600
we could add the tenure status of

117
00:04:59,600 --> 00:05:01,655
the instructors as
well to the graphic.

118
00:05:01,655 --> 00:05:03,350
To do that in Python,

119
00:05:03,350 --> 00:05:06,215
you add hue argument
to the countplot.

120
00:05:06,215 --> 00:05:08,390
We can add another
dimension to the data,

121
00:05:08,390 --> 00:05:11,465
regenerating the same graphic
with the same information.

122
00:05:11,465 --> 00:05:13,070
That is, the number of

123
00:05:13,070 --> 00:05:14,930
courses taught by
gender and tenure.

124
00:05:14,930 --> 00:05:16,640
Then adding the dimension of

125
00:05:16,640 --> 00:05:20,150
courses being upper-division
and lower division,

126
00:05:20,150 --> 00:05:22,940
and presenting them in
two rows or columns.

127
00:05:22,940 --> 00:05:24,530
To do this in Python,

128
00:05:24,530 --> 00:05:26,525
we can specify the rows argument

129
00:05:26,525 --> 00:05:28,805
using the countplot function.

130
00:05:28,805 --> 00:05:30,920
Now let's look at
the situation where

131
00:05:30,920 --> 00:05:32,060
our primary variables of

132
00:05:32,060 --> 00:05:34,280
interest are
continuous variables.

133
00:05:34,280 --> 00:05:35,600
We would like to explore

134
00:05:35,600 --> 00:05:37,520
the relationship between
the two while adding

135
00:05:37,520 --> 00:05:41,135
further categorical variables
as an additional dimension.

136
00:05:41,135 --> 00:05:44,780
Using the teaching evaluation
data we ask this question,

137
00:05:44,780 --> 00:05:47,465
does age effect
teaching evaluations?

138
00:05:47,465 --> 00:05:49,804
We then add two
additional dimensions,

139
00:05:49,804 --> 00:05:51,830
which are gender and tenure.

140
00:05:51,830 --> 00:05:55,700
So our dataset consists of
age and teaching evaluation,

141
00:05:55,700 --> 00:05:57,680
which are the two
primary variables of

142
00:05:57,680 --> 00:06:00,095
interest and are continuous.

143
00:06:00,095 --> 00:06:02,510
Then we add two other dimensions,

144
00:06:02,510 --> 00:06:05,000
i.e., gender and tenure.

145
00:06:05,000 --> 00:06:07,520
These are categorical variables.

146
00:06:07,520 --> 00:06:09,440
Age is on the X axis and

147
00:06:09,440 --> 00:06:12,320
the teaching evaluation
scores on the Y axis.

148
00:06:12,320 --> 00:06:14,420
The orange colored
circles represent

149
00:06:14,420 --> 00:06:17,869
males and the blue colored
circles represent females.

150
00:06:17,869 --> 00:06:20,660
The top panel is for
tenured professors and

151
00:06:20,660 --> 00:06:23,765
the bottom panel is for
the untenured instructors.

152
00:06:23,765 --> 00:06:25,565
To do this in Python,

153
00:06:25,565 --> 00:06:27,560
we use the FacetGrid option,

154
00:06:27,560 --> 00:06:29,480
which works for
multiplot gridding

155
00:06:29,480 --> 00:06:31,220
and allows tweaking the plot.

156
00:06:31,220 --> 00:06:35,045
You create the row and hue for
the categorical variables,

157
00:06:35,045 --> 00:06:37,400
in our case, tenure and gender.

158
00:06:37,400 --> 00:06:39,305
Then we use the map to apply

159
00:06:39,305 --> 00:06:42,540
a plotting function to
each subset of the data.