1
00:00:06,200 --> 00:00:09,210
The measures of central
tendency are the most

2
00:00:09,210 --> 00:00:11,925
commonly used in
statistical analysis.

3
00:00:11,925 --> 00:00:14,235
We know them as mean, median,

4
00:00:14,235 --> 00:00:15,900
and mode and their use is

5
00:00:15,900 --> 00:00:18,195
ubiquitous and
statistical analysis.

6
00:00:18,195 --> 00:00:20,685
So let us see how it works.

7
00:00:20,685 --> 00:00:23,490
Before we begin, let us take

8
00:00:23,490 --> 00:00:25,920
a quick look at our
dataset in this course.

9
00:00:25,920 --> 00:00:28,515
We have been using the
teaching evaluation Data

10
00:00:28,515 --> 00:00:30,285
from the University of Texas.

11
00:00:30,285 --> 00:00:34,080
The dataset comprises
of 463 courses,

12
00:00:34,080 --> 00:00:35,550
in which we have
information about

13
00:00:35,550 --> 00:00:38,730
the teaching evaluation score
received by the instructor.

14
00:00:38,730 --> 00:00:40,190
We have information about

15
00:00:40,190 --> 00:00:42,050
the attributes of the instructor,

16
00:00:42,050 --> 00:00:45,065
as well as the characteristics
of the course.

17
00:00:45,065 --> 00:00:46,955
Once you have imported

18
00:00:46,955 --> 00:00:50,360
a CSV file with a
Pandas Python library,

19
00:00:50,360 --> 00:00:52,730
the first step in getting
to know your data is to

20
00:00:52,730 --> 00:00:55,750
discover the different
data types it contains.

21
00:00:55,750 --> 00:00:57,990
You can display all columns and

22
00:00:57,990 --> 00:01:00,950
their data types with
dataframe dot info.

23
00:01:00,950 --> 00:01:05,310
In this case, we have named
our dataframe, ratings_df.

24
00:01:05,530 --> 00:01:08,660
It tells you how
many rows you have.

25
00:01:08,660 --> 00:01:10,595
For the teaching rating data,

26
00:01:10,595 --> 00:01:13,370
we have 463 entries from zero to

27
00:01:13,370 --> 00:01:17,255
462 because Python starts
counting from zero.

28
00:01:17,255 --> 00:01:20,555
Then it also gives you
information about the data types.

29
00:01:20,555 --> 00:01:22,630
Object represents strings.

30
00:01:22,630 --> 00:01:26,450
In 64 represents integer
or whole numbers,

31
00:01:26,450 --> 00:01:28,640
and float represents
real numbers,

32
00:01:28,640 --> 00:01:31,265
which could take
on decimal points.

33
00:01:31,265 --> 00:01:33,710
Before we begin, let us have

34
00:01:33,710 --> 00:01:36,500
a conversation about
population and samples.

35
00:01:36,500 --> 00:01:38,270
Essentially, if you have

36
00:01:38,270 --> 00:01:39,680
all the information of interest

37
00:01:39,680 --> 00:01:41,150
for a particular decision,

38
00:01:41,150 --> 00:01:42,830
about every individual that is

39
00:01:42,830 --> 00:01:44,900
supposed to be involved
in that decision,

40
00:01:44,900 --> 00:01:46,910
that is called a population.

41
00:01:46,910 --> 00:01:48,770
So if you are
interested in looking

42
00:01:48,770 --> 00:01:50,570
at some attribute of driving,

43
00:01:50,570 --> 00:01:52,085
and we have information about

44
00:01:52,085 --> 00:01:55,115
all possible automobile
drivers in the US,

45
00:01:55,115 --> 00:01:57,605
then we call this,
the population.

46
00:01:57,605 --> 00:01:59,945
The sample, on the other hand,

47
00:01:59,945 --> 00:02:01,910
is a subset of population.

48
00:02:01,910 --> 00:02:04,700
So for example, if we have

49
00:02:04,700 --> 00:02:07,925
data on all married drivers
over the age of twenty five,

50
00:02:07,925 --> 00:02:09,820
then that's a subset.

51
00:02:09,820 --> 00:02:13,160
Within that subset, if
we were to randomly

52
00:02:13,160 --> 00:02:14,415
select five percent of

53
00:02:14,415 --> 00:02:17,225
those married drivers
over the age of 25,

54
00:02:17,225 --> 00:02:19,400
that would be our sample.

55
00:02:19,400 --> 00:02:21,290
We use samples, especially

56
00:02:21,290 --> 00:02:22,790
in cases where we do not want to

57
00:02:22,790 --> 00:02:24,260
incur the cost of collecting

58
00:02:24,260 --> 00:02:26,465
data for the entire population.

59
00:02:26,465 --> 00:02:29,165
Now, let us consider
that there are

60
00:02:29,165 --> 00:02:32,315
230 million individuals
in the country.

61
00:02:32,315 --> 00:02:33,915
A sample size of

62
00:02:33,915 --> 00:02:39,215
330 to 500 individuals randomly
selected would suffice.

63
00:02:39,215 --> 00:02:41,330
This reduces the cost,

64
00:02:41,330 --> 00:02:43,220
especially in cases
where you cannot

65
00:02:43,220 --> 00:02:45,875
collect information for
the entire population.

66
00:02:45,875 --> 00:02:48,140
Therefore, using samples,

67
00:02:48,140 --> 00:02:51,095
it's really helpful
and cost-effective.

68
00:02:51,095 --> 00:02:54,575
Here you see some Greek
symbols on the screen.

69
00:02:54,575 --> 00:02:56,120
But don't be afraid.

70
00:02:56,120 --> 00:02:58,145
They mostly show the Formula.

71
00:02:58,145 --> 00:03:00,530
We will then proceed from here.

72
00:03:00,530 --> 00:03:03,095
While they may
differ in notation,

73
00:03:03,095 --> 00:03:04,340
essentially the mean for

74
00:03:04,340 --> 00:03:06,950
a population and
sample are the same.

75
00:03:06,950 --> 00:03:10,190
It is the sum of all
the observations, then,

76
00:03:10,190 --> 00:03:13,460
divide it by the number of
observations to get the mean,

77
00:03:13,460 --> 00:03:15,889
which we call averages.

78
00:03:15,889 --> 00:03:17,960
There are several properties

79
00:03:17,960 --> 00:03:19,550
of the mean and they
are meaningful.

80
00:03:19,550 --> 00:03:20,990
But one of the characteristics of

81
00:03:20,990 --> 00:03:22,310
a mean is that if you take

82
00:03:22,310 --> 00:03:25,490
the difference between the
average value for a variable,

83
00:03:25,490 --> 00:03:29,315
and subtract from all the
observations and sum them up.

84
00:03:29,315 --> 00:03:32,210
That sum would be equal to zero.

85
00:03:32,210 --> 00:03:34,820
The median is different
from the mean.

86
00:03:34,820 --> 00:03:36,380
When you order the data from

87
00:03:36,380 --> 00:03:38,600
the smallest value to
the largest value,

88
00:03:38,600 --> 00:03:40,710
the result is in the middle.

89
00:03:40,710 --> 00:03:43,450
That is, the value in the middle

90
00:03:43,450 --> 00:03:44,500
indicating that there are

91
00:03:44,500 --> 00:03:46,210
an equal number of observations,

92
00:03:46,210 --> 00:03:47,920
that are above and
the equal number of

93
00:03:47,920 --> 00:03:50,380
observations are
below that family.

94
00:03:50,380 --> 00:03:53,050
That value is called the median.

95
00:03:53,050 --> 00:03:57,595
So if the median salary in
some city is $45 thousand,

96
00:03:57,595 --> 00:03:59,770
it means that 50 percent
of the people make

97
00:03:59,770 --> 00:04:02,050
more than $45 thousand and

98
00:04:02,050 --> 00:04:05,980
the other 50 percent make
less than $45 thousand.

99
00:04:05,980 --> 00:04:08,830
Mode is essentially the value

100
00:04:08,830 --> 00:04:10,810
that occurs most frequently.

101
00:04:10,810 --> 00:04:13,450
Therefore, if the
most common age and

102
00:04:13,450 --> 00:04:16,930
a class of students is
16, then that's the mode.

103
00:04:16,930 --> 00:04:20,260
We will now turn to Python
for our hands-on training to

104
00:04:20,260 --> 00:04:23,490
estimate the Summary Statistics
values for beauty score.

105
00:04:23,490 --> 00:04:26,445
Teaching you about
evaluation and H.

106
00:04:26,445 --> 00:04:28,295
We will use the DataFrame

107
00:04:28,295 --> 00:04:32,135
dot describe function to
find the Summary Statistics.

108
00:04:32,135 --> 00:04:34,040
This prints out the number of

109
00:04:34,040 --> 00:04:36,665
rows, mean, standard deviation,

110
00:04:36,665 --> 00:04:39,880
minimum value, 25th, 50th,

111
00:04:39,880 --> 00:04:43,745
75th percentile, and
the maximum value.

112
00:04:43,745 --> 00:04:45,620
To find the summary statistics

113
00:04:45,620 --> 00:04:47,480
for a subset of the variables,

114
00:04:47,480 --> 00:04:49,715
you will have to state
the column names

115
00:04:49,715 --> 00:04:51,575
as we can see here.

116
00:04:51,575 --> 00:04:54,500
Otherwise, for the
full population,

117
00:04:54,500 --> 00:04:58,710
we will call the dot describe
function on the DataFrame.