1
00:00:00,000 --> 00:00:04,040
Hello, and welcome to analyzing data with Python.

2
00:00:04,040 --> 00:00:05,840
After completing this video,

3
00:00:05,840 --> 00:00:08,130
you will be able to understand basic concepts

4
00:00:08,130 --> 00:00:11,340
related to performing exploratory analysis on data.

5
00:00:11,340 --> 00:00:14,260
We will demonstrate an example of how to store data using

6
00:00:14,260 --> 00:00:17,470
the IBM Db2 on Cloud database,

7
00:00:17,470 --> 00:00:21,630
and then use Python to do some basic data analysis on this data.

8
00:00:21,630 --> 00:00:23,830
In this video, we will be using

9
00:00:23,830 --> 00:00:28,500
the McDonald's menu nutritional facts data for popular menu items at McDonald's,

10
00:00:28,500 --> 00:00:32,810
while using Python to perform basic exploratory analysis.

11
00:00:32,810 --> 00:00:35,500
McDonald's is an American fast food company

12
00:00:35,500 --> 00:00:38,230
and the world's largest restaurant chain by revenue.

13
00:00:38,230 --> 00:00:41,620
Although McDonald's is known for fast food items such as hamburgers,

14
00:00:41,620 --> 00:00:43,710
French fries, soft drinks, milkshakes,

15
00:00:43,710 --> 00:00:47,310
and desserts, the company has added to its menu salads,

16
00:00:47,310 --> 00:00:49,720
fish, smoothies, and fruit.

17
00:00:49,720 --> 00:00:52,360
McDonald's provides nutrition analysis of

18
00:00:52,360 --> 00:00:56,600
their menu items to help you balance your McDonald's meal with other foods you eat.

19
00:00:56,600 --> 00:00:59,440
The data set used in this lesson has been obtained from

20
00:00:59,440 --> 00:01:02,570
the nutritional facts for McDonald's menu from Kaggle.

21
00:01:02,570 --> 00:01:05,820
We need to create a table on Db2 to store

22
00:01:05,820 --> 00:01:09,440
the McDonald's menu nutrition facts data set that we will be using.

23
00:01:09,440 --> 00:01:14,840
We will also be using the console provided by Db2 for this process.

24
00:01:14,840 --> 00:01:18,300
There are four steps involved in loading data into a table,

25
00:01:18,300 --> 00:01:22,200
source, target, define, and finalize.

26
00:01:22,200 --> 00:01:26,480
We first load the spreadsheet into the Db2 using the console.

27
00:01:26,480 --> 00:01:28,530
We then select the target schema,

28
00:01:28,530 --> 00:01:31,030
and then you will be given an option to load the data into

29
00:01:31,030 --> 00:01:34,170
an existing table or create a new table.

30
00:01:34,170 --> 00:01:36,460
When you choose to create a new table,

31
00:01:36,460 --> 00:01:39,180
you have the option to specify the table name.

32
00:01:39,180 --> 00:01:41,910
Next, you will see a preview of the data where you

33
00:01:41,910 --> 00:01:44,740
can also define the columns and data types.

34
00:01:44,740 --> 00:01:47,300
Review the settings and begin the load.

35
00:01:47,300 --> 00:01:49,160
When the loading is complete,

36
00:01:49,160 --> 00:01:51,930
you can see the statistics on the loaded data.

37
00:01:51,930 --> 00:01:55,090
Next, view the table to explore further.

38
00:01:55,090 --> 00:02:00,410
Db2 Warehouse allows you to analyze data using in-database analytics,

39
00:02:00,410 --> 00:02:03,650
APIs, RStudio or Python.

40
00:02:03,650 --> 00:02:06,460
The data has been loaded into our relational database.

41
00:02:06,460 --> 00:02:12,140
You can run Python scripts that retrieve data from and write data to a Db2 database.

42
00:02:12,140 --> 00:02:15,720
Such scripts can be powerful tools to help you analyze your data.

43
00:02:15,720 --> 00:02:18,350
For example, you can use them to generate

44
00:02:18,350 --> 00:02:21,530
statistical models based on data in your database,

45
00:02:21,530 --> 00:02:24,520
and to plot the results of these models.

46
00:02:24,520 --> 00:02:27,220
In this lesson, we will be using

47
00:02:27,220 --> 00:02:30,840
Python scripts that will be run within a Jupyter notebook.

48
00:02:30,840 --> 00:02:33,440
Now, after obtaining a connection resource,

49
00:02:33,440 --> 00:02:35,030
by connecting to the database,

50
00:02:35,030 --> 00:02:39,810
by using the connect method of the IBM_DB API,

51
00:02:39,810 --> 00:02:42,420
we use the SQL select query to verify

52
00:02:42,420 --> 00:02:45,500
the number of rows that have been loaded in the table created.

53
00:02:45,500 --> 00:02:48,170
The figure shows a snapshot of the output.

54
00:02:48,170 --> 00:02:51,140
The output obtained is 260 which is similar

55
00:02:51,140 --> 00:02:54,340
to the number of rows in the Db2 console.

56
00:02:54,340 --> 00:02:59,050
Now let's see how we can use Pandas to retrieve data from the database tables.

57
00:02:59,050 --> 00:03:02,370
We load data from the McDonalds_nutrition table into

58
00:03:02,370 --> 00:03:06,010
the data frame DF using the read_SQL method.

59
00:03:06,010 --> 00:03:08,820
The SQL select query and the connection object are

60
00:03:08,820 --> 00:03:12,300
passed as parameters to the read_SQL method.

61
00:03:12,300 --> 00:03:14,960
We can view the first few rows of the data frame DF

62
00:03:14,960 --> 00:03:18,150
that we created using the head method.

63
00:03:18,150 --> 00:03:20,170
Now it's time to learn about your data.

64
00:03:20,170 --> 00:03:25,320
Pandas methods are equipped with a set of common mathematical and statistical methods.

65
00:03:25,320 --> 00:03:27,710
Let's use the describe method to view

66
00:03:27,710 --> 00:03:30,970
the summary statistics of the data in the data frame,

67
00:03:30,970 --> 00:03:33,870
then explore the output of the describe method.

68
00:03:33,870 --> 00:03:38,990
We see that there are 260 observations or food items in our data frame.

69
00:03:38,990 --> 00:03:43,650
We also see that there are nine unique categories of food items in our data frame.

70
00:03:43,650 --> 00:03:48,380
We can also see summary statistics information such as frequency, mean,

71
00:03:48,380 --> 00:03:51,120
median, standard deviation, et cetera for

72
00:03:51,120 --> 00:03:54,460
the 260 food items across the different variables.

73
00:03:54,460 --> 00:03:58,990
For example, the maximum value for total fat is 118.

74
00:03:58,990 --> 00:04:01,430
Let's investigate this data further.

75
00:04:01,430 --> 00:04:06,220
Let's try to understand one of the nutrients in the food items which is sodium.

76
00:04:06,220 --> 00:04:09,040
A main source of sodium is table salt.

77
00:04:09,040 --> 00:04:12,910
The average American eats five or more teaspoons of salt each day.

78
00:04:12,910 --> 00:04:15,900
This is about 20 times as much as the body needs.

79
00:04:15,900 --> 00:04:18,170
Sodium is found naturally in foods,

80
00:04:18,170 --> 00:04:21,600
but a lot of it is added during processing and preparation.

81
00:04:21,600 --> 00:04:23,860
Many foods that do not taste salty,

82
00:04:23,860 --> 00:04:25,580
may still be high in sodium.

83
00:04:25,580 --> 00:04:28,180
Large amounts of sodium can be hidden in canned,

84
00:04:28,180 --> 00:04:30,660
processed and convenience foods.

85
00:04:30,660 --> 00:04:33,290
Sodium controls fluid balance in our bodies,

86
00:04:33,290 --> 00:04:35,530
and maintains blood volume and blood pressure.

87
00:04:35,530 --> 00:04:39,560
Eating too much sodium may raise blood pressure and cause fluid retention,

88
00:04:39,560 --> 00:04:41,330
which could lead to swelling of the legs,

89
00:04:41,330 --> 00:04:43,230
and feet, or other health issues.

90
00:04:43,230 --> 00:04:45,210
When limiting sodium in your diet,

91
00:04:45,210 --> 00:04:49,360
a common target is to eat less than 2,000 milligrams of sodium per day.

92
00:04:49,360 --> 00:04:52,280
Now using the nutrition data set for McDonald's,

93
00:04:52,280 --> 00:04:55,730
let's do some basic data analysis to answer the question.

94
00:04:55,730 --> 00:04:59,430
Which food item has the maximum sodium content?

95
00:04:59,430 --> 00:05:04,520
We first use visualization to explore the sodium content of food items.

96
00:05:04,520 --> 00:05:08,230
Using the swarm plot method provided by the Seaborne package,

97
00:05:08,230 --> 00:05:11,720
we create a categorical scatter plot as shown on the right,

98
00:05:11,720 --> 00:05:13,920
then give as the input,

99
00:05:13,920 --> 00:05:16,050
category on the x-axis,

100
00:05:16,050 --> 00:05:17,930
sodium on the y-axis,

101
00:05:17,930 --> 00:05:20,190
and the data will be the data frame DF that

102
00:05:20,190 --> 00:05:23,330
contains the nutritional data set from McDonald's.

103
00:05:23,330 --> 00:05:27,540
The plot shows the sodium values for the different food items by category.

104
00:05:27,540 --> 00:05:32,670
Notice a high value of around 3,600 for sodium on the scatter plot.

105
00:05:32,670 --> 00:05:35,960
We will be learning about visualizations later in this module.

106
00:05:35,960 --> 00:05:39,320
Let's further explore this high sodium value and identify

107
00:05:39,320 --> 00:05:42,940
which food items on the menu have this value for sodium.

108
00:05:42,940 --> 00:05:46,370
Let's do some basic data analysis using Python to

109
00:05:46,370 --> 00:05:50,180
find which food items on the menu have maximum sodium content.

110
00:05:50,180 --> 00:05:54,250
To check the values of sodium levels in the food items within the dataset,

111
00:05:54,250 --> 00:05:57,110
we use the code as shown in code 1.

112
00:05:57,110 --> 00:06:02,960
The describe method is used to understand the summary statistics associated with sodium.

113
00:06:02,960 --> 00:06:07,870
Notice that the maximum value of sodium is given as 3,600.

114
00:06:07,870 --> 00:06:11,000
Now let's further explore the row associated

115
00:06:11,000 --> 00:06:14,530
with the maximum sodium variable as shown in code 2.

116
00:06:14,530 --> 00:06:18,560
We use the idxmax method to compute the index values,

117
00:06:18,560 --> 00:06:22,240
at which the maximum value of sodium is obtained in the data frame.

118
00:06:22,240 --> 00:06:24,750
We see that the output is 82.

119
00:06:24,750 --> 00:06:29,910
Now lets find the item name associated with the 82nd item in our data frame.

120
00:06:29,910 --> 00:06:31,670
As shown in code 3,

121
00:06:31,670 --> 00:06:34,690
we will use the .at method to find

122
00:06:34,690 --> 00:06:39,190
the item name by passing the index of 82 and the column name item,

123
00:06:39,190 --> 00:06:41,690
to be returned for the 82nd row.

124
00:06:41,690 --> 00:06:45,270
Finally, we find that the food item on the menu that has

125
00:06:45,270 --> 00:06:49,370
a highest sodium content is Chicken McNuggets, 40 pieces.

126
00:06:49,370 --> 00:06:53,130
Visualizations are very useful for initial data exploration.

127
00:06:53,130 --> 00:06:55,670
They can help us understand relationships,

128
00:06:55,670 --> 00:06:57,970
patterns, and outliers in the data.

129
00:06:57,970 --> 00:07:01,870
Let's first create a scatter plot with protein on the x-axis,

130
00:07:01,870 --> 00:07:04,270
and total fat on the y-axis.

131
00:07:04,270 --> 00:07:07,840
Scatter plots are very popular visualization tools and show

132
00:07:07,840 --> 00:07:11,690
the relationship between two variables with a point for each observation.

133
00:07:11,690 --> 00:07:16,580
To do this, we can use the joint plot function provided by the Seaborn package,

134
00:07:16,580 --> 00:07:18,290
and give as input,

135
00:07:18,290 --> 00:07:22,570
protein on the x-axis and total fat on the y-axis.

136
00:07:22,570 --> 00:07:25,300
And the data will be the data frame DF that

137
00:07:25,300 --> 00:07:28,690
contains the nutritional data set from McDonald's.

138
00:07:28,690 --> 00:07:31,860
The output scatter plot is shown on the right side.

139
00:07:31,860 --> 00:07:34,440
The plot has an interesting shape.

140
00:07:34,440 --> 00:07:39,390
It shows the correlation between the two variables: protein and fat.

141
00:07:39,390 --> 00:07:43,160
Correlation is a measure of association between two variables,

142
00:07:43,160 --> 00:07:46,580
and has a value of between -1 and +1.

143
00:07:46,580 --> 00:07:48,590
We see that the points on the scatter plot are

144
00:07:48,590 --> 00:07:51,200
closer to a straight line in the positive direction.

145
00:07:51,200 --> 00:07:54,600
So we have a positive correlation between the two variables.

146
00:07:54,600 --> 00:07:56,880
On the top right corner of the scatter plot,

147
00:07:56,880 --> 00:07:59,600
we have the values of the Pearson correlation-

148
00:07:59,600 --> 00:08:03,610
0.81 and the significance of the correlation denoted

149
00:08:03,610 --> 00:08:08,130
as P - which is a good value that shows the variables are certainly correlated.

150
00:08:08,130 --> 00:08:10,710
The plot also shows two histograms:

151
00:08:10,710 --> 00:08:13,250
one on the top and the other on the right side.

152
00:08:13,250 --> 00:08:16,440
The histogram on the top is that of the variable protein,

153
00:08:16,440 --> 00:08:20,190
and the histogram on the right side is that of the variable total fat.

154
00:08:20,190 --> 00:08:24,800
We also noticed that there is a point on the scatter plot outside the general pattern.

155
00:08:24,800 --> 00:08:26,710
This is a possible outlier.

156
00:08:26,710 --> 00:08:30,500
Now let's see how we can visualize data using box plots.

157
00:08:30,500 --> 00:08:35,170
Box plots are charts that indicate the distribution of one or more variables.

158
00:08:35,170 --> 00:08:38,500
The box in a box plot captures the middle 50 percent of data.

159
00:08:38,500 --> 00:08:42,540
Lines and points indicate possible skewness and outliers.

160
00:08:42,540 --> 00:08:44,780
Let's create a box plot for sugar.

161
00:08:44,780 --> 00:08:48,270
The function we are going to use is box plot from the Seaborn package.

162
00:08:48,270 --> 00:08:52,910
We give the column name sugars as input to the box plot function.

163
00:08:52,910 --> 00:08:55,000
The output is shown on the right side,

164
00:08:55,000 --> 00:09:00,240
where we had the box plot with average values of sugar and food items around 30 grams.

165
00:09:00,240 --> 00:09:05,580
We also notice a few outliers that indicate food items with extreme values of sugar.

166
00:09:05,580 --> 00:09:11,420
There exist food items in the data set that have sugar content of around 128 grams.

167
00:09:11,420 --> 00:09:15,680
Candies maybe among these high sugar content food items on the menu.

168
00:09:15,680 --> 00:09:17,310
Now that you know how to do

169
00:09:17,310 --> 00:09:22,020
basic exploratory data analysis using Pandas and visualization tools,

170
00:09:22,020 --> 00:09:26,700
proceed to the labs in this module where you can practice the concepts learned.

171
00:09:26,700 --> 00:09:29,870
Thank you for watching this video.

172
00:09:29,870 --> 00:09:34,000
(Music)