1 00:00:00,000 --> 00:00:04,040 Hello, and welcome to analyzing data with Python. 2 00:00:04,040 --> 00:00:05,840 After completing this video, 3 00:00:05,840 --> 00:00:08,130 you will be able to understand basic concepts 4 00:00:08,130 --> 00:00:11,340 related to performing exploratory analysis on data. 5 00:00:11,340 --> 00:00:14,260 We will demonstrate an example of how to store data using 6 00:00:14,260 --> 00:00:17,470 the IBM Db2 on Cloud database, 7 00:00:17,470 --> 00:00:21,630 and then use Python to do some basic data analysis on this data. 8 00:00:21,630 --> 00:00:23,830 In this video, we will be using 9 00:00:23,830 --> 00:00:28,500 the McDonald's menu nutritional facts data for popular menu items at McDonald's, 10 00:00:28,500 --> 00:00:32,810 while using Python to perform basic exploratory analysis. 11 00:00:32,810 --> 00:00:35,500 McDonald's is an American fast food company 12 00:00:35,500 --> 00:00:38,230 and the world's largest restaurant chain by revenue. 13 00:00:38,230 --> 00:00:41,620 Although McDonald's is known for fast food items such as hamburgers, 14 00:00:41,620 --> 00:00:43,710 French fries, soft drinks, milkshakes, 15 00:00:43,710 --> 00:00:47,310 and desserts, the company has added to its menu salads, 16 00:00:47,310 --> 00:00:49,720 fish, smoothies, and fruit. 17 00:00:49,720 --> 00:00:52,360 McDonald's provides nutrition analysis of 18 00:00:52,360 --> 00:00:56,600 their menu items to help you balance your McDonald's meal with other foods you eat. 19 00:00:56,600 --> 00:00:59,440 The data set used in this lesson has been obtained from 20 00:00:59,440 --> 00:01:02,570 the nutritional facts for McDonald's menu from Kaggle. 21 00:01:02,570 --> 00:01:05,820 We need to create a table on Db2 to store 22 00:01:05,820 --> 00:01:09,440 the McDonald's menu nutrition facts data set that we will be using. 23 00:01:09,440 --> 00:01:14,840 We will also be using the console provided by Db2 for this process. 24 00:01:14,840 --> 00:01:18,300 There are four steps involved in loading data into a table, 25 00:01:18,300 --> 00:01:22,200 source, target, define, and finalize. 26 00:01:22,200 --> 00:01:26,480 We first load the spreadsheet into the Db2 using the console. 27 00:01:26,480 --> 00:01:28,530 We then select the target schema, 28 00:01:28,530 --> 00:01:31,030 and then you will be given an option to load the data into 29 00:01:31,030 --> 00:01:34,170 an existing table or create a new table. 30 00:01:34,170 --> 00:01:36,460 When you choose to create a new table, 31 00:01:36,460 --> 00:01:39,180 you have the option to specify the table name. 32 00:01:39,180 --> 00:01:41,910 Next, you will see a preview of the data where you 33 00:01:41,910 --> 00:01:44,740 can also define the columns and data types. 34 00:01:44,740 --> 00:01:47,300 Review the settings and begin the load. 35 00:01:47,300 --> 00:01:49,160 When the loading is complete, 36 00:01:49,160 --> 00:01:51,930 you can see the statistics on the loaded data. 37 00:01:51,930 --> 00:01:55,090 Next, view the table to explore further. 38 00:01:55,090 --> 00:02:00,410 Db2 Warehouse allows you to analyze data using in-database analytics, 39 00:02:00,410 --> 00:02:03,650 APIs, RStudio or Python. 40 00:02:03,650 --> 00:02:06,460 The data has been loaded into our relational database. 41 00:02:06,460 --> 00:02:12,140 You can run Python scripts that retrieve data from and write data to a Db2 database. 42 00:02:12,140 --> 00:02:15,720 Such scripts can be powerful tools to help you analyze your data. 43 00:02:15,720 --> 00:02:18,350 For example, you can use them to generate 44 00:02:18,350 --> 00:02:21,530 statistical models based on data in your database, 45 00:02:21,530 --> 00:02:24,520 and to plot the results of these models. 46 00:02:24,520 --> 00:02:27,220 In this lesson, we will be using 47 00:02:27,220 --> 00:02:30,840 Python scripts that will be run within a Jupyter notebook. 48 00:02:30,840 --> 00:02:33,440 Now, after obtaining a connection resource, 49 00:02:33,440 --> 00:02:35,030 by connecting to the database, 50 00:02:35,030 --> 00:02:39,810 by using the connect method of the IBM_DB API, 51 00:02:39,810 --> 00:02:42,420 we use the SQL select query to verify 52 00:02:42,420 --> 00:02:45,500 the number of rows that have been loaded in the table created. 53 00:02:45,500 --> 00:02:48,170 The figure shows a snapshot of the output. 54 00:02:48,170 --> 00:02:51,140 The output obtained is 260 which is similar 55 00:02:51,140 --> 00:02:54,340 to the number of rows in the Db2 console. 56 00:02:54,340 --> 00:02:59,050 Now let's see how we can use Pandas to retrieve data from the database tables. 57 00:02:59,050 --> 00:03:02,370 We load data from the McDonalds_nutrition table into 58 00:03:02,370 --> 00:03:06,010 the data frame DF using the read_SQL method. 59 00:03:06,010 --> 00:03:08,820 The SQL select query and the connection object are 60 00:03:08,820 --> 00:03:12,300 passed as parameters to the read_SQL method. 61 00:03:12,300 --> 00:03:14,960 We can view the first few rows of the data frame DF 62 00:03:14,960 --> 00:03:18,150 that we created using the head method. 63 00:03:18,150 --> 00:03:20,170 Now it's time to learn about your data. 64 00:03:20,170 --> 00:03:25,320 Pandas methods are equipped with a set of common mathematical and statistical methods. 65 00:03:25,320 --> 00:03:27,710 Let's use the describe method to view 66 00:03:27,710 --> 00:03:30,970 the summary statistics of the data in the data frame, 67 00:03:30,970 --> 00:03:33,870 then explore the output of the describe method. 68 00:03:33,870 --> 00:03:38,990 We see that there are 260 observations or food items in our data frame. 69 00:03:38,990 --> 00:03:43,650 We also see that there are nine unique categories of food items in our data frame. 70 00:03:43,650 --> 00:03:48,380 We can also see summary statistics information such as frequency, mean, 71 00:03:48,380 --> 00:03:51,120 median, standard deviation, et cetera for 72 00:03:51,120 --> 00:03:54,460 the 260 food items across the different variables. 73 00:03:54,460 --> 00:03:58,990 For example, the maximum value for total fat is 118. 74 00:03:58,990 --> 00:04:01,430 Let's investigate this data further. 75 00:04:01,430 --> 00:04:06,220 Let's try to understand one of the nutrients in the food items which is sodium. 76 00:04:06,220 --> 00:04:09,040 A main source of sodium is table salt. 77 00:04:09,040 --> 00:04:12,910 The average American eats five or more teaspoons of salt each day. 78 00:04:12,910 --> 00:04:15,900 This is about 20 times as much as the body needs. 79 00:04:15,900 --> 00:04:18,170 Sodium is found naturally in foods, 80 00:04:18,170 --> 00:04:21,600 but a lot of it is added during processing and preparation. 81 00:04:21,600 --> 00:04:23,860 Many foods that do not taste salty, 82 00:04:23,860 --> 00:04:25,580 may still be high in sodium. 83 00:04:25,580 --> 00:04:28,180 Large amounts of sodium can be hidden in canned, 84 00:04:28,180 --> 00:04:30,660 processed and convenience foods. 85 00:04:30,660 --> 00:04:33,290 Sodium controls fluid balance in our bodies, 86 00:04:33,290 --> 00:04:35,530 and maintains blood volume and blood pressure. 87 00:04:35,530 --> 00:04:39,560 Eating too much sodium may raise blood pressure and cause fluid retention, 88 00:04:39,560 --> 00:04:41,330 which could lead to swelling of the legs, 89 00:04:41,330 --> 00:04:43,230 and feet, or other health issues. 90 00:04:43,230 --> 00:04:45,210 When limiting sodium in your diet, 91 00:04:45,210 --> 00:04:49,360 a common target is to eat less than 2,000 milligrams of sodium per day. 92 00:04:49,360 --> 00:04:52,280 Now using the nutrition data set for McDonald's, 93 00:04:52,280 --> 00:04:55,730 let's do some basic data analysis to answer the question. 94 00:04:55,730 --> 00:04:59,430 Which food item has the maximum sodium content? 95 00:04:59,430 --> 00:05:04,520 We first use visualization to explore the sodium content of food items. 96 00:05:04,520 --> 00:05:08,230 Using the swarm plot method provided by the Seaborne package, 97 00:05:08,230 --> 00:05:11,720 we create a categorical scatter plot as shown on the right, 98 00:05:11,720 --> 00:05:13,920 then give as the input, 99 00:05:13,920 --> 00:05:16,050 category on the x-axis, 100 00:05:16,050 --> 00:05:17,930 sodium on the y-axis, 101 00:05:17,930 --> 00:05:20,190 and the data will be the data frame DF that 102 00:05:20,190 --> 00:05:23,330 contains the nutritional data set from McDonald's. 103 00:05:23,330 --> 00:05:27,540 The plot shows the sodium values for the different food items by category. 104 00:05:27,540 --> 00:05:32,670 Notice a high value of around 3,600 for sodium on the scatter plot. 105 00:05:32,670 --> 00:05:35,960 We will be learning about visualizations later in this module. 106 00:05:35,960 --> 00:05:39,320 Let's further explore this high sodium value and identify 107 00:05:39,320 --> 00:05:42,940 which food items on the menu have this value for sodium. 108 00:05:42,940 --> 00:05:46,370 Let's do some basic data analysis using Python to 109 00:05:46,370 --> 00:05:50,180 find which food items on the menu have maximum sodium content. 110 00:05:50,180 --> 00:05:54,250 To check the values of sodium levels in the food items within the dataset, 111 00:05:54,250 --> 00:05:57,110 we use the code as shown in code 1. 112 00:05:57,110 --> 00:06:02,960 The describe method is used to understand the summary statistics associated with sodium. 113 00:06:02,960 --> 00:06:07,870 Notice that the maximum value of sodium is given as 3,600. 114 00:06:07,870 --> 00:06:11,000 Now let's further explore the row associated 115 00:06:11,000 --> 00:06:14,530 with the maximum sodium variable as shown in code 2. 116 00:06:14,530 --> 00:06:18,560 We use the idxmax method to compute the index values, 117 00:06:18,560 --> 00:06:22,240 at which the maximum value of sodium is obtained in the data frame. 118 00:06:22,240 --> 00:06:24,750 We see that the output is 82. 119 00:06:24,750 --> 00:06:29,910 Now lets find the item name associated with the 82nd item in our data frame. 120 00:06:29,910 --> 00:06:31,670 As shown in code 3, 121 00:06:31,670 --> 00:06:34,690 we will use the .at method to find 122 00:06:34,690 --> 00:06:39,190 the item name by passing the index of 82 and the column name item, 123 00:06:39,190 --> 00:06:41,690 to be returned for the 82nd row. 124 00:06:41,690 --> 00:06:45,270 Finally, we find that the food item on the menu that has 125 00:06:45,270 --> 00:06:49,370 a highest sodium content is Chicken McNuggets, 40 pieces. 126 00:06:49,370 --> 00:06:53,130 Visualizations are very useful for initial data exploration. 127 00:06:53,130 --> 00:06:55,670 They can help us understand relationships, 128 00:06:55,670 --> 00:06:57,970 patterns, and outliers in the data. 129 00:06:57,970 --> 00:07:01,870 Let's first create a scatter plot with protein on the x-axis, 130 00:07:01,870 --> 00:07:04,270 and total fat on the y-axis. 131 00:07:04,270 --> 00:07:07,840 Scatter plots are very popular visualization tools and show 132 00:07:07,840 --> 00:07:11,690 the relationship between two variables with a point for each observation. 133 00:07:11,690 --> 00:07:16,580 To do this, we can use the joint plot function provided by the Seaborn package, 134 00:07:16,580 --> 00:07:18,290 and give as input, 135 00:07:18,290 --> 00:07:22,570 protein on the x-axis and total fat on the y-axis. 136 00:07:22,570 --> 00:07:25,300 And the data will be the data frame DF that 137 00:07:25,300 --> 00:07:28,690 contains the nutritional data set from McDonald's. 138 00:07:28,690 --> 00:07:31,860 The output scatter plot is shown on the right side. 139 00:07:31,860 --> 00:07:34,440 The plot has an interesting shape. 140 00:07:34,440 --> 00:07:39,390 It shows the correlation between the two variables: protein and fat. 141 00:07:39,390 --> 00:07:43,160 Correlation is a measure of association between two variables, 142 00:07:43,160 --> 00:07:46,580 and has a value of between -1 and +1. 143 00:07:46,580 --> 00:07:48,590 We see that the points on the scatter plot are 144 00:07:48,590 --> 00:07:51,200 closer to a straight line in the positive direction. 145 00:07:51,200 --> 00:07:54,600 So we have a positive correlation between the two variables. 146 00:07:54,600 --> 00:07:56,880 On the top right corner of the scatter plot, 147 00:07:56,880 --> 00:07:59,600 we have the values of the Pearson correlation- 148 00:07:59,600 --> 00:08:03,610 0.81 and the significance of the correlation denoted 149 00:08:03,610 --> 00:08:08,130 as P - which is a good value that shows the variables are certainly correlated. 150 00:08:08,130 --> 00:08:10,710 The plot also shows two histograms: 151 00:08:10,710 --> 00:08:13,250 one on the top and the other on the right side. 152 00:08:13,250 --> 00:08:16,440 The histogram on the top is that of the variable protein, 153 00:08:16,440 --> 00:08:20,190 and the histogram on the right side is that of the variable total fat. 154 00:08:20,190 --> 00:08:24,800 We also noticed that there is a point on the scatter plot outside the general pattern. 155 00:08:24,800 --> 00:08:26,710 This is a possible outlier. 156 00:08:26,710 --> 00:08:30,500 Now let's see how we can visualize data using box plots. 157 00:08:30,500 --> 00:08:35,170 Box plots are charts that indicate the distribution of one or more variables. 158 00:08:35,170 --> 00:08:38,500 The box in a box plot captures the middle 50 percent of data. 159 00:08:38,500 --> 00:08:42,540 Lines and points indicate possible skewness and outliers. 160 00:08:42,540 --> 00:08:44,780 Let's create a box plot for sugar. 161 00:08:44,780 --> 00:08:48,270 The function we are going to use is box plot from the Seaborn package. 162 00:08:48,270 --> 00:08:52,910 We give the column name sugars as input to the box plot function. 163 00:08:52,910 --> 00:08:55,000 The output is shown on the right side, 164 00:08:55,000 --> 00:09:00,240 where we had the box plot with average values of sugar and food items around 30 grams. 165 00:09:00,240 --> 00:09:05,580 We also notice a few outliers that indicate food items with extreme values of sugar. 166 00:09:05,580 --> 00:09:11,420 There exist food items in the data set that have sugar content of around 128 grams. 167 00:09:11,420 --> 00:09:15,680 Candies maybe among these high sugar content food items on the menu. 168 00:09:15,680 --> 00:09:17,310 Now that you know how to do 169 00:09:17,310 --> 00:09:22,020 basic exploratory data analysis using Pandas and visualization tools, 170 00:09:22,020 --> 00:09:26,700 proceed to the labs in this module where you can practice the concepts learned. 171 00:09:26,700 --> 00:09:29,870 Thank you for watching this video. 172 00:09:29,870 --> 00:09:34,000 (Music)