1 00:00:00,000 --> 00:00:05,960 (Music) 2 00:00:05,960 --> 00:00:08,625 Hello and welcome. 3 00:00:08,625 --> 00:00:10,065 In this video, 4 00:00:10,065 --> 00:00:12,540 we'll give you a few hints and tips for 5 00:00:12,540 --> 00:00:15,435 working with Real World Data-sets. 6 00:00:15,435 --> 00:00:18,090 Many of the real world data sets are 7 00:00:18,090 --> 00:00:21,660 made available as .CSV files. 8 00:00:21,660 --> 00:00:24,120 These are text files which contain 9 00:00:24,120 --> 00:00:27,285 data values typically separated by commas. 10 00:00:27,285 --> 00:00:30,210 In some cases, a different separator 11 00:00:30,210 --> 00:00:32,895 such as a semicolon may be used. 12 00:00:32,895 --> 00:00:35,960 For this video, we will use an example of 13 00:00:35,960 --> 00:00:39,935 a file called DOGS.CSV. 14 00:00:39,935 --> 00:00:42,590 Although this is a fictional data set that 15 00:00:42,590 --> 00:00:45,620 contains names of dogs and their breeds, 16 00:00:45,620 --> 00:00:48,200 we will use it to illustrate concepts that 17 00:00:48,200 --> 00:00:50,935 you will then apply to real datasets. 18 00:00:50,935 --> 00:00:56,715 Sample contents of the DOGS.CSV file are shown here. 19 00:00:56,715 --> 00:01:00,530 The first row in the table in many cases contains 20 00:01:00,530 --> 00:01:04,840 attribute labels which map to column names in a table. 21 00:01:04,840 --> 00:01:07,420 In DOGS.CSV, 22 00:01:07,420 --> 00:01:11,435 the first row contains the name of three attributes. 23 00:01:11,435 --> 00:01:15,050 Id is the name of the first attribute and 24 00:01:15,050 --> 00:01:20,420 the subsequent rows contain Id values of 1, 2, and 3. 25 00:01:20,420 --> 00:01:23,360 The name of the dog is the second attribute. 26 00:01:23,360 --> 00:01:26,435 In this case the dog names Wolfie, 27 00:01:26,435 --> 00:01:29,635 Fluffy, and Huggy are the values. 28 00:01:29,635 --> 00:01:33,075 The third attribute is called breed, 29 00:01:33,075 --> 00:01:36,680 either the dominant breed or pure breed name. 30 00:01:36,680 --> 00:01:38,615 It has values of 31 00:01:38,615 --> 00:01:42,205 German Shepherd, Pomeranian, and Labrador. 32 00:01:42,205 --> 00:01:43,960 As we've just seen, 33 00:01:43,960 --> 00:01:46,985 CSV files can have the first or 34 00:01:46,985 --> 00:01:50,420 a header row that contains the names of the attributes. 35 00:01:50,420 --> 00:01:53,690 If you're loading the data into the database using 36 00:01:53,690 --> 00:01:57,080 the visual load tool in the database console, 37 00:01:57,080 --> 00:02:01,055 ensure the header in first row is enabled. 38 00:02:01,055 --> 00:02:04,640 This will map the attribute names in the first row of 39 00:02:04,640 --> 00:02:09,380 the CSV file into column names in the database table, 40 00:02:09,380 --> 00:02:11,240 and the rest of the rows into 41 00:02:11,240 --> 00:02:14,485 the data rows in the table, as shown here. 42 00:02:14,485 --> 00:02:17,585 Note that the default column names may not 43 00:02:17,585 --> 00:02:20,780 always be database or query friendly, 44 00:02:20,780 --> 00:02:23,270 and if that is the case, you may 45 00:02:23,270 --> 00:02:26,410 want to edit them before the table is created. 46 00:02:26,410 --> 00:02:28,970 Now, let's talk about querying 47 00:02:28,970 --> 00:02:31,660 column names that are lower or mixed case, 48 00:02:31,660 --> 00:02:35,270 that is, a combination of upper and lowercase. 49 00:02:35,270 --> 00:02:38,450 Let's assume we loaded the DOGS.CSV 50 00:02:38,450 --> 00:02:42,575 file using the default column names from the CSV. 51 00:02:42,575 --> 00:02:44,960 If we try to retrieve the contents of 52 00:02:44,960 --> 00:02:47,235 the Id column using the query, 53 00:02:47,235 --> 00:02:50,625 select id from DOGS, 54 00:02:50,625 --> 00:02:52,910 we'll get an error as 55 00:02:52,910 --> 00:02:57,010 shown indicating the id is not valid. 56 00:02:57,010 --> 00:02:59,615 This is because the database parser 57 00:02:59,615 --> 00:03:02,420 assumes uppercase names by default. 58 00:03:02,420 --> 00:03:06,020 Whereas when we loaded the CSV file into 59 00:03:06,020 --> 00:03:11,540 the database it had the Id column name in mixed case i.e 60 00:03:11,540 --> 00:03:15,980 an uppercase I and a lowercase d. In 61 00:03:15,980 --> 00:03:18,200 this case, to select data from 62 00:03:18,200 --> 00:03:20,855 a column with a mixed case name, 63 00:03:20,855 --> 00:03:23,510 we need to specify the column name in 64 00:03:23,510 --> 00:03:27,965 its correct case within double quotes as follows. 65 00:03:27,965 --> 00:03:36,325 Select * "Id" from DOGS. 66 00:03:36,325 --> 00:03:39,080 Ensure you use double quotes around 67 00:03:39,080 --> 00:03:42,305 the column name and not single quotes. 68 00:03:42,305 --> 00:03:44,360 Next, we'll cover querying 69 00:03:44,360 --> 00:03:47,830 column names that have spaces and other characters. 70 00:03:47,830 --> 00:03:52,595 In a CSV file, if the name of the column contain spaces, 71 00:03:52,595 --> 00:03:56,960 by default the database may map them to underscores. 72 00:03:56,960 --> 00:04:00,875 For example, in the name of dog column, 73 00:04:00,875 --> 00:04:04,205 there are spaces in between the three words. 74 00:04:04,205 --> 00:04:08,400 The database may change it to Name_of_Dog. 75 00:04:10,330 --> 00:04:14,030 Other special characters like parentheses or 76 00:04:14,030 --> 00:04:17,200 brackets may also get mapped to underscores. 77 00:04:17,200 --> 00:04:19,770 Therefore, when you write a query 78 00:04:19,770 --> 00:04:22,940 ensure you use proper case formatting within 79 00:04:22,940 --> 00:04:25,880 quotes and substitute special characters 80 00:04:25,880 --> 00:04:28,715 to underscores as shown in this example. 81 00:04:28,715 --> 00:04:31,785 Select "Id," "Name_of_Dog," 82 00:04:31,785 --> 00:04:36,545 "Breed__dominant_breed_if_not_pure_breed_"from dogs. 83 00:04:36,545 --> 00:04:39,140 Please note the underscores separating 84 00:04:39,140 --> 00:04:41,870 the words within double quotes. 85 00:04:41,870 --> 00:04:45,020 Also note the double underscore 86 00:04:45,020 --> 00:04:48,230 between breed and dominant as shown. 87 00:04:48,230 --> 00:04:51,080 Finally, it's also important to note 88 00:04:51,080 --> 00:04:53,150 the trailing underscore after 89 00:04:53,150 --> 00:04:56,330 the word breed near the end of the query. 90 00:04:56,330 --> 00:04:59,900 This is used in place of the closing bracket. 91 00:04:59,900 --> 00:05:02,540 When using quotes in Jupyter notebooks, 92 00:05:02,540 --> 00:05:05,300 you may be issuing queries in a notebook by 93 00:05:05,300 --> 00:05:08,585 first assigning them to Python variables. 94 00:05:08,585 --> 00:05:10,820 In such cases, if your query 95 00:05:10,820 --> 00:05:13,474 contains double quotes for example, 96 00:05:13,474 --> 00:05:16,540 to specify a mixed case column name, 97 00:05:16,540 --> 00:05:18,620 you could differentiate the quotes 98 00:05:18,620 --> 00:05:20,360 by using single quotes for 99 00:05:20,360 --> 00:05:22,640 the Python variable to enclose 100 00:05:22,640 --> 00:05:26,650 this SQL query and double quotes for the column names. 101 00:05:26,650 --> 00:05:36,580 For example, selectQuery ='select "Id" from dogs.' 102 00:05:36,980 --> 00:05:39,740 Now, what if you need to 103 00:05:39,740 --> 00:05:42,380 specify single quotes within the query, 104 00:05:42,380 --> 00:05:46,400 for example, to specify a value in the where clause? 105 00:05:46,400 --> 00:05:49,265 In this case you can use backslash 106 00:05:49,265 --> 00:05:52,270 as the escape character as follows, 107 00:05:52,270 --> 00:05:59,430 select Query = 'select * from dogs where 108 00:06:07,810 --> 00:06:12,320 "Name_of_Dog"=\'Huggy\' '. If you have 109 00:06:12,320 --> 00:06:16,865 very long queries such as join queries or nested queries, 110 00:06:16,865 --> 00:06:19,670 it may be useful to split the query into 111 00:06:19,670 --> 00:06:22,700 multiple lines for improved readability. 112 00:06:22,700 --> 00:06:24,770 In Python notebooks, you 113 00:06:24,770 --> 00:06:26,870 can use the backslash character to 114 00:06:26,870 --> 00:06:29,810 indicate continuation to the next row 115 00:06:29,810 --> 00:06:32,480 as shown in this example. 116 00:06:32,480 --> 00:06:36,645 %sql select "Id," Name_of_Dog," 117 00:06:36,645 --> 00:06:42,160 \ from dogs \ where"Name_of_Dog" = 'Huggy.' 118 00:06:42,160 --> 00:06:44,990 It would be helpful at this point to take 119 00:06:44,990 --> 00:06:48,335 a moment to review the special characters as shown. 120 00:06:48,335 --> 00:06:50,630 Please keep in mind that you might get 121 00:06:50,630 --> 00:06:53,210 an error if you split the query into 122 00:06:53,210 --> 00:06:54,650 multiple lines in 123 00:06:54,650 --> 00:06:58,370 a Python notebook without the backslash. 124 00:06:58,370 --> 00:07:01,670 When using SQL magic, you can use 125 00:07:01,670 --> 00:07:04,040 the double percent SQL in 126 00:07:04,040 --> 00:07:07,475 the first line of the cell in Jupyter Notebooks. 127 00:07:07,475 --> 00:07:10,280 It implies that the rest of the content of 128 00:07:10,280 --> 00:07:14,015 the cell is to be interpreted by SQL magic. 129 00:07:14,015 --> 00:07:20,730 For example %% sql new row select "Id", 130 00:07:20,730 --> 00:07:24,345 "Name_of_dog," new row, from dogs, 131 00:07:24,345 --> 00:07:28,275 new row, where "Name_of_dog = 'Huggy.' 132 00:07:28,275 --> 00:07:32,390 Again, please note the special characters as shown. 133 00:07:32,390 --> 00:07:36,200 When using %% sql the backslash 134 00:07:36,200 --> 00:07:39,410 is not needed at the end of each line. 135 00:07:39,410 --> 00:07:42,035 At this point you might be asking, 136 00:07:42,035 --> 00:07:45,485 how would you restrict the number of rows retrieved? 137 00:07:45,485 --> 00:07:47,990 It's a good question because a table 138 00:07:47,990 --> 00:07:50,540 may contain thousands or even millions 139 00:07:50,540 --> 00:07:54,620 of rows, and you may only want to see some sample data 140 00:07:54,620 --> 00:07:57,260 or look at just a few rows to 141 00:07:57,260 --> 00:08:00,380 see what kind of data the table contains. 142 00:08:00,380 --> 00:08:02,870 You may be tempted to just do 143 00:08:02,870 --> 00:08:06,470 select * from table name to retrieve the results 144 00:08:06,470 --> 00:08:10,510 in a Pandas data frame and do a head function on it. 145 00:08:10,510 --> 00:08:15,760 But, doing so may take a long time for a query to run. 146 00:08:15,760 --> 00:08:18,080 Instead, you can restrict 147 00:08:18,080 --> 00:08:21,110 the results set by using the limit clause. 148 00:08:21,110 --> 00:08:24,980 For example, use the following query to retrieve 149 00:08:24,980 --> 00:08:29,395 just the first three rows in a table called census data. 150 00:08:29,395 --> 00:08:35,155 Select * from census_data limit 3. 151 00:08:35,155 --> 00:08:38,450 In this video we looked at some considerations 152 00:08:38,450 --> 00:08:42,005 and tips for working with real-world datasets. 153 00:08:42,005 --> 00:08:44,280 Thanks for watching. 154 00:08:44,280 --> 00:08:48,000 (Music)