1
00:00:00,000 --> 00:00:05,960
(Music)

2
00:00:05,960 --> 00:00:08,625
Hello and welcome.

3
00:00:08,625 --> 00:00:10,065
In this video,

4
00:00:10,065 --> 00:00:12,540
we'll give you a few hints and tips for

5
00:00:12,540 --> 00:00:15,435
working with Real World Data-sets.

6
00:00:15,435 --> 00:00:18,090
Many of the real world data sets are

7
00:00:18,090 --> 00:00:21,660
made available as .CSV files.

8
00:00:21,660 --> 00:00:24,120
These are text files which contain

9
00:00:24,120 --> 00:00:27,285
data values typically separated by commas.

10
00:00:27,285 --> 00:00:30,210
In some cases, a different separator

11
00:00:30,210 --> 00:00:32,895
such as a semicolon may be used.

12
00:00:32,895 --> 00:00:35,960
For this video, we will use an example of

13
00:00:35,960 --> 00:00:39,935
a file called DOGS.CSV.

14
00:00:39,935 --> 00:00:42,590
Although this is a fictional data set that

15
00:00:42,590 --> 00:00:45,620
contains names of dogs and their breeds,

16
00:00:45,620 --> 00:00:48,200
we will use it to illustrate concepts that

17
00:00:48,200 --> 00:00:50,935
you will then apply to real datasets.

18
00:00:50,935 --> 00:00:56,715
Sample contents of the DOGS.CSV file are shown here.

19
00:00:56,715 --> 00:01:00,530
The first row in the table in many cases contains

20
00:01:00,530 --> 00:01:04,840
attribute labels which map to column names in a table.

21
00:01:04,840 --> 00:01:07,420
In DOGS.CSV,

22
00:01:07,420 --> 00:01:11,435
the first row contains the name of three attributes.

23
00:01:11,435 --> 00:01:15,050
Id is the name of the first attribute and

24
00:01:15,050 --> 00:01:20,420
the subsequent rows contain Id values of 1, 2, and 3.

25
00:01:20,420 --> 00:01:23,360
The name of the dog is the second attribute.

26
00:01:23,360 --> 00:01:26,435
In this case the dog names Wolfie,

27
00:01:26,435 --> 00:01:29,635
Fluffy, and Huggy are the values.

28
00:01:29,635 --> 00:01:33,075
The third attribute is called breed,

29
00:01:33,075 --> 00:01:36,680
either the dominant breed or pure breed name.

30
00:01:36,680 --> 00:01:38,615
It has values of

31
00:01:38,615 --> 00:01:42,205
German Shepherd, Pomeranian, and Labrador.

32
00:01:42,205 --> 00:01:43,960
As we've just seen,

33
00:01:43,960 --> 00:01:46,985
CSV files can have the first or

34
00:01:46,985 --> 00:01:50,420
a header row that contains the names of the attributes.

35
00:01:50,420 --> 00:01:53,690
If you're loading the data into the database using

36
00:01:53,690 --> 00:01:57,080
the visual load tool in the database console,

37
00:01:57,080 --> 00:02:01,055
ensure the header in first row is enabled.

38
00:02:01,055 --> 00:02:04,640
This will map the attribute names in the first row of

39
00:02:04,640 --> 00:02:09,380
the CSV file into column names in the database table,

40
00:02:09,380 --> 00:02:11,240
and the rest of the rows into

41
00:02:11,240 --> 00:02:14,485
the data rows in the table, as shown here.

42
00:02:14,485 --> 00:02:17,585
Note that the default column names may not

43
00:02:17,585 --> 00:02:20,780
always be database or query friendly,

44
00:02:20,780 --> 00:02:23,270
and if that is the case, you may

45
00:02:23,270 --> 00:02:26,410
want to edit them before the table is created.

46
00:02:26,410 --> 00:02:28,970
Now, let's talk about querying

47
00:02:28,970 --> 00:02:31,660
column names that are lower or mixed case,

48
00:02:31,660 --> 00:02:35,270
that is, a combination of upper and lowercase.

49
00:02:35,270 --> 00:02:38,450
Let's assume we loaded the DOGS.CSV

50
00:02:38,450 --> 00:02:42,575
file using the default column names from the CSV.

51
00:02:42,575 --> 00:02:44,960
If we try to retrieve the contents of

52
00:02:44,960 --> 00:02:47,235
the Id column using the query,

53
00:02:47,235 --> 00:02:50,625
select id from DOGS,

54
00:02:50,625 --> 00:02:52,910
we'll get an error as

55
00:02:52,910 --> 00:02:57,010
shown indicating the id is not valid.

56
00:02:57,010 --> 00:02:59,615
This is because the database parser

57
00:02:59,615 --> 00:03:02,420
assumes uppercase names by default.

58
00:03:02,420 --> 00:03:06,020
Whereas when we loaded the CSV file into

59
00:03:06,020 --> 00:03:11,540
the database it had the Id column name in mixed case i.e

60
00:03:11,540 --> 00:03:15,980
an uppercase I and a lowercase d. In

61
00:03:15,980 --> 00:03:18,200
this case, to select data from

62
00:03:18,200 --> 00:03:20,855
a column with a mixed case name,

63
00:03:20,855 --> 00:03:23,510
we need to specify the column name in

64
00:03:23,510 --> 00:03:27,965
its correct case within double quotes as follows.

65
00:03:27,965 --> 00:03:36,325
Select * "Id" from DOGS.

66
00:03:36,325 --> 00:03:39,080
Ensure you use double quotes around

67
00:03:39,080 --> 00:03:42,305
the column name and not single quotes.

68
00:03:42,305 --> 00:03:44,360
Next, we'll cover querying

69
00:03:44,360 --> 00:03:47,830
column names that have spaces and other characters.

70
00:03:47,830 --> 00:03:52,595
In a CSV file, if the name of the column contain spaces,

71
00:03:52,595 --> 00:03:56,960
by default the database may map them to underscores.

72
00:03:56,960 --> 00:04:00,875
For example, in the name of dog column,

73
00:04:00,875 --> 00:04:04,205
there are spaces in between the three words.

74
00:04:04,205 --> 00:04:08,400
The database may change it to Name_of_Dog.

75
00:04:10,330 --> 00:04:14,030
Other special characters like parentheses or

76
00:04:14,030 --> 00:04:17,200
brackets may also get mapped to underscores.

77
00:04:17,200 --> 00:04:19,770
Therefore, when you write a query

78
00:04:19,770 --> 00:04:22,940
ensure you use proper case formatting within

79
00:04:22,940 --> 00:04:25,880
quotes and substitute special characters

80
00:04:25,880 --> 00:04:28,715
to underscores as shown in this example.

81
00:04:28,715 --> 00:04:31,785
Select "Id," "Name_of_Dog,"

82
00:04:31,785 --> 00:04:36,545
"Breed__dominant_breed_if_not_pure_breed_"from dogs.

83
00:04:36,545 --> 00:04:39,140
Please note the underscores separating

84
00:04:39,140 --> 00:04:41,870
the words within double quotes.

85
00:04:41,870 --> 00:04:45,020
Also note the double underscore

86
00:04:45,020 --> 00:04:48,230
between breed and dominant as shown.

87
00:04:48,230 --> 00:04:51,080
Finally, it's also important to note

88
00:04:51,080 --> 00:04:53,150
the trailing underscore after

89
00:04:53,150 --> 00:04:56,330
the word breed near the end of the query.

90
00:04:56,330 --> 00:04:59,900
This is used in place of the closing bracket.

91
00:04:59,900 --> 00:05:02,540
When using quotes in Jupyter notebooks,

92
00:05:02,540 --> 00:05:05,300
you may be issuing queries in a notebook by

93
00:05:05,300 --> 00:05:08,585
first assigning them to Python variables.

94
00:05:08,585 --> 00:05:10,820
In such cases, if your query

95
00:05:10,820 --> 00:05:13,474
contains double quotes for example,

96
00:05:13,474 --> 00:05:16,540
to specify a mixed case column name,

97
00:05:16,540 --> 00:05:18,620
you could differentiate the quotes

98
00:05:18,620 --> 00:05:20,360
by using single quotes for

99
00:05:20,360 --> 00:05:22,640
the Python variable to enclose

100
00:05:22,640 --> 00:05:26,650
this SQL query and double quotes for the column names.

101
00:05:26,650 --> 00:05:36,580
For example, selectQuery ='select "Id" from dogs.'

102
00:05:36,980 --> 00:05:39,740
Now, what if you need to

103
00:05:39,740 --> 00:05:42,380
specify single quotes within the query,

104
00:05:42,380 --> 00:05:46,400
for example, to specify a value in the where clause?

105
00:05:46,400 --> 00:05:49,265
In this case you can use backslash

106
00:05:49,265 --> 00:05:52,270
as the escape character as follows,

107
00:05:52,270 --> 00:05:59,430
select Query = 'select * from dogs where

108
00:06:07,810 --> 00:06:12,320
"Name_of_Dog"=\'Huggy\' '. If you have

109
00:06:12,320 --> 00:06:16,865
very long queries such as join queries or nested queries,

110
00:06:16,865 --> 00:06:19,670
it may be useful to split the query into

111
00:06:19,670 --> 00:06:22,700
multiple lines for improved readability.

112
00:06:22,700 --> 00:06:24,770
In Python notebooks, you

113
00:06:24,770 --> 00:06:26,870
can use the backslash character to

114
00:06:26,870 --> 00:06:29,810
indicate continuation to the next row

115
00:06:29,810 --> 00:06:32,480
as shown in this example.

116
00:06:32,480 --> 00:06:36,645
%sql select "Id," Name_of_Dog,"

117
00:06:36,645 --> 00:06:42,160
\ from dogs \ where"Name_of_Dog" = 'Huggy.'

118
00:06:42,160 --> 00:06:44,990
It would be helpful at this point to take

119
00:06:44,990 --> 00:06:48,335
a moment to review the special characters as shown.

120
00:06:48,335 --> 00:06:50,630
Please keep in mind that you might get

121
00:06:50,630 --> 00:06:53,210
an error if you split the query into

122
00:06:53,210 --> 00:06:54,650
multiple lines in

123
00:06:54,650 --> 00:06:58,370
a Python notebook without the backslash.

124
00:06:58,370 --> 00:07:01,670
When using SQL magic, you can use

125
00:07:01,670 --> 00:07:04,040
the double percent SQL in

126
00:07:04,040 --> 00:07:07,475
the first line of the cell in Jupyter Notebooks.

127
00:07:07,475 --> 00:07:10,280
It implies that the rest of the content of

128
00:07:10,280 --> 00:07:14,015
the cell is to be interpreted by SQL magic.

129
00:07:14,015 --> 00:07:20,730
For example %% sql new row select "Id",

130
00:07:20,730 --> 00:07:24,345
"Name_of_dog," new row, from dogs,

131
00:07:24,345 --> 00:07:28,275
new row, where "Name_of_dog = 'Huggy.'

132
00:07:28,275 --> 00:07:32,390
Again, please note the special characters as shown.

133
00:07:32,390 --> 00:07:36,200
When using %% sql the backslash

134
00:07:36,200 --> 00:07:39,410
is not needed at the end of each line.

135
00:07:39,410 --> 00:07:42,035
At this point you might be asking,

136
00:07:42,035 --> 00:07:45,485
how would you restrict the number of rows retrieved?

137
00:07:45,485 --> 00:07:47,990
It's a good question because a table

138
00:07:47,990 --> 00:07:50,540
may contain thousands or even millions

139
00:07:50,540 --> 00:07:54,620
of rows, and you may only want to see some sample data

140
00:07:54,620 --> 00:07:57,260
or look at just a few rows to

141
00:07:57,260 --> 00:08:00,380
see what kind of data the table contains.

142
00:08:00,380 --> 00:08:02,870
You may be tempted to just do

143
00:08:02,870 --> 00:08:06,470
select * from table name to retrieve the results

144
00:08:06,470 --> 00:08:10,510
in a Pandas data frame and do a head function on it.

145
00:08:10,510 --> 00:08:15,760
But, doing so may take a long time for a query to run.

146
00:08:15,760 --> 00:08:18,080
Instead, you can restrict

147
00:08:18,080 --> 00:08:21,110
the results set by using the limit clause.

148
00:08:21,110 --> 00:08:24,980
For example, use the following query to retrieve

149
00:08:24,980 --> 00:08:29,395
just the first three rows in a table called census data.

150
00:08:29,395 --> 00:08:35,155
Select * from census_data limit 3.

151
00:08:35,155 --> 00:08:38,450
In this video we looked at some considerations

152
00:08:38,450 --> 00:08:42,005
and tips for working with real-world datasets.

153
00:08:42,005 --> 00:08:44,280
Thanks for watching.

154
00:08:44,280 --> 00:08:48,000
(Music)