1 00:00:07,940 --> 00:00:13,455 Welcome back. Since CSV is just a special format, 2 00:00:13,455 --> 00:00:15,360 you can read it like any other file. 3 00:00:15,360 --> 00:00:20,520 In fact, we've been reading a file that was in CSV format already, the Olympics file. 4 00:00:20,520 --> 00:00:24,915 It didn't follow the conventions of using.CSV as the ending for the file name, 5 00:00:24,915 --> 00:00:28,620 but the actual contents were in CSV format. 6 00:00:28,620 --> 00:00:34,275 The advantage when we know that something is in CSV format is that it's easy to parse it. 7 00:00:34,275 --> 00:00:36,140 We can just chop up each line into 8 00:00:36,140 --> 00:00:40,420 its individual components by looking for where the commas are. 9 00:00:40,420 --> 00:00:43,665 For example, take a look at this code. 10 00:00:43,665 --> 00:00:45,460 On lines one through four, 11 00:00:45,460 --> 00:00:49,950 I'm just reminding you of what the contents of this file are. 12 00:00:54,830 --> 00:00:59,240 So, you can see the first line is a header, 13 00:00:59,240 --> 00:01:05,475 and then I'm printing out more lines up to line six. 14 00:01:05,475 --> 00:01:11,850 Each line has somebody's name and the other values they're separated by commas. 15 00:01:12,820 --> 00:01:16,250 I want to show you how easy it is to process 16 00:01:16,250 --> 00:01:22,145 these contents because we can just use the.split method looking for commas. 17 00:01:22,145 --> 00:01:24,895 So, on line six of our code, 18 00:01:24,895 --> 00:01:27,390 we're looking just at the header line, 19 00:01:27,390 --> 00:01:32,235 that's this one, and we're saying, 20 00:01:32,235 --> 00:01:33,820 well, first of all, 21 00:01:33,820 --> 00:01:38,390 get rid of the newline character at the end of that line. 22 00:01:38,390 --> 00:01:44,230 Then, split and split wherever you see a comma. 23 00:01:44,230 --> 00:01:46,270 So, where there's a comma, 24 00:01:46,270 --> 00:01:50,430 we're going to chop up the text and we're going to get, 25 00:01:51,530 --> 00:01:57,005 actually, we can see the output here because on line eight, we're printing it out. 26 00:01:57,005 --> 00:02:03,440 So, we have name is the first value, 27 00:02:03,440 --> 00:02:06,930 it's the characters that occur before the first comma, 28 00:02:06,930 --> 00:02:09,945 and our next value is the string sex, 29 00:02:09,945 --> 00:02:11,850 and then we have the string age. 30 00:02:11,850 --> 00:02:15,230 All of these are coming from this one line of text, 31 00:02:15,230 --> 00:02:18,340 but we've chopped it up to make a list, 32 00:02:18,340 --> 00:02:22,965 and that's what the.split command does for us. 33 00:02:22,965 --> 00:02:27,700 We're doing something pretty similar with the rest of the lines. 34 00:02:27,700 --> 00:02:31,460 We're looping through all of them and for each of them, 35 00:02:31,460 --> 00:02:36,870 we're chopping it up wherever you find a comma on the line. 36 00:02:37,040 --> 00:02:42,255 Now, once we take a line like A Lamusi,M,23,China,Judo,NA, 37 00:02:42,255 --> 00:02:47,840 and we split it up into a list, 38 00:02:47,840 --> 00:02:50,214 we can now use indexing. 39 00:02:50,214 --> 00:02:54,280 I can ask for the value that's in index five, 40 00:02:54,280 --> 00:02:57,355 the sixth element from that line, 41 00:02:57,355 --> 00:03:00,970 and we can check, is its value NA? 42 00:03:00,970 --> 00:03:04,805 Well, sure enough its value is NA, 43 00:03:04,805 --> 00:03:06,690 and we do one thing if it's NA, 44 00:03:06,690 --> 00:03:08,265 we do something else if it's not. 45 00:03:08,265 --> 00:03:11,504 In this case, if it's not NA, 46 00:03:11,504 --> 00:03:12,945 then we're going to print something out. 47 00:03:12,945 --> 00:03:17,850 So, we're going to print out something only for the people who actually won a medal. 48 00:03:17,850 --> 00:03:20,665 We're going to skip the people who didn't win a medal. 49 00:03:20,665 --> 00:03:24,290 You can see that output that comes down here. 50 00:03:24,290 --> 00:03:26,210 So, only people who won a gold, 51 00:03:26,210 --> 00:03:30,410 a bronze or a silver will show up in our output. 52 00:03:30,410 --> 00:03:34,490 We're choosing here to not print the whole line, 53 00:03:34,490 --> 00:03:37,140 we're printing three elements from that line, 54 00:03:37,140 --> 00:03:38,900 the three curly braces, 55 00:03:38,900 --> 00:03:45,000 and we're printing out vowel square bracket zero, that's the name. 56 00:03:45,000 --> 00:03:47,600 It's the string that comes before the first comma, 57 00:03:47,600 --> 00:03:51,470 and we're getting the thing from the position four and from position five. 58 00:03:51,470 --> 00:03:55,980 So, that's the name and the event and the medal that they won. 59 00:03:57,230 --> 00:04:01,594 Now, note that we have to split on commas. 60 00:04:01,594 --> 00:04:04,970 I think before when we've seen the split command, 61 00:04:04,970 --> 00:04:09,710 we've tend to just split without specifying the value. 62 00:04:09,710 --> 00:04:11,660 When you don't specify a value, 63 00:04:11,660 --> 00:04:15,855 it splits wherever it finds any whitespace, 64 00:04:15,855 --> 00:04:19,095 a space or a tab or a new line. 65 00:04:19,095 --> 00:04:23,040 If we do that, we're going to see that we get something different. 66 00:04:23,040 --> 00:04:26,820 We're not going to get this nice list here, 67 00:04:26,820 --> 00:04:29,080 we're going to get a different list. 68 00:04:31,730 --> 00:04:42,205 Sure enough, what we get is a list with only one element in it. 69 00:04:42,205 --> 00:04:44,720 It's one big string, 70 00:04:44,720 --> 00:04:47,010 all of its commas and everything, 71 00:04:47,010 --> 00:04:51,805 it hasn't split it up into seven different elements or six different elements, 72 00:04:51,805 --> 00:04:53,905 it's just giving us one big thing. 73 00:04:53,905 --> 00:04:59,085 The reason is, it was looking for a whitespace and in this whole string, 74 00:04:59,085 --> 00:05:00,590 there are no spaces, 75 00:05:00,590 --> 00:05:02,540 no tabs, no carriage returns. 76 00:05:02,540 --> 00:05:04,235 So, we just get one item. 77 00:05:04,235 --> 00:05:06,815 If we had split on something else, 78 00:05:06,815 --> 00:05:09,810 let's say the letter E, 79 00:05:12,980 --> 00:05:17,490 it would split wherever there was an E, 80 00:05:17,490 --> 00:05:19,900 and we'll get some weird thing. 81 00:05:22,310 --> 00:05:24,770 I have to split on the character E, 82 00:05:24,770 --> 00:05:33,690 not the variable name E. We're just waiting for it to finish. 83 00:05:33,690 --> 00:05:38,325 Sure enough, our first value is Nam and then there was an E, 84 00:05:38,325 --> 00:05:41,540 and then after that there is a comma and a capital S, 85 00:05:41,540 --> 00:05:44,035 and then there was another E and so on. 86 00:05:44,035 --> 00:05:47,480 So, split will split on whatever you tell it to split on. 87 00:05:47,480 --> 00:05:52,700 In our case, we want to split on commas because the comma 88 00:05:52,700 --> 00:05:59,160 separated value format says commas are the things that separate the values. 89 00:05:59,160 --> 00:06:04,640 By the way, there is another more advanced version of 90 00:06:04,640 --> 00:06:11,615 the CSV format that separates with commas but encloses all of the values in quotes. 91 00:06:11,615 --> 00:06:14,250 Let's see what that looks like. 92 00:06:14,300 --> 00:06:18,360 Here you can see in this file format, 93 00:06:18,360 --> 00:06:23,240 you can see that some events have commas in them, while others don't. 94 00:06:23,240 --> 00:06:28,000 For example, we have Speed Skating, 1500 meters. 95 00:06:28,000 --> 00:06:31,410 Whereas, for Tug-Of-War or Basketball, 96 00:06:31,410 --> 00:06:33,465 there's no comma in it. 97 00:06:33,465 --> 00:06:38,250 That's going to make it harder to parse because when there's a comma, 98 00:06:38,250 --> 00:06:43,615 we don't know whether it's part of a value or separating values. 99 00:06:43,615 --> 00:06:48,914 If we were to just split on comma like we did before, 100 00:06:48,914 --> 00:06:58,155 vals equals row.split and comma. 101 00:06:58,155 --> 00:07:02,940 Then we said vals square bracket five, 102 00:07:02,940 --> 00:07:04,720 like we did before, 103 00:07:04,720 --> 00:07:07,640 the fifth element or the index five, 104 00:07:07,640 --> 00:07:13,680 the sixth element of this row will be NA, 105 00:07:13,680 --> 00:07:21,330 but the fifth element of this row will be the 1500 meters. 106 00:07:21,530 --> 00:07:27,170 So, life gets more complicated when we want to parse 107 00:07:27,170 --> 00:07:30,425 this more advanced comma separated values format 108 00:07:30,425 --> 00:07:34,380 that also has quotes around each of the values. 109 00:07:34,380 --> 00:07:38,525 It actually is still possible to unambiguously chop up the lines, 110 00:07:38,525 --> 00:07:40,490 but that's a harder programming challenge. 111 00:07:40,490 --> 00:07:42,875 I don't recommend trying it yourself. 112 00:07:42,875 --> 00:07:46,610 Instead, when you encounter something in this format, 113 00:07:46,610 --> 00:07:50,690 you would use Python CSV module to parse the lines for you. 114 00:07:50,690 --> 00:07:53,855 We're not going to learn that module right now. 115 00:07:53,855 --> 00:07:57,860 I've found that it's good for students to learn how to parse simple CSVs using 116 00:07:57,860 --> 00:08:02,470 the.split method at this point for understanding what's really going on. 117 00:08:02,470 --> 00:08:06,725 Later, you can learn to use the CSV module for harder formats. 118 00:08:06,725 --> 00:08:10,910 In summary, when we have a simple CSV format 119 00:08:10,910 --> 00:08:15,825 with commas separating and no quotes around all the values, parsing is easy. 120 00:08:15,825 --> 00:08:18,300 You just read in the file, align it at a time, 121 00:08:18,300 --> 00:08:22,640 and you use the split method specifying comma as the thing to split on. 122 00:08:22,640 --> 00:08:25,910 That gives you a list of the individual values or 123 00:08:25,910 --> 00:08:30,580 the individual field names on the header line. I'll see you next time.