1 00:00:07,910 --> 00:00:11,760 Welcome back. What is a file? 2 00:00:11,760 --> 00:00:14,400 It's just a collection of data saved on 3 00:00:14,400 --> 00:00:17,715 a hard disk or other storage that persists over time. 4 00:00:17,715 --> 00:00:22,425 A file has a name, and files can be organized into folders or directories. 5 00:00:22,425 --> 00:00:25,920 We'll be working with text files as opposed to images, 6 00:00:25,920 --> 00:00:27,990 or sounds, or videos. 7 00:00:27,990 --> 00:00:31,640 Here, we have an example of a file called 8 00:00:31,640 --> 00:00:35,615 Olympics.text that is available to us in Runestone. 9 00:00:35,615 --> 00:00:41,270 Each line has information about one athlete's participation in the Olympics. 10 00:00:41,270 --> 00:00:43,830 It's actually a simulated file. 11 00:00:43,830 --> 00:00:48,680 Runestone can't access the real files on your computer for security and privacy reasons. 12 00:00:48,680 --> 00:00:52,010 So, we simulate the presence of a few files in the Runestone environments, 13 00:00:52,010 --> 00:00:55,165 so we can illustrate how file reading works. 14 00:00:55,165 --> 00:00:58,010 Python provides some functions for reading data from 15 00:00:58,010 --> 00:01:00,995 an existing file. There are two steps. 16 00:01:00,995 --> 00:01:07,170 First, you call the open function to open the file. 17 00:01:10,640 --> 00:01:15,355 So, here we have an invocation of the open function, 18 00:01:15,355 --> 00:01:18,850 and we pass in two arguments. 19 00:01:18,850 --> 00:01:22,655 One is the string, that's the name of the file, Olympics.text. 20 00:01:22,655 --> 00:01:25,095 And the other says what to do with the file. 21 00:01:25,095 --> 00:01:26,970 In our case, R for reading. 22 00:01:26,970 --> 00:01:29,490 Later on, we'll see W for writing. 23 00:01:29,490 --> 00:01:33,690 The open function returns an object. 24 00:01:33,690 --> 00:01:35,200 It's a file object, 25 00:01:35,200 --> 00:01:39,730 and we are assigning it to this variable name called fileref. 26 00:01:41,100 --> 00:01:44,230 There's going to be an additional step that we're going to 27 00:01:44,230 --> 00:01:46,360 have to do to actually read the contents. 28 00:01:46,360 --> 00:01:50,585 So, this step that we're showing so far just creates the file object. 29 00:01:50,585 --> 00:01:54,010 Then, they're going to be some lines of code that we haven't written yet, 30 00:01:54,010 --> 00:01:57,320 that will actually get the contents from the file and do something with it. 31 00:01:57,320 --> 00:02:00,380 Then, there is a corresponding close operation 32 00:02:00,380 --> 00:02:04,010 that lets Python know that we're done working with this file object, 33 00:02:04,010 --> 00:02:07,205 and it's okay to stop keeping track of it. 34 00:02:07,205 --> 00:02:11,945 So, there's a line three, fileref.close. 35 00:02:11,945 --> 00:02:14,270 Now, if I run this, 36 00:02:14,270 --> 00:02:15,725 we're actually not going to see 37 00:02:15,725 --> 00:02:21,320 any output because all we've done is open the file and then close it. 38 00:02:21,320 --> 00:02:24,680 We haven't actually read the contents in, 39 00:02:24,680 --> 00:02:26,390 and we certainly haven't printed anything out, 40 00:02:26,390 --> 00:02:29,795 so you're not seeing anything in the output window. 41 00:02:29,795 --> 00:02:33,305 What if we did want to read the contents and print them out? 42 00:02:33,305 --> 00:02:37,670 There's a few different ways of working with file objects. 43 00:02:37,670 --> 00:02:40,580 The first method that we'll use is dot read, 44 00:02:40,580 --> 00:02:46,795 which is going to bring in the entire contents of the file as a single string. 45 00:02:46,795 --> 00:02:50,320 Let me show you what that would look like. 46 00:02:58,910 --> 00:03:03,345 So, we call the dot read method on the fileref object. 47 00:03:03,345 --> 00:03:04,670 That returns a string, 48 00:03:04,670 --> 00:03:08,270 and I'm assigning that to the variable called contents, 49 00:03:08,270 --> 00:03:10,490 and then I can just print out, 50 00:03:10,490 --> 00:03:14,730 let's print out the first 100 characters of it. 51 00:03:14,750 --> 00:03:18,270 Now, we'll see something in the "Output" window. 52 00:03:18,270 --> 00:03:22,360 So, we're seeing the first 100 characters from the file, 53 00:03:22,360 --> 00:03:26,165 which got us three lines and a little bit of the fourth line. 54 00:03:26,165 --> 00:03:29,560 You'll rarely use this method of reading 55 00:03:29,560 --> 00:03:33,550 the entire contents of the file all at once as a big string, 56 00:03:33,550 --> 00:03:36,670 partly because if you had a really big file, 57 00:03:36,670 --> 00:03:40,655 it would be a problem for your computer to handle all of that in memory all at once. 58 00:03:40,655 --> 00:03:44,830 The only times we're going to use this dot read method is if you wanted to 59 00:03:44,830 --> 00:03:49,900 grab the whole file and as a string and pass it to some other function that parses it. 60 00:03:49,900 --> 00:03:52,720 Even then, it will usually be some other function available that will 61 00:03:52,720 --> 00:03:55,385 directly read from the file object a little bit at a time, 62 00:03:55,385 --> 00:03:57,435 and parse its contents. 63 00:03:57,435 --> 00:04:01,400 So, the second method that I'm going to show you is, 64 00:04:01,400 --> 00:04:03,470 instead of reading it all at once, 65 00:04:03,470 --> 00:04:06,215 we have a dot read lines method. 66 00:04:06,215 --> 00:04:09,750 Instead of getting everything as a single string, 67 00:04:10,970 --> 00:04:14,965 it returns a list of strings, 68 00:04:14,965 --> 00:04:18,170 one string for each line in the file. 69 00:04:18,170 --> 00:04:20,210 So, let's print out. 70 00:04:20,210 --> 00:04:23,610 Let's say the first four lines of the file this way. 71 00:04:24,680 --> 00:04:29,520 I forgot to rename the variable that I'm referring to. 72 00:04:29,520 --> 00:04:35,465 Let's call it lines because I called it lines on line two. 73 00:04:35,465 --> 00:04:39,870 You can see now that we're printing out a list, 74 00:04:40,040 --> 00:04:43,350 and we've got the square brackets. 75 00:04:43,350 --> 00:04:46,799 Inside the list, there are four strings. 76 00:04:46,799 --> 00:04:48,615 Here's the first string. 77 00:04:48,615 --> 00:04:50,650 The second string begins here, 78 00:04:50,650 --> 00:04:53,595 and is ending here, and so on. 79 00:04:53,595 --> 00:04:58,980 Each of the strings you notice is ending with this special backslash n character. 80 00:04:58,980 --> 00:05:01,740 That's the new line character, 81 00:05:01,740 --> 00:05:05,175 because in the file, we have a bunch of lines of text. 82 00:05:05,175 --> 00:05:07,575 So, when we read these lines in, 83 00:05:07,575 --> 00:05:12,070 each of these strings has a backslash n at the end of it. 84 00:05:12,890 --> 00:05:16,875 Instead of just printing out all these lines, 85 00:05:16,875 --> 00:05:21,380 I could maybe get a slightly prettier print out, 86 00:05:21,380 --> 00:05:25,205 if I iterate through them. 87 00:05:25,205 --> 00:05:31,750 So, for line in lines and maybe I'll just take the first four lines again. 88 00:05:31,750 --> 00:05:37,770 First five lines, and I'm going to print the individual line. 89 00:05:40,450 --> 00:05:42,695 So, now, when I run it, 90 00:05:42,695 --> 00:05:45,770 it's going to iterate through these four lines, 91 00:05:45,770 --> 00:05:48,455 and each one of them is going to go on its own line. 92 00:05:48,455 --> 00:05:51,350 We're no longer going to get the square brackets to show 93 00:05:51,350 --> 00:05:54,860 up because we're not printing the whole list. 94 00:05:54,860 --> 00:05:57,005 We're iterating through the individual strings. 95 00:05:57,005 --> 00:06:03,494 We're also not going to get these quote marks because we're going to pass the strings, 96 00:06:03,494 --> 00:06:04,860 and when we print those out, 97 00:06:04,860 --> 00:06:08,340 we just show their contents in the "Output" window. 98 00:06:08,340 --> 00:06:12,315 So, let's see how that looks when we run it, 99 00:06:12,315 --> 00:06:16,564 and sure enough, we get each of the lines separately. 100 00:06:16,564 --> 00:06:19,730 Now, you might notice something a little strange here, 101 00:06:19,730 --> 00:06:22,685 which is that we get these blank lines. 102 00:06:22,685 --> 00:06:26,330 The reason for that is that each of the strings, 103 00:06:26,330 --> 00:06:29,985 you'll remember had that newline character at the end, 104 00:06:29,985 --> 00:06:32,670 which meant do a carriage return. 105 00:06:32,670 --> 00:06:38,630 The print function always does a a carriage return, 106 00:06:38,630 --> 00:06:39,860 and so we're getting two of those. 107 00:06:39,860 --> 00:06:42,170 One is starting us on a new line, 108 00:06:42,170 --> 00:06:44,300 and the other one is starting us on a new line again, 109 00:06:44,300 --> 00:06:46,025 so we get a blank line. 110 00:06:46,025 --> 00:06:50,555 What if we didn't want to have that extra blank line? 111 00:06:50,555 --> 00:06:54,560 Well, you've seen the dot strip method before. 112 00:06:54,560 --> 00:07:00,650 I can strip the whitespace from the beginning and ends of each of these lines, 113 00:07:00,650 --> 00:07:04,280 so the dot strip method gets rid of any whitespace at the beginning or the end. 114 00:07:04,280 --> 00:07:07,340 Whitespace is the space character, 115 00:07:07,340 --> 00:07:10,210 a tab character, or a new line character. 116 00:07:10,210 --> 00:07:12,210 So, if I call this, now, 117 00:07:12,210 --> 00:07:16,575 I'm going to get the printout that doesn't have the blank lines, 118 00:07:16,575 --> 00:07:21,630 and sure enough, we've got the first four lines from the file. 119 00:07:22,370 --> 00:07:25,730 Now, there's a shorter way to iterate over the lines 120 00:07:25,730 --> 00:07:29,195 if that's all we're going to do is iterate over all of them. 121 00:07:29,195 --> 00:07:32,990 So, let me show you that because it's the more Pythonic way 122 00:07:32,990 --> 00:07:37,295 rather than reading the entire file into a list. 123 00:07:37,295 --> 00:07:47,160 We can just directly iterate over all of the lines by saying for line in fileref. 124 00:07:47,320 --> 00:07:50,720 So, here, it's a file object. 125 00:07:50,720 --> 00:07:52,030 It's not a list, 126 00:07:52,030 --> 00:07:56,580 but it knows how to be iterated over and each time we get one more line. 127 00:07:56,580 --> 00:07:59,825 So, this is going to do exactly the same thing that we had before. 128 00:07:59,825 --> 00:08:03,230 Except now, we're going to get all the lines in the file. 129 00:08:04,940 --> 00:08:08,780 So, we can iterate over this file object directly. 130 00:08:08,780 --> 00:08:15,700 We can't do this thing of taking a slice of it like we did with lists. 131 00:08:15,700 --> 00:08:17,505 That gives us an error. 132 00:08:17,505 --> 00:08:19,480 So, a file object supports iteration, 133 00:08:19,480 --> 00:08:21,770 but it does not support taking slices. 134 00:08:21,770 --> 00:08:26,810 So, if we wanted to just do something with the first four lines, 135 00:08:26,810 --> 00:08:29,765 we'd have to use the dot read lines, 136 00:08:29,765 --> 00:08:33,650 rather than just iterating over the file object. 137 00:08:33,650 --> 00:08:36,170 If we're prepared to process all the lines, 138 00:08:36,170 --> 00:08:38,570 which is the normal thing that you're going to do with a file, 139 00:08:38,570 --> 00:08:41,785 this is the standard Pythonic idiom. 140 00:08:41,785 --> 00:08:46,050 Now, when should you actually call dot read lines or dot read? 141 00:08:46,050 --> 00:08:48,780 Well, one reason to call dot read lines is, 142 00:08:48,780 --> 00:08:50,930 if you wanted to take slices. 143 00:08:50,930 --> 00:08:53,600 Another reason might be that you wanted to just get 144 00:08:53,600 --> 00:08:57,085 a count of how many lines are in the file. 145 00:08:57,085 --> 00:09:00,845 So, if I get all of the lines and put them in a variable, 146 00:09:00,845 --> 00:09:06,215 I could now print out the length of lines, 147 00:09:06,215 --> 00:09:09,985 and that would tell me how many lines were in the file. 148 00:09:09,985 --> 00:09:12,845 I'm going to comment out the other two. 149 00:09:12,845 --> 00:09:16,050 Turns out there are 60 lines in the file. 150 00:09:16,050 --> 00:09:22,650 If I wanted to find out how many characters are in the file, 151 00:09:22,990 --> 00:09:28,565 I could read the entire file as one character string, 152 00:09:28,565 --> 00:09:31,770 and then I could ask for it's length. 153 00:09:31,900 --> 00:09:34,789 So, except in those special cases, 154 00:09:34,789 --> 00:09:37,610 the more common thing that you're going to want to do is 155 00:09:37,610 --> 00:09:42,470 to just iterate over the file object itself. 156 00:09:42,470 --> 00:09:46,940 We won't use dot read or dot readlines, 157 00:09:46,940 --> 00:09:49,820 instead will just iterate over the file object itself. 158 00:09:49,820 --> 00:09:53,845 This is the most common way that you'll be working with files. 159 00:09:53,845 --> 00:09:56,735 So, that's Python code for reading from a file. 160 00:09:56,735 --> 00:09:58,980 Well, see you next time.