1 00:00:07,910 --> 00:00:11,560 Welcome back for this way of the programmer segment on 2 00:00:11,560 --> 00:00:14,950 how to approach a big complicated nested data structure. 3 00:00:14,950 --> 00:00:18,385 The answer is to take it one step at a time. 4 00:00:18,385 --> 00:00:21,460 I gave you a little preview of my approach previously which I 5 00:00:21,460 --> 00:00:25,095 call understand, extract, repeat. 6 00:00:25,095 --> 00:00:28,420 To illustrate this, we'll walk through extracting information from 7 00:00:28,420 --> 00:00:32,795 data formatted in a way that's returned by the Twitter API. 8 00:00:32,795 --> 00:00:36,300 This nested dictionary results from querying Twitter, 9 00:00:36,300 --> 00:00:37,895 asking for three tweets, 10 00:00:37,895 --> 00:00:39,955 matching University of Michigan. 11 00:00:39,955 --> 00:00:42,820 As you'll see, it's quite a daunting data structure 12 00:00:42,820 --> 00:00:45,920 even when printed with nice indentation. 13 00:00:46,780 --> 00:00:49,670 Here we'll just take a little tour through it. 14 00:00:49,670 --> 00:00:52,720 You can see just to get the information about three tweets, 15 00:00:52,720 --> 00:00:56,900 there's a lot of stuff and there's a lot of indentation. 16 00:00:56,900 --> 00:00:59,630 How many levels of nesting do we have here? 17 00:00:59,630 --> 00:01:01,710 Maybe five or six. 18 00:01:01,820 --> 00:01:04,800 So, this can be pretty daunting, 19 00:01:04,800 --> 00:01:07,455 but we're going to take it one step at a time. 20 00:01:07,455 --> 00:01:11,070 The mantra is understand, extract, repeat. 21 00:01:11,070 --> 00:01:15,040 So, first we will understand it. 22 00:01:17,030 --> 00:01:22,475 To understand it, you might want to just print out the first few characters of it. 23 00:01:22,475 --> 00:01:26,480 I'm using that json.dumps to 24 00:01:26,480 --> 00:01:30,290 pretty print it and I'm printing only the first 100 characters here, 25 00:01:30,290 --> 00:01:32,790 so that it doesn't take up too much space. 26 00:01:32,790 --> 00:01:35,420 We can see at least from this that it's a dictionary 27 00:01:35,420 --> 00:01:39,050 and that one of its keys is called search_metadata. 28 00:01:39,050 --> 00:01:42,455 If you don't want to depend on looking at it and seeing the curly brace, 29 00:01:42,455 --> 00:01:48,900 here I've printed out type of res and it is a dictionary. 30 00:01:50,000 --> 00:01:52,900 Whenever I have a dictionary, 31 00:01:52,900 --> 00:01:55,850 the first thing to do in order to understand that 32 00:01:55,850 --> 00:01:59,200 dictionary is to ask for what are the keys. 33 00:01:59,200 --> 00:02:01,715 In this case there are only two keys. 34 00:02:01,715 --> 00:02:05,840 There's one that's search_metadata and there's other statuses. 35 00:02:05,840 --> 00:02:09,695 When I printed out the first 100 characters you could see search_metadata, 36 00:02:09,695 --> 00:02:13,175 but you couldn't see statuses because that was way more than 37 00:02:13,175 --> 00:02:17,100 a 100 characters in before that second key was going to show up, 38 00:02:17,100 --> 00:02:19,130 because it was first going to show all of 39 00:02:19,130 --> 00:02:23,715 the nested data that's part of the value of the search_metadata key. 40 00:02:23,715 --> 00:02:26,885 Once we've done our first level of understanding, 41 00:02:26,885 --> 00:02:30,035 we're going to discern one level we will extract. 42 00:02:30,035 --> 00:02:34,220 But I want to show you another tool that can be pretty helpful 43 00:02:34,220 --> 00:02:38,240 for getting an overview of what the whole data structure is going to be, 44 00:02:38,240 --> 00:02:42,890 which is an outline or view of the whole nested data structure. 45 00:02:42,890 --> 00:02:46,570 If I dump this data into JSON format, 46 00:02:46,570 --> 00:02:51,860 I can copy it to an external site that lets me look at it in outline view. 47 00:02:51,860 --> 00:02:55,220 So, I'm going to get rid of these other print statements and I'm 48 00:02:55,220 --> 00:02:59,460 just going to print out the whole thing in JSON format. 49 00:03:01,190 --> 00:03:04,890 Instead of printing out just the first 100 characters, 50 00:03:04,890 --> 00:03:07,780 I'm going to print out the whole thing. 51 00:03:08,960 --> 00:03:19,240 Now I'm going to copy the whole contents which is quite a long thing. 52 00:03:25,310 --> 00:03:31,300 This is all data just for three tweets believe it or not. 53 00:03:42,230 --> 00:03:48,310 Finally, I've got it all. I'm going to copy it. 54 00:03:48,410 --> 00:03:53,200 We're now visiting a site called jsoneditoronline.org 55 00:03:53,200 --> 00:03:59,125 and on the left side they let you paste in a string that's in JSON format. 56 00:03:59,125 --> 00:04:03,500 I've copied that big long string and I'm now pasting it. 57 00:04:04,980 --> 00:04:10,600 All 600 and some lines of it and I can click on 58 00:04:10,600 --> 00:04:15,625 this arrow here and it's now going to give me an outline or view of it. 59 00:04:15,625 --> 00:04:19,265 Remember there were two keys search_metadata and statuses. 60 00:04:19,265 --> 00:04:24,290 It's telling me that search_metadata is a dictionary with nine keys. 61 00:04:24,290 --> 00:04:29,585 One of them's called count completed in max_id_str and so on. 62 00:04:29,585 --> 00:04:32,570 Under statuses, it's got a square brackets three which is 63 00:04:32,570 --> 00:04:36,330 telling me that it's a list with three items in it. 64 00:04:36,330 --> 00:04:46,920 I can see that the first element in that list is a dictionary with 24 keys in it, 65 00:04:47,110 --> 00:04:50,100 and I can keep descending down there. 66 00:04:50,100 --> 00:04:55,285 My goal is actually going to be to get the authors of the tweets. 67 00:04:55,285 --> 00:04:56,740 So, I'm going to guess that that's in 68 00:04:56,740 --> 00:05:03,890 the user key and sure enough there's a screen_name or a name. 69 00:05:03,920 --> 00:05:07,270 So, if I'm very careful about this, 70 00:05:07,270 --> 00:05:10,240 I could just write a very complicated expression to go 71 00:05:10,240 --> 00:05:16,370 in four levels of nesting and grab a screen_name of 31brooks. 72 00:05:16,370 --> 00:05:19,090 But I'm going to do this one step at a time. 73 00:05:19,090 --> 00:05:25,555 I've gotten myself oriented and I might come back here to help me stay oriented. 74 00:05:25,555 --> 00:05:28,765 But then I'm going to work with the code and I'm going to build up 75 00:05:28,765 --> 00:05:33,190 my code one step at a time where at each step I understand what I 76 00:05:33,190 --> 00:05:36,580 have at the current level of nesting and then I 77 00:05:36,580 --> 00:05:42,020 extract to get something one more level of nesting in. 78 00:05:42,990 --> 00:05:46,260 Let's go back to our code. 79 00:05:46,260 --> 00:05:53,060 I figured out that there were two keys and that the information that I wanted to extract, 80 00:05:53,060 --> 00:05:56,605 the author names were in the second of those. 81 00:05:56,605 --> 00:05:59,270 They were in the value associated with the statuses. 82 00:05:59,270 --> 00:06:08,030 So, I've just done something to extract the value associated with the statuses key. 83 00:06:08,030 --> 00:06:11,085 This is my first extract. 84 00:06:11,085 --> 00:06:15,480 I did an understand and then an extract and then I'm going to repeat. 85 00:06:15,760 --> 00:06:20,120 I have extracted something at level two and now I 86 00:06:20,120 --> 00:06:24,430 want to print some stuff out to help me understand what I've got. 87 00:06:24,430 --> 00:06:30,455 So, I'm printing out the word level two and checking what is the type? It's a list. 88 00:06:30,455 --> 00:06:33,290 Well, if it's a list, the first thing that I always want to do 89 00:06:33,290 --> 00:06:36,520 is check its length and it has three items. 90 00:06:36,520 --> 00:06:40,790 Now, from what we looked at in the JSON Online editor, 91 00:06:40,790 --> 00:06:43,475 we also would have gotten that same information. 92 00:06:43,475 --> 00:06:48,580 You can tell that we've descended to that level of the data. 93 00:06:48,580 --> 00:06:52,640 Since I know that this was a query that return three tweets, 94 00:06:52,640 --> 00:06:55,505 I'm going to guess that each of the items in this list 95 00:06:55,505 --> 00:06:59,405 is representing one tweet from Twitter. 96 00:06:59,405 --> 00:07:03,470 I've now understood at this level it's a list with 97 00:07:03,470 --> 00:07:07,400 three items and so I'm now ready to extract. 98 00:07:07,400 --> 00:07:12,545 I could either extract a single item or I could iterate through all the items. 99 00:07:12,545 --> 00:07:18,190 What I'm going to want to do is extract all of the authors. So, I'm going to iterate. 100 00:07:18,260 --> 00:07:23,025 For each thing in red is two, 101 00:07:23,025 --> 00:07:25,275 I'm going to do something with it. 102 00:07:25,275 --> 00:07:27,375 So, that's why I'm iterating. 103 00:07:27,375 --> 00:07:29,325 here on line eight. 104 00:07:29,325 --> 00:07:32,950 But as I'm developing my code, 105 00:07:32,950 --> 00:07:36,230 I don't want to have to have lots of stuff showing up in the output window. 106 00:07:36,230 --> 00:07:40,295 So, I actually wanna deal initially with just one item at a time. 107 00:07:40,295 --> 00:07:43,570 That's why I've done what I've done here on line eight. 108 00:07:43,570 --> 00:07:48,185 I'm building up a template for my code that I'm going to iterate, but in fact, 109 00:07:48,185 --> 00:07:54,875 I'm only taking a slice containing one item from the result at level two. 110 00:07:54,875 --> 00:08:00,220 So, I'm only going to execute lines 9 and 10 one time. 111 00:08:05,360 --> 00:08:16,040 At level three, we're printing out that we've got some information about a tweet and 112 00:08:16,040 --> 00:08:20,690 then I'm just getting the first 30 characters of whatever the thing 113 00:08:20,690 --> 00:08:25,940 is I dumped it to be JSON and I've asked for the first 30 characters, 114 00:08:25,940 --> 00:08:28,290 it looks like it's a dictionary. 115 00:08:29,510 --> 00:08:32,520 So, to help in the understand phase, 116 00:08:32,520 --> 00:08:41,370 I'm printing out the type of res3 and its type is that it's a dictionary. 117 00:08:41,370 --> 00:08:43,575 Because it's a dictionary, 118 00:08:43,575 --> 00:08:45,315 I'd like to print out the keys. 119 00:08:45,315 --> 00:08:51,935 In this case, there's quite a few keys and I got to look through these and try to guess 120 00:08:51,935 --> 00:08:59,760 which of these is going to have the author of the tweet and it's the user key. 121 00:08:59,980 --> 00:09:04,590 Therefore, I'm going to extract the user key. 122 00:09:06,970 --> 00:09:11,870 I'm going to extract the user key from 123 00:09:11,870 --> 00:09:17,110 the res3 dictionary and that's going to be my level four result. 124 00:09:21,870 --> 00:09:26,740 We've extracted into the variable res4 and now 125 00:09:26,740 --> 00:09:31,820 it's time to make a few print statements to figure out what's in res4. 126 00:09:33,360 --> 00:09:37,585 Turns out that what's in res4 is 127 00:09:37,585 --> 00:09:45,180 a dictionary and it 128 00:09:45,180 --> 00:09:48,875 has even more keys, 129 00:09:48,875 --> 00:09:54,745 but we're going to have to look in here and decide maybe we want to extract screen_name. 130 00:09:54,745 --> 00:10:02,960 Our next extract operation is going to take screen_name from res4. 131 00:10:09,470 --> 00:10:14,060 Here I've got res4 and I'm extracting the screen_name. 132 00:10:14,060 --> 00:10:18,545 I also decided to just find out the time when the tweet was created. 133 00:10:18,545 --> 00:10:21,230 I've commented out a bunch of 134 00:10:21,230 --> 00:10:26,520 the other print statements so that we're not going to have quite as busy of a printout. 135 00:10:28,570 --> 00:10:31,310 Now we are printing 136 00:10:31,310 --> 00:10:41,555 out 31brooks and the time at which that Tweet was created. 137 00:10:41,555 --> 00:10:45,940 So, at this point if I get rid of some of the print statements that 138 00:10:45,940 --> 00:10:50,345 are commented out and are distracting in the code, 139 00:10:50,345 --> 00:10:53,770 I would start to have something that's reasonably compact that prints 140 00:10:53,770 --> 00:10:58,870 out the screen_name and the creation time for the first tweet. 141 00:10:58,870 --> 00:11:03,325 My next step once I've got enough of 142 00:11:03,325 --> 00:11:07,480 the code to do this is that now I'm ready to generalize and say, 143 00:11:07,480 --> 00:11:11,140 "Hey, I don't want just the one item, 144 00:11:11,140 --> 00:11:14,095 I'd really like to get all of them." 145 00:11:14,095 --> 00:11:21,940 So, here the change is that instead of only getting one item from res2, 146 00:11:21,940 --> 00:11:24,140 we're going to get all of them. 147 00:11:27,390 --> 00:11:30,300 Then I get 31brooks, 148 00:11:30,300 --> 00:11:34,680 but I also get froyoho and MDuncan, 149 00:11:34,680 --> 00:11:37,970 because there were three different tweets. 150 00:11:38,630 --> 00:11:42,105 Once I've gotten this far, 151 00:11:42,105 --> 00:11:46,180 now I really have built up my code one step at a time, 152 00:11:46,180 --> 00:11:48,700 I can now simplify the code. 153 00:11:48,700 --> 00:11:52,300 We could actually do something as simple as this, 154 00:11:52,300 --> 00:11:59,210 where we have combined some things into more complex expressions. 155 00:12:01,920 --> 00:12:07,444 So, instead of making a new variable name for each level that we descend, 156 00:12:07,444 --> 00:12:13,085 we're just going to say res[statuses] and we're going to iterate through that. 157 00:12:13,085 --> 00:12:16,715 Our loop variable becomes res3 and with 158 00:12:16,715 --> 00:12:21,730 each of those res3 is still a pretty complicated dictionary, 159 00:12:21,730 --> 00:12:26,985 but we can do res3[user] [screen_name] to get the screen_name. 160 00:12:26,985 --> 00:12:32,510 And we can do [user] [created_at] to get the time that it was published. 161 00:12:32,510 --> 00:12:37,630 I can run this I get the same results as the more complicated code. 162 00:12:37,630 --> 00:12:40,720 Then you could try just writing 163 00:12:40,720 --> 00:12:45,050 these two lines of code like this from the very beginning, 164 00:12:45,050 --> 00:12:47,530 but if it doesn't work out it would be really hard to debug. 165 00:12:47,530 --> 00:12:51,070 So, I really recommend building it up one step at a time, 166 00:12:51,070 --> 00:12:55,170 where you just descend one level at a time into the data. 167 00:12:56,410 --> 00:12:59,780 In summary, my suggestion is that if you need to extract 168 00:12:59,780 --> 00:13:02,679 something from a complicated deeply nested structure, 169 00:13:02,679 --> 00:13:05,400 develop your code one layer at a time. 170 00:13:05,400 --> 00:13:07,380 At each step, print out what you have, 171 00:13:07,380 --> 00:13:09,380 through the keys of the dictionary or 172 00:13:09,380 --> 00:13:13,340 the first few characters in the printed representation of the first item. 173 00:13:13,340 --> 00:13:15,640 Then extract a little more. 174 00:13:15,640 --> 00:13:20,240 At the end you can remove all the print statements and collapse some of the code into 175 00:13:20,240 --> 00:13:23,360 more complex expressions that give you something compact like we 176 00:13:23,360 --> 00:13:28,110 see on the screen now. See you next time.