1 00:00:07,940 --> 00:00:10,380 Let's try a new example and 2 00:00:10,380 --> 00:00:12,420 bring together some of the things that we've learned. 3 00:00:12,420 --> 00:00:14,220 Here's an image of a storefront. 4 00:00:14,220 --> 00:00:15,510 Let's load it, and try and get 5 00:00:15,510 --> 00:00:17,430 the name of the store out of that image. 6 00:00:17,430 --> 00:00:20,250 So from PIL, we'll need the image package of course, 7 00:00:20,250 --> 00:00:22,980 and then let's bring in pytesseract as well. 8 00:00:22,980 --> 00:00:25,275 So let's read in the store front image 9 00:00:25,275 --> 00:00:27,645 I've loaded into the course and display it. 10 00:00:27,645 --> 00:00:31,395 So I put this in read_only/ Storefront.jpg, 11 00:00:31,395 --> 00:00:33,210 and we'll just open that as an image, 12 00:00:33,210 --> 00:00:34,815 and display it in line. 13 00:00:34,815 --> 00:00:36,570 Then finally, let's try and run 14 00:00:36,570 --> 00:00:39,225 tesseract on that image and see what the results are. 15 00:00:39,225 --> 00:00:42,220 So we'll call image_ to_ string on that. 16 00:00:46,490 --> 00:00:48,640 We see at the very bottom that 17 00:00:48,640 --> 00:00:50,140 there's just an empty string. 18 00:00:50,140 --> 00:00:51,730 Tesseract is unable to 19 00:00:51,730 --> 00:00:53,765 take this image and pull out the name. 20 00:00:53,765 --> 00:00:55,300 But we looked how to crop an image 21 00:00:55,300 --> 00:00:56,560 in the last set of lectures. 22 00:00:56,560 --> 00:00:58,330 So let's try and help tesseract by 23 00:00:58,330 --> 00:01:00,640 cropping out certain pieces. 24 00:01:00,640 --> 00:01:03,520 So first, we have to set the bounding box. 25 00:01:03,520 --> 00:01:05,620 In this image, the store name is in 26 00:01:05,620 --> 00:01:08,930 a box bounded by roughly 315, 27 00:01:08,930 --> 00:01:12,365 170, 700, and 270. 28 00:01:12,365 --> 00:01:15,340 So I'll make a bounding box equal to this tuple. 29 00:01:15,340 --> 00:01:18,040 Remember that's the upper left corner, 30 00:01:18,040 --> 00:01:19,810 and then we walk around the image, 31 00:01:19,810 --> 00:01:21,820 and you can go back to the PIL lecture if you 32 00:01:21,820 --> 00:01:24,455 want to be reminded how to do this. 33 00:01:24,455 --> 00:01:26,855 Now, let's crop the image. 34 00:01:26,855 --> 00:01:28,775 So we just call the image.crop, 35 00:01:28,775 --> 00:01:30,425 and we pass in a bounding box. 36 00:01:30,425 --> 00:01:31,850 It doesn't change the image, 37 00:01:31,850 --> 00:01:33,080 it returns a new image. 38 00:01:33,080 --> 00:01:34,520 So we save this to this title 39 00:01:34,520 --> 00:01:37,445 image variable that we'll use later. 40 00:01:37,445 --> 00:01:41,150 Now, let's display it and pull out the text. 41 00:01:41,150 --> 00:01:42,620 So we'll pull out display, 42 00:01:42,620 --> 00:01:45,590 and then we'll call pytesseract on image_to_string, 43 00:01:45,590 --> 00:01:48,545 and pass in the title image. 44 00:01:48,545 --> 00:01:52,340 Great. So we see how with a bit of problem reduction, 45 00:01:52,340 --> 00:01:53,690 we can make that work. 46 00:01:53,690 --> 00:01:55,975 So now we've been able to take an image, 47 00:01:55,975 --> 00:01:58,630 pre-process it where we expect to see text, 48 00:01:58,630 --> 00:02:00,170 and turn that text into 49 00:02:00,170 --> 00:02:02,990 a string that Python can understand. 50 00:02:02,990 --> 00:02:05,600 If you look back up at the image though, 51 00:02:05,600 --> 00:02:08,780 you'll see that there's a small sign inside of the shop. 52 00:02:08,780 --> 00:02:11,330 That also has the shop name on it. 53 00:02:11,330 --> 00:02:13,130 I wonder if we are able to recognize 54 00:02:13,130 --> 00:02:14,675 the text on that sign. 55 00:02:14,675 --> 00:02:16,910 Let's give it a try. First, we 56 00:02:16,910 --> 00:02:19,445 need to determine a bounding box for that sign. 57 00:02:19,445 --> 00:02:21,275 I'm going to show you a short-cut to make this 58 00:02:21,275 --> 00:02:23,915 easier and an optional video in this module. 59 00:02:23,915 --> 00:02:25,190 But for now, let's just use 60 00:02:25,190 --> 00:02:27,440 the bounding box that I decided on. 61 00:02:27,440 --> 00:02:29,825 The bounding box, we'll set this to a tuple of 62 00:02:29,825 --> 00:02:32,840 900 by 420 for the upper left, 63 00:02:32,840 --> 00:02:36,695 and then 940 by 445 for the lower right. 64 00:02:36,695 --> 00:02:39,005 Now, let's crop the image. 65 00:02:39,005 --> 00:02:40,760 So we just call Image.crop, 66 00:02:40,760 --> 00:02:42,290 pass it in the bounding box, 67 00:02:42,290 --> 00:02:43,700 and we'll call this little sign for 68 00:02:43,700 --> 00:02:46,470 fun and display that little sign. 69 00:02:46,470 --> 00:02:49,275 All right. This is a little sign. 70 00:02:49,275 --> 00:02:52,060 OCR works better with higher resolution images, 71 00:02:52,060 --> 00:02:54,110 so let's increase the size of this image 72 00:02:54,110 --> 00:02:56,375 by using the pillow resize function. 73 00:02:56,375 --> 00:02:58,580 Let's set the width and the height equal 74 00:02:58,580 --> 00:03:00,755 to 10 times the size it is now, 75 00:03:00,755 --> 00:03:03,260 in a (w, h) tuple. 76 00:03:03,260 --> 00:03:05,345 So we'll take the new size, 77 00:03:05,345 --> 00:03:06,695 and we'll make it equal to the little 78 00:03:06,695 --> 00:03:08,495 sign.width times 10, 79 00:03:08,495 --> 00:03:11,565 and the little sign.height times 10. 80 00:03:11,565 --> 00:03:15,160 Now, let's check the docs for resize. 81 00:03:16,960 --> 00:03:19,730 We can see here that there's a number of 82 00:03:19,730 --> 00:03:22,325 different filters for resizing the image. 83 00:03:22,325 --> 00:03:23,960 The default is Image. 84 00:03:23,960 --> 00:03:26,810 NEAREST. Let's see what that looks like. 85 00:03:26,810 --> 00:03:29,540 So we'll take our little sign.resize, 86 00:03:29,540 --> 00:03:31,850 we'll pass in the new bounding box size. 87 00:03:31,850 --> 00:03:33,050 So that's new size, 88 00:03:33,050 --> 00:03:35,250 and then we'll say Image.NEAREST 89 00:03:35,250 --> 00:03:38,335 all in caps and pass that to display. 90 00:03:38,335 --> 00:03:42,230 So here you can see that it actually resize the image, 91 00:03:42,230 --> 00:03:44,150 and now it's maybe much more readable. 92 00:03:44,150 --> 00:03:44,480 I don't know. 93 00:03:44,480 --> 00:03:46,340 I didn't have troubles maybe seeing it before. 94 00:03:46,340 --> 00:03:47,525 Although it was little, 95 00:03:47,525 --> 00:03:49,550 and it says the word fossil. 96 00:03:49,550 --> 00:03:51,290 I think we should be able to find 97 00:03:51,290 --> 00:03:52,400 something better though. 98 00:03:52,400 --> 00:03:55,270 I can read this, but it looks really pixelated. 99 00:03:55,270 --> 00:03:56,690 Let's see what all the different 100 00:03:56,690 --> 00:03:58,580 resize options look like. 101 00:03:58,580 --> 00:04:00,260 You can go back up to 102 00:04:00,260 --> 00:04:03,680 the documentation to look at the names. 103 00:04:03,680 --> 00:04:05,090 So here I'm going to make 104 00:04:05,090 --> 00:04:07,340 just a list of all the different names as options. 105 00:04:07,340 --> 00:04:12,035 So Image.NEAREST, Image.BOX, Image.BILINEAR, 106 00:04:12,035 --> 00:04:15,440 Image.HAMMING, Imaged.BICUBIC, 107 00:04:15,440 --> 00:04:20,210 and Image.LANCZOS is how you say that. 108 00:04:20,210 --> 00:04:22,250 So for each of the options, 109 00:04:22,250 --> 00:04:23,570 I'm just going to iterate over these. 110 00:04:23,570 --> 00:04:25,370 Let's print out the option name. 111 00:04:25,370 --> 00:04:27,560 So print out whatever the option name is, 112 00:04:27,560 --> 00:04:28,880 and then let's display what 113 00:04:28,880 --> 00:04:31,340 this option looks like on our little sign. 114 00:04:31,340 --> 00:04:34,080 So here we're actually going to call little_sign.RESIZE, 115 00:04:34,080 --> 00:04:35,375 pass in the new size, 116 00:04:35,375 --> 00:04:37,220 pass in the option that we're looking at, 117 00:04:37,220 --> 00:04:39,300 and call to display. 118 00:04:39,460 --> 00:04:42,170 So you can see that this has run, 119 00:04:42,170 --> 00:04:44,210 and we have a whole bunch 120 00:04:44,210 --> 00:04:46,190 of different numbers are printed, 121 00:04:46,190 --> 00:04:50,580 and then different images that are interesting. 122 00:04:50,930 --> 00:04:53,825 So from this, we can notice two things. 123 00:04:53,825 --> 00:04:56,960 First, when we print out one of the re-sampling values, 124 00:04:56,960 --> 00:04:58,910 it actually just print an integer. 125 00:04:58,910 --> 00:05:00,980 This is actually really common that 126 00:05:00,980 --> 00:05:02,450 the API developer writes 127 00:05:02,450 --> 00:05:05,260 a property such as Image.BICUBIC, 128 00:05:05,260 --> 00:05:06,620 and then assigns it to 129 00:05:06,620 --> 00:05:09,065 an integer value to pass it around. 130 00:05:09,065 --> 00:05:11,480 Some languages use enumerations of 131 00:05:11,480 --> 00:05:13,970 values which is common in say, Java. 132 00:05:13,970 --> 00:05:15,230 But in Python, this is 133 00:05:15,230 --> 00:05:17,555 a pretty normal way of doing things. 134 00:05:17,555 --> 00:05:20,390 The second thing we learned is that there's a number of 135 00:05:20,390 --> 00:05:23,150 different algorithms for the image re-sampling. 136 00:05:23,150 --> 00:05:25,280 In this case, the LANCZOS and 137 00:05:25,280 --> 00:05:28,505 image.BICUBIC filters do a good job. 138 00:05:28,505 --> 00:05:30,335 Everything else not so much. 139 00:05:30,335 --> 00:05:31,985 So let's see if we are able to recognize 140 00:05:31,985 --> 00:05:34,385 the text off this resized image. 141 00:05:34,385 --> 00:05:37,730 So first, let's resize to the larger size. 142 00:05:37,730 --> 00:05:39,875 So I'm going to create something bigger sign, 143 00:05:39,875 --> 00:05:41,870 and I'm going to take little_sign.resize, 144 00:05:41,870 --> 00:05:44,655 I'm going to pass in our new size that we want. 145 00:05:44,655 --> 00:05:45,990 Then I'm going to use 146 00:05:45,990 --> 00:05:49,560 Image.BICUBIC for lack of any personal preference. 147 00:05:49,560 --> 00:05:52,490 You feel free to try one of the different methods. 148 00:05:52,490 --> 00:05:54,215 Then let's print out the text. 149 00:05:54,215 --> 00:05:56,630 So we'll call pytesseract image_to_string, 150 00:05:56,630 --> 00:05:59,250 and pass in the bigger side. 151 00:05:59,860 --> 00:06:02,825 Well, not really any text there. 152 00:06:02,825 --> 00:06:04,550 Let's try and binarize this. 153 00:06:04,550 --> 00:06:06,320 So first, let me just bring in 154 00:06:06,320 --> 00:06:09,930 the binarization code we did earlier. 155 00:06:11,960 --> 00:06:14,700 Now, let's apply binarization. 156 00:06:14,700 --> 00:06:17,265 Would say, a threshold of a 190, 157 00:06:17,265 --> 00:06:21,185 and try display that as well as to do the OCR work. 158 00:06:21,185 --> 00:06:24,180 So binarized, remember those function takes in 159 00:06:24,180 --> 00:06:27,635 the sign or the image I guess that we want to binarize, 160 00:06:27,635 --> 00:06:30,830 and then a value between zero and 255. 161 00:06:30,830 --> 00:06:32,990 It's going to walk through it pixel by pixel of 162 00:06:32,990 --> 00:06:35,420 the image and either set it to zero, or one. 163 00:06:35,420 --> 00:06:38,080 So change it straight up black and white. 164 00:06:38,080 --> 00:06:40,930 Then we'll display what the binarized sign looks like, 165 00:06:40,930 --> 00:06:42,740 and then let's actually try and 166 00:06:42,740 --> 00:06:44,975 get the text out with pytesseract too, 167 00:06:44,975 --> 00:06:46,490 in the hopes that a 190 is 168 00:06:46,490 --> 00:06:49,650 actually a good number for us to use. 169 00:06:49,900 --> 00:06:53,165 Well, that looks pretty abysmal I would say. 170 00:06:53,165 --> 00:06:55,580 It's doesn't look at all like fossil. 171 00:06:55,580 --> 00:06:58,070 I guess you could see some of the ases there, 172 00:06:58,070 --> 00:07:01,720 but really not much in that image at all. 173 00:07:01,720 --> 00:07:03,620 So the text is pretty useless. 174 00:07:03,620 --> 00:07:06,860 How should we pick the best binarization to use? 175 00:07:06,860 --> 00:07:08,700 There's a number of different methods. 176 00:07:08,700 --> 00:07:10,010 But let's just try something 177 00:07:10,010 --> 00:07:12,500 very simple to show how this can work. 178 00:07:12,500 --> 00:07:14,960 We have an english word that we're trying to detect, 179 00:07:14,960 --> 00:07:19,070 its called "FOSSIL" If we tried all binarization from 180 00:07:19,070 --> 00:07:21,395 zero through 255 and look to 181 00:07:21,395 --> 00:07:23,945 see if there were any english words in that list, 182 00:07:23,945 --> 00:07:25,535 this might be one way. 183 00:07:25,535 --> 00:07:28,055 So let's see if we could write a routine to do this. 184 00:07:28,055 --> 00:07:30,425 So we're problem-solving on our own here. 185 00:07:30,425 --> 00:07:33,620 So first, let's load a list of english words into a list. 186 00:07:33,620 --> 00:07:35,090 I put a copy in the read_only 187 00:07:35,090 --> 00:07:36,845 directory for you to work with. 188 00:07:36,845 --> 00:07:39,060 So create something eng dict, 189 00:07:39,060 --> 00:07:40,575 it's just an empty list. 190 00:07:40,575 --> 00:07:42,660 Then I'm going to open the 191 00:07:42,660 --> 00:07:44,580 read_only/words_alpha.text as read, 192 00:07:44,580 --> 00:07:45,860 you can go back into one of 193 00:07:45,860 --> 00:07:47,690 the previous courses if this doesn't look 194 00:07:47,690 --> 00:07:51,020 very familiar to you on how to work with files. 195 00:07:51,020 --> 00:07:52,250 We're going to call the file 196 00:07:52,250 --> 00:07:54,590 F. Then I'm just going to read all 197 00:07:54,590 --> 00:07:58,550 of F in one giant chunk and put that in data. 198 00:07:58,550 --> 00:08:00,410 So now we actually want to split this into 199 00:08:00,410 --> 00:08:02,585 a list based on those new line characters. 200 00:08:02,585 --> 00:08:05,030 So if you go look in that data file words alpha, 201 00:08:05,030 --> 00:08:07,070 you'll see it's one word per line. 202 00:08:07,070 --> 00:08:10,070 So I'll call data.split on slash eng, 203 00:08:10,070 --> 00:08:11,600 this is the new line character, 204 00:08:11,600 --> 00:08:14,270 and this will return a new list which is all of 205 00:08:14,270 --> 00:08:15,590 the different words and I'll 206 00:08:15,590 --> 00:08:17,710 put this into english dictionary. 207 00:08:17,710 --> 00:08:19,640 Now, let's iterate through 208 00:08:19,640 --> 00:08:21,770 all the possible thresholds and look for 209 00:08:21,770 --> 00:08:24,410 an english word printing it out if it exists. 210 00:08:24,410 --> 00:08:27,905 So for i in range 150 and 170, 211 00:08:27,905 --> 00:08:29,600 I'm just going to binarize between 212 00:08:29,600 --> 00:08:35,590 those ranges as binarizing convert this to string values, 213 00:08:35,590 --> 00:08:40,000 and then string will set to pytesseract.image to string. 214 00:08:40,000 --> 00:08:42,110 So we'll binarized, passing in 215 00:08:42,110 --> 00:08:44,570 the bigger sine and are given i value. 216 00:08:44,570 --> 00:08:45,770 So this is a binarized with 217 00:08:45,770 --> 00:08:49,550 150,151,152,153, and so forth. 218 00:08:49,550 --> 00:08:51,070 I'm going to try them all between 219 00:08:51,070 --> 00:08:55,465 these two threshold values 150, and 170. 220 00:08:55,465 --> 00:08:58,885 So we want to remove all non alphabetical character. 221 00:08:58,885 --> 00:09:01,380 So that includes a parentheses, brackets, 222 00:09:01,380 --> 00:09:03,515 percentage signs, dollar signs, 223 00:09:03,515 --> 00:09:05,080 et cetera from the text. 224 00:09:05,080 --> 00:09:07,285 So here's a short method to do that. 225 00:09:07,285 --> 00:09:10,570 So first, let's convert our string to lowercase only. 226 00:09:10,570 --> 00:09:13,740 So string.lower, and we'll just change string. 227 00:09:13,740 --> 00:09:15,820 Then let's import the string package. 228 00:09:15,820 --> 00:09:18,400 It's got a nice list of lowercase characters. 229 00:09:18,400 --> 00:09:20,560 So import string, and 230 00:09:20,560 --> 00:09:22,830 now let's just iterate over a string, 231 00:09:22,830 --> 00:09:24,630 looking at it character by character, 232 00:09:24,630 --> 00:09:26,770 putting it in the comparison text. 233 00:09:26,770 --> 00:09:29,410 So we'll create some new value comparison, 234 00:09:29,410 --> 00:09:31,480 and then for every character in our string, 235 00:09:31,480 --> 00:09:33,290 remember this a lowercase. 236 00:09:33,290 --> 00:09:37,085 If that characters is in the string.ascii lowercase. 237 00:09:37,085 --> 00:09:39,635 So this is actually just checking to see if 238 00:09:39,635 --> 00:09:44,450 a single character is in a list of characters. 239 00:09:44,450 --> 00:09:47,420 Remember, a string and a list of characters are the same 240 00:09:47,420 --> 00:09:50,210 when you use n. If so then, 241 00:09:50,210 --> 00:09:53,840 comparison is equal to comparison plus that character. 242 00:09:53,840 --> 00:09:57,245 So we just append it to our output string. All right. 243 00:09:57,245 --> 00:09:58,520 Finally, let's search for 244 00:09:58,520 --> 00:10:00,920 the comparison in the dictionary file. 245 00:10:00,920 --> 00:10:02,540 So that's easy in Python. 246 00:10:02,540 --> 00:10:05,090 In other languages, you would have to do a lot of work. 247 00:10:05,090 --> 00:10:07,940 But here we just use the in comparator, 248 00:10:07,940 --> 00:10:10,505 and see if comparison is eng dict. 249 00:10:10,505 --> 00:10:12,755 Then we're going to print it out if we find it. 250 00:10:12,755 --> 00:10:14,455 So we'll print comparison. 251 00:10:14,455 --> 00:10:17,110 All right, let's run that. 252 00:10:17,140 --> 00:10:19,685 So you should start to see that 253 00:10:19,685 --> 00:10:21,740 various characters come up, 254 00:10:21,740 --> 00:10:25,430 and in my case fossil came up and W came up. 255 00:10:25,430 --> 00:10:28,610 So W is also in this dictionary, 256 00:10:28,610 --> 00:10:31,670 and a W was detected in data 257 00:10:31,670 --> 00:10:35,090 that we sent in at least one of the binarization. 258 00:10:35,090 --> 00:10:37,895 So well, if this is not perfect but we can 259 00:10:37,895 --> 00:10:40,504 see fossil there among other values, 260 00:10:40,504 --> 00:10:43,550 and this is not a bad way actually to clean up OCR data. 261 00:10:43,550 --> 00:10:45,860 It can be useful to use a language or 262 00:10:45,860 --> 00:10:48,350 domain-specific dictionary and practice. 263 00:10:48,350 --> 00:10:50,870 Instead of all of the english language words, 264 00:10:50,870 --> 00:10:53,435 especially if you're generating a search engine for 265 00:10:53,435 --> 00:10:55,610 specialized language such as 266 00:10:55,610 --> 00:10:57,990 medical knowledge base, or locations. 267 00:10:57,990 --> 00:10:59,750 So like cities. If you 268 00:10:59,750 --> 00:11:01,920 scroll up and look at the data we're working with, 269 00:11:01,920 --> 00:11:04,010 this tiny little wall hanging in 270 00:11:04,010 --> 00:11:07,160 the inside of the store is really not so bad. 271 00:11:07,160 --> 00:11:09,560 A lot of this comes down to the purpose that 272 00:11:09,560 --> 00:11:12,050 you're actually doing the OCR for. 273 00:11:12,050 --> 00:11:13,950 So if you are using it for instance to 274 00:11:13,950 --> 00:11:16,100 back up search engine, that's one thing. 275 00:11:16,100 --> 00:11:19,895 If you're using it to do text-to-speech for instance, 276 00:11:19,895 --> 00:11:22,010 and somebody is going to use this to 277 00:11:22,010 --> 00:11:24,500 listen to a lecture, that's completely different, 278 00:11:24,500 --> 00:11:27,290 and you have to have a very very strong method 279 00:11:27,290 --> 00:11:30,505 for generating the actual data. 280 00:11:30,505 --> 00:11:33,110 So at this point, you've now learned how to manipulate 281 00:11:33,110 --> 00:11:35,465 images and convert them into text. 282 00:11:35,465 --> 00:11:37,070 In the next module in this course, 283 00:11:37,070 --> 00:11:38,690 we're going to dig deeper further 284 00:11:38,690 --> 00:11:40,760 into a computer vision library, 285 00:11:40,760 --> 00:11:43,640 which allows us to detect faces among other things. 286 00:11:43,640 --> 00:11:45,100 So then, we'll go onto 287 00:11:45,100 --> 00:11:48,190 a culminating project. I'll see you there.