1 00:00:08,705 --> 00:00:13,680 In the previous example, we were using a clear, unambiguous image for a conversion. 2 00:00:13,680 --> 00:00:17,170 Sometimes there will be noise in images you want to OCR, 3 00:00:17,170 --> 00:00:20,010 making it difficult to extract the text. 4 00:00:20,010 --> 00:00:23,293 Luckily, there are techniques we can use to 5 00:00:23,293 --> 00:00:27,744 increase the efficacy of OCR with pytesseract and pillow. 6 00:00:27,744 --> 00:00:31,371 Let's use a different image this time with the same text as before, but 7 00:00:31,371 --> 00:00:33,640 with added noise to the picture. 8 00:00:33,640 --> 00:00:36,260 We could view this image using the following code. 9 00:00:36,260 --> 00:00:40,110 So from PIL we'll import image, pretty common for us now. 10 00:00:40,110 --> 00:00:44,827 Then we'll do an Image.open, and we'll pull out this Noisy_OCR.PNG, 11 00:00:44,827 --> 00:00:49,558 and then we'll use the display function in Jupyter to display it in-line. 12 00:00:52,392 --> 00:00:56,299 As you can see, this image has shapes of different capacities behind the text, 13 00:00:56,299 --> 00:00:59,040 which can confuse the tesseract image. 14 00:00:59,040 --> 00:01:02,020 Let's see if OCR will work on this noisy image. 15 00:01:02,020 --> 00:01:06,810 So import pytesseract, then we'll call pytesseract.image_to_string. 16 00:01:06,810 --> 00:01:10,382 And we'll just pass it in the image that we're going to open, this Noisy_OCR. 17 00:01:10,382 --> 00:01:14,289 And then let's print out the text directly. 18 00:01:15,588 --> 00:01:19,750 This is a bit surprising given how nicely pytesseract worked previously. 19 00:01:19,750 --> 00:01:22,970 Let's experiment on the image using techniques that will allow for 20 00:01:22,970 --> 00:01:25,390 more effective image analysis. 21 00:01:25,390 --> 00:01:27,480 First up, let's change the size of the image. 22 00:01:28,680 --> 00:01:30,415 So first we're going to import PIL. 23 00:01:31,770 --> 00:01:34,990 Then we set the base width of our image, so the base width, 24 00:01:34,990 --> 00:01:38,470 we'll set it to 600 points, these are in pixels. 25 00:01:38,470 --> 00:01:42,197 Now let's open the image, so that's old hat for us now, 26 00:01:42,197 --> 00:01:44,942 and we'll sign this to IMG. 27 00:01:44,942 --> 00:01:48,410 We want to get the correct aspect ratio, so we can do this by taking the base width 28 00:01:48,410 --> 00:01:51,320 and dividing it by the actual width of the image. 29 00:01:51,320 --> 00:01:54,790 So I'm going to create a new variable called wpercent and make this equal to 30 00:01:54,790 --> 00:02:01,280 the base width divided by the image.size sub zero, which is the width value there. 31 00:02:02,470 --> 00:02:05,860 With the ratio, we can just get the appropriate height of the image too. 32 00:02:05,860 --> 00:02:11,387 So I'll make something called hsize and set this to image size sub one, and 33 00:02:11,387 --> 00:02:16,406 we'll times that by the percentage, so we're just scaling here. 34 00:02:16,406 --> 00:02:18,935 Finally, let's resize the image. 35 00:02:18,935 --> 00:02:24,920 Antialiasing is a specific way of resizing lines to try and make them appear smooth. 36 00:02:24,920 --> 00:02:28,500 So here I'll just call image.resize, I pass it a tuple, 37 00:02:28,500 --> 00:02:31,980 which is the base width and the height size. 38 00:02:31,980 --> 00:02:37,130 And then I use this PIL.Image.ANTIALIAS to really just create better lines. 39 00:02:38,340 --> 00:02:41,845 Now let's save this to a file, so I'll call this 40 00:02:41,845 --> 00:02:46,170 img.save('resized_nois.png'), you could call it whatever you'd like. 41 00:02:47,370 --> 00:02:51,810 And finally, let's display it in-line, so I'll call display. 42 00:02:51,810 --> 00:02:53,890 And then let's run OCR. 43 00:02:53,890 --> 00:02:56,795 So again, pytesseract.image_to_string, and 44 00:02:56,795 --> 00:02:59,240 I'm going to open this new image underneath. 45 00:02:59,240 --> 00:03:01,281 I guess I could've just passed an image here. 46 00:03:01,281 --> 00:03:02,820 And then print the text. 47 00:03:05,481 --> 00:03:08,837 So this is not actually any improvement for resizing the image. 48 00:03:08,837 --> 00:03:11,636 And this is sometimes live when you're experimenting and 49 00:03:11,636 --> 00:03:14,230 trying to get things like this to work. 50 00:03:14,230 --> 00:03:16,320 Let's convert the image to grayscale. 51 00:03:16,320 --> 00:03:18,890 Converting images can be done in many different ways. 52 00:03:18,890 --> 00:03:21,460 If we poke around in the pillow documentation, 53 00:03:21,460 --> 00:03:25,520 we'll find that one of the easiest ways to do this is with the convert function, and 54 00:03:25,520 --> 00:03:28,300 we pass in the string a capital L. 55 00:03:28,300 --> 00:03:31,430 So let's open the image that we're working witt, and 56 00:03:31,430 --> 00:03:34,630 then let's call img.convert and pass in a capital L. 57 00:03:35,970 --> 00:03:42,197 Now let's save that image, I'm going to call it grayscale_noise.jpg here. 58 00:03:42,197 --> 00:03:47,451 Remember, PIL always worries about the file format for you based on the name 59 00:03:47,451 --> 00:03:52,240 of the image, so ending in .jpg here versus ending in .png is fine. 60 00:03:52,240 --> 00:03:55,368 And then let's run OCR on the grayscale image. 61 00:03:55,368 --> 00:03:58,555 And sort of prove there's no shenanigans, 62 00:03:58,555 --> 00:04:02,185 I'll open that grayscale image that we saved and 63 00:04:02,185 --> 00:04:07,340 pass it to image_to_string in pytesseract and print out the text. 64 00:04:07,340 --> 00:04:09,010 Wow, that worked really well. 65 00:04:09,010 --> 00:04:12,700 So if we look at the help documentation using the help function, 66 00:04:12,700 --> 00:04:18,270 as in help(img.convert), we see that the conversion mechanism used 67 00:04:18,270 --> 00:04:23,350 is the ITU-R 601-2 luma transform. 68 00:04:23,350 --> 00:04:25,460 There's more information about this out there, but 69 00:04:25,460 --> 00:04:28,790 this method essentially takes a three channel image, 70 00:04:28,790 --> 00:04:33,720 where there's information for the amount of red, green, and blue, or R, G, and B. 71 00:04:33,720 --> 00:04:37,450 And reduces it to a single channel to represent luminosity, 72 00:04:37,450 --> 00:04:39,350 and that's what the L is for. 73 00:04:39,350 --> 00:04:42,730 This method actually comes from how standard definition television 74 00:04:42,730 --> 00:04:46,070 sets encode color onto black and white images. 75 00:04:46,070 --> 00:04:50,420 If you get really interested in image manipulation and recognition, 76 00:04:50,420 --> 00:04:55,132 learning about color spaces and how we represent color, both computationally and 77 00:04:55,132 --> 00:04:59,926 through human perception, is a really interesting field. 78 00:04:59,926 --> 00:05:03,170 Even though we now have the complete text of the image, 79 00:05:03,170 --> 00:05:07,490 there's a few other techniques we could use to help improve OCR detection 80 00:05:07,490 --> 00:05:10,100 in the event that the above two don't help. 81 00:05:10,100 --> 00:05:13,670 The next approach I would use is one called binarization, 82 00:05:13,670 --> 00:05:19,090 which means to separate into two distinct parts, in this case, black and white. 83 00:05:19,090 --> 00:05:23,170 Binarization is enacted through a process called thresholding. 84 00:05:23,170 --> 00:05:25,870 If a pixel value is greater than a threshold value, 85 00:05:25,870 --> 00:05:28,510 it'll be converted to a black pixel. 86 00:05:28,510 --> 00:05:32,660 If it is lower than a threshold value, it'll be converted to a white pixel. 87 00:05:32,660 --> 00:05:35,790 This process eliminates noise in the OCR process, 88 00:05:35,790 --> 00:05:39,020 allowing greater image recognition accuracy. 89 00:05:39,020 --> 00:05:42,420 With pillow, this process is straightforward. 90 00:05:42,420 --> 00:05:46,113 So let's open a noisy image and convert it using binarization. 91 00:05:46,113 --> 00:05:50,358 So here we just image.open, we're going to read our noisy image in, and 92 00:05:50,358 --> 00:05:53,960 then we call convert and we pass in the character 1. 93 00:05:53,960 --> 00:05:56,740 Note that we're passing it as a character, not as a number, so 94 00:05:56,740 --> 00:05:59,190 this is a string value we're passing in. 95 00:05:59,190 --> 00:06:02,175 Now let's save and display that image, so img.save, 96 00:06:02,175 --> 00:06:05,550 we'll call it black_and_white noise.jpg, and display. 97 00:06:08,290 --> 00:06:12,100 You can see here the image looks kind of dotted and modeled, there's various 98 00:06:12,100 --> 00:06:16,190 different patterns in it, but definitely this is a black and white image. 99 00:06:17,530 --> 00:06:21,480 So that was a bit magical, and it really required a fine reading of the docs to 100 00:06:21,480 --> 00:06:25,120 figure out that the number 1 is the special string parameter 101 00:06:25,120 --> 00:06:28,950 to the convert function that actually does the binarization. 102 00:06:28,950 --> 00:06:33,570 But you actually have all the skills you need to write this function by yourself. 103 00:06:33,570 --> 00:06:35,270 Let's walk through an example. 104 00:06:35,270 --> 00:06:39,520 First, let's define a function called binarize, which takes in an image and 105 00:06:39,520 --> 00:06:40,950 a threshold value. 106 00:06:40,950 --> 00:06:44,300 So I'll def binarize and image_to_transform, and 107 00:06:44,300 --> 00:06:46,140 then some threshold value. 108 00:06:46,140 --> 00:06:50,190 Now, let's convert the image to a single grayscale image using convert. 109 00:06:50,190 --> 00:06:52,440 So here we just create some new output image, 110 00:06:52,440 --> 00:06:54,650 this is what we'll end up returning, and 111 00:06:54,650 --> 00:06:59,920 we'll transform the image, passed in by the color, to luminosity values only. 112 00:06:59,920 --> 00:07:03,300 So right now there's nothing new magical here to be done, 113 00:07:03,300 --> 00:07:05,140 this is just creating a grayscale image. 114 00:07:06,500 --> 00:07:10,680 The threshold value is usually provided as a number between 0 and 255, 115 00:07:10,680 --> 00:07:14,070 which is the number of bits in a byte. 116 00:07:14,070 --> 00:07:14,940 The algorithm for 117 00:07:14,940 --> 00:07:19,560 the binarization is pretty simple, go through every pixel in the image, and 118 00:07:19,560 --> 00:07:23,370 if it's greater than the threshold, turn it all the way up, so to 255. 119 00:07:23,370 --> 00:07:27,163 And if it's lower than the threshold, turn it all the way down, so that's to 0. 120 00:07:27,163 --> 00:07:29,500 So let's write this in code. 121 00:07:29,500 --> 00:07:31,910 First, we need to iterate overall the pixels in the image. 122 00:07:34,070 --> 00:07:39,890 So for x in range, and we'll just go over the widths, so values along the x axes. 123 00:07:39,890 --> 00:07:42,859 And then for y in range, it will go through the heights, so 124 00:07:42,859 --> 00:07:45,240 these will be our values through to the y axes. 125 00:07:45,240 --> 00:07:47,840 So for a given pixel at some width and height, 126 00:07:47,840 --> 00:07:50,650 let's check its value again to threshold. 127 00:07:50,650 --> 00:07:53,833 So we could do this with if output_image.getpixel, 128 00:07:53,833 --> 00:07:55,847 we'll just pull the pixel x and y. 129 00:07:55,847 --> 00:07:57,744 You'll note lots of brackets here, 130 00:07:57,744 --> 00:08:00,757 that's because we are actually passing a tuple value in. 131 00:08:00,757 --> 00:08:04,530 We just check to see if it's less than some threshold value. 132 00:08:04,530 --> 00:08:06,500 So let's set this to 0 if it is. 133 00:08:06,500 --> 00:08:11,220 So in our output image, we just put that pixel, we pass in the same x, y, and 134 00:08:11,220 --> 00:08:12,240 we put it to 0. 135 00:08:12,240 --> 00:08:15,265 So we're just changing it to 0 if it's less than a threshold. 136 00:08:15,265 --> 00:08:22,342 Otherwise we want to set this to 255, so output_image.putpixel( ( x,y), 255 ). 137 00:08:22,342 --> 00:08:24,090 And now we just return the new image. 138 00:08:26,050 --> 00:08:29,880 So let's test this function over a range of different thresholds. 139 00:08:29,880 --> 00:08:33,480 Remember that you can use the range function to generate a list of numbers at 140 00:08:33,480 --> 00:08:35,310 different step sizes. 141 00:08:35,310 --> 00:08:39,030 Range is called with a start, a stop, and a step size. 142 00:08:39,030 --> 00:08:42,030 So let's try the range 0, 257, and 64, 143 00:08:42,030 --> 00:08:47,280 which should generate five images of different threshold values. 144 00:08:47,280 --> 00:08:54,320 So for a thresh in range 0 to 257, and then we're going to step at 65. 145 00:08:54,320 --> 00:08:59,280 Let's print out a string to tell us what threshold we're trying here. 146 00:08:59,280 --> 00:09:03,350 And so we want to change, remember, the thresh value is an integer, so 147 00:09:03,350 --> 00:09:05,980 we'll change it to a string here using the str function. 148 00:09:07,100 --> 00:09:09,070 And then let's display the binarized image in-line. 149 00:09:10,510 --> 00:09:14,706 And so the way we do this is the display function, then we're going to call our 150 00:09:14,706 --> 00:09:19,854 function,binarize, we're going to pass it the image.open, read_only/Noisy_OCR. 151 00:09:19,854 --> 00:09:23,682 We could of course cache this, open it, and pass it around as a parameter, but 152 00:09:23,682 --> 00:09:26,570 it's okay for our demonstration to do it this way. 153 00:09:26,570 --> 00:09:31,120 And then we'll send in the threshold value, which will be 0 the first time, 154 00:09:31,120 --> 00:09:33,509 60 for the second time, and so forth. 155 00:09:33,509 --> 00:09:35,310 And let's use tesseract on it. 156 00:09:35,310 --> 00:09:39,400 It's inefficient the binarize it twice, but this is really just for a demo. 157 00:09:39,400 --> 00:09:43,250 So here we'll call print pytesseract.image_to_string, 158 00:09:43,250 --> 00:09:47,837 passing in then a call to binarize, which passes in a call to image.open. 159 00:09:47,837 --> 00:09:52,339 So there's a lot of image.opens here, lots of room this code could be improved, but 160 00:09:52,339 --> 00:09:54,350 it should generate an example for us. 161 00:09:55,830 --> 00:09:59,440 So you can see the result with threshold 0 is pretty empty. 162 00:09:59,440 --> 00:10:04,663 With threshold 64 we actually get a very faint looking image, 163 00:10:04,663 --> 00:10:08,641 but it seems like we get all of or most of the text. 164 00:10:08,641 --> 00:10:12,333 When we increase the threshold to 192 from 128, 165 00:10:12,333 --> 00:10:17,306 we see that we actually pick up a new space between the words of and this, so 166 00:10:17,306 --> 00:10:20,620 we're getting more definition in the text. 167 00:10:20,620 --> 00:10:24,270 But then when we increase the threshold all the way to 256, 168 00:10:24,270 --> 00:10:29,570 we lose a lot of text because a whole segment of the image becomes black. 169 00:10:29,570 --> 00:10:32,691 And then, all of a sudden, at the very top end threshold, 170 00:10:32,691 --> 00:10:36,032 we get nothing because the whole image is black at that point. 171 00:10:38,560 --> 00:10:43,117 We could see from this that a threshold of 0 essentially turns everything white, 172 00:10:43,117 --> 00:10:47,470 that the text becomes more bold as we move towards a higher threshold. 173 00:10:47,470 --> 00:10:48,230 And the shapes, 174 00:10:48,230 --> 00:10:52,840 which have a filled-in gray color, become more evident at higher thresholds. 175 00:10:52,840 --> 00:10:56,670 In the next lecture, we'll look a bit more at some of the challenges you can expect 176 00:10:56,670 --> 00:10:58,500 when doing OCR on real data.