1 00:00:05,280 --> 00:00:08,650 We're going to start experimenting with tesseract 2 00:00:08,650 --> 00:00:11,785 using just a simple image of nice clean text. 3 00:00:11,785 --> 00:00:15,400 Let's first import PIL from image and display the image 4 00:00:15,400 --> 00:00:19,945 text.png from PIL will import image. 5 00:00:19,945 --> 00:00:22,720 Image is equal to image.open, 6 00:00:22,720 --> 00:00:25,700 and we'll get the text.png other than the read_only. 7 00:00:25,700 --> 00:00:28,905 Then we'll display that image. 8 00:00:28,905 --> 00:00:33,265 Great, we have a base image of some big clear text. 9 00:00:33,265 --> 00:00:36,910 Let's import pytesseract and use the dir function to get 10 00:00:36,910 --> 00:00:38,170 a sense of what might be 11 00:00:38,170 --> 00:00:40,510 some interesting functions to play with. 12 00:00:40,510 --> 00:00:42,765 So import pytesseract, 13 00:00:42,765 --> 00:00:46,100 and we can use dir to see what's inside of it. 14 00:00:46,100 --> 00:00:48,800 Okay. It looks like there is just a handful 15 00:00:48,800 --> 00:00:50,165 of interesting functions, 16 00:00:50,165 --> 00:00:53,645 and I think image_to_string is probably our best bet. 17 00:00:53,645 --> 00:00:55,220 Let's use the help function to 18 00:00:55,220 --> 00:00:57,020 interrogate this a bit more. 19 00:00:57,020 --> 00:01:01,290 So help pytesseract image_to_string. 20 00:01:03,070 --> 00:01:06,545 So this function takes an image as the first parameter, 21 00:01:06,545 --> 00:01:08,510 and there is a bunch of optional parameters, 22 00:01:08,510 --> 00:01:10,760 and it'll return the results of the OCR. 23 00:01:10,760 --> 00:01:12,110 I think it's worth comparing 24 00:01:12,110 --> 00:01:13,790 this documentation string with 25 00:01:13,790 --> 00:01:15,050 the documentation we were 26 00:01:15,050 --> 00:01:16,760 receiving from the pillow module. 27 00:01:16,760 --> 00:01:20,540 Let's run the help command on the image resize function. 28 00:01:20,540 --> 00:01:24,790 So help image, Image.resize. 29 00:01:25,610 --> 00:01:28,090 Notice how the PILLOW function has 30 00:01:28,090 --> 00:01:29,860 a bit more information in it. 31 00:01:29,860 --> 00:01:31,270 First off, it's using 32 00:01:31,270 --> 00:01:34,430 a specific format called reStructuredText, 33 00:01:34,430 --> 00:01:36,375 which is similar in intent 34 00:01:36,375 --> 00:01:38,840 to document markups such as HTML, 35 00:01:38,840 --> 00:01:40,330 the language of the web. 36 00:01:40,330 --> 00:01:41,950 The intent is to embed 37 00:01:41,950 --> 00:01:44,350 semantics in the documentation itself. 38 00:01:44,350 --> 00:01:46,930 For instance, in the resize function we see 39 00:01:46,930 --> 00:01:49,900 the words param size with colon surrounding it. 40 00:01:49,900 --> 00:01:53,375 This allows documentation engines which create docs for 41 00:01:53,375 --> 00:01:55,090 the source code to link 42 00:01:55,090 --> 00:01:58,270 the parameter to the extended docs about that parameter. 43 00:01:58,270 --> 00:02:00,970 In this case, the extended docs tell us that 44 00:02:00,970 --> 00:02:04,000 the size should be passed as a tuple of width and height. 45 00:02:04,000 --> 00:02:07,455 Notice how the dogs for image_to_string, for instance, 46 00:02:07,455 --> 00:02:09,530 indicate that there's a lang parameter 47 00:02:09,530 --> 00:02:10,910 which we could use, 48 00:02:10,910 --> 00:02:12,890 but then fail to say anything about what 49 00:02:12,890 --> 00:02:16,805 that parameter is for or what its format is. 50 00:02:16,805 --> 00:02:20,450 What this really means is that you need to dig deeper. 51 00:02:20,450 --> 00:02:21,950 Here's a quick hack if you want to 52 00:02:21,950 --> 00:02:23,870 look at the source code of a function. 53 00:02:23,870 --> 00:02:25,805 You can use the inspect gets 54 00:02:25,805 --> 00:02:28,475 source command and print the results. 55 00:02:28,475 --> 00:02:30,350 So let's import inspect, 56 00:02:30,350 --> 00:02:32,000 and remember this module comes from 57 00:02:32,000 --> 00:02:34,350 our Python three standard library. 58 00:02:34,350 --> 00:02:36,760 Then we'll create the source 59 00:02:36,760 --> 00:02:39,830 inspect.getsource and we pass it a function pointer. 60 00:02:39,830 --> 00:02:41,885 You note that we're not calling the function. 61 00:02:41,885 --> 00:02:44,345 We're just passing a reference to the function, 62 00:02:44,345 --> 00:02:47,770 and then let's print that source to the screen. 63 00:02:47,770 --> 00:02:50,780 So it's interesting, you can actually look at 64 00:02:50,780 --> 00:02:53,720 the source code behind a given function, 65 00:02:53,720 --> 00:02:54,860 and that's one of the powers of 66 00:02:54,860 --> 00:02:57,635 an interpreted language like Python. 67 00:02:57,635 --> 00:03:00,200 There's actually another way in Jupyter, 68 00:03:00,200 --> 00:03:02,600 and that's to append two question marks 69 00:03:02,600 --> 00:03:05,285 to the end of a given function or module. 70 00:03:05,285 --> 00:03:07,410 Other editors have similar features, 71 00:03:07,410 --> 00:03:09,920 and this is actually a great reason that 72 00:03:09,920 --> 00:03:12,860 you should be using a software development environment. 73 00:03:12,860 --> 00:03:15,040 So pytesseract.image_to_string 74 00:03:15,040 --> 00:03:16,130 just as if we were going to call 75 00:03:16,130 --> 00:03:18,170 the function add two question marks 76 00:03:18,170 --> 00:03:19,550 to the end and then run that. 77 00:03:19,550 --> 00:03:23,690 We see that it pops up at the bottom of 78 00:03:23,690 --> 00:03:25,850 the screen with a lot more information and it's 79 00:03:25,850 --> 00:03:29,310 nice in syntax highlighted for us too. 80 00:03:34,580 --> 00:03:37,270 We can see from the source code that there 81 00:03:37,270 --> 00:03:39,430 really isn't much more information about 82 00:03:39,430 --> 00:03:41,395 what the parameters are for or 83 00:03:41,395 --> 00:03:43,775 what this image_to_string function is? 84 00:03:43,775 --> 00:03:45,300 This is because underneath, 85 00:03:45,300 --> 00:03:47,370 the pytesseract library is calling 86 00:03:47,370 --> 00:03:50,685 a C++ library which does all of the hard work, 87 00:03:50,685 --> 00:03:53,050 and the author just passes through all of 88 00:03:53,050 --> 00:03:56,170 the calls to the underlying tesseract executable. 89 00:03:56,170 --> 00:03:57,690 This is a common issue when 90 00:03:57,690 --> 00:03:59,355 working with Python libraries, 91 00:03:59,355 --> 00:04:01,870 and it means that we need to do some web sleuthing 92 00:04:01,870 --> 00:04:03,340 in order to understand how we 93 00:04:03,340 --> 00:04:05,605 can interact with tesseract. 94 00:04:05,605 --> 00:04:07,490 In a case like this, 95 00:04:07,490 --> 00:04:10,495 I just Googled tesseract command line parameters, 96 00:04:10,495 --> 00:04:13,135 and the first hit was what I was looking for. 97 00:04:13,135 --> 00:04:16,160 Here's the URL to the GitHub. 98 00:04:17,890 --> 00:04:21,940 This goes to a wiki page which describes how to call 99 00:04:21,940 --> 00:04:25,000 the tesseract executable, and as we read down, 100 00:04:25,000 --> 00:04:27,670 we see that we can actually have tesseract use 101 00:04:27,670 --> 00:04:30,445 multiple languages in its detection such as 102 00:04:30,445 --> 00:04:33,145 English and Hindi by passing them in 103 00:04:33,145 --> 00:04:37,240 as eng plus hin, that's very cool. 104 00:04:38,570 --> 00:04:41,055 One last thing to mention, 105 00:04:41,055 --> 00:04:44,015 the image_to_string function takes in an image, 106 00:04:44,015 --> 00:04:45,430 but the docs don't really 107 00:04:45,430 --> 00:04:47,875 describe what this image is underneath. 108 00:04:47,875 --> 00:04:49,930 Is it a string to an image file? 109 00:04:49,930 --> 00:04:52,975 A PILLOW image or something else? 110 00:04:52,975 --> 00:04:55,630 Again we have to sleuth and or 111 00:04:55,630 --> 00:04:58,700 experiment to understand what we should do. 112 00:04:58,700 --> 00:05:00,320 If we look at the source code 113 00:05:00,320 --> 00:05:01,850 for the pytesseract library, 114 00:05:01,850 --> 00:05:04,835 we see there's a function called run_and_get_output. 115 00:05:04,835 --> 00:05:06,560 Here's a link to that function on 116 00:05:06,560 --> 00:05:08,840 the author's GitHub account. 117 00:05:08,840 --> 00:05:11,390 When we look at this function we can actually 118 00:05:11,390 --> 00:05:16,110 see what actually happens when we call this function. 119 00:05:16,340 --> 00:05:18,830 In this function, we see that one of 120 00:05:18,830 --> 00:05:20,480 the first things which happens as 121 00:05:20,480 --> 00:05:23,870 the image is saved through the save_image function. 122 00:05:23,870 --> 00:05:26,490 Here's that line of code. 123 00:05:29,060 --> 00:05:31,670 We see that there's another function 124 00:05:31,670 --> 00:05:33,020 called prepare image, 125 00:05:33,020 --> 00:05:36,515 which actually loads the image as a PILLOW image file. 126 00:05:36,515 --> 00:05:39,290 So, yes, sending a PIL image file 127 00:05:39,290 --> 00:05:41,600 is appropriate for use for this function. 128 00:05:41,600 --> 00:05:43,250 It sure would have been useful for 129 00:05:43,250 --> 00:05:45,440 the author to have included this information and 130 00:05:45,440 --> 00:05:47,750 reStructuredText to help us 131 00:05:47,750 --> 00:05:50,825 not have to dig through the implementation itself. 132 00:05:50,825 --> 00:05:53,075 But this is an open source project. 133 00:05:53,075 --> 00:05:54,770 Maybe you would like to contribute 134 00:05:54,770 --> 00:05:57,070 back some better documentation. 135 00:05:57,070 --> 00:05:59,145 Just a hint, if you're interested in, 136 00:05:59,145 --> 00:06:02,055 the doc line we need is param image, 137 00:06:02,055 --> 00:06:04,470 and then we just say that it's a PIL 138 00:06:04,470 --> 00:06:08,265 Image.Image file or an ndarray of bytes. 139 00:06:08,265 --> 00:06:10,880 In the end, we often don't 140 00:06:10,880 --> 00:06:12,920 do this full level of investigation, 141 00:06:12,920 --> 00:06:14,960 and we just experiment and try things. 142 00:06:14,960 --> 00:06:18,680 It seems pretty likely that a PIL Image.Image would work, 143 00:06:18,680 --> 00:06:21,815 given how well known PIL is in the Python world. 144 00:06:21,815 --> 00:06:24,260 But still, as you explore and use 145 00:06:24,260 --> 00:06:26,150 different libraries you'll see a breadth 146 00:06:26,150 --> 00:06:28,190 of different documentation norms. 147 00:06:28,190 --> 00:06:31,505 So it's useful to know how to explore the source code, 148 00:06:31,505 --> 00:06:33,485 and now that you're at the end of this course, 149 00:06:33,485 --> 00:06:36,145 you've got the skills to do so. 150 00:06:36,145 --> 00:06:39,965 Okay. Let's try and run tesseract on this image. 151 00:06:39,965 --> 00:06:41,210 So texts is equal to 152 00:06:41,210 --> 00:06:45,050 pytesseract.image_to_string and we pass in the image, 153 00:06:45,050 --> 00:06:50,070 and then let's just print out the text. Looks great. 154 00:06:50,070 --> 00:06:51,650 We can see that the output includes 155 00:06:51,650 --> 00:06:55,099 new line characters and faithfully represents the text, 156 00:06:55,099 --> 00:06:57,410 but it doesn't include any special formatting. 157 00:06:57,410 --> 00:06:59,330 Let's go on and look at something 158 00:06:59,330 --> 00:07:02,280 with a bit more nuance to it next.