1
00:00:08,705 --> 00:00:13,680
In the previous example, we were using a
clear, unambiguous image for a conversion.

2
00:00:13,680 --> 00:00:17,170
Sometimes there will be noise
in images you want to OCR,

3
00:00:17,170 --> 00:00:20,010
making it difficult to extract the text.

4
00:00:20,010 --> 00:00:23,293
Luckily, there
are techniques we can use to

5
00:00:23,293 --> 00:00:27,744
increase the efficacy of OCR
with pytesseract and pillow.

6
00:00:27,744 --> 00:00:31,371
Let's use a different image this time
with the same text as before, but

7
00:00:31,371 --> 00:00:33,640
with added noise to the picture.

8
00:00:33,640 --> 00:00:36,260
We could view this image
using the following code.

9
00:00:36,260 --> 00:00:40,110
So from PIL we'll import image,
pretty common for us now.

10
00:00:40,110 --> 00:00:44,827
Then we'll do an Image.open, and
we'll pull out this Noisy_OCR.PNG,

11
00:00:44,827 --> 00:00:49,558
and then we'll use the display function
in Jupyter to display it in-line.

12
00:00:52,392 --> 00:00:56,299
As you can see, this image has shapes of
different capacities behind the text,

13
00:00:56,299 --> 00:00:59,040
which can confuse the tesseract image.

14
00:00:59,040 --> 00:01:02,020
Let's see if OCR will
work on this noisy image.

15
00:01:02,020 --> 00:01:06,810
So import pytesseract, then we'll
call pytesseract.image_to_string.

16
00:01:06,810 --> 00:01:10,382
And we'll just pass it in the image that
we're going to open, this Noisy_OCR.

17
00:01:10,382 --> 00:01:14,289
And then let's print
out the text directly.

18
00:01:15,588 --> 00:01:19,750
This is a bit surprising given how
nicely pytesseract worked previously.

19
00:01:19,750 --> 00:01:22,970
Let's experiment on the image using
techniques that will allow for

20
00:01:22,970 --> 00:01:25,390
more effective image analysis.

21
00:01:25,390 --> 00:01:27,480
First up,
let's change the size of the image.

22
00:01:28,680 --> 00:01:30,415
So first we're going to import PIL.

23
00:01:31,770 --> 00:01:34,990
Then we set the base width of our image,
so the base width,

24
00:01:34,990 --> 00:01:38,470
we'll set it to 600 points,
these are in pixels.

25
00:01:38,470 --> 00:01:42,197
Now let's open the image, so
that's old hat for us now,

26
00:01:42,197 --> 00:01:44,942
and we'll sign this to IMG.

27
00:01:44,942 --> 00:01:48,410
We want to get the correct aspect ratio,
so we can do this by taking the base width

28
00:01:48,410 --> 00:01:51,320
and dividing it by the actual
width of the image.

29
00:01:51,320 --> 00:01:54,790
So I'm going to create a new variable
called wpercent and make this equal to

30
00:01:54,790 --> 00:02:01,280
the base width divided by the image.size
sub zero, which is the width value there.

31
00:02:02,470 --> 00:02:05,860
With the ratio, we can just get
the appropriate height of the image too.

32
00:02:05,860 --> 00:02:11,387
So I'll make something called hsize and
set this to image size sub one, and

33
00:02:11,387 --> 00:02:16,406
we'll times that by the percentage,
so we're just scaling here.

34
00:02:16,406 --> 00:02:18,935
Finally, let's resize the image.

35
00:02:18,935 --> 00:02:24,920
Antialiasing is a specific way of resizing
lines to try and make them appear smooth.

36
00:02:24,920 --> 00:02:28,500
So here I'll just call image.resize,
I pass it a tuple,

37
00:02:28,500 --> 00:02:31,980
which is the base width and
the height size.

38
00:02:31,980 --> 00:02:37,130
And then I use this PIL.Image.ANTIALIAS
to really just create better lines.

39
00:02:38,340 --> 00:02:41,845
Now let's save this to a file,
so I'll call this

40
00:02:41,845 --> 00:02:46,170
img.save('resized_nois.png'), you
could call it whatever you'd like.

41
00:02:47,370 --> 00:02:51,810
And finally, let's display it in-line,
so I'll call display.

42
00:02:51,810 --> 00:02:53,890
And then let's run OCR.

43
00:02:53,890 --> 00:02:56,795
So again, pytesseract.image_to_string, and

44
00:02:56,795 --> 00:02:59,240
I'm going to open this
new image underneath.

45
00:02:59,240 --> 00:03:01,281
I guess I could've just
passed an image here.

46
00:03:01,281 --> 00:03:02,820
And then print the text.

47
00:03:05,481 --> 00:03:08,837
So this is not actually any
improvement for resizing the image.

48
00:03:08,837 --> 00:03:11,636
And this is sometimes live
when you're experimenting and

49
00:03:11,636 --> 00:03:14,230
trying to get things like this to work.

50
00:03:14,230 --> 00:03:16,320
Let's convert the image to grayscale.

51
00:03:16,320 --> 00:03:18,890
Converting images can be
done in many different ways.

52
00:03:18,890 --> 00:03:21,460
If we poke around in
the pillow documentation,

53
00:03:21,460 --> 00:03:25,520
we'll find that one of the easiest ways to
do this is with the convert function, and

54
00:03:25,520 --> 00:03:28,300
we pass in the string a capital L.

55
00:03:28,300 --> 00:03:31,430
So let's open the image that
we're working witt, and

56
00:03:31,430 --> 00:03:34,630
then let's call img.convert and
pass in a capital L.

57
00:03:35,970 --> 00:03:42,197
Now let's save that image, I'm going to
call it grayscale_noise.jpg here.

58
00:03:42,197 --> 00:03:47,451
Remember, PIL always worries about
the file format for you based on the name

59
00:03:47,451 --> 00:03:52,240
of the image, so ending in .jpg
here versus ending in .png is fine.

60
00:03:52,240 --> 00:03:55,368
And then let's run OCR
on the grayscale image.

61
00:03:55,368 --> 00:03:58,555
And sort of prove there's no shenanigans,

62
00:03:58,555 --> 00:04:02,185
I'll open that grayscale
image that we saved and

63
00:04:02,185 --> 00:04:07,340
pass it to image_to_string in
pytesseract and print out the text.

64
00:04:07,340 --> 00:04:09,010
Wow, that worked really well.

65
00:04:09,010 --> 00:04:12,700
So if we look at the help
documentation using the help function,

66
00:04:12,700 --> 00:04:18,270
as in help(img.convert),
we see that the conversion mechanism used

67
00:04:18,270 --> 00:04:23,350
is the ITU-R 601-2 luma transform.

68
00:04:23,350 --> 00:04:25,460
There's more information
about this out there, but

69
00:04:25,460 --> 00:04:28,790
this method essentially
takes a three channel image,

70
00:04:28,790 --> 00:04:33,720
where there's information for the amount
of red, green, and blue, or R, G, and B.

71
00:04:33,720 --> 00:04:37,450
And reduces it to a single
channel to represent luminosity,

72
00:04:37,450 --> 00:04:39,350
and that's what the L is for.

73
00:04:39,350 --> 00:04:42,730
This method actually comes from
how standard definition television

74
00:04:42,730 --> 00:04:46,070
sets encode color onto black and
white images.

75
00:04:46,070 --> 00:04:50,420
If you get really interested in
image manipulation and recognition,

76
00:04:50,420 --> 00:04:55,132
learning about color spaces and how we
represent color, both computationally and

77
00:04:55,132 --> 00:04:59,926
through human perception,
is a really interesting field.

78
00:04:59,926 --> 00:05:03,170
Even though we now have
the complete text of the image,

79
00:05:03,170 --> 00:05:07,490
there's a few other techniques we could
use to help improve OCR detection

80
00:05:07,490 --> 00:05:10,100
in the event that
the above two don't help.

81
00:05:10,100 --> 00:05:13,670
The next approach I would use
is one called binarization,

82
00:05:13,670 --> 00:05:19,090
which means to separate into two distinct
parts, in this case, black and white.

83
00:05:19,090 --> 00:05:23,170
Binarization is enacted through
a process called thresholding.

84
00:05:23,170 --> 00:05:25,870
If a pixel value is greater
than a threshold value,

85
00:05:25,870 --> 00:05:28,510
it'll be converted to a black pixel.

86
00:05:28,510 --> 00:05:32,660
If it is lower than a threshold value,
it'll be converted to a white pixel.

87
00:05:32,660 --> 00:05:35,790
This process eliminates
noise in the OCR process,

88
00:05:35,790 --> 00:05:39,020
allowing greater image
recognition accuracy.

89
00:05:39,020 --> 00:05:42,420
With pillow,
this process is straightforward.

90
00:05:42,420 --> 00:05:46,113
So let's open a noisy image and
convert it using binarization.

91
00:05:46,113 --> 00:05:50,358
So here we just image.open, we're
going to read our noisy image in, and

92
00:05:50,358 --> 00:05:53,960
then we call convert and
we pass in the character 1.

93
00:05:53,960 --> 00:05:56,740
Note that we're passing it as a character,
not as a number, so

94
00:05:56,740 --> 00:05:59,190
this is a string value we're passing in.

95
00:05:59,190 --> 00:06:02,175
Now let's save and
display that image, so img.save,

96
00:06:02,175 --> 00:06:05,550
we'll call it black_and_white noise.jpg,
and display.

97
00:06:08,290 --> 00:06:12,100
You can see here the image looks kind
of dotted and modeled, there's various

98
00:06:12,100 --> 00:06:16,190
different patterns in it, but definitely
this is a black and white image.

99
00:06:17,530 --> 00:06:21,480
So that was a bit magical, and it really
required a fine reading of the docs to

100
00:06:21,480 --> 00:06:25,120
figure out that the number 1 is
the special string parameter

101
00:06:25,120 --> 00:06:28,950
to the convert function that
actually does the binarization.

102
00:06:28,950 --> 00:06:33,570
But you actually have all the skills you
need to write this function by yourself.

103
00:06:33,570 --> 00:06:35,270
Let's walk through an example.

104
00:06:35,270 --> 00:06:39,520
First, let's define a function called
binarize, which takes in an image and

105
00:06:39,520 --> 00:06:40,950
a threshold value.

106
00:06:40,950 --> 00:06:44,300
So I'll def binarize and
image_to_transform, and

107
00:06:44,300 --> 00:06:46,140
then some threshold value.

108
00:06:46,140 --> 00:06:50,190
Now, let's convert the image to
a single grayscale image using convert.

109
00:06:50,190 --> 00:06:52,440
So here we just create
some new output image,

110
00:06:52,440 --> 00:06:54,650
this is what we'll end up returning, and

111
00:06:54,650 --> 00:06:59,920
we'll transform the image, passed in by
the color, to luminosity values only.

112
00:06:59,920 --> 00:07:03,300
So right now there's nothing
new magical here to be done,

113
00:07:03,300 --> 00:07:05,140
this is just creating a grayscale image.

114
00:07:06,500 --> 00:07:10,680
The threshold value is usually provided
as a number between 0 and 255,

115
00:07:10,680 --> 00:07:14,070
which is the number of bits in a byte.

116
00:07:14,070 --> 00:07:14,940
The algorithm for

117
00:07:14,940 --> 00:07:19,560
the binarization is pretty simple,
go through every pixel in the image, and

118
00:07:19,560 --> 00:07:23,370
if it's greater than the threshold,
turn it all the way up, so to 255.

119
00:07:23,370 --> 00:07:27,163
And if it's lower than the threshold,
turn it all the way down, so that's to 0.

120
00:07:27,163 --> 00:07:29,500
So let's write this in code.

121
00:07:29,500 --> 00:07:31,910
First, we need to iterate
overall the pixels in the image.

122
00:07:34,070 --> 00:07:39,890
So for x in range, and we'll just go over
the widths, so values along the x axes.

123
00:07:39,890 --> 00:07:42,859
And then for y in range,
it will go through the heights, so

124
00:07:42,859 --> 00:07:45,240
these will be our values
through to the y axes.

125
00:07:45,240 --> 00:07:47,840
So for a given pixel at some width and
height,

126
00:07:47,840 --> 00:07:50,650
let's check its value again to threshold.

127
00:07:50,650 --> 00:07:53,833
So we could do this with
if output_image.getpixel,

128
00:07:53,833 --> 00:07:55,847
we'll just pull the pixel x and y.

129
00:07:55,847 --> 00:07:57,744
You'll note lots of brackets here,

130
00:07:57,744 --> 00:08:00,757
that's because we are actually
passing a tuple value in.

131
00:08:00,757 --> 00:08:04,530
We just check to see if it's
less than some threshold value.

132
00:08:04,530 --> 00:08:06,500
So let's set this to 0 if it is.

133
00:08:06,500 --> 00:08:11,220
So in our output image, we just put that
pixel, we pass in the same x, y, and

134
00:08:11,220 --> 00:08:12,240
we put it to 0.

135
00:08:12,240 --> 00:08:15,265
So we're just changing it to 0
if it's less than a threshold.

136
00:08:15,265 --> 00:08:22,342
Otherwise we want to set this to 255,
so output_image.putpixel( ( x,y), 255 ).

137
00:08:22,342 --> 00:08:24,090
And now we just return the new image.

138
00:08:26,050 --> 00:08:29,880
So let's test this function over
a range of different thresholds.

139
00:08:29,880 --> 00:08:33,480
Remember that you can use the range
function to generate a list of numbers at

140
00:08:33,480 --> 00:08:35,310
different step sizes.

141
00:08:35,310 --> 00:08:39,030
Range is called with a start,
a stop, and a step size.

142
00:08:39,030 --> 00:08:42,030
So let's try the range 0, 257, and 64,

143
00:08:42,030 --> 00:08:47,280
which should generate five images
of different threshold values.

144
00:08:47,280 --> 00:08:54,320
So for a thresh in range 0 to 257,
and then we're going to step at 65.

145
00:08:54,320 --> 00:08:59,280
Let's print out a string to tell us
what threshold we're trying here.

146
00:08:59,280 --> 00:09:03,350
And so we want to change, remember,
the thresh value is an integer, so

147
00:09:03,350 --> 00:09:05,980
we'll change it to a string
here using the str function.

148
00:09:07,100 --> 00:09:09,070
And then let's display
the binarized image in-line.

149
00:09:10,510 --> 00:09:14,706
And so the way we do this is the display
function, then we're going to call our

150
00:09:14,706 --> 00:09:19,854
function,binarize, we're going to pass
it the image.open, read_only/Noisy_OCR.

151
00:09:19,854 --> 00:09:23,682
We could of course cache this, open it,
and pass it around as a parameter, but

152
00:09:23,682 --> 00:09:26,570
it's okay for
our demonstration to do it this way.

153
00:09:26,570 --> 00:09:31,120
And then we'll send in the threshold
value, which will be 0 the first time,

154
00:09:31,120 --> 00:09:33,509
60 for the second time, and so forth.

155
00:09:33,509 --> 00:09:35,310
And let's use tesseract on it.

156
00:09:35,310 --> 00:09:39,400
It's inefficient the binarize it twice,
but this is really just for a demo.

157
00:09:39,400 --> 00:09:43,250
So here we'll call print
pytesseract.image_to_string,

158
00:09:43,250 --> 00:09:47,837
passing in then a call to binarize,
which passes in a call to image.open.

159
00:09:47,837 --> 00:09:52,339
So there's a lot of image.opens here, lots
of room this code could be improved, but

160
00:09:52,339 --> 00:09:54,350
it should generate an example for us.

161
00:09:55,830 --> 00:09:59,440
So you can see the result with
threshold 0 is pretty empty.

162
00:09:59,440 --> 00:10:04,663
With threshold 64 we actually
get a very faint looking image,

163
00:10:04,663 --> 00:10:08,641
but it seems like we get all of or
most of the text.

164
00:10:08,641 --> 00:10:12,333
When we increase the threshold
to 192 from 128,

165
00:10:12,333 --> 00:10:17,306
we see that we actually pick up a new
space between the words of and this, so

166
00:10:17,306 --> 00:10:20,620
we're getting more definition in the text.

167
00:10:20,620 --> 00:10:24,270
But then when we increase
the threshold all the way to 256,

168
00:10:24,270 --> 00:10:29,570
we lose a lot of text because a whole
segment of the image becomes black.

169
00:10:29,570 --> 00:10:32,691
And then, all of a sudden,
at the very top end threshold,

170
00:10:32,691 --> 00:10:36,032
we get nothing because the whole
image is black at that point.

171
00:10:38,560 --> 00:10:43,117
We could see from this that a threshold
of 0 essentially turns everything white,

172
00:10:43,117 --> 00:10:47,470
that the text becomes more bold as
we move towards a higher threshold.

173
00:10:47,470 --> 00:10:48,230
And the shapes,

174
00:10:48,230 --> 00:10:52,840
which have a filled-in gray color,
become more evident at higher thresholds.

175
00:10:52,840 --> 00:10:56,670
In the next lecture, we'll look a bit more
at some of the challenges you can expect

176
00:10:56,670 --> 00:10:58,500
when doing OCR on real data.