1
00:00:07,940 --> 00:00:10,380
Let's try a new example and

2
00:00:10,380 --> 00:00:12,420
bring together some of the
things that we've learned.

3
00:00:12,420 --> 00:00:14,220
Here's an image of a storefront.

4
00:00:14,220 --> 00:00:15,510
Let's load it, and try and get

5
00:00:15,510 --> 00:00:17,430
the name of the store
out of that image.

6
00:00:17,430 --> 00:00:20,250
So from PIL, we'll need
the image package of course,

7
00:00:20,250 --> 00:00:22,980
and then let's bring in
pytesseract as well.

8
00:00:22,980 --> 00:00:25,275
So let's read in
the store front image

9
00:00:25,275 --> 00:00:27,645
I've loaded into
the course and display it.

10
00:00:27,645 --> 00:00:31,395
So I put this in
read_only/ Storefront.jpg,

11
00:00:31,395 --> 00:00:33,210
and we'll just open
that as an image,

12
00:00:33,210 --> 00:00:34,815
and display it in line.

13
00:00:34,815 --> 00:00:36,570
Then finally, let's try and run

14
00:00:36,570 --> 00:00:39,225
tesseract on that image and
see what the results are.

15
00:00:39,225 --> 00:00:42,220
So we'll call image_
to_ string on that.

16
00:00:46,490 --> 00:00:48,640
We see at the very bottom that

17
00:00:48,640 --> 00:00:50,140
there's just an empty string.

18
00:00:50,140 --> 00:00:51,730
Tesseract is unable to

19
00:00:51,730 --> 00:00:53,765
take this image and
pull out the name.

20
00:00:53,765 --> 00:00:55,300
But we looked how
to crop an image

21
00:00:55,300 --> 00:00:56,560
in the last set of lectures.

22
00:00:56,560 --> 00:00:58,330
So let's try and
help tesseract by

23
00:00:58,330 --> 00:01:00,640
cropping out certain pieces.

24
00:01:00,640 --> 00:01:03,520
So first, we have to
set the bounding box.

25
00:01:03,520 --> 00:01:05,620
In this image,
the store name is in

26
00:01:05,620 --> 00:01:08,930
a box bounded by roughly 315,

27
00:01:08,930 --> 00:01:12,365
170, 700, and 270.

28
00:01:12,365 --> 00:01:15,340
So I'll make a bounding box
equal to this tuple.

29
00:01:15,340 --> 00:01:18,040
Remember that's
the upper left corner,

30
00:01:18,040 --> 00:01:19,810
and then we walk
around the image,

31
00:01:19,810 --> 00:01:21,820
and you can go back to
the PIL lecture if you

32
00:01:21,820 --> 00:01:24,455
want to be reminded
how to do this.

33
00:01:24,455 --> 00:01:26,855
Now, let's crop the image.

34
00:01:26,855 --> 00:01:28,775
So we just call the image.crop,

35
00:01:28,775 --> 00:01:30,425
and we pass in a bounding box.

36
00:01:30,425 --> 00:01:31,850
It doesn't change the image,

37
00:01:31,850 --> 00:01:33,080
it returns a new image.

38
00:01:33,080 --> 00:01:34,520
So we save this to this title

39
00:01:34,520 --> 00:01:37,445
image variable that
we'll use later.

40
00:01:37,445 --> 00:01:41,150
Now, let's display it
and pull out the text.

41
00:01:41,150 --> 00:01:42,620
So we'll pull out display,

42
00:01:42,620 --> 00:01:45,590
and then we'll call pytesseract
on image_to_string,

43
00:01:45,590 --> 00:01:48,545
and pass in the title image.

44
00:01:48,545 --> 00:01:52,340
Great. So we see how with
a bit of problem reduction,

45
00:01:52,340 --> 00:01:53,690
we can make that work.

46
00:01:53,690 --> 00:01:55,975
So now we've been able
to take an image,

47
00:01:55,975 --> 00:01:58,630
pre-process it where
we expect to see text,

48
00:01:58,630 --> 00:02:00,170
and turn that text into

49
00:02:00,170 --> 00:02:02,990
a string that Python
can understand.

50
00:02:02,990 --> 00:02:05,600
If you look back up
at the image though,

51
00:02:05,600 --> 00:02:08,780
you'll see that there's
a small sign inside of the shop.

52
00:02:08,780 --> 00:02:11,330
That also has
the shop name on it.

53
00:02:11,330 --> 00:02:13,130
I wonder if we are
able to recognize

54
00:02:13,130 --> 00:02:14,675
the text on that sign.

55
00:02:14,675 --> 00:02:16,910
Let's give it a try. First, we

56
00:02:16,910 --> 00:02:19,445
need to determine
a bounding box for that sign.

57
00:02:19,445 --> 00:02:21,275
I'm going to show you
a short-cut to make this

58
00:02:21,275 --> 00:02:23,915
easier and an optional video
in this module.

59
00:02:23,915 --> 00:02:25,190
But for now, let's just use

60
00:02:25,190 --> 00:02:27,440
the bounding box
that I decided on.

61
00:02:27,440 --> 00:02:29,825
The bounding box, we'll
set this to a tuple of

62
00:02:29,825 --> 00:02:32,840
900 by 420 for the upper left,

63
00:02:32,840 --> 00:02:36,695
and then 940 by 445
for the lower right.

64
00:02:36,695 --> 00:02:39,005
Now, let's crop the image.

65
00:02:39,005 --> 00:02:40,760
So we just call Image.crop,

66
00:02:40,760 --> 00:02:42,290
pass it in the bounding box,

67
00:02:42,290 --> 00:02:43,700
and we'll call
this little sign for

68
00:02:43,700 --> 00:02:46,470
fun and display that little sign.

69
00:02:46,470 --> 00:02:49,275
All right. This is a little sign.

70
00:02:49,275 --> 00:02:52,060
OCR works better with
higher resolution images,

71
00:02:52,060 --> 00:02:54,110
so let's increase
the size of this image

72
00:02:54,110 --> 00:02:56,375
by using the pillow
resize function.

73
00:02:56,375 --> 00:02:58,580
Let's set the width
and the height equal

74
00:02:58,580 --> 00:03:00,755
to 10 times the size it is now,

75
00:03:00,755 --> 00:03:03,260
in a (w, h) tuple.

76
00:03:03,260 --> 00:03:05,345
So we'll take the new size,

77
00:03:05,345 --> 00:03:06,695
and we'll make it
equal to the little

78
00:03:06,695 --> 00:03:08,495
sign.width times 10,

79
00:03:08,495 --> 00:03:11,565
and the little
sign.height times 10.

80
00:03:11,565 --> 00:03:15,160
Now, let's check
the docs for resize.

81
00:03:16,960 --> 00:03:19,730
We can see here that
there's a number of

82
00:03:19,730 --> 00:03:22,325
different filters for
resizing the image.

83
00:03:22,325 --> 00:03:23,960
The default is Image.

84
00:03:23,960 --> 00:03:26,810
NEAREST. Let's see
what that looks like.

85
00:03:26,810 --> 00:03:29,540
So we'll take our
little sign.resize,

86
00:03:29,540 --> 00:03:31,850
we'll pass in
the new bounding box size.

87
00:03:31,850 --> 00:03:33,050
So that's new size,

88
00:03:33,050 --> 00:03:35,250
and then we'll say Image.NEAREST

89
00:03:35,250 --> 00:03:38,335
all in caps and pass
that to display.

90
00:03:38,335 --> 00:03:42,230
So here you can see that it
actually resize the image,

91
00:03:42,230 --> 00:03:44,150
and now it's maybe
much more readable.

92
00:03:44,150 --> 00:03:44,480
I don't know.

93
00:03:44,480 --> 00:03:46,340
I didn't have troubles
maybe seeing it before.

94
00:03:46,340 --> 00:03:47,525
Although it was little,

95
00:03:47,525 --> 00:03:49,550
and it says the word fossil.

96
00:03:49,550 --> 00:03:51,290
I think we should be able to find

97
00:03:51,290 --> 00:03:52,400
something better though.

98
00:03:52,400 --> 00:03:55,270
I can read this, but it
looks really pixelated.

99
00:03:55,270 --> 00:03:56,690
Let's see what all the different

100
00:03:56,690 --> 00:03:58,580
resize options look like.

101
00:03:58,580 --> 00:04:00,260
You can go back up to

102
00:04:00,260 --> 00:04:03,680
the documentation to
look at the names.

103
00:04:03,680 --> 00:04:05,090
So here I'm going to make

104
00:04:05,090 --> 00:04:07,340
just a list of all the
different names as options.

105
00:04:07,340 --> 00:04:12,035
So Image.NEAREST,
Image.BOX, Image.BILINEAR,

106
00:04:12,035 --> 00:04:15,440
Image.HAMMING, Imaged.BICUBIC,

107
00:04:15,440 --> 00:04:20,210
and Image.LANCZOS is
how you say that.

108
00:04:20,210 --> 00:04:22,250
So for each of the options,

109
00:04:22,250 --> 00:04:23,570
I'm just going to
iterate over these.

110
00:04:23,570 --> 00:04:25,370
Let's print out the option name.

111
00:04:25,370 --> 00:04:27,560
So print out whatever
the option name is,

112
00:04:27,560 --> 00:04:28,880
and then let's display what

113
00:04:28,880 --> 00:04:31,340
this option looks like
on our little sign.

114
00:04:31,340 --> 00:04:34,080
So here we're actually going
to call little_sign.RESIZE,

115
00:04:34,080 --> 00:04:35,375
pass in the new size,

116
00:04:35,375 --> 00:04:37,220
pass in the option
that we're looking at,

117
00:04:37,220 --> 00:04:39,300
and call to display.

118
00:04:39,460 --> 00:04:42,170
So you can see that this has run,

119
00:04:42,170 --> 00:04:44,210
and we have a whole bunch

120
00:04:44,210 --> 00:04:46,190
of different numbers are printed,

121
00:04:46,190 --> 00:04:50,580
and then different images
that are interesting.

122
00:04:50,930 --> 00:04:53,825
So from this, we can
notice two things.

123
00:04:53,825 --> 00:04:56,960
First, when we print out one
of the re-sampling values,

124
00:04:56,960 --> 00:04:58,910
it actually just
print an integer.

125
00:04:58,910 --> 00:05:00,980
This is actually
really common that

126
00:05:00,980 --> 00:05:02,450
the API developer writes

127
00:05:02,450 --> 00:05:05,260
a property such as Image.BICUBIC,

128
00:05:05,260 --> 00:05:06,620
and then assigns it to

129
00:05:06,620 --> 00:05:09,065
an integer value
to pass it around.

130
00:05:09,065 --> 00:05:11,480
Some languages use
enumerations of

131
00:05:11,480 --> 00:05:13,970
values which is
common in say, Java.

132
00:05:13,970 --> 00:05:15,230
But in Python, this is

133
00:05:15,230 --> 00:05:17,555
a pretty normal way
of doing things.

134
00:05:17,555 --> 00:05:20,390
The second thing we learned
is that there's a number of

135
00:05:20,390 --> 00:05:23,150
different algorithms for
the image re-sampling.

136
00:05:23,150 --> 00:05:25,280
In this case, the LANCZOS and

137
00:05:25,280 --> 00:05:28,505
image.BICUBIC filters
do a good job.

138
00:05:28,505 --> 00:05:30,335
Everything else not so much.

139
00:05:30,335 --> 00:05:31,985
So let's see if we
are able to recognize

140
00:05:31,985 --> 00:05:34,385
the text off this resized image.

141
00:05:34,385 --> 00:05:37,730
So first, let's resize
to the larger size.

142
00:05:37,730 --> 00:05:39,875
So I'm going to create
something bigger sign,

143
00:05:39,875 --> 00:05:41,870
and I'm going to take
little_sign.resize,

144
00:05:41,870 --> 00:05:44,655
I'm going to pass in
our new size that we want.

145
00:05:44,655 --> 00:05:45,990
Then I'm going to use

146
00:05:45,990 --> 00:05:49,560
Image.BICUBIC for lack of
any personal preference.

147
00:05:49,560 --> 00:05:52,490
You feel free to try one
of the different methods.

148
00:05:52,490 --> 00:05:54,215
Then let's print out the text.

149
00:05:54,215 --> 00:05:56,630
So we'll call pytesseract
image_to_string,

150
00:05:56,630 --> 00:05:59,250
and pass in the bigger side.

151
00:05:59,860 --> 00:06:02,825
Well, not really any text there.

152
00:06:02,825 --> 00:06:04,550
Let's try and binarize this.

153
00:06:04,550 --> 00:06:06,320
So first, let me just bring in

154
00:06:06,320 --> 00:06:09,930
the binarization
code we did earlier.

155
00:06:11,960 --> 00:06:14,700
Now, let's apply binarization.

156
00:06:14,700 --> 00:06:17,265
Would say, a threshold of a 190,

157
00:06:17,265 --> 00:06:21,185
and try display that as
well as to do the OCR work.

158
00:06:21,185 --> 00:06:24,180
So binarized, remember
those function takes in

159
00:06:24,180 --> 00:06:27,635
the sign or the image I guess
that we want to binarize,

160
00:06:27,635 --> 00:06:30,830
and then a value
between zero and 255.

161
00:06:30,830 --> 00:06:32,990
It's going to walk through
it pixel by pixel of

162
00:06:32,990 --> 00:06:35,420
the image and either
set it to zero, or one.

163
00:06:35,420 --> 00:06:38,080
So change it straight
up black and white.

164
00:06:38,080 --> 00:06:40,930
Then we'll display what
the binarized sign looks like,

165
00:06:40,930 --> 00:06:42,740
and then let's actually try and

166
00:06:42,740 --> 00:06:44,975
get the text out with
pytesseract too,

167
00:06:44,975 --> 00:06:46,490
in the hopes that a 190 is

168
00:06:46,490 --> 00:06:49,650
actually a good number
for us to use.

169
00:06:49,900 --> 00:06:53,165
Well, that looks pretty
abysmal I would say.

170
00:06:53,165 --> 00:06:55,580
It's doesn't look
at all like fossil.

171
00:06:55,580 --> 00:06:58,070
I guess you could see
some of the ases there,

172
00:06:58,070 --> 00:07:01,720
but really not much
in that image at all.

173
00:07:01,720 --> 00:07:03,620
So the text is pretty useless.

174
00:07:03,620 --> 00:07:06,860
How should we pick
the best binarization to use?

175
00:07:06,860 --> 00:07:08,700
There's a number of
different methods.

176
00:07:08,700 --> 00:07:10,010
But let's just try something

177
00:07:10,010 --> 00:07:12,500
very simple to show
how this can work.

178
00:07:12,500 --> 00:07:14,960
We have an english word that
we're trying to detect,

179
00:07:14,960 --> 00:07:19,070
its called "FOSSIL" If we
tried all binarization from

180
00:07:19,070 --> 00:07:21,395
zero through 255 and look to

181
00:07:21,395 --> 00:07:23,945
see if there were any
english words in that list,

182
00:07:23,945 --> 00:07:25,535
this might be one way.

183
00:07:25,535 --> 00:07:28,055
So let's see if we could
write a routine to do this.

184
00:07:28,055 --> 00:07:30,425
So we're problem-solving
on our own here.

185
00:07:30,425 --> 00:07:33,620
So first, let's load a list
of english words into a list.

186
00:07:33,620 --> 00:07:35,090
I put a copy in the read_only

187
00:07:35,090 --> 00:07:36,845
directory for you to work with.

188
00:07:36,845 --> 00:07:39,060
So create something eng dict,

189
00:07:39,060 --> 00:07:40,575
it's just an empty list.

190
00:07:40,575 --> 00:07:42,660
Then I'm going to open the

191
00:07:42,660 --> 00:07:44,580
read_only/words_alpha.text
as read,

192
00:07:44,580 --> 00:07:45,860
you can go back into one of

193
00:07:45,860 --> 00:07:47,690
the previous courses
if this doesn't look

194
00:07:47,690 --> 00:07:51,020
very familiar to you on
how to work with files.

195
00:07:51,020 --> 00:07:52,250
We're going to call the file

196
00:07:52,250 --> 00:07:54,590
F. Then I'm just
going to read all

197
00:07:54,590 --> 00:07:58,550
of F in one giant chunk
and put that in data.

198
00:07:58,550 --> 00:08:00,410
So now we actually want
to split this into

199
00:08:00,410 --> 00:08:02,585
a list based on
those new line characters.

200
00:08:02,585 --> 00:08:05,030
So if you go look in that
data file words alpha,

201
00:08:05,030 --> 00:08:07,070
you'll see it's
one word per line.

202
00:08:07,070 --> 00:08:10,070
So I'll call data.split
on slash eng,

203
00:08:10,070 --> 00:08:11,600
this is the new line character,

204
00:08:11,600 --> 00:08:14,270
and this will return
a new list which is all of

205
00:08:14,270 --> 00:08:15,590
the different words and I'll

206
00:08:15,590 --> 00:08:17,710
put this into english dictionary.

207
00:08:17,710 --> 00:08:19,640
Now, let's iterate through

208
00:08:19,640 --> 00:08:21,770
all the possible
thresholds and look for

209
00:08:21,770 --> 00:08:24,410
an english word printing
it out if it exists.

210
00:08:24,410 --> 00:08:27,905
So for i in range 150 and 170,

211
00:08:27,905 --> 00:08:29,600
I'm just going to
binarize between

212
00:08:29,600 --> 00:08:35,590
those ranges as binarizing
convert this to string values,

213
00:08:35,590 --> 00:08:40,000
and then string will set to
pytesseract.image to string.

214
00:08:40,000 --> 00:08:42,110
So we'll binarized, passing in

215
00:08:42,110 --> 00:08:44,570
the bigger sine and
are given i value.

216
00:08:44,570 --> 00:08:45,770
So this is a binarized with

217
00:08:45,770 --> 00:08:49,550
150,151,152,153, and so forth.

218
00:08:49,550 --> 00:08:51,070
I'm going to try them all between

219
00:08:51,070 --> 00:08:55,465
these two threshold values
150, and 170.

220
00:08:55,465 --> 00:08:58,885
So we want to remove
all non alphabetical character.

221
00:08:58,885 --> 00:09:01,380
So that includes a
parentheses, brackets,

222
00:09:01,380 --> 00:09:03,515
percentage signs, dollar signs,

223
00:09:03,515 --> 00:09:05,080
et cetera from the text.

224
00:09:05,080 --> 00:09:07,285
So here's a short
method to do that.

225
00:09:07,285 --> 00:09:10,570
So first, let's convert
our string to lowercase only.

226
00:09:10,570 --> 00:09:13,740
So string.lower, and
we'll just change string.

227
00:09:13,740 --> 00:09:15,820
Then let's import
the string package.

228
00:09:15,820 --> 00:09:18,400
It's got a nice list of
lowercase characters.

229
00:09:18,400 --> 00:09:20,560
So import string, and

230
00:09:20,560 --> 00:09:22,830
now let's just iterate
over a string,

231
00:09:22,830 --> 00:09:24,630
looking at it character
by character,

232
00:09:24,630 --> 00:09:26,770
putting it in
the comparison text.

233
00:09:26,770 --> 00:09:29,410
So we'll create
some new value comparison,

234
00:09:29,410 --> 00:09:31,480
and then for every character
in our string,

235
00:09:31,480 --> 00:09:33,290
remember this a lowercase.

236
00:09:33,290 --> 00:09:37,085
If that characters is in
the string.ascii lowercase.

237
00:09:37,085 --> 00:09:39,635
So this is actually
just checking to see if

238
00:09:39,635 --> 00:09:44,450
a single character is in
a list of characters.

239
00:09:44,450 --> 00:09:47,420
Remember, a string and a list
of characters are the same

240
00:09:47,420 --> 00:09:50,210
when you use n. If so then,

241
00:09:50,210 --> 00:09:53,840
comparison is equal to
comparison plus that character.

242
00:09:53,840 --> 00:09:57,245
So we just append it to
our output string. All right.

243
00:09:57,245 --> 00:09:58,520
Finally, let's search for

244
00:09:58,520 --> 00:10:00,920
the comparison in
the dictionary file.

245
00:10:00,920 --> 00:10:02,540
So that's easy in Python.

246
00:10:02,540 --> 00:10:05,090
In other languages, you would
have to do a lot of work.

247
00:10:05,090 --> 00:10:07,940
But here we just use
the in comparator,

248
00:10:07,940 --> 00:10:10,505
and see if comparison
is eng dict.

249
00:10:10,505 --> 00:10:12,755
Then we're going to print
it out if we find it.

250
00:10:12,755 --> 00:10:14,455
So we'll print comparison.

251
00:10:14,455 --> 00:10:17,110
All right, let's run that.

252
00:10:17,140 --> 00:10:19,685
So you should start to see that

253
00:10:19,685 --> 00:10:21,740
various characters come up,

254
00:10:21,740 --> 00:10:25,430
and in my case fossil
came up and W came up.

255
00:10:25,430 --> 00:10:28,610
So W is also in this dictionary,

256
00:10:28,610 --> 00:10:31,670
and a W was detected in data

257
00:10:31,670 --> 00:10:35,090
that we sent in at least one
of the binarization.

258
00:10:35,090 --> 00:10:37,895
So well, if this is
not perfect but we can

259
00:10:37,895 --> 00:10:40,504
see fossil there
among other values,

260
00:10:40,504 --> 00:10:43,550
and this is not a bad way
actually to clean up OCR data.

261
00:10:43,550 --> 00:10:45,860
It can be useful to
use a language or

262
00:10:45,860 --> 00:10:48,350
domain-specific dictionary
and practice.

263
00:10:48,350 --> 00:10:50,870
Instead of all of
the english language words,

264
00:10:50,870 --> 00:10:53,435
especially if you're
generating a search engine for

265
00:10:53,435 --> 00:10:55,610
specialized language such as

266
00:10:55,610 --> 00:10:57,990
medical knowledge base,
or locations.

267
00:10:57,990 --> 00:10:59,750
So like cities. If you

268
00:10:59,750 --> 00:11:01,920
scroll up and look at
the data we're working with,

269
00:11:01,920 --> 00:11:04,010
this tiny little wall hanging in

270
00:11:04,010 --> 00:11:07,160
the inside of the store
is really not so bad.

271
00:11:07,160 --> 00:11:09,560
A lot of this comes down
to the purpose that

272
00:11:09,560 --> 00:11:12,050
you're actually
doing the OCR for.

273
00:11:12,050 --> 00:11:13,950
So if you are using
it for instance to

274
00:11:13,950 --> 00:11:16,100
back up search engine,
that's one thing.

275
00:11:16,100 --> 00:11:19,895
If you're using it to do
text-to-speech for instance,

276
00:11:19,895 --> 00:11:22,010
and somebody is
going to use this to

277
00:11:22,010 --> 00:11:24,500
listen to a lecture, that's
completely different,

278
00:11:24,500 --> 00:11:27,290
and you have to have
a very very strong method

279
00:11:27,290 --> 00:11:30,505
for generating the actual data.

280
00:11:30,505 --> 00:11:33,110
So at this point, you've now
learned how to manipulate

281
00:11:33,110 --> 00:11:35,465
images and convert
them into text.

282
00:11:35,465 --> 00:11:37,070
In the next module
in this course,

283
00:11:37,070 --> 00:11:38,690
we're going to dig deeper further

284
00:11:38,690 --> 00:11:40,760
into a computer vision library,

285
00:11:40,760 --> 00:11:43,640
which allows us to detect
faces among other things.

286
00:11:43,640 --> 00:11:45,100
So then, we'll go onto

287
00:11:45,100 --> 00:11:48,190
a culminating project.
I'll see you there.