1
00:00:08,987 --> 00:00:11,579
Optical character recognition, or OCR,

2
00:00:11,579 --> 00:00:16,880
is the conversion of text captured in
images into text usable by a computer.

3
00:00:16,880 --> 00:00:21,120
In other words,
an OCR tool can read the text in images.

4
00:00:21,120 --> 00:00:25,120
OCR is a common method of processing
large volumes of printed text,

5
00:00:25,120 --> 00:00:28,840
especially when that text isn't
available in a digital format.

6
00:00:28,840 --> 00:00:32,920
In practical applications, OCR has
been used to scan the pages of books,

7
00:00:32,920 --> 00:00:37,670
to recognize license plates, and even to
convert handwriting into digitized text.

8
00:00:38,780 --> 00:00:41,248
Let me share an example from my own work.

9
00:00:41,248 --> 00:00:44,980
During my doctoral degree I was working
on an open source system called

10
00:00:44,980 --> 00:00:46,940
Opencast Matterhorn.

11
00:00:46,940 --> 00:00:51,350
This system allows for the automated
recording of lectures within universities,

12
00:00:51,350 --> 00:00:53,720
similar in some ways to
the video you're watching now.

13
00:00:54,950 --> 00:00:59,980
The system generally ran automatically
when an instructor was teaching a course.

14
00:00:59,980 --> 00:01:02,780
And the video was uploaded to
the web immediately following

15
00:01:02,780 --> 00:01:05,250
the lecture without any
input from technicians.

16
00:01:06,660 --> 00:01:10,620
This was great, but it was hard to find
videos about your given topic that

17
00:01:10,620 --> 00:01:13,440
might have been covered
during the lecture.

18
00:01:13,440 --> 00:01:18,280
To deal with this we built a search index
from the contents of videos themselves.

19
00:01:18,280 --> 00:01:21,926
We essentially broke a video
up into a sequence of images.

20
00:01:21,926 --> 00:01:26,147
Ran OCR on each image to determine what
text might have been shown to the students

21
00:01:26,147 --> 00:01:27,976
from the slides in the classroom.

22
00:01:27,976 --> 00:01:31,934
Then created a search index
using this technique.

23
00:01:31,934 --> 00:01:35,754
In the project for this course, you're
going to be doing something similar.

24
00:01:35,754 --> 00:01:38,490
But we're going to be doing it
with images of newspapers instead.

25
00:01:40,030 --> 00:01:41,380
In this module of the course,

26
00:01:41,380 --> 00:01:44,630
we're going to be using an OCR
engine called Tesseract.

27
00:01:44,630 --> 00:01:48,651
Tesseract was originally
developed between 1984 and

28
00:01:48,651 --> 00:01:51,856
1994 as a PhD research project at HP labs.

29
00:01:51,856 --> 00:01:56,131
The engine vastly outperform
commercial products at the time, but

30
00:01:56,131 --> 00:02:01,625
then development was stopped until HP
released Tesseract as open source in 2005.

31
00:02:01,625 --> 00:02:06,272
In 2006, Google began maintaining
the tool and has since released

32
00:02:06,272 --> 00:02:11,006
updated versions of Tesseract with
support for over 100 languages.

33
00:02:11,006 --> 00:02:15,012
I think Tesseract is a great tool, and
a wonderful example of how companies can

34
00:02:15,012 --> 00:02:18,060
engage in open source
software development.

35
00:02:18,060 --> 00:02:20,840
So before we spend a lot of
time talking about Tesseract,

36
00:02:20,840 --> 00:02:23,020
let's talk more about
open source software.