1 00:00:08,987 --> 00:00:11,579 Optical character recognition, or OCR, 2 00:00:11,579 --> 00:00:16,880 is the conversion of text captured in images into text usable by a computer. 3 00:00:16,880 --> 00:00:21,120 In other words, an OCR tool can read the text in images. 4 00:00:21,120 --> 00:00:25,120 OCR is a common method of processing large volumes of printed text, 5 00:00:25,120 --> 00:00:28,840 especially when that text isn't available in a digital format. 6 00:00:28,840 --> 00:00:32,920 In practical applications, OCR has been used to scan the pages of books, 7 00:00:32,920 --> 00:00:37,670 to recognize license plates, and even to convert handwriting into digitized text. 8 00:00:38,780 --> 00:00:41,248 Let me share an example from my own work. 9 00:00:41,248 --> 00:00:44,980 During my doctoral degree I was working on an open source system called 10 00:00:44,980 --> 00:00:46,940 Opencast Matterhorn. 11 00:00:46,940 --> 00:00:51,350 This system allows for the automated recording of lectures within universities, 12 00:00:51,350 --> 00:00:53,720 similar in some ways to the video you're watching now. 13 00:00:54,950 --> 00:00:59,980 The system generally ran automatically when an instructor was teaching a course. 14 00:00:59,980 --> 00:01:02,780 And the video was uploaded to the web immediately following 15 00:01:02,780 --> 00:01:05,250 the lecture without any input from technicians. 16 00:01:06,660 --> 00:01:10,620 This was great, but it was hard to find videos about your given topic that 17 00:01:10,620 --> 00:01:13,440 might have been covered during the lecture. 18 00:01:13,440 --> 00:01:18,280 To deal with this we built a search index from the contents of videos themselves. 19 00:01:18,280 --> 00:01:21,926 We essentially broke a video up into a sequence of images. 20 00:01:21,926 --> 00:01:26,147 Ran OCR on each image to determine what text might have been shown to the students 21 00:01:26,147 --> 00:01:27,976 from the slides in the classroom. 22 00:01:27,976 --> 00:01:31,934 Then created a search index using this technique. 23 00:01:31,934 --> 00:01:35,754 In the project for this course, you're going to be doing something similar. 24 00:01:35,754 --> 00:01:38,490 But we're going to be doing it with images of newspapers instead. 25 00:01:40,030 --> 00:01:41,380 In this module of the course, 26 00:01:41,380 --> 00:01:44,630 we're going to be using an OCR engine called Tesseract. 27 00:01:44,630 --> 00:01:48,651 Tesseract was originally developed between 1984 and 28 00:01:48,651 --> 00:01:51,856 1994 as a PhD research project at HP labs. 29 00:01:51,856 --> 00:01:56,131 The engine vastly outperform commercial products at the time, but 30 00:01:56,131 --> 00:02:01,625 then development was stopped until HP released Tesseract as open source in 2005. 31 00:02:01,625 --> 00:02:06,272 In 2006, Google began maintaining the tool and has since released 32 00:02:06,272 --> 00:02:11,006 updated versions of Tesseract with support for over 100 languages. 33 00:02:11,006 --> 00:02:15,012 I think Tesseract is a great tool, and a wonderful example of how companies can 34 00:02:15,012 --> 00:02:18,060 engage in open source software development. 35 00:02:18,060 --> 00:02:20,840 So before we spend a lot of time talking about Tesseract, 36 00:02:20,840 --> 00:02:23,020 let's talk more about open source software.