1 00:00:07,980 --> 00:00:14,100 In this video we’ll discuss data sets: what they are, why they are important in data science, 2 00:00:14,100 --> 00:00:15,660 and where to find them. 3 00:00:15,660 --> 00:00:19,370 Let’s first loosely define what a data set is. 4 00:00:19,370 --> 00:00:22,710 A data set is a structured collection of data. 5 00:00:22,710 --> 00:00:29,179 Data embodies information that might be represented as text, numbers, or media such as images, 6 00:00:29,179 --> 00:00:32,160 audio, or video files. 7 00:00:32,160 --> 00:00:37,690 A data set that is structured as tabular data comprises a collection of rows, which in turn 8 00:00:37,690 --> 00:00:41,040 comprise columns that store the information. 9 00:00:41,040 --> 00:00:45,899 One popular tabular data format is "comma separated values," or CSV. 10 00:00:45,899 --> 00:00:52,660 A CSV file is a delimited text file where each line represents a row and data values 11 00:00:52,660 --> 00:00:55,110 are separated by a comma. 12 00:00:55,110 --> 00:01:00,140 For example, imagine a data set of observations from a weather station. 13 00:01:00,140 --> 00:01:05,770 Each row represents an observation at a given time, while each column contains information 14 00:01:05,770 --> 00:01:11,120 about that particular observation, such as the temperature, humidity, and other weather 15 00:01:11,120 --> 00:01:12,380 conditions. 16 00:01:12,380 --> 00:01:17,540 Hierarchical or network data structures are typically used to represent relationships 17 00:01:17,540 --> 00:01:18,700 between data. 18 00:01:18,700 --> 00:01:24,430 Hierarchical data is organized in a tree-like structure, whereas network data might be stored 19 00:01:24,430 --> 00:01:26,010 as a graph. 20 00:01:26,010 --> 00:01:31,530 For example, the connections between people on a social networking website are often represented 21 00:01:31,530 --> 00:01:33,640 in the form of a graph. 22 00:01:33,640 --> 00:01:38,170 A data set might also include raw data files, such as images or audio. 23 00:01:38,170 --> 00:01:42,500 The MNIST dataset is popular for data science. 24 00:01:42,500 --> 00:01:47,810 It contains images of handwritten digits and is commonly used to train image processing 25 00:01:47,810 --> 00:01:48,810 systems. 26 00:01:48,810 --> 00:01:54,360 Traditionally, most data sets were considered to be private because they contain proprietary 27 00:01:54,360 --> 00:02:00,240 or confidential information such as customer data, pricing data, or other commercially 28 00:02:00,240 --> 00:02:03,030 sensitive information. 29 00:02:03,030 --> 00:02:07,090 These data sets are typically not shared publicly. 30 00:02:07,090 --> 00:02:12,890 Over time, more and more public and private entities such as scientific institutions, 31 00:02:12,890 --> 00:02:18,459 governments, organizations and even companies have started to make data sets available to 32 00:02:18,459 --> 00:02:24,549 the public as “open data," providing a wealth of information for free. 33 00:02:24,549 --> 00:02:29,599 For example, the United Nations and federal and municipal governments around the world 34 00:02:29,599 --> 00:02:35,629 have published many data sets on their websites, covering the economy, society, healthcare, 35 00:02:35,629 --> 00:02:39,500 transportation, environment, and much more. 36 00:02:39,500 --> 00:02:45,390 Access to these and other open data sets enable data scientists, researchers, analysts, and 37 00:02:45,390 --> 00:02:50,970 others to uncover previously unknown and potentially useful insights. 38 00:02:50,970 --> 00:02:55,930 They can create new applications for both commercial purposes and the public good. 39 00:02:55,930 --> 00:02:59,390 They can also carry out new research. 40 00:02:59,390 --> 00:03:04,129 Open data has played a significant role in the growth of data science, machine learning, 41 00:03:04,129 --> 00:03:09,170 and artificial intelligence and has provided a way for practitioners to hone their skills 42 00:03:09,170 --> 00:03:13,780 on a wide variety of data sets. 43 00:03:13,780 --> 00:03:16,870 There are many open data sources on the internet. 44 00:03:16,870 --> 00:03:21,269 You can find a comprehensive list of open data portals from around the world on the 45 00:03:21,269 --> 00:03:26,829 Open Knowledge Foundation’s datacatalogs.org website. 46 00:03:26,829 --> 00:03:32,000 The United Nations, the European Union, and many other governmental and intergovernmental 47 00:03:32,000 --> 00:03:39,469 organizations maintain data repositories providing access to a wide range of information. 48 00:03:39,469 --> 00:03:44,700 On Kaggle, which is a popular data science online community, you can find and contribute 49 00:03:44,700 --> 00:03:47,970 data sets that might be of general interest. 50 00:03:47,970 --> 00:03:53,250 Last but not least, Google provides a search engine for data sets that might help you find 51 00:03:53,250 --> 00:03:56,050 the ones that have particular value for you. 52 00:03:56,050 --> 00:04:01,269 It’s important to recognize that open data distribution and use might be restricted, 53 00:04:01,269 --> 00:04:03,870 as defined by its licensing terms. 54 00:04:03,870 --> 00:04:09,629 In absence of a license for open data distribution, many data sets were shared in the past under 55 00:04:09,629 --> 00:04:12,790 open source software licenses. 56 00:04:12,790 --> 00:04:17,471 These licenses were not designed to cover the specific considerations related to the 57 00:04:17,471 --> 00:04:20,870 distribution and use of data sets. 58 00:04:20,870 --> 00:04:26,100 To address the issue, the Linux Foundation created the Community Data License Agreement, 59 00:04:26,100 --> 00:04:28,000 or CDLA. 60 00:04:28,000 --> 00:04:33,810 Two licenses were initially created for sharing data: CDLA-Sharing and CDLA-Permissive. 61 00:04:33,810 --> 00:04:41,000 The CDLA-Sharing license grants you permission to use and modify the data. 62 00:04:41,000 --> 00:04:46,010 The license stipulates that if you publish your modified version of the data you must 63 00:04:46,010 --> 00:04:50,500 do so under the same license terms as the original data. 64 00:04:50,500 --> 00:04:56,330 The CDLA-Permissive license also grants you permission to use and modify the data. 65 00:04:56,330 --> 00:05:00,700 However, you are not required to share changes to the data. 66 00:05:00,700 --> 00:05:06,750 Note that neither license imposes any restrictions on results you might derive by using the data, 67 00:05:06,750 --> 00:05:08,690 which is important in data science. 68 00:05:08,690 --> 00:05:13,740 Let’s say, for example, that you are building a model that performs a prediction. 69 00:05:13,740 --> 00:05:19,500 If you are training the model using CDLA-licensed data sets, you are under no obligation to 70 00:05:19,500 --> 00:05:25,440 share the model, or to share it under a specific license if you do choose to share it. 71 00:05:25,440 --> 00:05:29,940 In this video you’ve learned about open data sets, their role in data science, and 72 00:05:29,940 --> 00:05:31,240 where to find them. 73 00:05:31,240 --> 00:05:36,090 We’ve also introduced the Community Data License Agreement, which makes it easier to 74 00:05:36,090 --> 00:05:38,180 share open data. 75 00:05:38,180 --> 00:05:43,760 One important aspect that we didn’t cover in this video is data quality and accuracy, 76 00:05:43,760 --> 00:05:48,870 which might vary greatly depending on who collected and contributed the data set. 77 00:05:48,870 --> 00:05:53,930 While some open data sets might be good enough for personal use, they might not meet enterprise 78 00:05:53,930 --> 00:05:58,380 requirements due to the impact they might have on the business. 79 00:05:58,380 --> 00:06:03,710 In the next module, you will learn about the Data Asset eXchange, a curated open data repository.