1 00:00:07,370 --> 00:00:11,290 We previously covered open source tools for data science. 2 00:00:11,290 --> 00:00:16,190 Now, let’s look at the commercial options you’ll find in many enterprise projects. 3 00:00:16,190 --> 00:00:20,470 Let’s revisit our overview of different tool categories. 4 00:00:20,470 --> 00:00:24,710 In data management, most of an enterprise’s relevant data is stored in an 5 00:00:24,710 --> 00:00:30,669 Oracle Database, Microsoft SQL Server, or IBM Db2. 6 00:00:30,669 --> 00:00:36,040 Although open source databases are gaining popularity, those three data management products 7 00:00:36,040 --> 00:00:38,000 are still considered the industry-standard. 8 00:00:38,000 --> 00:00:40,739 They won’t disappear in the near future. 9 00:00:40,739 --> 00:00:43,769 It’s not just about functionality. 10 00:00:43,769 --> 00:00:48,789 Data is at the heart of every organization, and the availability of commercial supports 11 00:00:48,789 --> 00:00:51,430 plays a major role. 12 00:00:51,430 --> 00:00:55,969 Commercial supports are delivered directly from software vendors, influential partners, 13 00:00:55,969 --> 00:00:58,079 and support networks. 14 00:00:58,079 --> 00:01:03,480 When we focus on commercial data integration tools, we’re talking about “extract, transform, 15 00:01:03,480 --> 00:01:06,060 and load,” or “ETL” tools. 16 00:01:06,060 --> 00:01:12,550 According to a Gartner Magic Quadrant, Informatica Powercenter and IBM InfoSphere DataStage are 17 00:01:12,550 --> 00:01:20,300 the leaders, followed by products from SAP, Oracle, SAS, Talend, and Microsoft. 18 00:01:20,300 --> 00:01:25,410 These tools support design and deployment of ETL data-processing pipelines through a 19 00:01:25,410 --> 00:01:27,370 graphical interface. 20 00:01:27,370 --> 00:01:32,400 They also provide connectors to most of the commercial and open source target information 21 00:01:32,400 --> 00:01:33,430 systems. 22 00:01:33,430 --> 00:01:39,440 Finally, Watson Studio Desktop includes a component called Data Refinery, which enables 23 00:01:39,440 --> 00:01:45,710 the defining and execution of data integration processes in a spreadsheet style. 24 00:01:45,710 --> 00:01:51,810 In the commercial environment, data visualizations are utilizing business intelligence, or “BI”, 25 00:01:51,810 --> 00:01:52,810 tools. 26 00:01:52,810 --> 00:01:59,060 Their main focus is to create visually attractive and easy-to-understand reports and live dashboards. 27 00:01:59,060 --> 00:02:05,180 The most prominent commercial examples are: Tableau, Microsoft Power BI, and IBM Cognos 28 00:02:05,180 --> 00:02:06,650 Analytics. 29 00:02:06,650 --> 00:02:12,349 Another type of visualization targets data scientists rather than regular users. 30 00:02:12,349 --> 00:02:18,069 A sample problem might be “How can different columns in a table relate to each other?” 31 00:02:18,069 --> 00:02:22,319 This type of functionality is contained in Watson Studio Desktop. 32 00:02:22,319 --> 00:02:26,890 If you want to build a machine learning model using a commercial tool, you should consider 33 00:02:26,890 --> 00:02:29,519 using a data mining product. 34 00:02:29,519 --> 00:02:35,720 The most prominent of these types of products are: SPSS Modeler and SAS Enterprise Miner. 35 00:02:35,720 --> 00:02:42,650 In addition, A version of SPSS Modeler is also available in Watson Studio Desktop, based 36 00:02:42,650 --> 00:02:44,730 on the cloud version of the tool. 37 00:02:44,730 --> 00:02:48,829 We’ll talk more about cloud-based tools in the next video. 38 00:02:48,829 --> 00:02:54,749 In commercial software, model deployment is tightly integrated in the model building process. 39 00:02:54,749 --> 00:03:00,319 This diagram shows an example of the SPSS Collaboration and Deployment Services which 40 00:03:00,319 --> 00:03:06,900 are used to deploy any type of asset created by the SPSS software tools suite. 41 00:03:06,900 --> 00:03:10,540 Other vendors use the same type of process. 42 00:03:10,540 --> 00:03:14,459 Commercial software can also export models in an open format. 43 00:03:14,459 --> 00:03:21,180 For example, SPSS Modeler supports the exporting of models as Predictive Model Markup Language, 44 00:03:21,180 --> 00:03:28,379 or PMML, which can be read by many other commercial and open software packages. 45 00:03:28,379 --> 00:03:33,439 Model monitoring is a new discipline and there are currently no relevant commercial tools 46 00:03:33,439 --> 00:03:34,439 available. 47 00:03:34,439 --> 00:03:37,690 As a result, open source is the first choice. 48 00:03:37,690 --> 00:03:40,870 The same is true for code asset management. 49 00:03:40,870 --> 00:03:45,560 Open source with Git and GitHub is the effective standard. 50 00:03:45,560 --> 00:03:51,780 Data asset management, often called data governance or data lineage, is a crucial part of enterprise 51 00:03:51,780 --> 00:03:54,010 grade data science. 52 00:03:54,010 --> 00:03:57,870 Data must be versioned and annotated using metadata. 53 00:03:57,870 --> 00:04:04,030 Vendors, including Informatica Enterprise Data Governance and IBM, provide tools for 54 00:04:04,030 --> 00:04:05,499 these specific tasks. 55 00:04:05,499 --> 00:04:12,120 The IBM InfoSphere Information Governance Catalog covers functions like data dictionary, 56 00:04:12,120 --> 00:04:15,730 which facilitates discovery of data assets. 57 00:04:15,730 --> 00:04:20,209 Each data asset is assigned to a data steward -- the data owner. 58 00:04:20,209 --> 00:04:25,400 The data owner is responsible for that data asset and can be contacted. 59 00:04:25,400 --> 00:04:30,919 Data lineage is also covered; this enables a user to track back through the transformation 60 00:04:30,919 --> 00:04:34,500 steps followed in creating the data assets. 61 00:04:34,500 --> 00:04:39,140 The data lineage also includes a reference to the actual source data. 62 00:04:39,140 --> 00:04:44,789 Rules and policies can be added to reflect complex regulatory and business requirements 63 00:04:44,789 --> 00:04:47,270 for data privacy and retention. 64 00:04:47,270 --> 00:04:52,919 Watson Studio is a fully integrated development environment for data scientists. 65 00:04:52,919 --> 00:04:57,610 It’s usually consumed through the cloud, and we’ll cover more about it in a later 66 00:04:57,610 --> 00:04:58,729 lesson. 67 00:04:58,729 --> 00:05:01,540 There is also a desktop version available. 68 00:05:01,540 --> 00:05:08,849 Watson Studio Desktop combines Jupyter Notebooks with graphical tools to maximize data scientists’ 69 00:05:08,849 --> 00:05:09,900 performance. 70 00:05:09,900 --> 00:05:15,919 Watson Studio, together with Watson Open Scale, is a fully integrated tool covering the full 71 00:05:15,919 --> 00:05:20,550 data science life cycle and all the tasks we’ve discussed previously. 72 00:05:20,550 --> 00:05:23,620 We’ll talk more about both in the next lesson. 73 00:05:23,620 --> 00:05:29,169 but just keep in mind that they can be deployed in a local data center on top of Kubernetes 74 00:05:29,169 --> 00:05:31,599 or RedHat OpenShift. 75 00:05:31,599 --> 00:05:38,039 Another example of a fully integrated commercial tool is H2O Driverless AI, which covers the 76 00:05:38,039 --> 00:05:41,030 complete data science life cycle. 77 00:05:41,030 --> 00:05:45,789 In this lesson, you’ve learned how most common data science tasks are supported by 78 00:05:45,789 --> 00:05:47,509 commercial tools. 79 00:05:47,509 --> 00:05:51,650 In the next video, we’ll discover data science tools that are available exclusively on the 80 00:05:51,650 --> 00:05:51,660 cloud.