1 00:00:07,430 --> 00:00:11,650 In this lesson we will discuss two products that are very helpful for data 2 00:00:11,650 --> 00:00:19,500 scientists. Both came to IBM with the SPSS acquisition in 2009. First is IBM 3 00:00:19,500 --> 00:00:25,890 SPSS Modeler. Let's review the different tool categories we discussed previously. 4 00:00:25,890 --> 00:00:31,200 IBM SPSS Modeler includes data management capabilities and tools for 5 00:00:31,200 --> 00:00:37,011 data preparation, visualization, model building and model deployment. The 6 00:00:37,011 --> 00:00:41,070 product was created by Integral Solutions Limited in the United Kingdom 7 00:00:41,070 --> 00:00:47,210 in 1994 and was originally called Clementine. It was acquired by a company 8 00:00:47,210 --> 00:00:57,360 called SPSS in 1998 and SPSS was in turn acquired by IBM in 2009. SPSS Modeler is 9 00:00:57,360 --> 00:01:02,520 a data mining and text analytics software application. It's used to build 10 00:01:02,520 --> 00:01:07,610 predictive models and conduct other analytics tasks. It has a visual 11 00:01:07,610 --> 00:01:12,770 interface that enables users to leverage statistical and data mining algorithms 12 00:01:12,770 --> 00:01:17,760 without programming. One of its main goals from the beginning was to create 13 00:01:17,760 --> 00:01:23,250 complex predictive modeling pipelines that are easily accessible. A sample 14 00:01:23,250 --> 00:01:29,380 modeler stream shown here includes one round data source node, three triangular 15 00:01:29,380 --> 00:01:34,490 graph nodes, one hexagonal node for computing, a new variable, and a square 16 00:01:34,490 --> 00:01:39,490 node for an output table. Below the canvas, we can see the rich node palette 17 00:01:39,490 --> 00:01:44,909 with separate tabs for data sources, record in field operations, graphs, models, 18 00:01:44,909 --> 00:01:48,689 output and so on. Nodes and different tabs have different 19 00:01:48,689 --> 00:01:54,470 shapes with Pentagon's used for modeling nodes. Let's examine the sample stream 20 00:01:54,470 --> 00:01:58,700 that comes as an example with the product. It starts with a data set of 21 00:01:58,700 --> 00:02:03,340 telecommunications records and the goal is to build a model to predict which 22 00:02:03,340 --> 00:02:09,310 customers are about to leave the service otherwise known as churn. The data source 23 00:02:09,310 --> 00:02:15,239 is shown by the round node on the left side, a hexagon type node typically follows 24 00:02:15,239 --> 00:02:17,689 a data source node and it enables us to 25 00:02:17,689 --> 00:02:23,599 specify roles, target predictor or none. And measurement levels such as 26 00:02:23,599 --> 00:02:29,939 continuous nominal or flag for all variables. The term flag is used to 27 00:02:29,939 --> 00:02:34,769 denote a variable with two categories one of which can be considered positive 28 00:02:34,769 --> 00:02:39,439 and the other negative. In this example the measurement level for the churn 29 00:02:39,439 --> 00:02:45,430 field is set to flag and the role is set to target. All others are set as 30 00:02:45,430 --> 00:02:50,989 predictors and inputs. The original data set has many fields and some of them 31 00:02:50,989 --> 00:02:55,799 are not relevant to the target variable, so we first need to decide which fields 32 00:02:55,799 --> 00:03:00,730 are more useful as predictors. There is a feature selection modeling node that 33 00:03:00,730 --> 00:03:06,339 helps to do this. After the stream with the feature selection node is executed a 34 00:03:06,339 --> 00:03:11,870 yellow model nugget gets created below it in the flow diagram.Using that nugget 35 00:03:11,870 --> 00:03:16,319 we can generate a filter node that filters out the variables that are not 36 00:03:16,319 --> 00:03:22,109 good predictors for the target. The data audit node located below the filtering 37 00:03:22,109 --> 00:03:27,319 node shows various properties of the data such as numbers of outliers in each 38 00:03:27,319 --> 00:03:32,129 variable and the percentage of valid values. It can also help to create a 39 00:03:32,129 --> 00:03:37,269 special node for missing value imputation that is replacing missing 40 00:03:37,269 --> 00:03:41,419 values of a variable with some valid values that can be selected based on 41 00:03:41,419 --> 00:03:48,799 domain knowledge. Here variable log toll has greater than 50% missing values and 42 00:03:48,799 --> 00:03:54,510 we will specify a value the mean to replace them. A super node in modeler is 43 00:03:54,510 --> 00:03:58,879 a special node that is not found in the palette but is created by the user with 44 00:03:58,879 --> 00:04:04,019 special functions included in it. The data audit node enables us to create a 45 00:04:04,019 --> 00:04:09,650 super node for imputing missing values. It is shaped as a star and shown on the 46 00:04:09,650 --> 00:04:13,450 right of the screen. Finally we attach the logistic 47 00:04:13,450 --> 00:04:18,900 regression model node to the stream and click run. Another model nugget appears 48 00:04:18,900 --> 00:04:28,590 and by clicking it we can see various model information and other output. in the output window that opens when we click on the model nugget the summary 49 00:04:28,590 --> 00:04:34,400 tab shows the target inputs and some model building settings. Based on certain 50 00:04:34,400 --> 00:04:38,729 advanced output settings that were specified before the model was built we 51 00:04:38,729 --> 00:04:43,780 can also see a classification table, accuracy, and some other generated 52 00:04:43,780 --> 00:04:49,770 outputs for the model. Note that these results are based on training data only. 53 00:04:49,770 --> 00:04:54,099 To assess how well the model generates two other real-world data you should 54 00:04:54,099 --> 00:04:58,949 always use a partition node to hold out a subset of records for the purposes of 55 00:04:58,949 --> 00:05:05,520 testing and validation. Then, in the model setup screen select the use partitioned 56 00:05:05,520 --> 00:05:10,370 data check box. This will help detect and avoid model 57 00:05:10,370 --> 00:05:15,580 overfitting. Overfitting is defined as having significantly higher accuracy on 58 00:05:15,580 --> 00:05:23,069 the training data. Data used for training the model then on tests or unseen data. 59 00:05:23,069 --> 00:05:27,689 The yellow model nugget added earlier can also be used to compute predictions, 60 00:05:27,689 --> 00:05:33,710 also called scores on the original data or on a new data source. All we need to 61 00:05:33,710 --> 00:05:38,230 do is to connect the data source in question to the nugget, make sure it has 62 00:05:38,230 --> 00:05:43,120 the predictor variables used in the model, and create an output to a table or 63 00:05:43,120 --> 00:05:48,139 other structure for storing the scores. We can also specify settings for scoring 64 00:05:48,139 --> 00:05:53,939 inside the model nugget. Note that if the model was built on transformed predictor 65 00:05:53,939 --> 00:05:59,050 data, the same data transformation steps would be applied to the new data before 66 00:05:59,050 --> 00:06:03,719 it can be scored by the model. The analysis node is the final node in the 67 00:06:03,719 --> 00:06:09,210 stream. It attaches to a model nugget and when executed it will compute some model 68 00:06:09,210 --> 00:06:15,569 evaluation metrics, auch as a confusion matrix and accuracy. In this example 69 00:06:15,569 --> 00:06:22,199 we've only looked at a logistic regression model. IBM SPSS Modeler offers 70 00:06:22,199 --> 00:06:26,560 a rich modeling palette that includes many classification, regression 71 00:06:26,560 --> 00:06:32,539 clustering, Association rules and other models. It also contains large selections 72 00:06:32,539 --> 00:06:35,379 of data source types, data transformations, graphs, 73 00:06:35,379 --> 00:06:41,610 and output notes. And we haven't even talked about text analytics, entity 74 00:06:41,610 --> 00:06:45,669 resolution and many other features of the product that can be extremely 75 00:06:45,669 --> 00:06:51,389 helpful to data scientists. We could create an entire course on IBM SPSS 76 00:06:51,389 --> 00:06:57,229 Modeler alone. You've learned how IBM SPSS Modeler helps analysts to create 77 00:06:57,229 --> 00:07:02,730 powerful machine learning pipelines using graphical interface. Next, we will 78 00:07:02,730 --> 00:07:08,990 talk about the original SPSS product now called IBM SPSS Statistics.