1 00:00:08,769 --> 00:00:16,640 IBM SPSS Statistics evolved from an original product that was released in 1968. That product 2 00:00:16,640 --> 00:00:22,539 was called “Statistical Package for Social Sciences,” or “SPSS.” 3 00:00:22,539 --> 00:00:29,000 IBM SPSS Statistics is a statistical and machine learning software application and is widely 4 00:00:29,000 --> 00:00:36,000 used in academia, government agencies, and large enterprises. It’s used to build predictive 5 00:00:36,000 --> 00:00:42,110 models, perform statistical analysis of data, and conduct other analytic tasks. It has a 6 00:00:42,110 --> 00:00:48,440 visual interface, which enables users to leverage statistical and data mining algorithms without 7 00:00:48,440 --> 00:00:54,020 programming, although the interface is very different from Modeler. As you can see, the 8 00:00:54,020 --> 00:00:59,000 main section of the screen looks very much like a spreadsheet; it displays data and allows 9 00:00:59,000 --> 00:01:05,040 manual editing. This particular small data set, called “Employee Data”, was created 10 00:01:05,040 --> 00:01:09,950 some time ago and does not represent real people. It is shipped with the product for 11 00:01:09,950 --> 00:01:12,690 use in demos and tutorials. 12 00:01:12,690 --> 00:01:18,880 At the bottom of the screen, we can see two tabs: Data View and Variable View. In the 13 00:01:18,880 --> 00:01:24,869 Variable View, we can see and edit the information about all variables, including names, labels, 14 00:01:24,869 --> 00:01:31,350 data types, and measurement levels. We can also specify labels for values of categorical 15 00:01:31,350 --> 00:01:34,170 variables, and missing values. 16 00:01:34,170 --> 00:01:39,740 At the top of the data window is a menu. Under File, if you select “Import Data,” you 17 00:01:39,740 --> 00:01:45,299 will see a list of a wide variety of data formats that you can import. The product uses 18 00:01:45,299 --> 00:01:50,990 its own data file format with the extension “.sav” that saves all the information 19 00:01:50,990 --> 00:01:57,200 about the variables we just saw in Variable view. The menu enables importing from and 20 00:01:57,200 --> 00:02:00,349 exporting to many other formats. 21 00:02:00,349 --> 00:02:06,069 Under “Data,” you’ll find an extensive menu of possible data operations. Note that 22 00:02:06,069 --> 00:02:12,380 Data Validation can be performed using user-defined rules that specify the expected behavior of 23 00:02:12,380 --> 00:02:18,650 variable values. For example, if the date and month are kept in separate columns, the 24 00:02:18,650 --> 00:02:24,810 date cannot exceed “31,” but for February, the date can’t exceed “29.” A special 25 00:02:24,810 --> 00:02:30,470 rule can therefore be created and applied during data validation. Additionally, you 26 00:02:30,470 --> 00:02:36,570 can enable some checks, such as percentage of missing values in a record or in the field. 27 00:02:36,570 --> 00:02:41,280 When you click the “Transform” menu item, you’ll find a variety of available data 28 00:02:41,280 --> 00:02:45,480 transformations. Under “Compute Variable…” you can write 29 00:02:45,480 --> 00:02:52,000 a formula for a new variable based on existing variables. You can use any of the many mathematical 30 00:02:52,000 --> 00:02:55,840 and statistical functions available in the product. 31 00:02:55,840 --> 00:03:03,230 You also have the option to use automatic data preparation, similar to Modeler. 32 00:03:03,230 --> 00:03:07,870 In the “Analyze” menu, you will see many types of statistical and machine learning 33 00:03:07,870 --> 00:03:13,490 analysis. Under “Regression,” there are a variety of regression-related models. There 34 00:03:13,490 --> 00:03:18,760 are other kinds of regressions that appear separately on the Analyze menu, including 35 00:03:18,760 --> 00:03:25,780 General Linear Model, Generalized Linear Models, Mixed Models, and Loglinear. 36 00:03:25,780 --> 00:03:30,980 Now let’s build a decision-tree model on the data. For this exercise we’ll try to 37 00:03:30,980 --> 00:03:36,910 predict the "Employment category" field based on other fields. In the “Analyze” menu, 38 00:03:36,910 --> 00:03:41,720 select “Classify” and then “Tree”. In the Decision Tree window, we can 39 00:03:41,720 --> 00:03:47,700 specify the dependent variable “Employment Category,” and use most other fields -- except 40 00:03:47,700 --> 00:03:54,620 id and bdate -- as predictors, or independent variables. Usually the ID variable should 41 00:03:54,620 --> 00:03:59,760 not be used as a predictor, because it will not help with new cases, and the birthdate 42 00:03:59,760 --> 00:04:05,430 does not seem to be a useful predictor in this example either. We’ll select “Exhaustive 43 00:04:05,430 --> 00:04:10,209 CHAID” as our Growing Method, although there are also three other options available. Data 44 00:04:10,209 --> 00:04:11,209 scientists often try many different models to see which one works best for their data. 45 00:04:11,209 --> 00:04:12,209 Here we are just looking at one example model in order to illustrate how the product works. 46 00:04:12,209 --> 00:04:17,150 Click the “Validation” button to open the Decision Tree Validation window. Here, 47 00:04:17,150 --> 00:04:23,310 we select “Split-sample validation” to make sure we test the model on new data. Click 48 00:04:23,310 --> 00:04:28,620 “OK” in the Decision Tree window, to generate the output, including the tree diagram 49 00:04:28,620 --> 00:04:33,919 shown here. A Classification table is also displayed that shows how well the 50 00:04:33,919 --> 00:04:41,570 model works on training and test data. In this case, the accuracy is 91.2% on training 51 00:04:41,570 --> 00:04:49,949 data and only 85.6% on test data, which means the model does not generalize to new data 52 00:04:49,949 --> 00:04:55,990 very well. It’s possible that by using different models, we can get better results. 53 00:04:55,990 --> 00:05:01,330 Let’s move to the next menu item. When you click “Graphs,” you’ll open a versatile 54 00:05:01,330 --> 00:05:05,409 Chart Builder, in addition to several other options. 55 00:05:05,409 --> 00:05:10,400 The Chart Builder enables us to choose a style from the gallery and to drag required fields 56 00:05:10,400 --> 00:05:14,870 onto the canvas, select colors, and choose from other options. 57 00:05:14,870 --> 00:05:21,259 Here’s an example after we drag the “Previous Experience,” “Current Salary,” and Gender 58 00:05:21,259 --> 00:05:27,949 variables to the corresponding slots to define the axis and colors for the dots on the chart. 59 00:05:27,949 --> 00:05:33,639 The plot in the canvas is not based on real data, this example simply gives you an idea 60 00:05:33,639 --> 00:05:37,130 of what to expect. 61 00:05:37,130 --> 00:05:41,669 Here is the real plot obtained from the data that we’ve been using. It shows different 62 00:05:41,669 --> 00:05:46,479 colored dots for gender, and regression lines that show the relationship of the current 63 00:05:46,479 --> 00:05:50,569 salary to previous experience for each gender. 64 00:05:50,569 --> 00:05:56,050 Throughout IBM SPSS Statistics, you’ll see a “Paste” button. When you click the “Paste” 65 00:05:56,050 --> 00:06:01,520 button, instead of executing the task right away the application will open another window, 66 00:06:01,520 --> 00:06:08,530 called the Syntax editor. Here, you can see the code called “syntax” pasted for you. 67 00:06:08,530 --> 00:06:12,330 SPSS syntax is a special programming language. 68 00:06:12,330 --> 00:06:18,720 For example, here is the code for the decision tree we just built. Once we have the syntax, 69 00:06:18,720 --> 00:06:24,279 we can execute it, manually edit it, store it for later use, or send it to other users 70 00:06:24,279 --> 00:06:32,410 of IBM SPSS Statistics. Experienced SPSS users can write the code from scratch, while others 71 00:06:32,410 --> 00:06:40,199 might prefer to have it generated by the graphical interface. Remember, the option to paste syntax 72 00:06:40,199 --> 00:06:45,319 is available in throughout the program. If the syntax is generated by all the steps 73 00:06:45,319 --> 00:06:51,720 in a data analytics process -- opening the data set, applying any data transformations, 74 00:06:51,720 --> 00:06:57,759 building models -- and then saved as a syntax file with the extension “.sps”, it’s 75 00:06:57,759 --> 00:07:04,969 similar to saving a stream in IBM SPSS Modeler. However, one important difference is that 76 00:07:04,969 --> 00:07:10,740 it does not allow for an easy way of scoring new records with the model. We’ll talk about 77 00:07:10,740 --> 00:07:14,110 different ways to deploy models in the next section. 78 00:07:14,110 --> 00:07:21,240 You’ve learned how IBM SPSS Statistics helps data scientists to analyze their data using 79 00:07:21,240 --> 00:07:27,810 many statistical and machine learning techniques. Using a graphical user interface, we can create 80 00:07:27,810 --> 00:07:33,159 complicated analysis that can be saved in the form of syntax and reused later. 81 00:07:33,159 --> 00:07:38,689 Next, we will talk about predictive model deployment, an important part of the overall 82 00:07:38,689 --> 00:07:40,229 data science lifecycle.