1
00:00:08,769 --> 00:00:16,640
IBM SPSS Statistics evolved from an original
product that was released in 1968. That product

2
00:00:16,640 --> 00:00:22,539
was called “Statistical Package for Social
Sciences,” or “SPSS.”

3
00:00:22,539 --> 00:00:29,000
IBM SPSS Statistics is a statistical and machine
learning software application and is widely

4
00:00:29,000 --> 00:00:36,000
used in academia, government agencies, and
large enterprises. It’s used to build predictive

5
00:00:36,000 --> 00:00:42,110
models, perform statistical analysis of data,
and conduct other analytic tasks. It has a

6
00:00:42,110 --> 00:00:48,440
visual interface, which enables users to leverage
statistical and data mining algorithms without

7
00:00:48,440 --> 00:00:54,020
programming, although the interface is very
different from Modeler. As you can see, the

8
00:00:54,020 --> 00:00:59,000
main section of the screen looks very much
like a spreadsheet; it displays data and allows

9
00:00:59,000 --> 00:01:05,040
manual editing. This particular small data
set, called “Employee Data”, was created

10
00:01:05,040 --> 00:01:09,950
some time ago and does not represent real
people. It is shipped with the product for

11
00:01:09,950 --> 00:01:12,690
use in demos and tutorials.

12
00:01:12,690 --> 00:01:18,880
At the bottom of the screen, we can see two
tabs: Data View and Variable View. In the

13
00:01:18,880 --> 00:01:24,869
Variable View, we can see and edit the information
about all variables, including names, labels,

14
00:01:24,869 --> 00:01:31,350
data types, and measurement levels. We can
also specify labels for values of categorical

15
00:01:31,350 --> 00:01:34,170
variables, and missing values.

16
00:01:34,170 --> 00:01:39,740
At the top of the data window is a menu. Under
File, if you select “Import Data,” you

17
00:01:39,740 --> 00:01:45,299
will see a list of a wide variety of data
formats that you can import. The product uses

18
00:01:45,299 --> 00:01:50,990
its own data file format with the extension
“.sav” that saves all the information

19
00:01:50,990 --> 00:01:57,200
about the variables we just saw in Variable
view. The menu enables importing from and

20
00:01:57,200 --> 00:02:00,349
exporting to many other formats.

21
00:02:00,349 --> 00:02:06,069
Under “Data,” you’ll find an extensive
menu of possible data operations. Note that

22
00:02:06,069 --> 00:02:12,380
Data Validation can be performed using user-defined
rules that specify the expected behavior of

23
00:02:12,380 --> 00:02:18,650
variable values. For example, if the date
and month are kept in separate columns, the

24
00:02:18,650 --> 00:02:24,810
date cannot exceed “31,” but for February,
the date can’t exceed “29.” A special

25
00:02:24,810 --> 00:02:30,470
rule can therefore be created and applied
during data validation. Additionally, you

26
00:02:30,470 --> 00:02:36,570
can enable some checks, such as percentage
of missing values in a record or in the field.

27
00:02:36,570 --> 00:02:41,280
When you click the “Transform” menu item,
you’ll find a variety of available data

28
00:02:41,280 --> 00:02:45,480
transformations.
Under “Compute Variable…” you can write

29
00:02:45,480 --> 00:02:52,000
a formula for a new variable based on existing
variables. You can use any of the many mathematical

30
00:02:52,000 --> 00:02:55,840
and statistical functions available in the
product.

31
00:02:55,840 --> 00:03:03,230
You also have the option to use automatic
data preparation, similar to Modeler.

32
00:03:03,230 --> 00:03:07,870
In the “Analyze” menu, you will see many
types of statistical and machine learning

33
00:03:07,870 --> 00:03:13,490
analysis. Under “Regression,” there are
a variety of regression-related models. There

34
00:03:13,490 --> 00:03:18,760
are other kinds of regressions that appear
separately on the Analyze menu, including

35
00:03:18,760 --> 00:03:25,780
General Linear Model, Generalized Linear Models,
Mixed Models, and Loglinear.

36
00:03:25,780 --> 00:03:30,980
Now let’s build a decision-tree model on
the data. For this exercise we’ll try to

37
00:03:30,980 --> 00:03:36,910
predict the "Employment category" field based
on other fields. In the “Analyze” menu,

38
00:03:36,910 --> 00:03:41,720
select “Classify” and then “Tree”.
<Click> In the Decision Tree window, we can

39
00:03:41,720 --> 00:03:47,700
specify the dependent variable “Employment
Category,” and use most other fields -- except

40
00:03:47,700 --> 00:03:54,620
id and bdate -- as predictors, or independent
variables. Usually the ID variable should

41
00:03:54,620 --> 00:03:59,760
not be used as a predictor, because it will
not help with new cases, and the birthdate

42
00:03:59,760 --> 00:04:05,430
does not seem to be a useful predictor in
this example either. We’ll select “Exhaustive

43
00:04:05,430 --> 00:04:10,209
CHAID” as our Growing Method, although there
are also three other options available. Data

44
00:04:10,209 --> 00:04:11,209
scientists often try many different models
to see which one works best for their data.

45
00:04:11,209 --> 00:04:12,209
Here we are just looking at one example model
in order to illustrate how the product works.

46
00:04:12,209 --> 00:04:17,150
Click the “Validation” button to open
the Decision Tree Validation window. Here,

47
00:04:17,150 --> 00:04:23,310
we select “Split-sample validation” to
make sure we test the model on new data. Click

48
00:04:23,310 --> 00:04:28,620
“OK” in the Decision Tree window, to <Click>
generate the output, including the tree diagram

49
00:04:28,620 --> 00:04:33,919
shown here. <Click> A Classification table
is also displayed that shows how well the

50
00:04:33,919 --> 00:04:41,570
model works on training and test data. In
this case, the accuracy is 91.2% on training

51
00:04:41,570 --> 00:04:49,949
data and only 85.6% on test data, which means
the model does not generalize to new data

52
00:04:49,949 --> 00:04:55,990
very well. It’s possible that by using different
models, we can get better results.

53
00:04:55,990 --> 00:05:01,330
Let’s move to the next menu item. When you
click “Graphs,” you’ll open a versatile

54
00:05:01,330 --> 00:05:05,409
Chart Builder, in addition to several other
options.

55
00:05:05,409 --> 00:05:10,400
The Chart Builder enables us to choose a style
from the gallery and to drag required fields

56
00:05:10,400 --> 00:05:14,870
onto the canvas, select colors, and choose
from other options.

57
00:05:14,870 --> 00:05:21,259
Here’s an example after we drag the “Previous
Experience,” “Current Salary,” and Gender

58
00:05:21,259 --> 00:05:27,949
variables to the corresponding slots to define
the axis and colors for the dots on the chart.

59
00:05:27,949 --> 00:05:33,639
The plot in the canvas is not based on real
data, this example simply gives you an idea

60
00:05:33,639 --> 00:05:37,130
of what to expect.

61
00:05:37,130 --> 00:05:41,669
Here is the real plot obtained from the data
that we’ve been using. It shows different

62
00:05:41,669 --> 00:05:46,479
colored dots for gender, and regression lines
that show the relationship of the current

63
00:05:46,479 --> 00:05:50,569
salary to previous experience for each gender.

64
00:05:50,569 --> 00:05:56,050
Throughout IBM SPSS Statistics, you’ll see
a “Paste” button. When you click the “Paste”

65
00:05:56,050 --> 00:06:01,520
button, instead of executing the task right
away the application will open another window,

66
00:06:01,520 --> 00:06:08,530
called the Syntax editor. Here, you can see
the code called “syntax” pasted for you.

67
00:06:08,530 --> 00:06:12,330
SPSS syntax is a special programming language.

68
00:06:12,330 --> 00:06:18,720
For example, here is the code for the decision
tree we just built. Once we have the syntax,

69
00:06:18,720 --> 00:06:24,279
we can execute it, manually edit it, store
it for later use, or send it to other users

70
00:06:24,279 --> 00:06:32,410
of IBM SPSS Statistics. Experienced SPSS users
can write the code from scratch, while others

71
00:06:32,410 --> 00:06:40,199
might prefer to have it generated by the graphical
interface. Remember, the option to paste syntax

72
00:06:40,199 --> 00:06:45,319
is available in throughout the program.
If the syntax is generated by all the steps

73
00:06:45,319 --> 00:06:51,720
in a data analytics process -- opening the
data set, applying any data transformations,

74
00:06:51,720 --> 00:06:57,759
building models -- and then saved as a syntax
file with the extension “.sps”, it’s

75
00:06:57,759 --> 00:07:04,969
similar to saving a stream in IBM SPSS Modeler.
However, one important difference is that

76
00:07:04,969 --> 00:07:10,740
it does not allow for an easy way of scoring
new records with the model. We’ll talk about

77
00:07:10,740 --> 00:07:14,110
different ways to deploy models in the next
section.

78
00:07:14,110 --> 00:07:21,240
You’ve learned how IBM SPSS Statistics helps
data scientists to analyze their data using

79
00:07:21,240 --> 00:07:27,810
many statistical and machine learning techniques.
Using a graphical user interface, we can create

80
00:07:27,810 --> 00:07:33,159
complicated analysis that can be saved in
the form of syntax and reused later.

81
00:07:33,159 --> 00:07:38,689
Next, we will talk about predictive model
deployment, an important part of the overall

82
00:07:38,689 --> 00:07:40,229
data science lifecycle.