1
00:00:07,430 --> 00:00:11,650
In this lesson we will discuss two
products that are very helpful for data

2
00:00:11,650 --> 00:00:19,500
scientists. Both came to IBM with the
SPSS acquisition in 2009. First is IBM

3
00:00:19,500 --> 00:00:25,890
SPSS Modeler. Let's review the different
tool categories we discussed previously.

4
00:00:25,890 --> 00:00:31,200
IBM SPSS Modeler includes data
management capabilities and tools for

5
00:00:31,200 --> 00:00:37,011
data preparation, visualization, model
building and model deployment. The

6
00:00:37,011 --> 00:00:41,070
product was created by Integral
Solutions Limited in the United Kingdom

7
00:00:41,070 --> 00:00:47,210
in 1994 and was originally called
Clementine. It was acquired by a company

8
00:00:47,210 --> 00:00:57,360
called SPSS in 1998 and SPSS was in turn
acquired by IBM in 2009. SPSS Modeler is

9
00:00:57,360 --> 00:01:02,520
a data mining and text analytics
software application. It's used to build

10
00:01:02,520 --> 00:01:07,610
predictive models and conduct other
analytics tasks. It has a visual

11
00:01:07,610 --> 00:01:12,770
interface that enables users to leverage
statistical and data mining algorithms

12
00:01:12,770 --> 00:01:17,760
without programming. One of its main
goals from the beginning was to create

13
00:01:17,760 --> 00:01:23,250
complex predictive modeling pipelines
that are easily accessible. A sample

14
00:01:23,250 --> 00:01:29,380
modeler stream shown here includes one
round data source node, three triangular

15
00:01:29,380 --> 00:01:34,490
graph nodes, one hexagonal node for
computing, a new variable, and a square

16
00:01:34,490 --> 00:01:39,490
node for an output table. Below the
canvas, we can see the rich node palette

17
00:01:39,490 --> 00:01:44,909
with separate tabs for data sources,
record in field operations, graphs, models,

18
00:01:44,909 --> 00:01:48,689
output and so on. Nodes and different tabs have different

19
00:01:48,689 --> 00:01:54,470
shapes with Pentagon's used for modeling
nodes. Let's examine the sample stream

20
00:01:54,470 --> 00:01:58,700
that comes as an example with the
product. It starts with a data set of

21
00:01:58,700 --> 00:02:03,340
telecommunications records and the goal
is to build a model to predict which

22
00:02:03,340 --> 00:02:09,310
customers are about to leave the service
otherwise known as churn. The data source

23
00:02:09,310 --> 00:02:15,239
is shown by the round node on the left
side, a hexagon type node typically follows

24
00:02:15,239 --> 00:02:17,689
a
data source node and it enables us to

25
00:02:17,689 --> 00:02:23,599
specify roles, target predictor or none.
And measurement levels such as

26
00:02:23,599 --> 00:02:29,939
continuous nominal or flag for all
variables. The term flag is used to

27
00:02:29,939 --> 00:02:34,769
denote a variable with two categories
one of which can be considered positive

28
00:02:34,769 --> 00:02:39,439
and the other negative. In this example
the measurement level for the churn

29
00:02:39,439 --> 00:02:45,430
field is set to flag and the role is set
to target. All others are set as

30
00:02:45,430 --> 00:02:50,989
predictors and inputs. The original
data set has many fields and some of them

31
00:02:50,989 --> 00:02:55,799
are not relevant to the target variable,
so we first need to decide which fields

32
00:02:55,799 --> 00:03:00,730
are more useful as predictors. There is a
feature selection modeling node that

33
00:03:00,730 --> 00:03:06,339
helps to do this. After the stream with
the feature selection node is executed a

34
00:03:06,339 --> 00:03:11,870
yellow model nugget gets created below
it in the flow diagram.Using that nugget

35
00:03:11,870 --> 00:03:16,319
we can generate a filter node that
filters out the variables that are not

36
00:03:16,319 --> 00:03:22,109
good predictors for the target. The data
audit node located below the filtering

37
00:03:22,109 --> 00:03:27,319
node shows various properties of the
data such as numbers of outliers in each

38
00:03:27,319 --> 00:03:32,129
variable and the percentage of valid
values. It can also help to create a

39
00:03:32,129 --> 00:03:37,269
special node for missing value
imputation that is replacing missing

40
00:03:37,269 --> 00:03:41,419
values of a variable with some valid
values that can be selected based on

41
00:03:41,419 --> 00:03:48,799
domain knowledge. Here variable log toll
has greater than 50% missing values and

42
00:03:48,799 --> 00:03:54,510
we will specify a value the mean to
replace them. A super node in modeler is

43
00:03:54,510 --> 00:03:58,879
a special node that is not found in the
palette but is created by the user with

44
00:03:58,879 --> 00:04:04,019
special functions included in it. The
data audit node enables us to create a

45
00:04:04,019 --> 00:04:09,650
super node for imputing missing values. It is shaped as a star and shown on the

46
00:04:09,650 --> 00:04:13,450
right of the screen.
Finally we attach the logistic

47
00:04:13,450 --> 00:04:18,900
regression model node to the stream and
click run. Another model nugget appears

48
00:04:18,900 --> 00:04:28,590
and by clicking it we can see various
model information and other output. in the output window that opens when we
click on the model nugget the summary

49
00:04:28,590 --> 00:04:34,400
tab shows the target inputs and some
model building settings. Based on certain

50
00:04:34,400 --> 00:04:38,729
advanced output settings that were
specified before the model was built we

51
00:04:38,729 --> 00:04:43,780
can also see a classification table,
accuracy, and some other generated

52
00:04:43,780 --> 00:04:49,770
outputs for the model. Note that these
results are based on training data only.

53
00:04:49,770 --> 00:04:54,099
To assess how well the model generates
two other real-world data you should

54
00:04:54,099 --> 00:04:58,949
always use a partition node to hold out
a subset of records for the purposes of

55
00:04:58,949 --> 00:05:05,520
testing and validation. Then, in the model
setup screen select the use partitioned

56
00:05:05,520 --> 00:05:10,370
data check box. This will help detect and avoid model

57
00:05:10,370 --> 00:05:15,580
overfitting. Overfitting is defined as
having significantly higher accuracy on

58
00:05:15,580 --> 00:05:23,069
the training data. Data used for training
the model then on tests or unseen data.

59
00:05:23,069 --> 00:05:27,689
The yellow model nugget added earlier
can also be used to compute predictions,

60
00:05:27,689 --> 00:05:33,710
also called scores on the original data
or on a new data source. All we need to

61
00:05:33,710 --> 00:05:38,230
do is to connect the data source in
question to the nugget, make sure it has

62
00:05:38,230 --> 00:05:43,120
the predictor variables used in the
model, and create an output to a table or

63
00:05:43,120 --> 00:05:48,139
other structure for storing the scores.
We can also specify settings for scoring

64
00:05:48,139 --> 00:05:53,939
inside the model nugget. Note that if the
model was built on transformed predictor

65
00:05:53,939 --> 00:05:59,050
data, the same data transformation steps
would be applied to the new data before

66
00:05:59,050 --> 00:06:03,719
it can be scored by the model. The
analysis node is the final node in the

67
00:06:03,719 --> 00:06:09,210
stream. It attaches to a model nugget and
when executed it will compute some model

68
00:06:09,210 --> 00:06:15,569
evaluation metrics, auch as a confusion
matrix and accuracy. In this example

69
00:06:15,569 --> 00:06:22,199
we've only looked at a logistic
regression model. IBM SPSS Modeler offers

70
00:06:22,199 --> 00:06:26,560
a rich modeling palette that includes
many classification, regression

71
00:06:26,560 --> 00:06:32,539
clustering, Association rules and other
models. It also contains large selections

72
00:06:32,539 --> 00:06:35,379
of data source types, data
transformations, graphs,

73
00:06:35,379 --> 00:06:41,610
and output notes. And we haven't even
talked about text analytics, entity

74
00:06:41,610 --> 00:06:45,669
resolution and many other features of
the product that can be extremely

75
00:06:45,669 --> 00:06:51,389
helpful to data scientists. We could
create an entire course on IBM SPSS

76
00:06:51,389 --> 00:06:57,229
Modeler alone. You've learned how IBM
SPSS Modeler helps analysts to create

77
00:06:57,229 --> 00:07:02,730
powerful machine learning pipelines
using graphical interface. Next, we will

78
00:07:02,730 --> 00:07:08,990
talk about the original SPSS product now
called IBM SPSS Statistics.