1
00:00:07,450 --> 00:00:13,700
In part one of this two-part series, we’ll
cover data management, open source data integration,

2
00:00:13,700 --> 00:00:16,920
transformation, and visualization tools.

3
00:00:16,920 --> 00:00:22,820
The most widely used open source data management
tools are relational databases such as

4
00:00:22,820 --> 00:00:32,739
MySQL and PostgreSQL; NoSQL databases such
as MongoDB Apache CouchDB, and Apache Cassandra;

5
00:00:32,739 --> 00:00:40,370
and file-based tools such as the Hadoop File
System or Cloud File systems like Ceph.

6
00:00:40,370 --> 00:00:45,970
Finally,Elasticsearch is mainly used for storing
text data and creating a search index for

7
00:00:45,970 --> 00:00:48,190
fast document retrieval.

8
00:00:48,190 --> 00:00:53,790
The task of data integration and transformation
in the classic data warehousing world is called

9
00:00:53,790 --> 00:00:59,100
ETL, which stands for “extract, transform,
and load.”

10
00:00:59,100 --> 00:01:06,520
These days, data scientists often propose
the term “ELT” – Extract, Load, Transform“ELT”,

11
00:01:06,520 --> 00:01:12,450
stressing the fact that data is dumped somewhere
and the data engineer or data scientist themself

12
00:01:12,450 --> 00:01:14,950
is responsible for data.

13
00:01:14,950 --> 00:01:20,609
Another term for this process has now emerged:
“data refinery and cleansing.”

14
00:01:20,609 --> 00:01:26,600
Here are the most widely used open source
data integration and transformation tools:

15
00:01:26,600 --> 00:01:33,439
Apache AirFlow, originally created by AirBNB;
KubeFlow, which enables you to execute data

16
00:01:33,439 --> 00:01:40,460
science pipelines on top of Kubernetes; Apache
Kafka, which originated from LinkedIn;

17
00:01:40,460 --> 00:01:44,619
Apache Nifi, which delivers a very nice visual
editor;

18
00:01:44,619 --> 00:01:50,920
Apache SparkSQL (which enables you to use
ANSI SQL and scales up to compute clusters

19
00:01:50,920 --> 00:01:56,970
of 1000s of nodes), and NodeRED, which also
provides a visual editor.

20
00:01:56,970 --> 00:02:02,640
NodeRED consumes so little in resources that
it even runs on small devices like a Raspberry

21
00:02:02,640 --> 00:02:03,709
Pi.

22
00:02:03,709 --> 00:02:09,759
We’ll now introduce the most widely used
open source data visualization tools.

23
00:02:09,759 --> 00:02:14,900
We have to distinguish between programming
libraries where you need to use code and tools

24
00:02:14,900 --> 00:02:17,310
that contain a user interface.

25
00:02:17,310 --> 00:02:20,810
The most popular libraries are covered in
the next videos.

26
00:02:20,810 --> 00:02:28,010
A similar approach uses Hue, which can create
visualizations from SQL queries.

27
00:02:28,010 --> 00:02:34,340
Kibana, a data exploration and visualization
web application, is limited to Elasticsearch

28
00:02:34,340 --> 00:02:35,750
(the data provider).

29
00:02:35,750 --> 00:02:42,689
Finally, Apache Superset is a data exploration
and visualization web application.

30
00:02:42,689 --> 00:02:45,489
Model deployment is extremely important.

31
00:02:45,489 --> 00:02:49,870
Once you’ve created a machine learning model
capable of predicting some key aspects of

32
00:02:49,870 --> 00:02:57,280
the future, you should make that model consumable
by other developers and turn it into an API.

33
00:02:57,280 --> 00:03:03,480
Apache PredictionIO currently only supports
Apache Spark ML models for deployment, but

34
00:03:03,480 --> 00:03:07,609
support for all sorts of other libraries is
on the roadmap.

35
00:03:07,609 --> 00:03:12,409
Seldon is an interesting product since it
supports nearly every framework, including

36
00:03:12,409 --> 00:03:17,980
TensorFlow, Apache SparkML, R, and scikit-learn.

37
00:03:17,980 --> 00:03:22,720
Seldon can run on top of Kubernetes and Redhat
OpenShift.

38
00:03:22,720 --> 00:03:26,950
Another way to deploy SparkML models is by
using MLeap.

39
00:03:26,950 --> 00:03:32,959
Finally, TensorFlow can serve any of its models
using the TensorFlow service.

40
00:03:32,959 --> 00:03:38,260
You can deploy to an embedded device like
a Raspberry Pi or a smartphone using TensorFlow

41
00:03:38,260 --> 00:03:44,840
Lite, and even deploy to a web browser using
TensorFlow dot JS.

42
00:03:44,840 --> 00:03:47,810
Model monitoring is another crucial step.

43
00:03:47,810 --> 00:03:52,050
Once you’ve deployed a machine learning
model, you need to keep track of its prediction

44
00:03:52,050 --> 00:03:57,110
performance as new data arrives in order to
maintain outdated models.

45
00:03:57,110 --> 00:04:02,159
Following are some examples of model monitoring
tools:

46
00:04:02,159 --> 00:04:08,770
ModelDB is a machine model metadatabase where
information about the models are stored and

47
00:04:08,770 --> 00:04:10,230
can be queried.

48
00:04:10,230 --> 00:04:15,730
It natively supports Apache Spark ML Pipelines
and scikit-learn.

49
00:04:15,730 --> 00:04:21,920
A generic, multi-purpose tool called Prometheus
is also widely used for machine learning model

50
00:04:21,920 --> 00:04:27,540
monitoring, although it’s not specifically
made for this purpose.

51
00:04:27,540 --> 00:04:31,850
Model performance is not exclusively measured
through accuracy.

52
00:04:31,850 --> 00:04:37,350
Model bias against protected groups like gender
or race is also important.

53
00:04:37,350 --> 00:04:43,030
The IBM AI Fairness 360 open source toolkit
does exactly this.

54
00:04:43,030 --> 00:04:49,570
It detects and mitigates against bias in machine
learning models.

55
00:04:49,570 --> 00:04:55,730
Machine learning models, especially neural-network-based
deep learning models, can be subject to adversarial

56
00:04:55,730 --> 00:05:01,780
attacks, where an attacker tries to fool the
model with manipulated data or by manipulating

57
00:05:01,780 --> 00:05:03,670
the model itself.

58
00:05:03,670 --> 00:05:07,640
The
IBM Adversarial Robustness 360 Toolbox can

59
00:05:07,640 --> 00:05:15,230
be used to detect vulnerability to adversarial
attacks and help make the model more robust.

60
00:05:15,230 --> 00:05:20,260
Machine learning modes are often considered
to be a black box that applies some mysterious

61
00:05:20,260 --> 00:05:21,260
“magic.”

62
00:05:21,260 --> 00:05:25,470
The IBM AI Explainability 360 Toolkit makes
the

63
00:05:25,470 --> 00:05:31,170
machine learning process more understandable
by finding similar examples within a dataset

64
00:05:31,170 --> 00:05:34,710
that can be presented to a user for manual
comparison.

65
00:05:34,710 --> 00:05:41,700
The IBM AI Explainability 360 Toolkit can
also illustrate training for a simpler machine

66
00:05:41,700 --> 00:05:46,930
learning model by explaining how different
input variables affect the final decision

67
00:05:46,930 --> 00:05:48,370
of the model.

68
00:05:48,370 --> 00:05:52,830
Options for code asset management tools have
been greatly simplified:

69
00:05:52,830 --> 00:05:58,340
For code asset management – also referred
to as version management or version control

70
00:05:58,340 --> 00:06:01,760
– Git is now the standard.

71
00:06:01,760 --> 00:06:07,310
Multiple services have emerged to support
Git, with the most prominent being GitHub,

72
00:06:07,310 --> 00:06:11,260
which provides hosting for software development
version management.

73
00:06:11,260 --> 00:06:16,280
The runner-up is definitely GitLab, which
has the advantage of being a fully open source

74
00:06:16,280 --> 00:06:19,910
platform that you can host and manage yourself.

75
00:06:19,910 --> 00:06:23,850
Another choice is Bitbucket.

76
00:06:23,850 --> 00:06:29,610
Data asset management, also known as data
governance or data lineage, is another crucial

77
00:06:29,610 --> 00:06:33,050
part of enterprise grade data science.

78
00:06:33,050 --> 00:06:36,970
Data has to be versioned and annotated with
metadata.

79
00:06:36,970 --> 00:06:41,120
Apache Atlas is a tool that supports this
task.

80
00:06:41,120 --> 00:06:46,900
Another interesting project, ODPi Egeria,
is managed through the Linux Foundation and

81
00:06:46,900 --> 00:06:49,650
is an open ecosystem.

82
00:06:49,650 --> 00:06:56,350
It offers a set of open APIs, types, and interchange
protocols that metadata repositories use to

83
00:06:56,350 --> 00:06:58,910
share and exchange data.

84
00:06:58,910 --> 00:07:04,820
Finally, Kylo is an open source data lake
management software platform that provides

85
00:07:04,820 --> 00:07:10,440
extensive support for a wide range of data
asset management tasks.

86
00:07:10,440 --> 00:07:12,940
This concludes part one of this two-part series.

87
00:07:12,940 --> 00:07:15,090
Now let’s move on to part two.