1
00:00:07,950 --> 00:00:10,309
Welcome to part two of this series.

2
00:00:10,309 --> 00:00:15,500
In this section, we’ll cover the development
environment, open source data integration,

3
00:00:15,500 --> 00:00:18,700
transformation, and visualization tools.

4
00:00:18,700 --> 00:00:23,180
One of the most popular current development
environments that data scientists are using

5
00:00:23,180 --> 00:00:24,800
is “Jupyter.”

6
00:00:24,800 --> 00:00:30,230
Jupyter first emerged as a tool for interactive
Python programming; it now supports more than

7
00:00:30,230 --> 00:00:34,140
a hundred different programming languages
through “kernels.”

8
00:00:34,140 --> 00:00:37,750
Kernels shouldn’t be confused with operating
system kernels.

9
00:00:37,750 --> 00:00:42,200
Jupyter kernels are encapsulating the different
interactive interpreters for the different

10
00:00:42,200 --> 00:00:43,630
programming languages.

11
00:00:43,630 --> 00:00:50,550
A key property of Jupyter Notebooks is the
ability to unify documentation, code, output

12
00:00:50,550 --> 00:00:56,830
from the code, shell commands, and visualizations
into a single document.

13
00:00:56,830 --> 00:01:02,410
JupyterLab is the next generation of Jupyter
Notebooks and in the long term, will actually

14
00:01:02,410 --> 00:01:04,610
replace Jupyter Notebooks.

15
00:01:04,610 --> 00:01:10,750
The architectural changes being introduced
in JupyterLab makes Jupyter more modern and

16
00:01:10,750 --> 00:01:11,759
modular.

17
00:01:11,759 --> 00:01:16,619
From a user’s perspective, the main difference
introduced by JupyterLab is the ability to

18
00:01:16,619 --> 00:01:22,229
open different types of files, including Jupyter
Notebooks, data, and terminals.

19
00:01:22,229 --> 00:01:25,829
You can then arrange these files on the canvas.

20
00:01:25,829 --> 00:01:31,850
Although Apache Zeppelin has been fully reimplemented,
it’s inspired by Jupyter Notebooks and provides

21
00:01:31,850 --> 00:01:34,110
a similar experience.

22
00:01:34,110 --> 00:01:38,319
One key differentiator is the integrated plotting
capability.

23
00:01:38,319 --> 00:01:43,630
In Jupyter Notebooks, you are required to
use external libraries in Apache Zeppelin,

24
00:01:43,630 --> 00:01:46,530
and plotting doesn’t require coding.

25
00:01:46,530 --> 00:01:50,799
You can also extend these capabilities by
using additional libraries.

26
00:01:50,799 --> 00:01:56,340
RStudio is one of the oldest development environments
for statistics and data science, having been

27
00:01:56,340 --> 00:01:58,899
introduced in 2011.

28
00:01:58,899 --> 00:02:02,909
It exclusively runs R and all associated R
libraries.

29
00:02:02,909 --> 00:02:08,780
However, Python development is possible and
R is therefore tightly integrated into this

30
00:02:08,780 --> 00:02:12,580
tool to provide an optimal user experience.

31
00:02:12,580 --> 00:02:20,430
RStudio unifies programming, execution, debugging,
remote data access, data exploration, and

32
00:02:20,430 --> 00:02:22,810
visualization into a single tool.

33
00:02:22,810 --> 00:02:28,860
Spyder tries to mimic the behaviour of RStudio
to bring its functionality to the Python world.

34
00:02:28,860 --> 00:02:34,250
Although Spyder does not have the same level
of functionality as RStudio, data scientists

35
00:02:34,250 --> 00:02:36,610
do consider it an alternative.

36
00:02:36,610 --> 00:02:41,190
But in the Python world, Jupyter is used more
frequently.

37
00:02:41,190 --> 00:02:47,290
This diagram shows how Spyder integrates code,
documentation, visualizations, and other components

38
00:02:47,290 --> 00:02:49,610
into a single canvas.

39
00:02:49,610 --> 00:02:54,510
Sometimes your data doesn’t fit into a single
computer’s storage or main memory capacity.

40
00:02:54,510 --> 00:02:58,340
That’s where cluster execution environments
come in.

41
00:02:58,340 --> 00:03:03,469
The well known cluster-computing framework
Apache Spark is among the most active Apache

42
00:03:03,469 --> 00:03:09,440
projects and is used across all industries,
including in many Fortune 500 companies.

43
00:03:09,440 --> 00:03:13,769
The key property of Apache Spark is linear
scalability.

44
00:03:13,769 --> 00:03:18,530
This means, if you double the number of servers
in a cluster, you’ll also roughly double

45
00:03:18,530 --> 00:03:20,430
its performance.

46
00:03:20,430 --> 00:03:25,849
After Apache Spark began to gain market share,
Apache Flink was created.

47
00:03:25,849 --> 00:03:30,980
The key difference between Apache Spark and
Apache Flink is that Apache Spark is a batch

48
00:03:30,980 --> 00:03:37,400
data processing engine, capable of processing
huge amounts of data file by file.

49
00:03:37,400 --> 00:03:43,299
Apache Flink, on the other hand, is a stream
processing image, with its main focus on processing

50
00:03:43,299 --> 00:03:45,719
real-time data streams.

51
00:03:45,719 --> 00:03:50,939
Although engine supports both data processing
paradigms, Apache Spark is usually the choice

52
00:03:50,939 --> 00:03:53,349
in most use cases.

53
00:03:53,349 --> 00:03:58,550
One of the latest developments in the data
science execution environments is called “Ray,”

54
00:03:58,550 --> 00:04:03,290
which has a clear focus on large-scale deep
learning model training.

55
00:04:03,290 --> 00:04:10,069
Let’s look at open source tools for data
scientists that are fully integrated and visual.

56
00:04:10,069 --> 00:04:13,980
With these tools, no programming knowledge
is necessary.

57
00:04:13,980 --> 00:04:19,710
Most important tasks are supported by these
tools; these tasks include data integration,

58
00:04:19,710 --> 00:04:24,080
transformation, data visualization, and model
building.

59
00:04:24,080 --> 00:04:28,690
KNIME originated at the University of Konstanz
in 2004.

60
00:04:28,690 --> 00:04:34,130
As you can see, KNIME has a visual user interface
with drag-and-drop capabilities.

61
00:04:34,130 --> 00:04:37,920
It also has built-in visualization capabilities.

62
00:04:37,920 --> 00:04:43,320
Knime can be be extended by programming in
R and Python, and has connectors to Apache

63
00:04:43,320 --> 00:04:44,700
Spark.

64
00:04:44,700 --> 00:04:47,570
Another example of this group of tools is
Orange.

65
00:04:47,570 --> 00:04:52,540
It’s less flexible than KNIME, but easier
to use.

66
00:04:52,540 --> 00:04:57,340
In this video, you’ve learned about the
most common data science tasks and which open

67
00:04:57,340 --> 00:05:00,500
source tools are relevant to those tasks.

68
00:05:00,500 --> 00:05:05,091
In the next video, we’ll describe some established
commercial tools that you’ll encounter in

69
00:05:05,091 --> 00:05:07,000
your data science experience.

70
00:05:07,000 --> 00:05:10,420
Let’s move on to the next video to get more
details.