1
00:00:07,640 --> 00:00:12,490
In this video, we will review several data
science libraries.

2
00:00:12,490 --> 00:00:16,580
Libraries are a collection of functions and
methods that enable you to perform a wide

3
00:00:16,580 --> 00:00:20,619
variety of actions without writing the code
yourself.

4
00:00:20,619 --> 00:00:25,980
We will focus on Python libraries:
Scientific Computing Libraries in Python

5
00:00:25,980 --> 00:00:30,970
Visualization Libraries in Python
High-Level Machine Learning and Deep Learning

6
00:00:30,970 --> 00:00:35,980
Libraries – “High-level” simply means
you don’t have to worry about details, although

7
00:00:35,980 --> 00:00:41,440
this makes it difficult to study or improve
Deep Learning Libraries in Python

8
00:00:41,440 --> 00:00:45,030
Libraries used in other languages

9
00:00:45,030 --> 00:00:49,120
Libraries usually contain built-in modules
providing different functionalities that you

10
00:00:49,120 --> 00:00:53,480
can use directly; these are sometimes called
“frameworks.”

11
00:00:53,480 --> 00:00:59,170
There are also extensive libraries, offering
a broad range of facilities.

12
00:00:59,170 --> 00:01:04,780
Pandas offers data structures and tools for
effective data cleaning, manipulation, and

13
00:01:04,780 --> 00:01:05,890
analysis.

14
00:01:05,890 --> 00:01:09,280
It provides tools to work with different types
of data.

15
00:01:09,280 --> 00:01:16,150
The primary instrument of Pandas is a two-dimensional
table consisting of columns and rows.

16
00:01:16,150 --> 00:01:21,009
This table is called a “DataFrame” and
is designed to provide easy indexing so you

17
00:01:21,009 --> 00:01:22,869
can work with your data.

18
00:01:22,869 --> 00:01:29,090
NumPy libraries are based on arrays, enabling
you to apply mathematical functions to these

19
00:01:29,090 --> 00:01:30,280
arrays.

20
00:01:30,280 --> 00:01:34,090
Pandas is actually built on top of NumPy

21
00:01:34,090 --> 00:01:38,470
Data visualization methods are a great way
to communicate with others and show the meaningful

22
00:01:38,470 --> 00:01:40,560
results of analysis.

23
00:01:40,560 --> 00:01:45,640
These libraries enable you to create graphs,
charts and maps.

24
00:01:45,640 --> 00:01:51,060
The Matplotlib package is the most well-known
library for data visualization, and it’s

25
00:01:51,060 --> 00:01:53,340
excellent for making graphs and plots.

26
00:01:53,340 --> 00:01:56,409
The graphs are also highly customizable.

27
00:01:56,409 --> 00:02:01,900
Another high-level visualization library,
Seaborn, is based on matplotlib.

28
00:02:01,900 --> 00:02:08,479
Seaborn makes it easy to generate plots like
heat maps, time series, and violin plots.

29
00:02:08,479 --> 00:02:14,110
For machine learning, the Scikit-learn library
contains tools for statistical modeling, including

30
00:02:14,110 --> 00:02:17,849
regression, classification, clustering and
others.

31
00:02:17,849 --> 00:02:24,350
It is built on NumPy, SciPy, and matplotlib,
and it’s relatively simple to get started.

32
00:02:24,350 --> 00:02:29,080
For this high-level approach, you define the
model and specify the parameter types you

33
00:02:29,080 --> 00:02:31,000
would like to use.

34
00:02:31,000 --> 00:02:35,989
For deep learning, Keras enables you to build
the standard deep learning model.

35
00:02:35,989 --> 00:02:41,470
Like Scikit-learn, the high-level interface
enables you to build models quickly and simply.

36
00:02:41,470 --> 00:02:47,430
It can function using graphics processing
units (GPU), but for many deep learning cases

37
00:02:47,430 --> 00:02:50,620
a lower-level environment is required.

38
00:02:50,620 --> 00:02:56,180
TensorFlow is a low-level framework used in
large scale production of deep learning models.

39
00:02:56,180 --> 00:03:00,890
It’s designed for production but can be
unwieldy for experimentation.

40
00:03:00,890 --> 00:03:07,560
Pytorch is used for experimentation, making
it simple for researchers to test their ideas

41
00:03:07,560 --> 00:03:13,409
Apache Spark is a general-purpose cluster-computing
framework that enables you to process data

42
00:03:13,409 --> 00:03:15,360
using compute clusters.

43
00:03:15,360 --> 00:03:21,319
This means that you process data in parallel,
using multiple computers simultaneously.

44
00:03:21,319 --> 00:03:23,980
The Spark library has similar functionality
as

45
00:03:23,980 --> 00:03:26,170
Pandas
Numpy

46
00:03:26,170 --> 00:03:27,709
Scikit-learn

47
00:03:27,709 --> 00:03:31,000
Apache Spark data processing jobs can use
Python

48
00:03:31,000 --> 00:03:34,250
R
Scala, or SQL

49
00:03:34,250 --> 00:03:39,170
There are many libraries for Scala, which
is predominately used in data engineering

50
00:03:39,170 --> 00:03:42,180
but is also sometimes used in data science.

51
00:03:42,180 --> 00:03:46,849
Let’s discuss some of the libraries that
are complementary to Spark

52
00:03:46,849 --> 00:03:51,510
Vegas is a Scala library for statistical data
visualizations.

53
00:03:51,510 --> 00:03:56,720
With Vegas, you can work with data files as
well as Spark DataFrames.

54
00:03:56,720 --> 00:04:00,329
For deep learning, you can use BigDL.

55
00:04:00,329 --> 00:04:05,740
R has built-in functionality for machine learning
and data visualization, but there are also

56
00:04:05,740 --> 00:04:11,860
several complementary libraries:
ggplot2 is a popular library for data visualization

57
00:04:11,860 --> 00:04:14,890
in R.
You can also use libraries that enable you

58
00:04:14,890 --> 00:04:18,710
to interface with Keras and TensorFlow.

59
00:04:18,710 --> 00:04:23,940
R has been the de-facto standard for open
source data science but it is now being superseded

60
00:04:23,940 --> 00:04:24,530
by Python.