1 00:00:07,640 --> 00:00:12,490 In this video, we will review several data science libraries. 2 00:00:12,490 --> 00:00:16,580 Libraries are a collection of functions and methods that enable you to perform a wide 3 00:00:16,580 --> 00:00:20,619 variety of actions without writing the code yourself. 4 00:00:20,619 --> 00:00:25,980 We will focus on Python libraries: Scientific Computing Libraries in Python 5 00:00:25,980 --> 00:00:30,970 Visualization Libraries in Python High-Level Machine Learning and Deep Learning 6 00:00:30,970 --> 00:00:35,980 Libraries – “High-level” simply means you don’t have to worry about details, although 7 00:00:35,980 --> 00:00:41,440 this makes it difficult to study or improve Deep Learning Libraries in Python 8 00:00:41,440 --> 00:00:45,030 Libraries used in other languages 9 00:00:45,030 --> 00:00:49,120 Libraries usually contain built-in modules providing different functionalities that you 10 00:00:49,120 --> 00:00:53,480 can use directly; these are sometimes called “frameworks.” 11 00:00:53,480 --> 00:00:59,170 There are also extensive libraries, offering a broad range of facilities. 12 00:00:59,170 --> 00:01:04,780 Pandas offers data structures and tools for effective data cleaning, manipulation, and 13 00:01:04,780 --> 00:01:05,890 analysis. 14 00:01:05,890 --> 00:01:09,280 It provides tools to work with different types of data. 15 00:01:09,280 --> 00:01:16,150 The primary instrument of Pandas is a two-dimensional table consisting of columns and rows. 16 00:01:16,150 --> 00:01:21,009 This table is called a “DataFrame” and is designed to provide easy indexing so you 17 00:01:21,009 --> 00:01:22,869 can work with your data. 18 00:01:22,869 --> 00:01:29,090 NumPy libraries are based on arrays, enabling you to apply mathematical functions to these 19 00:01:29,090 --> 00:01:30,280 arrays. 20 00:01:30,280 --> 00:01:34,090 Pandas is actually built on top of NumPy 21 00:01:34,090 --> 00:01:38,470 Data visualization methods are a great way to communicate with others and show the meaningful 22 00:01:38,470 --> 00:01:40,560 results of analysis. 23 00:01:40,560 --> 00:01:45,640 These libraries enable you to create graphs, charts and maps. 24 00:01:45,640 --> 00:01:51,060 The Matplotlib package is the most well-known library for data visualization, and it’s 25 00:01:51,060 --> 00:01:53,340 excellent for making graphs and plots. 26 00:01:53,340 --> 00:01:56,409 The graphs are also highly customizable. 27 00:01:56,409 --> 00:02:01,900 Another high-level visualization library, Seaborn, is based on matplotlib. 28 00:02:01,900 --> 00:02:08,479 Seaborn makes it easy to generate plots like heat maps, time series, and violin plots. 29 00:02:08,479 --> 00:02:14,110 For machine learning, the Scikit-learn library contains tools for statistical modeling, including 30 00:02:14,110 --> 00:02:17,849 regression, classification, clustering and others. 31 00:02:17,849 --> 00:02:24,350 It is built on NumPy, SciPy, and matplotlib, and it’s relatively simple to get started. 32 00:02:24,350 --> 00:02:29,080 For this high-level approach, you define the model and specify the parameter types you 33 00:02:29,080 --> 00:02:31,000 would like to use. 34 00:02:31,000 --> 00:02:35,989 For deep learning, Keras enables you to build the standard deep learning model. 35 00:02:35,989 --> 00:02:41,470 Like Scikit-learn, the high-level interface enables you to build models quickly and simply. 36 00:02:41,470 --> 00:02:47,430 It can function using graphics processing units (GPU), but for many deep learning cases 37 00:02:47,430 --> 00:02:50,620 a lower-level environment is required. 38 00:02:50,620 --> 00:02:56,180 TensorFlow is a low-level framework used in large scale production of deep learning models. 39 00:02:56,180 --> 00:03:00,890 It’s designed for production but can be unwieldy for experimentation. 40 00:03:00,890 --> 00:03:07,560 Pytorch is used for experimentation, making it simple for researchers to test their ideas 41 00:03:07,560 --> 00:03:13,409 Apache Spark is a general-purpose cluster-computing framework that enables you to process data 42 00:03:13,409 --> 00:03:15,360 using compute clusters. 43 00:03:15,360 --> 00:03:21,319 This means that you process data in parallel, using multiple computers simultaneously. 44 00:03:21,319 --> 00:03:23,980 The Spark library has similar functionality as 45 00:03:23,980 --> 00:03:26,170 Pandas Numpy 46 00:03:26,170 --> 00:03:27,709 Scikit-learn 47 00:03:27,709 --> 00:03:31,000 Apache Spark data processing jobs can use Python 48 00:03:31,000 --> 00:03:34,250 R Scala, or SQL 49 00:03:34,250 --> 00:03:39,170 There are many libraries for Scala, which is predominately used in data engineering 50 00:03:39,170 --> 00:03:42,180 but is also sometimes used in data science. 51 00:03:42,180 --> 00:03:46,849 Let’s discuss some of the libraries that are complementary to Spark 52 00:03:46,849 --> 00:03:51,510 Vegas is a Scala library for statistical data visualizations. 53 00:03:51,510 --> 00:03:56,720 With Vegas, you can work with data files as well as Spark DataFrames. 54 00:03:56,720 --> 00:04:00,329 For deep learning, you can use BigDL. 55 00:04:00,329 --> 00:04:05,740 R has built-in functionality for machine learning and data visualization, but there are also 56 00:04:05,740 --> 00:04:11,860 several complementary libraries: ggplot2 is a popular library for data visualization 57 00:04:11,860 --> 00:04:14,890 in R. You can also use libraries that enable you 58 00:04:14,890 --> 00:04:18,710 to interface with Keras and TensorFlow. 59 00:04:18,710 --> 00:04:23,940 R has been the de-facto standard for open source data science but it is now being superseded 60 00:04:23,940 --> 00:04:24,530 by Python.