1 00:00:07,950 --> 00:00:10,309 Welcome to part two of this series. 2 00:00:10,309 --> 00:00:15,500 In this section, we’ll cover the development environment, open source data integration, 3 00:00:15,500 --> 00:00:18,700 transformation, and visualization tools. 4 00:00:18,700 --> 00:00:23,180 One of the most popular current development environments that data scientists are using 5 00:00:23,180 --> 00:00:24,800 is “Jupyter.” 6 00:00:24,800 --> 00:00:30,230 Jupyter first emerged as a tool for interactive Python programming; it now supports more than 7 00:00:30,230 --> 00:00:34,140 a hundred different programming languages through “kernels.” 8 00:00:34,140 --> 00:00:37,750 Kernels shouldn’t be confused with operating system kernels. 9 00:00:37,750 --> 00:00:42,200 Jupyter kernels are encapsulating the different interactive interpreters for the different 10 00:00:42,200 --> 00:00:43,630 programming languages. 11 00:00:43,630 --> 00:00:50,550 A key property of Jupyter Notebooks is the ability to unify documentation, code, output 12 00:00:50,550 --> 00:00:56,830 from the code, shell commands, and visualizations into a single document. 13 00:00:56,830 --> 00:01:02,410 JupyterLab is the next generation of Jupyter Notebooks and in the long term, will actually 14 00:01:02,410 --> 00:01:04,610 replace Jupyter Notebooks. 15 00:01:04,610 --> 00:01:10,750 The architectural changes being introduced in JupyterLab makes Jupyter more modern and 16 00:01:10,750 --> 00:01:11,759 modular. 17 00:01:11,759 --> 00:01:16,619 From a user’s perspective, the main difference introduced by JupyterLab is the ability to 18 00:01:16,619 --> 00:01:22,229 open different types of files, including Jupyter Notebooks, data, and terminals. 19 00:01:22,229 --> 00:01:25,829 You can then arrange these files on the canvas. 20 00:01:25,829 --> 00:01:31,850 Although Apache Zeppelin has been fully reimplemented, it’s inspired by Jupyter Notebooks and provides 21 00:01:31,850 --> 00:01:34,110 a similar experience. 22 00:01:34,110 --> 00:01:38,319 One key differentiator is the integrated plotting capability. 23 00:01:38,319 --> 00:01:43,630 In Jupyter Notebooks, you are required to use external libraries in Apache Zeppelin, 24 00:01:43,630 --> 00:01:46,530 and plotting doesn’t require coding. 25 00:01:46,530 --> 00:01:50,799 You can also extend these capabilities by using additional libraries. 26 00:01:50,799 --> 00:01:56,340 RStudio is one of the oldest development environments for statistics and data science, having been 27 00:01:56,340 --> 00:01:58,899 introduced in 2011. 28 00:01:58,899 --> 00:02:02,909 It exclusively runs R and all associated R libraries. 29 00:02:02,909 --> 00:02:08,780 However, Python development is possible and R is therefore tightly integrated into this 30 00:02:08,780 --> 00:02:12,580 tool to provide an optimal user experience. 31 00:02:12,580 --> 00:02:20,430 RStudio unifies programming, execution, debugging, remote data access, data exploration, and 32 00:02:20,430 --> 00:02:22,810 visualization into a single tool. 33 00:02:22,810 --> 00:02:28,860 Spyder tries to mimic the behaviour of RStudio to bring its functionality to the Python world. 34 00:02:28,860 --> 00:02:34,250 Although Spyder does not have the same level of functionality as RStudio, data scientists 35 00:02:34,250 --> 00:02:36,610 do consider it an alternative. 36 00:02:36,610 --> 00:02:41,190 But in the Python world, Jupyter is used more frequently. 37 00:02:41,190 --> 00:02:47,290 This diagram shows how Spyder integrates code, documentation, visualizations, and other components 38 00:02:47,290 --> 00:02:49,610 into a single canvas. 39 00:02:49,610 --> 00:02:54,510 Sometimes your data doesn’t fit into a single computer’s storage or main memory capacity. 40 00:02:54,510 --> 00:02:58,340 That’s where cluster execution environments come in. 41 00:02:58,340 --> 00:03:03,469 The well known cluster-computing framework Apache Spark is among the most active Apache 42 00:03:03,469 --> 00:03:09,440 projects and is used across all industries, including in many Fortune 500 companies. 43 00:03:09,440 --> 00:03:13,769 The key property of Apache Spark is linear scalability. 44 00:03:13,769 --> 00:03:18,530 This means, if you double the number of servers in a cluster, you’ll also roughly double 45 00:03:18,530 --> 00:03:20,430 its performance. 46 00:03:20,430 --> 00:03:25,849 After Apache Spark began to gain market share, Apache Flink was created. 47 00:03:25,849 --> 00:03:30,980 The key difference between Apache Spark and Apache Flink is that Apache Spark is a batch 48 00:03:30,980 --> 00:03:37,400 data processing engine, capable of processing huge amounts of data file by file. 49 00:03:37,400 --> 00:03:43,299 Apache Flink, on the other hand, is a stream processing image, with its main focus on processing 50 00:03:43,299 --> 00:03:45,719 real-time data streams. 51 00:03:45,719 --> 00:03:50,939 Although engine supports both data processing paradigms, Apache Spark is usually the choice 52 00:03:50,939 --> 00:03:53,349 in most use cases. 53 00:03:53,349 --> 00:03:58,550 One of the latest developments in the data science execution environments is called “Ray,” 54 00:03:58,550 --> 00:04:03,290 which has a clear focus on large-scale deep learning model training. 55 00:04:03,290 --> 00:04:10,069 Let’s look at open source tools for data scientists that are fully integrated and visual. 56 00:04:10,069 --> 00:04:13,980 With these tools, no programming knowledge is necessary. 57 00:04:13,980 --> 00:04:19,710 Most important tasks are supported by these tools; these tasks include data integration, 58 00:04:19,710 --> 00:04:24,080 transformation, data visualization, and model building. 59 00:04:24,080 --> 00:04:28,690 KNIME originated at the University of Konstanz in 2004. 60 00:04:28,690 --> 00:04:34,130 As you can see, KNIME has a visual user interface with drag-and-drop capabilities. 61 00:04:34,130 --> 00:04:37,920 It also has built-in visualization capabilities. 62 00:04:37,920 --> 00:04:43,320 Knime can be be extended by programming in R and Python, and has connectors to Apache 63 00:04:43,320 --> 00:04:44,700 Spark. 64 00:04:44,700 --> 00:04:47,570 Another example of this group of tools is Orange. 65 00:04:47,570 --> 00:04:52,540 It’s less flexible than KNIME, but easier to use. 66 00:04:52,540 --> 00:04:57,340 In this video, you’ve learned about the most common data science tasks and which open 67 00:04:57,340 --> 00:05:00,500 source tools are relevant to those tasks. 68 00:05:00,500 --> 00:05:05,091 In the next video, we’ll describe some established commercial tools that you’ll encounter in 69 00:05:05,091 --> 00:05:07,000 your data science experience. 70 00:05:07,000 --> 00:05:10,420 Let’s move on to the next video to get more details.