1 00:00:07,760 --> 00:00:09,580 Hi, I'm Sonali Surange Dev. 2 00:00:11,840 --> 00:00:14,019 Data scientists often end up spending a lot 3 00:00:14,019 --> 00:00:19,410 of time doing mundane tasks like cleansing, shaping and preparing data. 4 00:00:19,410 --> 00:00:23,230 Typically these tasks are roadblocks for starting the more enjoyable part of 5 00:00:23,230 --> 00:00:27,160 analyzing the data sets or building and training machine learning models. 6 00:00:28,170 --> 00:00:33,579 This is because data sets typically are not in a format that can be readily used. 7 00:00:34,579 --> 00:00:39,030 They first need to be cleansed, refined before they are useable by a data scientist. 8 00:00:40,030 --> 00:00:44,640 IBM Data Refinery addresses this issue and simplifies the task of refining data and 9 00:00:44,640 --> 00:00:46,370 its workflows. 10 00:00:46,370 --> 00:00:51,210 It provides a self-service data preparation environment where you 11 00:00:51,210 --> 00:00:55,329 can quickly analyze, cleanse and prepare data sets. 12 00:00:55,329 --> 00:00:56,809 Data refinery is available with 13 00:00:56,809 --> 00:01:01,579 Watson Studio on public cloud, private cloud and desktop. 14 00:01:01,579 --> 00:01:02,579 In the rest of the 15 00:01:02,579 --> 00:01:05,920 video we will walk through a scenario and see Data Refinery in action. 16 00:01:05,920 --> 00:01:06,920 In this 17 00:01:06,920 --> 00:01:09,590 scenario we will use Data Refinery to find the best deals 18 00:01:09,590 --> 00:01:12,390 using data about discounts offered over time. 19 00:01:12,390 --> 00:01:14,340 We will then automate the 20 00:01:14,340 --> 00:01:18,870 analysis to run on a regular schedule. 21 00:01:18,870 --> 00:01:20,531 Before the Data Scientist starts, she looks at the data 22 00:01:20,531 --> 00:01:26,710 distribution and notices that the inSale column is missing data. 23 00:01:26,710 --> 00:01:33,869 She visualizes the offer column and notices that it contains valuable 24 00:01:33,869 --> 00:01:38,020 information about discounts. 25 00:01:38,020 --> 00:01:42,640 Many fields contain the percent of information, 26 00:01:42,640 --> 00:01:48,310 some contain references to previous price indicating a new reduced price 27 00:01:48,310 --> 00:01:49,310 being available. 28 00:01:49,310 --> 00:01:55,040 She decides to derive sale from offer. 29 00:01:55,040 --> 00:02:01,570 She uses a conditional decrease operation to derive if the product is on 30 00:02:01,570 --> 00:02:08,830 sale. 31 00:02:08,830 --> 00:02:12,960 Next she uses a filter operation 32 00:02:12,960 --> 00:02:21,610 to eliminate deals that are not on sale 33 00:02:21,610 --> 00:02:27,920 She then wants to pick up the bargains. 34 00:02:27,920 --> 00:02:30,050 She uses the replace substring operation 35 00:02:30,050 --> 00:02:42,290 and provides a pattern that extracts the discounts from the offer. 36 00:02:42,290 --> 00:02:43,670 After 37 00:02:43,670 --> 00:02:48,740 converting the discount values to a decimal 38 00:02:48,740 --> 00:02:54,400 she can visually see the discounts that were available. 39 00:02:54,400 --> 00:02:55,400 She needs to find the 40 00:02:55,400 --> 00:02:57,800 months that offered the best deals. 41 00:02:57,800 --> 00:03:01,310 She visualizes the dateUpdated and notices 42 00:03:01,310 --> 00:03:07,510 that the date field has a variety of formats, some with dashes some with 43 00:03:07,510 --> 00:03:13,430 slashes and some with months as text. 44 00:03:13,430 --> 00:03:15,630 She hopes that Data Refinery can normalize the 45 00:03:15,630 --> 00:03:18,370 data and extract a month. 46 00:03:18,370 --> 00:03:23,920 She uses the convert column operation to convert to 47 00:03:23,920 --> 00:03:28,470 date and selects ymd. 48 00:03:28,470 --> 00:03:35,050 Next she extracts month and creates a derived column 49 00:03:35,050 --> 00:03:39,680 called discountMonth. 50 00:03:39,680 --> 00:03:42,710 The data now represents all brands and products 51 00:03:42,710 --> 00:03:46,560 providing sales and the month the offer was available. 52 00:03:46,560 --> 00:03:49,790 The data scientist is only 53 00:03:49,790 --> 00:03:52,180 interested in her preferred brands. 54 00:03:52,180 --> 00:03:54,310 Over time she has built a list of preferred 55 00:03:54,310 --> 00:04:01,430 brands and has imported the data in her project. 56 00:04:01,430 --> 00:04:03,250 Data Refinery provides 57 00:04:03,250 --> 00:04:15,090 relational transformations such as left, inner right, full, semi and anti-join. 58 00:04:15,090 --> 00:04:16,090 To ensure 59 00:04:16,090 --> 00:04:19,000 that the data only contains her preferred brand she uses a semi-join 60 00:04:19,000 --> 00:04:34,730 operation which narrows the brands to match her preferences. 61 00:04:34,730 --> 00:04:49,350 She then selects the keys for the join and the resulting fields. 62 00:04:49,350 --> 00:04:52,220 The visual results now confirms that the brands match the preferences. 63 00:04:52,220 --> 00:04:53,220 To find the best 64 00:04:53,220 --> 00:05:07,600 possible deals she needs to perform some aggregations. 65 00:05:07,600 --> 00:05:09,640 Several features determine a good deal. 66 00:05:09,640 --> 00:05:11,380 She is interested in the best offer and 67 00:05:11,380 --> 00:05:15,780 duration when the discounts are active. 68 00:05:15,780 --> 00:05:17,370 Aggregating the sale data will help 69 00:05:17,370 --> 00:05:18,980 understand the deals. 70 00:05:18,980 --> 00:05:27,500 She groups the columns by brand and discountMonth and 71 00:05:27,500 --> 00:05:38,770 calculates the maximum discount. 72 00:05:38,770 --> 00:05:43,480 Finally she sorts the result in descending order 73 00:05:43,480 --> 00:05:49,300 Data refinery is now displaying the best deals by brand preferences and the 74 00:05:49,300 --> 00:05:52,580 duration which the offer is available. 75 00:05:52,580 --> 00:06:02,000 The last step is to execute the analysis on the full dataset. 76 00:06:02,000 --> 00:06:07,680 She starts the full 77 00:06:07,680 --> 00:06:15,820 analysis, which she can monitor for the completion status. 78 00:06:15,820 --> 00:06:25,240 It's time to automate the analysis which runs on a regular basis. 79 00:06:25,240 --> 00:06:26,240 The data in the 80 00:06:26,240 --> 00:06:28,930 database can grow over time. 81 00:06:28,930 --> 00:06:35,091 She uses a personalized runtime to match the larger 82 00:06:35,091 --> 00:06:54,960 data volumes and sets a schedule for automation. 83 00:06:54,960 --> 00:06:56,419 The hourly schedule reads 84 00:06:56,419 --> 00:07:02,010 from updated data from the database and writes to the target table. 85 00:07:02,010 --> 00:07:04,490 Data Refinery has helped her uncover deals in 86 00:07:04,490 --> 00:07:06,500 the raw data through a small set of 87 00:07:06,500 --> 00:07:10,390 operations and transformations with the bulk of the work done for her. 88 00:07:10,390 --> 00:07:11,390 Thank you 89 00:07:11,390 --> 00:07:11,890 for watching