Life Science and Insurance
The dataset included 500,000 trip records from 3000 drivers and was stored in Hadoop-HDFS. Each trip had GPS data points for every second of the trip.
IIP's Data Pipeline capabilities were implemented and Spark REPL and Spark R wereleveraged for feature engineering. H20 leveraged for K-Means clustering, GBM, and Random Forest and Lasso. Native R packages were used for validation and testing. This had the ability to scale up to billions of trips without changes in R-code.
Geographic co-ordinates were converted into speed, acceleration, breaks, fast turns and speed limit violations. The outcome was model AUC of 87.5%. R-Studio and Scala REPL were used for orchestration of the whole analytical lifecycle. Interactive predictive model was developed using R-Studio. Models were created on all trips as compared to sample trips. Client is ranked in the top 10% on public leaderboard in the competition.
Each run with IIP was a few minutes compared to 8-9 hours in a parallel R environment without IIP. 4-5 Models were created in a day as compared to 2 days a model without IIP.
For automobile insurers, telematics represents a growing and valuable way to quantify driver risk. Instead of pricing decisions on vehicle and driver characteristics, telematics gives the opportunity to measure the quantity and quality of a driver's behavior.
The single day SAP data in the form of pipe delimited files for customer order information – daily logs of order received (size ~7 MB/ 32K records), Zmarc, the snap shot of the available quantity for each plant and material (size ~8 MB/ 34K records) and product hierarchy, the static table containing all the information about materials/products (size ~ 10 KB) was simulated for 365 days and ingested in the Spark shell.
Java UDF was used to calculate the cumulative sum of the quantity ordered column in the order Information table. The cumulative quantity was compared with Zmarc quantity (available quantity) and the flag was set as "Y/N" depending on availability. The query result was saved as parquet file and registered as table in the HIVE Thrift server, which was later read in visualized using Tableau.
It took a mere 10 seconds to run and process the summary report on a single node, 64GB RAM, 16 core processor. This will help the client management quickly identify the gap in demand and supply for specific product and if need be, take corrective action and also determine the total monetary amounts outstanding in backorders.
A specific set of equipment – a set of reactors and upstream de-gasifier as a logical sub process – was identified to develop and test the predictive analytics approach.
SAP PM data extract for 18 months, PLC system data and alarm patterns were used as set of predictor/ independent variables. A logistic regression (binomial logit) model was trained using a portion of data, keeping out 4 months for validating/ testing the model.
Models were developed for 1-day and 2-day predictions. Model score cut-off for predicting a potential breakdown was chosen to balance the capture rate vs. false alarm percentage as they represented a trade-off.
With IIP, clientcould predict major breakdown of equipment well ahead (1 or more days vs. just hours) with 80% accuracy as well reduce false alarms.
Client's maintenance teams appreciated the value of the prediction approach. One more plant site was identified for testing approach for generalizability. Client is now planning a roadshow across multiple plants to drive adoption for scale implementation.