Analysis was done on customer data for various dimensions like transactions, product, engagement, transaction timeline, demography, service requests, and external variables for the last two revenue cycles by logistic regression to understand historical behavior in order to develop a mathematical relationship on risk factors. The churn probability was derived as a function of risk factors.
13 GB of customer data including transactions, equipment holdings and disputes ingested into HDFS. The ingested data was processed using Spark SQL and the churn probability equation was applied on the data to predict the propensity of customer churn. Tableau was used for visualizations.
The execution of 66 million records to predict churn took 6.5 minutes and was executed over 3 Amazon instances of 60GB RAM with 640GB storage, demonstrating Spark's capability to deliver exceptional performance through in-memory computing.
Significant factors affecting churn were identified and the high spending of customers among highly likely churners were prioritized for remedial strategies/promotions.
IIP's Twitter Adapter was used to acquire public tweets from users across Twitter handles, to interact with the user community.
Tweets were pre-processed and stored in the IIP data lake and exposed as views/tables to data analysts and scientists. The data scientists used IIP-R and IIP-Text analytics capabilities to analyze and generate N Gram-based topic clouds, sentiments and engagement scores.
Processed data was integrated back into the IIP data lake and visualized using Tableau so that insights derived from Twitter could be communicated and shared with business users.
The IIP-based solution offered data scientists and analysts a flexible approach to source data from Twitter and possibly combine it with structured data available in the IIP data lake to generate insights related to community engagement and sentiments.
Machine learning models were developed in Apache Spark to predict high propensity customers by sales territory to prioritize sales force action. Customer profiles and monthly transaction records were ingested into Hadoop and processed via Hive meta store and Spark SQL. An offline model on sample data to benchmark/validate results was constructed. Tableau was used for visualizations.
The Spark machine learning library was used to construct a model using logistic regression. This prediction took around 7 seconds with 2+million records demonstrating the usefulness of Spark's in-memory computing paradigm for near real-time analysis.
Customers who have high propensity to buy specific products and services were identified to help cross-sell and up sell.
A solution based on an industry leading visualization tool coupled with open source technology stack was used for real-time predictive analytics thus saving cost.
15 GB of ADSL service layer data of customers – when they did not report any faults – was ingested into the Hadoop-based IIP to find the "control signature" using signal processing and statistical techniques on connection properties (attenuation/loss, code violations, upload/ download rates, re-initializations). The fault signature was computed from the connection properties in the days leading up to the reporting of a fault.
A predictive model was developed to derive a formula which provided the probability of a current state of connection signifying impending fault report in the next 7 days.
A data science formula for the predictive model was applied to process 16.5 million records in 5 seconds on an 8 core, 64 GB RAM, 5 node clusters to predict the impending network faults for connections, DSLAM and geographic area.
Impending network faults in the next week were predicted with a high degree of accuracy to fix the network failure points.
The IIP solution based on open source stack including Apache Spark and prebuilt Infosys components was delivered in just 5 days with an impressive price-performance ratio.
The dataset included 500,000 trip records from 3000 drivers and was stored in Hadoop-HDFS. Each trip had GPS data points for every second of the trip.
IIP's Data Pipeline capabilities were implemented and Spark REPL and Spark R wereleveraged for feature engineering. H20 leveraged for K-Means clustering, GBM, and Random Forest and Lasso. Native R packages were used for validation and testing. This had the ability to scale up to billions of trips without changes in R-code.
Geographic co-ordinates were converted into speed, acceleration, breaks, fast turns and speed limit violations. The outcome was model AUC of 87.5%. R-Studio and Scala REPL were used for orchestration of the whole analytical lifecycle. Interactive predictive model was developed using R-Studio. Models were created on all trips as compared to sample trips. Client is ranked in the top 10% on public leaderboard in the competition.
Each run with IIP was a few minutes compared to 8-9 hours in a parallel R environment without IIP. 4-5 Models were created in a day as compared to 2 days a model without IIP.
For automobile insurers, telematics represents a growing and valuable way to quantify driver risk. Instead of pricing decisions on vehicle and driver characteristics, telematics gives the opportunity to measure the quantity and quality of a driver's behavior.
A specific set of equipment – a set of reactors and upstream de-gasifier as a logical sub process – was identified to develop and test the predictive analytics approach.
SAP PM data extract for 18 months, PLC system data and alarm patterns were used as set of predictor/ independent variables. A logistic regression (binomial logit) model was trained using a portion of data, keeping out 4 months for validating/ testing the model.
Models were developed for 1-day and 2-day predictions. Model score cut-off for predicting a potential breakdown was chosen to balance the capture rate vs. false alarm percentage as they represented a trade-off.
With IIP, clientcould predict major breakdown of equipment well ahead (1 or more days vs. just hours) with 80% accuracy as well reduce false alarms.
Client's maintenance teams appreciated the value of the prediction approach. One more plant site was identified for testing approach for generalizability. Client is now planning a roadshow across multiple plants to drive adoption for scale implementation.
To develop, test and arrive at the optimal probability analytics approach, Infosys focused its knowledge curation efforts on umpire data captured by ATP across all tournaments over the last 12 months and 5 years of the Hawkeye data from the Barclays ATP World Tour Finals.
Leveraging just 2 nodes of 8 Core CPU and 16GB RAM for hardware, IIP was fully equipped to process data volumes of over 240,000 records (12 million data points) in near real-time.
With machine learning, algorithms probability was established and published for various statistics such as speed of shots of each player against each other for both forehand and backhand; point winning shots; winning statistics for players at different surface types; holding statistics of each player at different games;double faults; first serve returns; match points saved;fatigue indexes;serve analysis based on hawk eye data etc.
High probability factors influencing match outcomes were published as insights, in real-time, on ATPWorldTour.com for tennis fans all over the world.
IIP based analytics showcased how historical data around player performance, their strengths and weaknesses can be used to predict player behavior, their shot selection and finally the outcome of the match itself.