Today almost all businesses are integrating AI in their systems to gain leverage and beating others to the market. The ones who adapt and adopt appropriately, seamlessly(automation) implementing and delivering solutions with minimal human intervention are establishing themselves in the market. Machine learning software systems that identify patterns in data to make informed decisions for solving complex real-world problems are becoming pivotal in transforming industries to delivering value. Therefore, companies are investing in their data science teams and ML capabilities to develop automated ML systems with predictive models that can automatically train, evaluate and deliver business value to their users.

MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Employing MLOps means to promote automation of ML systems from construction to integration to testing to deployment and infrastructure management. The challenge with ML system isn’t building a valuable ML model, the real challenge is building an integrated ML system and to continuously operate it in production.

mlops2

In the above diagram, the ML system is composed of configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring. Such sophisticated software systems are often developed and operated with DevOps principles. Since an ML system is a software system, similar practices can be implemented in building and operating reliable ML systems at scale.

ML and other software systems are similar in continuous integration(CI) of source control, unit testing, integration testing, and continuous delivery(CD) of the software module or the package. However, in ML, there are a few notable differences:

  • CI is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
  • CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
  • CT(Continuous Training) is a new property, unique to ML systems, that’s concerned with automatically retraining and serving the models.

In any ML project, after defining the business use case and criteria for gauging success of the product criteria, the process of building and delivering an AI enabled service to customers involves the following steps. These steps can be either executed manually or be automated. The amount of automation involved in the process of these steps defines the maturity of the ML process.

  • Data collection: Select and collect the relevant data from various data sources relevant to the task.
  • Data wrangling: Consists of EDA and ETL: understanding the data, feature engineering, splitting the data into training and validation sets into desired format. - Model training: The data scientist implements different algorithms with the prepared data to train various ML models. In addition, the hyperparameter are tuned to obtain minimum loss or maximise the metric. The output of this step is a trained model.
  • Model evaluation: The model is evaluated on the validation set to evaluate the model quality. The output of this step is a set of metrics to assess the quality of the model.
  • Model validation: The model is confirmed to be adequate for deployment.
  • Model serving: The validated model is deployed to serve predictions. This deployment can be one of the following,
    • Microservices with a REST API to serve online predictions.
    • An embedded model to an edge or mobile device.
    • Part of a batch prediction system.
  • Model monitoring: The model predictive performance is monitored to potentially invoke a new iteration in the ML process.

References
1.https://research.google/pubs/pub43146/
2.https://nealanalytics.com/expertise/mlops/
3.https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning