Oracle Machine Learning Lifecycle
check below Oracle Machine Learning Lifecycle hyperlink to related article.
Data Access and Collection
All machine learning models start with data. The first step in working a machine learning problem is accessing the data and collecting it into the notebook session. These datasets typically reside in a data lake or in a database. There may also be valuable unstructured data sets that do not fit well into a relational database such as logs, raw text, images, videos, etc.
Data Exploration and Preparation
Data preparation is cleansing and processing raw data before analysis. Before building any machine learning model, data scientists need to understand the available data. Raw data can be messy, duplicated, or inaccurate. Data scientists explore the data available to them, then cleanse the data by identifying corrupt, inaccurate, and incomplete data and replacing or deleting it. In addition, data scientists need to determine if the data has labels or not. Oracle provides OCI Data Labeling cloud service for this purpose.
Feature Exploration
After data is prepared, data scientists explore the features (or the variables) in their data set, identify any relationship between the features and make decisions about any additional data transformations.
- Is the data set skewed towards a range of values or a subset of categories?
- What are the minimum, maximum, mean, median, and mode values of the feature?
- Are there missing values or invalid values such as null? If so, how many are there?
- Are there outliers in the data set?
- How you will handle outliers?
- Are some of your features correlated with each other?
- Do you need to normalize the data set or perform some other transformation to rescale the data (e.g. log transformation)?
- What is your approach to a long tail of categorical values?
- Do you use features as-is, group them in some meaningful way, or ignore a subset of them altogether?
Feature Engineering
During the data exploration step, you can identify patterns in your data set for ideas about how to develop new features that would better represent the data set. This is known as feature engineering.
Modeling
- Model type selection :In the first step of model building, data scientists need to decide what might be the appropriate type of machine learning model to solve the problem. There are two main types: supervised and unsupervised.
- Algorithm selection :Different classes of machine learning models are used to solve unsupervised and supervised learning problems. Typically, data scientists will try multiple algorithms and generate multiple model candidates.
- Model training : During model training, a data scientist might experiment with selecting different subsets of features as input to the machine learning model. The benefits of reducing the number of input variables are
- reducing computational cost of model training,
- making the model more generalizable
- possibly improving model performance.
During model training, the data set is split up into training and testing sets. The training data set is used to train the model, and the testing data set is used to see how well the model performs on data it has not seen.
Validation (Evaluation)
Once a trained model is obtained, it’s important to evaluate the model to determine its suitability.
Classification problems :
- true positives
- true negatives
- false positives
- false negatives
- precision (精確率)
- recall (召回率)
Regression problems :
- root-mean-square error (均方根誤差)
- mean absolute error (平均絕對誤差)
- coefficient of determination r2 (決定係數)
Unsupervised problems :
- silhouette score
- Calinski-Harabasz coefficient
Model Deployment
After the model training and evaluation processes are complete, the best candidate models are saved. Models are usually saved in Pickle, ONNX, and/or PMML format. OCI Data Science provides a model catalog for preserving models.
Model deployment is the process of making the machine learning model available for use in some way. Most likely, the pipeline of data transformations have to be deployed.
- batch consumption
- real-time consumption
Model Monitoring
Model monitoring is a challenging step but important to make sure a model remains effective after it’s deployed. Model monitoring has two components:
- drift/statistical monitoring of the model performance : After models are deployed, the metrics by which the models were evaluated may degrade over time.This is because data can change over time. For example, features in the production data can exhibit values outside of the range in the training data set; or there can be a slow drift in the distribution of the values.
- statistics and distribution of the training data compared to live data
- compare the distribution of the model predictions with training and live data
- ops monitoring : Ops monitoring of the machine learning system will require partnership between the data scientists and engineering team. Things to monitor include serving latency, memory/CPU usage, throughput, and system reliability. Logs and metrics need to be set up for tracking and monitoring. Logs can be used to investigate specific incidents and help identify root causes.