Oracle Data Science AutoML

AutoML stands for automated machine learning. In the modeling phase of the machine learning lifecycle, multiple models are built using a variety of algorithms and with different hyperparameter configurations. The performance and accuracy of these models must be tracked. This whole process can be automated, which is where, as the name implies, AutoML comes in.

AutoML combines the process of choosing and refining models, tuning parameters, and hence, optimizes the outcome of the learning.

AutoML Approaches

Bayesian optimization: It uses a probabilistic model to capture different hyperparameter configurations and their performance. Auto-sklearn, one of the most notable works relying on this approach, adopted a random-forest-based sequential model-based optimization technique for general algorithm configuration. It uses metalearning to identify a previously optimized data set closest to the given data set and uses the known data set’s configuration to bootstrap the iterative optimization process.

Recommender system: Where the system maintains a record of the best configuration found for each data set it has previously encountered. Given a new data set and the results of several initial trials, the system uses similarities to known data sets and configurations to suggest the next configurations to evaluate. Probabilistic Matrix Factorization is, typically, at the core of recommender systems.

Genetic evolutionary algorithms: One notable example is the Tree-based Pipeline Optimization Tool (TPOT), which automatically optimizes a machine learning pipeline built around Scikit-Learn.

Oracle AutoML benefit

It automates the process of feature selection, model/algorithm selection, and hyperparameter tuning. It is a feature that all major data science platforms have. Users can feed a data set to AutoML, and it will train multiple machine learning models, tune the hyperparameters for those models, and evaluate their performance against each other.
AutoML can improve the productivity of data scientists by automating the training process. It also allows data analysts and developers to build machine learning models without tweaking every aspect of the model training process that comes with data science expertise.
It also reduces the compute time that is required to deliver ML models

Oracle AutoML Workflow

Oracle AutoML automates the workflow and provides you with an optimal model given a time budget. A typical workflow is:

Select a model from a large number of viable candidate models
For each model, tune the hyperparameters.
Select only predictive features to speed up the pipeline and reduce overfitting.
Ensure the model performs well on unseen data (also called generalization).

AutoML Pipeline

Oracle AutoML automates four major time-consuming and tedious steps in the machine learning modeling process to deliver significant productivity improvements for data scientists. Oracle Machine Learning AutoML User Interface on Autonomous Database

Algorithm Selection: Identifies best algorithms for the data and problem; faster than exhaustive search
Adaptive Sampling: Identifies the right sample size and adjusts for unbalanced data
Feature Selection: De-noise the data and reduce the number of features
Model Tuning: Auto tunes hyperparameters for best model accuracy

Algorithm Selection ：

Given a data set and a prediction task, such as classification or regression, the goal of algorithm selection is to identify the algorithm that yields the maximum score. This “best” algorithm is not always intuitive, and picking complex models is not optimal for each use case. The ADS algorithm selection stage is designed to rank algorithms based on their estimated predictive performance on the input data set.

Using models built from a wide range of data sets, automated algorithm selection uses metalearning where, based on the distribution of values or meta-features in the data, a prebuilt model predicts which algorithms are most likely to produce the best results. Algorithms with the highest scores are later used for model tuning. This helps data scientists and non-expert users to find the best algorithm candidates faster than with exhaustive search.

Extract relevant dataset characteristics, such as dataset shape, feature correlations, and appropriate meta-features.
Invoke specialized score-prediction metamodels that were learned to predict algorithm performance across a wide variety of datasets and domains.
Rank algorithms based on their predicted performance.
Select the optimal algorithm.

Adaptive Sampling

Adaptive sampling iteratively samples a data set (number of rows) from a small subset to the full data set size and evaluates each sample to obtain a score for a specific algorithm. The goal is to find the smallest sample size of a data set, for use in subsequent pipeline stages, without sacrificing the quality of the pipeline.This also speeds up model building. Further, adaptive sampling detects unbalanced data sets that can cause poor models to be built.

For a given algorithm and data set, identify a sample based on the sample size and characteristics of the data set and task.
Leverage metalearning to predict algorithm performance on the given sample.
Iterate until the score converges to within a small threshold.
The identified sample is then used for subsequent stages of the AutoML Pipeline.

Feature Selection

This stage selects a subset of features that are highly predictive of the target.

Attributes that have no correlation with the target attribute, have too many constants or missing values, or have too high cardinality can reduce model quality, while increasing model building and data scoring time. Feature selection speeds up training without losing predictive performance. AutoML first ranks the features and evaluates subsets based on these rankings, using several techniques.

AutoML pre-processes the input data and automatically removes those attributes that contain little information, or worse, noise.

Obtain the dataset meta-features, similar to those obtained in the algorithm selection stage.
Rank all features using multiple ranking algorithms. Feature rankings are ordered lists of features from most to least important.
For each feature ranking, the optimal feature subset is identified.
Algorithm performance is predicted by leveraging meta-learning on a given feature subset.
Iterating over multiple feature subsets, the optimal subset is determined.

Model Tuning (Hyperparameter Tuning) :

It determines the optimal configuration for the model’s hyperparameters. The hyperparameter tuning process is designed with efficiency and scalability as first-order requirements.

Filters for optimal configuration of the shortlisted algorithms
Tunes multiple machine learning models
Tunes each selected algorithm to find hyperparameter settings

OracleAutoMLProvider

OracleAutoMLProvider delegates model training to the ads.automl package from Oracle Accelerated Data Science Python SDK. OracleAutoMLProvider class supports two arguments:

n_jobs: Specifies the degree of parallelism for Oracle AutoML. The default is -1, which means all cores will be used.
Loglevel: Verbosity of output for Oracle AutoML

The Oracle AutoML process summarizes the optimization process by providing:
- Training data information
- Pipeline information with selected features, best choices, and respective hyperparameters
- Best model trial information
Adaptive sampling will not run and visualizations will not be generated if data points are < 1000.
model_list allows you to control what algorithms AutoML will consider during the optimization process.
score_metric allows you to provide your own scoring metric as a string from a list of metrics or as a user-defined function. Default metrics are:
- Binary Classification: roc_auc
- Multiclass Classification: recall_macro
- Regression: neg_mean_squared_error

automl_model2, _ = oracle_automl.train(model_list=['LogisticRegression'])

automl_model3, _ = oracle_automl.train(score_metric='f1_macro')

Oracle AutoML: Time Budget

The Oracle AutoML tool also supports a user given time budget in seconds. This time budget works as a hint, and AutoML tries to terminate computation as soon as the time budget is exhausted by returning the current best model. The model returned depends on the stage that AutoML was in when the time budget was exhausted.

Preprocessing completes : then a Naive Bayes model is returned for classification and Linear Regression for regression.
Algorithm selection completes : the partial results for algorithm selection are used to evaluate the best candidate that is returned.
Hyperparameter tuning completes : then the current best known hyperparameter configuration is returned.

automl_model5, _ = oracle_automl.train(time_budget=10)

Minimum Feature List

AutoML ensures through min_features that the features in the list are part of the final model that it creates, and these are not dropped during the feature selection phase.

If int, 0 < min_features <= n_features
If float, 0 < min_features <= 1.0
If list, names of features to keep. For example, [‘a’, ‘b’] means keep features ‘a’ and ‘b’.

automl_model6, _ = oracle_automl.train(min_features=['fnlwgt', 'native-country'])