Oracle Data Science Data Preprocessing

Data Preprocessing And preparating (Data Transformations & Manipulations)

Data Preprocessing

Takes 80% of time in a typical ML Life Cycle, Real-world data is incomplete and has several missing values:

Inconsistent having a lot of discrepancies
Contains errors and outliers

Preprocessing of data involves various steps:

1. Combining and Cleaning Data (結合與清除資料)

Often is the case when data needed to solve a machine learning problem is from different sources and we need to combine them together. In that case, it is very important to take care of format, units.

2. Data Imputation (資料插補)

- Caused by human errors, transmission errors, categorical entries
- Handle by deletion (case basis and least recommended)
- Process of dealing with missing data
- Filling missing data with Mean, Median, and Mode

3. Dummy Variables (虛擬變數)

3.1 Categories being converted into numbers. 將類別資料轉換成數值順序 (Ordinal Encoding　又稱為　LabelEncoder) :有程度上的差異，轉換成0~N-1的數值

Nominal: Categories with no sorting order
Ordinal: Categories with sorting order

From dataset.label_encoder import DataFrameLabelEncoder

3.2 One-Hot Encoding (獨熱編碼)如果該欄位沒有程度上的差異(例如城市、性別)，比較不適合 LabelEncoder，改用 OneHotEncoder。新增N個Columns，每個Columns利用0與1表示原本的Feature是不是這個類別

get_dummies()
Encode all categorical using fit_transform()

4. Outlier Detection(異常檢測)

An Outlier can be an error or a true data point. Usually sits far in space-time than rest of the data. Detected by:

Visualization
Statistical Measures
Machine Learning

5. Feature Scaling(特徵縮放)

比較分析兩組數據資料時，可能會因單位的不同，造成變化的程度不一，影響統計分析的結果；為解決此的問題，我們可利用資料的正規化(Normalization)與標準化(Standardization)，藉由將原始資料轉換成Dimensionless的純量後，來進行數據的比較及分析。

Used with algorithms that calculate Euclidean distances(歐幾里得距離), such as Regression. Two methods:

5.1 Normalization(正規化) : 將原始資料的數據按比例縮放於 [0, 1] 區間中，且不改變其原本分佈

Distribution remains the same.
Range become [0-1] (Min-max Scaling)

5.2 Standardization (標準化)

Distribution remains the same.
Mean is 0 and SD is 1 so that the feature columns take the form of normal distribution and it’s easier to learn weights.
This is more practical because many models initialize weights to 0 or closer.

Dimensionality redhttps://en.wikipedia.org/wiki/Dimensionality_reductionuction

There are two ways to do dimensionality reduction.

Feature selection : select a subset of original features, which is called as.
Feature extraction : we derive information from existing features to create a new feature subspace.

Text Data

text data pre-processing could be a little different than the numerical data, and usually involves some or all of these processes.

Vectorize : Transform text into numerical feature vectors
Stop Words : Commonly used words that have no value to text analysis
POS Tagging : Identifying each token’s part and then tagging it
Tokenize : Breaking down text into tokens, such as words, chars or n-gram
Stemming : Text standardization to stem words to their root
Lemmatization : Stemming according to context usually from a dictionary

ADS Data Transformations

Data can be transformed and manipulated with ADS built-in functions.
Underlying an ADSDataset object is a Pandas dataframe.
Any operation that can be performed on a Pandas dataframe can also be applied to an ADSDataset.

ADS built-in Tools (Apply Automated Transformations)

suggest_recommendations()

Shows the user-detected issues and recommends changes to apply to the data set. You can accept the changes, and it is as easy as clicking a button in the dropdown menu.

auto_transform()

use auto_transform to apply all the recommended transformations at once. It really optimizes the data set by imputing missing values in noisy data, dropping a strongly correlated column since they don’t help its generalization.

All ADS data sets are immutable; any transforms that are applied result in a new data set.
Automatic transformation fixes the class imbalance.
ADS downsamples the majority class first unless there are too few data points. If there are too few data points, then ADS will upsample the minority class.
The optional parameter fix_imbalance is set to True by default.

visualize_transforms()

Visualizes the transformation that has been performed on a data set
Only applies to the automated transformations, not any custom transformations

Split Data Set into Train, Validation, and Test Data

Before inputting data to ML algorithm, we have to split data into train, test, and split.

For a train, test, and validation set, the defaults are set to 80% of the data for training, 10% for testing, and 10% for validation.
This example sets split to 70%, 15%, and 15%.

The resulting three data subsets each have separate data (X) and labels (y).