What are feature types?
ADS uses the term “feature type” to refer to the nature of the data.
Different than the data type used to store data
ADS uses the term “feature type” to refer to the nature of the data. The feature type represents the data in the way the data scientist understands it. Panda uses the term column or series to refer to a column of data. In ADS, the term, “feature,” is used to refer to a column or series when the feature type has been applied. Feature types do not change the way the data is stored in memory. It adds a lot of powerful functionality.
Customized to your data
- Each feature type can be customized to describe the data that you are working with. You can test it and make sure that the data is properly constrained. (a valid telephone number ?)
- Feature types also provide tools that allow you to confirm a set of data that you’re using conforms to the assumptions that you’re making about it. You can write a test that checks that. Each time you change the data, it can quickly validate all your assumptions.(half male and half female)
- Feature type allows you to create pretty plots that are informative and standardized
Reused across the organization
- The data changes, but the variables they are collecting normally do not change much.
- It documents the nature of the data.
- By codifying these nuances into the feature type, then everyone can be confident in the quality of their data.
Work on Pandas Series and DataFrames
One of the best parts of feature types is that it works with Pandas. It’s feature types sit on top of Pandas. Anything that you can do with Pandas can be done with feature types. Feature types just extend the capabilities of the data scientist.
Types of Feature Types
Default
- Is based on the Pandas dtype (dtype is the way the Panda stores the data in memory.)
- Cannot be changed by the user (All features have a default feature type. You cannot change the default feature type unless you change the underlying dtype.)
- Does not have to be declared (Since there is so little control of the default data type, you do not have to declare it when you declare the inheritance chain. You can, but you don’t have to.)
ADS
- Custom feature type (such as telephone number, zip code, or GIS coordinates. In general, it’s not possible to create a lot of generic feature types, as each organization tends to have a lot of nuances in their feature types, and that would not be captured in a generic feature type.)
- Provided by the ADS SDK (ADS feature types are a small set of common feature types that the ADS team has created for you)
Custom
- User-defined (These are feature types that you and your team can create and define the properties of the feature. Since these are very customized to your organization’s data, you will have full control over what you want to define in the feature type)
- Derived from the FeatureType class (The basic requirement is that you create a FeatureType class and derive it from the FeatureType class)
Tag
- Inert feature type ( It is an inert feature type that does not do anything. )
- Used as a label but not in a Python class (It is a label that you can attach to a FeatureType inheritance chain. It does not have any fancy features. it’s a handy way of tagging your feature without doing much work at all.)
Exploratory Data Analysis (EDA)
EDA is the process of examining your data to understand its nature.
We’re looking to understand things like the distribution of each feature, data condition issues, such as missing values, and how the features are related to each other. This is a very time-consuming process but is one of the most critical steps in data science work. You just do it once, and then you can use it over and over again.
- Data validation at the observation level
- Warning of data condition at the data set level
- Custom visualizations
- Custom summary statistics
- Correlations across features
Inheritance chain
For example, a medical record number, which is also a type of ID, which is also an integer– it’s all three things. So we would create an inheritance chain and add a specialization to each feature. At its root, it’s an integer. But not all integers are IDs.
- All IDs are positive and don’t include 0.
- All medical record numbers have eight digits.
The medical record number, the ID, and the integer feature type make up the inheritance chain.
Multiple Inheritance
Unlike inheritance in many systems, there is not a true parent-child relationship between the features. Each feature type is standalone, and the behavior will depend on the order. However, they are often designed to have a parent-child relationship, as this reduces code duplication. When calling attributes and methods on a feature type, ADS searches the inheritance chain and finds the first match, and dispatches on that, generally.
This is similar to the parent-child inheritance in most object-oriented programming languages. However, the difference is that there is no requirement that the medical record is listed before the ID feature type in the inheritance chain. The medical record number feature type has no dependents on the ID feature type.
It basically means that features can inherit characteristics. More specifically:
A feature can have multiple feature types. For example, A wholesale car price might have the following feature types:
wholesale_price, car_price, USD, continuous
Each feature type defines characteristics for a given feature. A feature inherits these defined characteristics.
You can assign feature types to a Pandas Series
df['wprice'].ads.feature_type = ['wholesale_price', 'car_price', 'USD', 'continuous’
You can assign feature types to a Panda dataframe
df.ads.feature_type = {'Attrition': ['boolean', Tag('target')],
'TravelForWork': ['travel_type', 'category'],
'JobFunction': ['job_type', 'category'],
'EducationalLevel': ['education_level', 'category']}
Feature Type Selection
Feature type selection happens when you select columns from a dataframe based on feature type.
- include the columns that inherit from at least one feature type in the list.
- exclude the columns that inherit from at least one feature type in the list.
Note: exclude overrides include
df.ads.feature_select(include=[‘category’, ‘boolean’], exclude=[‘education_level’])
Feature Type Count
The feature count method provides a DataFrame that summarizes what features are being used. This is often really helpful when you are working with a lot of features.
- A count of the number of features for each feature type.
- A “Primary” feature type is the first feature type in the inheritance chain.
- Use feature_count() on a dataframe.
string : That is where the feature type is based on the dtype of the Pandas DataFrame you’re using. You do not have to define it when you are defining the inheritance chain. It will automatically be added at the end. In this case, Attrition, TravelForWork, and JobFunction have strings as their dtype. Plus, they automatically inherit the string feature type as their default feature type.
Correlation Tables
When a data scientist wants to understand the relationship between features, correlation analysis is normally part of the EDA. You also do a correlation analysis to make a model as parsimonious as possible. This often involves determining what features are highly correlated and removing them.
Pearson correlation coefficient → pearson()
It has a range of negative 1 to 1, where 1 means that the two data sets are perfectly correlated. And a value of negative 1 means that the correlation is perfectly out of phase. Both datasets consist of continuous values.
Correlation ratio → correlation_ratio()
Correlation ratio is a measure of the dispersion within categories relative to the dispersion across the entire data set. The correlation ratio is a weighted variance of the category means over the variance of all the samples. This metric is used to compare categorical variables to continuous variables.
Cramér’s V → cramersv()
Cramer’s V is used to measure the amount of association between two categorical variables. A value of 0 means that there is no association between bivariates, and a value of 1 means that there is complete association.
Method names are:
- df.ads.pearson_plot()
- df.ads.correlation_ratio_plot()
- df.ads.cramersv_plot()