Table of Contents
In the context of machine learning, a feature can be described as a set of characteristics, that explains the occurrence of a phenomenon. When these characteristics are converted into some measurable form, they are called features.
For example, assume you have a list of students. This list contains the name of each student, number of hours they studied, their IQ, and their total marks in the previous examinations. Now you are given information about a new student— the number of hours he/she studied and his IQ, but his/her marks are missing. You have to estimate his/her probable marks.
“Ready to take your python skills to the next level? Sign up for a free demo today!”
Here, you’d use IQ and study_hours to build a predictive model to estimate these missing marks. So, IQ and study_ hours are called the features for this model.
What is Feature Engineering?
Feature Engineering can simply be defined as the process of creating new features from the existing features in a dataset. Let’s consider a sample data that has details about a few items, such as their weight and price.
Now, to create a new feature we can use Item_Weight and Item_Price. So, let’s create a feature called Price_per_Weight. It is nothing but the price of the item divided by the weight of the item. This process is called feature engineering.
This was just a simple example to create a new feature from existing ones, but in practice, when we have quite a lot of features, feature engineering can become quite complex and cumbersome.
Let’s take another example. In the popular Titanic dataset, there is a passenger name feature and below are some of the names in the dataset:
- Montvila, Rev. Juozas
- Graham, Miss. Margaret Edith
- Johnston, Miss. Catherine Helen “Carrie”
- Behr, Mr. Karl Howell
- Dooley, Mr. Patrick
These names can actually be broken down into additional meaningful features. For example, we can extract and group similar titles into single categories. Let’s have a look at the unique number of titles in the passenger names.
1: Which of the following data types is immutable in Python?
It turns out that titles like ‘Dona’, ‘Lady’, ‘the Countess’, ‘Capt’, ‘Col’, ‘Don’, ‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, and ‘Jonkheer’ are quite rare and can be put under a single label. Let’s call it rare_title. Apart from this, the titles ‘Mlle’ and ‘Ms’ can be placed under ‘Miss’, and ‘Mme’ can be replaced with ‘Mrs’.
Hence, the new title feature would have only 5 unique values as shown below:
So, this is how we can extract useful information with the help of feature engineering, even from features like passenger names which initially seemed fairly pointless.
“Experience the power of our web development course with a free demo – enroll now!”
Why is Feature Engineering required?
The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train that model. If you are able to create new features which help in providing more information to the model about the target variable, it’s performance will go up. Hence, when we don’t have enough quality features in our dataset, we have to lean on feature engineering.
As explained in this article, smart feature engineering was instrumental in securing a place in the top 5 percentile of the leaderboard. Some of the features created are given below:
- Hour Bins: A new feature was created by binning the hour feature with the help of a decision tree
- Temp Bins: Similarly, a binned feature for the temperature variable
- Year Bins: 8 quarterly bins were created for a period of 2 years
- Day Type: Days were categorized as “weekday”, “weekend” or “holiday”
Creating such features is no child’s play – it takes a great deal of brainstorming and extensive data exploration. Not everyone is good at feature engineering because it is not something that you can learn by reading books or watching videos. This is why feature engineering is also called an art. If you are good at it, then you have a major edge over the competition.
Automating Feature Engineering
Analyze the two images shown above. The left one shows a car being assembled by a group of men during early 20th century, and the right picture shows robots doing the same job in today’s world. Automating any process has the potential to make it much more efficient and cost-effective.
Building machine learning models can often be a painstaking process. It involves many steps so if we are able to automate a certain percentage of feature engineering tasks, then the data scientists or the domain experts can focus on other aspects of the model.
Now that we have understood that automating feature engineering is the need of the hour, the next question to ask is – how is it going to happen? Well, we have a great tool to address this issue and it’s called Featuretools.
Introduction to Featuretools
Featuretools is an open source library for performing automated feature engineering. It is a great tool designed to fast-forward the feature generation process, by giving more time to focus on other aspects of machine learning model building. In other words, it makes your data “machine learning ready”.
Before taking Featuretools for a spin, there are three major components of the package that we should be aware of:
- Entities
- Deep Feature Synthesis (DFS)
- Feature primitives
a) An Entity can be considered as a representation of a Pandas DataFrame. A collection of multiple entities is called an Entityset.
b) Deep Feature Synthesis (DFS) is actually a Feature Engineering method and is the backbone of Featuretools. It enables the creation of new features from single, as well as multiple dataframes.
c) DFS create features by applying Feature primitives to the Entity-relationships in an EntitySet. These primitives are the often-used methods to generate features manually. For example, the primitive “mean” would find the mean of a variable at an aggregated level.
The best way to understand and become comfortable with Featuretools is by applying it on a dataset.
“Get hands-on with our python course – sign up for a free demo!”
Implementation of Featuretools
The objective of the BigMart Sales challenge is to build a predictive model to estimate the sales of each product at a particular store. This would help the decision makers at BigMart to find out the properties of any product or store, which play a key role in increasing the overall sales. Note that there are 1559 products across 10 stores in the given dataset.
The below table shows the features provided in our data:
Variable | Description |
---|---|
Item_Identifier | Unique product ID |
Item_Weight | Weight of product |
Item_Fat_Content | Whether the product is low fat or not |
Item_Visibility | The % of total display area of all products in a store allocated to the particular product |
Item_Type | The category to which the product belongs |
Item_MRP | Maximum Retail Price (list price) of the product |
Outlet_Identifier | Unique store ID |
Outlet_Establishment_Year | The year in which store was established |
Outlet_Size | The size of the store in terms of ground area covered |
Outlet_Location_Type | The type of city in which the store is located |
Outlet_Type | Whether the outlet is just a grocery store or some sort of supermarket |
Item_Outlet_Sales | Sales of the product in the particulat store. This is the outcome variable to be predicted. |
“Ready to take your python skills to the next level? Sign up for a free demo today!”
Featuretools Interpretability
Making our data science solutions interpretable is a very important aspect of performing machine learning. Features generated by Featuretools can be easily explained even to a non-technical person because they are based on the primitives, which are easy to understand.
For example, the features outlet.SUM(bigmart.Item_Weight) and outlet.STD(bigmart.Item_MRP) mean outlet-level sum of weight of the items and standard deviation of the cost of the items, respectively.
This makes it possible for those people who are not machine learning experts, to contribute as well in terms of their domain expertise.
End Notes
The featuretools package is truly a game-changer in machine learning. While it’s applications are understandably still limited in industry use cases, it has quickly become ultra popular in Machine Learning competitions. The amount of time it saves, and the usefulness of feature it generates, has truly won me over.
Automated Feature Engineering Tools
Feature Engineering is a technique to convert raw data columns to something meaningful which can help in predicting the outcomes in a machine learning task. Feature Engineering can be a very tedious and often the most time taking in machine learning life cycle.
We have found following tools which automates the whole feature engineering process and creates large number of features for both relation and non-relational data. While some of them only performs feature engineering, we have some tools which also perform feature selection.
Here is the list of some of the best tools available :-
1. FeatureTools
2. AutoFeat
3. TsFresh
4. Cognito
5. OneBM
6. ExploreKit
7. PyFeat
FeatureTools
One of the most popular Python library for automated feature engineering is FeatureTools, which generates a large feature set using “deep feature synthesis”. This library is targeted towards relational data, where features can be created through aggregations or transformations. DFS requires structured and relational data for creating new features.
There are 2 main components of FeatureTools :-
· Entity and Entity-Set : Entity can be thought of as a single data-frame while Entity-Set is combination of more than one data-frames.
· Primitives : These are basic operations like mean, mode, max etc. that can be applied to the data. It can either be a Transformation or Aggregation.
Advantages
1) Most popular and hence lot of resources are available.
2) We can specify the variable types.
3) Custom primitives can be created.
4) Expanding Features with respect to time can be created.
5) Best at handling relational database.
Limitations
1) Creates large number of features leading to curse of dimensionality.
2) For database which are not relational we will have to use normalization.
3) Does not have support for unstructured data.
4) Features extracted are basic statistical features which are aggregated independently of other columns of target variable.
AutoFeat
AutoFeat is one of the python library which automates feature engineering and feature selection along with fitting a Linear Regression model. They generally fit Linear Regression model to make the process more explainable.
AutoFeat is not meant for relational data, found in many business application areas, but was rather built with scientific use cases in mind, where experimental measurements would instead be stored in a single table. For this reason, AutoFeat also makes it possible to specify the units of the input variables to prevent the creation of physically nonsensical features.
“Experience the power of our web development course with a free demo – enroll now!”
Advantages
· Only open source framework for general purpose automated feature engineering which does not care about relational data.
· Also does feature selection to reduce the dimensionality problem.
· Does not create physically non-sensical features and hence useful in logistic data.
Limitations
· Not good at handling relational data.
· Only make simpler features like ratios, products and other basic transformations.
· Does not consider feature interaction for making new features.
· Only fit automated model for regression data not for classification problem.
TsFresh
TsFresh, which stands for “Time Series Feature extraction based on scalable hypothesis tests”, is a Python package for time series analysis that contains feature extraction methods and a feature selection algorithm. Currently, it automatically extracts 64 features from time series data that describe both basic and complex characteristics of a time series (such as the number of peaks, average value, maximum value, time reversal symmetry statistic, etc.), and those features can be used to build regression or classification based machine learning models.
Advantages
· Best open source python tool available for time series classification and regression.
· Can be easily integrated with FeatureTools.
Limitations
· Can only be used for time-series data that too only good for supervised learning.
PyFeat
PyFeat is a practical and easy to use toolkit implemented in Python for extracting various features from proteins, DNAs and RNAs.
It can only be used with genome data, hence making it not so useful for everyday classification and regression task but find its usefulness in pharma industries.
ExploreKit
ExploreKit is based on the intuition that highly informative features often result from manipulations of elementary ones, they identify common operators to transform each feature individually or combine several of them together. It uses these operators to generate many candidate features, and chooses the subset to add based on the empirical performance of models trained with candidate features added.
Generation and Selection process is as follows:
Advantages
· Uses meta learning to rank candidate features rather than running feature selection on all created features which can sometimes be very large.
Limitations
· No open source implementation either in Python or R.
OneBM
OneBM works directly with multiple raw tables in a database. It joins the tables incrementally, following different paths on the relational graph. It automatically identifies data types of the joint results, including simple data types (numerical or categorical) and complex data types (set of numbers, set of categories, sequences, time series and texts), and applies corresponding pre-defined feature engineering techniques on the given types. By doing so, new feature engineering techniques could be plugged in through an interface with its feature extractor modules to extract desired types of features in specific domain.
Feature selection is used to remove irrelevant features extracted in the prior steps. First, duplicated features are removed. Second, if the training and test data have an implicit order defined by a column, e.g. timestamp, then drift features are detected by comparing the distribution between the value of features in the training and a validation set. If two distributions are different, the feature is identified as a drift feature which may cause over-fitting. Drift features are all removed from the feature set.
“Get hands-on with our python course – sign up for a free demo!”
Advantages
· Works well with both relational as well as non-relational data.
· Generates simple as well as complex features.
· Can be used to create feature for big data also.
Limitations
· No open source implementation.
Cognito
Cognito is a system that automates feature engineering from a single database table. In each step, it recursively applies a set of predefined mathematical transformations on the table’s columns to obtain new features from the original table. By doing so, the number of features is exponential in the number of steps. Therefore, a feature selection strategy was proposed to remove redundant features. It improves prediction accuracy on UCI datasets.
Limitations
· No open source implementation.
· Extra efforts needed with relational data.
We can use these techniques to come with large number of features without investing much time and focus more on other aspects of machine learning which is modelling and at scale deployment.