Table of Contents
Machine learning and data science are two extremely popular fields of computer science, and they overlap at many points. Due to this overlap, there are plenty of similarities in the tasks both fields require of their practitioners. To use a machine-learning algorithm effectively on your data, you need to be sure that it’s been preprocessed and sanitized properly, which often involves using some of the same preprocessing steps used in data science as well. Let’s take a look at what preprocessing is all about, how it relates to machine learning and data preprocessing in data science, and the top preprocessing steps you need to know! When you’re working with data, the data you’re working on may be old, corrupted, or incomplete. In order to get it into the shape you need, you’ll need to clean it up using preprocessing steps that ensure your machine learning algorithm has input it can use effectively. Data preprocessing techniques in machine learning and data science, and in this guide, you’ll learn the most important data preprocessing techniques in machine learning and data science.
1) Clean, Normalize, And Transform Data
When you’re working with data—whether for analysis or for some kind of ML algorithm—you’ll need to clean, normalize, and transform it. This is a crucial step, because dirty data can cause problems downstream; but at first glance it’s not always obvious what clean means. A lot of times it seems like there should be a single definition of clean, like there should be one canonical way to standardize data, but that’s simply not how it works. It all depends on your use case. When dealing with ML/data science issues, you want to ask yourself: what do I want my final output to look like? How will other people interpret my results? What kinds of errors might they make if I don’t clarify things? What are my constraints (time, budget)? Those questions are going to help you figure out exactly what needs to happen during preprocessing. If you’re still unsure about whether something is clean enough, then run it by someone else who knows more than you do! You don’t have to go through all these steps alone.
2) Explore The Data
Let’s get started. Explore your data! How much data do you have? How many observations? What are the values? Are they ordered (in some way)? Do they all take a value between 0 and 1 or -1 to 1? Then, look at each variable. Does it make sense that it’s there? Does it make sense that it has been coded in a particular way (i.e., is there a variable for left-handedness if you don’t have relevant information about left-handed people)? Is there redundant information within your dataset that can be removed without losing important information? You should also think about how your variables relate to one another. For example, is it possible that one variable could serve as an indicator of another? If so, does it make sense to combine them into one? And finally, what other types of variables might you want to add? If there are any missing values, do you know why they were missing and whether those missing values will affect your analysis in any way?
3) Scrub Duplicate/Near Duplicate Records
This is a very easy thing to overlook, but it can be important. If you’re working with Big Data (i.e., tons of data), there’s a good chance that you’re going to have duplicate or near-duplicate records, which can skew your results when applied to large populations (like groups of test subjects). This can also cause you to get bogus results if something causes these duplicates/near-duplicates to appear as different entities. So scrub them out using an identifier like IP address or email address. You might not know how many records will require scrubbing until you run your analysis, so make sure to do it before running any tests. To clean up duplicates and near-duplicates:
1) Determine what constitutes a duplicate record.
2) Run all records through your identifying function to determine if they are unique.
3) Use your unique records for further analysis. The last step here is most important—don’t just assume that because one record has X, Y, and Z fields that every other record should too!
4) Identify Outliers
You should also examine your data and see if any of your variables are outlying. Outliers can be caused by erroneous data entry or rare values that may result from errors in measurement. There are three techniques for identifying outliers: Grubbs’ test, Tukey’s test, and Dixon’s Q test. The Grubbs’ test checks to see whether extreme scores fall outside a number of specified standard deviations. If they do, then they are classified as outliers. The Tukey’s test is similar to Grubbs’, but it also looks at scores within 2 standard deviations on either side of each other. If these fall outside those two standard deviations, then they too are considered outliers. Finally, Dixon’s Q-test compares each score with every other score in your dataset (including itself). If there is no overlap between adjacent pairs of points, then one or more pairs must be identified as an outlier pair. This method will identify multiple points at once rather than just one point like Grubbs’ and Tukey’s tests do.
5) Do Feature Selection
Feature selection is a data mining method for reducing the dimensionality of data during predictive modeling. The objective of feature selection is to select a subset of relevant features from a larger set. Although there are many methods available, it is also worth noting that feature selection isn’t always needed because sometimes existing features (like past interactions with your customers) can be used as they are. However, if you do decide to go through with it, many applications can benefit from some filtering or weeding out of unused variables. There are three main approaches: manual, automated and semi-automated techniques. Manual techniques include domain knowledge coupled with visual inspection to analyze and select useful features by understanding their relationship with other parameters or variables in an application. Automated techniques involve applying statistical tests on variable distributions, correlations between variables, variable attributes and so on. Finally, semi-automated techniques involve using software tools to apply statistical tests and rank possible candidates based on their relevance. In general, there are two main ways of selecting a subset of features: forward selection and backward elimination. Forward selection starts with no features included in your model then adds one at a time until all desired features have been added while backward elimination starts with all possible candidate features included then removes one at a time until only desired ones remain.
6) Remove Some Columns From Consideration Entirely
Removing columns from your dataset may seem to be wasteful, but you’ll save time by focusing on the most relevant attributes. Plus, by removing information that’s irrelevant or unnecessary, you’re simplifying things for your machine learning algorithm. While you’re at it, make sure to remove any redundant data points—ones with duplicate or near-duplicate entries. Often times these types of duplicate entries result when a mistake is made (like inputting an extra zero). This step will ensure that each unique data point only appears once in your dataset. As a rule of thumb, one of our favorite ways to do this is by simply sorting your data set by column name. Another way would be to sort rows alphabetically or numerically based on their values. Finally, you can use R’s sort function like so: sort(mydataframe[,somecolumn], decreasing=TRUE) . That said, there are several other methods out there depending on what exactly you’re trying to accomplish with your analysis. One last note: just because we’re talking about cleaning up your data here doesn’t mean we’ve forgotten about checking its accuracy! To learn more about some best practices around making sure that what you have is what you want click here.
Are you aspiring for a booming career in IT? If YES, then dive in
7) Create Dummy Variables From Categorical Features
Remember, a dummy variable is one that has only two values: 0 or 1. If you’re dealing with any kind of categorical data, transforming it into a series of binary features can be helpful for machine learning algorithms. This involves creating a dummy feature for each possible value a categorical feature can take on. For example, if your dataset includes gender as a feature (male/female), you could create two new variables: male_gender = 0 and female_gender = 1. You could then use these variables to train your model as if they were continuous numerical features instead of categorical ones. Dummy variables are often used to encode binary outcomes (like whether an email was spam or not) but can also be used to encode more complex relationships between multiple categories. For example, if your dataset includes marital status as well as gender (married/single), you could create three dummy variables using those values: married_gender = 0, single_gender = 1 and divorced_gender = 2.
8) Create Binary Features From Continuous Features.
Most machine learning algorithms require that your features be numerical values, or at least represented as numbers. (In Python’s scikit-learn, categorical features are generally stored as integers, with no information about what each value means.) In order to convert your categorical features into numerical ones, you will need to create a one hot representation for them. Essentially, these are lists of 0’s and 1’s which indicate whether or not a certain category is present. For example: if you have three categories (Blueberries, Strawberries, Raspberries), one possible one hot encoding would be [0, 1, 0]. Which indicates there are no Blueberries present; one Strawberry present; and two Raspberries present. If you have more than three categories, use an array of arrays instead. So: [0, 0, 1], indicating no Blueberries and no Strawberries but one Raspberry present. A quick note on how to do this in Pandas: df[‘Category’] = df[‘Category’].apply(lambda x: np.array([1 if x == ‘Blueberry’ else 0])) . You may also want to normalize some continuous variables so they’re on similar scales before creating binary features from them; see below for more details on how to do that.
9) Impute Missing Data With Sequential Hot Decking Or Regression Trees
When you need to fill a data set with missing values, one solution is to use missing at random (MAR) values. MAR means that there’s a good chance that you can use some probability model to predict whether a certain data point is missing or not. Say, for example, we’re predicting whether an adult lives in San Francisco. If we try to predict missing values using simply dummy variables indicating gender, ethnicity, income level, etc., there’s no reason why these variables should be predictive of whether someone lives there or not. This would mean our sample wasn’t MAR: We could randomly assign them a location with pretty high accuracy! However, if we have other information about where people live—like their zip code—we might be able to get much better predictions. Sequential hot decking uses a large dataset with complete entries to make predictions about missing entries in another dataset. The process starts by first assigning each record in your incomplete dataset its own hot deck set—that is, its own group of complete records from which it draws imputed values. It then proceeds through each record sequentially and assigns it imputed values based on predicted probabilities from hot decks drawn from all other records. The process continues until every record has been assigned imputed values.
10) Make An Ensemble Model Of Decision Trees, Random Forests, Gradient Boosting Machines, etc.
Before you build a machine learning model, there are some important questions to answer. Do you want to build a multi-class or binary classifier? How many features do you want your model to use? What is your labeling strategy going to be? These types of questions can be answered using data-driven approaches like bagging or boosting. Ensemble models work by combining multiple base models into one more powerful model that may perform better than any individual base model. For example, if you’re trying to predict which customers will respond to an email campaign, you could train five different decision trees on five different subsets of your data (all customers who responded to a previous campaign vs. all customers who didn’t respond). Then combine these five decision trees into one random forest that uses all of their outputs as input variables. This way you have a more accurate prediction than any single tree would have given on its own. If you are interested to learn new coding skills, the Entri app will help you to acquire them very easily. Entri app is following a structural study plan so that the students can learn very easily. If you don’t have a coding background, it won’t be any problem. You can download the Entri app from the google play store and enroll in your favorite course.
|Our Other Courses|
|MEP Course||Quantity Surveying Course||Montessori Teachers Training Course|
|Performance Marketing Course||Practical Accounting Course||Yoga Teachers Training Course|