Introduction: common data preprocessing tasks include vectorizing data, vectorize input tasks (converting data into numeric values, models expect numeric data as input), cleaning, removing noises, handling missing values. The ultimate goal is to convert data into meaningful high quality features data. So that models trained on such data, has better performance, generalize well to real world data.
The goal of preprocessing in Machine Learning, is to convert raw data into numeric data, convert raw data into high quality, generalizable, usable features, before feeding it into a model / algorithm.
Common data preprocessing tasks: formatting data, de-noising data / cleaning data, encoding data, vectorizing data, scaling data, normalizing data etc.
Start with exploration of the data. Basic analysis in EDA.
Calculate demographics data
Calculate summary statistics.
Calculate, visualize data distribution,
Preview data files using this code
head filename.csv. Some data files are so large, opening them in a text editor may take a while.
In supervised learning, it is important to separate the features and the labels. Drop the target column from the dataset, and store it in another variable. It is also important to remove any part of the dataset that may cause data leak.
Generally, we will need to turn all data into numeric data to be consumed, used by machine learning models. This process is called vectorizing data, vectorization. For example, categorical data needs to vectorized, encoded with methods such as one hot encoding. Convert the text based data of categories into numeric data. Convert inputs into vectors is an important step. Encoding : refers to using numeric data to represent categorical data, in this case. Data can be either binary or multi-class. Binary data is True OR False, 1 OR 0. Multi-class can be cat, dog, OR bird. Both binary and multi-class categorical data is discrete (not continuous, there is a set of possible values). Some data such as temperature is continuous (cant take all values along the way, on some scale), numeric, the opposite of discrete.
Common tasks in text processing: converting text to lower case using .lower(), replacing new line \n special characters with space, replacing double quotes with single quotes for string encapsulation. The python .lower() can handle punctuation and special characters by ignoring them and returning them without an error.
Another common task is to vectorize texts : turning texts into matrix representations, also called embeddings. We usually use pre-trained models to do this, because it is not possible and it is costly for us to download all texts from the internet and literature, and spend many hours of expensive high compute power to generate sensible word embeddings. Google, Facebook and other Machine Learning papers have already done this for us, so we should just utilize it. Unless we need to re-train for a custom solution. Famous word embedding models include word2vec (google), FastText (facebook), Glove (Stanford) etc. Though not all these models are still state-of-the-art.
Identifying missing values. Encoding missing values (encode NaN. Recode any special symbols as numeric or computer symbols). Replace missing values intelligently with imputation strategy.
Once missng values are encoded / handled, such as using numpy.nan, we can tally up the number of missing values. An example using pandas to tally Not-a-Number(NaN) pandas.DataFrame.isnull().sum() We can also specify the axis and only display the first 10 results df.isnull().sum(axis=1).loc[:10]
Calculating the percentage of null values in columns and rows.
Handling columns, rows with a significant portion of missing values. For example, if a column has more than 30% missing value, it may be a good column to drop due to lack of data. The actual cut-off threshold depends on the data and domain knowledge. Identify if there're patterns between columns/rows with missing values. If there're common patterns in missing values between columns, it could mean survey design (for example, a followup question will always be blank, if a leading question is selected as no), it could also mean some serious gap, issue with the data collection process, it could also mean certain users are not well represented. This could be a signal for more serious data collection issues.
Commonly used fill NA or Fill NaN functions : .fillna(), .fillna(value=calc_median, inplace=True). An abrasive way to handle columns or row with too many missing values is to use .drop()
One example standardization is normalizing data in such a way, the data has zero mean and standard deviation of 1. To achieve this we can subtract the mean from each data point, and divide the value by standard deviation of the entire dataset. This can change the distribution of the data point and estimate it with a standard normal distribution instead. You might see centering the distribution using an approximated mean and standard deviation of 0.5. Might have encounter this in Pytorch transformation examples. Check out the Pytorch transformation documentation. Tranformation on torch tensor section. Read more about Pytorch normalizing standardizaiton.
Another example of normalization is to divide image pixel value by the max value 255 (not 256 because the value is zero through 255). Making the max value 1, and the min value 0. Makes it easier to compute and converge models (help models reach minimal error, optimal solution faster - converge faster). It's possible to standardize image pixel by subtracting the mean, and dividing the value by standard deviation instead of a simple division of min max scaling using max value 255. In the case of image pixel, the min value is 0, range is 255-0=255. The min does not have to be zero. It can be anything for other datasets.
Transformations turn the image pixel distribution into a Gaussian function centered around zero, ranging from 0 to 1, [0,1]
Related concept data transformation. Includes scaling, converting data types, transforming normalizing data distribution. Normalize. Normalize. Standard Scaler. MinMaxScaler
Detect, identify outliers. Remove outliers (if neccessary. In some scenarios, outliers are desired targets).
We can set a threshold for outliers, if it is certain threshold above, the mean, for example 2-3 standard deviation above the mean etc.
One useful visualization for outlier is ...todo insert flash card here
Feature Selection : decide which data files, or columns from the original dataset to include in the final training dataset.
Tackling curse of dimensionality. Low dimensional projection.
It's possible to bundle all data preprocessing and cleaning tasks into a sequence of chained functions or into a machine learning library pipeline. Encoding, vectorization, scaling, feature selection, feature engineering... The pipeline will take an input, generate a transformed output, then pass the output as the input to the next transformation in the pipeline.
Our medium article about how class imbalance and when it matters. Paid member can request free copies. Class Imbalance