Uniqtech Guide to Machine Learning with Scikit-learn sklearn

Code snippet, common pattern Machine Learning with Scikit-Learn sklearn

scikit learn uniqtech guide cover photo

Watch this machine learning with sklearn overview tutorial on Youtube YouTube tutorial : machine learning with scikit-learn sklearn - Uniqtech Guide Learn more about Machine Learning, fit training pattern in scikit-learn sklearn. Scikit-Learn Machine Learning Pattern (training loop using fit)

Pro tip: Common Scikit-Learn Pattern

Learning new libraries can take a while. Sklearn follows a simple API pattern though. It's easy to use. 01 init and store the model in a variable called clf short for classifier. 02 call clf.fit(X, y) to train the model, X for features, y for labels. 03 you can score the model performance using .score() 04 make prediction on new data clf.predict(X2), X2 for new feature data

Watch our youtube video walkthrough here: Scikit-learn machine learning cheat sheet - Uniqtech Guide on Youtube

Commonly Used Scikit-Learn Data Preprocessing Tools

sklearn.preprocessing.StandardScaler : "Standardize features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as: z = (x - u) / s" - official documentation sklearn.preprocessing.StandardScaler The documentation seems busy but clearly gives the formula for this transformation. This step is called feature scaling, which is a part of feature engineering. It is also a key step in data preprocessing. If you are following our Machine Learning Workflow graph, its role in the entire ML pipeline will become more obvious.

Read more about data preprocessing on our Data Pre-Processing, Clean - Uniqtech topic page

Commonly Used Scikit-Learn Data Preprocessing Tools

What is imputation? Imputation explained. #datascience

Imputation is the method used to handle missing data and replace it with another value calculated using a strategy. Pro tips: imputation is the process of handling / encoding missing value in datasets. It's a common, important step of data preprocessing. Imputation is the method used to handle missing data and replace it with another value calculated using a strategy. A key parameter for imputers is the strategy.

What is imputation in practice, let's explain it using the scikit-learn (sklearn) machine learning API. In scikit-learn, the code for imputation looks something like: imp = SimpleImputer(strategy="most_frequent")

Instead of dropping data points with missing value, we encode / replace data points with meaningful new values using a strategy. Depends on the data, domain, and method of data collection, mean or most_frequent most_frequent may be the better choice of encoding. There are also other strategies. For example, if we choose strategy mean to replace missing value for a particular column, we will replace that column's missing data with the average or mean value of the entire column (excluding the missing entries). There are also more complex imputation strategies available. Read more here: on scikit-learn documentation. How to use scikit-learn : Imputation of missing values

Data preprocessing using Sklearn pipeline

We can chain data transformation together, so that input data is passed from one transformation output to the next transformation as input. This is a performant pipeline. There is a sklearn pipeline feature to do this. Most machine learning libraries have a similar concept.

Evaluating Models, Measuring Performance

The scikit-learn scoring function is used on a variable clf, short for classifier, to measure it's clf.predict(X_test) "predicted" y_test versus "true" y_test. We should first train the classifier using the fit method. clf.fit(X_train, y_train) before scoring it. clf.score(X_test, y_test)

scikit-learn sklearn additional resources - Uniqtech Guide

Check out our additional tutorials on Medium Scikit-learn machine learning cheat sheet - Uniqtech Guide on Medium and on Youtube tutorial. Scikit-learn machine learning cheat sheet - Uniqtech Guide on Youtube