Watch this machine learning with sklearn overview tutorial on Youtube YouTube tutorial : machine learning with scikit-learn sklearn - Uniqtech Guide Learn more about Machine Learning, fit training pattern in scikit-learn sklearn. Scikit-Learn Machine Learning Pattern (training loop using fit) Learn more about the .fit() API in scikit-learn [pro member]
Learning new libraries can take a while. Sklearn follows a simple API pattern though. It's easy to use. 01 init and store the model in a variable called clf short for classifier. 02 call clf.fit(X, y) to train the model, X for features, y for labels. 03 you can score the model performance using .score() 04 make prediction on new data clf.predict(X2), X2 for new feature data
Watch our youtube video walkthrough here: Scikit-learn machine learning cheat sheet - Uniqtech Guide on Youtube
sklearn.preprocessing.StandardScaler : "Standardize features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as: z = (x - u) / s" - official documentation sklearn.preprocessing.StandardScaler The documentation seems busy but clearly gives the formula for this transformation. This step is called feature scaling, which is a part of feature engineering. It is also a key step in data preprocessing. If you are following our Machine Learning Workflow graph, its role in the entire ML pipeline will become more obvious.
Read more about data preprocessing on our Data Pre-Processing, Clean - Uniqtech topic page
Imputation is the method used to handle missing data and replace it with another value calculated using a strategy. Pro tips: imputation is the process of handling / encoding missing value in datasets. It's a common, important step of data preprocessing. Imputation is the method used to handle missing data and replace it with another value calculated using a strategy. A key parameter for imputers is the strategy.
What is imputation in practice, let's explain it using the scikit-learn (sklearn) machine learning API. In scikit-learn, the code for imputation looks something like: imp = SimpleImputer(strategy="most_frequent")
Instead of dropping data points with missing value, we encode / replace data points with meaningful new values using a strategy. Depends on the data, domain, and method of data collection, mean or most_frequent most_frequent may be the better choice of encoding. There are also other strategies. For example, if we choose strategy mean to replace missing value for a particular column, we will replace that column's missing data with the average or mean value of the entire column (excluding the missing entries). There are also more complex imputation strategies available. Read more here: on scikit-learn documentation. How to use scikit-learn : Imputation of missing values
We can chain data transformation together, so that input data is passed from one transformation output to the next transformation as input. This is a performant pipeline. There is a sklearn pipeline feature to do this. Most machine learning libraries have a similar concept.
The scikit-learn scoring function is used on a variable clf, short for classifier, to measure it's clf.predict(X_test) "predicted" y_test versus "true" y_test.
We should first train the classifier using the fit method. clf.fit(X_train, y_train) before scoring it.
clf.score(X_test, y_test)
Check out our additional tutorials on Medium Scikit-learn machine learning cheat sheet - Uniqtech Guide on Medium and on Youtube tutorial. Scikit-learn machine learning cheat sheet - Uniqtech Guide on Youtube