Computer Vision - Uniqtech Guide

Image Basics

Each colored image pixel have three channels Red Green Blue (RGB), the values are all integers uint8 8 bit ranging from zero to 255. 0 to 2**8 two to the eighth power. RGB is a representation of color. It is an encoding of color. Gray images only have one channel.

Understand pixel and red green blue (RGB) basics.

Best visualization of Image Matrix math [PRO]

Convert images to vectors: In computer vision, the training dataset can be a big matrix of numbers. Each row represent one image, in a flattened vector. These vectors stack up into a matrix. To display each image vector, will want to reshape it into a square or rectangle based on width, height. It is easier for compute to crop the images into the same size, and it is also required to fit them into the same matrix. The number of columns (pixels) need to match. See this flash card for what an MNIST hand written digit image look like

Common Developer Tools, Libraries

OpenCV, Pillow (python), torchvision (Pytorch)

Tasks: Gathering Training Data, Generating Training Data

What training data looks like in image classification tasks?

Data augmentation: images are horizontal flipped, randomly rotated, cropped to create additional training images with variations. This generates more data and prevents overfitting.

A cool pro trip to generate image training data [PRO]

Convolutional Neural Networks (CNN)

Convolutional Neural Networks become popular around 2012 when AlexNet (designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor.) had significant performance improvement over prior competitors in the ImageNet competition on the ImageNet dataset pioneered by Fei Fei Li an AI professor and leader at Stanford. Geoffrey Hinton is a famous AI academic leader.

The fundamental concept of a convolution is running a small patch of filter / neural network / convolution / a small matrix on images to identify shapes, features and even objects.

This small patch of filter is also called kernel (aka matrix, aka filter). Some common kernels are 3x3, 5x5, 7x7.

There are visual simulations of the convolution process. You can also think of this analogy: the input picture is hanging on the same, in a dark room, we shine a flash light that only illuminates a small part of the image at a time, and note down what we see, do some calculations on it and note it down. The area the flash light makes visible is how big our kernel/filter is. We will do that for each patch of the image, until the full image is discovered. Instead of the original image, we now only have our calculations which roughly correspond to the original location of the image pixels. We end up with a matrix representation of the image. Usually there's change in size/dimensions - usually a reduction. We can also use special flash lights to discover special features, such as uv lights, only reveal what reflects uv light. This is an analogy and not a perfect description of CNN filters.

CNN Building blocks, building components : convolution, non-linear activation, pooling, subsampling, max pooling, fully connected layer

A field of explaining what each filter / layer learns is known as machine learning layer visualization, explainable AI, or activation visualization “which part of the neurons, network light up”.

Computer Vision Basics | Computer Vision 101

Convolutional Neural Networks (CNN) and Computer Vision 101 fundamentals Part 1 by Uniqtech #CNN #computervision #pytorch #DeepLearning #Data #MachineLearning #computervision

FashionMNIST is the equivalent for hello world dataset, introductory for Computer Vision. "We’ll use the FashionMNIST dataset to train a neural network that predicts if an input image belongs to one of the following classes: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, or Ankle boot." - Fashion MNIST It is more sophisticated than its predecessor : MNIST handwriting digits 0 through 9.

A great introduction CNN dataset is FashionMNIST Check out the flash card here. FashionMNIST is a famous hello world introductory dataset for computer vision tasks. It is great for testing out and learn to the ropes of a vanilla CNN model. Test out some CNN or variation of CNN architecture. Pre-trained computer vision models such as VGG. Great for image classification task. The hot dog not hot dog project in HBO’s Silicon Valley can be built with a CNN model. Prior to CNN, the hello world dataset is MNIST hand written digits, 10 classes/labels ranging from 0 to 9.

Message us and ask us about the best way to visualize convolutional neural networks.

VGG [architecture]

Know your VGG - Uniqtech Guide

VGG once was a super popular computer vision architecture. It is still very useful for understanding CV. VGG is useful for image classification tasks. It's easy to set up transfer learning using VGG.

Fun Fact: Karen Simonyan's team called their neural network VGG in their paper published in 2014, after their successful ImageNet competition submission. You can find this paper on arxiv. Simonyan VGG paper on arxiv 2014. A little bit of history: before machine learning and deep learning (neural network) came along, Optical character recognition (OCR) is the paradigm used to recognize digits and capture digits and letters on license plates. OCR technology was also used to scan documents.

Computer Vision Models and Tools

Modern computer vision models and tool.

famous computer vision models and performances famous computer vision models and performances [pro, flashcard]

Edge detection

Facial Recognition

Background:

Common computer vision tasks: classification, object detection, facial recognition, image analysis, identify deep fake.

Facial recognition is another famous computer vision task.

Facial Recognition is an important machine learning task. Facial Recognition Technology FRT is a formal name for facial recognition tech, software and hardware. Ethical issue: privacy concerns, false positive error, false arrests have been made. Controversial. A common training data problem is the imbalanced data problem: minorities, non-white males, especially female, have lower representation in readily available data. Deep fake technology can be used to train. This insane website used GAN generative adverserial network to generate these extreme HD deep fake images. This Person Does not Exist deep fake, generative deep learning As the name implies, this person does not exist. There's a facial recognition library called Dlib. (Mentioning a technology does not mean endorsement). Here's a python example of Dlib Dlib Github We have not tried this tech, but have heard of it.

Data encoding for facial recognition tasks

Important knowledge for facial data encoding [PRO high quality]

How to plot facial data? [PRO high quality]

Facial image dataset [public]

Bounding Box

Understand bounding box math [PRO only]

Understand Jaccard, Bounding Box Overlap [BASIC, Public]

Computer Vision with Pytorch

Pytorch computer vision landing page, visit it here

For computer vision tasks, there are a few commonly used Pytorch modules from torchvision import datasets, models, transforms # vision datasets, architectures & transforms import torchvision.transforms as transforms # composable transforms source: torch vision documentation Pytorch, Torchvision Computer Vision Building Blocks, Components You Should Know

Transfer Learning Pytorch Torchvision Model

Here's Uniqtech's Medium article on how to download a torch vision model and use it for transfer learning. Transfer learning is important for innovation in computer vision. It's expensive to collect high quality data to train neural networks. It is acceptable to use transfer learning - download pre-trained image models and fine tuning those models, instead of starting from scratch.

Transfer Learning with Pytorch Code Snippet Load a Pretrained Model #pytorch #deeplearning #machinelearning #datascience #data #tutorial #computervision

Transfer learning with pytorch, visit it here

Additional Resources

Vision Kit by Tensorflow, Google. Build a smart camera project using Google's kit and tensorflow.

History of Computer Vision: the first modern convolutional neural network (CNN) is AlexNet (2012). The first modern benchmark competition for image classification and computer vision is: ImageNet. The competition started by Li Fei Fei (Stanford) asks competing teams to predict top-k (e.g. top-5) classes each image depicts. The competitors can choose from 1000 classes (labels) of images. These categories include dogs, cats, etc Pretrained models may expect images with mean and variance looking similar to ImageNet images, so it is common to normalize images using summary statistics of ImageNet images. Even though models are performing better and better, ImageNet still plays an important role in CV. Pro tip: if you see numbers like this, there's an attempt to normalize current image data to conform with ImageNet mean and standard deviation. People saw this on the official pytorch documentation, pytorch VGG-16. transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) The idea also inspired similar datasets that are driving innovation in other disciplines, e.g. there's now a medical imagenet.

Top-1 error, top-5 error, model benchmarking

Computer vision Pytorch benchmark Stanford Stanford Dawn Benchmark

Stanford computer vision class Stanford University CS231n: Deep Learning for Computer Vision, Fei Fei Li, Justin, Serena Yeung

There are many computer vision models. To learn more about them you can look for computer vision models on Pytorch, computer vision model zoos.