Notes from Building Computer Vision Applications Using Artificial Neural Networks - Chapter 6
Object Detection
- Classification & Localization (What and Where)
Two different variations of Convolutional Neural Networks
- Two-step convolutions
- A region based CNN or R-CNN is a two-step algorithm
- Single-step convolutions
- YOLO
- Single-shot detection
Intersection Over Union (IoU)
- Also known as Jaccard Index
- Bounding boxes in training set are ‘ground truth’
Region-Based CNN (R-CNN)
Contains three modules:
- Region Proposal: First finds regions that might contain objects
- Feature Extraction: The region proposals are cropped out, resized, and fed to a standard CNN for feature extraction.
- Classifier: Extracted features are classified using standard algorithms, Ex: Linear SVM
- There are serious performance issues:
- Each proposed region is passed to CNN, approx. 2000 passed per image
- Three different models need to be trained: CNN, classifier, regression
- Each region needs to be predicted which can be slow
Fast R-CNN
- Same thing as a R-CNN but it doesn’t crop and resize region proposals. It processes the whole image.
Faster R-CNN
- Except for the faster region proposal method, architecturally similar to Fast R-CNN
A Faster R-CNN architecture consists of a region proposal network (RPN) that shares the full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
Region Proposal Network
- Deep Convolutional Neural Network
- Inputs an image and outputs generated rectangular object proposals
- Each rectangular proposal has an ‘objectness’ score.
Mask R-CNN
- Extension of Faster R-CNN
-
Adds the ability to predict an object mask along with the object class and bounding box coordinates.
- Faster R-CNN vs. Mask R-CNN:
- Faster R-CNN has two outputs: class label, bounding box coordinates
- Mask R-CNN has three outputs: class label, bounding box coordinates, object mask
Feature Pyramid Network
- Add-on to the backbone network
- Each higher layer passed features to the lower layers, and predictions are done at each layer.
- Helps to detect smaller objects in an image because feature size decreases
What’s the point of the masks?
The Mask R-CNN (like the Faster R-CNN) generates object classes and the bounding boxes. The combination of these two helps us locate the objects within the image.
Single-Shot Multibox Detection (SSD)
- An R-CNN and it’s variants are two-stage detectors with two networks:
- Generates the region proposals and bounding boxes
- Predicts object classes
-
Because of this they are computationally expensive
- A single-shot object detector predicts both in one pass of the network
- An SSD is robust to various input object sizes and shapes
SSD Network Architecture
- Base Network: A deep CNN used for feature extraction
- Detection Network: Attach some extra convolution layers to the base network to do the actual prediction of bounding boxes and object classes.
Anchor Boxes
Anchors are one or more rectangular shapes set at each convolution point of the feature map.
- In SSD, typically five anchor boxes per point
- Each box acts as a detector
- Therefore, five detectors at each location of the feature map
- Each one can detect five different objects
// Then the chapter goes through some examples talking about aspect ratios and training
YOLO
[[Paper - YOLO Unified, Real-Time Object Detection (2016)]]
Detection Process:
- Input image is divided into SxS grids
- If the center of the object falls within a grid, that grid is responsible for detecting that object
- Each grid cell predicts B number of bounding boxes and a confidence score
- Confidence score is calculated as:
- Confidence score = Probability of objectness x IOU (predicted and ground)
- For each bounding box, the network makes 5 predictions (x, y, w, h, confidence)
- At the same time, the network predicts, for each grid cell, a class conditional probability C
- Class-specific confidence score:
-
Class Confidence score = P(Class Object) x P(Object) x IOU (predicted and ground)
-
- Predictions are encoded as an S x S x (B x 5 + C) tensor
- There are now YOLO versions 1, 2, and 3
Comparison of Performance
Chapter Ends going into actual code examples using [[Python]]