YOLO Object Detection Notes

YOLO Object Detection Notes

in

You Only Look Once

PDF Web Link

YOLO Project Link

Abstract

(…) we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.

  • Real time processing @ 45FPS
  • Good at generalization but makes localization errors

Introduction

  • Current methods just repurpose a classifier model per object, then apply a bunch of models to parts of the frame.
  • Current methods are slow and hard to optimize because each component much be trained separately.

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

  • Fast, no complex pipeline, sees the entire image at once.
  • Neural network runs on each image.
  • Twice the precision as other systems.

Unified Detection

  • Breaks image into boxes with probability and confidence scores
  • Each box has x-cords, y-cords, width, height, and confidence.

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

YOLO Architecture

Our final layer predicts both class probabilities and bounding box coordinates.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

Limitations of YOLO

  • Because each grid cell only predicts two bounding boxes and only has one class, there are ‘strong spacial constraints’. It struggles with small things like flocks of birds.
  • Main source of error is incorrect localization.

Comparisons

Conclusion

  • It is good at general applications, even works with a webcam, and is VERY fast in real-time detection.

Graph

Generalization Results