Loading [MathJax]/extensions/Safe.js


ICRA brings together researchers from all around the world to discuss the latest in robotics and automation. In addition, the industry exhibition as well as live demonstrations and challenges bring together students, developers and entrepreneurs seeking the latest developments and applications in robotics and automation.

[link]
Summary by Hadrien Bertrand 5 years ago

The paper proposes a method to perform joint instance and semantic segmentation. The method is fast as it is meant to run in an embedded environment (such as a robot). While the semantic map may seem redundant given the instance one, it is not as semantic segmentation is a key part of obtaining the instance map.

Architecture

image

The image is first put through a typical CNN encoder (specifically a ResNet derivative), followed by 3 separate decoders. The output of the decoder is at a low resolution for faster processing.

Decoders:

  • Semantic segmentation: coupled with the encoder, it's U-Net-like. The output is a segmentation map.
  • Instance center: for each pixel, outputs the confidence that it is the center of an object.
  • Embedding: for each pixel, computes a 32 dimensional embedding. This embedding must have a low distance to embedding of other pixels of the same instance, and high distance to embedding of other pixels.

To obtain the instance map, the segmentation map is used to mask the other 2 decoder outputs to separate the embeddings and centers of each class. Centers are thresholded at 0.7, and centers with embedding distances lower than a set amount are discarded, as they are considered duplicates.

Then for each class, a similarity matrix is computed between all pixels from that class and centers from that class. Pixels are assigned to their closest centers, which represent different instances of the class.

Finally, the segmentation and instance maps are upsampled using the SLIC algorithm.

Loss

There is one loss for each decoder head.

  • Semantic segmentation: weighted cross-entropy
  • Instance center: cross-entropy term modulated by a $\gamma$ parameter to counter the over-representation of the background over the target classes.

    image

  • Embedding: composed of 3 parts, an attracting force between embeddings of the same instance, a repelling force between embeddings of different instances, and a l2 regularization on the embedding.

    image
    image
    $\hat{e}$ are the embeddings, $\delta_a$ is an hyper-parameter defining "close enough", and $\delta_b$ defines "far enough"

The whole model is trained jointly using a weighted sum of the 3 losses.

Experiments and results

The authors test their method on the Cityscape dataset, which is composed of 5000 annotated images and 8 instance classes. They compare their methods both for semantic segmentation and instance segmentation.

image

For semantic segmentation, their method is ok, though ENet for example performs better on average and is much faster.

image

On the other hand, for instance segmentation, their method is much faster than the other while still performing well. Not SOTA on performance, but considering the real-time constraint, it's much better.

Comments

  • Most instance segmentation methods tend to be sluggish and overly complicated. This approach is much more elegant in my opinion.
  • If they removed the aggressive down/up sampling, I wonder if they would beat MaskRCNN and PANet.
  • I'm not sure what's the point of upsampling the semantic map given that we already have the instance map.
more
Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: