A new state of the art for unsupervised computer vision | MIT News

Data labeling can be a hassle. It is the main source of sustenance for computer vision models; otherwise, they would have great difficulty identifying objects, people, and other important features in the image. Yet producing just one hour of tagged and labeled data can take 800 hours of human time. Our high-fidelity understanding of the world is growing as machines can better perceive and interact with our surroundings. But they need more help.

Scientists from the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT, Microsoft and Cornell University attempted to solve this problem that plagues vision patterns by creating “STEGO”, an algorithm capable of discovering and to jointly segment objects without any human label. to the pixel.

STEGO learns what’s called “semantic segmentation” – a fancy language for the process of assigning a label to each pixel in an image. Semantic segmentation is an important skill for today’s computer vision systems because images can be cluttered with objects. Even more difficult is that these objects don’t always fit in literal boxes; the algorithms tend to work best for discrete “things” like people and cars, as opposed to “things” like vegetation, the sky, and mashed potatoes. A previous system could simply perceive a nuanced scene of a dog playing in the park as just a dog, but by assigning a label to each pixel in the image, STEGO can break the image down into its main ingredients: a dog, a sky, grass and its owner.

Assigning a label to every pixel in the world is ambitious, especially without any kind of feedback from humans. The majority of today’s algorithms derive their knowledge from mounds of labeled data, the source of which can take laborious human hours. Imagine the excitement of labeling every pixel of 100,000 images! To discover these objects without the help of a human, STEGO searches for similar objects that appear in a dataset. He then associates these similar objects to construct a coherent vision of the world through all the images from which he learns.

See the world

Machines that can “see” are essential for a wide range of new and emerging technologies such as self-driving cars and predictive modeling for medical diagnostics. Since STEGO can learn without labels, it can detect objects in many different domains, even those that humans don’t yet fully understand.

“If you’re looking at oncology scans, the surface of planets, or high-resolution biological images, it’s hard to know what objects to look for without expert knowledge. In emerging fields, sometimes even human experts don’t know what the right objects should be,” says Mark Hamilton, PhD student in electrical and computer engineering at MIT, MIT CSAIL Research Affiliate, software engineer at Microsoft, and lead author. on a new article on STEGO. “In these types of situations where you want to design a method to operate at the limits of science, you can’t rely on humans to figure it out before machines.”

STEGO has been tested on a multitude of visual domains covering general images, driving images and high altitude aerial photography. Within each domain, STEGO was able to identify and segment relevant objects that closely aligned with human judgments. STEGO’s most diverse benchmark was the COCO-Stuff dataset, which is comprised of diverse images from around the world, from indoor scenes and people playing sports to trees and cows. In most cases, the previous state-of-the-art system could capture the essence of a scene in low resolution, but struggled to achieve fine detail: a human was a blob, a motorcycle was captured as a person, and she could I don’t recognize any geese. On the same stages, STEGO doubled the performance of previous systems and discovered concepts like animals, buildings, people, furniture and many more.

STEGO not only doubled the performance of previous systems on the COCO-Stuff benchmark, but also made similar leaps forward in other visual areas. When applied to driverless car datasets, STEGO was able to segment roads, people, and traffic signs with much greater resolution and granularity than previous systems. In images from space, the system broke down every square foot of the Earth’s surface into roads, vegetation and buildings.

Connect the pixels

STEGO – which stands for “Self-Supervised Transformer with Energy-Based Graph Optimization” – is based on the DINO algorithm, which discovered the world through 14 million images from the ImageNet database. STEGO hones the DINO spine through a learning process that mimics our own way of putting together pieces of the world to make sense.

For example, you might consider two images of dogs walking through the park. Even though they are different dogs, with different owners, in different parks, STEGO can tell (without the humans) how the objects in each scene relate to each other. The authors even probe STEGO’s mind to see how similar each little brown hairy thing in the images is, and similarly with other shared objects like grass and people. By connecting objects through images, STEGO builds a coherent view of the word.

“The idea is that these kinds of algorithms can find consistent groupings in a largely automated way so that we don’t have to do it ourselves,” says Hamilton. “It may have taken years to understand complex visual datasets like biological imagery, but if we can avoid spending 1,000 hours sifting through the data and labeling it, we can find and discover new insights. that we could have missed. We hope this will help us understand the visual word in a more empirical way.

Look forward

Despite its improvements, STEGO still faces some challenges. The first is that labels can be arbitrary. For example, the labels in the COCO-Stuff dataset distinguish between “food items” like bananas and chicken wings, and “food items” like oatmeal and pasta. STEGO doesn’t see much difference. In other cases, STEGO has been troubled by strange images – such as that of a banana resting on a telephone receiver – where the receiver was labeled “foodstuff” instead of “raw material”.

For future work, they plan to give STEGO a bit more flexibility than just labeling pixels in a fixed number of classes, since things in the real world can sometimes be multiple things at the same time (like “food”) , “plant” and “fruit”). The authors hope this will give the algorithm room for uncertainty, trade-offs, and more abstract thinking.

“By creating a general tool for understanding potentially complicated datasets, we hope that this type of algorithm can automate the scientific process of discovering objects from images. There are many different areas where labeling would be prohibitively expensive, or humans simply don’t even know the specific structure, such as in some biological and astrophysical fields.We hope that future work will allow application to a very wide range of datasets. you don’t need human labels, now we can start to apply ML tools more widely,” says Hamilton.

“STEGO is simple, elegant and very effective. I consider unsupervised segmentation as a reference to progress in the understanding of images, and a very difficult problem. The research community has made tremendous progress in understanding unsupervised images with the adoption of transformer architectures,” says Andrea Vedaldi, Professor of Computer Vision and Machine Learning and Co-Director of the Visual Geometry Group in the Department of Engineering at the University of Oxford. “This research provides perhaps the most direct and effective demonstration of these advances in unsupervised segmentation.”

Hamilton authored the paper alongside MIT CSAIL PhD student Zhoutong Zhang, Assistant Professor Bharath Hariharan of Cornell University, Associate Professor Noah Snavely of Cornell Tech, and MIT Professor William T. Freeman. They will present the paper at the 2022 International Conference on Representations of Learning (ICLR).


Leave a Reply

Your email address will not be published.