Facebook Parent Company Meta AI Releases the Segment Anything Model (SAM): A New AI Model That Can Cut Out Any Object In A Image | Video With A Single Click | Meta AI has introduced the Segment Anything Model (SAM), aiming to democratize image segmentation by introducing a new task, dataset, and model. The project features the Segment Anything Model (SAM) and the Segment Anything 1-Billion mask dataset (SA-1B), which is the most extensive segmentation dataset to date | Facebook-Parent Meta Launches SAM AI Model That Can Identify Objects Within An Image: Here’s How It Will Work

As per Recent Meta AI research presents their project called “Segment Anything,” which is an effort to “democratize segmentation” by providing a new task, dataset, and model for image segmentation. Their Segment Anything Model (SAM) and Segment Anything 1-Billion mask dataset (SA-1B), the largest ever segmentation dataset.

There used to be two main categories of strategies for dealing with segmentation issues. The first, interactive segmentation, could segment any object, but it needed a human operator to refine a mask iteratively. Automatic segmentation, however, allowed for predefined object categories to be segmented. Still, it required a large number of manually annotated objects, in addition to computing resources and technical expertise, to train the segmentation model. Neither method offered a foolproof, universally automated means of segmentation.

These features allow SAM to transfer to different domains and perform different tasks. Some of the SAM’s capabilities are as follows:

SAM facilitates object segmentation with a single mouse click or through the interactive selection of points for inclusion and exclusion. A boundary box can also be used as a prompt for the model.
For practical segmentation problems, SAM’s ability to generate competing valid masks in the face of object ambiguity is a crucial feature.
SAM can instantly detects and mask any objects in an image.
After precomputing the image embedding, SAM can instantly generate a segmentation mask for any prompt, enabling real-time interaction with the model.

The team needed a large and varied data set to train the model. SAM was used to gather the information. In particular, SAM was used by annotators to perform interactive image annotation, and the resulting data was subsequently used to refine and improve SAM. This loop ran several times to refine the model and data.

New segmentation masks can be collected at lightning speed using SAM. The tool used by the team makes interactive mask annotation quick and easy, taking only about 14 seconds. This model is 6.5x faster than COCO fully manual polygon-based mask annotation and 2x faster than the previous largest data annotation effort, which was also model-assisted compared to previous large-scale segmentation data collection efforts.

The presented 1 billion mask dataset could not have been built with interactively annotated masks alone. As a result, the researchers developed a data engine to use when collecting data for the SA-1B. There are three “gears” in this data “engine.” The model’s first mode of operation is to aid human annotators. In the next gear, fully automatic annotation is combined with human assistance to broaden the range of collected masks. Last, fully automated mask creation supports the dataset’s ability to scale.

The final dataset has over 11 million images with licenses, privacy protections, and 1.1 billion segmentation masks. Human evaluation studies have confirmed that the masks in SA-1B are of high quality and diversity and are comparable in quality to masks from the previous much smaller, manually annotated datasets. SA-1B has 400 times as many masks as any existing segmentation dataset.

The researchers trained SAM to provide an accurate segmentation mask in response to various inputs, including foreground/background points, a rough box or mask, freeform text, etc. They observed that the pretraining task and interactive data collection imposed particular constraints on the model design. For annotators to effectively utilize SAM during annotation, the model must run in real-time on a CPU in a web browser.

A lightweight encoder can instantly transform any prompt into an embedding vector, while an image encoder creates a one-time embedding for the image. A lightweight decoder is then used to combine the data from these two sources into a prediction of the segmentation mask. Once the image embedding has been calculated, SAM can respond to any query in a web browser with a segment in under 50 ms.

SAM has the potential to fuel future applications in a wide variety of fields that require locating and segmenting any object in any given image. For example, understanding a webpage’s visual and textual content is just one example of how SAM could be integrated into larger AI systems for a general multimodal understanding of the world.