Get the latest tech news
Microsoft drops Florence-2, a unified model to handle a variety of vision tasks
As of now, both pre-trained and fine-tuned versions of Florence-2 232M and 771M are available on Hugging Face under a permissive MIT license.
When Microsoft tried solving this, it found two key roadblocks: Scarcity of comprehensively annotated visual datasets and the absence of a unified pretraining framework with a singular network architecture that integrated the ability to understand spatial hierarchy and semantic granularity. “All annotations in the dataset, FLD-5B, are uniformly standardized into textual outputs, facilitating a unified multi-task learning approach with consistent optimization with the same loss function as the objective,” the researchers wrote in the paper detailing the model. For instance, in a zero-shot captioning test on the COCO dataset, both 232M and 771M versions of Florence outperformed Deepmind’s 80B parameter Flamingo visual language model with scores of 133 and 135.6, respectively.
Or read this on Venture Beat