Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Microsoft introduced Florence-2, a cutting-edge vision foundation model with a unified, prompt-based representation for diverse computer vision and vision-language tasks. Unlike existing models, Florence-2 handles various tasks using simple text instructions, covering captioning, object detection, grounding, and segmentation. It relies on FLD-5B, a dataset with 5.4 billion visual annotations on 126 million images, created through automated annotation and model refinement. Florence-2 employs a sequence-to-sequence structure for training, achieving remarkable zero-shot and fine-tuning capabilities. Extensive evaluations confirm its strong performance across numerous tasks.
Read the full research paper.

Add a Comment

Your email address will not be published. Required fields are marked *