MetaCLIP – Meta AI Research

2023-10-269:3615523github.com

Everything about MetaCLIP: curation/training code, metadata, distribution and pre-trained models. - GitHub - facebookresearch/MetaCLIP: Everything about MetaCLIP: curation/training code, metadata, ...

This repository contains the code for the MetaCLIP, described in the paper Demystifying CLIP Data that formalizes CLIP data curation as a simple algorithm. The main contributions are:

  • Curating data from scratch without filtering via prior models (e.g., different from existing open source efforts that uses the original CLIP model as a teacher for filtering student data.
  • Making training data more transparent, we released our training data distribution over metadata;
  • A scalable algorithm running in the data pipeline, allowing to scale the data pool to the whole CommonCrawl (CC) w/ 300+B image-text pairs. We observe that data quality is much more important than quantity (different from existing open source efforts or ALIGN that mostly scale quantity);
  • standard CLIP training setup for controlled experiments and fair comparisons under fixed training and model configuration.

We conclude that:

  • Effective pretraining data should maximally preserve signal and mitigate noise, instead of hard removal of noise with blackbox filters that lead to unknown distribution
  • Our algorithm is simpler and scalable to curate the whole Internet
  • Open-sourcing does not just entail a trained model checkpoint but more importantly the pre-training data distribution.
@inproceedings{xu2023metaclip,
   title={Demystifying CLIP Data},
   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu, Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2309.16671},
   year={2023}
}

Updates

  • 09/28/2023: initial release.

Getting Started

This code is developed with minimal changes on top of OpenCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1 used by this repo:

conda create -n python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
    -c pytorch-nightly \
    -c nvidia \
    -c conda-forge \
    -c anaconda

Metadata

MetaCLIP uses 500,000 queries as metadata to align the training data to distribution over quality writing of Wikipedia/WordNet terms. This metadata also allows us to release training data distribution of a released model as data card.

Pre-trained Models

We change OpenCLIP to match training in the default CLIP model setup (w/ ViT-B-16-quickgelu, ViT-L-14-quickgelu and ViT-H-14-quickgelu). Most OpenCLIP models use nn.GELU not quickgelu used by vanilla CLIP. We hope this helps research w/ controlled experiments in the "CLIP era of ImageNet".

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu', pretrained='metaclip/b32_400m.pt')

image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

How to Curate ?

We have a demo notebook to show how the proposed algorithm works.

I already have a (head distributed) dataset:

CLIP curation can still help as online balancing (Table 6 in the paper). We wrap CLIP curation in two key functions: substring matching (recommended to run offline) and balancing (either offline or online, please check metaclip.balancing:main).

import json
import numpy as np
from metaclip.substr_matching import substr_matching
from metaclip.balancing import balance_sampling

with open("metadata.json") as f:
  metadata = json.load(f)
# entry counts for our 1.6B(pool) -> 400M(curated); please check balance_sampling:main and substr match and count on your own data.
with open("metaclip/entry_counts_400m.json") as f:
  entry_count_json = json.load(f)
entry_count = np.array([entry_count_json[entry] for entry in metadata], dtype=np.uint64)  # uint64 to be safe for scaling.

t = 20000
entry_count[entry_count < t] = t
entry_prob = t / entry_count

for text in ["jacksons chameleon", "battery plate"]:
  matched_entry_ids = substr_matching(text, metadata)
  if balance_sampling(matched_entry_ids, entry_prob):
    print(f"'{text}' curated")

I want to curate data from scratch:

We release a skeleton code for sub-string matching from CommonCrawl WAT or WARC and balancing. Check here for details.

Training

python submitit_openclip.py b32_400m

Please config the corresponding training_data in run_configs_400m.py.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Hu Xu (huxu@meta.com).

Citation

Please cite our paper if MetaCLIP helps your work:

@inproceedings{xu2023metaclip,
   title={Demystifying CLIP Data},
   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu, Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2309.16671},
   year={2023}
}

Reference

The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.

TODO

  • cross-json URL dedup in skeleton code;
  • numpy implementation for matching and balancing;
  • support online downloading;
  • support vanilla CLIP API;
  • Huggingface integration;
  • (welcome your use cases or suggestions to update this codebase regularly)

License

The majority of MetaCLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.


Page 2

You can’t perform that action at this time.


Read the original article

Comments

  • By zerojames 2023-10-2614:28

    I have been playing with MetaCLIP this afternoon and made https://github.com/autodistill/autodistill-metaclip as a pip installable version. The Facebook repository has some guidance but you have to pull the weights yourself, save them, etc.

    My inference function (model.predict("image.png")) return an sv.Classifications object that you can load into supervision for processing (i.e. get top k) [1].

    The paper [2] notes the following in terms of performance:

    > In Table 4, we observe that MetaCLIP outperforms OpenAI CLIP on ImageNet and average accuracy across 26 tasks, for 3 model scales. With 400 million training data points on ViT-B/32, MetaCLIP outperforms CLIP by +2.1% on ImageNet and by +1.6% on average. On ViT-B/16, MetaCLIP outperforms CLIP by +2.5% on ImageNet and by +1.5% on average. On ViT-L/14, MetaCLIP outperforms CLIP by +0.7% on ImageNet and by +1.4% on average across the 26 tasks.

    [1] https://github.com/autodistill/autodistill-metaclip [2] https://arxiv.org/pdf/2309.16671.pdf

  • By ninja3925 2023-10-2615:45

    CLIP is a such a nice paradigm shift. Historically, CV things were quite limited:

    - You could predict a class (from a static list such as [dog,cat, ...]) or ...

    - You could use image embeddings disconnected from text (you could tell image look-alikes but not what they actually represent). By "embedding" text and images in the same latent space, you can now query your images with text query (such a "a large dog") and find the relevant photos. CLIP understands semantics but also is not limited to a set list of classes (thanks to the ability to use of web data in training).

    This is a list compiled by OpenCLIP of high performance models (some better than MetaCLIP) for those interested in using CLIP: https://github.com/mlfoundations/open_clip/blob/main/docs/op...

  • By gurkwart 2023-10-2616:00

    Very exciting. CLIP and latent space embeddings in general are such an intuitive to use and powerful tool. I'm using it in some hobby projects, from semantic image search in private collections, to trading card recognition among tenthousands of cards. Love to see more open source work from big players on this.

HackerNews