OpenAI's CLIP model using a ViT-B/32 image encoder, the smaller of the two widely deployed CLIP variants. Trained contrastively on 400 million image-text pairs, it aligns image and text representations in a shared embedding space for zero-shot classification and retrieval. The B/32 variant sacrifices accuracy versus ViT-L/14 for faster inference.
22,348,495 ↓ · 969 ♡
OpenAI's CLIP model using a ViT-L/14 image encoder, trained contrastively on 400 million image-text pairs from the internet. It aligns image and text in a shared embedding space, enabling zero-shot image classification by comparing image embeddings against text label embeddings. The ViT-L/14 variant offers higher accuracy than the smaller ViT-B/32 at greater compute cost.
12,385,997 ↓ · 2,047 ♡
OpenCLIP ViT-B/32 trained by LAION on 2 billion image-text pairs from the LAION-2B dataset. It provides open-source CLIP features comparable to OpenAI's original ViT-B/32 while being trained on a fully public dataset.
4,013,581 ↓ · 141 ♡
OpenAI CLIP ViT-L/14 at 336×336px input resolution, a higher-resolution variant of the standard ViT-L/14 CLIP model. The larger input patch size reduces information loss during tokenization, improving performance on classification tasks requiring fine-grained visual detail. Otherwise shares the same contrastive training on 400M image-text pairs as the base ViT-L/14.
3,393,714 ↓ · 306 ♡
PickScore_v1 is a CLIP-based human preference scorer trained on the Pick-a-Pic dataset of text-image pairs with human preference labels. Given a text prompt and a set of generated images, it predicts which image humans would prefer. It is typically used as a reward model in reinforcement-learning-from-human-feedback (RLHF) pipelines for image generation, not as a standalone image generator.
3,213,190 ↓ · 52 ♡
CLIP fine-tuned on a large fashion product dataset to improve image-text alignment for apparel, accessories, and retail imagery. Standard CLIP models underperform on fashion-specific queries due to distribution shift from generic web data.
2,928,557 ↓ · 284 ♡
SigLIP (Sigmoid Loss for Language-Image Pre-training) SO/400M at 384px resolution is Google's vision-language model using a sigmoid binary cross-entropy loss instead of CLIP's softmax contrastive loss. It achieves stronger zero-shot classification than CLIP ViT-L at comparable scale.
1,763,914 ↓ · 680 ♡
clip-vit-base-patch16 uses a joint image-text embedding space to score unseen label categories against input images.
1,619,880 ↓ · 164 ♡
siglip2-giant-opt-patch16-384 is Google's SigLIP 2 giant variant, a contrastively trained vision-language encoder with 384px patch-16 resolution. SigLIP 2 introduces sigmoid loss instead of softmax for cross-modal alignment, improving per-example calibration and zero-shot classification accuracy over the original SigLIP. The 'opt' variant uses optimized training recipes and targets state-of-the-art zero-shot classification quality.
1,469,944 ↓ · 43 ♡
SigLIP base/patch16 at 224px resolution is the lightweight tier of Google's sigmoid-loss vision-language pretraining model. It serves as a vision encoder for multimodal pipelines and as a standalone zero-shot classifier.
1,416,435 ↓ · 86 ♡
SigLIP2-Base with NaFlex (Native Resolution Flexible) encoding, which processes images at their native resolution by dynamically adjusting patch sequences rather than resizing to a fixed size. This improves accuracy on images where spatial details matter. The base variant offers a smaller memory footprint than the 400M so400m variant.
796,271 ↓ · 35 ♡
PE-Core-S16-384 is Meta's Perception Encoder model at the Small/16-patch/384px configuration, designed for zero-shot image classification and visual representation learning. It is described in arxiv:2504.13181 as a general-purpose vision encoder trained for broad perceptual tasks.
768,261 ↓ · 0 ♡
SigLIP2 SO400M with NaFlex (Native Resolution Flexible) encoding — the larger 400M variant of siglip2-base-patch16-naflex. NaFlex processes images at native resolution without forced resizing, preserving spatial detail. This is the strongest SigLIP2 variant for both CLIP-style tasks and as a vision encoder in multimodal LLMs.
732,402 ↓ · 75 ♡
siglip2-so400m-patch14-384 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.
692,872 ↓ · 92 ♡
marqo-fashionSigLIP classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
642,819 ↓ · 83 ♡
CLIP-convnext_base_w-laion2B-s13B-b82K-augreg classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
588,279 ↓ · 9 ♡
BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 uses a joint image-text embedding space to score unseen label categories against input images.
565,824 ↓ · 413 ♡
SigLIP2 is Google's second-generation sigmoid loss vision-language contrastive model at 400M parameters, using a 16px patch size and 256px input resolution. The sigmoid loss formulation (vs softmax in CLIP) enables independent positive/negative scoring without requiring full batch negatives. Often used as the vision encoder in multimodal LLMs.
521,594 ↓ · 5 ♡
CLIP-ViT-H-14-laion2B-s32B-b79K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
416,026 ↓ · 462 ♡
siglip2-base-patch16-224 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.
408,496 ↓ · 111 ♡
OpenCLIP ViT-B/16 trained on LAION-2B with 34B samples seen during training. The ViT-B/16 architecture processes 16x16 patches at 224px resolution, offering better feature quality than ViT-B/32 at moderate additional cost.
402,763 ↓ · 39 ♡
CLIP-ViT-L-14-laion2B-s32B-b82K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
373,799 ↓ · 64 ♡
PE-Core-L14-336 is an open-weight checkpoint for zero-shot image classification, distributed on the HuggingFace Hub. The Apache 2.0 license keeps PE-Core-L14-336 unrestricted for commercial reuse. PE-Core-L14-336 is community-maintained, so track upstream changes and pin a known-good revision.
316,732 ↓ · 52 ♡
vit_base_patch16_plus_clip_240.laion400m_e31 is an openly licensed zero-shot image classification model in the clip family. vit_base_patch16_plus_clip_240.laion400m_e31 is MIT-licensed, clearing it for closed-source and paid products. Evaluate vit_base_patch16_plus_clip_240.laion400m_e31 on your own data before trusting it in production.
314,216 ↓ · 1 ♡
siglip2-base-patch16-512 is an open-weight checkpoint for zero-shot image classification, distributed on the HuggingFace Hub. The Apache 2.0 license keeps siglip2-base-patch16-512 unrestricted for commercial reuse. Like most open checkpoints, siglip2-base-patch16-512 rewards a quick in-domain eval before commitment.
294,208 ↓ · 42 ♡
One-Align is a unified image and video quality assessment model from the Q-Future group, trained to score perceptual quality and alignment with human aesthetic preferences. It unifies image quality assessment (IQA) and video quality assessment (VQA) into a single model.
267,437 ↓ · 43 ♡
As a clip-based compact model, TinyCLIP-ViT-8M-16-Text-3M-YFCC15M focuses on zero-shot image classification. Weighing in near 8M parameters, TinyCLIP-ViT-8M-16-Text-3M-YFCC15M trades some ceiling for cheaper, faster inference. The MIT license keeps TinyCLIP-ViT-8M-16-Text-3M-YFCC15M unrestricted for commercial reuse. Before relying on TinyCLIP-ViT-8M-16-Text-3M-YFCC15M, reproduce its key numbers on representative inputs.
232,353 ↓ · 12 ♡