ViT Base (Patch 16, 384 resolution)
JSON →A base Vision Transformer model with 16x16 patch size and 384x384 input resolution for image classification.
Capabilities
vision
A base Vision Transformer model with 16x16 patch size and 384x384 input resolution for image classification.