ViT Base (Patch 16, 384 resolution)

JSON →
google vision
image

A base Vision Transformer model with 16x16 patch size and 384x384 input resolution for image classification.

vision