Below is a breakdown of what I got planned so far and I am looking for your feedback and recommendations.
Models: I am considering dinov2-base or SigLIP-S even OpenCLIP ViT-B/32.
Storage and indexing: Probably Qdrant (self-host), would consider FAISS too if large memory was not a requirement.
Input problems: I have images of various sizes and aspect ratios and all are fairly large (no thumbs). Which preprocessing would you recommend me? Cropping a square in the center, resizing to a square and ruining proportions, padding to a square and resizing? I am worried that padding will impact accuracy of the search.
Deployment: I'll do the embedding calculations on my local machine but would like to hear suggestions for price-efficient online hosting of the inference model.
Thank you.