Built this originally as a small competitive game, then it turned into a useful prompt-engineering practice loop.
Core mechanic: user sees a target image, writes a prompt, model generates output, and we score similarity.
Scoring uses multiple signals so one metric doesn’t dominate:
1. Semantic alignment (CLIP)
- user_prompt -> target_image (is the prompt conceptually aligned with target?)
- user_image -> target_image (is the generated result semantically aligned with target?)
2. Prompt faithfulness (CLIP)
- user_prompt -> user_image (did generation actually follow the submitted prompt?)
3. Color similarity
- HSV histogram overlap (user_image vs target_image) for palette/tone distribution
4. Structure similarity
- HOG-lite gradient/orientation comparison (user_image vs target_image) for layout/edge composition
Final score is a weighted blend (content signals weighted highest), normalized to player-facing points.
Why this approach:
- CLIP-only can overrate semantically related but visually off outputs
- color-only ignores structure/meaning
- structure-only misses semantics/style
- combining prompt-image and image-image signals reduced obvious false positives in ranking
Stack:
- Spring Boot backend
- separate CLIP scoring container
- external image generation service
- Next.js frontend
- PostgreSQL
Would love technical feedback on:
- metric weighting/calibration
- known failure modes I should benchmark
- alternatives to HOG-lite for fast structural scoring
vunderba•1h ago