The 5 tasks:
1. Classify JD lines (requirement vs boilerplate) 2. Split requirements into required vs preferred 3. Disambiguate skill mentions (Python-the-language vs Python-the-ecosystem) 4. Textual entailment (does resume experience satisfy a requirement?) 5. Semantic embeddings for similarity search
All five share MiniLM-L6 (22M params). Before: 5 x ~91MB fine-tuned copies — essentially the same encoder with different weights, burning RAM for no reason on a 4 vCPU / 8GB box. The obvious idea: share the encoder. One copy in memory, five lightweight heads (~580KB each) routing its output to task-specific predictions.
Attempt 1: Frozen encoder, linear heads. Cache CLS embeddings, train heads on cached vectors. Fast. Every head lost 10-15% accuracy. Pretrained representations weren't tuned for our tasks.
Attempt 2: Multi-task fine-tuning, 4 objectives. Unfreeze, train all four heads simultaneously with alternating batches, differential LR (encoder 2e-5, heads 1e-3). Classification recovered to within 2% of standalone. But embedding quality collapsed — "Python programming" and "cooking recipes" hit 0.91 cosine similarity. Classification objectives pushed all representations into a small region where heads could classify, destroying the distance structure embeddings need.
Attempt 3: Add contrastive objective as 5th task. Cosine similarity loss on positive/negative pairs alongside the four classification objectives. Explicitly penalizes the collapse. Encoder now has two competing incentives: make CLS tokens classifiable AND keep similar texts close / dissimilar texts far apart.
Gotchas:
Task weighting. Smallest dataset (65K) was drowned by largest (1.6M). 3x weight fixed it.
Embedding objective needs mass. 270K contrastive pairs vs 2.1M classification wasn't enough. Scaled to 1M pairs at 2x loss weight.
Some heads need independent training. Required-vs-preferred head wouldn't converge multi-task. Diagnostic showed encoder representations separated the classes (0.80 within vs 0.02 between cosine). Encoder fine, head was the problem. Froze encoder, retrained just that head in 30 seconds. 99% accuracy.
Result: one 22.9MB INT8 encoder + five ~580KB heads = 25MB total. Also a DeBERTa encoder (68.5MB) with two heads for token-level tasks (NER + section segmentation). Total: 94MB for 7 models, $11/month VPS, zero API costs.
The kicker: matching score went UP (71 → 75) because the entailment head became more accurate. It identifies when experience satisfies a requirement that a keyword matcher would miss. Pipeline speed 19s → 8.7s. Consolidated for RAM, got better accuracy as a bonus.
One thing I'd do differently: start from sentence-transformers/all-MiniLM-L6-v2, not the cross-encoder checkpoint. Sentence-transformer's embedding space is already oriented toward similarity; cross-encoder's is oriented toward ranking. Better starting point for the contrastive objective.
Happy to answer questions about multi-task training, the embedding-collapse diagnostic, quantization, or production deployment.
Soft launch — product is live at the URL, free to try one report. Feedback welcome, especially from anyone who's been through ATS-backed applications. And if you want to argue with the architecture choices, even better.