However: Batch Processing: Average batch size: 10 Time per batch: 13.03ms Time per example in batch: 1.30ms
TASK SUMMARY WITH TIMING
===================================================
Task Correct Total Accuracy Med Time (ms)
---------------------------------------------------
Emotion Classification 10 10 100.0 % 1.30
Toxicity Classification 9 10 90.0 % 1.29
Sentiment Classification 10 10 100.0 % 1.34
Domain Classification 8 10 80.0 % 1.30
Sarcasm Detection 6 10 60.0 % 1.34
Scam Detection 7 10 70.0 % 1.31
Age Appropriateness Classification 4 10 40.0 % 1.28
Urgency Level Classification 4 10 40.0 % 1.25
Privacy Policy Classification 9 10 90.0 % 1.32
Dialogue Speaker Classification 8 10 80.0 % 1.29
Book Review Sentiment 10 10 100.0 % 1.25
Empathetic Direction Classification 10 10 100.0 % 1.29
Virtual Assistant Action Classification 6 10 60.0 % 1.37
---------------------------------------------------
OVERALL 101 130 77.7 %
===================================================
It can do interesting things.
This has a lot of caveats and limitations. However, the model is available for download via a script in the repo, and the exact benchmarks I used are available. The white paper gets into theory and application, as well as reveals a lot of limitations and interesting differences from transformers in terms of training and prompting behavior. It also produces extensive appendices (over 100 pages) on training datasets used, and performance on the ~260 (I think?) NIV2 tasks in its validation dataset.
Running inference for the DSRU model + BGE embedding model together takes a bit shy of 10GB of VRAM, and the reference comparison model -- Zephyr 7B -- takes about 15GB of VRAM.
throwawayffffas•6mo ago
Wouldn't it be easier and more ergonomic to users to have dedicated models for each of this tasks?
orderone_ai•6mo ago
I would say that ease of use and deployment is actually a good reason to have a single model.
We don't train 20 LLMs for different purposes - we train one (or, I guess 3-4 in practice, each with their own broad specialization), and then prompt it for different tasks.
This simplifies deployment, integration, upgrading, etc.
This model is basically the same - instead of having a restriction to doing single-task classification. This means that a user can complete new tasks using a new prompt, not a new model.
throwawayffffas•6mo ago
That's the feeling I have when I try to use LLMs for more general language processing.
Have you run in cases where the model "forgets" the task at hand and switches to another mid text stream?
Regardless of all of the above. It looks to me that your choice of reasoning and problem solving in the latent space is a great one and where we should be collectively focusing our efforts, keep up the good work.
orderone_ai•6mo ago
It's a vec2vec architecture - it takes in 3 bge-large embeddings of the task, the input data, and the vocabulary. It outputs 1 bge-large embedding of the answer.
That's the DSRU part.
What makes it a classifier is that later, outside of the model, we do a nearest neighbor search for our vocabulary items using our answer vector. So it will output something from the labels no matter what - the nearest neighbor search will always have something closest, even if the model went a little crazy internally.
The prompts here tend to be very straightforward. Things like: "Is this book review positive or negative?" "Is this person sharing something happy or venting?" "Determine the logical relationship between the premise and hypothesis. Answer with: entailment, neutral, or contradiction."
It has limited use cases, but where it's good, it should be very, very good - the insane speed, deterministic output, and forced label output makes it great for a lot of common, cheap tasks.