I've seen a lot of purple gradient websites made by LLMs and I was curious if we can change a model's favorite color by messing with the activations instead of prompts.
So I used Representation Engineering with just 1 pair of contrastive prompts to make Mistral-7B prefer orange as its favorite color.
I used the repeng library by Theia to test this out.
Next I'm gonna implement it from scratch to understand why this even works.
I wonder if we can introduce "taste" into a model with methods like this.
The paper is "REPRESENTATION ENGINEERING:
A TOP-DOWN APPROACH TO AI TRANSPARENCY".
13point5•3h ago
So I used Representation Engineering with just 1 pair of contrastive prompts to make Mistral-7B prefer orange as its favorite color.
I used the repeng library by Theia to test this out.
Next I'm gonna implement it from scratch to understand why this even works.
I wonder if we can introduce "taste" into a model with methods like this.
The paper is "REPRESENTATION ENGINEERING: A TOP-DOWN APPROACH TO AI TRANSPARENCY".
Link: https://arxiv.org/abs/2310.01405