> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
I post this because this information seems very important for users of LLMs, and devs implementing LLMs in their own solutions.
The fall-off in accuracy is far faster and greater than I had imagined.
Someone should really make this an ongoing thing, which evaluates new models as they are released. Or, this information should be included in all model system cards.
consumer451•4h ago
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
I post this because this information seems very important for users of LLMs, and devs implementing LLMs in their own solutions.
The fall-off in accuracy is far faster and greater than I had imagined.
Someone should really make this an ongoing thing, which evaluates new models as they are released. Or, this information should be included in all model system cards.