I’ve been working with large language models from across the spectrum, including OpenAI GPT, Anthropic, LLAMA, etc., for quite some time. For most of this journey, I was leaning towards selecting larger models (with more context windows) and cramming as much context as possible in hopes of getting better inferences and domain-specific reasoning. However, after the release of GPT-3.5-Turbo-16K, I realized the capabilities of these LLMs do not necessarily scale with the context window. Interestingly, the same model might perform poorly as the context length is increased.
I recently came across a paper that further reinforces my intuition about how context plays a role in model evaluation. The authors of this paper are trying to understand the role the length of the context and, more importantly, the position of the relevant context plays in the overall performance of the model. Their major finding is that the quality of the result degrades as the relevant context is positioned more in the middle of the prompt and generally supersedes others when the context is present either at the very top or bottom of the prompt. Therefore, the performance metrics show a U-shaped graph with the position of the context.
I believe this paper is one of the most significant pieces of literature that anyone working with LLM can benefit from in their day-to-day life. It’s something I would place right alongside the ‘Chain of Thought’ paper.