Context Length Experiment¶
An experiment to test Gemini 1.5 Flash's ability to answer questions about up to 1 million tokens of context.
Gemini 1.5 Flash, with its million-token context capability, represents a significant advancement in LLMs. This expanded context length opens up new possibilities for processing and understanding vast amounts of information. However, it also raises important questions about how effectively such models can utilize this extensive context in practical applications. How can we really know what to expect from an analysis of a 1 million token context?
Traditionally, long-context models have been evaluated using "needle in a haystack" tests, where specific and usually irrelevant information is hidden within a large context to assess the model's retrieval capabilities. While valuable, these tests don't fully explore a model's ability to reason across and synthesize information from its entire context - a crucial skill for many real-world applications. They also lean on irrelevant needles, which I believe gives the LLM a crutch of parsing out anomalies in the text rather than understanding it.
This study aims to address this gap by conducting a comprehensive evaluation of Gemini 1.5 Flash's question-answering capabilities across varying context lengths. Using a dataset derived from the Apple App Store, we design an experiment that systematically increases the context from 50,000 tokens to the full million-token capacity.
Our primary objectives are to:
- Assess Gemini 1.5 Flash's performance in answering specific questions as the context length increases.
- Explore the practical implications of using such large context lengths in real-world scenarios.
The experiment involves a set of questions about a curated set of apps, requiring the model to synthesize information from different parts of the context. By incrementally increasing the context length, we aim to understand not just the model's information retrieval capabilities, but its ability to reason across vast amounts of data.
Setup¶
You will need the following to recreate this experiment:
- A Langsmith account and API Key
- A Google AI Studio API key
- An OpenAI API key
Create a copy of the .env.sample file saved as .env
and add your API keys.
Install the necessary libraries:
%pip install -qU pandas tiktoken langchain langchain-openai langchain-google-genai matplotlib langsmith python-dotenv seaborn
Data Collection and Preparation¶
Our experiment utilizes the App Store Apple Data Set (10k apps) from Kaggle. This dataset was chosen for its rich information about various apps, allowing for deterministic question generation and answer validation using pandas operations.
We start by loading the necessary libraries and the two main CSV files from the dataset:
import pandas as pd
from langsmith import Client
from dotenv import load_dotenv
load_dotenv()
client = Client()
app_data_df = pd.read_csv('./data/AppleStore.csv')
descriptions_df = pd.read_csv('./data/appleStore_description.csv')
The data is then merged into a single dataframe:
full_app_df = pd.merge(app_data_df, descriptions_df, on='id', how='left')
Data Cleaning and Preprocessing¶
To prepare the data for our experiment, we perform several cleaning and preprocessing steps:
- Select relevant columns and rename for clarity
- Convert app size from bytes to megabytes
- Remove special characters from app names
- Filter out rows with empty names or zero ratings
- Remove entries with non-Latin characters in the description
Here's the code implementing these steps:
new_df = full_app_df[['id', 'track_name_x', 'size_bytes_x', 'currency',
'price', 'rating_count_tot', 'user_rating', 'ver', 'prime_genre', 'app_desc']]
new_df = new_df.rename(columns={'track_name_x': 'name', 'size_bytes_x': 'size'})
new_df['size'] = new_df['size'] / (1024 * 1024) # Convert to MB
new_df['name'] = new_df['name'].str.replace(r"[^a-zA-Z0-9\s]+", "", regex=True)
new_df = new_df[new_df['name'].str.strip() != ""]
new_df = new_df[new_df['rating_count_tot'] != 0]
new_df = new_df[new_df['app_desc'].str.contains(r'[\u4e00-\u9fff]') == False]
new_df = new_df.sort_values(by='app_desc')
new_df.head()
id | name | size | currency | price | rating_count_tot | user_rating | ver | prime_genre | app_desc | |
---|---|---|---|---|---|---|---|---|---|---|
6548 | 1134867821 | NOT ALONE Story of a bird | 116.121094 | USD | 2.99 | 1 | 3.0 | 1.1 | Games | ! Now on X'mas special sales (~2017 Jan. 3rd) ... |
6751 | 1145500015 | Drifty Chase | 180.987305 | USD | 0.00 | 1631 | 4.5 | 1.7 | Games | !! 2016 Very Big Indie Pitch finalist at PGCon... |
2493 | 823804745 | Multiplayer Terraria edition | 15.058594 | USD | 3.99 | 6981 | 4.0 | 1.5 | Games | !!! First and the only app which allows to pla... |
3273 | 949876643 | Lumyer augmented reality camera effects | 116.251953 | USD | 0.00 | 3896 | 4.5 | 4.0.1 | Photo & Video | !!! NEW !!! TAP EFFECTS\nTry the new Tap Effe... |
5519 | 1086929344 | Dancing with the Stars The Official Game | 334.543945 | USD | 0.00 | 1098 | 4.0 | 2.7 | Games | !!! Please note this app does not currently su... |
Golden Dataset Selection¶
To create a controlled subset for our questions, we select a "golden dataset" of five apps. Rather than random selection, which could introduce variability across experiments, we chose a deterministic approach:
golden_df = new_df[new_df['app_desc'].str.startswith(('K', 'k'))].head(5)
clean_df = new_df.drop(golden_df.index)
golden_df
id | name | size | currency | price | rating_count_tot | user_rating | ver | prime_genre | app_desc | |
---|---|---|---|---|---|---|---|---|---|---|
1350 | 529997671 | Disney Channel Watch Full Episodes Movies TV | 125.921875 | USD | 0.00 | 21082 | 3.5 | 5.7.0 | Entertainment | K.C. Undercover, Liv & Maddie, Bunk’d and more... |
3455 | 965789238 | 1000 | 73.989258 | USD | 0.00 | 23 | 4.5 | 3.6.5 | Shopping | KAOLA.COM is China 's largest overseas commodi... |
3793 | 994674676 | Sago Mini Superhero | 171.169922 | USD | 2.99 | 30 | 3.5 | 1.1 | Education | KAPOW! Jack the rabbit bursts into the sky as ... |
1965 | 645949180 | Jelly Splash | 132.311523 | USD | 0.00 | 21601 | 4.0 | 3.13.0 | Games | KICK BACK AND SPLASH!\n\nJoin those delicious ... |
5076 | 1070850573 | KQ MiniSynth | 19.365234 | USD | 5.99 | 15 | 5.0 | 1.7.4 | Music | KQ MiniSynth is a polyphonic modular synthesiz... |
This method selects the first five apps whose descriptions start with 'K' or 'k'. While somewhat arbitrary, this approach ensures:
- Reproducibility across experiments
- A degree of randomness to avoid bias
By separating our golden dataset from the main dataframe, we can ensure these apps are always included in our context, regardless of the token limit, while filling the remaining context with other app data.
Evaluation Dataset¶
Our evaluation dataset consists of three carefully crafted questions designed to test Gemini 1.5 Flash's ability to synthesize information across the App Store data context:
examples = [
{
"question": "Do the 'Sago Mini Superhero' and 'Disney Channel Watch Full Episodes Movies TV' apps require internet connection?",
"answer": "You can play Sago Mini Superhero without wi-fi or internet. Internet is required for Disney Channel Watch Full Episodes Movies TV"
},
{
"question": "Where can I find the privacy policy for the 'Disney Channel Watch Full Episodes Movies TV' app?",
"answer": "http://disneyprivacycenter.com/"
},
{
"question": "Which one costs less? The 'KQ MiniSynth' app or the 'Sago Mini Superhero' app?",
"answer": "The 'KQ MiniSynth' app costs $5.99, the 'Sago Mini Superhero' app costs $2.99. So 'Sago Mini Superhero' is cheaper"
}
]
And now we upload them to Langsmith as a dataset if they haven't already been uploaded
dataset_name = "AppStore Q&A"
def make_dataset():
"""Make and fill dataset if it doesnt exist"""
if client.has_dataset(dataset_name=dataset_name):
return client.read_dataset(dataset_name=dataset_name)
dataset = client.create_dataset(
dataset_name=dataset_name, description="App Store Data questions and answers")
for example in examples:
client.create_example(
inputs={"question": example["question"]}, outputs={"answer": example["answer"]}, dataset_name=dataset.name)
return dataset
dataset = make_dataset()
Experiment¶
Our experiment is designed to test Gemini 1.5 Flash's performance across increasing context lengths. To execute this, we need several components:
- The App Store dataset
- Our evaluation dataset (Q&As)
- An evaluation function
- A prediction function
Evaluation Function¶
The evaluation function is a critical component of our experiment. It's responsible for:
- Comparing the model's output to the correct answer
- Providing a detailed assessment of the answer's correctness
- Assigning a binary score (correct or incorrect)
Key aspects of this implementation:
GPT-4o as Judge: We use GPT-4o (via the
ChatOpenAI
class) to evaluate the answers. This allows for nuanced understanding and assessment of the responses. If you haven't worked with LLM's as a judge, it may sound unreliable to have an LLM grade an LLM but it is very reliable in scenarios like this. You can think about it like this: GPT-4o is more than capable of validating that an answer is the same as the correct answer we give it. This is a much more simple task than the one we are giving Gemini Flash of actually generating the correct answer.Structured Output: Using
with_structured_output(EvaluationSchema)
ensures that our evaluation consistently provides both reasoning and a binary correctness judgment that we can use programmatically.Detailed Evaluation: The system prompt instructs the GPT-4o model to provide thorough reasoning, considering partial correctness and nuances in the answers.
Binary Scoring: While the evaluation includes detailed reasoning, the final score is binary (0 or 1) for simplicity in aggregating results across multiple questions and context lengths.
from langchain_openai import ChatOpenAI
from langchain.pydantic_v1 import BaseModel, Field
from langchain.schema import SystemMessage, HumanMessage
from langsmith.schemas import Run, Example
class EvaluationSchema(BaseModel):
"""An evaluation schema for assessing the correctness of an answer"""
reasoning: str = Field(
description="Detailed reasoning for the evaluation score")
correct: bool = Field(
description="Whether the user's answer is correct or not")
def qa_eval(root_run: Run, example: Example):
"""Evaluate the correctness of an answer to a given question"""
question = example.inputs["question"]
user_answer = root_run.outputs["output"]
correct_answer = example.outputs["answer"]
if not question or not user_answer or not correct_answer:
return {
"score": 0,
"key": "correctness",
"comment": "Question, user's answer, or correct answer is missing"
}
llm = ChatOpenAI(
model="gpt-4o", temperature=0.4).with_structured_output(EvaluationSchema)
system_prompt = f"""You are a judge tasked with evaluating a user's answer to a given question.
You will be provided with the question, the correct answer, and the user's thought process and answer.
Question:
{question}
Correct Answer:
{correct_answer}
Your job is to assess the user's answer and provide:
1. Detailed reasoning for your evaluation, comparing the user's answer to the correct answer
2. A boolean judgment on whether the user's answer is correct or not
Be thorough in your reasoning and accurate in your judgment. Consider partial correctness and any nuances in the answers."""
evaluation: EvaluationSchema = llm.invoke(
[SystemMessage(content=system_prompt),
HumanMessage(content=user_answer)]
)
score = 1 if evaluation.correct else 0
return {
"score": score,
"key": "correctness",
"comment": evaluation.reasoning
}
This evaluation function allows us to assess Gemini 1.5 Flash's performance consistently across different context lengths and questions. By using a language model (GPT-4) as the judge, we can capture subtle aspects of correctness that might be missed by simpler, rule-based evaluation methods.
Prediction Function¶
The prediction function is a crucial component of our experiment. It's responsible for:
- Taking a question from our evaluation dataset
- Generating a context of appropriate length
- Querying Gemini 1.5 Flash with the question and context
- Returning the model's response
Here's our implementation:
from tiktoken import get_encoding
import random
import matplotlib.pyplot as plt
import seaborn as sns
# Gemini's 1M token limit
max_context_limit = 1000000
# Util Functions
def count_tokens(text: str):
"""Count the number of tokens in a string"""
encoder = get_encoding("cl100k_base")
return len(encoder.encode(text))
def row_to_string(row):
"""Convert a row to a string"""
app_string = f"""App Name: {row.name}
Size: {round(row.size, 2)} MB
Price: {row.price} {row.currency}
Rating Count: {row.rating_count_tot}
User Rating: {row.user_rating}
Version: {row.ver}
Genre: {row.prime_genre}
Description: {row.app_desc}"""
return app_string
def get_context(tokens: int):
"""Get the context for a given number of tokens"""
# Combine the golden df and the new_df
combined_df = pd.concat([golden_df, new_df])
app_strs: list[str] = []
delimiter = "\n================\n"
for i, row in enumerate(combined_df.itertuples()):
row_str = row_to_string(row)
num_tokens = count_tokens(
f"{delimiter.join(app_strs)}{delimiter}{row_str}")
if num_tokens < tokens: # If we havent hit the token limit, add the row
app_strs.append(row_str)
else:
break
# Randomize app strings
random.shuffle(app_strs)
return delimiter.join(app_strs)
def visualize_test_results(experiments):
"""Display a graph of the test results"""
# Step 1: Extract and process data
all_results = []
for exp in experiments:
df = client.get_test_results(project_name=exp["results"].experiment_name)
df['tokens'] = exp['tokens']
all_results.append(df)
# Combine all results into a single dataframe
combined_df = pd.concat(all_results, ignore_index=True)
# Step 2: Sort by token count
combined_df = combined_df.sort_values('tokens')
# Get unique questions
questions = combined_df['input.inputs.question'].unique()
# Create a color palette
color_palette = sns.color_palette("husl", n_colors=len(questions))
# Step 3: Create the line graph
fig, ax = plt.subplots(figsize=(12, 8))
for i, question in enumerate(questions):
question_data = combined_df[combined_df['input.inputs.question'] == question]
ax.plot(question_data['tokens'], question_data['feedback.correctness'],
label=f'Question {i+1}', color=color_palette[i])
ax.set_title('Test Results by Token Count')
ax.set_xlabel('Number of Tokens')
ax.set_ylabel('Correctness Score')
ax.legend(title='Questions', loc='center left', bbox_to_anchor=(1, 0.5))
# Add questions as text below the graph
fig.text(0.1, 0.02, "Questions:", fontweight='bold')
for i, question in enumerate(questions):
fig.text(0.1, -0.02 - 0.03*i, f"{i+1}. {question}", fontsize=8, wrap=True)
plt.tight_layout()
plt.subplots_adjust(bottom=0.3) # Adjust this value to fit all questions
return plt
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
from langsmith.evaluation import evaluate
class Predictor:
def __init__(self, step=1, total_steps=20, model="gemini-1.5-flash"):
self.step = step
self.total_steps = total_steps
self.model = model
self.llm = ChatGoogleGenerativeAI(model=model)
self.experiments = []
def predict(self, inputs: dict):
"""Prediction function for Gemini Experiment"""
tokens = (max_context_limit / self.total_steps) * self.step
context = get_context(tokens)
system_prompt = f"""You are tasked with answering user questions based on the the App Store data inside <APP STORE DATA>.
<APP STORE DATA> contains a ton of public data about apps on the App Store. It is the most current and accurate source \
so be sure to ONLY answer based on the context in <APP STORE DATA>. You will be graded on accuracy so be very careful and \
make sure you are as accurate as possible. First, think through your reasoning to answering the question before ultimately repeating \
the question and giving your answer.
<APP STORE DATA>
{context}
</APP STORE DATA>"""
response = self.llm.invoke(
[SystemMessage(content=system_prompt), HumanMessage(content=inputs["question"])])
return {"output": response.content}
def _run_eval(self):
"""Run a single evaluation for Gemini Experiment"""
tokens = (max_context_limit / self.total_steps) * self.step
result = evaluate(
self.predict,
data=client.list_examples(dataset_name=dataset_name),
evaluators=[qa_eval],
experiment_prefix=f"{self.model}-{tokens}"
)
# Append the results to the experiments list
self.experiments.append({
"tokens": tokens,
"step": self.step,
"results": result
})
def run(self):
"""Run a single step of the Gemini Experiment"""
print(f"Running step {self.step} of the Gemini Experiment")
self._run_eval()
# Increment the step
self.step += 1
# If we have more than 1 experiment, display the results
if len(self.experiments) > 1:
visualize_test_results(self.experiments)
def run_all(self, reset=False, stop_at=None):
"""Run all steps of the Gemini Experiment
Args:
reset (bool, optional): Whether to reset the step counter. Defaults to False.
stop_at (int, optional): The step to stop at. Defaults to Predictor.total_steps.
"""
if stop_at is None:
stop_at = self.total_steps
if reset:
self.step = 1
while self.step <= stop_at:
self.run()
eval = Predictor()
Key aspects of this implementation:
Context Generation: The
get_context
function generates a context of the appropriate length for each step of the experiment.Incremental Context: The
step
andtotal_steps
parameters allow us to incrementally increase the context length from 50,000 to 1,000,000 tokens. We have to use a class because our prediction function should only take theinputs
dict.System Prompt: We wrote a system prompt to instruct Gemini 1.5 Flash on how to approach the task. This prompt emphasizes:
- Using only the provided context
- The importance of accuracy
- The need for reasoning before answering
Model Invocation: We use the
ChatGoogleGenerativeAI
class from thelangchain_google_genai
library to interact with Gemini 1.5 Flash.
This prediction function allows us to systematically test Gemini 1.5 Flash's performance across varying context lengths while maintaining consistent instructions and evaluation criteria. By incrementing the step
parameter, we can observe how the model's performance changes as it has access to more context.
Lets run our experiment!¶
eval.run()
Results¶
The results of our experiment with Gemini 1.5 Flash are remarkable in their consistency. Across all context lengths, from 50,000 tokens all the way up to the full million-token capacity, Gemini 1.5 Flash achieved 100% accuracy in answering our test questions!
View Test Results on LangSmith
question | 50k | 100k | 150k | 200k | 250k | 300k | 350k | 400k | 450k | 500k | 550k | 600k | 950k |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Which one costs less? The 'KQ MiniSynth' app or the 'Sago Mini Superhero' app? | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Where can I find the privacy policy for the 'Disney Channel Watch Full Episodes Movies TV' app? | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Do the 'Sago Mini Superhero' and 'Disney Channel Watch Full Episodes Movies TV' apps require internet connection? | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Key Findings:
- Perfect Accuracy: Gemini 1.5 Flash maintained 100% correctness across all context lengths, from 50,000 to 1,000,000 tokens.
- Scalability: The model's performance did not degrade as the context length increased, demonstrating robust scalability.
- Consistency: Regardless of the amount of context provided, Gemini 1.5 Flash consistently provided accurate answers, indicating strong information synthesis capabilities.
It's important to note that our experiment was carefully designed to avoid numerical reasoning and relational queries. Previous experiments have shown that Gemini Flash struggles with tasks involving logic around numbers, such as identifying "the highest rated" or "Top 5 by size" apps. Our questions focused on factual retrieval and simple comparisons, areas where the model excels.
Implications:¶
Comprehensive Document Analysis: Organizations can now process entire documents or databases in a single query. For example, a company could input all its policy documents, employee handbooks, and project reports into Gemini 1.5 Flash. This would allow for quick and accurate answers to complex queries that span multiple documents, potentially saving hours of manual searching and cross-referencing.
Enhanced Customer Support: Customer service departments could leverage Gemini 1.5 Flash to create incredibly knowledgeable chatbots. By inputting all product information, past customer interactions, and frequently asked questions, these chatbots could provide accurate, context-aware responses to customer queries. This could significantly reduce response times and improve customer satisfaction while decreasing the workload on human customer service representatives.
Improved Contract Analysis: Legal departments and law firms could use Gemini 1.5 Flash to analyze lengthy contracts and legal documents. By inputting multiple related contracts, case law, and regulatory information, lawyers could quickly get accurate answers to specific legal questions, potentially speeding up contract review processes and reducing the risk of overlooking important clauses or legal precedents.
While these results are extremely promising, it's crucial to remember that they are based on a specific dataset and set of questions. The model's performance on numerical reasoning and relational queries remains a limitation. Further testing across diverse domains and more complex query types would be beneficial to fully understand the capabilities and limitations of Gemini 1.5 Flash in real-world scenarios.
Nevertheless, these results mark a significant leap forward in the field of large language models, particularly in handling and analyzing vast amounts of textual information. The ability to maintain perfect accuracy across such a large context opens up exciting possibilities for businesses dealing with large volumes of documents and data.