Context Length Experiment¶

An experiment to test Gemini 1.5 Flash's ability to answer questions about up to 1 million tokens of context.

Gemini 1.5 Flash, with its million-token context capability, represents a significant advancement in LLMs. This expanded context length opens up new possibilities for processing and understanding vast amounts of information. However, it also raises important questions about how effectively such models can utilize this extensive context in practical applications. How can we really know what to expect from an analysis of a 1 million token context?

Traditionally, long-context models have been evaluated using "needle in a haystack" tests, where specific and usually irrelevant information is hidden within a large context to assess the model's retrieval capabilities. While valuable, these tests don't fully explore a model's ability to reason across and synthesize information from its entire context - a crucial skill for many real-world applications. They also lean on irrelevant needles, which I believe gives the LLM a crutch of parsing out anomalies in the text rather than understanding it.

This study aims to address this gap by conducting a comprehensive evaluation of Gemini 1.5 Flash's question-answering capabilities across varying context lengths. Using a dataset derived from the Apple App Store, we design an experiment that systematically increases the context from 50,000 tokens to the full million-token capacity.

Our primary objectives are to:

Assess Gemini 1.5 Flash's performance in answering specific questions as the context length increases.
Explore the practical implications of using such large context lengths in real-world scenarios.

The experiment involves a set of questions about a curated set of apps, requiring the model to synthesize information from different parts of the context. By incrementally increasing the context length, we aim to understand not just the model's information retrieval capabilities, but its ability to reason across vast amounts of data.

Setup¶

You will need the following to recreate this experiment:

A Langsmith account and API Key
A Google AI Studio API key
An OpenAI API key

Create a copy of the .env.sample file saved as .env and add your API keys.

Install the necessary libraries:

In [ ]:

Copied!

%pip install -qU pandas tiktoken langchain langchain-openai langchain-google-genai matplotlib langsmith python-dotenv seaborn
%pip install -qU pandas tiktoken langchain langchain-openai langchain-google-genai matplotlib langsmith python-dotenv seaborn

Data Collection and Preparation¶

Our experiment utilizes the App Store Apple Data Set (10k apps) from Kaggle. This dataset was chosen for its rich information about various apps, allowing for deterministic question generation and answer validation using pandas operations.

We start by loading the necessary libraries and the two main CSV files from the dataset:

In [ ]:

Copied!

import pandas as pd
from langsmith import Client
from dotenv import load_dotenv

load_dotenv()

client = Client()

app_data_df = pd.read_csv('./data/AppleStore.csv')
descriptions_df = pd.read_csv('./data/appleStore_description.csv')
import pandas as pd
from langsmith import Client
from dotenv import load_dotenv

load_dotenv()

client = Client()

app_data_df = pd.read_csv('./data/AppleStore.csv')
descriptions_df = pd.read_csv('./data/appleStore_description.csv')

The data is then merged into a single dataframe:

In [3]:

Copied!

full_app_df = pd.merge(app_data_df, descriptions_df, on='id', how='left')
full_app_df = pd.merge(app_data_df, descriptions_df, on='id', how='left')

Data Cleaning and Preprocessing¶

To prepare the data for our experiment, we perform several cleaning and preprocessing steps:

Select relevant columns and rename for clarity
Convert app size from bytes to megabytes
Remove special characters from app names
Filter out rows with empty names or zero ratings
Remove entries with non-Latin characters in the description

Here's the code implementing these steps:

In [4]:

Copied!





new_df = full_app_df[['id', 'track_name_x', 'size_bytes_x', 'currency',
                      'price', 'rating_count_tot', 'user_rating', 'ver', 'prime_genre', 'app_desc']]

new_df = new_df.rename(columns={'track_name_x': 'name', 'size_bytes_x': 'size'})

new_df['size'] = new_df['size'] / (1024 * 1024)  # Convert to MB
new_df['name'] = new_df['name'].str.replace(r"[^a-zA-Z0-9\s]+", "", regex=True)
new_df = new_df[new_df['name'].str.strip() != ""]
new_df = new_df[new_df['rating_count_tot'] != 0]
new_df = new_df[new_df['app_desc'].str.contains(r'[\u4e00-\u9fff]') == False]
new_df = new_df.sort_values(by='app_desc')
new_df.head()
new_df = full_app_df[['id', 'track_name_x', 'size_bytes_x', 'currency',
                      'price', 'rating_count_tot', 'user_rating', 'ver', 'prime_genre', 'app_desc']]

new_df = new_df.rename(columns={'track_name_x': 'name', 'size_bytes_x': 'size'})

new_df['size'] = new_df['size'] / (1024 * 1024)  # Convert to MB
new_df['name'] = new_df['name'].str.replace(r"[^a-zA-Z0-9\s]+", "", regex=True)
new_df = new_df[new_df['name'].str.strip() != ""]
new_df = new_df[new_df['rating_count_tot'] != 0]
new_df = new_df[new_df['app_desc'].str.contains(r'[\u4e00-\u9fff]') == False]
new_df = new_df.sort_values(by='app_desc')
new_df.head()

Out[4]:

	id	name	size	currency	price	rating_count_tot	user_rating	ver	prime_genre	app_desc
6548	1134867821	NOT ALONE Story of a bird	116.121094	USD	2.99	1	3.0	1.1	Games	! Now on X'mas special sales (~2017 Jan. 3rd) ...
6751	1145500015	Drifty Chase	180.987305	USD	0.00	1631	4.5	1.7	Games	!! 2016 Very Big Indie Pitch finalist at PGCon...
2493	823804745	Multiplayer Terraria edition	15.058594	USD	3.99	6981	4.0	1.5	Games	!!! First and the only app which allows to pla...
3273	949876643	Lumyer augmented reality camera effects	116.251953	USD	0.00	3896	4.5	4.0.1	Photo & Video	!!! NEW !!! TAP EFFECTS\nTry the new Tap Effe...
5519	1086929344	Dancing with the Stars The Official Game	334.543945	USD	0.00	1098	4.0	2.7	Games	!!! Please note this app does not currently su...

Golden Dataset Selection¶

To create a controlled subset for our questions, we select a "golden dataset" of five apps. Rather than random selection, which could introduce variability across experiments, we chose a deterministic approach:

In [5]:

Copied!

golden_df = new_df[new_df['app_desc'].str.startswith(('K', 'k'))].head(5)
clean_df = new_df.drop(golden_df.index)
golden_df
golden_df = new_df[new_df['app_desc'].str.startswith(('K', 'k'))].head(5)
clean_df = new_df.drop(golden_df.index)
golden_df

Out[5]:

	id	name	size	currency	price	rating_count_tot	user_rating	ver	prime_genre	app_desc
1350	529997671	Disney Channel Watch Full Episodes Movies TV	125.921875	USD	0.00	21082	3.5	5.7.0	Entertainment	K.C. Undercover, Liv & Maddie, Bunk’d and more...
3455	965789238	1000	73.989258	USD	0.00	23	4.5	3.6.5	Shopping	KAOLA.COM is China 's largest overseas commodi...
3793	994674676	Sago Mini Superhero	171.169922	USD	2.99	30	3.5	1.1	Education	KAPOW! Jack the rabbit bursts into the sky as ...
1965	645949180	Jelly Splash	132.311523	USD	0.00	21601	4.0	3.13.0	Games	KICK BACK AND SPLASH!\n\nJoin those delicious ...
5076	1070850573	KQ MiniSynth	19.365234	USD	5.99	15	5.0	1.7.4	Music	KQ MiniSynth is a polyphonic modular synthesiz...

This method selects the first five apps whose descriptions start with 'K' or 'k'. While somewhat arbitrary, this approach ensures:

Reproducibility across experiments
A degree of randomness to avoid bias

By separating our golden dataset from the main dataframe, we can ensure these apps are always included in our context, regardless of the token limit, while filling the remaining context with other app data.

Evaluation Dataset¶

Our evaluation dataset consists of three carefully crafted questions designed to test Gemini 1.5 Flash's ability to synthesize information across the App Store data context:

In [6]:

Copied!





examples = [
    {
        "question": "Do the 'Sago Mini Superhero' and 'Disney Channel  Watch Full Episodes Movies  TV' apps require internet connection?",
        "answer": "You can play Sago Mini Superhero without wi-fi or internet. Internet is required for Disney Channel  Watch Full Episodes Movies  TV"
    },
    {
        "question": "Where can I find the privacy policy for the 'Disney Channel  Watch Full Episodes Movies  TV' app?",
        "answer": "http://disneyprivacycenter.com/"
    },
    {
        "question": "Which one costs less? The 'KQ MiniSynth' app or the 'Sago Mini Superhero' app?",
        "answer": "The 'KQ MiniSynth' app costs $5.99, the 'Sago Mini Superhero' app costs $2.99. So 'Sago Mini Superhero' is cheaper"
    }
]
examples = [
    {
        "question": "Do the 'Sago Mini Superhero' and 'Disney Channel  Watch Full Episodes Movies  TV' apps require internet connection?",
        "answer": "You can play Sago Mini Superhero without wi-fi or internet. Internet is required for Disney Channel  Watch Full Episodes Movies  TV"
    },
    {
        "question": "Where can I find the privacy policy for the 'Disney Channel  Watch Full Episodes Movies  TV' app?",
        "answer": "http://disneyprivacycenter.com/"
    },
    {
        "question": "Which one costs less? The 'KQ MiniSynth' app or the 'Sago Mini Superhero' app?",
        "answer": "The 'KQ MiniSynth' app costs $5.99, the 'Sago Mini Superhero' app costs $2.99. So 'Sago Mini Superhero' is cheaper"
    }
]

And now we upload them to Langsmith as a dataset if they haven't already been uploaded

In [7]:

Copied!





dataset_name = "AppStore Q&A"


def make_dataset():
    """Make and fill dataset if it doesnt exist"""
    if client.has_dataset(dataset_name=dataset_name):
        return client.read_dataset(dataset_name=dataset_name)

    dataset = client.create_dataset(
        dataset_name=dataset_name, description="App Store Data questions and answers")

    for example in examples:
        client.create_example(
            inputs={"question": example["question"]}, outputs={"answer": example["answer"]}, dataset_name=dataset.name)
        
    return dataset

dataset = make_dataset()
dataset_name = "AppStore Q&A"


def make_dataset():
    """Make and fill dataset if it doesnt exist"""
    if client.has_dataset(dataset_name=dataset_name):
        return client.read_dataset(dataset_name=dataset_name)

    dataset = client.create_dataset(
        dataset_name=dataset_name, description="App Store Data questions and answers")

    for example in examples:
        client.create_example(
            inputs={"question": example["question"]}, outputs={"answer": example["answer"]}, dataset_name=dataset.name)
        
    return dataset

dataset = make_dataset()

Experiment¶

Our experiment is designed to test Gemini 1.5 Flash's performance across increasing context lengths. To execute this, we need several components:

The App Store dataset
Our evaluation dataset (Q&As)
An evaluation function
A prediction function

Evaluation Function¶

The evaluation function is a critical component of our experiment. It's responsible for:

Comparing the model's output to the correct answer
Providing a detailed assessment of the answer's correctness
Assigning a binary score (correct or incorrect)

Key aspects of this implementation:

GPT-4o as Judge: We use GPT-4o (via the ChatOpenAI class) to evaluate the answers. This allows for nuanced understanding and assessment of the responses. If you haven't worked with LLM's as a judge, it may sound unreliable to have an LLM grade an LLM but it is very reliable in scenarios like this. You can think about it like this: GPT-4o is more than capable of validating that an answer is the same as the correct answer we give it. This is a much more simple task than the one we are giving Gemini Flash of actually generating the correct answer.
Structured Output: Using with_structured_output(EvaluationSchema) ensures that our evaluation consistently provides both reasoning and a binary correctness judgment that we can use programmatically.
Detailed Evaluation: The system prompt instructs the GPT-4o model to provide thorough reasoning, considering partial correctness and nuances in the answers.
Binary Scoring: While the evaluation includes detailed reasoning, the final score is binary (0 or 1) for simplicity in aggregating results across multiple questions and context lengths.

In [8]:

Copied!





from langchain_openai import ChatOpenAI
from langchain.pydantic_v1 import BaseModel, Field
from langchain.schema import SystemMessage, HumanMessage
from langsmith.schemas import Run, Example

class EvaluationSchema(BaseModel):
    """An evaluation schema for assessing the correctness of an answer"""
    reasoning: str = Field(
        description="Detailed reasoning for the evaluation score")
    correct: bool = Field(
        description="Whether the user's answer is correct or not")

def qa_eval(root_run: Run, example: Example):
    """Evaluate the correctness of an answer to a given question"""
    question = example.inputs["question"]
    user_answer = root_run.outputs["output"]
    correct_answer = example.outputs["answer"]

    if not question or not user_answer or not correct_answer:
        return {
            "score": 0,
            "key": "correctness",
            "comment": "Question, user's answer, or correct answer is missing"
        }

    llm = ChatOpenAI(
        model="gpt-4o", temperature=0.4).with_structured_output(EvaluationSchema)

    system_prompt = f"""You are a judge tasked with evaluating a user's answer to a given question. 
You will be provided with the question, the correct answer, and the user's thought process and answer.

Question:
{question}

Correct Answer:
{correct_answer}

Your job is to assess the user's answer and provide:
1. Detailed reasoning for your evaluation, comparing the user's answer to the correct answer
2. A boolean judgment on whether the user's answer is correct or not

Be thorough in your reasoning and accurate in your judgment. Consider partial correctness and any nuances in the answers."""

    evaluation: EvaluationSchema = llm.invoke(
        [SystemMessage(content=system_prompt),
         HumanMessage(content=user_answer)]
    )

    score = 1 if evaluation.correct else 0

    return {
        "score": score,
        "key": "correctness",
        "comment": evaluation.reasoning
    }

from langchain_openai import ChatOpenAI
from langchain.pydantic_v1 import BaseModel, Field
from langchain.schema import SystemMessage, HumanMessage
from langsmith.schemas import Run, Example

class EvaluationSchema(BaseModel):
    """An evaluation schema for assessing the correctness of an answer"""
    reasoning: str = Field(
        description="Detailed reasoning for the evaluation score")
    correct: bool = Field(
        description="Whether the user's answer is correct or not")

def qa_eval(root_run: Run, example: Example):
    """Evaluate the correctness of an answer to a given question"""
    question = example.inputs["question"]
    user_answer = root_run.outputs["output"]
    correct_answer = example.outputs["answer"]

    if not question or not user_answer or not correct_answer:
        return {
            "score": 0,
            "key": "correctness",
            "comment": "Question, user's answer, or correct answer is missing"
        }

    llm = ChatOpenAI(
        model="gpt-4o", temperature=0.4).with_structured_output(EvaluationSchema)

    system_prompt = f"""You are a judge tasked with evaluating a user's answer to a given question. 
You will be provided with the question, the correct answer, and the user's thought process and answer.

Question:
{question}

Correct Answer:
{correct_answer}

Your job is to assess the user's answer and provide:
1. Detailed reasoning for your evaluation, comparing the user's answer to the correct answer
2. A boolean judgment on whether the user's answer is correct or not

Be thorough in your reasoning and accurate in your judgment. Consider partial correctness and any nuances in the answers."""

    evaluation: EvaluationSchema = llm.invoke(
        [SystemMessage(content=system_prompt),
         HumanMessage(content=user_answer)]
    )

    score = 1 if evaluation.correct else 0

    return {
        "score": score,
        "key": "correctness",
        "comment": evaluation.reasoning
    }

This evaluation function allows us to assess Gemini 1.5 Flash's performance consistently across different context lengths and questions. By using a language model (GPT-4) as the judge, we can capture subtle aspects of correctness that might be missed by simpler, rule-based evaluation methods.

Prediction Function¶

The prediction function is a crucial component of our experiment. It's responsible for:

Taking a question from our evaluation dataset
Generating a context of appropriate length
Querying Gemini 1.5 Flash with the question and context
Returning the model's response

Here's our implementation:

In [9]:

Copied!





from tiktoken import get_encoding
import random
import matplotlib.pyplot as plt
import seaborn as sns

# Gemini's 1M token limit
max_context_limit = 1000000


# Util Functions

def count_tokens(text: str):
    """Count the number of tokens in a string"""
    encoder = get_encoding("cl100k_base")
    return len(encoder.encode(text))


def row_to_string(row):
    """Convert a row to a string"""
    app_string = f"""App Name: {row.name}
Size: {round(row.size, 2)} MB
Price: {row.price} {row.currency}
Rating Count: {row.rating_count_tot}
User Rating: {row.user_rating}
Version: {row.ver}
Genre: {row.prime_genre}
Description: {row.app_desc}"""
    return app_string


def get_context(tokens: int):
    """Get the context for a given number of tokens"""
    # Combine the golden df and the new_df
    combined_df = pd.concat([golden_df, new_df])
    app_strs: list[str] = []
    delimiter = "\n================\n"

    for i, row in enumerate(combined_df.itertuples()):
        row_str = row_to_string(row)
        num_tokens = count_tokens(
            f"{delimiter.join(app_strs)}{delimiter}{row_str}")
        if num_tokens < tokens:  # If we havent hit the token limit, add the row
            app_strs.append(row_str)
        else:
            break

    # Randomize app strings
    random.shuffle(app_strs)
    return delimiter.join(app_strs)


def visualize_test_results(experiments):
    """Display a graph of the test results"""
    # Step 1: Extract and process data
    all_results = []
    for exp in experiments:
        df = client.get_test_results(project_name=exp["results"].experiment_name)
        df['tokens'] = exp['tokens']
        all_results.append(df)
    
    # Combine all results into a single dataframe
    combined_df = pd.concat(all_results, ignore_index=True)
    
    # Step 2: Sort by token count
    combined_df = combined_df.sort_values('tokens')
    
    # Get unique questions
    questions = combined_df['input.inputs.question'].unique()
    
    # Create a color palette
    color_palette = sns.color_palette("husl", n_colors=len(questions))
    
    # Step 3: Create the line graph
    fig, ax = plt.subplots(figsize=(12, 8))
    
    for i, question in enumerate(questions):
        question_data = combined_df[combined_df['input.inputs.question'] == question]
        ax.plot(question_data['tokens'], question_data['feedback.correctness'], 
                label=f'Question {i+1}', color=color_palette[i])
    
    ax.set_title('Test Results by Token Count')
    ax.set_xlabel('Number of Tokens')
    ax.set_ylabel('Correctness Score')
    ax.legend(title='Questions', loc='center left', bbox_to_anchor=(1, 0.5))
    
    # Add questions as text below the graph
    fig.text(0.1, 0.02, "Questions:", fontweight='bold')
    for i, question in enumerate(questions):
        fig.text(0.1, -0.02 - 0.03*i, f"{i+1}. {question}", fontsize=8, wrap=True)
    
    plt.tight_layout()
    plt.subplots_adjust(bottom=0.3)  # Adjust this value to fit all questions
    
    return plt
from tiktoken import get_encoding
import random
import matplotlib.pyplot as plt
import seaborn as sns

# Gemini's 1M token limit
max_context_limit = 1000000


# Util Functions

def count_tokens(text: str):
    """Count the number of tokens in a string"""
    encoder = get_encoding("cl100k_base")
    return len(encoder.encode(text))


def row_to_string(row):
    """Convert a row to a string"""
    app_string = f"""App Name: {row.name}
Size: {round(row.size, 2)} MB
Price: {row.price} {row.currency}
Rating Count: {row.rating_count_tot}
User Rating: {row.user_rating}
Version: {row.ver}
Genre: {row.prime_genre}
Description: {row.app_desc}"""
    return app_string


def get_context(tokens: int):
    """Get the context for a given number of tokens"""
    # Combine the golden df and the new_df
    combined_df = pd.concat([golden_df, new_df])
    app_strs: list[str] = []
    delimiter = "\n================\n"

    for i, row in enumerate(combined_df.itertuples()):
        row_str = row_to_string(row)
        num_tokens = count_tokens(
            f"{delimiter.join(app_strs)}{delimiter}{row_str}")
        if num_tokens < tokens:  # If we havent hit the token limit, add the row
            app_strs.append(row_str)
        else:
            break

    # Randomize app strings
    random.shuffle(app_strs)
    return delimiter.join(app_strs)


def visualize_test_results(experiments):
    """Display a graph of the test results"""
    # Step 1: Extract and process data
    all_results = []
    for exp in experiments:
        df = client.get_test_results(project_name=exp["results"].experiment_name)
        df['tokens'] = exp['tokens']
        all_results.append(df)
    
    # Combine all results into a single dataframe
    combined_df = pd.concat(all_results, ignore_index=True)
    
    # Step 2: Sort by token count
    combined_df = combined_df.sort_values('tokens')
    
    # Get unique questions
    questions = combined_df['input.inputs.question'].unique()
    
    # Create a color palette
    color_palette = sns.color_palette("husl", n_colors=len(questions))
    
    # Step 3: Create the line graph
    fig, ax = plt.subplots(figsize=(12, 8))
    
    for i, question in enumerate(questions):
        question_data = combined_df[combined_df['input.inputs.question'] == question]
        ax.plot(question_data['tokens'], question_data['feedback.correctness'], 
                label=f'Question {i+1}', color=color_palette[i])
    
    ax.set_title('Test Results by Token Count')
    ax.set_xlabel('Number of Tokens')
    ax.set_ylabel('Correctness Score')
    ax.legend(title='Questions', loc='center left', bbox_to_anchor=(1, 0.5))
    
    # Add questions as text below the graph
    fig.text(0.1, 0.02, "Questions:", fontweight='bold')
    for i, question in enumerate(questions):
        fig.text(0.1, -0.02 - 0.03*i, f"{i+1}. {question}", fontsize=8, wrap=True)
    
    plt.tight_layout()
    plt.subplots_adjust(bottom=0.3)  # Adjust this value to fit all questions
    
    return plt

In [ ]:

Copied!





from langchain_core.messages import SystemMessage, HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI


from langsmith.evaluation import evaluate


class Predictor:
    def __init__(self, step=1, total_steps=20, model="gemini-1.5-flash"):
        self.step = step
        self.total_steps = total_steps
        self.model = model
        self.llm = ChatGoogleGenerativeAI(model=model)
        self.experiments = []

    def predict(self, inputs: dict):
        """Prediction function for Gemini Experiment"""
        tokens = (max_context_limit / self.total_steps) * self.step
        context = get_context(tokens)

        system_prompt = f"""You are tasked with answering user questions based on the the App Store data inside <APP STORE DATA>.
<APP STORE DATA> contains a ton of public data about apps on the App Store. It is the most current and accurate source \
so be sure to ONLY answer based on the context in <APP STORE DATA>. You will be graded on accuracy so be very careful and \
make sure you are as accurate as possible. First, think through your reasoning to answering the question before ultimately repeating \
the question and giving your answer.

<APP STORE DATA>
{context}
</APP STORE DATA>"""
        response = self.llm.invoke(
            [SystemMessage(content=system_prompt), HumanMessage(content=inputs["question"])])
        return {"output": response.content}

    def _run_eval(self):
        """Run a single evaluation for Gemini Experiment"""
        tokens = (max_context_limit / self.total_steps) * self.step

        result = evaluate(
            self.predict,
            data=client.list_examples(dataset_name=dataset_name),
            evaluators=[qa_eval],
            experiment_prefix=f"{self.model}-{tokens}"
        )

        # Append the results to the experiments list
        self.experiments.append({
            "tokens": tokens,
            "step": self.step,
            "results": result
        })

    def run(self):
        """Run a single step of the Gemini Experiment"""
        print(f"Running step {self.step} of the Gemini Experiment")
        self._run_eval()
        # Increment the step
        self.step += 1
        # If we have more than 1 experiment, display the results
        if len(self.experiments) > 1:
            visualize_test_results(self.experiments)

    def run_all(self, reset=False, stop_at=None):
        """Run all steps of the Gemini Experiment

        Args:
            reset (bool, optional): Whether to reset the step counter. Defaults to False.
            stop_at (int, optional): The step to stop at. Defaults to Predictor.total_steps.
        """
        if stop_at is None:
            stop_at = self.total_steps

        if reset:
            self.step = 1

        while self.step <= stop_at:
            self.run()


eval = Predictor()
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI


from langsmith.evaluation import evaluate


class Predictor:
    def __init__(self, step=1, total_steps=20, model="gemini-1.5-flash"):
        self.step = step
        self.total_steps = total_steps
        self.model = model
        self.llm = ChatGoogleGenerativeAI(model=model)
        self.experiments = []

    def predict(self, inputs: dict):
        """Prediction function for Gemini Experiment"""
        tokens = (max_context_limit / self.total_steps) * self.step
        context = get_context(tokens)

        system_prompt = f"""You are tasked with answering user questions based on the the App Store data inside .
 contains a ton of public data about apps on the App Store. It is the most current and accurate source \
so be sure to ONLY answer based on the context in . You will be graded on accuracy so be very careful and \
make sure you are as accurate as possible. First, think through your reasoning to answering the question before ultimately repeating \
the question and giving your answer.


{context}
"""
        response = self.llm.invoke(
            [SystemMessage(content=system_prompt), HumanMessage(content=inputs["question"])])
        return {"output": response.content}

    def _run_eval(self):
        """Run a single evaluation for Gemini Experiment"""
        tokens = (max_context_limit / self.total_steps) * self.step

        result = evaluate(
            self.predict,
            data=client.list_examples(dataset_name=dataset_name),
            evaluators=[qa_eval],
            experiment_prefix=f"{self.model}-{tokens}"
        )

        # Append the results to the experiments list
        self.experiments.append({
            "tokens": tokens,
            "step": self.step,
            "results": result
        })

    def run(self):
        """Run a single step of the Gemini Experiment"""
        print(f"Running step {self.step} of the Gemini Experiment")
        self._run_eval()
        # Increment the step
        self.step += 1
        # If we have more than 1 experiment, display the results
        if len(self.experiments) > 1:
            visualize_test_results(self.experiments)

    def run_all(self, reset=False, stop_at=None):
        """Run all steps of the Gemini Experiment

        Args:
            reset (bool, optional): Whether to reset the step counter. Defaults to False.
            stop_at (int, optional): The step to stop at. Defaults to Predictor.total_steps.
        """
        if stop_at is None:
            stop_at = self.total_steps

        if reset:
            self.step = 1

        while self.step <= stop_at:
            self.run()


eval = Predictor()

Key aspects of this implementation:

Context Generation: The get_context function generates a context of the appropriate length for each step of the experiment.
Incremental Context: The step and total_steps parameters allow us to incrementally increase the context length from 50,000 to 1,000,000 tokens. We have to use a class because our prediction function should only take the inputs dict.
System Prompt: We wrote a system prompt to instruct Gemini 1.5 Flash on how to approach the task. This prompt emphasizes:
- Using only the provided context
- The importance of accuracy
- The need for reasoning before answering
Model Invocation: We use the ChatGoogleGenerativeAI class from the langchain_google_genai library to interact with Gemini 1.5 Flash.

This prediction function allows us to systematically test Gemini 1.5 Flash's performance across varying context lengths while maintaining consistent instructions and evaluation criteria. By incrementing the step parameter, we can observe how the model's performance changes as it has access to more context.

Lets run our experiment!¶

In [ ]:

Copied!

eval.run()
eval.run()

Results¶

The results of our experiment with Gemini 1.5 Flash are remarkable in their consistency. Across all context lengths, from 50,000 tokens all the way up to the full million-token capacity, Gemini 1.5 Flash achieved 100% accuracy in answering our test questions!

View Test Results on LangSmith

question	50k	100k	150k	200k	250k	300k	350k	400k	450k	500k	550k	600k	950k
Which one costs less? The 'KQ MiniSynth' app or the 'Sago Mini Superhero' app?	1	1	1	1	1	1	1	1	1	1	1	1	1
Where can I find the privacy policy for the 'Disney Channel Watch Full Episodes Movies TV' app?	1	1	1	1	1	1	1	1	1	1	1	1	1
Do the 'Sago Mini Superhero' and 'Disney Channel Watch Full Episodes Movies TV' apps require internet connection?	1	1	1	1	1	1	1	1	1	1	1	1	1

Key Findings:

Perfect Accuracy: Gemini 1.5 Flash maintained 100% correctness across all context lengths, from 50,000 to 1,000,000 tokens.
Scalability: The model's performance did not degrade as the context length increased, demonstrating robust scalability.
Consistency: Regardless of the amount of context provided, Gemini 1.5 Flash consistently provided accurate answers, indicating strong information synthesis capabilities.

It's important to note that our experiment was carefully designed to avoid numerical reasoning and relational queries. Previous experiments have shown that Gemini Flash struggles with tasks involving logic around numbers, such as identifying "the highest rated" or "Top 5 by size" apps. Our questions focused on factual retrieval and simple comparisons, areas where the model excels.

Implications:¶

Comprehensive Document Analysis: Organizations can now process entire documents or databases in a single query. For example, a company could input all its policy documents, employee handbooks, and project reports into Gemini 1.5 Flash. This would allow for quick and accurate answers to complex queries that span multiple documents, potentially saving hours of manual searching and cross-referencing.
Enhanced Customer Support: Customer service departments could leverage Gemini 1.5 Flash to create incredibly knowledgeable chatbots. By inputting all product information, past customer interactions, and frequently asked questions, these chatbots could provide accurate, context-aware responses to customer queries. This could significantly reduce response times and improve customer satisfaction while decreasing the workload on human customer service representatives.
Improved Contract Analysis: Legal departments and law firms could use Gemini 1.5 Flash to analyze lengthy contracts and legal documents. By inputting multiple related contracts, case law, and regulatory information, lawyers could quickly get accurate answers to specific legal questions, potentially speeding up contract review processes and reducing the risk of overlooking important clauses or legal precedents.

While these results are extremely promising, it's crucial to remember that they are based on a specific dataset and set of questions. The model's performance on numerical reasoning and relational queries remains a limitation. Further testing across diverse domains and more complex query types would be beneficial to fully understand the capabilities and limitations of Gemini 1.5 Flash in real-world scenarios.

Nevertheless, these results mark a significant leap forward in the field of large language models, particularly in handling and analyzing vast amounts of textual information. The ability to maintain perfect accuracy across such a large context opens up exciting possibilities for businesses dealing with large volumes of documents and data.