Skip to content

Blog

Some light reading on AI.

What Claude SSH Actually Unlocks

SSH Agent Server

I wrote about the AI PC concept back in December. The idea: a personal server that you own, running a codebase that your AI agents can connect to, edit, and serve from. An app that persists and grows with you as you use it.

Claude Code just shipped native SSH support in Desktop. So I spun up an EC2 instance and tested it. I wanted to see how close we are to making the AI PC real. The answer: closer than I expected.

The experiment

The plan was simple. Compare two approaches side by side:

  1. Traditional SSH: Use Claude Code CLI locally and have it SSH into the server
  2. Native Desktop SSH: Use Claude Desktop's new built-in SSH feature, which runs Claude Code directly on the remote machine

I spun up a tiny EC2 instance, grabbed the SSH key, and started running through a checklist: CRUD operations on files, permissions, config files, file transfers, and eventually building and serving a web app.

Traditional SSH

The first thing I noticed when I told CLI Claude to SSH into the server is that it doesn't open an interactive session. It sends one-off commands. Every interaction looks like:

ssh -i key.pem ec2-user@3.84.197.88 "mkdir -p /home/ec2-user/docs && cat > /home/ec2-user/docs/01-attention-mechanism.md << 'DOCEOF'
# The Attention Mechanism
..."

Every file operation goes through the Bash tool. Claude can't use its native Read, Write, or Edit tools because those target the local filesystem. So it falls back to cat with heredocs for writes and reads. No structured diffs, no clean UI, just raw shell commands piped through SSH.

The permissions UX is also terrible. You're approving raw Bash commands, but you can't tell whether a cat is a read or a write. Claude suggested SSHFS as a workaround (mount the remote filesystem locally), but that doesn't give you a remote terminal, you have to set it up every time, and it'll never work on mobile. Dead end.

Traditional SSH confines the agent to shell scripting.

Native Desktop SSH

Then I connected through Claude Desktop's native SSH feature. Same permissioning prompts as local. Same tool suite:

  • Read: native file reading, no cat over SSH
  • Edit: structured edits with diffs, no heredocs
  • Write: clean file creation, same UI as local

It felt like working on my own machine, except the filesystem was on a server. One quirk: Desktop doesn't support ! bang commands like the CLI does, so everything has to go through the agent. Minor annoyance.

Config files work (and they're fully isolated)

I added a CLAUDE.md on the server with instructions to talk like a pirate and set bypassPermissions in ~/.claude/settings.json. Fresh session: both worked. Pirate speak, zero permission prompts. Full yolo mode on a remote server.

But then I put my phone number in my local global ~/.claude/CLAUDE.md and asked SSH Claude if it knew it. Nope. Fully isolated. It doesn't see:

  • Your local global CLAUDE.md
  • Your local skills
  • Any local config whatsoever

Double-edged sword. You get full separation per server (different personality, tools, permissions for each), but if you've built up a library of custom skills locally, you'd need to recreate them on each server. No inheritance, no merging.

The trust boundary

A disposable server is the perfect sandbox for an agent. Worst case, roll back to a snapshot. This is actually safer than giving an agent full permissions on your personal laptop.

  • Conservative: Don't allow sudo. The agent can go wild with everything else.
  • Full yolo: Allow sudo too, but snapshot first.

The boundary gets interesting when the agent needs cloud access. Yolo on a server is fine, the blast radius is one disposable box. Yolo with AWS credentials means the agent could spin up resources, delete things, rack up bills. Totally different risk profile.

So I split it: SSH Claude builds and serves, local Claude handles infra. Two agents, one trust boundary.

Trust boundary model

Proof of concept

I told SSH Claude to set up a web app. When it needed port 80 opened, it told me. Local Claude handled that. Done.

Hello World served from EC2

"Hello, World!" served live from the EC2 instance. No deploy pipeline. No git push. No CI/CD. The agent just... built it and served it.

To demo it, I opened the url on my phone, asked claude to the change colors and the message, and refreshed the page on my phone to see the changes instantly... The implications of this are huge.

Where this is going

This unlocks a type of app that's never existed before. An app where the developer and the user are the same person, and the app is always running while it's being built.

Think about what just happened with the demo. I was using the app on my phone. I thought of a change. I told the agent. The change was live in the app I was already looking at. No redeploy, no waiting, no switching contexts. The app and the development environment are the same thing.

And because it's your server, you can bring it anywhere. Claude connects to it today, Codex connects to it tomorrow. Switch providers whenever you want, the app and all your data comes with you.

Now scale that up. You upload your schedule, your resume, your contacts, all the stuff you currently track across five different apps and a notes folder. The agent turns it into a personal app, served from your server. Then your life changes. New job, new address, kid starts school. You tell Claude, and the app morphs a little bit more to what you need. It's a living thing that adapts because using it is the same as building it.

Building toward that next.

Golden Datasets Are Dead

Golden Dataset Header

There's an instinct when you start building agent evals to replicate what the big benchmarks do. You see TerminalBench or SWE-bench or whatever, and there's this nice hill to climb. Model releases improve the score, progress is visible, stakeholders are happy. So you think: why not build an internal version? Start at 10%, iterate throughout the year, end at 80%. Show the chart in your quarterly review.

It doesn't work. Here's why.

The Year of the AI PC

AI PC

2025 was supposed to be "the year of the agents". We saw real agent use cases being pushed to production at enterprises and startups and actually being useful. These are usually very simple tool-loop agents that devs plug into APIs, allowing LLMs to use tools to fetch info (RAG) or to take actions. A ton of agents popped up in 2025, but not a ton of great ones. You would think this was due to model capabilities, but what Claude Code taught us is that the harness, or the architecture of the agent, is just as important as the model, if not more.

If you haven't been using Claude Code, I highly recommend you give it a try, even if you're not a programmer. It's magical.

Stop using LLM frameworks

Build direct

The core pitch of LangChain was interchangeability. Plug-and-play components. Swap ChatGPT for Anthropic for Gemini for whatever. Replace your vector database, swap out your tools, mix and match to your heart's content. Build agents from standardized Lego bricks. It sounded great.

I think there's still a place for LangGraph for orchestration. But the rest of it? I don't think LangChain makes sense anymore. Here's why.

Floor vs Ceiling: Different Models for Different Jobs

A neoclassical oil painting reimagined for a far-future setting: in the upper portion, a single robed figure sits in contemplation within a temple made of sleek chrome and holographic marble columns, bathed in golden light emanating from floating data streams. In the lower portion, dozens of identical android workers in classical tunics operate at glowing forges and holographic anvils in synchronized motion. Renaissance composition and chiaroscuro lighting, but with circuit patterns subtly woven into togas, floating geometric interfaces, and bioluminescent accents. Rich earth tones mixed with cyan and gold technological highlights.

I talk a lot about the floor versus the ceiling when it comes to LLMs and agents. The ceiling is the maximum capability when you push these models to the edge of what they can do: complex architectures, novel scientific problems, anything that requires real reasoning. The floor is the everyday stuff, the entry-level human tasks that just need to get done reliably.

For customer service, you want floor models. Cheap, fast, stable. For cutting-edge research or gnarly architectural decisions, you want ceiling models. Expensive, slow, but actually smart.

What I've realized lately is that coding agent workflows should be using both. And most of them aren't.

The Meta-Evaluator: Your Coding Agent as an Eval Layer

Meta-Evaluator Header

I've been building AI products for a while now, and I've always followed the standard playbook: build your agent, write your evals, iterate on prompts until the numbers look good, ship. It works. But recently I stumbled onto something that completely changed how I think about the evaluation layer.

What if your coding agent is the evaluation layer?

Let me explain.

My Bull Case for Prompt Automation

Recently, Andrej Karpathy did the Dwarkesh Patel podcast, and one of the stories he told stuck out to me.

He they were doing an experiment where they had an LLM-as-a-judge scoring a student LLM. All of a sudden, he says, the loss went straight to zero, meaning the student LLM was getting 100% out of nowhere. So either the student LLM achieved perfection, or something went wrong.

They dug into the outputs, and it turns out the student LLM was just outputting the word "the" a bunch of times: "the the the the the the the." For some reason, that tricked the LLM-as-a-judge into giving a passing score. It was just an anomalous input that gave them an anomalous output, and it broke the judge.

It's an interesting story in itself, just on the flakiness of LLMs, but we knew that already. I think the revelation for me here is that if outputting the word "the" a bunch of times is enough to get an LLM to perform in ways you wouldn't expect, then how random is the process of prompting? Are there scenarios where if you put "the the the the the" a bunch of times in the system prompt, maybe it solves a behavior, or creates a behavior you were trying to get to?

We treat prompting like we're speaking to an entity, and that if we can get really clear instructions in the system prompt, we can steer these LLMs as if they're just humans that are a little less smart. But that doesn't seem to be the case, because even a dumb human wouldn't interpret the word "the" a bunch of times as some kind of successful response. These things are more enigmatic than we treat them. It's not too far removed from random at this point.

AI Agent Testing: Stop Caveman Testing and Use Evals

I recently gave a talk at the LangChain Miami meetup about evals. This blog encapsulates the main points of the talk.

AI agent manual testing illustration showing developer copy-pasting test prompts

AI agent testing is one of the biggest challenges in building reliable LLM applications. Unlike traditional software, AI agents have infinite possible inputs and outputs, making manual testing inefficient and incomplete. This guide covers practical AI agent evaluation strategies that will help you move from manual testing to automated evaluation frameworks.

I build AI agents for work, and for a long time, I was iterating on them the worst way possible.

The test-adjust-test-adjust loop is how you improve agents. You try something, see if it works, tweak it, try again. Repeat until it's good enough to ship. The problem isn't the loop itself—it's how slow and painful that loop can be if you're doing it manually.

Complex AI Agents

Model Mafia

In the world of AI dev, there’s a lot of excitement around multi-agent frameworks—swarms, supervisors, crews, committees, and all the buzzwords that come with them. These systems promise to break down complex tasks into manageable pieces, delegating work to specialized agents that plan, execute, and summarize on your behalf. Picture this: you hand a task to a “supervisor” agent, it spins up a team of smaller agents to tackle subtasks, and then another agent compiles the results into a neat little package. It’s a beautiful vision, almost like a corporate hierarchy with you at the helm. And right now, these architectures and their frameworks are undeniably cool. They’re also solving real problems as benchmarks show that iterative, multi-step workflows can significantly boost performance over single-model approaches.

But these frameworks are a temporary fix, a clever workaround for the limitations of today's AI models. As models get smarter, faster, and more capable, the need for this intricate scaffolding will fade. We're building hammers and hunting for nails, when the truth is that the nail (the problem itself) might not even exist in a year. Let me explain why.