June 11, 2026 · 8 min read · Perspective
Co-Scientist Is a Pattern, Not a Product
On May 19, 2026, two Nature papers landed within hours of each other, both describing multi-agent AI systems for scientific discovery. Google DeepMind published Accelerating scientific discovery with Co-Scientist.1 FutureHouse — a San Francisco nonprofit — published A multi-agent system for automating scientific discovery, introducing a system called Robin.2 Nature ran an accompanying editorial titled Why AI cannot do good science without humans the same day.3 Both had been circulating as preprints for months; both are now peer-reviewed, in Nature, in the same week.
If you build, design, or test batteries for a living, you have probably already seen the headlines. Here's a useful way to read these papers.
Read the methods sections.
The Co-Scientist paper describes a multi-agent system built on Gemini "for structured scientific thinking and hypothesis generation."1 A Generation agent proposes hypotheses. A Proximity agent clusters them so the system doesn't collapse into a single line of thinking. A Reflection agent acts as a virtual peer reviewer, critiquing each hypothesis for correctness, novelty, and rigor. A Ranking agent runs an Elo-style "idea tournament" of pairwise comparisons. An Evolution agent refines, recombines, and builds on top-ranked candidates. The system's value comes from scaling test-time compute: letting the loop run longer, generate more, debate harder. Co-Scientist was validated in three biomedical applications: drug repurposing, novel target discovery, and explaining mechanisms of anti-microbial resistance. For acute myeloid leukemia specifically, it proposed drug repurposing candidates that showed tumor inhibition in vitro at clinically applicable concentrations.1
Robin is structurally similar but smaller and more agile: three agents, two for literature review, one for experimental data analysis, that propose hypotheses, design experiments, and interpret results in an iterative loop. Robin identified ripasudil, an existing glaucoma drug, as a candidate treatment for dry age-related macular degeneration, a major cause of blindness. The prediction was validated in lab experiments.2 The FutureHouse team reports that the entire system, from conception through validated discovery to manuscript submission, was built and used in 2.5 months by a small team.4
That is a pattern. It is not a product. The pattern has a name now, agentic tool use, and a protocol, the Model Context Protocol (MCP), that exists specifically to make the pattern composable across whatever tools, data, and language models you happen to be using. The Nature papers are landmark publications because of what the pattern accomplishes in drug discovery. The architecture itself is an open architecture: a coordinator LLM, a set of specialized agents, tools the agents reach by name, and data they reason over. The same pattern works for batteries.
Three layers
Three layers, and it matters who owns which one.
The substrate. Open formats your code can already parse (parquet, CSV, JSON), a Python API your existing notebooks can call, and updates that arrive from the lab fast enough to be useful for the question you're asking. If your cyclers are dumping data into a folder once a week, your co-scientist is summarizing last week. If your platform ingests cycler output continuously, your co-scientist is reasoning about what is happening on your test floor right now. Micantis is a data substrate.
The tools. The functions your agent calls to get work done. A live data query interface. A spec library that holds your acceptance criteria. A method library that holds your canonical analyses: capacity fade, EIS fitting, dQ/dV peak tracking, cohort comparison. A test plan generator that knows your protocols. Report templates in your formats. Each is exposed as an MCP server, a named tool the agent reaches by name. Micantis provides all of these too.
The playbook. The instructions you give the agent about how to work in your environment. Who it's talking to. What conventions matter. What "formation" means at your company versus another. Which method library function is authoritative for which question. What counts as a flag and what counts as a fail. When to escalate to a human. This layer is yours. We provide a skeleton, but the judgment in the playbook is where your team's institutional knowledge lives, and that's not something a vendor can ship from the outside.
The LLM goes on top. Whichever you trust. Claude. Gemini. GPT. A local model running on your own hardware if regulatory or IP constraints demand it.
External tools (your in-house pack thermal model, PyBaMM for physics-based simulations, a procurement scoring script your team already wrote) plug in alongside the Micantis tools. The agent doesn't know which is which. They're all just MCP servers it can call by name.
What the playbook is
It's a plain Markdown file. Probably 200–400 lines when filled in. Sections for who the agent is talking to, what it should always do, what counts as good versus a flag versus a fail, your team's vocabulary (because "batch" means different things at different companies), which method library functions are canonical, and when to escalate rather than answer.
Here's a representative slice, a few lines from the kind of playbook a cell scientist team might write:
Always pull live data before answering. When asked about a running test, query the data substrate first. Do not answer from memory of an earlier query in the conversation.
First-cycle Coulombic efficiency below 89% is a flag. Below 86% is a fail.
"Variant" means the formulation code in the recipe library, not the cell instance. Do not confuse them.
For capacity fade, use
degradation.capacity_fade_v3. v2 is deprecated and produces incorrect results below 2C.2σ conflict with prior data: do not just report it. Call it out as anomalous. Propose a check.
Conventions. Thresholds. Vocabulary. Method preferences. Escalation rules. The kind of thing a senior engineer would tell a new hire on their first week. None of it is generic; all of it is specific to a particular team's way of working. The agent's behavior in your environment comes from the playbook, and the playbook is yours, because nobody outside your team can write those thresholds for you.
We publish a skeleton playbook as an open template you can fork.5 It's structured, commented, ready to fill in. Fork it, edit it, version-control it alongside your other engineering artifacts. The agent reads it the same way it reads its tools, as context. Update the playbook, the behavior updates.
Swap anything you want
The corollary to "the playbook is yours" is that the tools are independently swappable.
Every tool in the diagram above is reached by name through MCP. The agent calls data_query.run(...) or spec_library.get_acceptance_criteria(...) and the protocol routes the call to whatever server is registered under that name. Want to keep your spec library in your existing PLM system? Write a short wrapper that exposes spec_library as an MCP server backed by the PLM. The agent doesn't know the difference. Want to pipe formation data into your own Snowflake warehouse for cross-program analytics? Swap the data query layer for one that hits Snowflake. Same answer: the agent doesn't know.
This is what open architecture means in practice. Not "we have an API." Every tool exposed as a peer; every tool replaceable by one you write; every tool callable alongside whatever else you want to plug in.
One exception: the substrate itself. Live ingestion of fifty-plus cycler formats, normalization, queryability at hourly cadence. That's the part that took us years to build and that doesn't peel out cleanly. If you want to build your own substrate, you can; we won't pretend it's a weekend project. But every layer above it is yours to swap, extend, or replace.
The difference live data makes
The Nature co-scientists run against literature: published papers, patents, clinical trial registries, curated knowledge bases. Literature is, in computational terms, static. It moves slowly. Yesterday's snapshot is approximately the same as today's. The agent loop can be slow, deliberative, and offline.
Battery R&D is not like that. The cyclers generated data while you were reading the last paragraph. A test plan launched on Monday looks different on Thursday than it did on Tuesday. A supplier batch you accepted last week may be drifting in ways you don't yet know. The interesting questions are about what is happening right now, not what was happening when somebody last exported a CSV.
A co-scientist that reasons over a stale snapshot is a less useful co-scientist than one reasoning over a live feed. It's also the part of the architecture that's hardest to retrofit. The Micantis substrate ingests cycler output, formation data, QR scans, and scale measurements continuously. Your co-scientist reads against data that's seconds old, not weeks old. The Monday-morning question, "What finished over the weekend? Anything off-spec?", gets answered against the run that just ended.
What this looks like, by workflow
This is the overview. The same pattern looks different depending on whose desk it lands on.
Over the next few weeks, we'll publish a series of pieces walking through what a co-scientist looks like in practice for four very different battery workflows. Each one uses the same three-layer architecture above. Each one stresses the platform in different places.
- The cell scientist designing a new electrolyte. A DOE-driven screen of six formulations across temperature points, weeks of cycling, anomaly detection on running tests, and the role of standard formation analyses in the method library. The agent's value here is in catching the bad variant on day three, not week six.
- The cathode engineer pushing a chemistry boundary. Cross-correlation across material characterization, electrochemical signal, and post-mortem. The platform's role is in the data joins (cell to powder lot to coating run to test result) that make "show me dQ/dV at cycle 100 for the new-dopant variants grouped by coating density" a one-line question instead of a week of pivot tables.
- The cell engineer migrating a chemistry from pouch to 18650. Same active materials, different format. The thermal path changes from planar to radial. The current distribution changes with tab geometry. Hoop stress on the jellyroll changes how high-Ni cathodes age. The platform's role is in matched-comparison rigor: the same protocols, the same methods, the same acceptance criteria applied to two builds of "the same chemistry," so any difference the data shows is the geometry talking, not the test setup.
- The pack engineer tuning BMS parameters from field and bench data. Charge current limits vs. SOC and temperature, balancing thresholds, voltage cutoffs, temperature derating, SOC estimation. Every tweak is a hypothesis to validate against bench cycling and field telemetry. The platform's role is in turning "did the new charge-current curve hurt low-temp cycle life?" from a multi-day data assembly into a question the agent answers against a live cohort.
- The process engineer bringing up a manufacturing line. Different stakes, different cadence. Hourly statistical process control, cell genealogy traceability, automated drift alarms. The agent's role shifts from open-ended scientist questions to "surface what's changed and where." Shift handoff reports written from live yield data.
Each of these gets its own piece. Different workflows, different conventions in the playbook, different external tools plugged in alongside the platform. Same architecture underneath.
The practical companion is live
This piece is the conceptual overview. The companion, From Overview to Running: Deploying the Co-Scientist Pattern, walks through four runtime options (DIY, Anthropic Managed Agents, Bedrock, Vertex AI), how Micantis wires in via MCP, and what data residency looks like in regulated industries.
You can read the Nature papers. You can recognize the architecture. The pieces are here: a substrate that keeps your data live and open, tools that take real engineering work to build, a playbook that holds your team's judgment, and an LLM you choose. Compose them the way that fits your work.
References
- Gottweis, J., Weng, W.-H., Daryin, A., et al. Accelerating scientific discovery with Co-Scientist. Nature (2026). DOI: 10.1038/s41586-026-10644-y. ↩
- Ghareeb, A. E., Chang, B., Mitchener, L., et al. A multi-agent system for automating scientific discovery. Nature (2026). DOI: 10.1038/s41586-026-10652-y. ↩
- Why AI cannot do good science without humans. Nature Editorial (19 May 2026). nature.com/articles/d41586-026-01551-3. ↩
- FutureHouse. Demonstrating end-to-end scientific discovery with Robin: a multi-agent system. futurehouse.org/research-announcements/…. ↩
- Micantis. Battery Co-Scientist Playbook — Skeleton. Download the Markdown template. ↩