Moving From Tool-Calling to Code Execution
In the past year, AI systems have started behaving less like static models and more like dynamic collaborators, connecting to tools, APIs, databases, and cloud systems on our behalf. Yet as this ecosystem grows, we’re hitting an increasingly familiar ceiling: context overload.
Large models can plan, reason, and synthesise beautifully, but when they must ingest, e.g.:
- 400 tool schemas,
- 80 pages of argument definitions, and
- a 5 MB JSON payload from a database query,
they become slow, fragile, and error-prone. In compuational biology and genomics, where workflows routinely involve thousands of rows of gene-disease hits, expression matrices, variant tables, and other data types, the problem becomes even more severe.
Recently, the Model Context Protocol (MCP) standardised how agents interact with external systems. But the real breakthrough comes from a different pattern:
Don’t let the model call tools directly.
Let the model write Python code that calls those tools.
This small inversion has major implications in every domain where datasets are large, schemas are detailed, and workflows are multi-step.
The Context Bottleneck in Multi-Step Biological Queries
To ground this in something familiar, consider a typical (and here deliberately simplified) bioinformatics workflow:
Start from a gene → look up its disease associations → examine its expression in the tissue most relevant to that disease.
This is a mock but representative multi-step reasoning task that appears constantly in genomics, and it exposes a fundamental limitation of traditional direct tool-calling.
In a direct tool-calling setup, the model is forced to absorb far more detail than it needs. A GWAS Catalogue query might return hundreds or thousands of association rows, each with effect sizes, p-values, disease ontology annotations, sample sizes, and study metadata. An Expression Atlas or GTEx lookup might return an entire expression vector across dozens of tissues. All of this flows through the model’s context, even though the model only needs a tiny fraction of the information, typically just the top hit and the relevant tissue.
This is where context overloading becomes a real bottleneck. The model must interpret large, structured biological datasets in its own context window, carry them between tool calls, and reason over them token-by-token. It’s slow, expensive, error-prone, and ultimately unnecessary.
The alternative is simpler and far more scalable:
The model shouldn’t manipulate the data directly; it should describe the workflow by writing the code that performs it.
In this pattern, the model would outline the logic (e.g. in Python), and the execution environment would carry it out: retrieving the data, processing it, and returning only the distilled result. Intent stays with the model; computation moves into code.
A Lightweight Bioinformatics Example:
GWAS → Tissue → Gene Expression
Here’s a compact example of how such a workflow could look in practice.
Scenario
For a given gene (e.g., BRCA1),
- fetch all disease associations from the GWAS Catalogue,
- identify the top associated trait,
- infer the tissue most relevant to that trait,
- and then query Expression Atlas (or GTEx) for the gene’s expression in that tissue.
We’ll keep the MCP tool wrappers simple and assume two hypothetical MCP servers:
- gwas_catalog.search_associations(gene: str)
- expression_atlas.get_expression(gene: str, tissue: str)
Below we contrast:
- the inefficient direct-tool-calling pattern, and
- the efficient Python code-execution pattern.
1. The Inefficient Way: Direct Tool Calling
What the model has to do:
- Load both tool schemas into context
- Receive the entire GWAS table
- Extract the top association through natural-language reasoning inside the model
- Call gene-expression tool with manually constructed arguments
- Receive entire expression matrix
This is what the direct tool-calling flow looks like (conceptually):
LLM → tool.call(gwas_catalog.search_associations)
← returns a 2000-row GWAS table into model context
LLM filters/reads/infers top tissue inside context
LLM → tool.call(expression_atlas.get_expression)
← returns entire expression dataset into context
Pain Points:
- Two large biological datasets pass through the model.
- Schemas for both tools reside in the model.
- Filtering and logic happen in the model.
2. The Efficient Way:
Code Execution + MCP
Here, the model writes a short Python script and sends it to the execution environment.
All heavy operations happen outside the model.
Python script generated by the model:
# scripts/gene_gwas_expression.py
import asyncio
from servers.gwas_catalog import search_associations, SearchAssociationsInput
from servers.expression_atlas import get_expression, GetExpressionInput
async def main():
gene = "BRCA1"
# 1. Fetch associations from GWAS Catalogue
associations = await search_associations(
SearchAssociationsInput(gene=gene)
)
rows = associations.rows # list[dict], stays OUTSIDE the model context
if not rows:
print(f"No GWAS hits found for {gene}.")
return
# 2. Sort by p-value and take the top association
rows_sorted = sorted(rows, key=lambda r: r.get("pvalue", 1))
top_hit = rows_sorted[0]
top_trait = top_hit.get("trait")
associated_tissue = top_hit.get("mapped_tissue", None)
print(f"Top GWAS trait for {gene}: {top_trait}")
print(f"Associated tissue: {associated_tissue}")
if not associated_tissue:
print("No tissue information available for the top trait.")
return
# 3. Query expression in the associated tissue
expr = await get_expression(
GetExpressionInput(
gene=gene,
tissue=associated_tissue
)
)
# Only send a summary back to the model
print(f"Expression of {gene} in {associated_tissue}:")
print(f"Median TPM: {expr.tpm}")
asyncio.run(main())
What the model sees:
Just the final summary:
Top GWAS trait for BRCA1: breast cancer
Associated tissue: mammary tissue
Expression of BRCA1 in mammary tissue:
Median TPM: 1.284
Benefits:
- The GWAS table never enters the context.
- The entire expression table never enters the context.
- The schemas are small and loaded on demand.
Why This Matters (in Bioinformatics and beyond)
In biology (and many other domains), almost every workflow involves:
- high-dimensional data,
- multiple databases,
- several layers of cross-referencing, and
- complex domain logic.
Doing this with direct tool-calling forces the LLM to juggle raw data structures far larger and more intricate than any simple metadata or file-retrieval API.
Code execution solves this elegantly:
- All heavy lifting happens in Python
- The model works with summaries
- No large tables enter context
- No schema overload
- No fragile sequence of multiple tool calls the model must orchestrate by hand
This pattern is cleaner, faster, safer → and closer to how humans design computational workflows.
A Final Thought
The more I experiment with this architecture, the more I’m convinced that agentic AI in biology won’t scale through bigger models alone. It will scale through better software engineering practices and better interfaces:
- tools that speak a common protocol (MCP), and
- models that express intent through code rather than micromanagement via natural language.
This separation is not only computationally efficient; it’s cognitively elegant: freeing the model from shuttling data around and letting it operate as the lightweight, efficient reasoning layer it’s meant to be.

Leave a comment