Agentic AI, Biology, and What Remains Human

TL;DR: Agentic AI is not just making work faster. It is turning work into fast-moving loops of planning, coding, testing, deployment, and iteration. In biology and pharma, this creates a new challenge: not simply whether agents can produce useful outputs, but whether humans can steer these loops toward the right questions, the right assumptions, and the right outcomes. As agents become better at exploring large spaces of possibilities, the human role may shift from supervising every step to shaping the exploration itself.

A couple of months ago, I wrote about the AI productivity paradox: output is scaling faster than understanding.

Agentic AI pushes this further, but in a different direction.

The interesting shift is not just that we can now generate more work. It is that we can increasingly delegate the loop itself: planning, execution, testing, refinement, and iteration.

That is a much more powerful abstraction.
It also means the central question changes.

Not only:

Did the system produce something useful?

But:

Was the system exploring the right space in the first place?

That is where I think the next challenge lies.

Testing Is Compressed Human Judgment

One of the most interesting effects of agentic AI is that it brings testing back to the spotlight. But the nature of testing is changing.

As agents become better at writing code, building workflows, generating interfaces, and iterating through solutions, testing is no longer just a labour-intensive technical step at the end of development. It becomes a way of encoding judgment into the system.

A good test is not just a software artefact, it is compressed human judgment.

It captures experience, intuition, systems architecture, domain knowledge, and institutional memory in a form the system can repeatedly apply.

It says:

this assumption must hold,
this input should fail,
this output should be invariant,
this shortcut is unacceptable,
this edge case matters,
this result should not be trusted without additional evidence.

This is where many of the most important blind spots live.

An agent can generate code quickly. It can generate tests quickly. It can critique its own output and iterate.

But the harder question is whether the tests are testing the right thing.

That requires more than technical fluency. It requires intuition about how systems fail, how users behave, how data gets distorted, and how domain assumptions quietly enter the workflow.

In other words, agentic AI does not make testing less important – it makes testing more strategic.

And in domains where correctness is contextual, ambiguous, or tied to real-world consequences, that strategy still depends heavily on human experience.

Figure 1. Agentic loops can move quickly through planning, coding, execution, observation, and refinement. Trust comes from the validation layer around the loop: meaningful tests, reproducible queries, safe deployment, logged assumptions, and domain constraints.

Why Biology Makes This Concrete

Biology is a prime example because it makes the risk tangible.

A biological agent may produce a fluent, well-structured, scientifically plausible answer and still be wrong because of something small and almost invisible:

A silently misapplied filter.
A deprecated identifier.
A missing metadata field.
A database convention the model did not know.
A genome build mismatch.
A web interface that hides important logic from the user.

These are not “dramatic” failures.
They are the kind of failures that do not announce themselves.

This is why Anthropic’s recent article on agents in biology is such a fitting case study. The article is interesting not only because it shows that agents can help with biological data tasks, but because it shows how much their usefulness depends on the environment around them: the tools, APIs, retrieval layers, interfaces, and validation mechanisms that make their work checkable.

In biology, a small retrieval error can change the downstream conclusion.

The wrong sequence set.
The wrong cohort.
The wrong annotation.
The wrong filtering logic.

The agent may have understood the intent perfectly, but still lacked a reliable way to execute it.

That distinction matters.

Because in scientific work, a convincing answer is NOT the same as a trustworthy one.
It comes from reproducibility, validation, and knowing which assumptions deserve pressure.

From Supervision to Steering

There is a common phrase in AI discussions: human-in-the-loop. It is useful, but increasingly incomplete.

It can make the human role sound like supervision: the agent does the work, the human checks it. That will remain important, but I suspect the higher-value human role will increasingly be about steering.

Agentic work is often an exploration of a huge, open-ended space.

In biology, that space includes targets, mechanisms, datasets, biomarkers, patient subgroups, assays, experiments, translational hypotheses, and strategic decisions. In software and product development, it includes architectures, user needs, workflows, failure modes, deployment choices, and trade-offs.

Agents can explore more of that space than humans could manually. But someone still has to shape the search.

Which direction is worth exploring?
Which assumption is dangerous?
Which shortcut is unacceptable?
Which result is technically valid but strategically irrelevant?
Which failure mode would be costly?
Which uncertainty matters?
Which path is worth pursuing before the evidence is complete?

This may be part of what ingenuity becomes.
Not just producing the answer.
But shaping the exploration.

Not just executing the workflow, but reducing the number of loops needed to reach a useful outcome.

Not just checking whether the agent completed the task, but asking whether it was optimizing for the right thing in the first place.

Figure 2. As agentic systems explore larger spaces of possibilities, the human role shifts from reviewing each output to steering the search toward valuable, safe, and scientifically meaningful paths.

What Remains Human?

As agents become better at planning, coding, testing, critiquing, and improving their own workflows, the human contribution has to move higher up the stack.

Judgment.
Problem formulation.
Scientific instinct.
Strategic taste.
Responsibility for consequences.

And perhaps, above all, resilience.

Because this transition will not be smooth.

The coming wave of AI will affect not only how we work, but how we understand expertise, productivity, value, and identity.

These may sound like philosophical questions, but they are becoming very practical ones for anyone building, deploying, or relying on these systems.

What should we delegate?
What should we verify?
What should we refuse to automate?
How do we keep learning when the tools around us change faster than our habits?

I do not think anyone has a complete answer yet.

But I am increasingly convinced that resilience will become one of the most important traits.

Final Thought

Agentic AI will make work faster, broader, and more exploratory. In biology, this could be incredibly valuable.

But faster is not automatically better.

The best systems will not simply compress entire workflows into efficient agentic loops. They will make those loops testable, steerable, and grounded in human judgment.

At least for now, I still believe humans have a powerful role in producing the kind of “eureka” moments that can outperform even the most impressive brute-force search of modern AI systems. Sometimes the important step is not exploring every possible path, but seeing the one path that suddenly makes the rest irrelevant.

Of course, this view may itself become outdated quickly. The space is moving so fast that what feels true today may need to be rewritten very soon…

So perhaps the most important human trait is not only judgment, or taste, or even ingenuity.

It is resilience: the ability to keep adapting, keep learning, keep questioning the system (and ourselves), and keep steering when the terrain changes 🙂

Dimitrios Vitsios's blog