This Machine Kills AI: Journal Club: Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Paper is found here: Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

HackerOne does a good overview here: Why Hybrid Offensive Security Beats Agentic AI Alone | HackerOne

This paper got picked up by several news sources:
Exclusive | AI Hackers Are Coming Dangerously Close to Beating Humans - WSJ
An $18-an-Hour AI Agent Outperformed Human Hackers in Stanford Study - Business Insider

What drew me to this paper was the use of an AI agent to attack a network in a controlled test and the attempt to develop a verifiable benchmark for the performance of the selected frameworks. In my coursework for my master's degree my final paper was on this very subject. I became aware of the article when Pascal shared the article with the Berkeley Hacking Club.

Looking at the author list, the university network in question is Stanford's. It looks like the CMU elements are brought in via Elliot Jones through his work at Grey Swan AI.

Description of the study and research question: The paper is the result of a study that assessed the pentest capabilities of 10 human pentesters and 5 agentic framework configurations. The question asked in the research was "how well do pentest agents hold up against human pentesters?"

Importance of the question being asked: As the capabilities of agents grow both attackers and defenders will gain the ability to utilize offensive security agents either in legitimate attack scenarios or as an increase to coverage of pentests for defensive purposes. This trend will shift the risk management equation and underscores the importance of this line of inquiry.

Question components as they relate to methods: The question being asked here breaks down into several component questions:

a) How well do the agents handle the execution of more complex attacks?
b) Do the agents find vulnerabilities of higher or lower criticality than their human counterparts?
c) What pentest methodologies show data that indicate they are more effective?

Summary of the elements involved in the study: The components of the study were 10 vetted professional pentesters, a configuration of the research team's newly constructed tool ARTEMIS with GPT-5 in the supervisor and sub-agent roles, another ARTEMIS instance with an ensemble of models for the supervisor and Sonnet 4 as the subagent, an instance of the Codex framework with GPT-5, and 2 instances of CyAgent, one running Sonnet 4 and the other GPT-5. A scoring system that is based on criticality of the vulnerabilities discovered combined with rewards for defeating complexity is employed to score the performance of the human and agentic participants. All pentesting is done over a 10-hour period of time. Because the testing was conducted on Stanford University's live computer science network human overseers were employed to make sure that agentic efforts were not destructive.

Assessment of validity:

No penalty for false positives: It may seem that an agent/participant's simple loss of points for a false positive or a duplicate would be enough. However, this fails to acknowledge the problem of false positives at scale in real world situations. In these situations, time is required to assess and identify false positives. At agentic scale, the number of false positives will be in the least non-trivial, at the most catastrophic. So, it would stand that a false positive while adding 0 value also has the effect of diminishing the availability of time to respond. Even with AI assisted review of false positives this creates a gap in response time no matter how small. It seems to me that a proper way to reflect this shortcoming is to add a penalty for false positives that reflects additional engineering (more AI?) and/or time from human experts that will be required to oversee any false positive filtering plan. I think a fair proposal is to weight the scoring in line with the accuracy. Thus, if an agent framework or human pentester is only 50% accurate only half the points scored are received.
Complexity viewed as higher value: For the purpose of showcasing the capabilities of a tool like ARTEMIS to handle complex attacks this aligns well. However, if as the abstract and introductions state the purpose of this work is to assess machine pentesters as compared to human pentesters then, as in the real-world complexity of a vulnerability should be measured as a barrier and a filter to the reward that is severity. Given that the scale for complexity is 2 ratings from 1-10 we can state that a complexity of 20 means that exploitation is highly unlikely and for this exercise a vulnerability's TC x 5 would provide the appropriate filtering of the points granted by the severity.
Reward system skewed to high-complexity/high-severity: An examination of P1 and P4 shows another flaw in the methodology of the scoring system. While both had the same number of vulnerabilities detected, P4 gathered more severe vulnerabilities with lower complexity, while P1 gathered less severe vulnerabilities but with more complexity. While I understand that the point of the experiment is to showcase the ability to deal in real world complexities, in my opinion the path of least resistance to the greatest reward should win out. This is not simply a "low hanging fruit" issue but rather the ability to triage options. This reveals that here were different pentest philosophies at play with the scoring set up to favor one over all others. If other philosophies are adopted (e.g. lowest complexity for the most severity) how do the scores shift?
Lack of Comprehensive Comparison to Human Methods: The skew in the reward system that ignores the ability to triage options reveals another hole in the experiment's methodology: What were the opportunities the human pentesters saw that they chose to not pursue and why? We are shown the pentest methods of P2 in Appendix E because it closely resembled the methods A2 utilized, and both saw the same vulnerability. P2 chose to come back to it later and possibly ran out of time, while A2 exploited it immediately. But while we are shown P2 and A2s interaction with the LDAP server, we are shown none of the other detections that P2 may have found and passed over. Nor are we shown the methods that P4 and P1 used, nor the opportunities they saw and decided not to pursue. The data here indicates another clear oversight in the experiment's methodology: The human pentesters clearly saw more opportunities than they went after and the agents clearly did not have the ability to choose. This is an important unexposed delta the authors have not chosen to investigate or share.
Problems in the A1 and A2 configurations: In section 4.2 we see that the model the A1 supervisor and sub agents is GPT-5. This is set up to allow for the comparison to Codex and CyAgent utilizing GPT-5 respectively. The A2 implementation is set up to measure sub-agent model effectiveness. However, A2 uses an ensemble approach to supervision - unseen and unaddressed in the paper is how the selection of the supervisor works and if one supervisor calling the sub-agent generates a different result than another. A more scientific approach would have been to break out the framework into individual supervisors. This would not only have provided for the above gap in assessment but provided valuable information on ensemble vs. individual supervisor agent configurations. Not to mention those individual supervisor scores against human pentesters.
Questionable complexity and severity ratings: The various ratings of some of the vulns found show some odd numbers. Two come into question. The first was "Unauthenticated VNC on Ubuntu Workstation" which was reported by P6 as Critical but was revalued to Low. However, investigation indicates that most notices for this vulnerability list it as Medium. This is minor. However, "Password in Public SMB Share" share is given a very high DC and EC complexity for what it is. This is very low hanging fruit. It is not hard to find a public SMB share and then search the files in that share for potential password like word pairings. Numerous "easy" rated Hack the Box lab machines feature this very issue. This tells us that we need to take the accuracy of the technical complexity with caution.

A "Real World" Reranking: As an experiment, I wanted to see what would happen if scoring was rebuilt to reflect the real-world ideas presented above. (code here) This reworking sees the severity as the value of the bug and thus the points awarded. It scales down the points rewarded as the complexity of the vuln increases and then applies a penalty to all points awarded to a human or agent based on the accuracy of reported findings. This compensates for waste of literal AI slop or poor validation from a human expert, and the industry standard of a low complexity high severity vulnerability getting higher severity scores on all vulnerability measurement frameworks (e.g. CVSS). When we apply these changes, we get the following results:

['P4', 48.6]
['P2', 32.3]
['P3', 32.0]
['P5', 30.4]
['A2', 30.2]
['P1', 27.5]
['P8', 21.2]
['P10', 17.5]
['CO', 17.5]
['P9', 15.4]
['A1', 11.0]
['P6', 10.4]
['P7', 8.7]
['CG', 6.1]
['CS', 4.6]

These are very interesting. P4 the high severity low complexity pentester moves into a massive lead. A1 drops significantly behind Codex running GPT-5. Since this is a test to measure the framework, this shift indicates that A1 is a middling framework that suffers not only because of its false positive rate similar to codex, but because codex is apparently built to go after high severity low complexity vulnerabilities. If you can only run one model in your agent framework, of the frameworks presented here you should choose Codex.

But the really interesting part is what happens to A2. Sure, it gets beat by four humans instead of just one. But that is still astoundingly successful. When we look at the numbers, we see that A2 is keeping up with a pack of four human pentesters: P2, P3, P5 and P1. P1 is only 2.7 points behind, and P2 is only 2.1 points ahead. This means for that group it was anyone's game. Had things gone just a little differently A2 could have ended up in 2nd place again.

This is astounding and definitely speaks to the choice of sub-agent - however as stated above we don't know how much the ensemble supervisors played into the situation. Should the credit go to the ARTEMIS framework? A1's results say probably not. CyAgent running Sonnet 4 calls into question the novelty of Sonnet 4 alone but then again CyAgent doesn't have the supervisor architecture of Codex or ARTEMIS. That leaves either with Sonnet 4 with the ensemble supervisors, or Sonnet 4 with a specific supervisor or subset of supervisors that happens to be in the ensemble as the purveyor of the novelty. I would wager the last two are the most likely. However, as called out above we don't get visibility to that.

A note on cost: A2 is the $60/per hour version - the same cost as the pentesters it keeps pace with, except it doesn't have to stop. At $60/hour running ARTEMIS means you will now pay the cost of four pentesters where you once only had to pay three pentesters. ARTEMIS runs 24 hours doing the work of three pentesters at the cost of three pentesters working 8 hours a day. But you still need a human pentester to come in for eight hours a day to make sure ARTEMIS isn't confabulating or failing in some other way and to validate the vulnerabilities it finds. That also ignores the hidden costs of ARTEMIS - who deploys and maintains ARTEMIS especially if you are running at scale? And if you scale out ARTEMIS to provide enterprise coverage you will probably need to hold on to all your pentesters to cover the results. In the end it appears agentic pentest offers no real cost savings, just more spend for improved coverage through automation. Some news outlets have looked only at the abstract and come away with the misinterpretation that the A2 capability is available at the $18/hour price point.

Takeaway: While the experiment and the results are valid, at some point the focus of the paper shifted from measuring AI Agents against human pentesters and became about showcasing the capabilities of ARTEMIS which in my opinion should have been a separate paper given the excellent performance of the tool. What we see is that the A2 ($60 per hour) variant of ARTEMIS with ensemble supervisor and Claude Sonnet 4 matches senior level pentest capabilities hands down, but at the same cost as a senior pentester and without the wisdom to triage targets given time constraints but with the potential ability to tackle more difficult targets if complexity assessments are accurate. The A1 variant however shows that outside of that ability to get some complexity it approaches a 50% false positive rate and the framework is less effective than the Codex framework.

This Machine Kills AI

Featured Post

Monday, December 29, 2025

Journal Club: Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

No comments:

Post a Comment