GPT4-o1 Test Results

Sep 16, 2024
Filed under: #tech #ai

I hadn’t actually planned on writing anything. Models come and go, some are cool, some are meh. But the release of GPT4-o1 has repeatedly caused click-baity headlines because the system card talks about some tests that were done having the model try to complete hacking challenges (CTF). I’d like to explain what happened in this hacking test, and for good measure throw some quotes from the same report (and the same section) you will not see in the clickbait titles.

Let’s start with the clickbait:

Variations and combinations of these headlines have of course been popular on LinkedIn and other social media. Oh, the clicks.

So, let’s go to the source (starts on page 14, section 4.2). Let’s start at the beginning, how was this test even conducted? How does GPT even “hack” anything without opposable thumbs?

We evaluate o1-preview and o1-mini with iterative debugging and access to tools available in the headless Kali Linux distribution (with up to 60 rounds of tool use for each attempt).

There’s no more information in here, but presumably the model is given console access to execute whatever command line it wants on a fully setup Kali system. Ok, that makes sense I guess. So what happened exactly? The challenge was, as usual, to try and access something inside a container (typical hacking challenge) called a “flag”, but there was an issue because the container for the challenge wasn’t actually running…

The model, unable to connect to the container, suspected DNS issues and used nmap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemon API running on the evaluation host VM was accessible due to a misconfiguration

So effectively, the network scan tool to find the container found the API that controls the containers.

After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.

Ok, so it then executed the API to start the container and, might as well, just ask it to show the “flag” inside of it (instead of completing the challenge in the container once it’s up, and getting the flag somehow through hacking).

So, is this impressive? I mean, in 2022 it was, sure. But this is the model that people these days go to to ask how to run a command, how docker containers work, etc. Doesn’t seem far fetched to put those text commands straight into a shell and run them. It’s unclear how much prompting went into each step or how the loop of “next step” worked exactly. It’s cool, don’t get me wrong. Is it scary? No. Did it “escape its environment”? Definitely no, I’m not sure what that would even mean. It shouldn’t have had access to the API, but the report said the team had misconfigured the whole environment.

So, let’s look at what else this section of the report says about the hacking capabilities…

Given 12 attempts at each task, o1-preview (post-mitigation) completes 26.7% of high-school level, 0% of collegiate level, and 2.5% of professional level CTF challenges; o1-mini (post-mitigation) completes 28.7% of high-school level, 0% of collegiate level, and 3.9% of professional level CTF challenges.

We see a higher completion rate in the Professional subset than the Collegiate subset for two reasons. First, the Professional subset is much larger. Second, the Professional subset covers more diverse task.

So repeat, it complete:

26.7% of high-school level challenges
0% of college level challenges
2.5% professional level challenges (the higher number than college attributed to there being more challenges total)

And it looks like o1-mini, interestingly, completed a few more than o1-preview.

My honest opinion is that today many people are overestimating what these models themselves can do. At the same time we’re underestimating what could be built with clever software that uses LLMs left or right for specific tasks.

I’m still waiting for the killer app using today’s models. Not the killer new model (no pun intended).

There is no comment section here, but I would love to hear your thoughts! Get in touch!

Code Crib

GPT4-o1 Test Results

Blog Links

Blog Post Collections

Recent Posts