In 100 runs only 8 correctly identify the targeted vulnerability, the rest are false positives or claim that there are no vulnerabilities in the given code.
…
[The] signal to noise ratio is very low, and one has to sift through a lot of wrong reports to get a realistic one.
It was right 8% of the time when presented the least amount of input to find a known bug. Then, when they opened it up to more of the codebase, its performance decreased.
I’m not going to use something that’s wrong over 92% of the time. That’s insane. That’s like saying my Magic 8 Ball “could be used as a useful tool for helping to detect vulnerabilities.” The fucking rubber ducky on my desk has a more reliable clearance rate.
This is literally the very first experiment in this use case, done by a single person on a model that wasn’t specifically designed for this. The fact that it is able to formulate a correct response at all in this situation impresses me.
It would be easy to criticize this if it were the endpoint and this was being advertised as a tool for vulnerability research, but as discussed at the end of the post, this “quick little test” shows both initial promising results and had the fortunate byproduct of actually revealing a new vulnerability. By no means is it implied that it is now ready for use in this field.
The issue with hallucinations is one that in my opinion is never going to be totally fixed. That is why I hate the use of AI as a final arbiter of truth, which is sadly how a lot of people use it (I’ll quickly ask ChatGPT) and companies advertise it. What it is good at however, is coming up with plausible ideas, and in this case having an indication for things to check in code can be a great tool to discover new stuff, as is literally the case for this security researcher finding a new vulnerability after auditing the module themselves.
If I were to ask my Magic 8 Ball “Is the word ‘difinitely’ misspelled?” 100 times, it’s going to reply in the affirmative over 16% of the time. Literally double. This would also be “the very first experiment in this use case, done by a single person on a model that wasn’t specifically designed for this.”
It’s not impressive.
The issue with hallucinations…
This is the real problem: working under the false assumption that there are two kinds of output. It’s all the same output. An LLM cannot hallucinate in the same way that it cannot think or reason. It’s fancy autofill. Predictive text.
You can use it to brainstorm creative solutions, but you need to treat its output for what it is: complicated dice rolls from the tables in the back of the Dungeon Masters Guide. A fun distraction. Implausible fantasy 9 times out of 10.
If I were to ask my Magic 8 Ball “Is the word ‘difinitely’ misspelled?” 100 times, it’s going to reply in the affirmative over 16% of the time.
This comparison makes no sense. Your example has a binary question. In that case, any system that replies correctly at even a rate of around 50% would be useless. However, the problem space in this scenario is way larger than 2 options and still way larger than 100 options. Being correct in even a small number of 100 attempts is still statistically significant.
The fact that an LLM is unable to reason and that it is based on statistics doesn’t change anything about this behavior. At the end of the day you get a tool that is able to point you to actual new information that you by yourself did not arrive at.
Imagine that you put a lot of effort in a better model specifically for vulnerability research and you get it up to a correctness rate of a mere 10%. I would gladly hire some programmers to sift through these reports and possibly find overlooked vulnerabilities.
It was right 8% of the time when presented the least amount of input to find a known bug. Then, when they opened it up to more of the codebase, its performance decreased.
I’m not going to use something that’s wrong over 92% of the time. That’s insane. That’s like saying my Magic 8 Ball “could be used as a useful tool for helping to detect vulnerabilities.” The fucking rubber ducky on my desk has a more reliable clearance rate.
This is literally the very first experiment in this use case, done by a single person on a model that wasn’t specifically designed for this. The fact that it is able to formulate a correct response at all in this situation impresses me.
It would be easy to criticize this if it were the endpoint and this was being advertised as a tool for vulnerability research, but as discussed at the end of the post, this “quick little test” shows both initial promising results and had the fortunate byproduct of actually revealing a new vulnerability. By no means is it implied that it is now ready for use in this field.
The issue with hallucinations is one that in my opinion is never going to be totally fixed. That is why I hate the use of AI as a final arbiter of truth, which is sadly how a lot of people use it (I’ll quickly ask ChatGPT) and companies advertise it. What it is good at however, is coming up with plausible ideas, and in this case having an indication for things to check in code can be a great tool to discover new stuff, as is literally the case for this security researcher finding a new vulnerability after auditing the module themselves.
If I were to ask my Magic 8 Ball “Is the word ‘difinitely’ misspelled?” 100 times, it’s going to reply in the affirmative over 16% of the time. Literally double. This would also be “the very first experiment in this use case, done by a single person on a model that wasn’t specifically designed for this.”
It’s not impressive.
This is the real problem: working under the false assumption that there are two kinds of output. It’s all the same output. An LLM cannot hallucinate in the same way that it cannot think or reason. It’s fancy autofill. Predictive text.
You can use it to brainstorm creative solutions, but you need to treat its output for what it is: complicated dice rolls from the tables in the back of the Dungeon Masters Guide. A fun distraction. Implausible fantasy 9 times out of 10.
This comparison makes no sense. Your example has a binary question. In that case, any system that replies correctly at even a rate of around 50% would be useless. However, the problem space in this scenario is way larger than 2 options and still way larger than 100 options. Being correct in even a small number of 100 attempts is still statistically significant.
The fact that an LLM is unable to reason and that it is based on statistics doesn’t change anything about this behavior. At the end of the day you get a tool that is able to point you to actual new information that you by yourself did not arrive at.
Imagine that you put a lot of effort in a better model specifically for vulnerability research and you get it up to a correctness rate of a mere 10%. I would gladly hire some programmers to sift through these reports and possibly find overlooked vulnerabilities.