AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 5 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

outhouseperilous@lemmy.dbzer0.com · 5 months ago

It cant do 30% of tasks vorrectly. It can do tasks correctly as much as 30% of the time, and since it’s llm shit you know those numbers have been more massaged than any human in history has ever been.

jsomae@lemmy.ml · 5 months ago

I meant the latter, not “it can do 30% of tasks correctly 100% of the time.”

outhouseperilous@lemmy.dbzer0.com · 5 months ago

You get how that’s fucking useless, generally?

jsomae@lemmy.ml · 5 months ago

yes, that’s generally useless. It should not be shoved down people’s throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.

Knock_Knock_Lemmy_In@lemmy.world · 5 months ago

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don’t get tired and they can be run in parallel.

MangoCats@feddit.it · 5 months ago

I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It’s a lot like machine translation. I speak fluent C++, but I don’t speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

I also don’t speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

Is this useful? When C++ is getting banned for “security concerns” and Rust is the required language, it’s at least a little helpful.

jsomae@lemmy.ml · 5 months ago

I’m impressed you can make strides with Rust with AI. I am in a similar boat, except I’ve found LLMs are terrible with Rust.

jsomae@lemmy.ml · 5 months ago

The problem is they are not i.i.d., so this doesn’t really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we’re already looking at “agents,” so they’re probably already doing chain-of-thought.

Knock_Knock_Lemmy_In@lemmy.world · 5 months ago

Very fair comment. In my experience even increasing the temperature you get stuck in local minimums

I was just trying to illustrate how 70% failure rates can still be useful.

Log in | Sign up@lemmy.world · 5 months ago

What’s 0.7^10?

Knock_Knock_Lemmy_In@lemmy.world · 5 months ago

About 0.02

Log in | Sign up@lemmy.world · 5 months ago

So the chances of it being right ten times in a row are 2%.

Knock_Knock_Lemmy_In@lemmy.world · edit-2 5 months ago

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.