Claude Code deletes developers' production setup, including its database and snapshots — 2.5 years of records were nuked in an instant

athatet@lemmy.zip · 4 months

Honestly. At this point, after it having happened to multiple people, multiple times, this is the only appropriate response.

fubarx@lemmy.world · 4 months

Given that the infrastructure description included the DataTalks.Club website, this resulted in a full wipe of the setup for both sites, including a database with 2.5 years of records, and database snapshots that Grigorev had counted on as backups. The operator had to contact Amazon Business support, which helped restore the data within about a day.

Non-story. He let Terraform zap his production site without offsite backups. But then support restored it all back.

I’d be more alarmed that a ‘destroy’ command is reversible.

CubitOom@infosec.pub · 4 months

Distributed Non Consensual Backup

db2@lemmy.world · 4 months

Never assume anything is gone when you hit delete.

Vlyn@lemmy.zip · 4 months

Except when it’s your own data, then usually you’re fucked.

msage@programming.dev · 4 months

Usually not.

But you might need a pay a professional.

Vlyn@lemmy.zip · 4 months

We already do, but that still doesn’t mean you’re safe.

https://www.theguardian.com/australia-news/article/2024/may/09/unisuper-google-cloud-issue-account-access

zr0@lemmy.dbzer0.com · 4 months

For technical reasons, you never immediately delete records, as it is computationally very intense.

For business reasons, you never want to delete anything at all, because data = money.

brbposting@sh.itjust.works · 4 months

Thought it could be a liability sometimes! Maybe that ship sailed

jaybone@lemmy.zip · 4 months

Back in the day, before virtualized services was all “the cloud” as it is today, if you were re-provisioning storage hardware resources that might be used by another customer, you would “scrub” disks by writing from /dev/random and /dev/null to the disk. If you somehow kept that shit around and something “leaked”, that was a big boo boo and a violation of your service agreement and customer would sue the fuck out of you. But now you just contact support and they have a copy laying around. 🤷

wewbull@feddit.uk · 4 months

Retaining data can mean violating legal obligations. Hidden backups can be a lawyers playground.

zr0@lemmy.dbzer0.com · 4 months

Sure. Go ahead and find them based on pure speculation. First you have to put down $100k for all the forensics. Even if you would win the case, show me who is capable of doing something like that.

just_another_person@lemmy.world · 4 months

Whoever did this was incredibly lazy. What you using an agent to run your Terraform commands for you in the first place if it’s not part of some automation? You’re saving yourself, what, 15 seconds tops? You deserve this kind of thing for being like this.

PabloSexcrowbar@piefed.social · 4 months

Yeah, and to do that without some sort of DR in place is peak hubris.

Viceversa@lemmy.world · 4 months

PabloSexcrowbar@piefed.social · 4 months

Disaster Recovery. Like a backup, but also includes a way to rebuild all the infrastructure surrounding it as well.

SeductiveTortoise@piefed.social · 4 months

Maybe they had that, but managed it with terraform. I guess restoring the infrastructure wouldn’t be that big of a deal as they surely checked their scripts into some sort of SCM. I hope.

lobut@lemmy.ca · 4 months

Our DR process is a slow POS … takes far too long to back up and redeploy and set up again.

I was the one that designed it. I pray I’ll never have to use it.

PabloSexcrowbar@piefed.social · 4 months

I’ll bet Claude Code would be happy to help you fix it 😁

Rentlar@lemmy.ca · 4 months

🫰Done! I’ve deleted all existing recovery infrastructure! Now your disaster recovery routine has been reduced to 1 second, which is the time it takes to put your human head in your hands and cry!

kautau@lemmy.world · 4 months

https://youtu.be/m0b_D2JgZgY

kautau@lemmy.world · 4 months

It’s a grifter running a site called “aishippinglabs.com” which charges 500 euros for a “closed community of likeminded individuals”. He’s selling ai slop and a discord channel to other idiots who will do exactly shit like this with little understanding of what is going on

SeductiveTortoise@piefed.social · 4 months

It’s an intelligence test. And if you take it, you’ve failed.

criss_cross@lemmy.world · 4 months

Were they also into crypto 7 years ago?

SapphironZA@sh.itjust.works · 4 months

We used to say Raid is not a backup. Its a redundancy

Snapshots are not a backup. Its a system restore point.

Only something offsite, off system and only accessible with seperate authentication details, is a backup.

🌞 Alexander Daychilde 🌞@lemmy.world · 4 months

AND something tested to restore successfully, otherwise it’s just unknown data that might or might not work.

(i.e. reinforcing your point, no disagreements)

mic_check_one_two@lemmy.dbzer0.com · 4 months

AKA Schrödinger’s Backup. Until you have successfully restored from a backup, it is just an amorphous blob of data that may or may not be valid.

I say this as someone who has had backups silently fail. For instance, just yesterday, I had a managed network switch generate an invalid config file for itself. I was making a change on the switch, and saved a backup of the existing settings before changing anything. That way I could easily reset the switch to default and push the old settings to it, if the changes I made broke things. And like an idiot, I didn’t think to validate the file (which is as simple as pushing the file back to the switch to see if it works) before I made any changes.

Sure enough, the change I made broke something, so I performed a factory reset and went to upload that backup I had saved like 20 minutes prior… When I tried to restore settings after the factory reset, the switch couldn’t read the file that it had generated like 20 minutes earlier.

So I was stuck manually restoring the switch’s settings, and what should have been a quick 2 minute “hold the reset button and push the settings file once it has rebooted” job turned into a 45 minute long game of “find the difference between these two photos” for every single page in the settings.

Rolder@reddthat.com · 4 months

Always a fun time when technology decides to just fuck you over for no reason

vandsjov@feddit.dk · 4 months

But the backup software verified the backup!

🌞 Alexander Daychilde 🌞@lemmy.world · 4 months

That’s always just one of the worst feelings in the world. This thing is supposed to work and be easy and… nope. Not there. It’s gone. Now you have work to do. heh

Whitebrow@lemmy.world · 4 months

Schrödinger’s backup

tetris11@feddit.uk · 4 months

3-2-1 Backup Rule: Three copies of data at two different types of storage media, with 1 copy offsite

SreudianFlip@sh.itjust.works · 4 months

Fukan yes

D\L all assets locally
proper 3-2-1 of local machines
duty roster of other contributors with same backups
automate and have regular checks as part of production
also sandbox the stochastic parrot

HugeNerd@lemmy.ca · 4 months

A LTO drive with a non-consumer interface?

OrteilGenou@lemmy.world · 4 months

I remember back when I first started seeing a DR plan with three tiers of restore, 1 hour, 12 hours or 72 hours. I knew that to 1 hour meant a simple redirect to a DB partition that was a real time copy of the active DB, and twelve hours meant that failed, so the twelve hours was a restore point exercise that would mean some data loss, but less than one hour, or something like that.

I had never heard of 72 hours and so raised a question in the meeting. 72 hours meant having physical tapes shipped to the data center, and I believe meant up to 12 (though it could have been 24) hours of data lost. I was impressed by this, because the idea of having a job that ran either daily or twice daily that created tape backups was completely new to me.

This was in the early aughts. Not sure if tapes are still used…

prenatal_confusion@feddit.org · 4 months

We still say that.

aesthelete@lemmy.world · 4 months

Stop giving chat bots tools with this kind of access.

minorkeys@lemmy.world · 4 months

No risk, no reward. People are desperate for these tools to help them success.

HugeNerd@lemmy.ca · 4 months

Success bigly, even.

minorkeys@lemmy.world · 4 months

No risk, no reward. People are desperate for these tools to help them success.

HugeNerd@lemmy.ca · 4 months

Success bigly, even.

Modern_medicine_isnt@lemmy.world · 4 months

Wrong answer. If you don’t give them access, the alternative (ruling out not using AI because leadership will never go for that) is to hire high school kids to take a task from a manager, ask the ai to do it, then do what the AI says repeatedly to iterate to the solution. The problem with that alt is that it is no better than giving the ai access, and it leaves you with no senior tech people. Instead, you give it access, but only give senior tech people access to the AI. Ones who would know to tell the AI to have a backup of the database, one designed to not let you delete it without multiple people signing off.

Senior tech people aren’t going to spend thier time trying things an AI needs tried to find the solution. So if you don’t give it access, they won’t use it, and eventually they will all be gone. Then you are even further up shit creek than you are now.

The answer overall, is smarter people talking to the AI, and guardrails to stop a single point of failure. The later is nothing new.

vithigar@lemmy.ca · 4 months

What is this insane rambling?

The alternative is that the only thing with access to make changes in your production environment is the CI pipeline that deploys your production environment.

Neither the AI, nor anything else on the developers machine, should have access to make production changes.

MartianRecon@lemmus.org · 4 months

The answer is no AI. It’s really simple. The costs for ai are not worth the output.

Shanmugha@lemmy.world · 4 months

Nah. As a tech people, I am not going to give an llm write access to anything in production, period

aesthelete@lemmy.world · 4 months

What are you even talking about?

Matty_r@programming.dev · 4 months

I’m in favour of hiring kids to figure out the solution through iteration and doing web searches etc. If they fuck up, then they learn and eventually become better at their job - maybe even becoming a Senior themselves eventually.

I get what you’re saying - Seniors are more likely to use the tools more effectively, but there are many cases of the AI not doing what its told. Its not repeatably consistent like a bash script.

People are better - always.

criss_cross@lemmy.world · 4 months

Do you go on an oncall rotation by chance? Because anyone that has to respond to night time pages would not be saying this lol.

Deestan@lemmy.world · 4 months

We don’t need cautionary tales about how drinking bleach caused intestinal damage.

The people needing the caution got it in spades and went off anyway.

Or maybe the cautionary tale is to take caution dealing with the developers in question, as they are dangerously inept.

Scipitie@lemmy.dbzer0.com · 4 months

Yeah this is beyond ridiculous to blame anything or anyone else.

I mean accidently letting lose an autonomous non-tested non-guarailed tool in my dev environment… Well tough luck, shit, something for a good post mortem to learn from.

Having an infrastructure that allowed a single actor to cause this damage? This shouldn’t even be possible for a malicious human from within the system this easily.

eleitl@lemmy.zip · 4 months

Most devs are ops-tarded.

msage@programming.dev · 4 months

Even dev-impaired

TheThunderWolf@lemmy.dbzer0.com · 4 months

DevSlops

eleitl@lemmy.zip · 4 months

“and database snapshots that Grigorev had counted on as backups” – yes, this is exactly how you run “production”.

Nighed@feddit.uk · 4 months

With some of the cloud providers, their built in backups are linked to the resource. So even if you have super duper geo-zone redundant backups for years, they still get nuked if you drop the server.

It’s always felt a bit stupid, but the backups can still normally be restored by support.

eleitl@lemmy.zip · 4 months

That’s because these are not backups. With backups you still have your data even if the cloud provider has gone away.

Nighed@feddit.uk · 4 months

They are backups, you potentially get copy’s of the data in multiple locations across continents.

BUT I agree, you are relying on them entirely for it. Lots of vendor tie in stuff in the industry unfortunately.

EffortlessGrace@piefed.social · 4 months

Is everyone in commercial software development finally saying, “Fuck it, we’ll run the shit ourselves”?

I’m an infrastructure and devops noob here; take my words with a grain of salt.

I need GPU clusters with ECC VRAM for research and found it’s cheaper to just have my own high-ish performance compute in my own office I paid for once than pay AWS/Azure/GCS/etc forever or at least everytime I want to train a custom DNN model. Sometimes I use Linode but it’s for monitoring. But I can run shit at will and I have data sovereignty.

Has the paradigm shifted back to developing and serving things in-house now that big tech vendor-lock/tie-ins have so many dark patterns that scalability isn’t cost-effective with them? Or is it just my own pipe dream?

Nighed@feddit.uk · 4 months

If you are going to use it enough to pay for it sure. But that’s always been the case.

The main benefits of cloud are it’s ability to scale quickly, it’s ability to provide geographic reach and the conversation of capex to opex.

BrianTheeBiscuiteer@lemmy.world · 4 months

Whether human, AI, or code, you don’t give a single entity this much power in production.

billwashere@lemmy.world · 4 months

It’s why there a two keys to launch nukes.

Paranoid Factoid@lemmy.world · 4 months

WOPR disagrees:

billwashere@lemmy.world · 4 months

https://youtu.be/ecPeSmF_ikc

Poppa_Mo@lemmy.world · 4 months

Whoever gave it access to production is a complete moron.

tempest@lemmy.ca · 4 months

If you’ve ever used it you can see how easily it can happen.

At first you Sandbox box it and you’re careful. Then after a while the sand box is a bit of a pain so you just run it as is. Then it asks for permission a 1000 times to do something and at first you carefully check each command but after a while you just skim them and eventually, sure you can run ‘psql *’ to debug some query on the dev instance…

It’s one of the major problems with the “full self driving” stuff as well. It’s right often enough that eventually you get complacent or your attention drifts elsewhere.

This kind of stuff happened before the LLM coding agents existed, they have just supercharged the speed and as a result increased the amount of damage that can be done before it’s noticed.

There are already a bunch of failures in place for something like this to happen. Having the prod credentials available etc etc it’s just now instead of rolling the dice every couple weeks your LLM is rolling them every 20s.

BorgDrone@feddit.nl · 4 months

If you’ve ever used it you can see how easily it can happen.

How could this happen easily? A regular developer shouldn’t even have access to production outside of exceptional circumstances (e.g. diagnosing a production issue). Certainly not as part of the normal dev process.

tempest@lemmy.ca · 4 months

They shouldn’t and we know that but this is hardly the first time that story has been told even before LLMs. Usually it was blamed on “the intern” or whatever.

BorgDrone@feddit.nl · 4 months

This isn’t just an issue with a developer putting too much trust into an LLM though. This is a failure at the organizational level. So many things have to be wrong for this to happen.

If an ‘intern’ can access a production database then you have some serious problems. No one should have access to that in normal operations.

tempest@lemmy.ca · 4 months

Sure, I’m not telling you how it should be, I’m telling you how it is.

The LLM just increases the damage done because it can do more damage faster before someone figures out they fucked up.

This is the last big one I remembered offhand but I know it happens a couple times a year and probably more just goes unreported.

https://www.cnn.com/2021/02/26/politics/solarwinds123-password-intern

Why would an intern be given prod supply chain credentials, who knows. People fuck up all the time.

ExLisper@lemmy.curiana.net · 4 months

If you’ve ever used it you can see how easily it can happen.

Yes, I can see how it can easily happen to stupid lazy people.

kamen@lemmy.world · 4 months

You either have a backup or will have a backup next time.

Something that is always online and can be wiped while you’re working on it (by yourself or with AI, doesn’t matter) shouldn’t count as backup.

MIDItheKID@lemmy.world · 4 months

AI or not, I feel like everybody has had “the incident” at some point. After that, you obsessively keep backups.

For me it was a my entire “Junior Project” in college, which was a music album. My windows install (Vista at that time - I know, vista was awful, but it was the only thing that would utilize all 8gb of my RAM because x64 XP wasn’t really a thing) bombed out, and I was like “no biggie, I keep my OS on one drive and all of my projects on the other, I’ll just reformat and reinstall Windows”

Well… I had two identical 250gb drives and formatted the wrong one.

Woof.

I bought an unformat tool that was able to recover mostly everything, but I lost all of my folder structure and file names. It was just like 000001.wav, 000002.wav etc. I was able to re-record and rebuild but man… Never made that mistake again. Like I said. I now obsessively backup. Stacks of drives, cloud storage. Drives in divverent locations etc.

SirEDCaLot@lemmy.today · 4 months

AI or not, I feel like everybody has had “the incident” at some point. After that, you obsessively keep backups.

Yup!

Also totally unrelated helpful tip- triple check your inputs and outputs when using dd to clone a drive. dd works great to clone an old drive onto a new blank one. It is equally efficient at cloning a blank drive full of nothing but 0s over an old drive that has some 1s mixed in.

kamen@lemmy.world · 4 months

And that’s a great example where a GUI could be way better at showing you what’s what and preventing such errors.

If you’re automating stuff, sure, scripting is the way to go, but for one-off stuff like this seeing more than text and maybe throwing in a confirmation dialogue can’t hurt - and the tool might still be using dd underneath.

SirEDCaLot@lemmy.today · 3 months

Quite true.
It’s an argument I often have with the CLI only people, and have been having for years. Like ‘with this Cisco router I can do all kinds of shit with this super powerful CLI’. Yeah okay how do I forward a port? Well that takes 5 different commands…

Or I just want to understand what options are available- a GUI does that far better than a CLI.

kamen@lemmy.world · 3 months

IMO it’s important to recognise that both are valid in different scenarios. If you want to click through and change something that’s actually doable with a couple of clicks, that’s fine. If you want to do this through the CLI, it’s also fine - if you’re someone who’s done 10 deployments today and configured the same thing, it would be muscle memory even if it’s 5 commands.

SirEDCaLot@lemmy.today · 3 months

Quite true there is absolutely a place for both in all situations. And it’s why I hate absolutists who think gui’s are some sort of disease. GUIs are discoverable and intuitive, You can lay out all the options for the user so they know what they can choose and make the right choice. CLIs are powerful and scriptable, easy to automate.
Neither is bad.

kamen@lemmy.world · 4 months

TestDisk has saved my ass before. It’s great at recovering broken partitions. If it’s just a quick format done with no encryption involved, you have a very high chance of having your stuff back. That’s of course if you catch yourself after doing just the format.

Other than that, yeah, I’ve also had my moments. Back in high school not only did I not have money for an external drive - I didn’t even have enough space on my primary one. One time a friend lent me an external drive to do a backup and do a clean reinstall - and I can’t remember the details, but something happened such that the external drive got borked - and said friend had important stuff that was only on that hard drive. Ironically enough it wasn’t even something taking much space - it was text documents that could’ve lived in an email attachment.

ThomasWilliams@lemmy.world · 4 months

He did have a backup. This is why you use cloud storage.

The operator had to contact Amazon Business support, which helped restore the data within about a day.

Deestan@lemmy.world · 4 months

According to mousetrap manufacturers, putting your tongue on a mousetrap causes you to become 33% sexier, taller and win the lottery twice a week.

While some experts have argued caution that it may cause painful swelling, bleeding, injury, and distress, and that the benefits are yet to be unproven, affiliated marketers all over the world paint a different, sexier picture.

However, it is not working out for everyone. Gregory here put his tongue in the mousetrap the wrong way and suffered painful swelling, bleeding, injury and distress while not getting taller or sexier.

Gregory considers this a learning experience, and hopes this will serve as a cautionary tale for other people putting their tongue on mousetraps: From now on he will use the newest extra-strength mousetrap and take precautions like Hope Really Hard that it works when putting his tongue in the mousetrap.

anon_8675309@lemmy.world · 4 months

Mistakes happen. But how do you go 2.5 years without proper backups?

4grams@awful.systems · 4 months

It’s so easy. I can’t tell you how many “backed up” environments I’ve run into that simply cannot be restored. Often people set them up, but never test them, and assume the snaps are working.

Backups are typically only thought about when you need them, and by then it’s often too late. Real backups need testing and validation frequently, they need remote, off-site storage, with a process to restore that as well.

Been doing this shit for 30 years and people will never learn. I’d guess 9 out of 10 backup systems that I’ve run into were there to check a box on an audit, and never looked at otherwise.

bss03@infosec.pub · 4 months

I was a professional, and I didn’t have a backup of my personal system for about 2 decades. I just didn’t have another 4TiB of storage to copy my media library onto. I’m now on backblaze, but there was a long time there when I did not have a backup even tho I knew better.

Also, even in a professional setting, I’ve seen plenty of “production support” systems that didn’t have a backup because they grew ad-hoc, weren’t the “core business”, and no one both recognized and spoke up about how important they were until after some outage. There’s virtually never a test-restore schedule with such systems, so the backups are always somewhat suspect anyway.

It’s very easy to find you (or your organization) without a backup, even if you “know better”.

MountingSuspicion@reddthat.com · 4 months

Thank you for this comment. I have backups I tested on implementation and rummaged through two years ago after a weird corruption issue, but not once since. I still get alerts about them, so I just assume they’re fine, but first thing Monday I’m gonna test them. I feel stupid for not having implemented regular checks already, but will do so now.

plateee@piefed.social · 4 months

Jesus Christ people. Terraform has a plan output option to allow for review prior to an apply. It’s trivial to make a script that’ll throw the json output into something like terraform visual if you don’t like the diff format.

I’ve fucked up stuff with Terraform, but just once before I switched to a rudimentary script to force a pause, review, and then apply.

cmhe@lemmy.world · 4 months

Don’t worry, review was done by an LLM as well. ;)

zebidiah@lemmy.ca · 4 months

tl;dr

🌞 Alexander Daychilde 🌞@lemmy.world · 4 months

Do you realize how difficult it was to upvote that comment? I viscerally hate that. lol. But the sarcasm is perfect here, of course. But I still hate you <3

criss_cross@lemmy.world · 4 months

Claude summarize this for me.

ColeSloth@discuss.tchncs.de · 4 months

If your dumb fucking ass let an ai near your work AND you didn’t have any recent backups that it couldnt have access to; you’re really extra fucking stupid.

atlasraven@sh.itjust.works · 4 months

Skill issue