• 0 Posts
  • 101 Comments
Joined 2 years ago
cake
Cake day: September 2nd, 2023

help-circle
  • Well yes, the LLMs are not the ones that actually generate the images. They basically act as a translator between the image generator and the human text input. Well, just the tokenizer probably. But that’s beside the point. Both LLMs and image generators are generative AI. And have similar mechanisms. They both can create never-before seen content by mixing things it has “seen”.

    I’m not claiming that they didn’t use CSAM to train their models. I’m just saying that’s this is not definitive proof of it.

    It’s like claiming that you’re a good mathematician because you can calculate 2+2. Good mathematicians can do that, but so can bad mathematicians.




  • The wine thing could prove me wrong if someone could answer my question.

    But I don’t think my theory is that wild. LLMs can interpolate, and that is a fact. You can ask it to make a bear with duck hands and it will do it. I’ve seen images on the internet of things similar to that generated by LLMs.

    Who is to say interpolating nude children from regular children+nude adults is too wild?

    Furthermore, you don’t need CSAM for photos of nude children.

    Children are nude at beaches all the time, there probably are many photos on the internet where there are nude children in the background of beach photos. That would probably help the LLM.









  • As someone that hates python more each day: you are absolutely wrong on basically every point.

    The only thing you are right on is the non-enforced types (not even warning logs!).

    First, python doesn’t “change all the standards”. Languages are different. If they weren’t different, there would only be one language. There is no language standard.

    for (x in a) is stupid. You want to know what is the “expression” of the for loop? It’s everything after the for and before the :. You don’t need () at all. In fact () would be confusing since you could argue the in is part of the for loop syntax.

    You don’t need to import the types you claim you need to import. list, tuple, dict (idk about set) are available without importing.

    I won’t even explain why you are wrong about data structures and tuples. Just that they are not “array-like”.

    It doesn’t run flawlessly on any OS. Many OS ship with ancient versions of python. So it’s incredibly easy to have your script not run on another computer because you used features that are too new. There are also 3rd party dependencies that are OS-dependant. But you cannot know that until you run it and it fails on some random function call. And after hours of research you figure out that that error is because your OS is not the same as the developer’s.




  • Is there anything in the LLMs code preventing it from emitting copyrighted code? Nobody outside LLM companies know, but I’m willing to bet there isn’t.

    Therefore, LLMs DO emit copyrighted code. Due to them being trained on copyrighted code and the statistical nature of LLMs.

    Does the LLM tell its users that the code it outputted has copyright? I’m not aware of any instance of that happening. In fact, LLMs are probably programmed to not put a copyright header at the start of files, even if the code it “learnt” from had them. So in the literal sense, it is stripping the code of copyright notices.

    Does the justice system prosecute LLMs for outputting copyrighted code? No it doesn’t.

    I don’t know what definition you use for “strip X of copyright” but I’d say if you can copy something openly and nobody does anything against it, you are stripping it’s copyright.




  • Generally agree. Except:

    Logs that are a “debug diary” are not useless. Their purpose is to debug. That’s why there’s log levels. If you are not interested in that, filter by log levels above debug.

    Also, the different formats for fields I see as a necessary evil. Generally, more logs (of verbose log levels) = more good. Which means that there should be as frictionless to write as possible. Forcing a specific format just means that there will be less logs being written.

    The json (or any other consistent format) logs seem to be a good idea, but I would keep it to a single debug level (maybe info+error?). So if you want to get wide events, you filter by these log levels to get the full compact picture. But if you are following a debug log chain, it seems a pain to have to search for the “message” field on a potentially order-independent format instead of just reading the log.

    TL;DR

    Log levels have different purposes, and so they should have different requirements.



  • One of the techniques I’ve seen it’s like a “password”. So for example if you write a lot the phrase “aunt bridge sold the orangutan potatoes” and then a bunch of nonsense after that, then you’re likely the only source of that phrase. So it learns that after that phrase, it has to write nonsense.

    I don’t see how this would be very useful, since then it wouldn’t say the phrase in the first place, so the poison wouldn’t be triggered.

    EDIT: maybe it could be like a building process. You have to also put “aunt bridge” together many times, then “bridge sold” and so on, so every time it writes “aunt”, it has a chance to fall into the next trap, untill it reaches absolute nonsense.