Sora is here

Comments

By yeknoda 2024-12-0918:1027 reply

I've found using these and similar tools that the amount of prompts and iteration required to create my vision (image or video in my mind) is very large and often is not able to create what I had originally wanted. A way to test this is to take a piece of footage or an image which is the ground truth, and test how much prompting and editing it takes to get the same or similar ground truth starting from scratch. It is basically not possible with the current tech and finite amounts of time and iterations.

By jerf 2024-12-0918:5111 reply

It just plain isn't possible if you mean a prompt the size of what most people have been using lately, in the couple hundred character range. By sheer information theory, the number of possible interpretations of "a zoom in on a happy dog catching a frisbee" means that you can not match a particular clip out of the set with just that much text. You will need vastly more content; information about the breed, information about the frisbee, information about the background, information about timing, information about framing, information about lighting, and so on and so forth. Right now the AIs can't do that, which is to say, even if you sit there and type a prompt containing all that information, it is going to be forced to ignore most of the result. Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.

This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.

Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.

By LASR 2024-12-0919:441 reply

What you are saying is totally correct.

And this applies to language / code outputs as well.

The number of times I’ve had engineers at my company type out 5 sentences and then expect a complete react webapp.

But what I’ve found in practice is using LLMs to generate the prompt with low-effort human input (eg: thumbs up/down, multiple-choice etc) is quite useful. It generates walls of text, but with metaprompting, that’s kind of the point. With this, I’ve definitely been able to get high ROI out of LLMs. I suspect the same would work for vision output.

By kurthr 2024-12-0921:39

I'm not sure, but I think you're saying what I'm thinking.

Stick the video you want to replicate into -o1 and ask for a descriptive prompt to generate a video with the same style and content. Take that prompt and put it into Sora. Iterate with human and o1 generated critical responses.

I suspect you can get close pretty quickly, but I don't know the cost. I'm also suspicious that they might have put in "safeguards" to prevent some high profile/embarrassing rip-offs.

By robotresearcher 2024-12-0919:256 reply

> Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible.

Why snippets? Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence.

By coffeebeqn 2024-12-0920:582 reply

This almost certainly won’t work. Feel free to feed any of the hundreds of existing film scripts and test how coherent the models can be. My guess is not at all

By robotresearcher 2024-12-0923:112 reply

The clips on the Sora site today would have been utterly astonishing ten years ago. Long term progress can be surprising.

By dragonwriter 2024-12-100:591 reply

> The clips on the Sora site today would have been utterly astonishing ten years ago.

Yeah, and Apollo 11 would have been utterly astonishing a decade before it occurred. And, yet, if you tried to project out from it to what further frontiers manned spaceflight would reach in the following decades, you’d…probably grossly overestimate what actually occurred.

> Long term progress can be surprising.

Sure, it can be surprising for optimists as well as naysayers; as a good rule of thumb, every curve that looks exponential in an early phase ends up being at best logistic.

By hatefulmoron 2024-12-102:101 reply

In the long run we are all dead. Saying that technology will be better in the future is almost eye-roll worthy. The real task is predicting what future technology will be, and when it will arrive.

Ask anyone with a chronic illness about the future and they'll tell you we're about 5 years off a cure. They've been saying that for decades. Who knows where the future advancements will be.

By bergen 2024-12-107:12

https://xkcd.com/605/

By sleepybrett 2024-12-0921:221 reply

This will almost certainly be in theaters within 5 years, probably first as a small experimental project (think blair witch).

By runarberg 2024-12-102:524 reply

The Blair Witch Project was a (surprise) creative masterpiece. It worked with very limited technology to create a very clever plot which was paired with an amazing marketing. The combination of which the world hadn’t seen before. It took some creative geniuses to peace the Blair Witch Project together.

Generative AI will never produce an experience like that. I know never is a long time, but I’m still gonna call it. You simply can’t produce such a fresh idea by gathering a bunch of data and interpolating.

Maybe someday enough AI will be good enough to create shorter or longer videos with some dialog and even a coherent story (though I doubt it), but it won‘t be fresh or creative. And we humans will at best enjoy it for its stupidity or sloppiness. Not for its cleverness or artistry.

By dumbfounder 2024-12-103:301 reply

Why does the idea need to be generated by AI? Let people generate the ideas, the AI will help execute. I think soon (3-5 years) a determined person with no video skills will be able to put together a compelling movie (maybe a short). And that is massive. AI doesn’t have to do everything. Like all tech, it’s a productivity tool.

By krainboltgreene 2024-12-109:341 reply

> Why does the idea need to be generated by AI?

This is the at-first-fun-but-now-frustrating infinite goal move. "AI (a stand in for literally anything) will do (anything) soon." -> "It won't do (thing), it's too complex." -> "Who said AI will do (thing)?"

By flappyeagle 2024-12-1020:36

AI will self-drive cars in San Francisco

By Breza 2024-12-1316:061 reply

I'm suspicious of most claims of AI growth, but I think screenwriting is an area where there's real potential. There are many screenplays out there, many movie plots are very similar to each other, and human raters could help with training. And it's worth noting that the top four highest grossing movies right now are all sequels or film adaptations. It's not a huge leap to imagine an LLM in the future that's been trained on movie writing being able to create a movie script when given the Wicked musical. https://www.imdb.com/chart/boxoffice/

By runarberg 2024-12-1321:42

The 2023 Writers Guild of America strike was in part to prevent screenplays being written entirely by generative AI.

So no I don’t think this will happen either. Authors may use use AI them selves as one tool in their tool box as they write their script, but we will not see entire production screen plays being written by generative AI set for theatrical release. The industry will simply not allow that to happen. At most you can have AI write a screen play for your own amusement, not for publication.

By sleepybrett 2024-12-1118:361 reply

I'm thinking more of a Gibsonian 'Garage Kubrick'. A solitary auteur (or small team) that produces the film alone perhaps without even touching a camera, generating all the footage using AI (in the novel the auteur creates all the footage through photo/found-footage manipulation, or at least thats all we see in text). The script will probably be human written, I'm not talking about an AI producing a film from scratch, rather a film being produced using AI to create all the visuals and audio.

By runarberg 2024-12-1120:321 reply

That is a far more reasonable prediction but I don’t even see this future. This kind of “film making” will at best be something generated for the amusement of the creator (think, give me a specific episode of Star Trek where Picard ...) or as prototypes or concepts of yet to be filmed with actual actors. And it certainly won’t be in theaters, not in 5 years, or ever.

Generative AI will not be able to approach the artistry of your average actor (not even a bad actor), it won’t be able match the lighting or the score to the mood (unless you carefully craft that in your prompt). It won‘t get creative with the camera angles (again unless you specifically prompt for a specific angle) or the cuts. And it probably won’t stay consistent with any of these, or otherwise break the consistency at the right moments, like an artist could.

If you manage to prompt the generative AI to create a full feature film with excellent acting, the correct lighting given the mood, a consistent tone with editing to match, etc. you have probably spent much more time and money into crafting the prompt than would otherwise have gone into simply hiring the crew to create your movie. The AI movie will certainly contain slop and be visibly so bad it guaranteed will not be in theaters.

Now if you hired that crew to make the movie instead, that crew might use AI as a tool to enhance their artistry, but you still need your specialized artists to use that tool correctly. That movie might make it to the theaters.

By sleepybrett 2024-12-136:291 reply

blair witch project looked like shit, 'the cinematography doesn't approach a true director of photography', the actors were shit... etc. Given the right script and concept it can be amazing and the imperfection of AI can become part of the aesthetic.

By runarberg 2024-12-1321:49

It was still a creative stroke of genius. The shit acting along with the shit cinemotography was preceded by a brilliant marketing campaign where you expected this lack of skill by the film makers.

In music you also have plenty of artists that have no clue how to play their instruments, or progress their songs, but the music is nonetheless amazing.

Skill is not the only quality of art. A brilliant artist works with their limitation to produce work which is better than the sum of its part. It will take AI the luck of ten billion universes before it produces anything like that.

By SamPatt 2024-12-104:201 reply

It's a tool. The cleverness and artistry comes from the humans, not from the tools they use.

The AI isn't creating the fresh ideas. People are.

By runarberg 2024-12-1017:28

So what you are saying is some aspects of movie making will use AI as parts of their jobs. That is very realistic and probably already happening.

Saying that large video models will be in theaters sounds like a completely different and much more ambitious prediction. I interpreted it as if large video models will produce whole movies on their own from a script of prompts. That there will be a single film maker with only a large video model and some prompts to make the movie. Such films will never be in the theater, unless by some grifter, and than it is certain to be a flop.

By troupo 2024-12-0920:464 reply

You should watch how movies are made sometime. How a script is developed. How changes to it are made. How storyboards are created. How actors are screened for roles. How locations are scouted, booked, and changed. How the gazillion of different departments end up affecting how a movie looks, is produced, made, and in which direction it goes (the wardrobe alone, and its availability and deadlines will have a huge impact on the movie).

What does "EXT. NIGHT" mean in a script? Is it cloudy? Rainy? Well lit? What are camera locations? Is the scene important for the context of the movie? What are characters wearing? What are they looking at?

What do actors actually do? How do they actually behave?

Here are a few examples of script vs. screen.

Here's a well described script of Whiplash. Tell me the one hundred million things happening on screen that are not in the script: https://www.youtube.com/watch?v=kunUvYIJtHM

Or here's Joker interrogation from The Dark Night Rises. Same million different things, including actors (or the director) ignoring instructions in the script: https://www.youtube.com/watch?v=rqQdEh0hUsc

Here's A Few Good Men: https://www.youtube.com/watch?v=6hv7U7XhDdI&list=PLxtbRuSKCC...

and so on

---

Edit. Here's Annie Atkins on visual design in movies, including Grand Budapest Hotel: https://www.youtube.com/watch?v=SzGvEYSzHf4. And here's a small article summarizing some of it: https://www.itsnicethat.com/articles/annie-atkins-grand-buda...

Good luck finding any of these details in any of the scripts. See minute 14:16 where she goes through the script

Edit 2: do watch The Kerning chapter at 22:35 to see what it actually takes to create something :)

By shermantanktop 2024-12-0922:481 reply

I can't upvote this enough. This topic in the media space has generated a huge amount of naive speculation that amounts to "how hard could it be to do <thing i know nothing about>?"

By FranzFerdiNaN 2024-12-107:441 reply

> "how hard could it be to do <thing i know nothing about>?"

This is most Hacker News comments summarized lmao. It's kinda my favorite thing of this place: just open any thread and you immediately see so many people rushing to say ''well just do X or Y'' or ''actually it's X or Y and not Z like the experts claim''. Love it.

By shermantanktop 2024-12-1015:46

In this case, it’s movies and TV, which most people enjoy. So there’s a superficial accessibility to the problem which encourages this attitude.

Of course, HN being the place that it is, the same type of comments are made about quantum entanglement and solar panel efficiency.

By bunabhucan 2024-12-100:49

I agree with you.

At the same time I am curious in the "that person has too many fingers" sense at what a system trained on tens of thousands of movies plus scripts plus subtitles plus metadata etc. would generate.

I thought about it for a bit and I would want to watch a computer generated Sharknado 7 or Hallmark Christmas movie.

By robotresearcher 2024-12-0923:081 reply

Of course normally other people contribute to a movie after the writer. My comment mentioned three of the important roles. This whole thread is about tech that automates away those roles. That's the whole point.

By dbspin 2024-12-101:531 reply

I think you've misunderstood the objection.

Lets pick something concrete. It's a medieval script, it opens with two knights fighting. OK so later in the script we learn their characters, historic counterparts etc. So your LLM can match nefarious villain to some kind of embedding, and doubtless has trained on countless images of a knight.

But the result is not naively going to understand the level of reality the script is going for - how closely to stick to historic parallels, how much to go fantastical with the depiction. The way we light and shoot the fight and how it coheres with the themes of the scene, the way we're supposed to understand the characters in the context of the scene and the overall story, the references the scene may be making to the genre or even specific other films etc.

This is just barely scraping the surface of the beginnings of thinking about mise en scene, blocking, framing etc. You can't skip these parts - and they're just as much of a challenge as temporal coherence, or performance generation or any of the other hard 'technical issues' that these models have shown no capacity to solve. They're decisions that have to be made to make a film coherent at all - not yet good or tasteful or creative or whatever.

Put another way - you'd need AGI to comprehend a script at the level of depth required to do the job of any HOD on any film. Such a thing is doubtless possible, but it's not going to be shortcut naively the way generation an image is - because it requires understanding in context, precisely what LLMs lack.

By robotresearcher 2024-12-102:422 reply

> but the result is not naively going to understand the level of reality the script is going for…

We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on.

These future models will not be large language models, they will be multi-modal. Large movie models if you like. They will have tons of context about how scenes within movies cohere, just as LLMs do within documents today.

By troupo 2024-12-109:393 reply

So, we went from "just hand off movie script to automated director/DP/editor" we're now rapidly approaching:

- you have to provide correct detailed instructions on lighting

- you have to provide correct detailed instructions on props

- you have to provide correct detailed instructions on clothing

- you have to provide correct detailed instructions on camera position and movement

- you have to provide correct detailed instructions on blocking

- you have to provide correct detailed instructions on editing

- you have to provide correct detailed instructions on music

- you have to provide correct detailed instructions on sound effects

- you have to provide correct detailed instructions on...

- ...

- repeat that for literally every single scene in the movie (up to 200 in extreme cases)

There's a reason I provided a few links for you to look at. I highly recommend the talk by Annie Atkins. Watch it, then open any movie script, and try to find any of the things she is talking about there (you can find actual movie scripts here: https://imsdb.com)

By throwup238 2024-12-1020:101 reply

There's two reasons to be hopeful about it though: AI/LLMs are very good at filling in all those little details so humans can cherry pick the parts that they like. I think that's where the real value is in for the masses - once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like. Sort of like SegmentAnything and masking in inpainting but for the rest of the scene assembly. The other reason is that the models can probably be architected to figure out environmental/character/light/etc embeddings and use those to build up other coherent scenes, like we use language embeddings for semantic similarity.

That's how I've been using the image generators - lots of experimentation and throwing out the stuff that doesn't work. Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.

Now the models and UX to do this at a cinematic quality are probably 5-10 years away for video (and the studios are probably the only ones with the data to do it), but I'm relatively bullish on AI in cinema. I don't think AI will be doing everything end to end, but it might be a shortcut for people who can write a script and figure out the UX to execute the rest of the creative process by trial and error.

By troupo 2024-12-1021:13

> AI/LLMs are very good at filling in all those little details so humans can cherry pick the parts that they like.

Where did you find AI/ML that are good at filling in actual required and consistent details.

I beg of you to watch Annie Atkins' presentation I linked: https://www.youtube.com/watch?v=SzGvEYSzHf4 and tell me how much intervention would AI/ML need to create all that, and be consistent throughout the movie?

> once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like.

Define "coherent scene" and "explore". A scene must be both coherent and consistent, and conform to the overall style of the movie and...

Even such a simple thing as shot/reverse shot requires about a million various details and can be shot in a million different ways. Here's an exploration of just shot/reverse shot: https://www.youtube.com/watch?v=5UE3jz_O_EM

All those are coherent scenes, but the coherence comes from a million decisions: from lighting, camera position, lens choice, wardrobe, what surrounds the characters, what's happening in the background, makeup... There's no coherence without all these choices made beforehand.

Around 4:00 mark: "Think about how well you know this woman just from her clothes, and workspace". Now watch that scene. And then read its description in the script https://imsdb.com/scripts/No-Country-for-Old-Men.html:

--- start quote ---

    Chigurh enters. Old plywood paneling, gunmetal desk, litter
          of papers. A window air-conditioner works hard.
          A fifty-year-old woman with a cast-iron hairdo sits behind
          the desk.

--- end quote ---

And right after that there's a section on the rhythm of editing. Another piece in the puzzle of coherence in a scene.

> Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.

So, literally what I wrote here: https://news.ycombinator.com/item?id=42375280 :)

By skydhash 2024-12-1017:11

That’s the same thing with digital art, even with the most effortless one (matte painting), there’s a plethora of decisions to make and techniques to use to have a coherent result. There’s a reason people go to school or trained themselves for years to get the needed expertise. If it was just data, someone would have written a guide that others would mindlessly follow.

By robotresearcher 2024-12-1022:071 reply

Not sure why you jumped there. I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’

You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.

Now, I don’t expect we’ll get the best movies this way. But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.

By troupo 2024-12-1023:071 reply

> Not sure why you jumped there.

No jump.

Your original claim: "Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence."

Two comments later it's this: "We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on."

I just re-wrote this with respect to movies.

> I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’

Because, as we all know, every single movie by Kurosawa is the same, as is every single score by Hans Zimmer, so it's ridiculously easy to recreate any movie in that style, with that music.

> You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.

Yes, and Midjounrey today really sucks at:

- being consistent

- creating proper consistent details

A general prompt will give you a general result that is usually very far from what you actually have in mind.

And yes, you will have to prescribe a lot of small things if you want your movie to be consistent. And for your movie to make any sense.

Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?

you can start ,with a very simple scene I referenced in my original reply: two people talking at the table in Whiplash.

> But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.

Even those movies have more details and more care than you can get out of AIs (now, or in foreseeable future)

By robotresearcher 2024-12-1120:21

> Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?

I think you're still assuming I always want to choose those things. That's why we're talking past each other. A good movie making model would choose for me unless I give explicit directions. Today we don't see long-range coherence in the results of movie (or game engine) models, but the range is increasing, and I'm willing to bet we will see movie-length coherence in the next decade or so.

By the way, I also bet that if I pasted exactly the No Country for Old Men script scene description from up this thread into Midjourney today it would produce at least some compelling images with decent choices of wardrobe, lighting, set dressing, camera angle, exposure, etc etc. That's what these models do, because they're extrapolating and interpolating between the billion images they've seen that contained these human choices.

AFAIK Midjourney produces single images, so the relevant scope of consistency is inside the single image only. Not between images. A movie model needs coherence across ~160,000 images, which is beyond the state of the art today but I don't see why it's impossible or unreasonable in the long run.

> A general prompt will give you a general result that is usually very far from what you actually have in mind.

Which is only a problem if I have something in mind. Alternatively I can give no guidance, or loose guidance, make half a dozen variations, pick the one I like best. Maybe iterate a couple of times into that variation tree. Just like the image generators do.

By krainboltgreene 2024-12-109:38

This is such an incredibly confident comment. I'm in awe.

By player1234 2024-12-1011:23

Cool since you know, at what point in the process do you swap out all the white ppl? Thanks in advance!

By letmevoteplease 2024-12-0922:122 reply

Shane Carruth (Primer) released interesting scripts for "A Topiary" and "The Modern Ocean" which now have no hope of being filmed. I hope AI can bring them to life someday. If we get tools like ControlNet for video, maybe Carruth could even "direct" them himself.

By spoaceman7777 2024-12-105:33

This exists already actually. Kling AI 1.5. Saw the demo on twitter two days ago, which shows a photo-to-video transformation on an image of three women standing on a beach, and the video transformation simulates the camera rotating, with the women moving naturally. Just involves a segment-anything style selection of the women, and drawing a basic movement vector.

https://x.com/minchoi/status/1862975323433795726

By Der_Einzige 2024-12-1014:01

Controlnet for video is just controlnet but ran frame by frame resulting in AI Rotoscoping.

By bwfan123 2024-12-102:121 reply

brilliant take from Ben Affleck on ai in movies..

"movies will be one of the last things to be replaced by ai"

https://www.youtube.com/watch?v=ypURoMU3P3U

including this quote: "being a craftsman is knowing how to work, art is knowing when to stop"

By rossjudson 2024-12-102:401 reply

It is absolutely true that LLMs do not know when to stop.

By natmaka 2024-12-107:22

An adequate prompter (human at the prompt) knows when to stop.

By jerf 2024-12-0919:34

That's what I describe at the end, albeit quickly in lingo, where the internal coherence is maintained in internal embeddings that are never related to English at all. A top-level AI could orchestrate component AIs through embedded vectors, but you'll never do it with a human trying to type out descriptions.

By minimaxir 2024-12-0919:111 reply

> Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.

The text encoder may not be able to know complex relationships, but the generative image/video models that are conditioned on said text embeddings absolutely can.

Flux, for example, uses the very old T5 model for text encoding, but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt: https://x.com/minimaxir/status/1820512770351411268

By dragonwriter 2024-12-100:51

> but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt

Flux certainly does not consistently do so across an arbitrary collection of multi-paragraph prompts, as anyone whose run more than a few long prompts past it would recongize; also, the tweet is wrong in the other direction, as well, longer language-model-preprocessed prompts for models that use CLIP (like various SD1.5 and SDXL derivatives) are, in fact, a common and useful technique. (You’d kind of think that the fact that generated prompt here is significantly longer than the 256 token window of T5 would be a clue that the 77 token limit of CLIP might not be as big of a constraint as the tweet was selling it as, too.)

By lmm 2024-12-101:43

> You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.

How would you ever tweak or debug it in that case? It doesn't strictly have to be English, but some kind of human-readable representation of the intermediate stages will be vital.

By amelius 2024-12-0920:492 reply

Can't you just give it a photo of a dog, and then say "use this dog in this or that scene"?

By artemisart 2024-12-0921:311 reply

Yes, the idea works and was explored with dreambooth/textual inversion for image diffusion models.

https://dreambooth.github.io/ https://textual-inversion.github.io/

By minimaxir 2024-12-0921:431 reply

Both of those are of course out of date and require significant training instead of just feeding it a single image.

InstantID (https://replicate.com/zsxkib/instant-id) fixes that issue.

By AuryGlenz 2024-12-1016:26

Dreambooth style training is in no way out of date.

If you just want a face, InstandID/Pulid work - but it’s not going to be very varied. Doing actual training means you can get any perspective, lighting, style, expression, etc - and have the whole body be accurate.

By alpha_squared 2024-12-0921:091 reply

How would that even work? A dog has physical features (legs, nose, eyes, ears, etc.) that they use to interact with the world around them (ground, tree, grass, sounds, etc.). And each one of those things has physical structures that compose senses (nervous system, optic nerves, etc.). There are layers upon layers of intricate complexity that took eons to develop and a single photo cannot encapsulate that level of complexity and density of information. Even a 3D scan can't capture that level of information. There is an implicit understanding of the physical world that helps us make sense of images. For example, a dog with all four paws standing on grass is within the bounds of possibility; a dog with six paws, two of which are on it's head, are outside the bounds of possibility. An image generator doesn't understand that obvious delineation and just approximates likelihood.

By int_19h 2024-12-0921:361 reply

A single photo doesn't have to capture all that complexity. It's carried by all those countless dog photos and videos in the training set of the model.

By krainboltgreene 2024-12-109:39

Actually, it does have to capture all of that complexity because it's a photon-based analysis of reality. You cannot take a photo without doing that.

By fennecbutt 2024-12-1810:20

This is correct and even image generation models aren't really trained for comprehension of image composition yet.

Even the models based off danbooru and E621 still aren't the best at that. And us furries like to tag art in detail.

The best we can really do at the moment is regional prompting, perhaps they need something similar for video.

By echelon 2024-12-0919:092 reply

For those not in this space, Sora is essentially dead on arrival.

Sora performs worse than closed source Kling and Hailuo, but more importantly, it's already trumped by open source too.

Tencent is releasing a fully open source Hunyuan model [1] that is better than all of the SOTA closed source models. Lightricks has their open source LTX model and Genmo is pushing Mochi as open source. Black Forest Labs is working on video too.

Sora will fall into the same pit that Dall-E did. SaaS doesn't work for artists, and open source always trumps closed source models.

Artists want to fine tune their models, add them to ComfyUI workflows, and use ControlNets to precision control the outputs.

Images are now almost 100% Flux and Stable Diffusion, and video will soon be 100% Hunyuan and LTX.

Sora doesn't have much market apart from name recognition at this point. It's just another inflexible closed source model like Runway or Pika. Open source has caught up with state of the art and is pushing past it.

[1] https://github.com/Tencent/HunyuanVideo

By circlefavshape 2024-12-1010:40

Their online version is all in Chinese (or at least some Chinese-looking script I don't understand) ... and they recommend an 80GB GPU to run the thing, which costs ~€15-18k. Yikes, guess I won't be doing this at home anytime soon

By baserev 2024-12-101:53

[flagged]

By yeknoda 2024-12-0919:032 reply

something like a white paper with a mood board, color scheme, and concept art as the input might work. This could be sent into an LLM "expander" that increases the words and speficity. Then multiple reviews to tap things in the right direction.

By mikepurvis 2024-12-0919:161 reply

I expect this kind of thing is actually how it's going to work longer term, where AI is a copilot to a human artist. The human artist does storyboarding, sketching in backdrops and character poses in keyframes, and then the AI steps in and "paints" the details over top of it, perhaps based on some pre-training about what the characters and settings are so that there's consistency throughout a given work.

The real trick is that the AI needs to be able to participate in iteration cycles, where the human can say "okay this is all mostly good, but I've circled some areas that don't look quite right and described what needs to be different about them." As far as I've played with it, current AIs aren't very good at revisiting their own work— you're basically just tweaking the original inputs and otherwise starting over from scratch each time.

By programd 2024-12-0919:50

We will shortly have much better tweaking tools which work not only on images and video but concepts like what aspects a character should exhibit. See for example the presentation from Shapeshift Labs.

https://www.shapeshift.ink/

By 3form 2024-12-0919:111 reply

And I think this realistically is going to be the shape of the tools to come in the foreseeable future.

By echelon 2024-12-0919:36

You should see what people are building with Open Source video models like HunYuan [1] and ComfyUI + Control Nets. It blows Sora out of the water.

Check out the Banodoco Discord community [2]. These are the people pioneering steerable AI video, and it's all being built on top of open source.

[1] https://github.com/Tencent/HunyuanVideo

[2] https://banodoco.ai/

By prmoustache 2024-12-106:57

The whole point of AI stuff is not to produce exactly what you have in mind, but what you are describing. Same with text, code, images, video...

By szundi 2024-12-106:45

Sounds like we achieved 50% of AI then. The artifical is there, now we need the intelligence part.

By baq 2024-12-106:46

Sora should be evaluated on xkcd strips as inputs.

By miltonlost 2024-12-0918:2914 reply

The adage "a picture is worth a thousand words" has the nice corollary "A thousand words isn't enough to be precise about an image".

Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.

By TeMPOraL 2024-12-0918:527 reply

> Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.

What will save it is that, no matter how picky you are as a creator, your audience will never know what exactly was that you dreamed up, so any half-decent approximation will work.

In other words, a corollary to your corollary is, "Fortunately, you don't need them to be, because no one cares about low-order bits".

Or, as we say in Poland, "What the eye doesn't see, the heart doesn't mourn."

By jsheard 2024-12-0918:574 reply

> What will save it is that, no matter how picky you are as a creator, your audience will never know what exactly was that you dreamed up, so any half-decent approximation will work.

Part of the problem is the "half decent approximations" tend towards a clichéd average, the audience won't know that the cool cyberpunk cityscape you generated isn't exactly what you had in mind, but they will know that it looks like every other AI generated cyberpunk cityscape and mentally file your creation in the slop folder.

I think the pursuit of fidelity has made the models less creative over time, they make fewer glaring mistakes like giving people six fingers but their output is ever more homogenized and interchangable.

By samatman 2024-12-0920:391 reply

Empirically, we've passed the point where that's true, for someone not being lazy about it.

https://www.astralcodexten.com/p/how-did-you-do-on-the-ai-ar...

In other words, someone willing to tweak the prompt and press the button enough times to say "yeah, that one, that's really good" is going to have a result which cannot in fact be reliably binned as AI-generated.

By lmm 2024-12-103:031 reply

I mean, no? None of the AI-generated images managed to be indistinguishable. Some people were much better than others at spotting the differences. He even quotes, at length, an artist giving a detailed breakdown of what's wrong with one of the images he thought was good.

By TeMPOraL 2024-12-108:151 reply

Did you read the article? Respondents performed barely better than chance. Sure, no one was actually 100% wrong[0]. Just almost always wrong, with a noticeable bias towards liking AI art more.

The detailed breakdown you mention? Maybe it's accurate to that artist's thought process, maybe it's more of a rationalization; either way, it's not a general rule they, or anyone, could apply to any of the other AI images. Most of those in the article don't exhibit those "telltale signs", and the one that does - the Victorian Megaship - was actually made by human artist with no AI in the mix.

EDIT:

Another image that stands out to me is Riverside Cafe. Myself, like apparently a lot of other people, going by articles' comments, assumed it's a human-made one, because we vaguely remembered Vang Gogh painted something like it. He did, it's called Café Terrace at Night - and yet, despite immediately evoking the association, Riverside Cafe was made by AI, and is actually nothing like Café Terrace at Night at any level.

(I find it fascinating how this work looks like a copy of Van Gogh at first glance, for no obvious reason, but nothing alike once you pause to look closer. It's like... they have similar low-frequency spectra or something?)

EDIT2:

Played around with the two images in https://ejectamenta.com/imaging-experiments/fourifier/. There are some similarities in the spectra, I can't put my finger on them exactly. But it's probably not the whole answer. I'll try to do some more detailed experimentation later.

[0] - Nor should you expect it - it would mean either a perfect calibration, or be the equivalent of flipping a coin and getting heads 30 times in a row; it's not impossible, but you shouldn't expect to see it unless you're interviewing fewer people than literally the entire population of the planet.

By lmm 2024-12-109:501 reply

Yes, I read the article. Did you?

> The average participant scored 60%, but people who hated AI art scored 64%, professional artists scored 66%, and people who were both professional artists and hated AI art scored 68%.

> The highest score was 98% (49/50), which 5 out of 11,000 people achieved. Even with 11,000 people, getting scores this high by luck alone is near-impossible.

By samatman 2024-12-1018:331 reply

This accurately boils down to "cannot reliably be binned as AI-generated". Your objection amounts to a vanishing few people who are informed that this is a test being able to do a pretty good job at it.

If 0.0005% of people who are specifically judging art as AI or not AI, in a test which presumably attracts people who would like to be able to do that thing, can do a 98% accurate job, and the average is around 60%: that isn't reliable.

If that doesn't work for you, I encourage you to take the test. Obviously since you've read the article there are some spoilers, but there's still plenty of chances to get it right or wrong. I think you'll discover that you, too, cannot do this reliably. Let us know what happens.

By lmm 2024-12-111:43

I can't do it reliably and I don't want to - I learnt to spot certain popular video compression artifacts in my youth, and that has not enhanced my life. But any distinction that random people taking a casual internet survey get right 60% of the time is absolutely one that you can make reliably if you put in the effort. Look at something like chicken sexing.

By randomcatuser 2024-12-0919:193 reply

a somewhat counterintuitive argument is this: AI models will make the overall creative landscape more diverse and interesting, ie, less "average"!

Imagine the space of ideas as a circle, with stuff in the middle being more easy to reach (the "cliched average"). Previously, traversing the circle was incredibly hard - we had to use tools like DeviantArt, Instragram, etc to agglomerate the diverse tastes of artists, hoping to find or create the style we're looking for. Creating the same art style is hiring the artist. As a result, on average, what you see is the result of huge amounts of human curation, effort, and branding teams.

Now reduce the effort 1000x, and all of a sudden, it's incredibly easy to reach the edge of the circle (or closer to it). Sure, we might still miss some things at the very outer edge, but it's equivalent to building roads. Motorists appear, people with no time to sit down and spend 10000 hours to learn and master a particular style can simply remix art and create things wildly beyond their manual capabilities. As a result, the amount of content in the infosphere skyrockets, the tastemaking velocity accelerates, and you end up with a more interesting infosphere than you're used to.

By TeMPOraL 2024-12-0919:473 reply

To extend the analogy, imagine the circle as a probability distribution; for simplicity, imagine it's a bivariate normal joint distribution (aka. Gaussian in 3D) + some noise, and you're above it and looking down.

When you're commissioning an artist to make you some art, you're basically sampling from the entire distribution. Stuff in the middle is, as you say, easiest to reach, so that's what you'll most likely get. Generative models let more people do art, meaning there's more sampling happening, so the stuff further from the centre will be visited more often, too.

However, AI tools also make another thing easier: moving and narrowing the sampling area. Much like with a very good human artist, you can find some work that's "out there", and ask for variations of it. However, there are only so many good artists to go around. AI making this process much easier and more accessible means more exploration of the circle's edges will happen. Not just "more like this weird thing", but also combinations of 2, 3, 4, N distinct weird things. So in a way, I feel that AI tools will surface creative art disproportionally more than it'll boost the common case.

Well, except for the fly in the ointment that's the advertising industry (aka. the cancer on modern society). Unfortunately, by far most of the creative output of humanity today is done for advertising purposes, and that goal favors the common, as it maximizes the audience (and is least off-putting). Deluge of AI slop is unavoidable, because slop is how the digital world makes money, and generative AI models make it cheaper than generative protein models that did it so far. Don't blame AI research for that, blame advertising.

By zmgsabst 2024-12-106:551 reply

A small technical point:

Tastes are almost never normally distributed along a spectrum, but multi-modal. So the more dimensions you explore in, the more you end up with “islands of taste” on the surface of a hyper sphere and nothing like the normal distribution at all. This phenomenon is deeply tied to why “design by committee” (eg, in movies) always makes financial estimates happy but flops with audiences — there is almost no customer for average anything.

I agree with your conclusion.

By circlefavshape 2024-12-1010:411 reply

"Design by committee" is also how most hit movies are made. Hit songs too

By zmgsabst 2024-12-1021:331 reply

Do you have an example?

My experience with customer surveys indicates the opposite — that customers prefer you have an opinion.

By circlefavshape 2024-12-1114:541 reply

An example of a hit movie or song that was created by committee?

Inside Out 2 had the largest box office of any movie in 2024. Checkout the "research and writing" section in its wikipedia article https://en.wikipedia.org/wiki/Inside_Out_2#Research_and_writ... ... psychological consultants, a feedback loop with a group of teenagers, test screenings.

Or how about "Die with a smile" - currently number 1 in the global top 50 on Spotify. 5 songwriters

Or "APT." - currently number 2 in the global top 50 on Spotify. 11 songwriters

You don't have to look very hard

By zmgsabst 2024-12-1215:48

Inside Out 2 has a single writer, who also worked on the first.

Consulting with SMEs, testing with audiences, etc isn’t “design by committee”.

Similarly, “Die With a Smile” seems to have been the work of two people with developed styles with support — again, not a committee:

> The collaboration was a result of Mars inviting Gaga to his studio where he had been working on new music. He presented the track in progress to her and the duo finished writing and recording the song the same day.

Apt seems to have started with a single person goofing around, then pitched as a collaboration and the expanded team entered at that point.

By etiam 2024-12-0923:01

I like the picture, but I'd be more impressed with the exploration argument if we were collectively actually doing a good job giving recognition to original and substantial works that already exist. It'd be of greater service in that regard to create a high-quality artificial stand-in for that limited-quantity "attention" and "engagement" all the bloodsuckers seem so keen on harvesting.

(And I do blame the advertisers, but frankly anyone handing them new amplifiers, with entirely predictable consequences, is also not blameless.)

By js8 2024-12-103:291 reply

I read this argument/analogy and the "AI slop will win" idea reminds me of the idea that "fake news will win".

That is based on perception that it is easier than ever to create fake content, but fails to account for the fact that creating real content (for example, simply taking a video) is even much easier. So while there is more fake content, there is also lot more real content, and so manipulation of reality (for example, denying a genocide) is much harder today than ever.

Anyway, "the AI slop will win" is based on a similar misconception, that total creative output will not increase. But like with fake news, it probably will not be the case, and so the actual amount of good art will increase, too.

I think we are OK as long as normal humans prefer to create real news rather than fake news, and create innovative art rather than cliched art.

By TeMPOraL 2024-12-108:04

> I think we are OK as long as normal humans prefer to create real news rather than fake news, and create innovative art rather than cliched art.

So we're not OK.

I think I need to state my assumptions/beliefs here more explicitly.

First of all, "AI slop" is just the newest iteration on human-produced slop, which we're already drowning in. Not because people prefer to create slop, but because they're paid to do it, because most content is created by marketers and advertisers to sell you shit, and they don't want it to be better than strictly necessary for purpose.

It's the same with fake news, really. Fake news isn't new. Almost all news is fake news; what we call "fake news" is a particular flavor of bullshit that got popular as it got easier for random humans to publish stories competing with established media operations.

In both cases, AI is exacerbating the problem, but it did not create it - we were already drowning in slop.

Which leads me to related point:

> Anyway, "the AI slop will win" is based on a similar misconception, that total creative output will not increase.

It will. But don't forget Sturgeon's law - "ninety percent of everything is crap"[0]. Again, for the past couple decades, we've been drowning in "creative output". It's not a new problem, it's just increasingly noticeable in the past years, because the Web makes it very easy for everyone to create more "creative output" (most of which is, again, advertising), and it finally started overwhelming our ability to filter out the crap and curate the gems.

Adding AI to the mix means more output, which per Sturgeon's law, means disproportionately more crap. That's not AI's fault, that's ours; it's still the same problem we had before.

[0] - https://en.wikipedia.org/wiki/Sturgeon%27s_law

By robertlagrant 2024-12-0921:19

It's just like when Bootstrap came out. Terrible-looking websites stopped appearing, but so did beautiful websites.

By wongarsu 2024-12-0919:401 reply

And as AI oversaturates the cliched average, creators will have to get further and further away from the average to differentiate themselves. If you pour a lot of work into your creation you want to make it clear that it isn't some cliched AI drivel.

By skydhash 2024-12-0922:331 reply

You will basically have to provide a video showcasing your workflow.

By krainboltgreene 2024-12-109:42

I promise you that the artists can outlive the VC money.

By dragonwriter 2024-12-101:11

> I think the pursuit of fidelity has made the models less creative over time, they make fewer glaring mistakes like giving people six fingers but their output is ever more homogenized and interchangeable.

That may be true of any one model (though I don’t think it really is, either, I think newer image gen models are individually capable of a much wider array of styles than earlier models), but it is pretty clearly not true of the whole range of available models, even if you look at a single model “family” like “SDXL derivatives”.

By TeMPOraL 2024-12-0919:202 reply

> I think the pursuit of fidelity has made the models less creative over time (...) their output is ever more homogenized and interchangable.

Ironically, we're long past that point with human creators, at least when it comes to movies and games.

Take sci-fi movies, compare modern ones to the ones from the tail end of the 20th century. Year by year, VFX gets more and more detailed (and expensive) - more and better lights, finer details on every material, more stuff moving and emitting lights, etc. But all that effort arguably killed immersion and believability, by making scenes incomprehensible. There's way too much visual noise in action scenes in particular - bullets and lighting bolts zip around, and all that detail just blurs together. Contrast the 20th century productions - textures weren't as refined, but you could at least tell who's shooting who and when.

Or take video games, where all that graphics works makes everything look the same. Especially games that go for realistic style, they're all homogenous these days, and it's all cheap plastic.

(Seriously, what the fuck went wrong here? All that talk, and research, and work into "physically based rendering", yet in the end, all PBR materials end up looking like painted plastic. Raytracing seems to help a bit when it comes to liquids, but it still can't seem to make metals look like metals and not Fischer-Price toys repainted to gray.)

So I guess in this way, more precision just makes the audience give up entirely.

> they will know that it looks like every other AI generated cyberpunk cityscape and mentally file your creation in the slop folder.

The answer here is the same as with human-produced slop: don't. People are good at spotting patterns, so keep adding those low-order bits until it's no longer obvious you're doing the same thing everyone else is.

EDIT: Also, obligatory reminder that generative models don't give you average of training data with some noise mixed up; they sample from learned distribution. Law of large numbers apply, but it just means that to get more creative output, you need to bias the sampling.

By wongarsu 2024-12-0919:52

Video games (the much larger industry of the two, by revenue) seems to be closer to understanding this. AAA games dominate advertising and news cycles, but on any best-seller list AAA games are on par with indie and B games (I think they call them AA now?). For every successful $60M PBR-rendered Unreal 5 title there is an equally successful game with low-fidelity graphics but exceptional art direction, story or gameplay.

Western movie studios may discover the same thing soon, with the number of high-budget productions tanking lately.

By robertlagrant 2024-12-0921:18

I agree. The one shining hope I have is the incredible art and animation style of Fortiche[0]'s Arcane[1] series. Watch that, and then watch any recent (and identikit) Pixar movie, and they are just streets ahead. It's just brilliant.

[0] https://en.wikipedia.org/wiki/Fortiche

[1] https://en.wikipedia.org/wiki/Arcane_(TV_series)

By naasking 2024-12-0918:591 reply

I was just going to say this. If you have an artistic vision that you simply must create to the minutest detail, then like any artist, you're in for a lot of manual work.

If you are not beholden to a precise vision or maybe just want to create something that sells, these tools will likely be significant productivity multipliers.

By whstl 2024-12-0919:16

Exactly.

So far ChatGPT is not for writing books, but is great for SEO-spam blogposts. It is already killing the content marketing industry.

So far Dall-E is not for making master paintings, but it's great for stock images. It might kill most of the clipart and stock image industry.

So far Udio and other song generators are not able to make symphonies, but it's great for quiet background music. It might kill most of the generic royalty-free-music industry.

By msabalau 2024-12-0919:41

Half decent approximations work a lot better in generating the equivalent of a stock illustrations of a powerpoint slide.

Actual long form art like a movie works because it includes many well informed choices that work together as a whole.

There seems to be a large gap between generating a few seconds of video vaguely like one's notion, and trying to create 90 minutes that are related and meaningful.

Which doesn't mean that you can't build from this starting place build more robust tools. But if you think that this is a large, hard amount of work, it certainly could call into question optisimitic projections from people who don't even seem to notice that there is work need at all.

By Ar-Curunir 2024-12-0919:361 reply

That's just sad, and why people have a derogative stance towards generative AI: "half-decent" approximation removes all personality from the output, leading to a bunch of slop on the internet.

By TeMPOraL 2024-12-0920:261 reply

It does indeed, but then many of those people don't notice they're already consuming half-decent, personality-less slop, because that's what human artists make too, when churning out commercial art for peanuts and on tight deadlines.

It's less obvious because people project personality onto the content they see, because they implicitly assume the artist cared, and had some vision in mind. Cheap shit doesn't look like cheap shit in isolation. Except when you know it's AI-generated, because this removes the artist from the equation, and with it, your assumptions that there's any personality involved.

By whatevertrevor 2024-12-0920:411 reply

I'm not so sure, one of the primary complaints about IP farming slop that major studios have produced recently is a lack of firm creative vision, and clear evidence of design by committee over artist direction.

People can generally see the lack of artistic intent when consuming entertainment.

By TeMPOraL 2024-12-0921:13

That's true. Then again, complaints about "lack of firm creative vision, and clear evidence of design by committee over artist direction" is something I've seen levied against Disney for several years now; importantly, they started before generative AI found its way into major productions.

So, while GenAI tools make it easier to create superficially decent work that lacks creative intent, the studios managed to do it just fine with human intelligence only, suggesting the problem isn't AI, but the studios and their modern management policies.

By hammock 2024-12-0919:221 reply

It’s like how there are two types of movie directors (or creative directors in general), the dictatorial “100 takes until I get it exactly how I envision it” type, and the “I hired you to act, so you bring the character to life for me and what will be will be” type

Right now AI is more the latter, but many people want it to be the former

By troupo 2024-12-0920:52

AI is neither.

A director letting actors "just be" knows exactly what he/she wants, and choses actors accordingly. Just as the directors that want the most minute detail.

Clint Eastwood tries to do at most one take of a scene. David Fincher is infamous for his dozens of takes.

AI is neither Fincher nor Eastwood.

By wcfrobert 2024-12-0919:221 reply

Do artist really have a fully formed vision in their head? I suspect the creative process is much more iterative rather than one-directional.

By skydhash 2024-12-0919:271 reply

No one can have a fully formed vision. But intent, yes. Then you use techniques to materialize it. Word is a poor substitute for that intent, which is why there’s so many sketches in a visual project.

By maxglute 2024-12-0919:32

And why physical execution frequently significantly departs from sketches and concept art. The amount of intent that doesn't get translated is pretty staggering in both physical and digital pipelines in many projects.

By dartos 2024-12-0918:582 reply

Your eye sees just about every frame of a film…

People may not think they care, but obviously they do. That’s why marvel movies do better than DC ones.

People absolutely care about details in their media.

By TeMPOraL 2024-12-0920:181 reply

Fair point, particularly given the example. My conclusion wrt. Marvel vs. DC is that DC productions care much less about details, in exactly the way I find off-putting.

Not all details matter, some do. And, it's better to not show the details at all, than to be inconsistent in them.

Like, idk., don't identify a bomb as a specific type of existing air-fuel ordnance and then act about it as if it was a goddamn tactical nuke. Something along these lines was what made me stop watching Arrow series.

By dartos 2024-12-0920:43

> Not all details matter, some do

This is a key observation, unfortunately generally solving for what details matter is extremely difficult.

I don’t think video generation models help with that problem, since you have even less control of details than you do with film.

At least before post.

By og_kalu 2024-12-0921:38

The visuals are the absolute bottom of why DC movies have performed worse over the years.

The movies have just had much worse audience and critical reception.

By throwup238 2024-12-0918:541 reply

“A frame is worth a billion rays”

The last production I worked on averaged 16 hours per frame for the final rendering. The amount of information encoded in lighting, models, texture, maps, etc is insane.

By bongodongobob 2024-12-0919:073 reply

What were you working on? It took a month to render 2 seconds of video?

By throwup238 2024-12-0919:361 reply

VFX heavy feature for a Disney subsidiary. Each frame is rendered independently of each other - it’s not like video encoding where each frame depends on the previous one, they all have their own scene assembly that can be sent to a server to parallelize rendering. With enough compute, the entire film can be rendered in a few days. (It’s a little more complicated than that but works to a first order approximation)

I don’t remember how long the final rendering took but it was nearly two months and the final compute budget was 7 or 8 figures. I think we had close to 100k cores running at peak from three different render farms during crunch time, but don’t take my word for it I wasn’t producing the picture.

By dist-epoch 2024-12-0920:073 reply

Are they still using CPUs and not GPUs for rendering?

Weren't the rendering algos ported to CUDA yet?

By jsheard 2024-12-0920:45

GPU renderers exist but they have pretty hard scaling limits, so the highest end productions still use CPU renderers almost exclusively.

The 3D you see in things like commercials is usually done on GPUs though because at their smaller scale it's much faster.

By throwup238 2024-12-101:09

There's plenty of GPU renderers but they face the same challenge as large language models: GPU memory is much more expensive and limited that CPU memory.

A friend recently told me about a complex scene (I think it was a Marvel or Star Wars flick) where they had so much going on in the scene with smoke, fire, and other special effects that they had to wait for a specialized server with 2TB of RAM to be assembled. They only had one such machine so by the time the rest of the movie was done rendering, that one scene still had a month to go.

By fc417fc802 2024-12-0923:26

I'm not sure how well suited GPUs are to the workload. They're also rather memory constrained. The Moana dataset is from 2016 so it's not exactly cutting edge but good luck loading it into vram.

https://www.disneyanimation.com/data-sets/?drawer=/resources...

https://datasets.disneyanimation.com/moanaislandscene/island...

> When everything is fully instantiated the scene contains more than 15 billion primitives.

By Arelius 2024-12-0919:21

Most VFX productions take over 2 CPU hours a frame for final video, and have for a very long time. It takes shorter then a month since this gets parallelized on large render farms.

By elmigranto 2024-12-0919:171 reply

I would guess there is more than one computer :)

Pixar's stuff famously takes days per frame.

By Arelius 2024-12-0919:231 reply

> Pixar's stuff famously takes days per frame.

Do you have a citation for this? My guess would be much closer to a couple of hours per frame.

By elmigranto 2024-12-0919:44

https://sciencebehindpixar.org/pipeline/rendering

By raincole 2024-12-0920:26

The point is not to be precise. It's to be "good enough".

Trust me, even if you work with human artists, you'll keep saying "it's not quite I initially invisioned, but we don't have budget/time for another revision, so it's good enough for now." all the time.

By beambot 2024-12-0919:42

Corollary: I couldn't create an original visual piece of art to save my life, so prompting is infinitely better than what I could do myself (or am willing to invest time in building skills). The gen-AI bubble isn't going to burst. Pareto always wins.

By Al-Khwarizmi 2024-12-0918:582 reply

If you can build a system that can generate engaging games and movies, from an economic (bubble popping or not popping) point of view it's largely irrelevant whether they conform to fine-grained specifications by a human or not.

By dartos 2024-12-0919:04

In other words:

If you find a silver bullet then everything else is largely irrelevant.

Idk if you noticed but that “if” is carrying an insane amount of weight.

By jsheard 2024-12-0920:491 reply

Text generation is the most mature form of genAI and even that isn't even remotely close to producing infinite engaging stories. Adding the visual aspect to make that story into a movie or the interactive element to turn it into a game is only uphill from there.

By soheil 2024-12-0922:41

Maybe your AI bubble! If you define AI to be something like just another programming language yes you will be sadly disappointed. You see it as an employee with its own intuitions and ways of doing things that you're trying to micromanage.

I have a bad feeling that you'd be a horrible manager if you ever were one.

By GistNoesis 2024-12-0918:491 reply

(2020) https://arxiv.org/abs/2010.11929 : an image is worth 16x16 words transformers for image recognition at scale

(2021) https://arxiv.org/abs/2103.13915 : An Image is Worth 16x16 Words, What is a Video Worth?

(2024) https://arxiv.org/abs/2406.07550 : An Image is Worth 32 Tokens for Reconstruction and Generation

By dartos 2024-12-0918:591 reply

Those are indeed 3 papers.

By GistNoesis 2024-12-0919:531 reply

Yes in a nutshell they explain that you can express a picture or a video with relatively few discrete information.

First paper is the most famous and prompted a lot of research to using text generation tools in the image generation domain : 256 "words" for an image, Second paper is 24 reference image per minutes of video, Third paper is a refinement of the first saying you only need 32 "tokens". I'll let you multiply the numbers.

In kind of the same way as a who's who game, where you can identify any human on earth with ~32bits of information.

The corollary being that contrary to what parent is telling there is no theoretical obstacle to obtaining a video from a textual description.

By dartos 2024-12-0920:191 reply

I think something is getting lost in translation.

These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.

For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)

As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.

By GistNoesis 2024-12-0920:461 reply

If you want to know what tokens you want to obtain _exactly_ Mona Lisa, or any other image, you take the image and put it through your image tokenizer aka encode it, and if you have the sequence of token you can decode it to an image.

VQ-VAE (Vector Quantised-Variational AutoEncoder), (2017) https://arxiv.org/abs/1711.00937

The whole encoding-decoding process is reversible, and you only lose some imperceptible "details", the process can be either trained with a L2Loss, or a perceptual loss depending what you value.

The point being that images which occurs naturally are not really information rich and can be compressed a lot by neural networks of a few GB that have seen billions of pictures. With that strong prior, aka common knowledge, we can indeed paint with words.

By dartos 2024-12-0920:561 reply

Maybe I’m not able to articulate my thought well enough.

Taking an existing image and reversing the process to get the tokens that led to it then redoing that doesn’t seem the same as inserting token to get a precise novel image.

Especially since, as you said, we’d lose some details, it suggests that not all images can be perfectly described and recreated.

I suppose I’ll need to play around with some of those techniques.

By GistNoesis 2024-12-0922:59

After encoding the models are usually cascaded either with a LLM or a diffusion model.

Natural Image-> Sequence of token, but not all possible sequence of token will be reachable. Like plenty of letters put together form non-sensical words.

Sequence of token -> Natural Image : if the initial sequence of token is unsensical the Natural image will be garbage.

So usually you then modelize the sequence of token so that it produce sensical sequences of token, like you would with a LLM, and you use the LLM to generate more tokens. It also gives you a natural interface to control the generation of token. You can express with words what modifications to the image you should do. Which will allow you to find the golden sequence of token which correspond to the mona-lisa by dialoguing with the LLM, which has been trained to translate from english to visual-word sequence.

Alternatively instead of a LLM you can use a diffusion model, the visual words usually are continuous, but you can displace them iteratively with text using things like "controlnet" (stable diffusion).

By stale2002 2024-12-0920:07

You are half right. Its funny because I use the same same. Mine is "A picture is worth a thousand words. thats why it takes 1000 words to describe the exact image that you want! Much better to just use Image to Image instead".

Thats my full quote on this topic. And I think it stands. Sure, people won't describe a picture. instead, they will take an existing picture or video, and do modifications of it, using AI. That is much much simpler and more useful, if you can file a scene, and then animate it later with AI.

By ben_w 2024-12-101:13

> Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.

The prior sentence does not imply the conclusion.

By meta_x_ai 2024-12-0920:091 reply

A picture is worth a thousand words.

A word is worth a thousand pictures. (E.g Love)

It is abstraction all the way

By gloosx 2024-12-106:05

it is all Information to be precise.

By 8n4vidtmkvmk 2024-12-102:39

Actually, I've gotten some great results with image2text2image with less than a thousand words. Maybe not enough for a video, but for some not too crazy images, it is enough!

By fooker 2024-12-0919:56

Sure it's going to pop. But when is the important question.

Being too early about this and being wrong are the same.

By szundi 2024-12-0918:38

Comment was probably rather about the 360 degree turning heads etc.

By mrandish 2024-12-0919:50

I agree that people who want any meaningful precision in their visual results will inevitably be disappointed.

By isoprophlex 2024-12-0918:285 reply

And another thing that irks me: none of these video generators get motion right...

Especially anything involving fluid/smoke dynamics, or fast dynamic momements of humans and animals all suffer from the same weird motion artifacts. I can't describe it other than that the fluidity of the movements are completely off.

And as all genai video tools I've used are suffering from the same problem, I wonder if this is somehow inherent to the approach & somehow unsolvable with the current model architectures.

By giantrobot 2024-12-0918:47

I think one of the biggest problems is the models are trained on 2D sequences and don't have any understanding of what they're actually seeing. They see some structure of pixels shift in a frame and learn that some 2D structures should shift in a frame over time. They don't actually understand the images are 2D capture of an event that occurred in four dimensions and the thing that's been imaged is under the influence of unimaged forces.

I saw a Santa dancing video today and the suspension of disbelief was almost instantly dispelled when the cuffs of his jacket moved erratically. The GenAI was trying to get them to sway with arm movements but because it didn't understand why they would sway it just generated a statistical approximation of swaying.

GenAI also definitely doesn't understand 3D structures easily demonstrated by completely incorrect morphological features. Even my dogs understand gravity, if I drop an object they're tracking (food) they know it should hit the ground. They also understand 3D space, if they stand on their back legs they can see over things or get a better perspective.

I've yet to see any GenAI that demonstrates even my dogs' level of understanding the physical world. This leaves their output in the uncanny valley.

By jeroen 2024-12-0920:24

They don't even get basic details right. The ship in the 8th video changes with every camera change and birds appear out of nowhere.

By nonameiguess 2024-12-1018:55

As far as I can tell it's a problem with CGI at all. Whether you're using precise physics models or learned embeddings from watching videos, reproducing certain physical events is computationally very hard, whereas recording them just requires a camera (and of course setting up the physical world to produce what you're filming, or getting very lucky). The behind the scenes from House of the Dragon has a very good discussion of this from the art directors. After a decade and a half of specializing in it, they have yet to find any convincing way to create fire other than to actually create fire and film it. This isn't a limitation of AI and it has nothing to do with intelligence. A human can't convincingly animate fire, either. It seems to me that discussions like this from the optimist side always miss this distinction and it's part of why I think Ben Affleck was absolutely correct that AI can't replace filmmaking. Regardless of the underlying approach, computationally reproducing what the world gives you for free is simply very hard, maybe impossible. The best rendering systems out there come nowhere close to true photorealism over arbitrary scenarios and probably never will.

By soheil 2024-12-0922:431 reply

What's the point of poking holes in new technology and nitpiking like this? Are you blind to the immense breakthroughs made today and yet you focus what irks you about some tiny detail that might go away after a couple of versions?

By Uehreka 2024-12-100:10

At this phase of the game a lot of people are pretty accustomed to the pace of technological innovation in this space, and I think it’s reasonable for people to have a sense of what will/won’t go away in a few versions. Some of Sora’s issues may just require more training, some of these issues are intrinsic to their approach and will not be solvable with their current method.

To that end, it is actually extremely important to nit-pick this stuff. For those of us using these tools, we need to be able to talk shop about which ones are keeping up, which are work like shit in practice, and which ones work but only in certain situations, and which situations those are.

By benchmarkist 2024-12-0918:383 reply

Neural networks use smooth manifolds as their underlying inductive bias so in theory it should be possible to incorporate smooth kinematic and Hamiltonian constraints but I am certain no one at OpenAI actually understands enough of the theory to figure out how to do that.

By david-gpu 2024-12-0918:481 reply

> I am certain no one at OpenAI actually understands enough of the theory to figure out how to do that

We would love to learn more about the origin of your certainty.

By benchmarkist 2024-12-0918:581 reply

I don't work there so I'm certain there is no one with enough knowledge to make it work with Hamiltonian constraints because the idea is very obvious but they haven't done it because they don't have the wherewithal to do so. In other words, no one at OpenAI understands enough basic physics to incorporate conservation principles into the generative network so that objects with random masses don't appear and disappear on the "video" manifold as it evolves in time.

By david-gpu 2024-12-0919:191 reply

> the idea is very obvious but they haven't done it because they don't have the wherewithal to do so

Fascinating! I wish I had the knowledge and wherewithal to do that and become rich instead of wasting my time on HN.

By benchmarkist 2024-12-0919:221 reply

No one is perfect but you should try to do better and waste less time on HN now that you're aware and can act on that knowledge.

By david-gpu 2024-12-0920:26

Nah, I'm good. HN can be a very amusing place at times. Thanks, though.

By dartos 2024-12-0919:082 reply

How does your conclusion follow from your statement?

Neural networks are largely black box piles of linear algebra which are massaged to minimize a loss function.

How would you incorporate smooth kinematic motion in such an environment?

The fact that you discount the knowledge of literally every single employee at OpenAI is a big signal that you have no idea what you’re talking about.

I don’t even really like OpenAI and I can see that.

By benchmarkist 2024-12-0919:141 reply

I've seen the quality of OpenAI engineers on Twitter and it's easy enough to extrapolate. Moreoever, neural networks are not black boxes, you're just parroting whatever you've heard on social media. The underlying theory is very simple.

By dartos 2024-12-0919:281 reply

Do not make assumptions about people you do not know in an attempt to discredit them. You seem to be a big fan of that.

I have been working with NLP and neural networks since 2017.

They aren’t just black boxes, they are _largely_ black boxes.

When training an NN, you don’t have great control over what parts of the model does what or how.

Now instead of trying to discredit me, would you mind answering my question? Especially since, as you say, the theory is so simple.

How would you incorporate smooth kinematic motion in such an environment?

By benchmarkist 2024-12-0919:322 reply

Why would I give away the idea for free? How much do you want to pay for the implementation?

By mech422 2024-12-0919:551 reply

cop out... according to you, the idea is so obvious it wouldn't be worth anything.

By benchmarkist 2024-12-0919:59

[flagged]

By dartos 2024-12-0919:491 reply

lol. Ok dude you have a good one.

By benchmarkist 2024-12-0919:542 reply

You too but if you do want to learn the basics then here's one good reference: https://www.amazon.com/Hamiltonian-Dynamics-Gaetano-Vilasi/d.... If you already know the basics then this is a good followup: https://www.amazon.com/Integrable-Hamiltonian-Systems-Geomet.... The books are much cheaper than paying someone like me to do the implementation.

By chillfox 2024-12-101:091 reply

Seriously... The ability to identify what physics/math theories the AI should apply and being able to make the AI actually apply those are very different things. And you don't seem to understand that distinction.

By benchmarkist 2024-12-102:121 reply

Unless you have $500k to pay for the actual implementation of a Hamiltonian video generator then I don't think you're in a position to tell me what I know and don't know.

By chillfox 2024-12-103:271 reply

lolz, I doubt very much anyone would want to pay you $500k to perform magic. Basically, I think you are coming across as someone who is trying to sound clever rather than being clever.

By benchmarkist 2024-12-104:041 reply

My price is very cheap in terms of what it would enable and allow OpenAI to charge their customers. Hamiltonian video generation with conservation principles which do not have phantom masses appearing and disappearing out of nowhere is a billion dollar industry so my asking price is basically giving away the entire industry for free.

By chillfox 2024-12-105:171 reply

Sure, but I imagine the reason you haven't started your own company to do it is you need 10s of millions in compute, so the price would be 500k + 10s of millions... Or you can't actually do it and are just talking shit on the internet.

By benchmarkist 2024-12-105:48

I guess we'll never know.

By dartos 2024-12-0920:061 reply

Yeah I mean I would never pay you for anything.

You’ve convinced me that you’re small and know very little about the subject matter.

You don’t need to reply to this. I’m done with this convo.

By benchmarkist 2024-12-0920:15

Ok, have a good one dude.

By esafak 2024-12-103:571 reply

There are physicists at OpenAI. You can verify with a quick search. So someone there clearly knows these things.

By benchmarkist 2024-12-104:081 reply

I'd be embarrassed if I was a physicists and my name was associated with software that had phantom masses appearing and disappearing into the void.

By esafak 2024-12-1013:481 reply

Why don't you write a paper or start a company to show them the right way to do it?

By benchmarkist 2024-12-1016:44

I don't think there is any real value in making videos other than useless entertainment. The real inspired use of computation and AI is to cure cancer, that would be the right way to show the world that this technology is worthwhile and useful. The techniques involved would be the same because one would need to include real physical constraints like conservation of mass and energy instead of figuring out the best way to flash lights on the screen with no regard for any foundational physical principles.

Do you know anyone or any companies working on that?

By beefnugs 2024-12-0919:012 reply

AI isn't trying to sell to you: a precise artist with real vision in your brain. It is selling to managers who want to shit out something in an evening that approximates anything, that writes ads that no one wants to see anyway, that produces surface level examples of how you can pay employees less because "their job is so easy"

By spuz 2024-12-0919:151 reply

Yes and the thing is, even for those tasks, it's incredibly difficult to achieve even the low bar that a typical advertising manager expects. Try it yourself for any real world task and you will see.

By cornel_io 2024-12-0919:543 reply

Counterpoint: our CEO spent 25 minutes shitting out a bunch of AI ads because he was frustrated with the pace of our advertising creative team. They hated the ads that he created, for the reasons you mention, but we tested them anyways and the best performing ones beat all of our "expert" team's best ads by a healthy margin (on all the metrics we care about, from CTR to IPM and downstream stuff like retention and RoAS).

Maybe we're in a honeymoon period where your average user hasn't gotten annoyed by all the slop out there and they will soon, but at least for now, there is real value here. Yes, out of 20 ads maybe only 2 outperform the manually created ones, but if I can create those 20 with a couple hundred bucks in GenAI credits and maybe an hour or two of video editing that process wipes the floor with the competition, which is several thousand dollars per ad, most of which are terrible and end up thrown away, too. With the way the platforms function now, ad creative is quickly becoming a volume-driven "throw it at the wall and see what sticks" game, and AI is great for that.

By sarchertech 2024-12-0921:38

> Maybe we're in a honeymoon period where your average user hasn't gotten annoyed by all the slop out there and they will soon

It’s this. A video ad with a person morphing into a bird that takes off like a rocket with fire coming out of its ass, sure it might perform well because we aren’t saturated with that yet.

You’d probably get a similar result by giving a camera to a 5 year old.

But you also have to ask what that’s doing long term to your brand.

By gonzobonzo 2024-12-105:23

> Counterpoint: our CEO spent 25 minutes shitting out a bunch of AI ads because he was frustrated with the pace of our advertising creative team. They hated the ads that he created, for the reasons you mention, but we tested them anyways and the best performing ones beat all of our "expert" team's best ads by a healthy margin (on all the metrics we care about, from CTR to IPM and downstream stuff like retention and RoAS).

My guess is that the criticism of AI not being that good is correct, but many people don't realize that most humans also aren't that good, and that it's quite possible that the AI performs better than mediocre humans.

This shouldn't be much of a surprise, we've seen automation replace low skilled labor in a lot of industries. People seem uncomfortable with the possibility that there's actually a lot of low skilled labor in the creative industry that could also be easily replaced.

By mewpmewp2 2024-12-0922:09

A/B/C/D testing is the perfect grounds for that. You can keep automatically generating and iterating quickly while A/B tests are constantly being ran. This data on CTR can later be used to train the model better as well.

By soheil 2024-12-0922:44

You seem to speak from experience of being that manager... I'm not going to ask what you shit out in your evenings.

By minimaxir 2024-12-0918:251 reply

Way back in the days of GPT-2, there was an expectation that you'd need to cherry-pick atleast 10% of your output to get something usable/coherent. GPT-3 and ChatGPT greatly reduced the need to cherry-pick, for better or for worse.

All the generated video startups seem to generate videos with much lower than 10% usable output, without significant human-guided edits. Given the massive amount of compute needed to generate a video relative to hyperoptimized LLMs, the quality issue will handicap gen video for the foreseeable future.

By joe_the_user 2024-12-0918:49

Plus editing text or an image is practical. Video editors typically are used to cut and paste video streams - a video editor can't fix a stream of video that gets motion or anatomy wrong.

By didibus 2024-12-101:53

Right, but you're thinking as someone who has a vision for the image/video. Think from someone who is needing an image/video and would normally hire a creative person for it, they might be able to get away with AI instead.

The same "prompt" they'd give the creative person they hired... Say, "I want an ad for my burgers that make it look really good, I'm thinking Christmas vibes, it should emphasize our high quality meat, make it cheerful, and remember to hint at our brand where we always have smiling cows."

Now that creative person would go make you that advert. You might check it, give a little feedback for some minor tweaks, and at some point, take what you got.

You can do the same here. The difference right now is that it'll output a lot of junk that a creative person would have never dared show you, so that initial quality filtering is missing. But on the flip side, it costs you a lot less, can generate like 100 of them quickly, and you just pick one that seems good enough.

By hipadev23 2024-12-0918:344 reply

Real artists struggle matching vague descriptions of what is in your head too. This is at least quicker?

By staticman2 2024-12-0918:471 reply

Real artists take comic book scripts and turn them into actual comic books every month. They may not match exactly what the writer had in mind, but they are fit for purpose.

By TeMPOraL 2024-12-0918:551 reply

> They may not match exactly what the writer had in mind, but they are fit for purpose.

That's what GenAI is doing, too. After all, the audience only sees the final product; they never get know what the writer had in mind.

By staticman2 2024-12-0919:312 reply

I haven't used SORA, but none of the GenAI I'm aware of could produce a competent comic book. When a human artist draws a character in a house in panel 1, they'll draw the same house in panel 2, not a procedurally generated different house for each image.

If a 60 year old grizzled detective is introduced in page 1, a human artist will draw the same grizzled detective in page 2, 3 and so on, not procedurally generate a new grizzled detective each time.

By TeMPOraL 2024-12-0919:55

A human artist keeps state :). They keep it between drawing sessions, and more importantly, they keep very detailed state - their imagination or interpretation of what the thing (house, grizzled detective, etc.) is.

Most models people currently use don't keep state between invocations, and whatever interpretation they make from provided context (e.g. reference image, previous frame) is surface level and doesn't translate well to output. This is akin to giving each panel in a comic to a different artist, and also telling them to sketch it out by their gut, without any deep analysis of prior work. It's a big limitation, alright, but researchers and practitioners are actively working to overcome it.

(Same applies to LLMs, too.)

By Der_Einzige 2024-12-1014:071 reply

Btw there’s a way to match characters in a batch in the forge webUI which guarantees that all images in the batch have the same figure in it. Trivial to implement this in all other image generators. This critique is baseless.

By staticman2 2024-12-1017:481 reply

So prove it. If you are in good faith arguing an AI, via automation can draw a comic script with consistent figures, please tell an AI to draw the images in the first 3 pages of this script I pulled from the comic book script archive:

https://www.comicsexperience.com/wp-content/uploads/2018/09/...

Or if you can't do this, explain why the feature you mentioned cannot do this, and what it or good for?

By TeMPOraL 2024-12-1116:041 reply

As long as you're not asking for a zero-shot solution with a single model run three times in a row, this should be entirely doable, though I imagine ensuring the result would require a complex pipeline consisting of:

- An LLM to inflate descriptions in the script to very detailed prompts (equivalent to artist thinking up how characters will look, how the scene is organized);

- A step to generate a representative drawing of every character via txt2img - or more likely, multiple ones, with a multimodal LLM rating adherence to the prompt;

- A step to generate a lot of variations of every character in different poses, using e.g. ControlNet or whatever is currently the SOTA solution used by the Stable Diffuison community to create consistent variations of a character;

- A step to bake all those character variations into a LoRA;

- Finally, scenes would be generated by another call to txt2img, with prompts computed in step 1, and appropriate LoRAs active (this can be handled through prompt too).

Then iterate on that, e.g. maybe additional img2img to force comic book style (with a different SD derivative, most likely), etc.

Point being, every subproblem of the task has many different solutions already developed, with new ones appearing every month - all that's left to have an "AI artist" capable of solving your challenge is to wire the building blocks up. For that, you need just a trivial bit of Python code using existing libraries (e.g. hooking up to ComfyUI), and guess what, GPT-4 and Claude 3.5 Sonnet are quite good at Python.

EDIT: I asked Claude to generate "pseudocode" diagram of the solution from our two comments:

http://www.plantuml.com/plantuml/img/dLLDQnin4BthLmpn9JaafOR...

Each of the nodes here would be like 3-5 real ComfyUI nodes in practice.

By staticman2 2024-12-1118:561 reply

I appreciate the detailed response. I had a feeling the answer was some variation of "well I could get an AI to draw that but I'd have to hack at it for a few hours...". If a human has to work at it for hours, it's more like using Blender than "having an AI draw it" in my mind.

I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably. If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.

By TeMPOraL 2024-12-1122:24

> If a human has to work at it for hours, it's more like using Blender than "having an AI draw it" in my mind.

A human has to work at it too; more than few hours when doing more than few quick sketches (memory has its limits; there's a reason artists keep reference drawings around), and obviously they already put years into learning their skills than before, but fair - the human artist already knows how to do things that any given model doesn't yet[0], we kind of have to assemble the overall flow ourselves for now[1].

Then again, you only need to assemble it once, putting those hours of work up front - and if it's done, and it works, it becomes fair to say that AI can, in fact, generate self-consistent comic books.

> I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably.

I agree. I obviously didn't try this myself either (yet, I'm very tempted to try it, to satisfy my own curiosity). However, between my own experience with LLMs and Stable Diffusion, and occasionally browsing Stable Diffusion subreddits, I'm convinced all individual steps work well (and have multiple working alternatives), except for the one you flagged, i.e. evaluating prompt adherence using multimodal LLM - that last one I only feel should work, but I don't know for sure. However, see [1] for alternative approach :).

My point thus is, all individual steps are possible, and wiring them together seems pretty straightforward, therefore the whole thing should work if someone bothers to do it.

> If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.

I think the biggest concentration of enthusiasm is to be found in NSWF uses of SD :). On the one hand, you're right; we probably should've seen it done already. On the other hand, my impression is that most people doing advanced SD magic are perfectly satisfied with partially manual workflows. And it kind of makes sense - manual steps allow for flexibility and experimentation, and some things are much simpler to wire by hand or patch up with some tactical photoshopping, than to try and automate them fully. In particular, things judging the quality of output is both easy for humans and hard to automate.

Still, I've recently seen ads of various AI apps claiming to do complex work (such as animating characters in photos) end-to-end automatically - exactly the kind of work that's typically done in partially manual process. So I suspect fully-automated solutions are being built on a case-by-case basis, driven by businesses making apps for the general population; a process that lags some months behind what image gen communities figure out in the open.

[0] - Though arguably, LLMs contain the procedural knowledge of how a task should be done; just ask it to ELI5 or explain in WikiHow style.

[1] - In fact, I just asked Claude to solve this problem in detail, without giving it my own solution to look at (but hinting at the required complexity level); see this: https://cloud.typingmind.com/share/db36fc29-6229-4127-8336-b... (and excuse the weird errors; Claude is overloaded at the moment, so some responses had to be regenerated; also styling on the shared conversation sucks, so be sure to use the "pop out" button on diagrams to see them in detail).

At very high level, it's the same as mine, but one level below, it uses different tools and approaches, some of which I never knew about - like keeping memory in embedding space instead of text space, and using various other models I didn't know exist.

EDIT: I did some quick web search for some of the ideas Claude proposed, and discovered even more techniques and models I never heard of. Even my own awareness of the image generation space is only scratching the surface of what people are doing.

By kevingadd 2024-12-0923:57

I work with professional artists all the time and this is not the case. They're generally quite good at extrapolating from a couple paragraphs into something fantastic, often exactly what I had in mind.

In comparison I've messed around with prompting image generator models quite a bit and it's not possible to get remotely close to the quality level of even rough paid concept work by a professional, and the credits to run these models aren't particularly cheap.

By coffeebeqn 2024-12-1010:55

With real art you can start from somewhere and keep building on that foundation. Say you pick an angle to shoot from and test different actors and scenes from that angle. With AI you’re re-rolling the dice for every iteration. If you’re happy that it looks 80% correct then sure it’s maybe passable.

I think people are getting way ahead of their skis here. Even in 2D I can’t for example generate inventory images for weapons and items for a game yet. Which is an orders of magnitude simpler test case than video. They all are slightly different styles. If I don’t care that they all look different in strange ways then it’s useful - but any consumer will think it looks like crap

By janalsncm 2024-12-0918:372 reply

The point is if you are the artist and have something in your head. It’s the same problem with image editing. I am sure you have experienced this.

By TeMPOraL 2024-12-0919:033 reply

There is no problem unless you insist on reflecting what you had in mind exactly. That needs minute controls, but no matter the medium and tools you use, unless you're doing it in your own quest for artistic perfection, the economic constraints will make you stop short of your idea - there's always a point past which any further refinement will not make a difference to the audience (which doesn't have access to the thing in your head to use as reference), and the costs of continuing will exceed any value (monetary or otherwise) you expect to get from the work.

AI or not, no one but you cares about the lower order bits of your idea.

By throwup238 2024-12-0921:201 reply

Nobody else really cares about the lower order bits of the idea but they do care that those lower order bits are consistent. The simplest example is color grading: most viewers are generally ignorant of artistic choices in color palettes unless it’s noticeable like the Netflix blue tint but a movie where the scenes haven’t been made consistently color graded is obviously jarring and even an expensive production can come off amateur.

GenAI is great at filling in those lower order bits but until stuff like ControlNet gets much better precision and UX, I think genAI will be stuck in the uncanny valley because they’re inconsistent between scenes, frames, etc.

By TeMPOraL 2024-12-0921:30

Yup, 100% agreed on that, and mentioned this caveat elsewhere. As you say - people don't pay attention to details (or lack of it), as long as the details are consistent. Inconsistencies stand out like sore thumbs. Which is why IMO it's best to have less details than to be inconsistent with them.

By pbhjpbhj 2024-12-1022:28

>There is no problem unless you insist on reflecting what you had in mind exactly.

Not disagreeing, just noting: this is not how [most?] people's minds work {I don't think you're holding to that opinion particularly, I'm just reflecting on this point}. We have vague ideas until an implementation is shown, then we examine it and latch on to a detail and decide if it matches our idea or not. For me, if I'm imagining "a superhero planting vegetables in his garden" I've no idea what they're actually wearing, but when an artist or genAI shows me it's a brown coat then I'll say "no something more marvel". Then when ultimately they show me something that matches the idea I had _and_ matches my current conception of the idea I had... then I'll point out the fingernails are too long, when in the idea I hadn't even perceived the person had fingers, never mind too-long fingernails!

I'd warrant any actualised artistic work has some delta with the artists current perception of the work; and a larger delta with their initial perception of it.

By janalsncm 2024-12-0921:061 reply

I disagree. Even without exactness, adding any reasonable constraints is impossible. Ask it to generate a realistic circuit diagram or chess board or any other thing where precision matters. Good luck going back and forth getting it right.

These are situations with relatively simple logical constraints, but an infinite number of valid solutions.

Keep in mind that we are not requiring any particular configuration of circuit diagram, just any diagram that makes sense. There are an infinite number of valid ones.

By TeMPOraL 2024-12-0921:261 reply

That's using the wrong tool for a job :). Asking diffusion models to give you a valid circuit diagram is like asking a painter to paint you pixel-perfect 300DPI image on a regular canvas, using their standard paintbrush. It ain't gonna work.

That doesn't mean it can't work with AI - it's that you may need to add something extra to the generative pipeline, something that can do circuit diagrams, and make the diffusion model supply style and extra noise (er, beautifying elements).

> Keep in mind that we are not requiring any particular configuration of circuit diagram, just any diagram that makes sense. There are an infinite number of valid ones.

On that note. I'm the kind of person that loves to freeze-frame movies to look at markings, labels, and computer screens, and one thing I learned is that humans fail at this task too. Most of the time the problems are big and obvious, ruining my suspension of disbelief, and importantly, they could be trivially solved if the producers grabbed a random STEM-interested intern and asked for advice. Alas, it seems they don't care.

This is just a specific instance of the general problem of "whatever you work with or are interested in, you'll see movies keep getting it wrong". Most of the time, it's somewhat defensible - e.g. most movies get guns wrong, but in way people are used to, and makes the scenes more streamlined and entertaining. But with labels, markings and computer screens, doing it right isn't any more expensive, nor would it make the movie any less entertaining. It seems that the people responsible don't know better or care.

Let's keep that in mind when comparing AI output to the "real deal", as to not set an impossible standards that human productions don't match, and never did.

By janalsncm 2024-12-0922:38

The issue isn’t any particular constraint. The issue is the inability to add any constraints at all.

In particular, internal consistency is one of the important constraints which viewers will immediately notice. If you’re just using sora for 5 second unrelated videos it may be less of an issue but if you want to do anything interesting you’ll need the clips to tie together which requires internal consistency.

By mlboss 2024-12-0918:40

So what I am getting a use-case for brain-computer interface.

By hmottestad 2024-12-0921:26

When I first started learning Photoshop as a teenager I often knew what I wanted my final image to look like, but no matter how hard I tried I could never get the there. It wasn't that it was impossible, it was just that my skills just weren't there yet. I needed a lot more practice before I got good enough to create what I could see in my imagination.

Sora is obviously not Photoshop, but given that you can write basically anything you can think of I reckon it's going to take a long time to get good at expressing your vision in words that a model like Sora will understand.

By corytheboyd 2024-12-0922:181 reply

Free text is just the fundamentally wrong input for precision work like this. Because it is wrong for this doesn’t mean it has NO purpose, it’s still useful and impressive for what it is.

FWIW I too have been quite frustrated iterating with AI to produce a vision that is clear in my head. Past changing the broad strokes, once you start “asking” for specifics, it all goes to shit.

Still, it’s good enough at those broad strokes. If you want your vision to become reality, you either need to learn how to paint (or whatever the medium), or hire a professional, both being tough-but-fair IMO.

By londons_explore 2024-12-0922:22

I don't think it'll be long before GUI tools catch up for editing video.

Things like rearranging things in the scene with drag'n'drop sound implementable (although incredibly GPU heavy)

By ohthehugemanate 2024-12-107:11

If you have a specific vision, you will have to express the detailed information of that vision into the digital realm somehow. You can use (more) direct tools like premiere if you are fluent enough in their "language". Or you can use natural language to express the vision using AI. Either way you have to get the same amount of information into a digital format.

Also, AI sucks at understanding detail expressed in symbolic communication, because it doesn't understand symbols the way linguistic communication expects the receiver to understand them.

My own experience is that all the AI tools are great for shortcutting the first 70-80% or so. But the last 20% goes up an exponential curve of required detail which is easier and easier to express directly using tooling and my human brain.

Consider the analogy to a contract worker building or painting something for you. If all you have is a vague description, they'll make a good guess and you'll just have to live with that. But the more time you spend with them communicating (through description, mood boards rough sketches etc) the more accurate to your detailed version it will get. But you only REALLY get exactly what you want if you do it yourself, or sit beside them as they work and direct almost every step. And that last option is almost impossible if they can't understand symbolic meaning in language.

By cube2222 2024-12-0918:191 reply

Agreed. It’s still much better than what I could do myself without it, though.

(Talking about visual generative AI in general)

By JKCalhoun 2024-12-0918:231 reply

Yeah, but if I handed you a Maxfield Parrish it would be better than either of us can do — but not what I asked for.

I find generative AI frustrating because I know what I want. To this point I have been trying but then ultimately sitting it out — waiting for the one that really works the way I want.

By cube2222 2024-12-0918:261 reply

For me even if I know what I want, if I’m using gen AI I’m happy to compromise and get good enough (which again, is so much better than I could do otherwise).

If you want higher quality/precision, you’ll likely want to ask a professional, and I don’t expect that to change in the near future.

By adamc 2024-12-0918:472 reply

That limits its value for industries like Hollywood, though, doesn't it? And without that, who exactly is going to pay for this?

By cube2222 2024-12-0919:041 reply

To me, currently, visual generative ai is an evolution and improvement of stock images, and has effectively the same purpose.

People pay for stock images.

By adamc 2024-12-0919:091 reply

Yeah, maybe for some purposes. In business, people sometimes pay for stock images but often don't have the expertise or patience to really spend a lot of time coaching a video into fruition. Maybe for advertising or other contexts where more effort is worth it (not just powerpoints), but it feels like a slim audience.

By cube2222 2024-12-0919:16

With tools like Apple Intelligence and its genmoji (emoji generation) and playground (general diffusion image generation) I expect it to also take on some of the current entertainment and social use-cases of stickers and GIFs.

But that’s probably something you don’t pay for directly, instead paying for e.g. a phone that has those features.

By jddj 2024-12-0919:031 reply

Advertisers, I guess. Same folks who paid for everything else around here

By adamc 2024-12-0919:10

Yeah, I just question if there are enough customers to make this work.

By joe_the_user 2024-12-106:591 reply

The thing about Hollywood is that movies aren't made by a producer or director creating a description and an army of actors, tech and etc doing exactly that.

What happens is a description becomes a longer specification or script that's still good and hangs together in itself and then further iterations involving professionals who can't do "exactly what the director wants" but rather do something further that's good and close enough to what the director wants.

By skydhash 2024-12-1017:29

Also, a team of experts and professionals that knows better than the director how a specific thing work.

By diob 2024-12-0919:35

I believe it. I was just using AI to help out with some mandatory end of year writing exercises at work.

Eventually, it starts to muck with the earlier work that it did good on, when I'm just asking it to add onto it.

I was still happy with what I got in the end, but it took trial and error and then a lot of piecemeal coaxing with verification that it didn't do more than I asked along the way.

I can imagine the same for video or images. You have to examine each step post prompt to verify it didn't go back and muck with the already good parts.

By planb 2024-12-108:56

Iterations are the missing link.

With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.

By goldfeld 2024-12-0922:55

If you use it in a utilitarian way it'll give you a run for your money, if you use for expression, such as art, learning to embrace some serendipity, it makes good stuff.

By titzer 2024-12-0921:58

As only a cursory user of said tools (but strong opinions) I felt the immediate desire to get an editable (2D) scene that I could rearrange. For example I often have a specific vantage point or composition in mind, which is fine to start from, but to tweak it and the elements, I'd like to edit it afterwards. To foray into 3D, I'd be wanting to rearrange the characters and direct them, as well as change the vantage point. Can it do that yet?

By javier123454321 2024-12-0919:20

This is the conundrum of AI generated art. It will lower the barrier to entry for new artists to produce audiovisual content, but it will not lower the amount of effort required to make good art. If anything it will increase the effort, as it has to be excellent in order to get past the slop of base level drudge that is bound to fill up every single distribution channel.

By moralestapia 2024-12-0919:00

Still three or four order of magnitudes cheaper and easier than to produce said video through traditional methods.

By nomel 2024-12-0918:23

I think inpainting and "draw the label scene" type interfaces are the obvious future. Never thought I'd miss GauGAN [1].

https://www.youtube.com/watch?v=uNv7XBngmLY&t=25

By jstummbillig 2024-12-0919:12

> A way to test this is to take a piece of footage or an image which is the ground truth, and test how much prompting and editing it takes to get the same or similar ground truth starting from scratch.

Sure, if you then do the same in reverse.

By mattigames 2024-12-0918:27

Not too far in the future you will be able to drag and drop the position of the characters as well as the position of the camera, among other refiment tools.

By estebarb 2024-12-0920:49

For those scenarios would be helpful a draft generation mode: 16 colors, 320x200...

By torginus 2024-12-0918:52

Yeah, it almost feels like gambling - 'you're very close, just spend 20 more credits and you might get it right this time!'

By bilsbie 2024-12-0920:28

Sounds like another way of saying a picture is worth a thousand words.

By droidrat 2024-12-0918:32

[dead]

By telenardo 2024-12-106:303 reply

For those curious (and still locked out) here’s direct a comparison of Sora vs. the open-source leaders (HunyuanVideo, Mochi and LTX):

https://app.checkbin.dev/snapshots/1f0f3ce3-6a30-4c1a-870e-2...

Pros:

- Some of the Sora results are absolutely stunning. Check out the detail on the lion, for example! - The landscapes and aerial shots are absolutely incredible. - Quality is much better than Mochi & LTX out of the box. Mochi/LTX seem to require specifically optimized workflows (I've seen great img2vid LTX results on Reddit that start with Flux image generations, for example). Hunyuan seems comparable to Sora!

Cons:

- Still nearly impossible to access Sora despite the “launch”. My generations today were in the 2000s, implying that it’s only open to a very small number of people. There’s no api yet, so it’s not an option for developers. - Sora struggles with physical interactions. Watch the dancers moonwalk, or the ball goes through the dog. HunyuanVideo seems to be a bit better in this regard. - Can't run it locally mode (obviously) - I haven't tested this, but I think it's safe to assume Sora will be censored extensively. HunyuanVideo is surprisingly open (I've seen NSFW generations!) - I’m getting weird camera angles from Sora, but that could likely be solved with better prompting.

Overall, I’d say it’s the best model I've played with, though I haven’t spent much time on other non-open-source ones. Hunyuan gives it a run for its money, though!

By spondyl 2024-12-106:59

I can't speak to any of those videos in a technical sense but personally, I don't feel like any of them are good?

The vibe they give me is similar to the iPhone photography commercials where yes, in theory, a picnic in the park could look exactly like this except for all the parts that seem movie perfect.

I guess it's really more of a colour grading question where most of the Sora colour grading triggers that part of my brain that says "I'm watching a movie and this isn't real" without quite realising why.

A few of the Hunyuan videos in contrast seem a bit more believable even though they have some obvious glitches at times.

The other thing I think Sora has is that thing in commercials where no one else except the protagonist exists and nothing is ever inconvenient. The video of the teacher in a classroom with no students reminds me of that as well as the picnic in the park where there's wide open space with no one around.

I suppose it depends if the goal is to generate believable video and how you define believable.

By zuminator 2024-12-109:09

Hunyuan was more realistic but lower quality than Sora, shorter videos with lower resolution or bitrate. The downside to Sora's sharpness is that it makes mistakes more apparent. Also funny that Sora didn't understand the rolling dunes metaphor.

By CSMastermind 2024-12-106:42

Based on this it really seems like Hunyuan is a significantly better model. In nearly every example I preferred its output.

By pen2l 2024-12-0918:3625 reply

Every day that passes I grow fonder of Google's decision to delay or otherwise keep a lot of this under the wraps.

The other day I was scrolling down on YouTube shorts and a couple videos invoked an uncanny valley response from me (I think it was a clip of an unrealistically large snake covering some hut) which was somehow fascinating and strange and captivating, and then scrolling down a few more, again I saw something kind of "unbelievable"... I saw a comment or two saying it's fake, and upon closer inspection: yeah, there were enough AI'esque artifacts that one could confidently conclude it's fake.

We'd known about AI slop permeating Facebook -- usually a Jesus figure made out of unlikely set of things (like shrimp!) and we'd known that it grips eyeballs. And I don't even know in which box to categorize this, in my mind it conjures the image of those people on slot machines, mechanically and soullessly pulling levers because they are addicted. It's just so strange.

I can imagine now some of the conversations that might have happened at Google when they choose to keep a lot of innovations related to genAI under the wraps (I'm being charitable here of their motives), and I can't help but agree.

And I can't help but be saddened about OpenAI's decisions to unload a lot of this before recognizing the results of unleashing this to humanity, because I'm almost certain it'll be used more for bad things than good things, I'm certain its application on bad things will secure more eyeballs than on good things.

By lelandfe 2024-12-0919:4912 reply

I saw my first AI video that completely fooled commenters: https://imgur.com/a/cbjVKMU

This was not marked as AI-generated and commenters were in awe at this fuzzy train, missing the "AIGC" signs.

I'm quite nervous for the future.

By superfrank 2024-12-0920:548 reply

I know there are people acting like this is obvious that this is AI, but I get why people wouldn't catch it, even if they know that AI is capable of creating a video like this.

A) Most of the give aways are pretty subtle and not what viewers are focused on. Sure, if you look closely the fur blends in with the pavement in some places, but I'm not going to spend 5 minutes investigating every video I see for hints of AI.

B) Even if I did notice something like that, I'm much more likely to write it off as a video filter glitch, a weird video perspective, or just low quality video. For example, when they show the inside of the car, the vertical handrails seem to bend in a weird way as the train moves, but I've seen similar things from real videos with wide angle lenses. Similar thoughts on one of the bystander's faces going blurry.

I think we just have to get people comfortable with the idea that you shouldn't trust a single unknown entity as the source or truth on things because everything can be faked. For insignificant things like this it doesn't matter, but for big things you need multiple independent sources. That's definitely an uphill battle and who knows if we can do it, but that's the only way we're going to get out the other side of this in one piece.

By jasinjames 2024-12-100:342 reply

I agree. Also, tangentially related: I use a black and white filter on my phone, and it is way harder to distinguish fake and real media without the color channels to help. I couldn't immediately find anything in the subway clip which gave it away.

By yieldcrv 2024-12-103:20

I've definitely seen skin blurring filters that everyone already uses to make it really hard to know

By lelandfe 2024-12-1023:37

Hijacking this top comment to say that I found the AI video creator: https://www.instagram.com/bugugugugu_aigc/

By n1b0m 2024-12-1013:162 reply

I agree. Apart from the text appearing backwards it all looked pretty real to me.

By lelandfe 2024-12-1016:41

My assumption was the uploader wanted to make the creator's "AIGC" less obvious. It definitely did that to me.

By KeplerBoy 2024-12-1014:381 reply

Yeah, that's a weird one. I doubt the video was generated that way. I assume someone flipped the video for "artistic" purposes.

By ukuina 2024-12-1016:221 reply

Reversing text is a known loophole to getting around copyright guardrails in image-generation models.

By KeplerBoy 2024-12-1016:281 reply

How does that work? Would you prompt the model to write "hello Kitty but in reverse" on the train so the resulting image isn't flagged?

By thereddaikon 2024-12-1020:26

Much more likely they just flipped the video in an editor after it was generated. Its common enough to see flipped video with backwards text on social media, most people wouldn't give it a second thought.

By nihil2501 2024-12-105:592 reply

I'm beginning to write off most images as AI. I actually think that's where this is all headed.

By thih9 2024-12-108:201 reply

There are projects like https://contentcredentials.org/ . If we want, with some effort we could distinguish between real and ai generated. If.

By jprete 2024-12-1014:461 reply

No individual actor - human or corporate - stands to benefit enough because "trust in reality" is neither easily measured nor financialized.

By thih9 2024-12-1017:40

Some do care, e.g. some camera manufacturers or some news agencies. Surprisingly some social media platforms[1] want clear labels for AI generated content.

[1]: e.g. tiktok https://newsroom.tiktok.com/en-us/partnering-with-our-indust...

By netdevphoenix 2024-12-1013:221 reply

that's the easiest position imo. It's AI unless proven otherwise. No one has the time to place this much detailed on a random video when the purpose of the video is just entertainment. What this might lead to though is people losing (or not learning) the skills needed to separate real content from AI generated content

By jacobr1 2024-12-1016:48

And even if it isn't AI, it is quite possibly deceptively edited. Content provenance will be important in the future.

By cess11 2024-12-1011:261 reply

A precondition is likely that one has mainly watched CGI-heavy movies for most of one's life. Compared to old school analog movies or fairly raw photography that looks as fake as the Coca-Cola Santa. There's a rather obvious lack of detail that real photography would have catched.

By Quekid5 2024-12-1011:423 reply

> A precondition is likely that one has mainly watched CGI-heavy movies for most of one's life.

Indeed, a great (if counterintuitive) example of this is The Wolf of Wall Street. I bet a lot of people would be surprised at just how much CGI is used in that just for set/location.

By lelandfe 2024-12-1012:47

Wow, you were not kidding https://www.youtube.com/watch?v=pocfRVAH9yU

By sbarre 2024-12-1012:451 reply

The OG film for that was Forrest Gump. It is often lauded as one of the first movies to use CGI heavily but in completely, and intentionally, unnoticeable ways...

By Quekid5 2024-12-1321:12

True, but in that case you knew it had to be CGI because Kennedy didn't talk to Tom Hanks in any capacity.

By cess11 2024-12-1014:06

Sure, it's like a weird dream where sometimes shadows don't come from the sun and the scenery has this absurd, acutely unreal polish.

By fennecbutt 2024-12-1810:24

A) also true that many people don't put a lot of thought into very much at all. They'd never consider actively thinking if a video is fake or not. These are the targets of short form content.

By Cthulhu_ 2024-12-1012:26

B is / will be huge; the largest amount of "mindless" content is consumed on phones, with half attention, often with other distractions going on and in between doing other stuff, and can be watched on older / lower fidelity devices, slower internet connections, etc. AI content needs high resolution / big screens and focused attention to "discover".

The truth is... most people will simply not care. Raised eyebrow, hm, cute, next. Critical watching is reserved for critics like the crowd on HN and the like, but they represent only a small percentage of the target audience and revenue stream.

By KennyBlanken 2024-12-108:252 reply

You can see the perspective/angle of the objects changing slightly as the camera moves in a way that makes it pretty obvious they're CG, AI or otherwise. That's always been a problem with AI generated imagery in video/animation; it changes too much frame to frame. If researchers figure out how to address that, yeah, we've got a problem. Until then - this looks worse tha

Then there's the usual giveways for CG - sharpness, noise, lighting, color temperature, saturation - none of them match. There's also no diffuse reflection of the intense pink color.

By lambdaone 2024-12-109:381 reply

Yes. The lack of diffuse reflection from the pink train is the clearest giveaway, and AI videos in general have problems with getting shadows and radiosity right. There's also the existence of the real-world Hello Kitty Shinkansen and the APM Cat Bus in Japan that makes this image more plausible.

By Cthulhu_ 2024-12-1012:29

That last point is also important; if it's not surprising, people will just accept it without being too critical about it. And since these AI tools are trained with real / existing content, creating realistic-enough content will be the norm. I think the first big AI generators - dall-e and co - had their model trained on more fantastical / artistic sources, and used that primarily as their model, also because realistic generation (like humans) wasn't yet good enough, or too uncanny. But uncanny and art work well together.

By jacobr1 2024-12-1016:49

Also consider one of of the reasons AI generated video has CG like artifacts is because it is trained on CG video. Better CG generation, and more real video for training will reduce these over time.

By ecmascript 2024-12-1010:40

Honestly, stuff like that could also be because of compression. We're all used to see low quality videos online.

By dagmx 2024-12-0919:515 reply

Most people have terrible eyes for distinguishing content.

I’ve worked in CG for many years and despite the online nerd fests that decry CG imagery in films, 99% of those people can’t tell what’s CG or not unless it’s incredibly obvious.

It’s the same for GenAI, though I think there are more tells. Still, most people cannot tell reality from fiction. If you just tell them it’s real, they’ll most likely believe it.

By dmazzoni 2024-12-0923:411 reply

> I’ve worked in CG for many years and despite the online nerd fests that decry CG imagery in films, 99% of those people can’t tell what’s CG or not unless it’s incredibly obvious.

I've noticed people assume things are CG that turn out to be practical effects, or 90% practical with just a bit of CG to add detail.

By dagmx 2024-12-100:031 reply

Yep I’ve had that happen many times , where people assume my work is real and the practical is CG.

Worse, directors often lie about what’s practical and we’ll have replaced it with CG. So people online will cheer the “practicals” as being better visually, while not knowing what they’re even looking at.

I’ve seen interviews with actors even where they talk about how they look in a given shot or have done something, and not realize they’re not even really in the shot anymore.

People just have terrible eyes once you can convince them something is a certain way.

By Der_Einzige 2024-12-1013:531 reply

But films without CG are clearly superior and it’s not even in contention.

Lawrence of Arabia or Cleopatra alone have incredible fully live shot special effects which can not be easily replicated with CG and have aged like fine wine, unlike the trash early CG of the 80s and 90s which ruined otherwise great films like the last starfighter

By dagmx 2024-12-1015:271 reply

I’m sorry, but you make an absurd argument.

You’re taking the best films of an era and comparing them to an arbitrary list of movies you don’t like? Adding to that, you’re comparing it to films in the infancy of a technology?

This is peak confusion of causality and correlation. There are tons of great films in that time frame with CG. Unless you’re going to argue that Jurassic Park is bad.

By jacobr1 2024-12-1016:541 reply

Jurassic Park isn't just a good example of CG, it also a good example of making the right choices on practical vs CG (in the context of technology of the time) and using a reasonable budget. You can have great CG and crappy CG by cutting corners. Plenty of people that decry CG don't actually know how much there is, even in non-sci-fi movies like romcoms, just for post-editing. But when it is done well nobody notices, the complaints only come when it looks like crap. Great use of technology to achieve the artistic vision will stand the test of time.

By lancesells 2024-12-1017:29

It's also directed by one of the best directors in history.

By qingcharles 2024-12-107:541 reply

The worst bit about working in CG, or film-making in general, is finding it harder to enjoy films because you are hypersensitized to bad work.

By dagmx 2024-12-1015:42

Yeah, totally. It’s not even just bad work, but I’m constantly breaking down shots as I’m watching them.

Especially because I’ve done both on set and virtual production, it’s hard to suspend disbelief in a lot of films.

By circlefavshape 2024-12-1010:172 reply

> Still, most people cannot tell reality from fiction. If you just tell them it’s real, they’ll most likely believe it.

This goes for conversation too! My neighbour recently told me about a mutual neighbour who walks 200 miles per day working on his farm. When I explained that this is impossible he said "I'll have to disagree with you there"

By schoen 2024-12-1010:31

Maybe not strictly impossible, just slightly better than an ultramarathon world record pace?

https://www.reddit.com/r/Ultramarathon/comments/xhbs4d/sorok...

https://en.wikipedia.org/wiki/Aleksandr_Sorokin

So, not very convenient for a non-world-champion runner to do (let alone while doing farm work) (let alone on more than one occasion).

By Cthulhu_ 2024-12-1012:312 reply

That's a cultural issue that seems to have developed in the past years (decades? idk), where people take their own opinion (or what they think is their own opinion) as unchallengeable gospel.

In my opinion anyway, I'm gonna have to disagree with any counterpoints in advance.

By dagmx 2024-12-1015:44

This is partially the result of being taught that every opinion is valid. What was taught as a nicety (don’t dismiss other people’s opinions was the intention) has evolved into all opinions are equal.

If all opinions are equal, and we’ve reinforced that you can find anything to strengthen an opinion, then facts don’t actually matter.

But I don’t think it’s actually all that recent. History is full of people saying that facts or logic don’t matter. The Americas were “discovered” by such a phenomenon.

By mandmandam 2024-12-1014:00

What's weird is the projection you get when you challenge someone's opinion in any way. All of a sudden, you're the arrogant one who thinks they're always right, no matter how diplomatic (or undeniably correct) about the issue you are. Or is that just me?

By 5040 2024-12-1016:58

>Most people have terrible eyes for distinguishing content

A related phenomenon is not being able to hear the difference between 128kbps and 320kbps. I find the notion astonishing, and yet lots of people cannot tell the difference.

By lancesells 2024-12-1017:27

> Most people have terrible eyes for distinguishing content.

But also in the case of the fluffy train there's nothing to compare it against. The reason CGI humans look the most fake is because we're trained from birth to read a human face. Someone that looks at trains on a regular basis will probably discern this as being fake quicker than most.

By krick 2024-12-108:373 reply

Looks dope though. But what impressed me recently was some crypto-scam video, featuring "a clip" from Lex Fridman Podcast where Elon Musk "reveals" his new crypto or whatever (sadly, the one I saw is currently deleted). It didn't really look good, they were talking with weird pauses and intonations, and as awkward these 2 normally are, here they were even more unnatural. There was so much audacity to it I laughed out loud.

But what I was thinking while enjoying the show was: people wouldn't do that, if it didn't work.

This is the point. There is no such thing as "completely fools commenters". I mean, it didn't fool you, apparently. (But don't be sad, I bet you were fooled by something else: you just don't know it, obviously.) But some of it always fools somebody.

I really liked how Thiel mentioned on some podcast that ChatGPT successfully passed Turing test, which was implicitly assumed to be "the holy grail of AI", and nobody really noticed. This is completely true. We don't really think about ChatGPT, as something that passes Turing test, we think how fucking stupid useless thing mislead you with some mistake in calculations you decided to delegate to it. But realistically, if it doesn't it's only because it is specifically trained to try to avoid passing it.

By lelandfe 2024-12-1012:37

I wish you were right that there is no way to completely fool viewers, but I know you are not. I was fooled! Note that I call out "AIGC." If that wasn't there (I only noticed it on repeat views), I would have simply had no way to tell. These are early, primitive AI generated videos, and I'm already unable to differentiate. Many in this thread talk about movie CG; there are countless movie scenes that fool all viewers.

By coffeebeqn 2024-12-109:36

If someone were to train a model on Joe Rogan podcasts whole run, I’m sure it would spit out extremely impressive fake results already

By vintermann 2024-12-109:08

> people wouldn't do that, if it didn't work.

You can't assume that with scams. Quite often, scams are themselves sold as a get-rich-quick scheme, which like all GRQ schemes, they wouldn't be if they worked well.

By peab 2024-12-1015:54

Think about this: you very well may have already seen AI videos that fooled you - you wouldn't know if you did.

By coffeebeqn 2024-12-109:35

One of the clearest signs in the current gen is that the typography looks bad still.

By darkerside 2024-12-102:292 reply

People are smart enough to know that what you see in movies isn't real. It will just take a little time for people to realize that now applies to all videos and images.

By nihil2501 2024-12-106:00

The frequency is so high, and I am getting so burned out on checking comments to gauge how much everything is changing, that I've nearly given up subconsciously. Pretty close to just ignoring all images I see.

By nurettin 2024-12-1016:02

This is definitely something the Japanese would do, but it is not a real train unless a thousand salarymen are crammed into it.

By matwood 2024-12-109:151 reply

The bigger problem is that people think something this ridiculous could happen.

By marci 2024-12-1013:17

Weirder things have been created. I could definitely see one being made for a movie.

By espadrine 2024-12-1014:19

> I'm quite nervous for the future.

Videos like these were already achievable through VFX.

The only difference here is a reduction in costs. That does mean that more people will produce misinformation, but the problem is one that we have had time to tackle, and which gave rise to Snopes and many others.

By ImaCake 2024-12-0923:451 reply

I mean the only real tell for me is how expensive this stunt would be. I personally think this is a really cool use of genAI. But the consequences will be far reaching.

By lelandfe 2024-12-103:07

Some of the comments were like, "come on guys, if this was real it would be way dirtier"

By starshadowx2 2024-12-0919:556 reply

The face of the girl on the left at the start in the first second should have been a giveaway.

By Perseids 2024-12-0920:371 reply

My intuition went for video compression artifact instead of AI modeling problem. There is even a moment directly before the cut that can be interpreted as the next key frame clearing up the face. To be honest, the whole video could have fooled me. There is definitely an aspect in discerning these videos that can be trained just by watching more of them with a critical eye, so try to be kind to those that did not concern themselves with generative AI as much as you have.

By yccs27 2024-12-108:50

Yeah, it's unfortunate that video compression already introduces artifacts into real videos, so minor genAI artifacts don't stand out.

It also took me a while to find any truly unambiguous signs of AI generation. For example, the reflection on the inside of the windows is wonky, but in real life warped glass can also produce weird reflections. I finally found a dark rectangle inside the door window, which at first stays fixed like a sign on the glass. However it then begins to move like part of the reflection, which really broke the illusion for me.

By booleandilemma 2024-12-0921:14

No one is looking at her face though, they're looking at the giant hello kitty train. And you were only looking at her face because you were told it's an AI-generated video. I agree with superfrank that extreme skepticism of everything seen online is going to have to be the default, unfortunately.

By vlovich123 2024-12-0920:45

Hard to not discount that as a compression artifact.

By magicalhippo 2024-12-0923:051 reply

Just like all the obvious signs[1] the moon landings were faked.

[1]: https://web.archive.org/web/20120829004513/http://stuffucanu...

By lelandfe 2024-12-1017:09

Just wanted to say I really enjoyed this!

By Nition 2024-12-102:061 reply

One thing that's not intuitive to spot but actually completely wrong, is that in the second clip we're apparently inside the train but the train is still rolling under us.

By lmm 2024-12-103:14

Or, y'know, the camera's moving smoothly backwards through the train? Would be bit of an odd choice (and high-effort to make it that smooth versus someone just carrying it) but not impossible by any means.

By tim333 2024-12-0922:522 reply

Also "HELLO KITTY" being backwards is odd - writting on trains doesn't normally come out like that eg https://www.groupe-sncf.com/medias-publics/styles/crop_1_1/p...

By slightwinder 2024-12-100:36

All the text is mirrored. It's not unusual doing this to avoid copyright-filters. This kinda adds to distracting suspicions.

By colordrops 2024-12-0923:00

The whole video was probably mirrored before being posted. Doesn't seem to be related to being AI generated.

By solfox 2024-12-1013:252 reply

On the other hand, because these tools like this are being made available before output is perfected, you and many others are being trained in AI discernment; being able to detect fake things will be a helpful skill to have for some time: another form of critical thinking.

It would be FAR worse if a privately held advanced AI's outputs were unleashed without the population being at least somewhat cautious of everything. The real danger imho comes from private silos of advanced general intelligence that aren't shared and used to gain power, control, and money.

By underdeserver 2024-12-1013:372 reply

I think as these things will get bigger and better much faster than we can learn to discern.

By solfox 2024-12-1013:582 reply

With zero doubt. Faster than we expect. And yet, it's nice that we are learning to distrust what we see before the "real real" stuff comes out.

By echelon 2024-12-1015:24

Open source has already caught up with SOTA:

https://www.reddit.com/r/StableDiffusion/comments/1hav4z3/op...

These are even unfair comparisons because they're leveraging text-to-video instead of the more powerful image-to-video. In the latter case, the results are indistinguishable.

Video generation is about to be everywhere, and we're about to have the "Stable Diffusion" moment for video.

Look at the comments: people are already fawning over open source being uncensored.

Cat's out of the bag.

By Jerrrry 2024-12-1014:02

Very convenient for those who are waiting for the waters to get muddier.

By lancesells 2024-12-1017:18

I'm wondering that as well but I also wonder if it's a bit like CGI where it's somewhat hit a limit on realness. I'm not saying CGI doesn't get better but is a 2024 Gollum that much more realistic than 2004 Gollum? Maybe I'm wrong but I wonder if that plastic feel to AI lessens but still sticks around.

By thinkingtoilet 2024-12-1013:462 reply

>you and many others are being trained in AI discernment

HN is a hyper specialized group of people. The average person can not do this and as we've seen devours up misinformation with no second thoughts.

By thereddaikon 2024-12-1014:191 reply

On one hand, I like to think that society is getting trained to recognize AI and distrust it. But at the same time my retired boomer parents are over for the holidays and I catch them watching youtube videos completely oblivious to the fact it's an AI voice and just reading an LLM generated script with B roll for eye candy. Often times it's just stolen auto generated captions from larger creators regurgitated by an AI voice. I'll point it out and they don't believe me that the voice is fake.

By wongarsu 2024-12-1014:481 reply

AI voices have gotten scarily good. They are easy to recognize because most creators use the same voices with the same intonations and don't care to cut out the mistakes. But if you don't recognize the voice it takes a couple sentences to discern that it's AI even with an ear trained on the difference.

But it is funny to see how much stuff gets uploaded with zero quality control and still gets traction. These models really don't deal will with "innocent" letter substitutions, Iike using I instead of l.

By thereddaikon 2024-12-1020:231 reply

I've heard enough slop using the ElevenLabs voices that I can recognize them almost immediately now. But you're right. Higher end models with less familiar voices are harder to notice. One consistent failing is that they are always too perfect. No mistakes or signs of cuts to edit out where a human VA would have made a mistake. Its all very smooth and perfect. As if they nailed it in the first shot. Once the cheap/free models manage to fix that then we are in real trouble. Also, some really lazy slop creators don't bother to fix issues with pronunciation. But that's not the fault of the model really.

By bumbledraven 2024-12-1321:21

"More human than human" is our motto. https://youtu.be/ZbgmYhqFO-4?t=30

By solfox 2024-12-1013:56

And yet, OP referred to a thread where the reality of the shorts were being questioned by "average" people. Imagine a world where OpenAI were the first out the gates with this and just started producing their own videos without telling anyone about their technology or letting creators play with it. They'd make loads of money, probably could topple governments... I'm glad these tools are being made generally available versus the alternative.

By quenix 2024-12-0919:2410 reply

It saddens me. Innovations in AI 'art' generation (music, audio, photo) have been a net negative to society and are already actively harming the Internet and our media sphere.

Like I said in another comment, LLMs are cool and useful, but who in the hell asked for AI art? It's good enough to fool people and break the fragile trust relationship we had with online content, but is also extremely shit and carries no meaning or depth whatsoever.

By anxoo 2024-12-102:10

>who in the hell asked for AI art?

everyone who has ever used stock photography, custom illustrators, and image editing. as AI improves, it will come after all of those industries.

that said, it is not OpenAI's goal to beat shutterstock, nor is it the goal of anthropic or google or meta. their goal is to make god: https://ia.samaltman.com/ . visual perception (and generation) is the near-term step on that path. every discussion of AI that doesn't acknowlege this goal, what all of these billions of dollars are aiming for, is myopic and naive.

By rurp 2024-12-1016:19

There was a recent discussion in another HN thread that I think summed it up well. Good art rewards a careful viewer; the more you look at and think about good art, the more you get out of it. AI art does the opposite and punishes thoughtful consumers. There's no logical underpinning to the various details, it's just stuff mashed together in a superficially nice looking way.

By mojuba 2024-12-0919:522 reply

I think AI "art" can be as useful as the text generators, i.e. only within certain limits of dull and stupid stuff that needs to exist but has little to no value.

For example, you need to generate a landing page for your boring company: text, images, videos and the overall design (as well as code!) can be and should be generated because... who cares about your boring company's landing page, right?

By whatevertrevor 2024-12-0920:451 reply

One could ask why the boring company landing page exists in the first place though. If it's not providing value to humans to warrant actual attention being paid to it...

By tomjen3 2024-12-104:46

The world is in need of soap. Not the fancy beautiful artistic kind, but the kind that comes in containers and you put in bathrooms. This objectively saves lives and is one of those boring things I can imagine.

By carlosjobim 2024-12-0920:401 reply

Then you don't understand the purpose of a landing page. If the boring company hires somebody to make the landing page who actually understands their job, the landing page will have great importance.

By yunwal 2024-12-1016:591 reply

> the landing page will have great importance.

Most companies don't need this. They need a page that has their contact info and some general information about services they provide so they can have a bare minimum internet presence and show up on google maps.

By carlosjobim 2024-12-1017:391 reply

Absolutely, if your company doesn't want to make sales or if you want to be bothered all the time by people calling and mailing only for them to find out your product isn't a fit for them. Or if you want third party sellers to take over most of your business like Booking.com, AirBnB DoorDash or Amazon.

Companies who understand the importance of a customer friendly and functional web presence get a great return on their investment. And it's much better for the customer.

By yunwal 2024-12-1322:061 reply

I have an ice cream shop by me that doesn't even have a website. They're mobbed every day, because good ice cream is fairly self explanatory, and doesn't need a web presence

By lobsterthief 2024-12-1411:43

You’re conflating “website” with “landing page”.

Your ice cream shop doesn’t need a landing page because of word of mouth and foot traffic.

Some project management platform for plumbers needs a highly tuned webpage because they’re competing with 20 other such systems, and there’s no line to walk past and assume it’s there because the software is good.

Believing that if you build great plumbing SAAS software people, paying customers will magically appear, is naive.

A great product can sell itself. But that doesn’t mean that marketing and sales aren’t necessary in order to get the product in front of people, assuage their concerns, reassure them that it solves their problems, show social proof from others using it, and close the deal. A good landing page will do all of this ;)

By dale_glass 2024-12-1011:38

> Like I said in another comment, LLMs are cool and useful, but who in the hell asked for AI art?

I did. I started messing around with computer graphics on DOS with QBASIC and consider AI art to be just an extension of that.

On the other hand I don't care all that much for LLMs most of the time. They're sometimes useful, but while I find AI art I enjoy very regularly, using a LLM for something is more a once every couple weeks event for me.

By computerex 2024-12-0920:531 reply

How do you know they are a net negative? What's your source?

By quenix 2024-12-0920:552 reply

My opinion ;-)

That's what HN is for

By CamperBob2 2024-12-101:50

It's quite well-supported on here, that's for sure.

Somewhere there's a site for "hackers" where it isn't, and I hope I stumble across that site at some point.

By Cthulhu_ 2024-12-1012:37

Do add "in my opinion" or prefix with "I think", because your definite wording implied you were stating a verifiable fact. Telling opinions like they are facts and then backtracking with "oh but it was just my opinion" is a big problem in (online?) society / discourse, and has led to a lot of misinformation and anti-scientific takes spreading.

"The earth is flat" - "Can you prove it?" - "Oh it's just my opinion". It's dishonest.

By randomlurking 2024-12-0921:30

I agree with the first part. For me, AI art is the chance to have a somewhat creative outlet that I wouldn’t have otherwise, because I’m much worse at painting that I can stand. Drawing by prompts helps me be creative and work through some stuff - for that it’s also nice and interesting to see that the result differs from my mental image. I will tweak the prompt to some extent and to some extent go with some unintentioned elements of the drawing. I keep the drawing on my phone in the notes app with a title and the prompt.

To get back to the beginning: I really do agree that the societal impact on the whole appears to be negative. But there are some positives and I wanted to share my example of that.

By tomjen3 2024-12-104:42

That describes most art. At least ai art can be pretty and doesn’t have the same political message.

By Der_Einzige 2024-12-1013:551 reply

Go on civil.AI, it’s primarily used for hardcore waifu porn.

By jeffhuys 2024-12-1014:57

You mean civitai.com? There's a lot more on there than just that...

By cokeandpepsi 2024-12-1017:49

[dead]

By lmm 2024-12-103:24

Much of the time I don't want "meaning or depth", I just want a pretty picture of whatever it was. AI art is great, it's just that the people it most benefits are the people you don't see or hear much from (and, rude as this is to say, people who write less convincingly).

By computerex 2024-12-0920:513 reply

They should have kept this amazing tech under the wraps because you have a bad feeling about it? Hate to break it to you, but there have been fake videos on the internet ever since it has existed. There are more ways to fake videos than GenAI. If you haven't been consuming everything on the internet with a high alert bs sensor, then that's an issue of its own. You shouldn't trust things on the internet anyway unless there is overwhelming evidence.

By callc 2024-12-103:011 reply

Amazing tech != socially good

Of course, as knowledgeable people in tech we can look at the last few years of AI improvements as technically remarkable. pen2l is talking about social impact.

I hope our trade can collectively become adults at the big table of Real Engineers. Consider the impact on humanity of your work. If you don’t care, then you are either recklessly irresponsible, don’t know any better, or are intentionally causing harm at scale.

By OskarS 2024-12-1013:261 reply

Very well put. There's always been this Silicon Valley instinct that all technological advance is always good for humanity, and it's just not that simple.

By callc 2024-12-1014:41

Thanks OskarS

Tech is a very powerful tool that can automate the most mundane tasks and also automate harm like mass surveillance and erosion of ownership rights of your devices. The sheer ability to create new markets and replace inefficient non-automated markets leads to huge $$$ making opportunities which people may mistake as being good in itself (good for economy / GDP = good for humanity)

By arsenico 2024-12-1012:31

Cannot even quality "It has always been shit, so no problem at it becoming even shittier" as a hot take.

By sergiogdr 2024-12-0921:193 reply

> If you haven't been consuming everything on the internet with a high alert bs sensor, then that's an issue of its own

"just be privileged as I was to get all the necessary education to be able to not be fooled by this tech". Yeah, very realistic and compassionate.

By cma 2024-12-0922:491 reply

With a heavy dose of "if masses of people are fooled by this, it can't affect me as long as I can see through it. No possible repercussions of mass people believing completely made up stuff that could affect laws, etc."

By JPKab 2024-12-0923:295 reply

This entire thread reeks of "I'm smart enough to know that videos can be faked, but Jethro in the trailer park isn't because he's just a plumber, and therefore this tech needs to be censored or else Jethro might believe stuff that makes him vote in a way I don't like" going on here.

While the average person overestimates their own intelligence, the average techy dramatically underestimates the intelligence of the average member of the public. The weirdos that latch onto every fake video and silly conspiracy theory are dramatically overrepresented in every online comments thread, but supposed geniuses in the tech/NGO/academic community forget this and assume a broad swath of the public believes in stuff like "Pizza gate" because nuanced thinking is a skill only the enlightened few possess.

By someothherguyy 2024-12-104:10

Some people aren't very skeptical at baseline, it doesn't mean that those concerned about the ability of others to recognize AI are disparaging people based on intelligence.

For example, some people can be very intelligent, yet not be discerning of information that resonates with prior biases. You see this in those who are devoutly religious, politically polarized, etc.

There is reason to believe that such biases will lend to ontological misinterpretations from algorithmically generated information.

You can see mistakes in interpretation on a day to day basis by the population at large. There are swaths of widely held beliefs that aren't based in truth. Pretty much anyone is likely to believe at least some stereotype, folklore, urban legend, or myth.

By sergiogdr 2024-12-102:02

It isn’t about being smart (you assumed this is what ‘education’ was pointing at). Most people aren’t even aware of what’s happening besides extremely superficial things that they get here and there on the news. Can’t you honestly see the real potential for massive damage coming out of all this?

By pixelsort 2024-12-1011:24

With respect to the American public, the majority can and do utilize nuanced thinking as a survival skill. The problem of modern American era, is not that our public is low in average intelligence. Rather, that on average, we have been miseducated to seek the eradication of discomfort, uncertainty, inconveniences, and unknowns.

By cma 2024-12-0923:47

That radio station in hotel Rawanda could be a bad thing for you and people you cared about even if you personally could discern the lies so it wasn't fooling you.

By krisboyz781 2024-12-100:561 reply

Actually you overestimate the general public's ability to discern what's real or not. On top of that, most people don't even care if it's real. This is exactly why Trump won.

Example: if a gen ai vid of a politician doing some crazy crime came out. Even if it were proven fake, people would start questioning everything and still act as if the politician were guilty

By JPKab 2024-12-1021:52

"This is exactly why Trump won"

See the part of my comment you are replying to where I specifically stated that the motivation for all of this is that "Jethro doesn't vote the way I want him to". You've proven my point.

The censorious attitudes on HN were non-existent before Trump won in 2016. I know this for a fact. I've had my account on here since 2012, after 2 years of being just a reader.

Meanwhile, you overestimate how immune to misinformation and lies the average HN techy is. Just a few years ago, the majority of people on here believed, with utter conviction, that the bat-borne coronavirus lab in Wuhan had absolutely no connection with the bat-borne coronavirus epidemic that started in Wuhan and that only bigots and ignoramuses could draw such a conclusion. I experienced this whenever I brought up the blatantly obvious, common sense connection in these same comment threads in late 2020 or into mid 2021. The absolutely absurd denial of common sense by otherwise intelligent people was reminiscent of trying to talk to a religious fundamentalist about evolution while pointing at dinosaur fossils and having them continue to deny what was staring them in the face.

By sekai 2024-12-1011:301 reply

> "just be privileged as I was to get all the necessary education to be able to not be fooled by this tech". Yeah, very realistic and compassionate.

This has nothing to do with privilege, a person in Indian slums on his 2005 PC with internet access can have better internet BS radar than an Ivy League student.

By sergiogdr 2024-12-1014:24

I think that would be an exception rather than the rule, to be honest.

I think though, that if you are in the position of doing serious critical reflection about this stuff, which is in my opinion necessary for being in a position of discernment wrt this stuff, then you are privileged. This is the idea I wanted to convey.

By JPKab 2024-12-0923:203 reply

What education do you specifically think is necessary for people with average IQs all over the world to not be fooled by this, given that they are aware that videos can easily be faked in 2024? A high school degree? A bachelors?

By gambiting 2024-12-107:532 reply

>>given that they are aware that videos can easily be faked in 2024?

That's a ridiculous assumption. In my experience no one outside of tech circles is even remotely aware that this kind of thing is possible already.

By IAmGraydon 2024-12-122:54

With all due respect, I think you may be out of touch.

By JPKab 2024-12-1021:231 reply

You think that the average member of the public isn't aware that videos can be faked with AI, or non-AI special effects, and your source of data for this is "your own experience"? Really?

My family is mostly working class in an economically depressed part of the Virginia/West Virginia coal country, and every single one of them is aware of this. None of them work in tech, obviously. None have college degrees.

I maintain that the attitude driving this paternalistic, censorious attitude is arrogance and condescension.

A prime example of how broadly aware the public all over the world is of AI faked videos was the reaction in the Arab world to the October 7th videos posted by Hamas. A shocking (and depressing) percentage of Arabs, as well as non-Arab Muslims all over the world, believed the videos and pictures were fakes produced with AI. I don't remember the exact number, but the polling I saw in November showed it was over 50% who believed they were fakes in countries as disparate as Egypt and Indonesia.

By gambiting 2024-12-1021:39

>>isn't aware that videos can be faked with AI, or non-AI special effects

These two are very different things. My family believes all kinds of videos on the internet are fake. None of them have any idea what a tool like Sora can do. The gap between "oh this was probably special effects" to "you have to notice pixels shimmering around someone's hand to tell" is enormous.

>>My family is mostly working class in an economically depressed part of the Virginia/West Virginia coal country, and every single one of them is aware of this.

Your working class family has time to keep up with the advancements in generative AI for video? They have more free time than I do then. If we're sharing anecdotes about families then my family is from Polish coal country and their idea of AI is talking to your car and it responding poorly.

>>I maintain that the attitude driving this paternalistic, censorious attitude is arrogance and condescension.

I'm confused - who is displaying this "censorious" attitude here?

>> and your source of data for this is "your own experience"? Really?

Yes, really. I mean do you have anything else? You are also quoting things from your own experience.

By sergiogdr 2024-12-101:56

I’m not (exclusively) talking about formal education. There are lots of people (I would dare say the majority of the planet) that don’t have the ‘digital literacy’ required to handle what’s happening right now. Being from a developed country I am very much worried about this.

By 8n4vidtmkvmk 2024-12-106:22

Fooled by what? Some of it looks real but is incredulous enough that it should set off your BS sensor. Other stuff is/will be more subtle and we will have no way of knowing.

By mrcwinn 2024-12-0920:411 reply

Too charitable indeed. Google was simply unprepared and has inferior alternatives.

My prediction is that next year they will catch up a bit and will not be shy about releasing new technology. They will remain behind in LLMs but at least will more deeply envelope their own existing products, thus creating a narrative of improved innovation and profit potential. They will publicly acknowledge perceived risks and say they have teams ensuring it will be okay.

By tziki 2024-12-1014:23

>They will remain behind in LLMs

The latest Gemini version (1206) is at least tied for the best LLM, if not the best outright.

By pier25 2024-12-0921:346 reply

I wish Google would allow me to remove the AI stuff from search results.

99% of the times it's either useless or wrong.

By titzer 2024-12-0922:052 reply

Strong plus one here. Not only that, but it uses gobs of energy in total. Google has reneged on all of its carbon promises to stay in the running for AI domination and to head off disruption to search ads business. Since I've unconsciously trained my brain to not look at the top search results anymore because they long ago turned into impossible-to-distinguish ads, I've quickly learned to just ignore the stupid AI summary. So it's an absurd waste of computational power to generate something wrong that I don't even want to see, and I can't even tell them to stop when they're wasting their own money to do so.

By Tempat 2024-12-107:59

It’s often wrong anyway. Much like you, the thing that annoys me most about it though is all the power they must be using having it run on every single search by anyone.

By Lcchy 2024-12-109:56

I have been using Kagi for a year now and it's been liberating. Its an ad/seo-free search engine.

https://kagi.com/

Sorry for the name dropping, I have no affiliation and am just a very happy user, so I wanted to share it as it felt adequate.

By fraXis 2024-12-0921:392 reply

Add a -ai to the end of your Google search query. There are also browser extensions that stop the AI content from displaying. I use the one for Chrome called "Remove Google Search Generative AI".

By dangerwill 2024-12-100:39

Great tip! But it only remove's Google's terrible AI summary, not AI generated content from showing up in searches, which is what the OP wishes for. A combination of -ai and before:2022-01-01 is probably the closest we can get to that

By 1kaizen 2024-12-1012:24

This is vaporware/false advertising.

We don't have tech to correctly "detect ai" in 2024, which is why education has broken down over the last few years with serial cheating in every institution.

Every company so far that claimed to detect AI generated slug has failed.

By Slyfox33 2024-12-0923:05

you can block is with Ublock Origin

https://www.reddit.com/r/uBlockOrigin/comments/1ct5mpt/heres...

By KeplerBoy 2024-12-1014:462 reply

Nobody has any clue what is AI stuff these days. Apart from the obvious ones, no one can tell a generative AI apart from 3D rendered stuff or low-res photos. Put image compression on top and it's definitely impossible.

By pier25 2024-12-1015:141 reply

I meant the bit of AI that Google adds on top the actual search results.

By KeplerBoy 2024-12-1015:20

Ah sure, that stuff is just annyoing. I don't need a - probably wrong - summary of the top hit either.

By imiric 2024-12-1015:16

I wonder what the outcome will be when new models are trained on AI-generated data. These companies are already running out of quality training data. So when most of the data on the internet is synthetic, will they find ways of separating the signal from the noise, or will all the noise lead to a convergence of performance across all models to something that is much inferior than what we have today?

This tech will make the internet even more unbearable to use, without mentioning its huge potential for abuse. This is far worse than whatever positives it might have, which are still unclear. What a shitshow.

By cobalt60 2024-12-1012:18

Udm?

By tlrobinson 2024-12-1015:55

This is all inevitable. At worst it's pulling the issues forward by a few months or years, and I don't think anyone will meaningfully address the problem until it's staring us in the face.

I believe the internet needs a distributed trust and reputation layer. I haven't fully thought through all the details, but:

- Some way to subscribe to fact checking providers of your choice.

- Some way to tie individuals' reputation to the things they post.

- Overlay those trust and reputation layers.

I want to see a score for every webpage, and be able to drill into what factored into that score, and any additional context people have provided (e.x. Community Notes).

There's a huge bootstrapping and incentive problem though. I think all the big players would need to work together to build this. Social media, legacy media companies, browsers, etc.

This also presupposes people actually care about the truth, which unfortunately doesn't always seem like the case.

By bko 2024-12-0919:54

I don't think Google delayed or kept this under wraps for any noble reasons. I think they were just disorganized as evidenced by their recent scrambling to compete in this space.

By makestuff 2024-12-0921:483 reply

I don't even know if this will be possible, or how it would work, but it seems like the next iteration of social media will be based on some verification that the user is not using AI or is a bot. Currently they are all incentivized to not stop bot activity because it increases user counts, ad revenue, etc.

Maybe the model is you have to pay per account to use it, or maybe the model will be something else.

I doubt this will make everyone just go back to primarily communicating in person/via voice servers but that is a possibility.

By joaohaas 2024-12-0922:54

Twitter Blue is paid and yet every single bot account has it in order to boost views.

By debugnik 2024-12-1010:10

> Maybe the model is you have to pay per account to use it

Spammers can afford more money per bot for their operations than the average user can justify to spend on social media.

By mnau 2024-12-0922:111 reply

So Musk was right?

By CJefferson 2024-12-1011:181 reply

No, because Musk encourages AI slop if people are willing to pay.

What we probably need (this is going to sound crazy, but I don’t have a better suggestion), is some kind of networked trust system.

By jeffhuys 2024-12-1015:00

Like Community Notes? It's actually a darn good system.

By lanthissa 2024-12-0919:501 reply

exactly one lab has passed the test of morals vs profit at this point, and thats deepmind, and they were thoroughly punished for it.

Every value oAI has claimed to have hasn't lasted a milisecond longer than there was profit motive to break it, and even anthropic is doing military tech now.

By dmix 2024-12-0921:02

LLMs aren’t AGI

By kylehotchkiss 2024-12-0921:53

> the image of those people on slot machines, mechanically and soullessly pulling levers because they are addicted. It's just so strange.

Worse, the audience is our parents and grandparents. They have little context to be able to sort out reality from this stuff

By soulofmischief 2024-12-1012:501 reply

Shorts are designed to trade your valuable attention for trite, low-effort content. Most decent shorts are just clips of longer-form content.

Do yourself a favor and avoid that kind of content, opting instead for long-form consumption. The discovery patterns are different, but you're less inclined to encounter fake content if you develop a trust network of good channels.

By jprete 2024-12-1016:261 reply

This is also my strategy. AI content makes me focus even harder on the source of the content instead of the apparent quality, because the current set of GenAI techniques are best at imitating surface-level quality features.

By soulofmischief 2024-12-119:04

What are some good channels you recommend?

By freehorse 2024-12-108:41

The way AI goes, it will actually raise the cost of valid services: cost of bullshit and spam is going down, which will raise the cost of valid, non-ai powered services to raise above the noise or be able to filter it out. There is only negative value to what "open"-ai is adding to the world right now. By playing the long-term AI safety card, of the hypothetical scenario some AI supposedly getting conscious in the future, they try to pass themselves clean and innocent in all the damage they cause to society.

I just hope the online, social media space gets enshitified to an such a degree that it stops playing a major role in society, though sadly that is not how things usually seem to work.

By DrScientist 2024-12-1011:36

On the other hand by making public what technologies capabilities are - doesn't it stop the problem of people having this tech in secret and using it before anybody is aware it's even possible?

ie a company developing this tech, keeping under wraps and say only using for special government programmes....

By dyauspitr 2024-12-102:04

Pandora’s box is open, not releasing models and tools is just going to result in someone else doing it.

By whywhywhywhy 2024-12-1010:03

They didn’t keep it under wraps, it’s just the team considered the paper shipping not the product. They still shipped the papers that decentralized the knowledge.

Could even argue shipping the product and not the paper would have done more for AI safety, least it would be controlled.

By ActionHank 2024-12-1015:02

The best part is that eventually, over time, the AI slop will feed into training data more and more. I suspect it will be like the Kessler Syndrome of AI models.

By fullstackchris 2024-12-107:51

The ability to make strange videos as a consumer... it's not inherently good or bad, it'll just be... weird

By MrBuddyCasino 2024-12-108:37

It doesn't take AI to fool people. They have been propagandised and lied to on a major scale since mass media.

They also lie themselves: they cannot detect overt bias or reflect on themselves and be aware of their hidden motives, resentments and wishful thinking. Including me and you.

Most people hold important beliefs about the world that are comically inaccurate.

AI changes absolutely nothing how many true or false beliefs the average Joe holds.

By littlestymaar 2024-12-1014:43

> And I can't help but be saddened about OpenAI's decisions to unload a lot of this before recognizing the results of unleashing this to humanity

Yeah, and it's especially hypocrite coming from them who said they'd refuse to disclose anything about GPT-3 because they said it was dangerous. And then a few years latter: “Hey remember about this thing we told you it was too dangerous before? Now we have a monetization strategy so we're giving access to everyone, today.”

By stronglikedan 2024-12-1014:23

> there were enough AI'esque artifacts that one could confidently conclude it's fake.

And yet, you would not have known how to recognize those artifacts without "OpenAI's decisions to unload a lot of this before recognizing the results of unleashing this to humanity".

By serial_dev 2024-12-1014:42

You could have said the same thing about photo shop... Some people will learn to spot BS and think critically even if they can't quite put their finger on it and the video is very good (What, Trump fought a T-Rex, AND WON?), some people could be fooled by anything, and there is a lot in between.

By amaurose 2024-12-108:08

[dead]

By halyconWays 2024-12-0919:501 reply

[flagged]

By thr3000 2024-12-0920:091 reply

So is yours! Mine isn't, however. I am a hard-nosed real boy now.

By tenpies 2024-12-101:104 reply

Write something that an LLM could never write.

(This is my latest favorite prompt and interview/conversation question)

By Der_Einzige 2024-12-1013:59

If you’re not actively publishing at top conferences (I.e. NeurIPS), than this is a trash question and shows the lack of knowledge that many who are now entering the field will have.

Anything that you or others can answer to this which isn’t some stupid “gotcha” puzzle shit (lol it’s video cus LLMs aren’t video models amiright?) will be wrong because of things like structured decoding and the fact that ultra high temperature works with better samplers like min_p.

https://openreview.net/forum?id=FBkpCyujtS&noteId=mY7FMnuuC9

By woctordho 2024-12-1010:33

3e4a3ad9f05fdfb609dda6e5f512e52506f4c1053962e21bfd93f1ed81582d16ca0fef9574fb07ab62f8f5b1373b4ddd541804c0d176f4a557d900b05047e853

(This is the hash of a string randomly popped in my mind. An LLM will write this with almost 0 probability --- until this is crawled into the training sets)

By Kiro 2024-12-107:27

You go first.

By definitelynotai 2024-12-107:56

[dead]

By raincole 2024-12-0920:49

Considering google image search is polluted by AI-generated images at this moment, perhaps google is afraid of making the search even worse?

Sora is here

Show article

toomuchtodo

Comments

By yeknoda 2024-12-0918:1027 reply

By jerf 2024-12-0918:5111 reply

By LASR 2024-12-0919:441 reply

By kurthr 2024-12-0921:39

By robotresearcher 2024-12-0919:256 reply

By coffeebeqn 2024-12-0920:582 reply

By robotresearcher 2024-12-0923:112 reply

By dragonwriter 2024-12-100:591 reply

By hatefulmoron 2024-12-102:101 reply

By bergen 2024-12-107:12

By sleepybrett 2024-12-0921:221 reply

By runarberg 2024-12-102:524 reply

By dumbfounder 2024-12-103:301 reply

By krainboltgreene 2024-12-109:341 reply

By flappyeagle 2024-12-1020:36

By Breza 2024-12-1316:061 reply

By runarberg 2024-12-1321:42

By sleepybrett 2024-12-1118:361 reply

By runarberg 2024-12-1120:321 reply

By sleepybrett 2024-12-136:291 reply

By runarberg 2024-12-1321:49

By SamPatt 2024-12-104:201 reply

By runarberg 2024-12-1017:28

By troupo 2024-12-0920:464 reply

By shermantanktop 2024-12-0922:481 reply

By FranzFerdiNaN 2024-12-107:441 reply

By shermantanktop 2024-12-1015:46

By bunabhucan 2024-12-100:49

By robotresearcher 2024-12-0923:081 reply

By dbspin 2024-12-101:531 reply

By robotresearcher 2024-12-102:422 reply

By troupo 2024-12-109:393 reply

By throwup238 2024-12-1020:101 reply

By troupo 2024-12-1021:13

By skydhash 2024-12-1017:11

By robotresearcher 2024-12-1022:071 reply

By troupo 2024-12-1023:071 reply

By robotresearcher 2024-12-1120:21

By krainboltgreene 2024-12-109:38

By player1234 2024-12-1011:23

By letmevoteplease 2024-12-0922:122 reply

By spoaceman7777 2024-12-105:33

By Der_Einzige 2024-12-1014:01

By bwfan123 2024-12-102:121 reply

By rossjudson 2024-12-102:401 reply

By natmaka 2024-12-107:22

By jerf 2024-12-0919:34

By minimaxir 2024-12-0919:111 reply

By dragonwriter 2024-12-100:51

By lmm 2024-12-101:43

By amelius 2024-12-0920:492 reply

By artemisart 2024-12-0921:311 reply

By minimaxir 2024-12-0921:431 reply

By AuryGlenz 2024-12-1016:26

By alpha_squared 2024-12-0921:091 reply

By int_19h 2024-12-0921:361 reply

By krainboltgreene 2024-12-109:39

By fennecbutt 2024-12-1810:20

By echelon 2024-12-0919:092 reply

By circlefavshape 2024-12-1010:40

By baserev 2024-12-101:53

By yeknoda 2024-12-0919:032 reply

By mikepurvis 2024-12-0919:161 reply

By programd 2024-12-0919:50

By 3form 2024-12-0919:111 reply

By echelon 2024-12-0919:36

By prmoustache 2024-12-106:57

By szundi 2024-12-106:45

By baq 2024-12-106:46

By miltonlost 2024-12-0918:2914 reply

By TeMPOraL 2024-12-0918:527 reply

By jsheard 2024-12-0918:574 reply

By samatman 2024-12-0920:391 reply

By lmm 2024-12-103:031 reply

By TeMPOraL 2024-12-108:151 reply

By lmm 2024-12-109:501 reply