Hacker News

Veo

2024-05-1417:581727488deepmind.google

Veo is our most capable video generation model to date. It generates high-quality, 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles.

Show article

Veo is our most capable video generation model to date. It generates high-quality, 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles.

It accurately captures the nuance and tone of a prompt, and provides an unprecedented level of creative control — understanding prompts for all kinds of cinematic effects, like time lapses or aerial shots of a landscape.

Our video generation model will help create tools that make video production accessible to everyone. Whether you're a seasoned filmmaker, aspiring creator, or educator looking to share knowledge, Veo unlocks new possibilities for storytelling, education and more.

Over the coming weeks some of these features will be available to select creators through VideoFX, a new experimental tool at labs.google. You can join the waitlist now.

In the future, we’ll also bring some of Veo’s capabilities to YouTube Shorts and other products.

Prompt: A fast-tracking shot down an suburban residential street lined with trees. Daytime with a clear blue sky. Saturated colors, high contrast

Prompt: Extreme close-up of chicken and green pepper kebabs grilling on a barbeque with flames. Shallow focus and light smoke. vivid colours

Prompt: Timelapse of the northern lights dancing across the Arctic sky, stars twinkling, snow-covered landscape

Prompt: An aerial shot of a lighthouse standing tall on a rocky cliff, its beacon cutting through the early dawn, waves crash against the rocks below

To produce a coherent scene, generative video models need to accurately interpret a text prompt and combine this information with relevant visual references.

With advanced understanding of natural language and visual semantics, Veo generates video that closely follows the prompt. It accurately captures the nuance and tone in a phrase, rendering intricate details within complex scenes.

Prompt: Timelapse of a common sunflower opening, dark background

Prompt: extreme close-up with a shallow depth of field of a puddle in a street. reflecting a busy futuristic Tokyo city with bright neon signs, night, lens flare

When given both an input video and editing command, like adding kayaks to an aerial shot of a coastline, Veo can apply this command to the initial video and create a new, edited video.

Prompt: Drone shot along the Hawaii jungle coastline, sunny day

Drone shot along the Hawaii jungle coastline, sunny day. Kayaks in the water

In addition, it supports masked editing, enabling changes to specific areas of the video when you add a mask area to your video and text prompt.

Veo can also generate a video with an image as input along with the text prompt. By providing a reference image in combination with a text prompt, it conditions Veo to generate a video that follows the image’s style and user prompt’s instructions.

Prompt: Alpacas wearing knit wool sweaters, graffiti background, sunglasses

Prompt: Alpacas dancing to the beat

The model is also able to make video clips and extend them to 60 seconds and beyond. It can do this either from a single prompt, or by being given a sequence of prompts which together tell a story.

Maintaining visual consistency can be a challenge for video generation models. Characters, objects, or even entire scenes can flicker, jump, or morph unexpectedly between frames, disrupting the viewing experience.

Veo's cutting-edge latent diffusion transformers reduce the appearance of these inconsistencies, keeping characters, objects and styles in place, as they would in real life.

Prompt: moody shot of a central European alley film noir cinematic black and white high contrast high detail

Prompt: Crochet elephant in intricate patterns walking on the savanna

Veo builds upon years of generative video model work including Generative Query Network (GQN), DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet and Lumiere, and also our Transformer architecture and Gemini.

To help Veo understand and follow prompts more accurately, we have also added more details to the captions of each video in its training data. And to further improve performance, the model uses high-quality, compressed representations of video (also known as latents) so it’s more efficient too. These steps improve overall quality and reduce the time it takes to generate videos.

It's critical to bring technologies like Veo to the world responsibly. Videos created by Veo are watermarked using SynthID, our cutting-edge tool for watermarking and identifying AI-generated content, and passed through safety filters and memorization checking processes that help mitigate privacy, copyright and bias risks.

Veo’s future will be informed by our work with leading creators and filmmakers. Their feedback helps us improve our generative video technologies and makes sure they benefit the wider creative community and beyond.

Note: All videos on this page were generated by Veo and have not been modified.

This work was made possible by the exceptional contributions of: Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, Andrew Pierson, Ankush Gupta, Austin Waters, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt, Marc van Zee, Matt McGill, Medhini Narasimhan, Miaosen Wang, Mikołaj Bińkowski, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Nick Pezzotti, Pieter-Jan Kindermans, Poorva Rane, Rachel Hornung, Robert Riachi, Ruben Villegas, Rui Qian, Sander Dieleman, Serena Zhang, Serkan Cabi, Shixin Luo, Shlomi Fruchter, Signe Nørly, Srivatsan Srinivasan, Tobias Pfaff, Tom Hume, Vikas Verma, Weizhe Hua, William Zhu, Xinchen Yan, Xinyu Wang, Yelin Kim, Yuqing Du and Yutian Chen.

We extend our gratitude to Aida Nematzadeh, Alex Cullum, April Lehman, Aäron van den Oord, Charlie Chen, Charline Le Lan, Cristian Țăpuș, David Bridson, Emanuel Taropa, Gavin Buttimore, Geng Yan, Greg Shaw, Harsha Vashisht, Hartwig Adam, Huisheng Wang, Jacob Austin, Jim Lin, Jonas Adler, Joost van Amersfoort, Jordi Pont-Tuset, Josh V. Dillon, Kristian Kjems, Lois Zhou, Luis C. Cobo, Maigo Le, Malcolm Reynolds, Marcus Wainwright, Mary Cassin, Matt Smart, Matt Young, Mingda Zhang, Minh Giang, Moritz Dickfeld, Nancy Xiao, Nelly Papalampidi, Nir Shabat, Ollie Purkiss, Oskar Bunyan, Patrice Oehen, Pete Aykroyd, Petko Georgiev, Phil Chen, Rakesh Shivanna, Ramya Ganeshan, Richard Nguyen, RJ Mical, Rohan Anil, Sam Haves, Shanshan Zheng, Sholto Douglas, Siddhartha Brahma, Tatiana López, Tobias Pfaff, Victor Gomes, Vighnesh Birodkar, Xin Chen, Yi-Ling Wang, Yilin Ma, Yori Zwols, Yu Qiao, Yuchen Liang, Yusuf Aytar and Zu Kim for their invaluable partnership in developing and refining key components of this project.

Special thanks to Douglas Eck, Nando de Freitas, Oriol Vinyals, Eli Collins, Koray Kavukcuoglu and Demis Hassabis for their insightful guidance and support throughout the research process.

We also acknowledge the many other individuals who contributed across Google DeepMind and our partners at Google.

Read the original article

meetpateltech

Karma: 28457

@Hacker__News
@hacker._news

Comments

By salamo 2024-05-152:187 reply

The first thing I will do when I get access to this is ask it to generate a realistic chess board. I have never gotten a decent looking chessboard with any image generator that doesn't have deformed pieces, the correct number of squares, squares properly in a checkerboard pattern, pieces placed in the correct position, board oriented properly (white on the right!) and not an otherwise illegal position. It seems to be an "AI complete" problem.

By arcticbull 2024-05-152:2314 reply

Similarly the Veo example of the northern lights is a really interesting one. That's not what the northern lights look like to the naked eye - they're actually pretty grey. The really bright greens and even the reds really only come out when you take a photo of them with a camera. Of course the model couldn't know that because, well, it only gets trained on photos. Gets really existential - simulacra energy - maybe another good AI Turing test, for now.

By porphyra 2024-05-155:031 reply

Human eyes are basically black and white in low light since rod cells can't detect color. But when the northern lights are bright enough you can definitely see the colors.

The fact that some things are too dark to be seen by humans but can be captured accurately with cameras doesn't mean that the camera, or the AI, is "making things up" or whatever.

Finally, nobody wants to see a video or a photo of a dark, gray, and barely visible aurora.

By exodust 2024-05-155:532 reply

> nobody wants to see a video or a photo of a dark, gray, and barely visible aurora

Except those who want to see an accurate representation of what it looks like to the naked eye.

By stkhlm 2024-05-156:144 reply

Living in northern Sweden I see the northern lights multiple times a year. I have never seen them pale or otherwise not colorful. Green and reds always. That is to my naked eye. Photographs do look more saturated, but the difference isn't as large as this comment thread make it out to be.

By peanut_merchant 2024-05-159:09

Even in Northern Scotland (further south than northern Sweden) this is the case. The latest aurora showing was vividly colourful to the naked eye.

By shwaj 2024-05-157:081 reply

That mirrors my experience from when I used to live in northern Canada

By jabits 2024-05-157:511 reply

Even in Upper Michigan near Lake Superior we sometimes had stunn, colorful Northern Lights. Sometimes it seemed like they were flying overhead within your grasp

By DaSHacka 2024-05-158:32

Most definitely, it's quite common to find people hanging around outside up towards Calumet whenever there's a night with a high KP Index.

I highly recommend checking them out if you're nearby, the recent auoras have been quite astonishing

By exodust 2024-05-1512:07

I'm in Australia where the southern lights are known to be not as intense as northern lights. That's where my remark comes from. Those who have never seen the aurora with their own eyes may like to see an accurate photo. A rare find among the collective celebration of saturation.

By fzzzy 2024-05-1511:111 reply

In the upper peninsula of michigan I have only seen grey.

By Jensson 2024-05-1515:49

That is the same latitude as Paris though, not very north at all.

By freedomben 2024-05-1512:29

Exactly. I went through major gas lighting trying to see the Aurora. I just wasn't sure whether I was actually seeing it, because it always looked so different from the photos. It is absolutely maddening trying to find a realistic photo of what it looks like to the naked eye, so that you can know if what you are seeing is actually the Aurora and not just clouds

By paxys 2024-05-153:082 reply

That's not true at all. I have seen northern lights with my own eyes that were more neon green and bright purple than any mainstream photo.

By mapt 2024-05-162:30

"With my own eyes"

But what sort of eyes are those?

Priming the opsins in your retina is a continuous process, and primed opsins are depleted rapidly by light. Fully adapting your eye to darkness takes a great deal of darkness and a great deal of time - on the order of an hour should set you up.

Most human beings in arctic regions live in places and engage in lifestyles where it's impossible to even come close to attaining the full light sensitivity of the human retina in perfect darkness. The sky never gets dark enough in a city or even a small town to get the full experience, and if you saw your smart watch five minutes ago you still haven't fully recovered your night vision. Even a sliver of moon makes remote dark-sky-sites dramatically brighter.

Everybody is going to have different degrees of the experience because they'll have eyes with different degrees of dark adaptation. And their brains are going to shift around the ~10^3x dynamic range of the eye up or down the light intensity scale by a factor ~10^6, without making it obvious to them.

By cryptoz 2024-05-154:173 reply

There's a middle ground here. I saw the northern lights with my own eyes just days ago and it was mostly grey. I saw some color. But when I took a photo with a phone camera, the color absolutely popped. So it may be that you've seen more color than any photo, but the average viewer in Seattle this past weekend saw grey-er with their eyes and huge color in their phone photos.

(Edit: it was still super-cool even if grey-ish, and there was absolutely beautiful colors in there if you could find your way out of the direct city lights)

By goostavos 2024-05-154:483 reply

The hubris of suggesting that your single experience of vaguely seeing the northern lights one time in Seattle has now led to a deep understanding of their true "color" and that the other person (perhaps all other people?) must be fooling themselves is... part of what makes HN so delightful to read.

I've also seen the northern lights with my own eyes. Way up in the arctic circle in Sweden. Their color changes along with activity. Grey looking sometimes? Sure. But also colors that are so vivid that it feels like it envelopes your body.

By lpapez 2024-05-158:19

> The hubris of suggesting that your single experience of vaguely seeing the northern lights one time in Seattle has now led to a deep understanding of their true "color" and that the other person (perhaps all other people?) must be fooling themselves is... part of what makes HN so delightful to read.

The H in HN stands for Hubris.

By stavros 2024-05-157:37

They did say "the average viewer in Seattle this past weekend", not "all other viewers".

Then again, the average viewer in Seattle this past weekend is hardly representative of what the northern lights look like.

By freedomben 2024-05-1512:34

The person they were responding to was saying that the people reporting grays were wrong, and that they had seen it and it was colorful. If anything, you should be accusing that person of hubris, not GP. All GPS point was, is that it can differ in different situations. They used the example of Seattle to show that the person they were responding to is not correct that it is never gray and dull.

By mitthrowaway2 2024-05-1520:41

The human retina effectively combines a color sensor with a monochrome sensor. The monochrome channel is more light-sensitive. When the lights are dim, we'll dilate our pupils, but there's only so much we can do to increase exposure. So in dim light we see mostly in grayscale, even if that light is strongly colored in spectral terms.

Phone cameras have a Bayer filter which means they only have RGB color-sensing. The Bayer filter cuts out some incoming light and dims the received image, compared with what a monochrome camera would see. But that's how you get color photos.

To compensate for a lack of light, the phone boosts the gain and exposure time until it gets enough signal to make an image. When it eventually does get an image, it's getting a color image. This comes at the cost of some noise and motion-blur, but it's that or no image at all.

If phone cameras had a mix of RGB and monochrome sensors like the human eye does, low-light aurora photos might end up closer to matching our own perception.

By hoyd 2024-05-155:25

I can see what you mean, and that the video is somewhat not what it would be like in real. I have lived in northern Norway most of my life, and watched Auroras a lot. It certainly look green and link for the most time. Fainter, it would perhaps sorry gray I guess? Red, when viewed from a more southern viewpoint..

I work at Andøya Space where perhaps most of the space research on Aurora had been done by sending scientific rockets into space for the last 60 yrs.

By pmlarocque 2024-05-152:421 reply

That not true, they look grey when they aren't bright enough, but they can look green or red to the naked eyes if they are bright. I have seen it myself and yes I was disappointed to see only grey ones last week.

see: https://theconversation.com/what-causes-the-different-colour...

By arcticbull 2024-05-152:455 reply

> [Aurora] only appear to us in shades of gray because the light is too faint to be sensed by our color-detecting cone cells."

> Thus, the human eye primarily views the Northern Lights in faint colors and shades of gray and white. DSLR camera sensors don't have that limitation. Couple that fact with the long exposure times and high ISO settings of modern cameras and it becomes clear that the camera sensor has a much higher dynamic range of vision in the dark than people do.

https://www.space.com/23707-only-photos-reveal-aurora-true-c...

This aligns with my experiences.

The brightest ones I saw in Northern Canada I even saw hints of reds - but no real greens - until I looked at it through my phone, and it looked just like the simulated video.

If I looked up and saw them the way they appear in the simulation, in real life, I'd run for a pair of leaded undies.

By Kiro 2024-05-155:26

That is totally incorrect which anyone who have seen real northern lights can attest to. I'm sorry that you haven't gotten the chance to experience it and now think all northern lights are that lackluster.

By Tronno 2024-05-152:58

I've seen it bright green with the naked eye. It definitely happens. That article is inaccurate.

By Maxion 2024-05-155:37

Greens are the more common colors, reds and blues occur in higher energy solar storms.

And yes, they can be as green to the naked eye in that AI video. I've seen aurora shows that fill the entire night sky from horizon to horizon, way more impressive than that AI video with my own eyes.

By kortilla 2024-05-155:34

This is such an arrogant pile of bullshit. I’ve seen very obvious colors on many different occasions in the northern part of the lower 48, up in southern Canada, and in Alaska.

By blhack 2024-05-154:49

Have you ever seen the Northern Lights with your eyes? If so I'm curious where you saw them.

I echo what some other posters here have said: they're certainly not gray.

By simonjgreen 2024-05-156:02

To be fair, the prompt isn’t asking for a realistic interpretation it’s asking for a timelapse. What it’s generated is absolutely what most timelapses look like.

> Prompt: Timelapse of the northern lights dancing across the Arctic sky, stars twinkling, snow-covered landscape

By sdenton4 2024-05-152:411 reply

That doesn't seem in any way useful, though... To use a very blunt analogy, are color blind people intelligent/sentient/whatever? Obviously, yes: differences in perceptual apparatus aren't useful indicators of intelligence.

By shermantanktop 2024-05-152:591 reply

As a colorblind person…I could see the northern lights way better than all the full-color-vision people around me squinting at their phones.

Wider bandwidth isn’t always better.

By Ferret7446 2024-05-156:041 reply

> I could see the northern lights way better than all the full-color-vision people around me

How would you know?

By squeaky-clean 2024-05-158:181 reply

Quote the entire sentence, not just a portion of it.

By Ferret7446 2024-05-166:151 reply

I don't see how that's relevant, unless you're able to possess people looking at their phones to experience what they're experiencing.

By shermantanktop 2024-05-192:56

To add a bit of color (ha) I was with my color-sighted spouse at a spot well known for panoramic views. 50ish people there. Many conversations happening around me.

“I can’t see anything” “Maybe that’s something over there?” “What’s everyone looking at?”

Someone shows their phone.

“Ooh!” “How do you turn on night mode?” “Wow it’s so much clearer on the phone!”

So I can’t know what their eyes see or what they really think, I could hear what came out of their mouths.

I don’t think this is an instance that warrants deep philosophical skepticism about the nature of truth or the impossibility of knowledge.

By 22c 2024-05-152:401 reply

I've only ever seen photos of the northern lights and I also didn't know that.

By laserbeam 2024-05-153:40

For decades, game engines have been working on realistic rendering. Bumping quality here and there.

The golden standard for rendering has always been cameras. It’s always photo-realistic rendering. Maybe this won’t be true for VR, but so far most effort is to be as good as video, not as good as the human eye.

Any sort of video generation AI is likely to have the same goal. Be as good as top notch cameras, not as eyes.

By darkstar_16 2024-05-157:55

Northern lights are actually pretty colourful, even to the naked eye. I've never seen them pale or b/w

By Kiro 2024-05-155:20

Shouldn't the model reflect how it looks on video rather than our naked eye?

By skypanther 2024-05-1513:23

What struck me about the northern lights video was that it showed the Milky Way crossing the sky behind the northern lights. That bright part of the Milky Way is visible in the southern sky but the aurora hugging the horizon like that indicates the viewer is looking north. (Swap directions for the southern hemisphere and the aurora borealis).

By garyrob 2024-05-154:40

Even in NY State, Hudson River Valley, I've seen them with real color. They're different each time.

By poulpy123 2024-05-159:29

that's a bad example since the only images of aurora borealis are brightly colored. What I expect of an image generator is to output what is expected from it

By mikeocool 2024-05-1511:501 reply

Ha, wow, I’d never seen this one before. The failures are pretty great. Even repeatedly trying to correct ChatGPT/Dall-e with the proper number of squares and pieces, it somehow makes it worse.

This is what dall-e came up with after trying to correct many previous iterations: https://imgur.com/Ss4TwNC

By Etherlord87 2024-05-1614:54

As someone who criticizes AI a lot: this actually looks pretty cool! AI is not better at surrealism than a good artist, but at least its work is enjoyable as a surreal art. Justifies the name Dall-e pretty well too.

By sdenton4 2024-05-152:432 reply

This strikes me as equally "AI complete" as drawing hands, which is now essentially a solved problem... No one test is sufficient, because you can add enough training data to address it.

By dongping 2024-05-1520:231 reply

Not sure about better models, but DALL-E3 still seems to be having problems with hands:

https://www.reddit.com/r/dalle2/comments/1afhemf/is_it_possi...

https://www.reddit.com/r/dalle2/comments/1cdks71/a_hand_with...

By Etherlord87 2024-05-1614:56

As opposed to legs, eyes, construction elements? ;)

By salamo 2024-05-152:582 reply

Yeah "AI complete" is a bit tongue-in-cheek but it is a fairly spectacular failure mode of every model I've tried.

By swyx 2024-05-1510:41

ive been using “agi-hard” https://latent.space/p/agi-hard as a term

because completeness isnt really what we are going for

By smusamashah 2024-05-158:47

Ideogram and dalle do hands pretty well

By sabellito 2024-05-157:292 reply

Per usual the top comment on anything AI related is snark on "it can't to [random specific thing] well yet".

By kmacdough 2024-05-159:55

Tiring, but so is the relentless over-marketing. Each new demo implies new use cases and flexible performance. But the reality is they're very brittle and blunder most seemingly simple tasks. I would personally love an ongoing breakdown of the key weaknesses. I often wonder "can it X?" The answer is almost always "almost, but not a useful almost".

By wasteduniverse 2024-05-1510:37

[dead]

By perbu 2024-05-1512:44

Most generative AI will struggle when given a task that requires something more less exact. They're probably pretty good at making something "chessish".

By creatonez 2024-05-181:45

> It seems to be an "AI complete" problem.

Conventionally this term means the opposite -- problems that AI unlocks that conventional computing could not do. Conventional computing can render a very wide range of different stylized chess boards, but when an ML technique like diffusion is applied to this mundane problem, it falls apart.

By Trixter 2024-05-1620:52

Mine is generation of any actual IBM PC/XT computer. All of the training sets either didn't include actual IBM PCs in them, or they labeled all PC compatibles "IBM PC". Whatever the reason, no generative AI today, whether commercial or open-source, can generate any picture of an IBM PC 5150. Once that situation improves, I'll start taking notice.

By svag 2024-05-1420:342 reply

An interesting thing that Google does is to watermark the AI generated videos using the [SynthID technology](https://deepmind.google/technologies/synthid/).

It seems that the SynthID is not only for AI generated video but for image, text and audio.

By bardak 2024-05-151:021 reply

I would like a bit more convincing that the text watermark will not be noticeable. AI text already has issues with using certain words to frequently. Messing with the weights seems like it might make the issue worse

By Tostino 2024-05-151:12

Not to mention, when does he get applied? If I am asking an llm to transform some data from one format to another, I don't expect any changes other than the format.

By padolsey 2024-05-154:33

It seems really clever, especially the encoding of a signature into LLM token probability selections. I wonder if synthid will trigger some standarization in the industry. I don't think there's much incentive to tho. Open-source gen AI will still exist. What does google expext to occur? I guess they're just trying to present themselves as 'ethically pursuing AI'.

By ugh123 2024-05-1420:4014 reply

From a filmmaking standpoint I still don't think this is impactful.

For that it needs a "director" to say: "turn the horse's head 90˚ the other way, trot 20 feet, and dismount the rider" and "give me additional camera angles" of the same scene. Otherwise this is mostly b-roll content.

I'm sure this is coming.

By qingcharles 2024-05-1421:561 reply

I can see using these video generators to create video storyboards. Especially if you can drop in a scribbled sketch and a prompt for each tile.

By ancientworldnow 2024-05-150:083 reply

That sounds actively harmful. Often we want story boards to be less specific so as not to have some non artist decision maker ask why it doesn't look like the storyboard.

And when we want it to match exactly in an animatic or whatever, it needs to be far more precise than this, matching real locations etc.

By gregmac 2024-05-153:13

I hadn't thought about that in movie context before, but it totally makes sense.

I've worked with other developers that want to build high fidelity wire frames, sometimes in the actual UI framework, probably because they can (and it's "easy"). I always push back against that, in favor of using whiteboard or Sharpies. The low-fidelity brings better feedback and discussion: focused on layout and flow, not spacing and colors. Psychologically it also feels temporary, giving permission for others to suggest a completely different approach without thinking they're tossing out more than a few minutes of work.

I think in the artistic context it extends further, too: if you show something too detailed it can anchor it in people's minds and stifle their creativity. Most people experience this in an ironically similar way: consider how you picture the characters of a book differently depending on if you watched the movie first or not.

By sbarre 2024-05-150:241 reply

I know you weren't implying this, but not every storyboard is for sharing with (or seeking approval from) decision makers.

I could see this being really useful for exploring tone, movement, shot sequences or cut timing, etc..

Right now you scrape together "kinda close enough" stock footage for this kind of exploration, and this could get you "much closer enough" footage..

By shermantanktop 2024-05-153:151 reply

I think of it in terms of the anchoring bias. Imagine that your most important decisions are anchored for you by what a 10 year old kid heard and understood. Your ideas don’t come to life without first being rendered as a terrible approximation that is convincing to others but deeply wrong to you, and now you get to react to that instead of going through your own method.

So if it’s an optional tool, great, but some people would be fine with it, some would not.

By sbarre 2024-05-1512:09

Absolutely. Everyone's creative process is different (and valid).

By cpill 2024-05-159:33

I guess this will give birth to a new kind of film making. Start with a rough sketch, generate 100 higher quality versions with an image generator, select one to tweak, use that as input to a video generator which generates 10 versions, coffee one to refine etc

By larodi 2024-05-159:11

Perhaps the only industry which immediately benefits from this is the short ads and perhaps TikTok. But still it is very dubious, as people seem to actually enjoy being themselves the directors of their thing, not somebody else.

Maybe this works for ads for duner place or shisha bar in some developing country. I’ve seen generated images used for menus in such places.

But I doubt a serious filmography can be done this way. And if it can - it’d be again thanks to some smart concept on behalf of humans.

By imachine1980_ 2024-05-151:502 reply

Stock videos are indeed crucial, especially now that we can easily search for precisely what we need. Take, for instance, the scene at the end of 'Look Up' featuring a native American dance in Peru. The dancer's movements were captured from a stock video, and the comet falling was seamlessly edited in. now imagine having near infinite stock videos tailored to the situation.

By rzmmm 2024-05-153:44

Stock photographers are already having issues with piracy due to very powerful AI watermark removal tools. And I suspect the companies are using content of these people to train these models too. .

By Shocka1 2024-05-1613:12

Unlimited possibilities. And more is coming - we're only in the beginning stages of this tech. Truly exciting stuff.

By chacham15 2024-05-151:302 reply

I dont think "turn the horse's head 90˚" is the right path forward. What I think is more likely and more useful is: here is a start keyframe and here is a stop keyframe (generated by text to image using other things like controlnet to control positioning etc.) and then having the AI generate the frames in between. Dont like the way it generated the in between? Choose a keyframe, adjust it, and rerun with the segment before and segment after.

By GenerocUsername 2024-05-152:16

This appeals to me because it feels auditable and controllable... But the pace these things have been progressing the last 3 years, I could imagine the tech leapfrogs all conventional understanding real soon. Likely outputting gaussian splat style outputs where the scene is separate from the camera and ask peices can be independently tweaked via a VR director chair

By 8note 2024-05-153:41

So a declarative keyframe of "the horses head is pointed forward" and a second one of "the horse is looking left"

And let the robot tween?

Vs an imperative for "tween this by turning the horse's head left"

By evantbyrne 2024-05-1421:45

They claim it can accept an "input video and editing command" to produce a new video output. Also, "In addition, it supports masked editing, enabling changes to specific areas of the video when you add a mask area to your video and text prompt." Not sure if that specific example would work or not.

By sailfast 2024-05-1423:04

For most things I view on the internet B-roll is great content, so I'm sure this will enable a new kind of storytelling via YouTube Shorts / Instagram, etc at minimum.

By kmacdough 2024-05-1510:45

I wouldn't be so sure it's coming. NNs currently dont have the structures for long term memory and development. These are almost certainly necessary for creating longer works with real purpose and meaning. It's possible we're on the cusp with some of the work to tame RNNs, but it's taken us years to really harness the power of transformers.

By Eji1700 2024-05-151:121 reply

There's also the whole "oh you have no actual model/rigging/lighting/set to manipulate" for detail work issue.

That said, I personally think the solution will not be coming that soon, but at the same time, we'll be seeing a LOT more content that can be done using current tools, even if that means a dip in quality (severely) due to the cost it might save.

By SJC_Hacker 2024-05-152:52

This lead me to the question of why hasn't there been an effort to do this with 3D content (that I know of).

Because camera angles/lighting/collision detection/etc. at that point would be almost trivial.

I guess with the "2D only" approach that is based on actual, acquired video you get way more impressive shots.

But the obvious application is for games. Content generation in the form of modeling and animation is actually one the biggest cost centers for most studios these days.

By gedy 2024-05-151:58

I think with AI content, we'd need to not treat it like expecting fine grained control. E.g. instead like "dramatic scene of rider coming down path, and dismounting horse, then looking into distance", etc. (Or even less detail eventually once a cohesive story can be generated.)

By thehappypm 2024-05-1513:151 reply

If you or I don’t see the potential here, I think that just means someone more creative is going to do amazing things with it

By Shocka1 2024-05-1613:171 reply

HN has always been notoriously negative, and wrong a lot of the time. One of my personal favorites is Brian Armstrong's post about an exciting new company he was starting around cryptocurrency and needing a co-founder... Always a good one to go back and read when I've been staying up late working on side projects and need a mental boost.

https://news.ycombinator.com/item?id=3754664

By thehappypm 2024-05-1619:26

Wow, that is a really negative thread. To be fair it’s not the best post either, but it shows that people jump to negativity really fast.

By teaearlgraycold 2024-05-151:38

Everything I’ve heard from professionals backs that up. Great for B roll. Great for stock footage. That’s it.

By aetherson 2024-05-151:37

Yeah, I've made a lot of images, and it sure is amazing if all you're interested in is, like, "Any basically good image," but if you start needing something very particular, rather than "anything that is on a general topic and is aesthetically pleasing," it gets a lot harder.

And there are a lot more degrees of freedom to get something wrong in film than in a single still image.

By aaron695 2024-05-154:26

[dead]

By lofaszvanitt 2024-05-155:52

I can't wait what will the big video camera makers gonna do with tech similar to this. Since Google clearly have zero idea what to do with this, and they lack the creativity, it's up to ARRI, Canon, Panasonic etc. to create their own solutions for this tech. I can't wait to see what Canon has up its sleeves with their new offerings that come in a few months.