Hacker News

Our next-generation model: Gemini 1.5

2024-02-1515:021244588blog.google

Gemini 1.5 delivers dramatically enhanced performance, with a breakthrough in long\u002Dcontext understanding across modalities.

Show article

A note from Google and Alphabet CEO Sundar Pichai:

Last week, we rolled out our most capable model, Gemini 1.0 Ultra, and took a significant step forward in making Google products more helpful, starting with Gemini Advanced. Today, developers and Cloud customers can begin building with 1.0 Ultra too — with our Gemini API in AI Studio and in Vertex AI.

Our teams continue pushing the frontiers of our latest models with safety at the core. They are making rapid progress. In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across a number of dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra, while using less compute.

This new generation also delivers a breakthrough in long-context understanding. We’ve been able to significantly increase the amount of information our models can process — running up to 1 million tokens consistently, achieving the longest context window of any large-scale foundation model yet.

Longer context windows show us the promise of what is possible. They will enable entirely new capabilities and help developers build much more useful models and applications. We’re excited to offer a limited preview of this experimental feature to developers and enterprise customers. Demis shares more on capabilities, safety and availability below.

— Sundar

By Demis Hassabis, CEO of Google DeepMind, on behalf of the Gemini team

This is an exciting time for AI. New advances in the field have the potential to make AI more helpful for billions of people over the coming years. Since introducing Gemini 1.0, we’ve been testing, refining and enhancing its capabilities.

Today, we’re announcing our next-generation model: Gemini 1.5.

Gemini 1.5 delivers dramatically enhanced performance. It represents a step change in our approach, building upon research and engineering innovations across nearly every part of our foundation model development and infrastructure. This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture.

The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro. It’s a mid-size multimodal model, optimized for scaling across a wide-range of tasks, and performs at a similar level to 1.0 Ultra, our largest model to date. It also introduces a breakthrough experimental feature in long-context understanding.

Gemini 1.5 Pro comes with a standard 128,000 token context window. But starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via AI Studio and Vertex AI in private preview.

As we roll out the full 1 million token context window, we’re actively working on optimizations to improve latency, reduce computational requirements and enhance the user experience. We’re excited for people to try this breakthrough capability, and we share more details on future availability below.

These continued advances in our next-generation models will open up new possibilities for people, developers and enterprises to create, discover and build using AI.

Context lengths of leading foundation models

Gemini 1.5 is built upon our leading research on Transformer and MoE architecture. While a traditional Transformer functions as one large neural network, MoE models are divided into smaller "expert” neural networks.

Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model’s efficiency. Google has been an early adopter and pioneer of the MoE technique for deep learning through research such as Sparsely-Gated MoE, GShard-Transformer, Switch-Transformer, M4 and more.

Our latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more quickly and maintain quality, while being more efficient to train and serve. These efficiencies are helping our teams iterate, train and deliver more advanced versions of Gemini faster than ever before, and we’re working on further optimizations.

An AI model’s “context window” is made up of tokens, which are the building blocks used for processing information. Tokens can be entire parts or subsections of words, images, videos, audio or code. The bigger a model’s context window, the more information it can take in and process in a given prompt — making its output more consistent, relevant and useful.

Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.

Complex reasoning about vast amounts of information

1.5 Pro can seamlessly analyze, classify and summarize large amounts of content within a given prompt. For example, when given the 402-page transcripts from Apollo 11’s mission to the moon, it can reason about conversations, events and details found across the document.

10:25

Gemini 1.5 Pro can reason across 100,000 lines of code giving helpful solutions, modifications and explanations.

When tested on a comprehensive panel of text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing our large language models (LLMs). And when compared to 1.0 Ultra on the same benchmarks, it performs at a broadly similar level.

Gemini 1.5 Pro maintains high levels of performance even as its context window increases. In the Needle In A Haystack (NIAH) evaluation, where a small piece of text containing a particular fact or statement is purposely placed within a long block of text, 1.5 Pro found the embedded text 99% of the time, in blocks of data as long as 1 million tokens.

Gemini 1.5 Pro also shows impressive “in-context learning” skills, meaning that it can learn a new skill from information given in a long prompt, without needing additional fine-tuning. We tested this skill on the Machine Translation from One Book (MTOB) benchmark, which shows how well the model learns from information it’s never seen before. When given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

As 1.5 Pro’s long context window is the first of its kind among large-scale models, we’re continuously developing new evaluations and benchmarks for testing its novel capabilities.

For more details, see our Gemini 1.5 Pro technical report.

In line with our AI Principles and robust safety policies, we’re ensuring our models undergo extensive ethics and safety tests. We then integrate these research learnings into our governance processes and model development and evaluations to continuously improve our AI systems.

Since introducing 1.0 Ultra in December, our teams have continued refining the model, making it safer for a wider release. We’ve also conducted novel research on safety risks and developed red-teaming techniques to test for a range of potential harms.

In advance of releasing 1.5 Pro, we've taken the same approach to responsible deployment as we did for our Gemini 1.0 models, conducting extensive evaluations across areas including content safety and representational harms, and will continue to expand this testing. Beyond this, we’re developing further tests that account for the novel long-context capabilities of 1.5 Pro.

We’re committed to bringing each new generation of Gemini models to billions of people, developers and enterprises around the world responsibly.

Starting today, we’re offering a limited preview of 1.5 Pro to developers and enterprise customers via AI Studio and Vertex AI. Read more about this on our Google for Developers blog and Google Cloud blog.

We’ll introduce 1.5 Pro with a standard 128,000 token context window when the model is ready for a wider release. Coming soon, we plan to introduce pricing tiers that start at the standard 128,000 context window and scale up to 1 million tokens, as we improve the model.

Early testers can try the 1 million token context window at no cost during the testing period, though they should expect longer latency times with this experimental feature. Significant improvements in speed are also on the horizon.

Developers interested in testing 1.5 Pro can sign up now in AI Studio, while enterprise customers can reach out to their Vertex AI account team.

POSTED IN:

Read the original article

todsacerdoti

Karma: 214241

@Hacker__News
@hacker._news

Comments

By vessenes 2024-02-1516:1536 reply

The white paper is worth a read. The things that stand out to me are:

1. They don't talk about how they get to 10M token context

2. They don't talk about how they get to 10M token context

3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creating caching abilities is going to be important for a lot of long token chatting features now, though). This is going to make things much, much simpler for a lot of use cases.

4. They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

5. It seems like 1.5 Ultra is going to be highly capable. 1.5 Pro is already very very capable. They are running up against very high scores on many tests, and took a minute to call out some tests where they scored badly as mostly returning false negatives.

Upshot, 1.5 Pro looks like it should set the bar for a bunch of workflow tasks, if we can ever get our hands on it. I've found 1.0 Ultra to be very capable, if a bit slow. Open models downstream should see a significant uptick in quality using it, which is great.

Time to dust out my coding test again, I think, which is: "here is a tarball of a repository. Write a new module that does X".

I really want to know how they're getting to 10M context, though. There are some intriguing clues in their results that this isn't just a single ultra-long vector; for instance, their audio and video "needle" tests, which just include inserting an image that says "the magic word is: xxx", or an audio clip that says the same thing, have perfect recall across up to 10M tokens. The text insertion occasionally fails. I'd speculate that this means there is some sort of compression going on; a full video frame with text on it is going to use a lot more tokens than the text needle.

By swalsh 2024-02-1516:378 reply

"The 10M context ability wipes out most RAG stack complexity immediately."

I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.

I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.

By HereBePandas 2024-02-1517:023 reply

They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.

By westoncb 2024-02-1517:312 reply

I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.

By a_wild_dandan 2024-02-1517:414 reply

I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.

By westoncb 2024-02-1519:032 reply

I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).

By a_wild_dandan 2024-02-1520:592 reply

Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?

What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.

By danielmarkbruce 2024-02-1522:06

Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?

A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.

By somenameforme 2024-02-165:17

You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.

By westoncb 2024-02-1519:04

Would be awesome if it is solved but seems like a much deeper problem tbh.

By caesil 2024-02-1523:451 reply

Unfortunately Google's track record with language models is one of overpromising and underdelivering.

By chaxor 2024-02-160:58

This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations. Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).

By mlsu 2024-02-165:331 reply

I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.

Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.

Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.

By leegao 2024-02-1618:36

Their in-context long-sequence understanding "benchmark" is pretty interesting.

There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]

They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.

This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.

It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.

[1] from https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.

By smeagull 2024-02-1521:121 reply

I believe that's a limitation of using vectors of high dimensions. It'll be muddy.

By Aeolun 2024-02-164:251 reply

Not unlike trying to keep the whole contents of the document in your own mind :)

By sirsinsalot 2024-02-169:501 reply

It's amazing we are in 2024 discussing the degree a machine can reason over millions of tokens of context. The degree, not the possibility.

By razodactyl 2024-02-1613:08

Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."

The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.

I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.

By andy_ppp 2024-02-168:181 reply

Did you think the extraction of information from a the Buster Keaton film was muddy? I thought it was incredibly impressive to be this precise.

By westoncb 2024-02-171:38

That was not muddy, but it's not the kind of scenario where muddiness shows up.

By tcdent 2024-02-1617:17

Page 8 of the technical paper [1] is especially informative.

The first chart (Cumulative Average NLL for Long Documents) shows a deviation from the trend and an increase in accuracy when working with >=1M tokens. The 1.0 graph is overlaid and supports the experience of 'muddiness'.

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...

By moffkalast 2024-02-1517:40

[flagged]

By chuckcode 2024-02-1517:59

Would like to see the latency and cost of parsing entire 10M context before throwing out the RAG stack which is relatively cheap and fast.

By theolivenbaum 2024-02-1519:30

Also unless they significantly change their pricing model, we're talking about 0.5$ per API call at current prices

By patja 2024-02-160:551 reply

I think there are also a lot of people who are only interested in RAG if they can self-host and keep their documents private.

By jimmySixDOF 2024-02-164:12

Yes and the ability to have direct attribution matters so you know exactly where your responses come from. And costs as others point out, but RAG is not gone in fact it just got easier and a lot more powerful.

By tkellogg 2024-02-1518:381 reply

costs rise on a per-token basis. So you CAN use 10M tokens, but it's probably not usually a good idea. A database lookup is still better than a few billion math operations.

By sjwhevvvvvsj 2024-02-1520:021 reply

I think the unspoken goal is to just lay off your employees and dump every doc and email they’ve ever written as one big context.

Now that Google has tasted the previously forbidden fruit of layoffs themselves, I think their primary goal in ML is now headcount reduction.

By goatlover 2024-02-1617:56

Somehow I just don't see the execs or managers being able to make this work well for them without help. Plus, documents still need to be generated. Are they going to be spending all day prompting LLMs?

By koliber 2024-02-168:081 reply

LLMs are able to utilize “all the worlds” knowledge during training and give seemingly magical answers. While providing context in the query is different than training models, is it possible that more context will give more materials to the LLM and it will be able to pick out the relevant bits on its own?

What if it was possible, with each query, to fine tune the model on the provided context, and then use that JIT fine-tuned model to answer the query?

By acchow 2024-02-170:211 reply

Are you asking what if it was possible that a "context window" ceased to exist? In a different architecture than we currently use, I guess that's hypothetically possible.

As it is now, you can't fine tune on context. It would have almost no effect on the parameters.

Context is like giving your friend a magazine article and asking them to respond to it. Fine tuning is like throwing that magazine article into the ocean of all content they ever came across during their lifetime.

By koliber 2024-02-1710:191 reply

I am not an expert here so I may be mixing terms and concepts.

The way I understand it, there is a base model that was trained on vast amount of general data. This sets up the weights.

You can fine-tune this base model on additional data. Often this is private data that is concentrated around a certain domain. This modifies the model's weights some more.

Then you have the context. This is where your query to the LLM goes. You can also add the chat history here. Also, system prompts that tell the LLM to behave a certain way go here. Finally, you can take additional information from other sources and provide it as part of the context -- this is called Retrieval Augmented Generation. All of this really goes into one bucket called the context, and the LLM needs to make sense of it. None of this modifies the weights of the model itself.

Is my mental picture correct so far?

My question is around RAG. It seems that providing additional selected information from your knowledge base, or using your knowledge base to fine-tune a model, seem similar. I am curious in which ways these are similar, and in which ways they cause the LLM to behave differently.

Concretely, say I have a company knowledge base with a bunch of rules and guidelines. Someone asks an agent "Can I take 3 weeks off in a row?" How would these two scenarios be different:

a) Agents searches the knowledge base for all pages and content related to "FTO, PTO, time off, vacations" and feeds those articles to the LLM, together with the "Can I take 3 weeks off in a row?" query

b) I have an LLM that has been fine tuned on all the content in the knowledge base. I ask it "Can I take 3 weeks off in a row?"

By acchow 2024-02-2022:32

> Is my mental picture correct so far?

Yes

> How would these two scenarios be different

They're different in exactly the way you described above. The agent searching the knowledge base for "FTO, PTO, time off, vacations" would be the same as you pasting all the articles related to those topics into the prompt directly - in both cases, it goes into the context.

In scenario a, you'll likely get the correct response. In scenario b, likely get an incorrect response.

Why? Because of what you explained above. Fine tuning adjusts the weights. When you adjusts weights by feeding data, you're only making small adjustments to shift slightly along a curve - thus the exposure to this data (for the purposes of fine tuning) will have very little effect on the next context the model is exposed to.

By aik 2024-02-1520:23

Have to consider cost for all of this. Big value of RAG already even given the size of GPT-4’a largest context size is it decreases cost very significantly.

By swyx 2024-02-1517:531 reply

also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)

By 1024core 2024-02-1522:521 reply

How much would a lawyer charge to review your 10M-token legal document?

By hereonout2 2024-02-160:081 reply

10M tokens is something like 14 copies of war and peace, or maybe the entire harry potter series seven times over. That'd be some legal document!

By xp84 2024-02-164:34

Hmm I don’t know but I feel like the U.S. Congress has bills that would push that limit.

By usaar333 2024-02-1516:363 reply

> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

They try to push that, but it's not the most convincing. Look at Table 8 for text evaluations (math, etc.) - they don't even attempt a comparison with GPT-4.

GPT-4 is higher than any Gemini model on both MMLU and GSM8K. Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71). Gemini Pro does crush naive GPT-4 on math (though not with code interpreter and this is the original model).

All in 1.5 Pro seems maybe a bit better than 1.0 Ultra. Given that in the wild people seem to find GPT-4 better for say coding than Gemini Ultra, my current update is Pro 1.5 is about equal to GPT-4.

But we'll see once released.

By panarky 2024-02-1517:435 reply

> people seem to find GPT-4 better for say coding than Gemini Ultra

For my use cases, Gemini Ultra performs significantly better than GPT-4.

My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I took 20 prompts that I'd run with GPT-4 and fed them to Gemini Ultra. Gemini gave a clearly better result in 16 out of 20 cases.

Where GPT-4 might miss one or two requirements, Gemini usually got them all. Where GPT-4 might require multiple chat turns to point out its errors and omissions and tell it to fix them, Gemini often returned the result I wanted in one shot. Where GPT-4 hallucinated a method that doesn't exist, or had been deprecated years ago, Gemini used correct methods. Where GPT-4 called methods of third-party packages it assumed were installed, Gemini either used native code or explicitly called out the dependency.

For the 4 out of 20 prompts where Gemini did worse, one was a weird rejection where I'd included an image in the prompt and Gemini refused to work with it because it had unrecognizable human forms in the distance. Another was a simple bash script to split a text file, and it came up with a technically correct but complex one-liner, while GPT-4 just used split with simple options to get the same result.

For now I subscribe to both. But I'm using Gemini for almost all coding work, only checking in with GPT-4 when Gemini stumbles, which isn't often. If I continue to get solid results I'll drop the GPT-4 subscription.

By sho_hn 2024-02-1517:522 reply

I have a very similar prompting style to yours and share this experience.

I am an experienced programmer and usually have a fairly exact idea of what I want, so I write detailed requirements and use the models more as typing accelerators.

GPT-4 is useful in this regard, but I also tried about a dozen older prompts on Gemini Advanced/Ultra recently and in every case preferred the Ultra output. The code was usually more complete and prod-ready, with higher sophistication in its construction and somewhat higher density. It was just closer to what I would have hand-written.

It's increasingly clear though LLM use has a couple of different major modes among end-user behavior. Knowledge base vs. reasoning, exploratory vs. completion, instruction following vs. getting suggestions, etc.

For programming I want an obedient instruction-following completer with great reasoning. Gemini Ultra seems to do this better than GPT-4 for me.

By lyu07282 2024-02-1523:511 reply

It constantly hallucinates APIs for me, I really wonder why people's perceptions are so radically different. For me it's basically unusable for coding. Perhaps I'm getting a cheaper model because I live in a poorer country.

By sho_hn 2024-02-160:332 reply

Are you using Gemini Advanced? (The paid tier.) The free one is indeed very bad.

By belter 2024-02-1611:16

Spent a few hours comparing Gemini Advanced with GPT-4.

Gemini Advanced is nowhere even close to GPT-4, either for text generation, code generation or logical reasoning.

Gemini Advanced is constantly asking for directions "What are your thoughts on this approach?" even to create a short task list of 10 items. Even when being told several times to provide the full list, and not stop at every three or four items and ask for directions. Is constantly giving moral lessons or finishing the results with annoying marketing style comments of the type "Let's make this an awesome product!"

Code is more generic, solutions are less sophisticated. On a discussion of Options Trading strategies Gemini Advanced got core risk management strategies wrong and apologized when errors were made clear to the model. GPT-4 provided answers with no errors, and even went into the subtleties of some exotic risk scenarios with no mistakes.

Maybe 1.5 will be it, or maybe Google realized this quite quickly and are trying the increased token size as a Hail Mary to catch up. Why release so soon?

Quite curious to try the same prompts on 1.5.

By oceanplexian 2024-02-169:101 reply

I asked Gemini Advanced, the paid one, to "Write a script to delete some files" and it told me that it couldn't do that because deleting files was unethical. At that point I cancelled my subscription since even GPT-4 with all its problems isn't nearly as broken as Gemini.

By panarky 2024-02-1618:27

If you share your prompt I'm sure people here can help you.

Here's a prompt I used and got a a script that not only accomplishes the objective, but even has an option to show what files will be deleted and asks for confirmation before deleting them.

Write a bash script to delete all files with the extension .log in the current directory and all subdirectories of the current directory.

By sjwhevvvvvsj 2024-02-1520:052 reply

I’m going to have to try Gemini for code again. It just occurred to me as a Xoogler that if they used Google’s code base as the training data it’s going to be unbeatable. Now did they do that? No idea, but quality wins over quantity, even with LLM.

By barrkel 2024-02-1521:452 reply

There is no way NTK data is in the training set, and google3 is NTK.

By sjwhevvvvvsj 2024-02-161:33

I dunno, leadership is desperate and they can de-NTK if and when they feel like it.

By cpeterso 2024-02-161:251 reply

What is “NTK”?

By mjamaloney 2024-02-162:121 reply

"Need To Know" I.e. data that isn't open within the company.

By saagarjha 2024-02-1713:58

Almost all of google3 is basically open to all of engineering.

By koreth1 2024-02-161:351 reply

> My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I guess this is a tough request if you're working on a proprietary code base, but I would love to see some concrete examples of the prompts and the code they produce.

I keep trying this kind of prompting with various LLM tools including GPT-4 (haven't tried Gemini Ultra yet, I admit) and it nearly always takes me longer to explain the detailed requirements and clean up the generated code than it would have taken me to write the code directly.

But plenty of people seem to have an experience more like yours, so I really wonder whether (a) we're just asking it to write very different kinds of code, or (b) I'm bad at writing LLM-friendly requirements.

By vineyardmike 2024-02-165:051 reply

Not OP but here is a verbatim prompt I put into these LLMs. I'm learning to make flutter apps, and I like to try make various UIs so I can learn how to compose some things. I agree that Gemini Ultra (aka the paid "advanced" mode) is def better than ChatGPT-4 for this prompt. Mine is a bit more terse than OP's huge prompt with numbered requirements, but I still got a super valid and meaningful response from Gemini, while GPT4 told me it was a tricky problem, and gave me some generic code snippets, that explicitly don't solve the problem asked.

> I'm building a note-taking app in flutter. I want to create a way to link between notes (like a web hyperlink) that opens a different note when a user clicks on it. They should be able to click on the link while editing the note, without having to switch modalities (eg. no edit-save-view flow nor a preview page). How can I accomplish this?

I also included a follow-up prompt after getting the first answer, which again for Gemini was super meaningful, and already included valid code to start with. Gemini also showed me many more projects and examples from the broader internet.

> Can you write a complete Widget that can implement this functionality? Please hard-code the note text below: <redacted from HN since its long>

By koreth1 2024-02-1621:071 reply

This is useful, thanks. Since you're using this for learning, would it be fair to characterize this as asking the LLM to write code you don't already know how to write on your own?

I've definitely had success using LLMs as a learning tool. They hallucinate, but most often the output will at least point me in a useful direction.

But my day-to-day work usually involves non-exploratory coding where I already know exactly how to do what I need. Those are the tasks where I've struggled to find ways to make LLMs save me any time or effort.

By vineyardmike 2024-02-1622:121 reply

> would it be fair to characterize this as asking the LLM to write code you don't already know how to write on your own?

Yea absolutely. I also use it to just write code I understand but am too lazy to write, but it's definitely effective at "show me how this works" type learning too.

> Those are the tasks where I've struggled to find ways to make LLMs save me any time or effort

Github CoPilot has an IDE integration where it can output directly into your editor. This is great for "// TODO: Unit Test for add(x, y) method when x < 0" and it'll dump out the full test for you.

Similarly useful for things like "write me a method that loops through a sorted list, and finds anything with <condition> and applies a transformation and saves it in a Map". Basically all those random helper methods and be written for you.

By koreth1 2024-02-1622:41

That last one is an interesting example. If I needed to do that, I would write something like this (in Kotlin, my daily-driver language):

    fun foo(list: List<Bar>) =
        list.filter { condition(it) }.associateWith { transform(it) }

which would take me less time to write than the prompt would.

However, if I didn't know Kotlin very well, I might have had to go look in the docs to find the associateWith function (or worse, I might not have even thought to look for it) at which point the prompt would have saved me time and taught me that the function exists.

By Dayshine 2024-02-1517:57

Is there any chance you could share an example of the kind of prompt you're writing?

I'm always reluctant to write long prompts because I often find GPT4 just doesn't get it, and then I've wasted ten minutes writing a prompt

By TaylorAlexander 2024-02-161:14

How do you interact with Gemini for coding work? I am trying to paste my code in the web interface and when I hit submit, the interface says "something went wrong" and the code does not appear in the chat window. I signed up for Gemini Advanced and that didn't help. Do you use AI Studio? I am just looking in to that now.

By qingcharles 2024-02-160:52

I've found Gemini generally equal with the .Net and HTML coding I've been doing.

I've never had Gemini give me a better result than GPT, though, so it does not surpass it for my needs.

The UI is more responsive, though, which is worth something.

By spott 2024-02-1518:49

> Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71).

Though they talk a bunch about how hard it was to filter out Human Eval, so this probably doesn't matter much.

By cchance 2024-02-1517:31

I mean i don't see GPT4 watching a 44 minute movie and being able to exactly pinpoint a guy taking a paper out of his pocket..

By CharlieDigital 2024-02-1516:212 reply

    > The 10M context ability wipes out most RAG stack complexity immediately.

Remains to be seen.

Large contexts are not always better. For starters, it takes longer to process. But secondly, even with RAG and the large context of GPT4 Turbo, providing it a more relevant and accurate context always yields better output.

What you get with RAG is faster response times and more accurate answers by pre-filtering out the noise.

By killerstorm 2024-02-1516:581 reply

Hopefully we can get a better RAG out of it. Currently people do incredibly primitive stuff like chunking text into chunks of a fixed size and adding them to vector DB.

An actually useful RAG would be to convert text to Q&A and use Q's embeddings as an index. Large context can make use of in-context learning to make better Q&A.

By mediaman 2024-02-1518:031 reply

A lot of people in RAG already do this. I do this with my product: we process each page and create lists of potential questions that the page would answer, and then embed that.

We also embed the actual text, though, because I found that only doing the questions resulted in inferior performance.

By CharlieDigital 2024-02-1518:171 reply

So in this case, what your workflow might look like is:

    1. Get text from page/section/chunk
    2. Generate possible questions related to the page/section/chunk
    3. Generate an embedding using { each possible question + page/section/chunk }
    4. Incoming question targets the embedding and matches against { question + source }

Is this roughly it? How many questions do you generate? Do you save a separate embedding for each question? Or just stuff all of the questions back with the page/section/chunk?

By mediaman 2024-02-168:19

Right now I just throw the different questions together in a single embedding for a given chunk, with the idea that there’s enough dimensionality to capture them all. But I haven’t tested embedding each question, matching on that vector, and then returning the corresponding chunk. That seems like it’d be worth testing out.

By behnamoh 2024-02-1516:302 reply

Don't forget that Gemini also has access to the internet, so a lot of RAGging becomes pointless anyway.

By beppo 2024-02-1516:411 reply

Internet search is a form of RAG, though. 10M tokens is very impressive, but you're not fitting a database, let alone the entire internet into a prompt anytime soon.

By behnamoh 2024-02-1516:592 reply

You shouldn't fit an entire database in the context anyway.

btw, 10M tokens is 78 times more context window than the newest GPT-4-turbo (128K). In a way, you don't need 78 GPT-4 API calls, only one batch call to Gemini 1.5.

By cchance 2024-02-1517:352 reply

I don't get this why is it people think that you need to put an entire database in the short-term memory of the AI to be useful? When you work with a DB are you memorizing the entire f*cking database, no, you know the summaries of it and how to access and use it.

People also seem to forget that the average is 1b words that are read by people in their entire LIFETIME, and at 10m, with nearly 100% recall thats pretty damn amazing, i'm pretty sure I don't have perfect recall of 10m words myself lol

By choilive 2024-02-1523:58

You certainly don't need that much context for it to be useful, but it definitely opens up a LOT more possibilities without the compromises of implementing some type of RAG. In addition, don't we want our AI to have superhuman capabilities? The ability to work on 10M+ tokens of context at a time could enable superhuman performance in many tasks. Why stop at 10M tokens? Imagine if AI could work on 1B tokens of context like you said?

By Qwero 2024-02-1522:13

It increases the use cases.

It can also be a good alternative for fine-tuning.

And the use case of a code base is a good example: if the ai understands the whole context, it can do basically everything.

Let me pay 5€ for a android app rewritten into iOS.

By rvnx 2024-02-1517:29

Well it's nice, just sad nobody can use it

By CharlieDigital 2024-02-1516:571 reply

This may be useful in a generalized use case, but a problem is that many of those results again will add noise.

For any use case where you want contextual results, you need to be able to either filter the search scope or use RAG to pre-define the acceptable corpus.

By panarky 2024-02-1520:33

> you need to be able to either filter the search scope or use RAG ...

Unless you can get nearly perfect recall with millions of tokens, which is the claim made here.

By tveita 2024-02-1516:59

> The 10M context ability wipes out most RAG stack complexity immediately.

The video queries they show take around 1 minute each, this probably burns a ton of GPU. I appreciate how clearly they highlight that the video is sped up though, they're clearly trying to avoid repeating the "fake demo" fiasco from the original Gemini videos.

By cchance 2024-02-1517:301 reply

The youtube video of the Multimodal analysis of a video is insane, imagine feeding in movies or tv shows and being able to autosummary or find information about them dynamically, how the hell is all this possible already? AI is moving insanely fast.

By vineyardmike 2024-02-165:191 reply

> imagine feeding in movies or tv shows

Google themselves have such a huge footprint of various businesses, that they alone would be an amazing customer for this, never mind all the other cool opportunities from third parties...

Imagine that they can ingest the entirety of YouTube and then dump that into Google Search's index AND use it to generate training data for their next LLM.

Imagine that they can hook it up to your security cameras (Nest Cam), and then ask questions about what happened last night.

Imagine that you can ask Gemini how to do something (eg. fix appliance), and it can go and look up a YouTube video on how to accomplish that ask, and explain it to you.

Imagine that it can apply summarization and descriptions to every photo AND video in your personal Google Photos library. You can ask it to find a video of your son's first steps, or a graduation/diploma walk for your 3rd child (by name) and it can actually do that.

Imagine that Google Meet video calls can have the entire convo itself fed into an LLM (live?), instead of just a transcription. You can have an AI assistant there with you that can interject and discuss, based on both the audio and video feed.

By anhner 2024-02-167:242 reply

I'd love to see that applied to the Google ecosystem, the question is - why haven't they already done this?

By is_true 2024-02-1611:59

IMO, they aren't sure how to monetize it, Google is run by the ads team.

Problem is they are jeopardizing their moat.

Google is still in a great position, they have the knowledge and lots of data to pull this off. They just have to take the risk of losing some ad revenue for a while.

By vineyardmike 2024-02-1622:08

Well, they just announced publicly that the technology is available. Maybe its just too new to have been productized so far.

By freedomben 2024-02-1516:482 reply

Is 10M token context correct? The blog post I see 1M but I'm not sure if these are different things

Edit: Ah, I see, it's 1M reliably in production, up to 10M in research:

> Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

> This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.

By huytersd 2024-02-1517:12

I know how I’m going to evaluate this model. Upload my codebase and ask it to “find all the bugs”.

By p1esk 2024-02-162:461 reply

How could one hour of video fit in 1M tokens? 1 hour at 30fps is 3600*30=100k frames. Each frame is converted in 256 tokens. So either they are not processing each frame, or each frame is converted into fewer tokens.

By KTibow 2024-02-165:17

The model can probably perform fine at 1 frame per second (3600*256=921600 tokens), and they could probably use some sort of compression.

By cs702 2024-02-1516:323 reply

> 1. They don't talk about how they get to 10M token context

> 2. They don't talk about how they get to 10M token context

Yes. I wonder if they're using a "linear RNN" type of model like Linear Attention, Mamba, RWKV, etc.

Like Transformers with standard attention, these models train efficiently in parallel, but their compute is O(N) instead of O(N²), so in theory they can be extended to much longer sequences much efficiently. They have shown a lot of promise recently at smaller model sizes.

Does anyone here have any insight or knowledge about the internals of Gemini 1.5?

By sebzim4500 2024-02-1517:321 reply

The fact they are getting perfect recall with millions of tokens rules out any of the existing linear attention methods.

By cs702 2024-02-1613:44

I wouldn't be so sure perfect recall rules out linear RNNs, because I haven't seen any conclusive data on their ability to recall. Have you?

By candiodari 2024-02-1516:373 reply

They do give a hint:

"This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture."

One thing you could do with MoE is giving each expert different subsets of the input tokens. And that would definitely do what they claim here: it would allow search. You want to find where someone said "the password is X" in a 50 hour audio file, this would be perfect.

If your question is "what is the first AND last thing person X said" ... it's going to suck badly. Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.

By spott 2024-02-1518:531 reply

> Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.

They kinda address that in the technical report[0]. On page 12 they show results from a "multiple needle in a haystack" evaluation.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

By candiodari 2024-02-1714:09

I would perhaps add that it's worrying. Every youtube evaluation of this model in GCP AI Studio I've seen have commented on the constant hallucinations of this model.

By declaredapple 2024-02-1518:08

> One thing you could do with MoE is giving each expert different subsets of the input tokens.

Don't MoE's route tokens to experts after the attention step? That wouldn't solve the n^2 issue the attention step has.

If you split the tokens before the attention step, that would mean those tokens would have no relationship to each other - it would be like inferring two prompts in parallel. That would defeat the point of a 10M context

By deskamess 2024-02-1518:131 reply

Is MOE then basically divide and conquer? I have no deep knowledge of this so I assumed MOE was where each expert analyzed the problem in a different way and then there was some map-reduce like operation on the generated expert results. Kinda like random forest but for inference.

By declaredapple 2024-02-1613:13

> I assumed MOE was where each expert analyzed the problem in a different way

Uh sorta but not like parent described at all. You have multiple "experts" and you have a routing layer(s) that decide which expert to send it to. Usually every token is sent to at least 2. You can't just send half the tokens to one expert and half to another.

Also the "experts" are not "domain experts" - there is not a "programming expert" and an "essay expert".

By TweedBeetle 2024-02-1518:471 reply

Regarding how they’re getting to 10M context, I think it’s possible they are using the new SAMBA architecture.

Here’s the paper: https://arxiv.org/abs/2312.00752

And here’s a great podcast episode on it: https://www.cognitiverevolution.ai/emergency-pod-mamba-memor...

By LightMachine 2024-02-1518:55

As a Brazilian, I approve that choice. Vambora amigos!

By nestorD 2024-02-1523:111 reply

Regarding the 10M tokens context, RingAttention has been shown [0] recently (by researchers, not ML engineers in a FAANG) to be able to scale to comparable (1M) context sizes (it does take work and a lot of GPUs).

[0]: https://news.ycombinator.com/item?id=39367141

By jebarker 2024-02-160:342 reply

> researchers, not ML engineers in a FAANG

Why did you point out this distinction?

By nestorD 2024-02-163:26

It means they have significantly less means (to get a lot of GPUs letting them scale up in context length) and are likely less well-versed in optimization (which also helps with scaling up)[0].

I believe those two things together are likely enough to explain the difference between a 1M context length and a 10M context length.

[0]: Which is not looking down on that particular research team, the vast majority of people have less means and optimization know-how than Google.

By vineyardmike 2024-02-167:03

Probably to indicate that its research and not productized?

By nborwankar 2024-02-1520:15

Re RAG aren’t you ignoring the fact that no one wants to put confidential company data into such LLM’s. Private RAG infrastructure remains a need for the same reason that privacy of data of all sorts remains a need. Huge context solves the problem for large open source context material but that’s only part of the picture.

By theGnuMe 2024-02-1517:00

For #1 and #2 it is some version of mixture of experts. This is mentioned in the blog post. So each expert only sees a subset of the tokens.

I imagine they have some new way to route tokens to the experts that probably computes a global context. One scalable way to compute a global context is by a state space model. This would act as a controller and route the input tokens to the MoEs. This can be computed by convolution if you make some simplifying assumptions. They may also still use transformers as well.

I could be wrong but there are some Mamba-MoEs papers that explore this idea.

By kristjansson 2024-02-1521:05

For other's reference, the paper: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

By AaronFriel 2024-02-1518:572 reply

There will always be more data that could be relevant than fits in a context window, and especially for multi-turn conversations, huge contexts incur huge costs.

GPT-4 Turbo, using its full 128k context, costs around $1.28 per API call.

At that pricing, 1m tokens is $10, and 10m tokens is an eye-watering $100 per API call.

Of course prices will go down, but the price advantage of working with less will remain.

By elorant 2024-02-1521:503 reply

I don't see a problem with this pricing. At 1m tokens you can upload the whole proceedings of a trial and ask it to draw an analysis. Paying $10 for that sounds like a steal.

By ithkuil 2024-02-166:28

Unfortunately the whole context has to be reprocessed fully for each query, which means that if you "chat" with the model you'll incur in that $10 fee for every interaction which quickly sums up.

It may still be worth it for some use cases

By AaronFriel 2024-02-1522:16

Of course, if you get exactly the answer you want in the first reply.

By staticman2 2024-02-1522:21

While it's hard to say what's possible on the cutting edge, historically models tend to get dumber as the context size gets bigger. So you'd get a much more intelligent analysis of a 10,000 token excerpt of the trial than a million token complete transcript of the trial. I have not spent the money testing big token sizes in GPT 4 turbo, but it would not surprise me if it gets dumber. Think of it this way, if the model is limited to 3,000 token replies, if an analysis would require a more detailed response than 3,000 tokens, it cannot provide it, it'll just give you insufficient information. What it'll probably do is ignore parts of the trial transcript because it can't analyze all that information in 3,000 tokens. And asking a followup question is another million tokens.

By 7734128 2024-02-1521:44

Would the price really increase linearly? Isn't the demands on compute and memory increasing steeper than that as a function of context length?

By ren_engineer 2024-02-1517:261 reply

RAG would still be useful for cost savings assuming they charge per token, plus I'm guessing using the full-context length would be slower than using RAG to get what you need for a smaller prompt

By nostrebored 2024-02-1517:332 reply

This is going to be the real differentiator.

HN is very focused on technical feasibility (which remains to be seen!), but in every LLM opportunity, the CIO/CFO/CEO are going to be concerned with the cost modeling.

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Maybe this changes with managed vector search offerings that are opaque to the user. The context goes to a preprocessing layer, an efficient cache understands which parts haven't been embedded (new bloom filter use case?), embeds the other chunks, and extracts the intent of the prompt.

By mediaman 2024-02-1518:00

Agreed with this.

The leading ability AI (in terms of cognitive power) will, generally, cost more per token than lower cognitive power AI.

That means that at a given budget you can choose more cognitive power with fewer tokens, or less cognitive power with more tokens. For most use cases, there's no real point in giving up cognitive power to include useless tokens that have no hope of helping with a given question.

So then you're back to the question of: how do we reduce the number of tokens, so that we can get higher cognitive power?

And that's the entire field of information retrieval, which is the most important part of RAG.

By golol 2024-02-1518:27

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Really? Because to my understanding the compute necessary to generate a token grows linearly with the context, and doesn't the OpenAI billing reflect that by seperating prompt and output tokens?

By resouer 2024-02-1517:09

> The 10M context ability wipes out most RAG stack complexity immediately.

This may not be true. My experience of the complexity of RAG lays in how to properly connect to various unstructured data sources and perform data transformation pipeline for large scale data set (which means GB, TB or even PB). It's in the critical path rather a "nice to have", because the quality of data and the pipeline is a major factor for the final generated the result. i.e., in RAG, the importance of R >>> G.

By localhost 2024-02-1518:432 reply

RE: RAG - they haven't released pricing, but if input tokens are priced at GPT-4 levels - $0.01/1K then sending 10M tokens will cost you $100.

By campers 2024-02-162:55

In the announcements today they also halved the pricing of Gemini 1.0 Pro to $0.000125 / 1K characters, which is a quarter of GPT3.5 Turbo so it could potentially be a bit lower than GPT-4 pricing.

By s-macke 2024-02-1518:521 reply

If you think the current APIs will stay that way, then you're right. But when they start offering dedicated chat instances or caching options, you could be back in the penny region.

You probably need a couple GB to cache a conversation. That's not so easy at the moment because you have to transfer that data to and from the GPUs and store the data somewhere.

By localhost 2024-02-1523:441 reply

The tokens need to be fed into the model along with the prompt and this takes time. Naive attention is O(N^2). They probably use at least flash attention, and likely something more exotic to their hardware.

You'll notice in their video [1] that they never show the prompts running interactively. This is for a roughly 800K context. They claim that "the model took around 60s to respond to each of these prompts".

This is not really usable as an interactive experience. I don't want to wait 1 minute for an answer each time I have a question.

[1] https://www.youtube.com/watch?v=SSnsmqIj1MI

By rfoo 2024-02-1717:431 reply

GP's point is you can cache the state after the model processed the super long context but before it ingests your prompt.

If you are going to ask "then why don't OpenAI do it now", the answer is it takes a lot of storage (and IO) so it may not be worth it for shorter context, it adds significant complexity to the entire serving stack, and is incoherent with how OpenAI originally imagined where the "custom-ish" LLM serving game is going to - they bet on finetuning and dedicated instances, instead of long context.

The tradeoff can be reflected in the API and pricing, LLM APIs don't have to be like OpenAI's. What if you have an endpoint to generate a "cache" of your context (or really, a prefix of your prompt), billed as usual per token, then you can use your prompt prefix for a fixed price no matter how long it is?

By localhost 2024-02-1718:20

Do you have examples of where this has been done? Based on my understanding you can do things like cache the embeddings to avoid the tokenization/embedding cost, but you will still need to do a forward pass through the model with the new user prompt and the cached context. That is where the naive O(N^2) complexity comes from and that is the cost that cannot be avoided (because the whole point is to present the next user prompt to the model along with the cached context).

By joshsabol46 2024-02-1521:54

> The 10M context ability wipes out most RAG stack complexity immediately.

RAG is needed for the same reason you don't `SELECT *` all of your queries.

By renonce 2024-02-1518:481 reply

> They don't talk about how they get to 10M token context

I don't know how either but maybe https://news.ycombinator.com/item?id=39367141

Anyway I mean, there is plenty of public research on this so it's probably just a matter of time for everyone else to catch up

By albertzeyer 2024-02-1519:341 reply

Why do you think this specific variant (RingAttention)? There are so many different variants for this.

As far as I know, the problem in most cases is that while the context length might be high in theory, the actual ability to use it is still limited. E.g. recurrent networks even have infinite context, but they actually only use 10-20 frames as context (longer only in very specific settings; or maybe if you scale them up).

By renonce 2024-02-162:09

There are ways to test the neural network’s ability to recall from a very long sequence. For example, if you insert a random sentence like “X is Sam Altman” somewhere in the text, will the model be able to answer the question “Who is X?”, or maybe somewhat indirectly “Who is X (in another language)” or “Which sentence was inserted out of context?” “Which celebrity was mentioned in the text?”

Anyways the ability to generalize to longer context length is evidenced by such tests. If every token of the model’s output is able to answer questions in such a way that any sentence from the input would be taken into account, this gives evidence that the full context window indeed matters. Currently I find Claude 2 to perform very well on such tasks, so that sets my expectation of how a language model with an extremely long context window should look like.

By bschne 2024-02-160:09

> The 10M context ability wipes out most RAG stack complexity immediately.

1. People mention accuracy issues with longer contexts 2. People mention processing time issues with longer contexts 3. Something people haven't mentioned in this thread is cost -- even thought prompt tokens are usually cheaper than generated tokens, and Gemini seems to be cheaper than GPT-4, putting a whole knowledge base or 80-page document in the context is going to make every time you run that prompt quite expensive

By DebtDeflation 2024-02-1610:56

>The 10M context ability wipes out most RAG stack complexity immediately

From a technology standpoint, maybe. From an economics standpoint, it seems like it would be quite expensive to jam the entire corpus into every single prompt.

By tomaskafka 2024-02-1620:36

"I really want to know how they're getting to 10M context, though."

My $5 says it's a RAG or a similar technique (hierarchical RAG comes to mind), just like all other large context LLMs.

By outside1234 2024-02-1520:30

It takes 60 seconds to process all of that context in their three.js demo, which is, I will say, not super interactive. So there is still room for RAG and other faster alternatives to narrow the context.

By lqcfcjx 2024-02-161:44

This might be a stupid question - even if there's no quality degradation from 10M context, will it be extremely slow in reference?

By Havoc 2024-02-162:22

>3. The 10M context ability wipes out most RAG stack complexity immediately.

I'd imagine RAG would still be much more efficient computationally

By kylerush 2024-02-1518:15

I assume using this large of a context window instead of RAG would mean the consumption of many orders of magnitude more GPU.

By zitterbewegung 2024-02-1517:47

RAG doesn’t go away at 10 Million tokens if you do esoteric sources like shodan API queries.

By karmasimida 2024-02-1518:282 reply

Even 1m tokens eliminate the need for RAG, unless it is for cost.

By 7734128 2024-02-1521:411 reply

1 million might sound like a lot, but it's only a few megabytes. I would want RAG, somehow, to be able to process gigabytes or terabytes of material in a streaming fashion.

By karmasimida 2024-02-1522:271 reply

RAG will not change how many tokens LLM can produce at once.

Longer context on the other hand, could put some RAG use cases to sleep, if your instructions are like, literally a manual long, then there is no need for rag.

By 7734128 2024-02-167:36

I think RAG could be used that do that. If you have a one time retrieval in the beginning, basically amending the prompt, then I agree with you. But there are projects (classmate doing his masters thesis project as one implementation of this) that retrieves once every few tokens and make the retrieved information available to the generation somehow. That would not take a toll on the context window.

By sroussey 2024-02-1518:40

Or accuracy

By jorvi 2024-02-1517:131 reply

I just hope at some point we get access to mostly uncensored models. Both GPT-4 and Gemini are extremely shackled, and a slightly inferior model that hasn’t been hobbled by a very restricting preprompt would handily outperform them.

By ShamelessC 2024-02-1519:39

You can customize the system prompt with ChatGPT or via the completions API, just fyi.

By oblio 2024-02-163:322 reply

What's RAG?

By ohmyiv 2024-02-165:481 reply

Retrieval Augmented Generation. In basic terms, it optimizes output of LLMs by using additional external data sources before answering queries. (That actually might be too basic of a description)

Here:

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge...

By ssd532 2024-02-1619:151 reply

Is it same as embedding? Is embedding an RAG method?

By scotty79 2024-02-171:28

I don't think so. I think embedding is just converting token string into its numeric representation. Numeric representations of semantically similar token strings are close geometrically.

RAG is training AI to be a guy who read a lot of books. He doesn't know all of them in the context of this conversation you are having with him, but he sort of remembers where he read about the thing you are talking about and he has a library behind him into which he can reach and cite what he read verbatim thus introducing it into the context of your conversation.

I might be wrong though. I'm a newb.

By girvo 2024-02-165:15

Retrieval augmented generation.

> Retrieval Augmented Generation (RAG) is a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt.

(stolen from: https://github.com/psychic-api/rag-stack)

By aubanel 2024-02-1521:00

> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting

I fully disagree, they compare Gemini 1.5 Pro and GPT4 only on context length, not on other tasks where they compare it only to other Gemini which is a strange self-own.

I'm convinced that if they do not show the results against GPT4/Claude, it is because they do not look good.

By shostack 2024-02-1617:24

Wake me when I can get access without handing over my texts and contacts. I opened the Gemini app on Android and that onerous privacy policy was the first experience. Worse, I didn't seem able to move past accepting giving Google the ability to hoover up my data to disable that in the settings so I just gave up and went back to ChatGPT where I at least generally have control over the data I give it.

By a_vanderbilt 2024-02-1520:12

After their giant fib with the Gemini video a few weeks back I'm not believing anything til I see it used by actual people. I hope it's that much better than GPT-4, but I'm not holding my breath there isn't an asterisk or trick hiding somewhere.

By tbruckner 2024-02-1516:49

How do you know it isn't RAG?

By qwerty_clicks 2024-02-1520:061 reply

FYI, MM is the standard for million. 10MM not 10M I’m reading all these comments confused as heck why you are excited about 10M tokens

By MichaelNolan 2024-02-1523:38

Maybe for accountants, but for everyone else a single M is much more common.

By scarmig 2024-02-1516:46

One interesting tidbit from the technical report:

>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conservative filtering heuristics. An analysis of the test data leakage of Gemini 1.0 Ultra showed that continued pretraining on a dataset containing even a single epoch of the test split for HumanEval boosted scores from 74.4% to 89.0%, highlighting the danger of data contamination. We found that this sharp increase persisted even when examples were embedded in extraneous formats (e.g. JSON, HTML). We invite researchers assessing coding abilities of these models head-to-head to always maintain a small set of truly held-out test functions that are written in-house, thereby minimizing the risk of leakage. The Natural2Code benchmark, which we announced and used in the evaluation of Gemini 1.0 series of models, was created to fill this gap. It follows the exact same format of HumanEval but with a different set of prompts and tests.

By alphabetting 2024-02-1515:268 reply

Massive whoa if true from technical report

"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

By Workaccount2 2024-02-1515:481 reply

10M tokens is absolutely jaw dropping. For reference, this is approximately thirty books of 500 pages each.

Having 99% retrieval is nuts too. Models tend to unwind pretty badly as the context (tokens) grows.

Put these together and you are getting into the territory of dumping all your company documents, or all your departments documents into a single GPT (or whatever google will call it) and everyone working with that. Wild.

By kranke155 2024-02-1516:221 reply

Seems like Google caught up. Demis is again showing an incredible ability to lead a team to make groundbreaking work.

By huytersd 2024-02-1517:151 reply

If any of this is remotely true, not only did it catch up, it’s wiping the floor with how useful it can be compared to GPT4. Not going to make a judgement until I can actually try it out though.

By singularity2001 2024-02-1518:223 reply

In the demo videos gemini needs about a minute to answer long context questions. Which is better than reading thousands of pages yourself. But if it has to compete with classical search and skimming it might need some optimization.

By a_wild_dandan 2024-02-1521:191 reply

Replacing grep or `ctrl+F` with Gemini would be the user's fault, not Gemini's. If classical search for a job already a performant solution, use classical search. Save your tokens for jobs worthy of solving with a general intelligence!

By Breza 2024-02-2019:18

I think some of the most useful apps will involve combining this level of AI with traditional algorithms. I've written lots of code using the OpenAI APIs and I look forward to seeing what can be done here. If you type, "How has management's approach to comp changed over the past five years?" it would be neat to see an app generate the greps needed to find the appropriate documents and then feed them back into the LLM to answer the question.

By huytersd 2024-02-1519:27

That’s a compute problem, something that involves just throwing money at the problem.

By IanCal 2024-02-1610:53

If you had this for your business could this approach be faster than RAG?

Input is parsed one token at a time right? Can you cache the state after the initial prompt has been provided?

By famouswaffles 2024-02-1516:161 reply

Another whoa for me

>Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

Results - https://imgur.com/a/qXcVNOM

By usaar333 2024-02-1516:491 reply

I think this somewhat is mostly due to the ability to handle high context lengths better. Note how Claude 2.1 already highly outperforms GPT-4 on this task.

By a_wild_dandan 2024-02-1521:23

GPT-4V turbo outperforms Claude on long contexts, IIRC. Unless that's mistaken, I'd suspect a different explanation for that task.

By cchance 2024-02-1517:36

Did you watch the video of the Gemini 1.5 video recall after it processed the 44 minute video... holy shit

By megaman821 2024-02-1515:335 reply

So, will this outperform any RAG approach as long as the data fits inside the context window?

By CuriouslyC 2024-02-1516:02

A perfect RAG system would probably outperform everything in a larger context due to prompt dilution, but in the real world putting everything in context will win a lot of the time. The large context system will also almost certainly be more usable due to elimination of retrieval latency. The large context system might lose on price/performance though.

By TheGeminon 2024-02-1515:401 reply

Outperform is dependent on the RAG approach (and this would be a RAG approach anyways, you can already do this with smaller context sizes). A simplistic one, probably, but dumping in data that you don't need dilutes the useful information, so I would imagine there would be at least _some_ degradation.

But there is also the downside of "tuning" the RAG to return less tokens you will miss extra context that could be useful to the model.

By megaman821 2024-02-1515:48

Doesn't their needle/haystack benchmark seem to suggest there is almost no dilution? They pushed that demo out to 10M tokens.

By ArcaneMoose 2024-02-1515:37

Cost would still be a big concern

By chasd00 2024-02-1521:22

are you going to upload 10M tokens to Gemini on every request? That's a lot of data moving around when the user is expecting a near realtime response. Seems like it would still be better to only set the context with information relevant to the user's prompt which is what plain rag does.

By saliagato 2024-02-1515:381 reply

basically, yes. Pinecone? Dead. Azure AI Search? Dead. Quadrant? Dead.

By _boffin_ 2024-02-1515:41

Prompt token cost still a variable.

By matsemann 2024-02-1515:582 reply

Could you (or someone) explain what this means?

By ehsankia 2024-02-1519:382 reply

It's how much text it can consider at a time when generating a response. Basically the size of the prompt. A token is not quite a word but you can think of it as roughly that. Previously, the best most LLMs could do is around 32K. This new model does 1M, and in testing they could put it up to 10M with near perfect retrieval.

As the other comment mentions, you can paste the content of entire books or documents and ask very pointed question about it. Last year, Anthropic was showing off their 100K context window, and that's exactly what they did, they gave it the content of The Great Gatsby and asked it questions about specific lines of the book.

Similarly, imagine giving it hundreds of documents and asking it to spot some specific detail in there.

By Breza 2024-02-2019:20

Great explanation. I was amazed when I started using Claude because I could find a recently-transcribed novella, upload it, and ask specific questions. I'm downright giddy to try a 1M+ model.

By liamYC 2024-02-1523:31

Awesome explanation, thanks for the comparison

By FergusArgyll 2024-02-1517:363 reply

The input you give it can be very long. This can qualitatively change the experience. Imagine, for example, copy pasting the entire lord of the rings plus another 100 books you like and asking it to write a similar book...

By HarHarVeryFunny 2024-02-1518:071 reply

I just googled it, and the LOTR trilogy apparently has a total of 480,000 words, which brings home how huge 1M is! It'd be fascinating to see how well Gemini could summarize the plot or reason about it.

One point I'm unclear on is how these huge context sizes are implemented by the various models. Are any of them the actual raw "width of the model" that is propagated through it, or are these all hierarchical summarization and chunk embedding index lookup type tricks?

By mburns 2024-02-1521:41

For another reference, Shakespeare’s complete works are ~885k words.

The Encyclopedia Britannica is ~44M words.

By staticman2 2024-02-1522:361 reply

Reading Lord of the Rings, and writing a quality book in the same style, are almost wholly unrelated tasks. Over 150 million copies of Lord of the Rings have been sold, but few readers are capable of "writing a similar book" in terms of quality. There's no reason to think this would work well.

By pfooti 2024-02-162:01

I mean, Terry Brooks did it with the Sword of Shannara. (/s)

By teaearlgraycold 2024-02-1517:54

I doubt it’s smart enough to write another (coherent, good) book based on 103 books. But you could ask it questions about the books and it would search and synthesize good answers.

By stavros 2024-02-1515:312 reply

Until I can talk to it, I care exactly zero.

By peterisza 2024-02-1516:251 reply

you can buy their stock if you think they'll make a lot of money with their tech

By HarHarVeryFunny 2024-02-1517:021 reply

Well that's really the right question .. what can, and will, Google do with this that can move their corporate earnings needle in a meaningful way? Obviously they can sell API access and integrate it into their Google docs suite, as well as their new Project IDX IDE, but do any of these have potential to make a meaningful impact ?

It's also not obvious how these huge models will fare against increasingly capable open source ones like Mixtral, perhaps especially since Google are confirming here that MoE is the path forward, which perhaps helps limit how big these models need to be.

By plaidfuji 2024-02-1517:55

In the long run it could move the needle in enterprise market share of Workspace and GCP. They have a lot of room to grow and IMO have a far superior product to O365/Azure which could be exacerbated by strong AI products. Only problem is this sales cycle can take a decade or more, and Google hasn’t historically been patient or strategic about things like this.