How is ChatGPT's behavior changing over time?

2023-07-191:06289178arxiv.org

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023…

Download PDF
Abstract: GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.
From: Lingjiao Chen [view email]
[v1] Tue, 18 Jul 2023 06:56:08 UTC (536 KB)

Read the original article

Comments

  • By dudeinhawaii 2023-07-192:499 reply

    I think we should stop trying to quiz LLMs on mathematics, something for which they are explicitly not designed to do with their tokenized view of the world. Ask GPT-4 to use its Wolfram plugin and it returns the answers quickly and correctly.

    Second, I think the code generation bit of this paper is blown out of proportion. The code can't be immediately injected into a codebase due to a formatting change (triple quotes). I'd be more interested in changes to the quality and performance of the code generated, not whether it can be easily copy/pasted from the page.

    I could not replicate that result (but could replicate the other math issues). I've also never seen the triple quotes in results so it's unclear if there was a temporary presentation bug.

    • By xp84 2023-07-194:5211 reply

      Seriously. GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you. Wtf is the point. We have computer things to do every math problem. It’s a waste of energy to use LLMs for it in my opinion.

      • By jstummbillig 2023-07-1910:591 reply

        "We" are never going to stop trying to use LLMs for math. They are obviously mimicking a smart person. What do you do with smart persons? You query them with bad and lazy questions (often without being too honest about that to yourself), and hope/expect helpful answers.

        Because that is how it works. The limiting factor so far has been the smart persons time and patience. Now, no longer.

        People moderating their LLMs usage is never happening, from here on out until the end of civilisation. Any LLM service that is designing for that is done. You need to make lazy questions efficient. People do not care about how complicated your sql query is and they will never care. People will not give up on energy, meat, cars, as long as they feel they are giving something up.

        People will never think twice to not make your LLM think twice.

        If it seems useful and convenient, people will use it. If it's not giving good answers to lazy questions out of the box, they will go to the thing that does.

        • By hammock 2023-07-1912:072 reply

          >"We" are never going to stop trying to use LLMs for math. They are obviously mimicking a smart person.

          A lot of words to make a big deal out of nothing. All that is needed is some new abstracted layer that identifies a math question and then proxies it over to the wolfram plugin. That’s it

          We don’t have crazy debates over whether a polygon should be rendered by the cpu or a gpu. We solved this problem

          • By zeusk 2023-07-1912:211 reply

            So a Mixture of Experts model; but then isn't ChatGPT using that already? Why're the models so bad at math despite being trained on academic papers and books clearly - or why do they hallucinate and make up non-existing citations?

            • By edgyquant 2023-07-1913:14

              Because they don’t actually reason about things they just map words based on probability. If they’ve seen a math problem enough they may get it right due to probability, but using that knowledge to solve a new problem isn’t likely.

          • By klausa 2023-07-1914:50

            I have bad news about how "solved" CPU/GPU rendering split is.

      • By SanderNL 2023-07-199:083 reply

        It’s not about the results, it’s about its ability to “reason”. Math is about as close to pure reasoning we get so I don’t get the pessimism.

        If it is bad at math and can’t be taught, then you have a fundamental problem. It’s a matter of time before this limit gets hit in other domains.

        • By wizofaus 2023-07-1910:211 reply

          This seems fair enough to me. That ChatGPT currently struggles with certain types of maths problems points to reasonably fundamental shortcomings in what otherwise appears to have the beginnings of a general purpose reasoning engine (whether you consider it AGI or not), and I'm willing to bet extraordinarily clever minds are working hard on trying to address those shortcomings.

          • By firewolf34 2023-07-1914:03

            I don't think that trying to shoehorn LLM into being AGI is the right path. Maybe some are trying to achieve this, or are hoping for this... but IMO this is trying to fit a square peg into a round hole. It's definitely an element in the overall puzzle, I think we can all agree, though. Even so, rather than warping this hammer to also screw in a bolt, why not combine it with another tool more fit for the job?

        • By orange_fritter 2023-07-2022:58

          Your thinking is very emotional and shows your inability to understand that "conversational output" is basically an accidental side-effect of a complex tool that just replicates our speech without understanding what it is saying.

          Through continuous use, I have found that it does not "reason". That doesn't mean it's not valuable in many ways, and I have found it to be very helpful in a multitude of diverse applications, including helping me reflect on my own life through my own interpretations of its output. It's also a great interface for JSTOR, wikipedia, and basically any language learning.

          I'm having a hard time making the jump from "this must be a calculator" to "this must be a philosopher" to be useful. When did we ever have those requirements for a tool?

          This tool is just not made for math. Most of its logic processing abilities seem to surpass mine if I am only given 5 minutes to understand a problem. If you understand the tool, you will get the most out of it. Stop anthropomorphizing it, and stop pretending that it can't generate both highly beneficial or highly harmful content simply because it doesn't have a soul/d*ck or whatever.

        • By FeepingCreature 2023-07-1910:152 reply

          In my experience, it can reason usably well but acts like it has dyscalculia. It'll set up a proper algorithm, step through it and trip over digits.

          • By throwaway4aday 2023-07-1915:19

            Yes and the reason it trips over digits is because those digits wind up being tokenized in ways that seem unexpected to us and would produce the same difficulty if we were presented with them:

            100,000 + 987 - 1444 * 25,945.842 / 0.0042

            becomes

            "100" one hundred

            "," comma

            "000" triple zero

            " +" space plus

            " 9" space nine

            "87" eighty seven

            " -" space minus

            " 14" space 14

            "44" forty four

            " *" space times

            " 25" space twenty five

            "," comma

            "9" nine

            "45" forty five

            "." period

            "8" eight

            "42" forty two

            " /" space divide

            " 0" zero

            "." period

            "00" double zero

            "42" forty two

            Now imagine someone reading that to you over the phone once and asking you to do the math in your head and you aren't allowed to use paper and pencil and you have to get it right the first time.

          • By hutzlibu 2023-07-1912:331 reply

            "It'll set up a proper algorithm, step through it and trip over digits."

            Or it is just pretending to do so. And since it pretends, of course it trips over all small things as it does not understand them.

            • By FeepingCreature 2023-07-1914:321 reply

              I don't think you can pretend to properly describe and evaluate an algorithm, anymore than you can pretend to solve a riddle - the answer is either right or it isn't.

              And in this case, the shape of the answer is often right; it just makes ... ordinary errors. Ironically, the AI is a lot better at high-level thinking than correct calculation.

              • By hutzlibu 2023-07-1916:38

                "Ironically, the AI is a lot better at high-level thinking than correct calculation."

                That would be actually a human like feature .. except I do not consider what LLMs are doing as thinking.

      • By remus 2023-07-1911:022 reply

        > GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you.

        I don't think this is a great analogy. if your 737 couldn't drive on the ground and your astrophysicist couldn't answer basic maths questions I wouldn't want to fly in that plane or put much faith in the astrophysicists answers to more complex questions.

        Maybe maths is not a particular strength of LLMs, but asking questions where it is easy to judge the factual accuracy of the responses seems a pretty reasonable test to be running.

        • By brushfoot 2023-07-1911:311 reply

          > asking questions where it is easy to judge the factual accuracy of the responses seems a pretty reasonable test to be running.

          It isn't reasonable if that isn't what the system was designed to do.

          It would be a poor test of my general practitioner's competence to ask him calculus questions and conclude he doesn't know what he's talking about because he can't answer them.

          • By fragmede 2023-07-1917:52

            Given that a drug's concentration in your bloodstream is absolutely critical for prescribing them, and considering that this is calculated using Calculus, you'd better hope your GP does actually know their Calculus!

        • By xp84 2023-07-1922:49

          My point was that a 737 can be a land vehicle, but very badly, because it's optimized (to an extreme amount) for flying. I fly on 737s all the time, knowing that have terrible stopping distances and cornering. The 0-60 could be decent, but you can't really accelerate all-out to try it, or you'll overshoot and end up going 175 and crashing.

          The astrophysicist can do long division in his head, but he'll be about as fast and accurate as the next person, because he doesn't practice arithmetic every day.

          I agree with the commenter somewhere in my thread who said all an LLM should be optimized for is to classify the type of problem and feed it into a purpose-built, deterministic solver that is trained to interpret math as math and not as language, be it ML-based or algorithmic.

      • By sheepscreek 2023-07-195:061 reply

        A better example would be - calling your guitar lessons teacher for help on a statistics problem.

        • By ben_w 2023-07-197:52

          Possibly more like instantiating the statistics problem by getting, say, every 3 members of an orchestra to represent a triplet of doors in the Monty Hall problem and then making a ball-park guess what the results are from a seat in the middle of the back row.

      • By theelous3 2023-07-198:181 reply

        The point of doing $non-llm-optimal-thing on an llm is the hope that it let's you skip on formal syntaxes, which are mentally taxing.

        It's far easier even for an expert to communicate what they want in natural language than it is in a formal syntax for all but the most trivial things.

        It should be a goal of these tools to do this correctly.

        • By xp84 2023-07-1922:521 reply

          > the hope that it let's you skip on formal syntaxes, which are mentally taxing

          agree completely, but the LLM should be focused, then, strictly on formulating an "execution plan" of sorts and handing that off, not on performing math itself.

          In other words, when asked "if i have 349 blueberries and one blueberry turns into a cherry per hour, how many of each fruit will I have in 93478 minutes?" it shouldn't be doing the actual arithmetic, but it should be figuring out what arithmetic would need to be done.

          • By orange_fritter 2023-07-217:50

            This is my second ChatGPT comment so I hate to sound like an evangelist, but it basically does this, reproducibly:

            You describe a complex relationship between several nodes(people, cities, etc), and then ask ChatGPT to draw the relationship as a graph data structure. It will create a formula that Mathematica can render, and then send the formula to Mathematica before presenting an ascii drawing. Usually. Sometimes it just complains that it can't draw and explains what the graph looks like to you in a written formal/human-readable syntax.

            In other words, yes it just summarizes the problem, converts it to a formal syntax, and sends it off to some other tool.

      • By Closi 2023-07-199:402 reply

        > Seriously. GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you. Wtf is the point. We have computer things to do every math problem. It’s a waste of energy to use LLMs for it in my opinion.

        Using GPT to do maths is probably like using a 737 to drive around on the ground.

        Teaching GPT to do maths might be like teaching a child the times tables - a skill that can help overall reasoning.

        • By viraptor 2023-07-1913:511 reply

          Well that's... a claim. What's your reasoning for operations on digits within the model itself improving any current metrics?

          • By Closi 2023-07-1916:19

            Well I didn't really make a bold claim, I said "might", but presumably there is a reason why we teach kids basic mental maths before we teach them how to use a calculator.

            Humans are different to an AI, but putting that aside, my intuition would be that if we never taught kids any mental maths, their concept/understanding of numbers would be fundamentally different to how it is if they learn that 9 x 9 = 81 (also look at how your fingers move - there is a relationship there!).

            But who knows, AI is strange and there's lots of stuff that needs to be experimented with. I would think training an intuitive sense of numbers would have other fall-outs though. This is half the beauty of LLM's right? You show an LLM some history books and it also learns about biology, politics, grammar, etymology and love. You teach a LLM maths and it also learns ... ?

        • By marci 2023-07-1910:22

          Apparently, models finetuned for coding are better at logical inquiries than those that are not.

      • By latexr 2023-07-1910:06

        > GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you.

        One significant difference is that in both of those examples it is (or quickly becomes) plain why it’s a ridiculous idea. Even if you don’t understand it yourself, you’ll get external feedback fast. Not so with LLMs, where even people with technical needs may fail to see what is or isn’t a good use of the tool. Case in point: https://news.ycombinator.com/item?id=36782446

        “You’re holding it wrong” isn’t a valid argument in perpetuity. At a certain point it becomes the fault of the designer, not the user.

      • By herdcall 2023-07-1915:56

        My guess is that for most folks good at math => you're smart, so the most obvious way to test an LLM is to ask it a math question, which is trivial to construct. Another motivation IMO is that asking a relatively simple math question and seeing an LLM fall flat kind of makes many to feel good about themselves and gives them a chance to show it off on social media.

      • By vicentwu 2023-07-1913:291 reply

        Asking models to do math is kind of an effecitve way to measure their capabilities, especially in reasoning and abstraction, which are quite important for problem solving.

        • By viraptor 2023-07-1913:48

          You don't need to reason and abstraction to do basic calculation. ChatGPT will however happily give you some decent answers about not-too -hard math that requires reasoning. It just won't operate on digits.

          Those are completely different ideas.

      • By reportgunner 2023-07-198:54

        What is the point of doing anything if you can't use flashy technologies ? Leave that to the old people. /s

      • By lumost 2023-07-195:327 reply

        I used GPT-4 to generate a non-cryptographic random 64 character string. It was faster to ask GPT-4 for the string than ask GPT-4 for the instructions to generate the string from my terminal. GPT-4 was faster than google.

        • By oefnak 2023-07-197:342 reply

          That definitely won't be random.

          • By taneq 2023-07-1911:32

            It's probably 4, that's pretty random.

          • By lumost 2023-07-1912:28

            Close enough for the purpose at hand!

        • By reportgunner 2023-07-198:561 reply

          Were you in a room that had it's walls slowly caving in like in indiana jones or why does it matter that it was faster ?

          • By lumost 2023-07-200:42

            Close! Was in a meeting and needed to see the behavior of a system when passed a string greater than 64 char.

            My mental capacity was used elsewhere - using chatGPT let me answer an important customer question authoritatively.

        • By deafpolygon 2023-07-197:59

          I don't think this will be a truly random string.

        • By josefx 2023-07-199:101 reply

          wouldn't it be less time to just type one yourself? Like open notepad, hit the keys like a deranged monkey, select the first 64 characters, done?

          • By Traubenfuchs 2023-07-199:292 reply

            That is very much not random, but generally enough for 99% of all use cases.

            • By WithinReason 2023-07-1910:48

              It would be more random than an LLM's output

            • By rcxdude 2023-07-1910:061 reply

              I would not bet on whether chatgpt's results would be more or less random than this process.

              • By Traubenfuchs 2023-07-1914:34

                Aren't chatgpts results deterministic if you use the same seed, temperature and other parameters?

        • By ThrowawayTestr 2023-07-197:32

          Surely navigating to random.org is faster than typing a prompt.

        • By dgb23 2023-07-197:52

          That’s likely a very good example of the limitations of LLMs.

        • By 0x000xca0xfe 2023-07-199:232 reply

          Obligatory bash solution:

            $ random_bytes() { xxd -plain -c 0 -l "$1" /dev/urandom; }
            $ random_bytes 32
            e6a4a7bbea69a0164cbb66c89f8f528af93c6d2459fd28d2640e2952c031b618

          • By ineedasername 2023-07-1911:14

            >/dev/urandom

            Hmm, I’ve been using /usr/games/fortune

            Is this not a best practice?

          • By gardenhedge 2023-07-1912:08

            I'm glad ChatGPT exists so I don't have to remember any of that

    • By deelly 2023-07-197:181 reply

      No, we definitely should continue to quiz LLMs on mathematics and absolutely any other topics. Otherwise how do we know and understand limitation of the system?

      • By muzani 2023-07-199:052 reply

        We should also test its capabilities on cooking steak, flying rockets, and making love. Only then will we know if AI can be superior to humans on all things.

        • By falcor84 2023-07-1910:271 reply

          I know you're being facetious, but I'd absolutely be in favor of having AI benchmarks for any and all of these.

          • By muzani 2023-07-1923:39

            I realized that halfway through typing that too. As well as the absurdity of trying to create AIs that exceed a human in all tasks.

            On a serious note, most of these AIs are bad at math but good at writing code for calculators. So what you'll be benchmarking is their ability to create and use tools.

        • By ineedasername 2023-07-1911:211 reply

          Well, if there was an API hooked up to a webcam & 6-dof arm that would be an interesting task. (The steak)

    • By ehnto 2023-07-193:512 reply

      I think knowing if the code can be used verbatim is actually the more important part practically speaking. That is the actually useful part.

      Quality is important to humans, because humans have to read it, but correctness is what people using ChatGPT for code actually need. So long as the quality and performance is good enough, then it will be useful.

      Performance is such a nuanced topic that you need very context aware devs anyway, and I think a general purpose LLM is never going to have that kind of awareness.

      • By staunton 2023-07-196:211 reply

        > never going to have that kind of awareness

        Be careful with that goalpost, it might make sudden movements.

        • By ineedasername 2023-07-1911:27

          Yes, when ChatGPT went public using 3 I was like, “Hah, yeah, that can’t come close to writing the narrative portions in my work.” Then 4 came along and it was more like, “ooh, a basic first draft in 15 seconds? That I can work with.”

          Now, at least for some projects, I can give it a few rapid bullet points in incomplete sentences and have that first draft in seconds, after which I just need to tweak, add in tables and more detailed stats & results, etc. quite useful.

      • By gonzo41 2023-07-1910:14

        GPT is very useful for the scenario of when you have a complex, dialect specific bit of SQL and need to turn it into an ORM query like sqlalchemy. It get's you about 90% the way there

    • By sanxiyn 2023-07-193:132 reply

      Or OpenAI can stop being stupid and adopt LLaMA-like tokenization, which special cases numbers and tokenize them into individual digits.

      • By jinay 2023-07-193:23

        Are LLaMA-based models better at math because of this digit-based tokenization?

      • By bugglebeetle 2023-07-193:251 reply

        Isn’t the tokenization tied to the training of the model?

        • By colobas 2023-07-195:16

          The model learns an embedding table, where (roughly) each row is used as the model’s internal representation of each token. The numbers in that table are learned. What isn’t learned is the map from token (i.e. combination of characters/byte-pairs) to row-index in the embedding table. That is given by tokenization

          EDIT: removed redundant bit

    • By dr_dshiv 2023-07-198:18

      They should stop teaching math to kids for the same reason. Just give them a calculator! Or show us the evidence that math education improves cognitive abilities in other domains.

    • By falcor84 2023-07-1910:25

      I would argue that ChatGPT is actually quite good at "mathematics", in the sense of helping me formulate a problem in an appropriate mathematical structure, and coming up with a good sequence of steps to solve / simplify it. Not perfect by any means, but not bad at all.

      What it is bad at is actually performing the steps accurately, but as others mentioned, that's where Wolfram and/or a code interpreter would come in.

    • By TeMPOraL 2023-07-198:391 reply

      In terms of evaluating LLMs, I'd argue quizzing them on maths is still better than the other thing people keep doing - quizzing them on facts and self-contradicting scenarios, hoping to get them to only either recall information perfectly, or answer with "I don't know".

      I'm not an AI/ML scientist, so I may be way off mark here, but everything I've read so far, and all my experience playing with GPT-3.5 and GPT-4, tell me that comparing performance of an LLM to that of a human is a category error, because the LLM isn't a good analogue of a whole human mind - but it's a very good analogue to human inner voice. The stream of consciousness. The whatever-it-is that surfaces your unconscious/subconscious thought process in form of words and sentences.

      The inner voice is fast, it's reactive. It generates thoughts that match the situation, whether they're correct or factually accurate or not. It's up to the conscious part of your mind to stop, refine, or recycle those thoughts. If you let it keep going, it'll give you thoughts based on what feels like should follow the thoughts that came before. And, unless you habituated responding to anything new with "I don't know" followed by ignoring the topic, the inner voice will start blurting answers to what looks like a question/problem statement; whether or not they'll make any sense, depends on your familiarity with the topic in question.

      Pretty much 1:1 what LLMs do.

      Now, this could all be noise, but I don't think so. I know not everyone has a distinct inner narrative (much like not everyone can visualize things in their mind - I can't), but many (most?) people do. The description of the "inner voice experience" I gave above is something I figured out over a decade ago - before LLMs or even deep learning were a thing, before I knew anything about the NLP beyond recognizing the term "Markov chain" is somehow related. Could my inner narration style be unique? Possibly, but given how advice to avoid connecting your inner voice directly with your vocal apparatus is deeply infused in culture and literature, I strongly suspect this is just how it works.

      All this to say: it is my hypothesis, so far corroborated by experience, that when you start feeding absurd amount of unlabeled text to a transformer model, letting it pick up on the structures encoded within, what you get is a close equivalent to our own inner voice - the part that deals with associations, not logic or data storage. You can't expect it to get good at performing arbitrary computation or recalling data with perfect fidelity, because it's structurally not what it's suited for. For humans, performing arbitrary calculations or perfect recall requires engaging a slower, more algorithmic thinking process (and/or external memory). That part is currently missing in the LLM-based AI systems we're playing with.

      • By Kubuxu 2023-07-1911:12

        I couldn’t agree with you more. Today when we are talk with LLM it’s like we (sorry for anthropomorphism) interact with a naked mind and we rely on pure immediate recall and it’s train of thought. I wonder whether allowing LLMs to perform inner dialogue (as it was done with the through/action pattern), to stage information for latter, and then based on that making it form response will be the next step.

        It will be much more compute intensive, as each response will probably require multiple context windows and distillations.

    • By dontupvoteme 2023-07-197:373 reply

      Could not agree more. It doesn't understand what a number is, why is everyone trying to quiz it on maths instead of, perhaps, seeing how good it is at language tasks, or even foreign languages? I suspect it has gotten a lot worse in non-english since launch.

      • By manojlds 2023-07-199:13

        The paper is talking about DELTA though. It used to do well and doesn't now.

      • By gwd 2023-07-198:52

        Why do you think this?

        FWIW I use GPT-4 regularly to explain Koine Greek from the New Testament to me; its ability there certainly hasn't diminished in the last two months.

      • By delusional 2023-07-197:401 reply

        it "understands" numbers the exact same amount that it understands words.

        • By vermilingua 2023-07-197:44

          Well, no, because numbers are “words” that represent something that behaves in a very different way to words in a sentance.

    • By tough 2023-07-1913:08

      Triple quotes sound like how you do markdown code blocks to me

      ``` code ```

  • By randomwalker 2023-07-193:591 reply

    This paper is being misinterpreted. The degradations reported are somewhat peculiar to the authors' task selection and evaluation method and can easily result from fine tuning rather than intentionally degrading GPT-4's performance for cost saving reasons.

    They report 2 degradations: code generation & math problems. In both cases, they report a behavior change (likely fine tuning) rather than a capability decrease (possibly intentional degradation). The paper confuses these a bit: they mostly say behavior, including in the title, but the intro says capability in a couple of places.

    Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.

    Math problems (primality checking): to solve this the model needs to do chain of thought. For some weird reason, the newer model doesn't seem to do so when asked to think step by step (but the current ChatGPT-4 does, as you can easily check). The paper doesn't say that the accuracy is worse conditional on doing CoT.

    The other two tasks are visual reasoning and answering sensitive questions. On the former, they report a slight improvement. On the latter, they report that the filters are much more effective — unsurprising since we know that OpenAI has been heavily tweaking these.

    In short, everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost saving purposes — but if so, this paper doesn't provide evidence of it. Still, it's a fascinating study of the unintended consequences of model updates.

    • By AbrahamParangi 2023-07-1912:021 reply

      In my opinion the more likely thing is that OpenAI is gaslighting people that the finetuning is improving the model when it likely mostly improves safety at some cost to capability. I'd bet this is measured against a set of evals and it looks like it performs well BUT I'd also bet the evals are asymmetrically good at detecting "unsafe" or jailbreak behavior and bad at detecting reduced general cognitive flexibility.

      The obvious avenue to degradation is that the "HR personality" is much more strictly applied and the resistance to being jailbroken is also in some sense an inability to think.

      • By kuchenbecker 2023-07-1914:40

        The ability to detect quality is harder than the ability to detect defects, so the obvious metric is improved while the nebulous one is "good enough". They are competing goals.

        This is not necessarily the case, and even if it is doesn't imply gaslighting as compared to inability to measure.

  • By swyx 2023-07-193:262 reply

    Previous commentary I know of from OpenAI staff:

    Logan: The API does not just change without us telling you. The models are static there. https://twitter.com/OfficialLoganK/status/166393494793189785... may 31

    Peter: No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one. https://twitter.com/jlowin/status/1679660938415177731 july 14

    either the models are static, or they are being improved continuously and there have been unforeseen regressions. only one can be true at any point in time. was this policy changed in the last 1.5 months?

    • By airgapstopgap 2023-07-193:401 reply

      Not really. They have a way of squaring this circle, by changing their inference code. Speculative sampling [1] would still make their first claim a lie – sure, there'd still be the original GPT-4 model, plus a smaller draft worker. But early exit decoding [2] allows you to get almost as good results for much cheaper from exactly the same checkpoint. We know that this line of research for large-scale inference is going strong [3] so it stands to reason that OpenAI, with their wealth of talent focused on GPT-4 throughput&inference [4], large contexts and aggressive pricing policy, would also develop something like that. And of course it's "smarter" that way – in a very deceptive sense of the word.

      1. https://arxiv.org/abs/2302.01318

      2. https://arxiv.org/abs/2207.07061

      3. https://arxiv.org/abs/2307.02628

      4. https://openai.com/contributions/gpt-4

      • By BoorishBears 2023-07-194:013 reply

        I don't get why you're jumping to cloak and daggers style operations: OpenAI would not kneecap their commercial offering by randomly changing how it works.

        At the end of the day 99% of the confusion comes from people using the web interface, which undoubtedly does change much more often than the API versions they share.

        The web app they host isn't a simple API wrapper, it does summarization, has some sort of system prompt, and calls the moderation API. That's undoubtedly being updated all the time.

        • By visarga 2023-07-194:541 reply

          > OpenAI would not kneecap their commercial offering by randomly changing how it works.

          Have you seen the 25 messages/3 hours limitation for GPT-4? Why do you think they did that? Of course they would make more money scaling up the volume, but how to do that when compute is so limited? Of course, by using some kind of approximation - quantised model or speculative sampling come to mind. It's hard to pinpoint model regressions, but scaling up volume is great, one more incentive to do it.

          • By BoorishBears 2023-07-195:062 reply

            You realize that's a limitation in the web application right?

            The web app is a consumer app (B2C) the api is commercial (B2B). They tinker with the B2C app because it's already a lossy approximation of using the model between the summarization and system prompt.

            They cannot mess with the commercial offering willy-nilly: People are building businesses predicated on it behaving a certain way. That's why there are dated version that you can pin to with the API. The web app changes whenever they feel like it.

            • By zo1 2023-07-197:521 reply

              You keep repeating that. You don't even know if the people commenting to you use the API or the "web app". I use the API and I noticed the same stuff others have.

              • By BoorishBears 2023-07-1919:131 reply

                > Have you seen the 25 messages/3 hours limitation for GPT-4?

                If can't tell if that's about the API or the web app, I don't think you're familiar enough with the subject to speak on it.

                • By zo1 2023-07-2014:231 reply

                  I don't need to be familiar with it to see that you're not being genuine in your interpretation of peoples' comments and in the way you're responding to people in this thread. Case in point: your reply to me.

                  • By BoorishBears 2023-07-2112:16

                    What was I supposed to reply to your factless baseless anecdote with?

                    Millions of dollars in spend from products predicated on the API not randomly changing?

            • By TeMPOraL 2023-07-1912:061 reply

              Arguably, both ChatGPT and API are consumer apps. That includes researchers. Pay as you go, no strings attached, "oh yeah no no, we're not changing anything, follow our CEO on Twitter if you want to know more". That kind of stuff.

              The actual B2B offering is handled by Microsoft, via Azure OpenAI. Same models, but deployed on Azure - meaning they come with SLA and all the right protocol and compliance stuff, so that your people can negotiate with their people - and if you're willing to spend enough, you'll get the models for yourself. Not the weights, of course: just no training on your inputs, not even retaining inputs for 30 days "because ${legal reasons}" - instead, you can pick and chose, fine-tune and deploy OpenAI models on your own tenant, and basically manage everything except the weights themselves.

              • By BoorishBears 2023-07-1919:111 reply

                Maybe arguable if you don't know what consumer apps are? Also sounds like you haven't actually used Azure OpenAI:

                - It has the same 30 day retention for legal reasons unless you manually request (just like OpenAI)

                - You can't fine tune any models that you can't fine tune on OpenAI, and in fact default access is a subset of what OpenAI offers.

                - "and if you're willing to spend enough, you'll get the models for yourself" is a bit of nonsense, Azure OpenAI forces everyone to make a "tenant", that's just for VPC stuff to work. Outside of that it's bog standard fine tuning and at most "on your data" which is a wrapper for chunking + vector embeddings

                - Azure OpenAI has a narrower built in filter that you can't modify without again, a separate request.

                Azure OpenAI overall is mostly for companies that need to signal to other companies that they're using Azure: it's no more commercial than the OpenAI offering.

                • By TeMPOraL 2023-07-202:182 reply

                  > Also sounds like you haven't actually used Azure OpenAI

                  On the contrary, I am using Azure OpenAI daily at work, and I'm explicitly not allowed to use "regular" OpenAI offerings.

                  > It has the same 30 day retention for legal reasons unless you manually request (just like OpenAI)

                  It doesn't, at least not for us.

                  > Azure OpenAI has a narrower built in filter that you can't modify without again, a separate request.

                  I'm not sure if it's narrower, but it is there and I have a strong suspicion that MS is just trying to extract additional rent from companies that really want to turn the filter off.

                  > Azure OpenAI overall is mostly for companies that need to signal to other companies that they're using Azure: it's no more commercial than the OpenAI offering.

                  No. Azure OpenAI is for companies that don't play fast and loose with data - their own data, and their customer data. Of course, most companies don't give a damn, but for big enough companies, or those operating in certain industries, there are actual, severe legal consequences for mishandling the data, and such companies don't have the option to just not give a fuck and dance with OpenAI - they need to sign an actual contract with a serious entity that understands regulatory compliance, and how corporations tick. Microsoft is such entity. OpenAI isn't.

                  • By BoorishBears 2023-07-203:271 reply

                    > It doesn't, at least not for us.

                    So then you filled out the request because the default is exactly the same as OpenAI: retained unless you manually apply for an exception.

                    https://customervoice.microsoft.com/Pages/ResponsePage.aspx?...

                    > I'm not sure if it's narrower, but it is there and I have a strong suspicion that MS is just trying to extract additional rent from companies that really want to turn the filter off.

                    You don't need to question if it's narrower, OpenAI used to surface it as an API separate from the moderation API and it's much stricter by design.

                    > No. Azure OpenAI is for companies that don't play fast and loose with data - their own data, and their customer data...

                    I don't know if you actually believe this or you're just not aware, but the companies that don't play fast and loose aren't using OpenAI period: Azure flavored or otherwise.

                    OpenAI has SOC2, GDPR and CCPA compliance. They comply with HIPPA and offer BAs. They sign DPAs on a case-by-case basis same as Azure.

                    You're pretty much proving the value of Azure in your comment: it's a veneer of familiarity that coaxes people who are convinced the new kid on the block must be untrustworthy.

                    If OpenAI can't promise something Azure can't either: They're entirely dependent on OpenAI for this. Every idiosyncrasy behind Azure OpenAI maps back 1:1 to OpenAI.

                    • By TeMPOraL 2023-07-208:301 reply

                      > I don't know if you actually believe this or you're just not aware, but the companies that don't play fast and loose aren't using OpenAI period: Azure flavored or otherwise.

                      They do, and that's the biggest value proposition of Azure OpenAI right now: strong contractual guarantees, from a reputable partner (that's easy to hit with lawsuits should they go rogue :)).

                      The current situation is that it's pretty unwise for any company to ignore GPT models. OpenAI itself is a wildcard, but getting the same from Microsoft isn't "playing fast and loose with data" any more than using Windows and Office 365 across the organization is. Most large corporations and governments have been building their office work and communication around those tools for decades now, so - questions of antitrust aside - all the kinks have been worked out. I don't think you appreciate how big a difference this makes.

                      I mean, it's either that or all the company communication I got on this was bullshit.

                      > OpenAI has SOC2, GDPR and CCPA compliance. They comply with HIPPA and offer BAs.

                      That's the first I hear of it, but since I never dealt with OpenAI itself on that level, I accept this was my ignorance speaking; thanks for clarifying.

                      > You're pretty much proving the value of Azure in your comment: it's a veneer of familiarity that coaxes people who are convinced the new kid on the block must be untrustworthy.

                      I think you're underestimating the importance of this. What you call "veneer of familiarity" translates to billions of dollars of differences in terms of security risk.

                      As mentioned before, MS has been in this space for a while, and has decades of trust and experience built with governments and corporations and other big organizations. Microsoft is a known, trusted quantity. That alone is worth a lot.

                      But then, there are also technical aspects too - like how deploying to a tenant on Azure integrates properly with all the other services you use to run half the company. In practical terms, this means all use is monitored and auditable by in-house teams, and all the in-house policies are being enforced. OpenAI can't begin to offer this level of integration - they have neither technical nor legal resources for that.

                      > If OpenAI can't promise something Azure can't either: They're entirely dependent on OpenAI for this. Every idiosyncrasy behind Azure OpenAI maps back 1:1 to OpenAI.

                      None of that matters here. The models are what they are - peculiar large matrix multiplication as a service. By themselves, they're pretty much pure functions. The part that matters is operations - both technical and legal aspects - and this is where Microsoft and OpenAI are independent and have different offerings.

                      Also, looking at the way money flows, I think it's OpenAI that's dependent on Microsoft right now, not the other way around. They kinda pretend to be just friends with benefits, but it's obvious who the dependent party is.

                      • By BoorishBears 2023-07-2017:11

                        I'm advising on calls with firms that have existed since the 1800s: their clients don't even want LLMs involved in output, regardless of who's hosting what.

                        Companies that don't play fast and loose are not using LLMs yet. They use "old school" ML at most with much narrower scope because at this point it's simply less of a liability.

                        You seem to think I'm underestimating what Azure's name adds to OpenAI: I fully understand how bureaucratic organizations work off vibes under the guise of name recognition and my point is I simply have no respect for it.

                        If you genuinely care about customer data, then the value of being able to sue MS instead of OpenAI is moot. You also probably aren't going to use a service that shouts from the roof tops about not using your data then quietly keeps it for 30 days unless you manually opt-out. You probably don't use some model with unsolved copyright/PII questions. And a million other unknowns

                        > The part that matters is operations - both technical and legal aspects - and this is where Microsoft and OpenAI are independent and have different offerings.

                        You might want to check OpenAI's subprocessor list if you think that they're not the same technically...

                        https://platform.openai.com/subprocessors/openai-subprocesso...

                        And Azure's subprocessor list is a superset of that list, not a subset.

        • By airgapstopgap 2023-07-194:411 reply

          No, it makes sense to secure engagement with the most expensive implementation and then cut costs, this kind of stuff is pervasive in the industry. Besides, we have Brockman on record saying that they do "a lot of quantization"[1][2] so it's not paranoia to suspect other optimization schemes when there's a clear performance drop, which they have also denied a few times.

          1. https://chat.openai.com/share/44a0c5b6-c629-470a-992f-8cdbbe...

          2. https://www.youtube.com/watch?v=_hpuPi7YZX8

          • By BoorishBears 2023-07-195:10

            Paranoia would be charitable: it's FUD.

            If you intentionally smear the line between their web app which is chock full of optimizations to even let it function as it does (the web app's max conversation length exceeds the context window) and the API which is versioned and iterated on in the open... it's either a lack of understanding or FUD.

        • By Shank 2023-07-194:101 reply

          > OpenAI would not kneecap their commercial offering by randomly changing how it works.

          > As of July 3, 2023, we’ve disabled the Browse with Bing beta feature out of an abundance of caution while we fix this in order to do right by content owners. We are working to bring the beta back as quickly as possible, and appreciate your understanding!

          https://help.openai.com/en/articles/8077698-how-do-i-use-cha...

          • By BoorishBears 2023-07-194:201 reply

            Thank you for confirming my point?

            > At the end of the day 99% of the confusion comes from people using the web interface, which undoubtedly does change much more often than the API versions they share.

            The API does not offer any browsing features, that's the web app.

    • By flangola7 2023-07-193:34

      You can call a specific version of the model. It's ones of the API values. The latter person is referring to the "gpt-4" which the documentation states will update and change without warning.

HackerNews