Comments

  • By simonw 2026-02-1117:479 reply

    Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

    Solid bird, not a great bicycle frame.

    • By btown 2026-02-1118:013 reply

      Thank you for continuing to maintain the only benchmarking system that matters!

      Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

      • By gabiruh 2026-02-1119:382 reply

        It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.

        • By btown 2026-02-1121:441 reply

          If you were a pelican, wouldn't you want to go cycling on a sunny day?

          Do electric pelicans dream of touching electric grass?

          • By Magniquick 2026-02-123:371 reply

            Do electric pelicans dream of touching electric grass?

            That would be shocking news to me.

        • By derefr 2026-02-1121:49

          It is odd, yeah.

          I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.

      • By l_eo 2026-02-1120:552 reply

        They will start to max this benchmark as well at some point.

        • By ljm 2026-02-1122:253 reply

          It's not a benchmark though, right? Because there's no control group or reference.

          It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

          It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

          I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

          • By tylervigen 2026-02-1122:32

            For 2026 SOTA models I think that is fair.

            For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

          • By vidarh 2026-02-127:58

            RLHF (reinforcement learning from human feedback) is to a large extent about resolving that ambiguity by simply polling people for their subjective judgement.

            I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.

          • By interstice 2026-02-1122:36

            So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?

        • By 9dev 2026-02-1211:19

          Simon has written a page specifically for you: https://simonwillison.net/2025/nov/13/training-for-pelicans-...

      • By segmondy 2026-02-121:01

        This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.

    • By hasperdi 2026-02-129:481 reply

      That's a bike that's ergonomically designed for pelicans.

      It is unreasonable to expect pelicans to ride human bikes, they have different anatomy.

      • By MrsPeaches 2026-02-1210:233 reply

        The next frontier:

        Draw a pelican on a bicycle ergonomically designed for pelicans.

        • By ben_w 2026-02-1211:14

          It may be a joke, but I think this is correct.

          For reasons, I have tried to get Stable Diffusion to put parrots into spacesuits. Always ended up with the beak coming out where the visor glass should've been, either no wings at all or wings outside the suit, legs and torso just human-shaped.

          ChatGPT got the helmet right, but their wings and tail (and sometimes claws) were exposed to vacuum, still very much closer to a human in either a normal or scifi space suit that happens to also be wearing a parrot head inside the space suit, and has tacked some costume wings on the outside.

          Essentially, it's got the same category of wrong as fantasy art's approach to what women's armour should look like: aesthetics are great, but it would be instantly lethal if done for real.

        • By simonw 2026-02-1219:37

          My more advanced prompt, for when models do a good job on the original, is this one:

          > Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.

        • By mitjam 2026-02-1211:08

          Thereafter: Design a bike that an actual pelican can learn to ride in real life.

    • By _joel 2026-02-1118:09

      Now this is the test that matters, cheers Simon.

    • By RC_ITR 2026-02-1121:234 reply

      The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

      This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

      AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

      • By zarzavat 2026-02-1210:151 reply

        This test is so far beyond AGI. Try to spit out the SVG for a pelican riding a bicycle. You are only allowed to use a simple text editor. No deleting or moving the text cursor. You have 1 minute.

        • By RC_ITR 2026-02-1223:491 reply

          Sorry, is your definition of AGI "doing things worse than humans can do, but way faster?" because that's been true of computers for a long time.

          • By pixl97 2026-02-1314:14

            I mean for this particular benchmark, yes.

            You'd have to put it in an agentic loop to perform corrections otherwise.

      • By Rudybega 2026-02-1121:501 reply

        MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

        As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025

        • By RC_ITR 2026-02-1223:44

          Here's the score for new AIME's, where we know the answers aren't in training.

          https://matharena.ai/?view=problem&comp=aime--aime_2026

          As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?

          As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.

      • By kingstnap 2026-02-1215:501 reply

        The benchmarks are harder than you might imagine and contain more wrong answers and terrible questions than you would expect.

        You don't need to take my word for it, try playing MMLU yourself.

        https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

        Its not MMLU-Pro btw, which is considerably harder.

        • By RC_ITR 2026-02-1223:501 reply

          Sure and AGI will 100% it 100% of the time, even if it is hard.

          • By hieudesu 2026-02-1413:35

            Your definition of AGI must be absurd

      • By simonw 2026-02-120:22

        It has a wing. Look at the code comments in the SVG!

    • By solarized 2026-02-1120:222 reply

      This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

      We need a new, authentic scenario.

      • By viraptor 2026-02-1120:523 reply

        Like identifying names of skateboard tricks from the description? https://skatebench.t3.gg/

      • By echelon 2026-02-1121:56

          1. Take the top ten searches on Google Trends 
             (on day of new model release)
          2. Concatenate
          3. SHA-1 hash them
          4. Use this as a seed to perform random noun-verb 
             lookup in an agreed upon large sized dictionary. 
          5. Construct a sentence using an agreed upon stable 
             algorithm that generates reasonably coherent prompts
             from an immensely deep probability space.
        
        That's the prompt. Every existing model is given that prompt and compared side-by-side.

        You can generate a few such sentences for more samples.

        Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.

        It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.

    • By TZubiri 2026-02-120:46

      The idea at the time is that it was obviously not part of the training set, now that it's a metric,it's worthless. Try an elephant smoking s cigar on the beach

    • By blurbleblurble 2026-02-124:41

      Have you tried with qwen-coder-next yet?

    • By pwython 2026-02-1118:252 reply

      How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

      • By bwilliams18 2026-02-1120:45

        I'd argue that a models ability to ignore/manage/sift through the noise added to the training set from other LLMs increases in importance and value as time goes on.

      • By nerdsniper 2026-02-1119:36

        You're correct. It's not as useful as it (ever?) was as a measure of performance...but it's fun and brings me joy.

    • By brianjking 2026-02-122:45

      Pretty damn great bird, tbh.

  • By NiloCK 2026-02-1114:408 reply

    Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.

    I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be yet another saturated benchmark, that even the N-2 models will fully satisfy.

    Heck, even my own preferences may be getting saturated already. Opus 4.5 was a very legible jump from 4.1. But 4.6? Apparently better, but it hasn't changed my workflows or the types of problems / questions I put to it.

    It's poetic - the greatest theft in human history followed by the greatest comeuppance.

    No end-user on planet earth will suffer a single qualm at the notion that their bargain-basement Chinese AI provider 'stole' from American big tech.

    • By jaccola 2026-02-1114:593 reply

      I have no idea how an LLM company can make any argument that their use of content to train the models is allowed that doesn't equally apply to the distillers using an LLM output.

      "The distilled LLM isn't stealing the content from the 'parent' LLM, it is learning from the content just as a human would, surely that can't be illegal!"...

      • By mikehearn 2026-02-1115:072 reply

        The argument is that converting static text into an LLM is sufficiently transformative to qualify for fair use, while distilling one LLM's output to create another LLM is not. Whether you buy that or not is up to you, but I think that's the fundamental difference.

        • By zozbot234 2026-02-1115:19

          The whole notion of 'distillation' at a distance is extremely iffy anyway. You're just training on LLM chat logs, but that's nowhere near enough to even loosely copy or replicate the actual model. You need the weights for that.

        • By budududuroiu 2026-02-1115:243 reply

          > The U.S. Court of Appeals for the D.C. Circuit has affirmed a district court ruling that human authorship is a bedrock requirement to register a copyright, and that an artificial intelligence system cannot be deemed the author of a work for copyright purposes

          > The court’s decision in Thaler v. Perlmutter,1 on March 18, 2025, supports the position adopted by the United States Copyright Office and is the latest chapter in the long-running saga of an attempt by a computer scientist to challenge that fundamental principle.

          I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

          https://www.skadden.com/insights/publications/2025/03/appell...

          • By roywiggins 2026-02-1115:511 reply

            Thaler v. Perlmutter is an a weird case because Thaler explicitly disclaimed human authorship and tried to register a machine as the author.

            Whereas someone trying to copyright LLM output would likely insist that there is human authorship is via the choice of prompts and careful selection of the best LLM output. I am not sure if claims like that have been tested.

            • By wongarsu 2026-02-129:30

              The US copyright office has published a statement that they see AI output analogous to a human contracting the work out to a machine. The machine would hold the copyright, but can't, consequently there is none. Which is imho slightly surprising since your argument about choice of prompt and output seems analogous to the argument that lead to photographs being subject to copyright despite being made by a machine.

              On the other hand in a way the opinion of the US copyright office doesn't matter, what matters is what the courts decide

          • By mikehearn 2026-02-1116:091 reply

            It's a fine line that's been drawn, but this ruling says that AI can't own a copyright itself, not that AI output is inherently ineligible for copyright protection or automatically public domain. A human can still own the output from an LLM.

            • By budududuroiu 2026-02-120:07

              > A human can still own the output from an LLM.

              It specifically highlights human authorship, not ownership

          • By Aerroon 2026-02-129:32

            >I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

            If the person who prompted the AI tool to generate something isn't considered the author (and therefore doesn't deserve copyright), then does that mean they aren't liable for the output of the AI either?

            Ie if the AI does something illegal, does the prompter get off scot-free?

      • By amenhotep 2026-02-1115:522 reply

        When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models. When you get tokens from one of these providers, you sort of did.

        I think it's a pretty weak distinction and by separating the concerns, having a company that collects a corpus and then "illegally" sells it for training, you can pretty much exactly reproduce the acquire-books-and-train-on-them scenario, but in the simplest case, the EULA does actually make it slightly different.

        Like, if a publisher pays an author to write a book, with the contract specifically saying they're not allowed to train on that text, and then they train on it anyway, that's clearly worse than someone just buying a book and training on it, right?

        • By BeetleB 2026-02-1116:23

          > When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models.

          Nice phrasing, using "pirate".

          Violating the TOS of an LLM is the equivalent of pirating a book.

        • By creamyhorror 2026-02-124:57

          Contracts can't exclude things that weren't invented when the contracts were written.

          Ultimately it's up to legislation to formalize rules, ideally based on principles of fairness. Is it fair in non-legalistic sense for all old books to be trainable-on, but not LLM outputs?

      • By TZubiri 2026-02-120:491 reply

        Because the terms by each provider are different

        American Model trains on public data without a "do not use this without permission" clause.

        Chinese models train on models that have a "you will not reverse engineer" clause.

        • By WSSP 2026-02-120:59

          > American Model trains on public data without a "do not use this without permission" clause.

          this is going through various courts right now, but likely not

    • By miohtama 2026-02-1114:433 reply

      In some ways, Opus 4.6 is a step backwards due to massively higher token consumption.

      • By theshrike79 2026-02-127:39

        You need to adjust the effort from the default (High) to Medium to match the token usage of 4.5

        High is for people with infinite budgets and Anthropic employees. =)

      • By nwienert 2026-02-1114:531 reply

        For me, it's just plain worse.

        • By cmrdporcupine 2026-02-1115:121 reply

          Try Codex / GPT 5.3 instead. Basically superior in all respects, and the codex CLI uses 1/10 the memory and doesn't have stupid bugs. And I can use my subscription in opencode, too.

          Anthropic has blown their lead in coding.

          • By toraway 2026-02-1116:061 reply

            Yeah, I have been loving GPT 5.2/3 once I figured out how to change to High reasoning in OpenCode.

            It has been crushing every request that would have gone to Opus at a fraction of the cost considering the massively increased quota of the cheap Codex plan with official OpenCode support.

            I just roll my eyes now whenever I see HN comments defending Anthropic and suggesting OpenCode users are being petulant TOS-violating children asking for the moon.

            Like, why would I be voluntarily subjected to worse, more expensive and locked down plan from Anthropic that has become more enshittified every month since I originally subscribed given Codex exists and is just as good?

            It won't last forever I'm sure but for now Codex is ridiculously good value without OpenAI crudely trying to enforce vendor lock-in. I hate so much about this absurd AI/VC era in tech but aggressive competition is still a big bright spot.

            • By cmrdporcupine 2026-02-1116:101 reply

              I like using Codex inside OpenCode, but frankly most times I just use it inside Codex itself because O.Ai has clearly made major improvements to it in the last 3 months -- performance and stability -- instead of mucking around trying to vibe code a buggy "game loop" in React on a VT100 terminal.

              • By toraway 2026-02-1116:141 reply

                I had been using Codex for a couple weeks after dropping Claude Code to evaluate as a baseline vs OpenCode and agreed, it is a very solid CLI that has improved a lot since it was originally released.

                I mainly use OC just because I had refined my workflow and like reducing lock-in in general, but Codex CLI is definitely much more pleasant to use than CC.

                • By cmrdporcupine 2026-02-1116:15

                  Yeah, if the eng team working on it is on this forum: kudos to you. Thanks.

      • By chillfox 2026-02-121:50

        yeah, I am still using 4.5 for coding.

        I have started using Gemini Flash on high for general cli questions as I can't tell the difference for those "what's the command again" type questions and it's cheap/fast/accurate.

    • By deaux 2026-02-127:09

      > But 4.6? Apparently better, but it hasn't changed my workflows or the types of problems / questions I put to it.

      The incremental steps are now more domain-specific. For example, Codex 5.3 is supposedly improved at agentic use (tools, skills). Opus 4.6 is markedly better at frontend UI design than 4.5. I'm sure at some point we'll see across-the-board noticeable improvement again, but that would probably be a major version rather than minor.

    • By vessenes 2026-02-1116:181 reply

      Just to say - 4.6 really shines on working longer without input. It feels to me like it gets twice as far. I would not want to go back.

      • By cmrdporcupine 2026-02-1116:443 reply

        If that's what they're tuning for, that's just not what I want. So I'm glad I switched off of Anthropic.

        What teams of programmers need, when AI tooling is thrown into the mix, is more interaction with the codebase, not less. To build reliable systems the humans involved need to know what was built and how.

        I'm not looking for full automation, I'm looking for intelligence and augmentation, and I'll give my money and my recommendation as team lead / eng manager to whatever product offers that best.

        • By p1esk 2026-02-121:061 reply

          I'm not looking for full automation

          But your boss probably is.

          • By wyre 2026-02-122:31

            Full automation is also possible by putting your coding agent into a loop. The point is that an LLM that can solve a small task is more valuable for quality output, than an LLM that can solve a larger task autonomously.

        • By vidarh 2026-02-128:321 reply

          That sounds like wishful thinking. Every client I work for wants to reduce the rate at which humans need to intervene. You might not want that, but odds are your CEO does. And babysitting intermediate stages is not productive use of developer time.

          • By officialchicken 2026-02-129:221 reply

            And the odds are good you use the models and understand them in detail while the CEO is just buying the hype, ill informed or not.

            • By vidarh 2026-02-129:39

              Well, I want to reduce the rate at which I have to intervene in the work my agents do as well. I spend more time improving how long agents can work without my input than I spend writing actual code these days.

        • By vessenes 2026-02-1222:58

          A year ago (geez!) I used aider, as you describe.

          Now I use claude with agent orchestration and beads.

          Well actually, I’m currently using openclaw to spin up multiple claudes with the above skills.

          If I need to drop down to claude, I do.

          If I need to edit something (usually writing I hate), I do.

          I haven’t needed to line edit something in a while - it’s just faster to be like “this is a bad architecture, throw it away, do this instead, write additional red-green tests first, and make sure X. Then write a step by step tutorial document (I like simonw’s new showboat a lot for this), and fix any bugs / API holes you see.”

          But I guess I could line edit something if I had to. The above takes a minute, though.

    • By throwaw12 2026-02-1115:10

      not allowing distillation should be illegal :)

      One can create 1000s of topic specific AI generated content websites, as a disclaimer each post should include prompt and used model.

      Others can "accidentally" crawl those websites and include in their training/fine-tuning.

    • By hasperdi 2026-02-1210:47

      Why distill, if you can run the full model yourself... or at other inference providers.

      Quantization the better approach in most cases, unless you want to for instance create hybrid models ie. distilling from here and there.

    • By jona-f 2026-02-128:46

      "the greatest theft in human history" what a nonsense. I was curious, how the AI haters will cope, now that the tides here have changed. We have built systems that can look at any output and replicate it. That is progress. If you think some particular sequence of numbers belongs to you, you are wrong. Current intellectual property laws are crooked. You are stuck in a crooked system.

    • By benterix 2026-02-129:521 reply

      > No end-user on planet earth will suffer a single qualm at the notion that their bargain-basement Chinese AI provider 'stole' from American big tech.

      Just like nobody cares[0] that American big tech stole from authors of millions of books.

      [0] Interestingly, the only ones that cared were the FB employees told to pirate the Library Genesis and reporting back that "it didn't feel right".

      • By DannyBee 2026-02-1212:28

        As one of those authors (3 books in this case) I'll just point out:

        Most authors don't own any interesting rights to their books because they are works for hire.

        Maybe I would have gotten something, maybe not. Depends on the contract. One of my books that was used is from 1996. That contract did not say a lot about the internet, and I was also 16 at the time ;)

        In practice they stole from a relatively small number of publishers. The rest is PR.

        The settlement goes to authors in part because anything else would generate immensely bad PR.

        As usual, nothing is really black and white

  • By cmrdporcupine 2026-02-1114:322 reply

    Bought some API credits and ran it through opencode (model was "GLM 5").

    Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO scripting language) verbs by MCP. It made basically no mistakes with the programming language despite it being my own bespoke language & runtime with syntactical and runtime additions of my own (lambdas, new types, for comprehensions, etc). It reasoned everything through by looking at the API surface and example code. No serious mistakes and tested its work and fixed as it went.

    Its initial analysis phase found leftover/sloppy work that Codex/GPT 5.3 left behind in a session yesterday.

    Cost me $1.50 USD in token credits to do it, but z.AI offers a coding plan which is absolutely worth it if this is the caliber of model they're offering.

    I could absolutely see combining the z.AI coding plan with a $20 Codex plan such that you switch back and forth between GPT 5.3 and GLM 5 depending on task complexity or intricacy. GPT 5.3 would only be necessary for really nitty gritty analysis. And since you can use both in opencode, you could start a session by establishing context and analysis in Codex and then having GLM do the grunt work.

    Thanks z.AI!

    • By muyuu 2026-02-1116:042 reply

      when i look at the prices these people are offering, and also the likes of kimi, and I wonder how are openAI, anthropic and google going to justify billions of dollars of investment? surely they have something in mind other than competing for subscriptions and against the abliterated open models that won't say "i cannot do that"

      EDIT:

      cheechw - point taken. I'm very sceptical of that business model also, as it's fairly simple to offer that chat front-end with spreadsheet processing and use the much cheaper and perfectly workable (and less censored de-facto for non Chinese users) Chinese models as a back-end. Maybe if somehow they manage to ban them effectively.

      sorry, don't seem to be able to reply to you directly

      • By cmrdporcupine 2026-02-1116:07

        They're all pretending to bring about the singularity (surely a 1 million token context window is enough, right?) and simultaneously begging the US government to help them create monopolies.

        Meanwhile said government burns bridges with all its allies, declaring economic and cultural warfare on everybody outside their borders (and most of everyone inside, too). So nobody outside of the US is going to be rooting for them or getting onside with this strategy.

        2026 is the year where we get pragmatic about these things. I use them to help me code. They can make my team extremely effective. But they can't replace them. The tooling needs improvement. Dario and SamA can f'off with their pronouncements about putting us all out of work and bringing about ... god knows what.

        The future belongs to the model providers who can make it cost effective and the tool makers who augment us instead of trying ineptly to replace us with their bloated buggy over-engineered glorified chat loop with shell access.

      • By cheechw 2026-02-1116:48

        [dead]

    • By jfaat 2026-02-1114:422 reply

      Yeah that's a good idea. I played around with kimi2.5/gemini in a similar way and it's solid for the price. It would be pretty easy to build some skills out and delegate heavy lifting to better models without managing it yourself I think. This has all been driven by anthropic's shenanigans (I cancelled my max sub after almost a year both because of the opencode thing and them consistently nerfing everything for weeks to keep up the arms race.)

      • By mattkevan 2026-02-1115:57

        Cancelled my Anthropic subscription this week after about 18 months of membership. Usage limits have dropped drastically (or token usage have increased) to the point where it's unusable.

        Codex + Z.ai combined is the same price, has far higher usage limits and just as good.

      • By cmrdporcupine 2026-02-1114:47

        Yeah I did the same (cancel Anthropic). Mainly because the buggy/bloatiness of their tooling pissed me off and I got annoyed by Dario's public pronouncements (not that SamA is any better).

        I ended up impressed enough w/ GPT 5.3 that I did the $200 for this month, but only because I can probably write-off as business expense in next year's accounting.

        Next month I'll probably do what I just said: $20 each to OpenAI and Google for GPT 5.3 and Gemini 3 [only because it gets me drive and photo storage], buy the z.AI plan, and only use GPT for nitty gritty analysis heavy work and review and GLM for everything else.

HackerNews