...

gkamradt

307

Karma

2015-07-02

Created

Recent Activity

  • Commented: "OpenAI o3-pro"

    o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585

  • Ah yes, two things

    1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

    2. We only tested o3 against the semi-private set. We didn't test it with the private eval.

  • #4 (private test set) doesn't get used for any public model testing. It is only used on the Kaggle leaderboard where no internet access is allowed.

  • Good question! This was one of the main motivations of our "Paper Prize" track. We wanted to reward conceptual progress vs leaderboard chasing. In fact, when we increased the prizes mid year we awarded more money towards the paper track vs top score.

    We had 40 papers submitted last year and 8 were awarded prizes. [1]

    On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]

    Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.

    [1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...

    [2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...

    [3] https://www.youtube.com/watch?v=mTX_sAq--zY

  • We have a few sets:

    1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public

    So for those two we don't have protections.

    3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.

    4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.

HackerNews