
About a year and a half ago, I wrote about my kid’s experience with an AI checker tool that was pre-installed on a school-issued Chromebook. The assignment had been to write an essay about Ku…
About a year and a half ago, I wrote about my kid’s experience with an AI checker tool that was pre-installed on a school-issued Chromebook. The assignment had been to write an essay about Kurt Vonnegut’s Harrison Bergeron—a story about a dystopian society that enforces “equality” by handicapping anyone who excels—and the AI detection tool flagged the essay as “18% AI written.” The culprit? Using the word “devoid.” When the word was swapped out for “without,” the score magically dropped to 0%.
The irony of being forced to dumb down an essay about a story warning against the forced suppression of excellence was not lost on me. Or on my kid, who spent a frustrating afternoon removing words and testing sentences one at a time, trying to figure out what invisible tripwire the algorithm had set. The lesson the kid absorbed was clear: write less creatively, use simpler vocabulary, and don’t sound too good, because sounding good is now suspicious.
At the time, I worried this was going to become a much bigger problem. That the fear of AI “cheating” would create a culture that actively punished good writing and pushed students toward mediocrity. I was hoping I’d be wrong about that.
Turns out… I was not wrong.
Dadland Maye, a writing instructor who has taught at many universities, has published a piece in the Chronicle of Higher Education documenting exactly how this has played out across his classrooms—and it’s even worse than what I described. Because the AI detection regime hasn’t just pushed students to write worse. It has actively pushed students who never used AI to start using it.
This fall, a student told me she began using generative AI only after learning that stylistic features such as em dashes were rumored to trigger AI detectors. To protect herself from being flagged, she started running her writing through AI tools to see how it would register.
A student who was writing her own work, with her own words, started using AI tools defensively—not to cheat, but to make sure her own writing wouldn’t be accused of cheating. The tool designed to prevent AI use became the reason she started using AI.
This is the Cobra Effect in its purest form. The British colonial government in India offered a bounty for dead cobras to reduce the cobra population. People started breeding cobras to collect the bounty. When the government scrapped the program, the breeders released their now-worthless cobras, making the problem worse than before. AI detection tools are our cobra bounty. They were supposed to reduce AI use. Instead, they’re incentivizing it.
And this goes well beyond one student’s experience. Maye describes a pattern spreading across his classrooms:
One student, a native English speaker, had long been praised for writing above grade level. This semester, a transfer to a new college brought a new concern. Professors unfamiliar with her work would have no way of knowing that her confident voice had been earned. She turned to Google Gemini with a pointed inquiry about what raises red flags for college instructors. That inquiry opened a door. She learned how prompts shape outputs, when certain sentence patterns attract scrutiny, and ways in which stylistic confidence trigger doubt. The tool became a way to supplement coursework and clarify difficult material. Still, the practice felt wrong. “I feel like I’m cheating,” she told me, although the impulse that led her there had been defensive.
A student praised for years for being an exceptional writer now feels like a cheater because she had to learn how AI detection works in order to protect herself from being falsely accused. The surveillance apparatus has turned writing talent into a liability.
Then there’s this:
After being accused of using AI in a different course, another student came to me. The accusation was unfounded, yet the paper went ungraded. What followed unsettled me. “I feel like I have to stay abreast of the technology that placed me in that situation,” the student said, “so I can protect myself from it.” Protection took the form of immersion. Multiple AI subscriptions. Careful study of how detection works. A fluency in tools the student had never planned to use. The experience ended with a decision. Other professors would not be informed. “I don’t believe they will view me favorably.”
The false accusation resulted in the student subscribing to multiple AI services and studying how the detection systems work. Not because they wanted to cheat, but because they felt they had no other option for self-defense. And then they decided to keep quiet about it, because telling professors about their AI literacy would only invite more suspicion.
Look, I get it: some students are absolutely using AI to cheat, and that’s a real issue educators have to deal with. But the detection-first approach has created an incentive structure that’s almost perfectly backwards. Students who don’t use AI are punished for writing too well. Students who are falsely accused learn that the only defense is to become fluent in the very tools they’re accused of using. And the students savvy enough to actually cheat? They’re the ones best equipped to game the detectors. The tools aren’t catching the cheaters—they’re radicalizing the honest kids.
As Maye explains, this dynamic is especially brutal at open-access institutions like CUNY, where students already face enormous pressures:
At CUNY, many students work 20 to 40 hours a week. Many are multilingual. They encounter a different AI policy in nearly every course. When one professor bans AI entirely and another encourages its use, students learn to stay quiet rather than risk a misstep. The burden of inconsistency falls on them, and it takes a concrete form: time, revision, and self-surveillance. One student described spending hours rephrasing sentences that detectors flagged as AI-generated even though every word was original. “I revise and revise,” the student said. “It takes too much time.”
Just like my kid and the school-provided AI checker, Maye’s student spent a bunch of wasted time “revising” to avoid being flagged.
Students spending hours rewriting their own original work—work that they wrote—because an algorithm decided it sounded too much like a machine. That’s time taken away from studying, working, caring for family, or, you know, actually learning to write better.
Learning to revise is a key part of learning to write. But revisions should be done to serve the intent of the writing. Not to appease a sketchy bot checker.
What Maye articulates so well is that the damage here goes beyond false positives and wasted time. The deeper problem is what these tools teach students about writing:
Detection tools communicate, even when instructors do not, that writing is a performance to be managed rather than a practice to be developed. Students learn that style can count against them, and that fluency invites suspicion.
We are teaching an entire generation of students that the goal of writing is to sound sufficiently unremarkable! Not to express an original thought, develop an argument, find your voice, or communicate with clarity and power—but to produce text bland enough that a statistical model doesn’t flag it.
The word “devoid” is too risky. Em dashes are suspicious. Confident prose is a red flag.
My kid’s Harrison Bergeron experience was, in retrospect, a perfect preview of all of this. Vonnegut warned about a society that forces everyone down to the lowest common denominator by handicapping anyone who shows ability. And here we are, with AI detection tools functioning as the Handicapper General of student writing, punishing fluency, penalizing vocabulary, and training students to sound as mediocre as possible to avoid triggering an algorithm that can’t even tell the difference between a thoughtful essay and a ChatGPT output.
Maye eventually did the only sensible thing: he stopped playing the game.
Midway through the semester, I stopped requiring students to disclose their AI use. My syllabi had asked for transparency, yet the expectation had become incoherent. The boundary between using AI and navigating the internet had blurred beyond recognition. Asking students to document every encounter with the technology would have turned writing into an accounting exercise. I shifted my approach. I told students they could use AI for research and outlining, while drafting had to remain their own. I taught them how to prompt responsibly and how to recognize when a tool began replacing their thinking.
Rather than taking a “guilt-first” approach, he took one that dealt with reality and focused on what would actually be best for the learning environment: teach students to use the tools appropriately, not as a shortcut, and don’t start from a position of suspicion.
The atmosphere in my classroom changed. Students approached me after class to ask how to use these tools well. One wanted to know how to prompt for research without copying output. Another asked how to tell when a summary drifted too far from its source. These conversations were pedagogical in nature. They became possible only after AI use stopped functioning as a disclosure problem and began functioning as a subject of instruction.
Once the surveillance regime was lifted, students could actually learn. They asked genuine questions about how to use tools effectively and ethically. They engaged with the technology as a subject worth understanding rather than a minefield to navigate. The teacher-student relationship shifted from adversarial to educational, which is, you know, kind of the whole point of school.
That line Maye uses: “these conversations were pedagogical in nature” keeps sticking in my brain. The fear of AI undermining teaching made it impossible to teach. Getting past that fear brought back the pedagogy. Incredible.
This piece should be required reading for every educator thinking that “catching” students using AI is the most important thing.
As Maye discovered through painful experience, the answer is to stop treating AI as a policing problem and start treating it as an educational one. Teach students how to write. Teach them how to think critically about AI tools. Teach them when those tools are helpful, when they’re harmful, and when they’re a crutch. And for the love of all that is good, stop deploying detection tools that punish good writers and push everyone toward a bland, algorithmic mean.
We are, quite literally, limiting our students’ writing to satisfy a machine that can’t tell the difference. Vonnegut would have had a field day.
Filed Under: ai, ai detection, cheating, dadland maye, students
Perhaps we should not grade students on weekly, or other occasional, writing during the term or semester.
How about going back to the old system where, apart from experimental lab work, nothing is graded until the end of the term?
All weekly assignments should just be considered prep for one exam at the end of the term where the student has an opportunity to demonstrate mastery of the course's subject matter. They can prepare as they wish, use AI, and even cheat on the homework, but there will be a revelation at the end of the term.
That final test can be proctored, monitored, audited to ensure that whatever words are used are indeed the student's own words. The resulting grade depends on that, and that alone.
The approach of continuous assessment, which to me always seemed suspect and ripe for abuse, was completely broken by the AI tools that are now available.
This approach does not really solve the core issue. In practice, students often do poorly when evaluation is concentrated in one end of term exam. It also pushes many students to cram at the end of the term instead of learning steadily.
A better approach is to rethink what we assess and how we assess it. Research shows that the design of assessments plays an important role in academic integrity. Assignments that require original thinking and regular engagement can reduce incentives to cheat and improve learning outcomes.
https://www.sciencedirect.com/science/article/abs/pii/S22119...
> Assignments that require original thinking and regular engagement can reduce incentives to cheat and improve learning outcomes.
At some point in college when I was thinking about law school, I learned about the Socratic Method. It was weird because up to that point in college, I just pretty much flew under the radar and took exams. It was far different than high school, and I realized my high school did pretty much use the Socratic Method. It wasn't as intense as law school, but every class, maybe 4-5 people would we grilled by teachers. This was called "participation."
Shy? Anxiety? Yeah, that didn't matter. Your number would eventually be up a few times a month. You had to prepare and know the assignments, otherwise your grade would suffer and public humiliation was a real thing.
It's a noble motivation (and not even unrealistic in 2015) but what you'll get would still be generative output.
If the only remedy is monitored end of term exams, so be it.
Err why can’t there be weekly monitored exams in class?
What exactly is that remediating? I don’t think that approach solves the problem of helping kids learn better.
Perhaps students should learn this information throughout the semester instead of on the last night or morning before exams?
If your goal is for them to know the entire material, then it makes sense to test their knowledge of the entire course in one exam, which also allows them to study at their own pace and order. If someone is unable to pass the exam or retain all the information, then consider whether you need such professionals.
Students are also people. If we're managing a software project, a single deadline at the end is sure to suffer from delays. It's better to split things into shorter deliverables with more frequent feedback.
You can have optional assignments and quizzes that serve that purpose.
If you take away the credit given for homework, you still can give feedback, while removing any incentive to cheat.
Few students do optional assignments unfortunately. Other tasks that are directly worth a gradetend to take priority (e.g. studying for another class that has an exam this week).
1. Class attendance is frequently optional, but students still attend.
2. I had a prof. that didn't require homework be done. He would give out "practice fun" and would gladly sit down, give feedback and 1:1 time to those who completed it, or tried. He also pointed out that it was rare to pass the exams for students who didn't do "practice fun". Most people did the work.
It leads me to believe - from my own experience too - that students generally aren't stupid, and will gladly do the work if there is a point. Plenty of homework is pure busywork though, even at the college level.
I got diagnosed with cancer just before finals in the first semester of my senior year. Sure, it kill my chances at graduating Summa Cum Laude, and I didn’t make the Dean’s list that semester even though I worked my ass off, as usual. Frustrating, but that’s life. I should not, however, have failed that semester, which I would have if only the final week’s assignments were counted. People have bad weeks. In most white collar jobs I’d have probably been able to take some time for myself, maybe given someone else my most urgent tasks, and likely been given plenty of leeway. Even doctors, lawyers, etc. People deserve to have bad weeks without losing months of work.
Let's just focus on CS for a moment because this is HN and I have a CS degree. How do you evaluate a student's ability to implement a piece of non trivial software following a specification that would take, say, over five thousand lines? This is a fairly common type of coursework for CS students. It's not possible to do it in one exam setting. Well, unless you can type continuously without having to stop to think or check your work.
What will likely happen if you have to force all evaluation into one exam is you will get a Leetcode style programming quiz, that tests something, except that something is only barely connected to your daily job after college. Or you end up with something like a multiple choice exam that is again disconnected from your work and can be gamed by studying just for the exam.
This exact line of argument that we should just abolish coursework keeps coming up on HN. Courseworks exist for a purpose. You can't just throw it away and pretend it never did anything useful.
A clever solution my professor did was we write the program in our own time, then take the code into an exam where we have to modify it to answer some novel question. It doesn't matter if we "cheated" writing the code as long as we understand it enough to change its behavior.
Why not multiple exams? In fact, why not many exams?
Sure, it requires more resources, but it shouldn't require much more:
- We've had multiple exams before AI, and I don't see how AI makes it any harder. Obviously these are closed-book
- Schools should already be banning phones in class (and colleges have insane tuitions, they can afford more exams)
- The students who go out of their way to cheat - as long as they're a minority, let them. Why not? Either they'll fail later in life, or they didn't need to learn the material because they're pathological fakers (even if you won and forced them to learn the material, they'd probably still fake their way out of using it). Then, I doubt you need much proctoring to ensure that most students don't cheat, because most of the smart students are generally smart enough to know that actually learning the material is probably important (or if the material is probably not important, it doesn't matter if the students all cheat...)
Meanwhile, downsides of one exam:
- Disadvantages students who get overly stressed about unrecoverable exams, or have a particularly bad day on the exam
- Many students will blow off the (ungraded) assignments and put off actually learning until the end
- Less graded content (especially if the exam isn't overly long, which would disadvantage some students)
Indeed. Many of my technical undergrad courses were very exam heavy. Typically 3-5 midterms and one final. Sometimes the final was as little as 10% of the grade. The idea was that if you'd done well throughout the semester you can relax during finals weeks.
Homework was assigned but not graded.
Periodic tests is the way to go.
I hated courses where the final was more than 30%. Forget 100%.
I don't disagree with you that a reasonable way to cope with the current problems is to ensure everything that "counts" is done in a controlled environment, but pedagogy and its goals are vast.
There are things you learn from spending several days structuring a 20-page argument that you will not learn (and cannot assess) from oral examination or a 5-paragraph essay written in a blue book.
If you have spent several days structuring a 20 page argument in October on any topic you'll have learnt a great deal about the subject matter. When you get to the exam hall in, say, May it will stand to you.
That knowledge will show up in the blue book vis-a-vis the other exam candidates.
Sure--yes--the student will learn something if they actually wrote a 20-page paper on some given topic. But how are you going to evaluate their ability to compose the 20-page argument?
I would prefer not to be confrontational here, but I am having a hard time imagining that you've deeply considered the pedagogy of how to teach and evaluate students on squishy skills like this.
Knowing a bunch of facts about something is a world apart from structuring a compelling in-depth argument about it.
In the simplest case, where we'll say the exam question was precisely the topic of the 20 page paper, the candidate would be golden. Of course, it's unlikely in a 3 hr. exam that you'll be asked to write a 20 page response; but in edited form, you could definitely produce three cogent pages about some particular aspect of the original paper - if you've done the work. If you truly wrote the 20 page paper, you can surely produce three literate, cogent, responsive and topical pages.
I think schools need to set up additional, new proctoring sessions for this type of work. This will likely be something they have to hire for. A student can come and work for four hours, then hand in their in-progress draft and leave, then return later to finish it. (And please for the love of god, let students do this on offline computers, don't make them handwrite everything!)
What stops a student from going home and asking chatGPT to write about few bullet points about how to answer and coming back the next day doing what chatGPT told him to do?
They'd need a very good memory. You can't make cheating impossible, before ChatGPT they could have asked a friend for help.
This assumes that the assignments and the exam cover the same material. That's not always the case.
That would be really poor course design :-)
There are many disciplines in which students work on effectively distinct projects.
For example, the life-changingly-well-designed newswriting course I took in college assigned every single student a different story to spend several weeks reporting out so that we wouldn't all be out harassing the same poor people for interviews.
Genuinely interested. What was the final like? This seems more in the experimental science (ok, journalism) category. I may have to adjust my thinking to be more expansive and also include things like "vocational".
Grammar and AP style rules, iirc. (I may not. It's been enough years now. I did try and fail to find the syllabus in my box of five-star notebooks. We mostly used reporters notebooks for this class, and I took it over the summer. The materials are probably in a plastic bag somewhere...)
But there's no reason to expect that work to be graded. It should be a learning exercise which trains skills later tested under exam conditions.
Many students will simply not do these assignments. They should but they won’t. Continuous assessment partly solves this.
Schools stopped doing that because students largely refuse to prepare. Testing throughout the year is like a CI pipeline and is shown to work better for the median student.
Students are neither generally stupid nor constitutionally lazy. I sense that when expectations are clear they'll often surprise you with diligence. We should trust them to do the right thing. If they do, it's an A; and if not, it's less than that.
> I sense that when expectations are clear they'll often surprise you with diligence.
Data does not support your sense.
Most students do not have good time management skills, usually because they have no models and/or have not been taught these skills.
Furthermore, continuous feedback, whether graded or not, has been found to be more effective than one-shot feedback.
Evaluation and assessment is a complex topic towards which many people (not necessarily you) want to take an overly simplified approach.
There are trade offs for any system that is chosen. The organizations providing the grades have to decide what their priorities are (e.g., time, accuracy, etc.).
I'm not sure what public school system has instilled that confidence in you, but it musn't have been mine. I'm also not sure why you think clear expectations about an end of year test will lead to better results than clear expectations about multiple spaced out tests. The data shows that it doesn't.
I think if they offered a proctored do-over a week later, the bad results on the first test might prompt them to make an attempt at studying for the next week, and the prospect of having to sit through two tests and getting shamed for having to do-over might prompt people to actually study for the first test.
Throw in some oratory presentations as well and that sounds like a curriculum.
Yes. That's fair. Students should know, up front, what they're supposed to learn.
Students are very grade-motivated and unfortunately they rarely do the homework assignments if they are not worth points.
At-home coding projects, writing essays, etc also exercise different skiils than you can test for in a 2 hour written exam. It's unfortunate that due to rampant AI cheating, we can no longer reward the students who put in the work and develop these skills.
Why a single test at the end of the semester? Why not allow the student to demonstrate mastery at anytime during the semester when they are ready? Then they can move on to the next objective, or, if they fall short, continue to study until they meet the objectives.
Of course, creating good exams is difficult, but you have to do that either way.
Unfortunately this approach is known to be really bad for some students. Continuous assessment is bad for other students.
I was lucky in that my education last millennium was almost entirely make-or-break final exams, which suit me well. I’m bad at routinely completing little assignments, but shine under crazy pressure. Other students are the opposite.
We’ve tried both kinds over time and trend. Neither is perfect.
> one exam at the end of the term where the student has an opportunity to demonstrate mastery of the course's subject matter. The resulting grade depends on that, and that alone.
I love this idea. And if a student is having a really bad day, or their dog just died, or they have bad cramps, or they have a hard time dealing with the intense stress of your entire grade being decided in one exam... well, those loser students can just fuck right off.
Would you design a system to assess knowledge, avoiding the distortions of AI on weekly submissions, according to the general case or the exceptional case?
Accommodations are part of the fabric already. It doesn't seem inconceivable that we couldn't deal with them in exceptional circumstances in a similar way to how it's done today.
How it’s done today is that they rely on your other marks from earlier in the semester to inform how your exam grade should be adjusted. That doesn’t work if there are no other marks to use.
Yes. There will be no other marks available for adjusting a final grade. One test, one mark. One knows the stuff, or one doesn't.
Accommodations are real and necessary, but applied at the end.
(Experimental sciences are an exception)
How do you propose to avoid "distortions of AI" in your mega-examination?
... well then, why not use those same protections (proctoring, monitoring, auditing) in continuous examination?
If a student knows how to communicate, they can solve this problem: warn the teacher, take a sick leave if they feel they are not ready, and other options.
If they did not do this, they failed the exam on communication with other people.
In addition, we are not always able to make decisions in ideal conditions. We need to learn how to solve problems under pressure, in emotional turmoil, and when we are not feeling well.
That's how it was for me - one exam per course at the end of each semester. To qualify for the exam you had to do take-home assignments. Didn't pass? Try again next semester. Was it easy? Hell no, but I learned a lot.
The purpose of grades is to punish students, something which they are keenly aware of. Remove grades from the equation and hold students back until they have mastered the material and they will cease cheating.
If someone knows 80% of the topics on an exam like the back of their hand and doesn't know the other 20% they shouldn't get a B, they should pass the subjects they know and be asked to retake and relearn the subjects they don't know.
When people know they can make mistakes and the result is not a perpetual black mark on their record (any grade not an A) but they are given the chance to improve and demonstrate this improvement then perhaps they might be more willing to admit and understand mistakes instead of cheating.
Ultimately, you ask the student, in one audited test, to demonstrate that they've absorbed the essence of the course material and have developed some level of mastery.
Okay, so the system is designed not to educate but to minimize the time required to determine whether students somehow stumbled into an education?
???
Do you only learn when you’re being graded?
If the change is not designed to educate the student, then the point isn’t education.
As a general rule when changing complex systems, you sacrifice what you aren’t trying to optimize. If you make a random change to a car without consideration for gas mileage it’s very likely to reduce gas mileage.
Schools are not merely in the business of maximizing education, they have their own prestige to uphold, and they would like to give degrees with their name on it to students who have actually upheld their end of the contract.
(The other side of that contract is, kids are not merely attending schools to learn, but to earn a degree that carries some degree of prestige)
To have end of semester grades be determined by work that is done by the student, not through weekly assignments where it’s trivial to cheat
To what end? Not cheating on the weekly assignments is surely more beneficial to learning than cheating on them is, but I don’t see how removing the assignments altogether would help students learn.
If you nail the one exam, you get an A+. If you fail it, you get an F. In between, you get what your score says you get.
I understand the proposed grading system but not the reason for selecting that particular system.
It's a crude blade to avoid the issues of AI pollution of weekly submissions, of which few teachers have much confidence that the submission itself was actually written by the student - who's assumed to be learning something.
The OP was about students dumbing down their own work to avoid AI detectors ratting them out. That seems like a big loss.
And what would the goal of that be? I thought the goal of education was... education. The grading is not goal in itself. Will this really motivate kids to do better?
It's to prove that a student is actually educated and has a firm grasp of the course material. If one gets an A every week on AI-assisted submissions, can one make such a claim? And can a teacher make the claim that they've achieved any actual education of the student?
A grade, on a single proctored test, is a crude metric, but at least it would be a brutally fair one.
Maybe you can call it an interview…
Came here to say the same thing. The AI problem is functionally no different to the paid essay writers. Grade everything at face value, and then have people write essays under exam conditions for grading.
Even before LLMs, there was a _lot_ of deception and cheating in university. I -- and I do not say this with pride -- used to write essays for my classmates for money. In my own defense, I needed the money. I also know that in addition to homework for money, many fraternities and sororities kept copies of prior exams and assignments, and getting access to these was one of the perks of membership. Knowing what kind of questions to expect (let alone the exact questions) can easily give someone a few extra IQ points for free.
Personally, I felt that the drive to automate the parts of the professors' workloads that mattered (i.e. teaching and grading and evaluation and research), only so that they can be given work that matters less the more they do it (i.e. publishing slightly different flavors of the same paper, to meet KPIs), was oddly perverse.
The multiple-choice test and the puzzle-solving test and really any standardized test can be exploited by any group that is sufficiently organized. This is also true in corporate interviewing where corporations think (or pretend) that they are interviewing an individual, whereas they are actually interviewing a _network_ of candidates who share details about the interviewers and the questions. I know people who got rejected in spite of getting all the interview questions correct (the theory is that nobody can do that well, so they must have had help from previously rejected/accepted candidates).
The word "trust" shares a root with the word "tree" and "truth" and "druid". Most exams and interviews are trying to speed-run trust-building (note that "verification" is from the latin word that means "true"). If trust and truth are analogous to "tree", then we are trying to speed-run the growth of a tree -- much like the orange tree, in the film, _The Illusionist_. And like the orange tree, it is a near-complete illusion, a ritual meant to keep the legal department and HR department happy.
The LLMs have simply made the corruption of academia accessible to _all_ students with an internet connection (EDIT: and instantaneous and cheap, unlike a human writer).
There has never been a shortcut to building trust. One cannot LLM their way into being a (metaphorical) druid.
I do not look forward to the Voight-Kampff tests that will come to dominate all aspects of online and asynchronous human interaction.
Note that, short of homework/classwork that _can't_ be gamed by an LLM (for some fundamental reason), even the high-quality honest students will be forced to cheat, so as to not be eclipsed by the actual low-quality cheating students[0].
I imagine that we may end up wrapping around to live in-person dialectics, as were standard in the time of Socrates and Parmenides[1]. If so, this should be fun.
[0]: If left unaddressed, we may see a bimodal distribution of great and terrible students graduating college, with those in between dropping out. If college is an attempt to categorize and rank a population, this would be a major fault in that mechanism.
[1]: Not to the exclusion of the other kinds of tests, writing is still important, critical even. But as a kind of verification-step, that should inform how much the academic community should trust the writing (I can imagine that all the writers here are experiencing stage-fright as they are reading these words).
The core of the problem the article is about isn't AI or LLMs, it's about scam software that claims to catch cheating. It's crap for the same reasons that crime predictions software is crap. It's selling a panacea, and that kind of product inherently attracts scammers.
If your school uses software to detect AI writing, that's a problem with the quality of your school. The people choosing that software are too stupid to be running a school. The software isn't going to get any better.
I'm always startled about how HN approaches these topics. When we have a press release from a university about how researchers can detect thoughts via fMRI, we have no issue with the claim. But if a vendor makes a pretty believable claim that there are repetitive statistical patterns in LLM output, it's all of sudden treated the same as palm reading.
The problem isn't that AI detection doesn't work. State of the art in this field is pretty solid. The only issue is that it's probabilistic, so it sometimes fails, and when it does, we have nothing else in situations where you actually want to know if someone put in the work.
So what are you proposing, exactly? That we run a large-scale experiment of "let's see what happens if children don't actually need to learn to do thinking and writing on their own"? The reality is that without some form of compulsion, most kids would rather play video games / scroll through TikTok all day. Or that we move to a vastly more resource-intensive model where every kid is given personalized instruction and watched 1:1?
>> But if a vendor makes a pretty believable claim that there are repetitive statistical patterns in LLM output, it's all of sudden treated the same as palm reading.
That's what fortunetellers do. The problem isn't guessing correctly about AI content in writing. The problem is false positives. That's what puts it in the same category is predictive policing scam software. And fortunetelling.
It has nothing to do with predictive policing. I don't understand this example, it has nothing to do with detecting intent. You're looking for evidence of a past misdeed.
False positive and false negative rates are non-zero, as with almost anything, but the tools are pretty good. I encourage you to give them a try. Pangram is a good state-of-the-art choice and you can try it for free. They also publish evals and other data about their approach.
> but the tools are pretty good. I encourage you to give them a try.
I have given them a try and can confirm the exact opposite. Plenty of others have given them tries and have confirmed the exact opposite.
Regardless, the “better for a hundred guilty men to go free than for one innocent man to hang” principle applies here.
No it doesn't.
This is armchair philosophy when pragmatism and problem solving serve better.
Fundamentals - Teaching is expensive, and we don't have enough teachers.
Verifying if someone has the skills is difficult.
Given the shortage of teachers, and the difficulty of verification, we ways to bridge the gap.
The first step is always going to be to spend more on education, especially in underserved areas.
The new options we have with LLMs is to increase the rate of testing, and test out the benefits of low stakes testing at scale.
> This is armchair philosophy when pragmatism and problem solving serve better.
Punishing innocent people out of negligence is not pragmatism, and refusing to tolerate such punishment is not armchair philosophy.
Eliminating any statistically significant difference between a high-quality human-written text and LLM-written text is exactly what the LLMs are being trained for. At this point, "text is low quality, therefore must be human" is a much stronger signal.
> Eliminating any statistically significant difference between a high-quality human-written text and LLM-written text is exactly what the LLMs are being trained for.
I think you're basing this off a fundamental misunderstanding of what these detectors look for. LLMs generate human-like text, but they also generate roughly the same style and content every time for a given prompt, modulo some small amount of nondeterminism. In essence, they are a very predictable human. Ask Gemini or ChatGPT ten times in a row to write an essay about why AI is awesome, and it will probably strike about the same tone every single time, with similar syntax, idioms, etc.
This is what these tools detect: the default output of "hey ChatGPT, write me a school essay about X". This can be evaded with clever prompting to assume a different writing personality, but there's only so much evasion you can do without making the text weird in other ways.
> Eliminating any statistically significant difference between a high-quality human-written text and LLM-written text is exactly what the LLMs are being trained for.
This is true only for base models, but few people would use a base model for writing assignments. Output from models trained to be assistants is, so far, decently recognizable.
You can detect if texts from a year ago used AI based on statistical patterns. Nobody is taking issue with that. But once you tell people "we will run these tests to detect if your future submissions are using AI" you create an adversarial environment and your statistical methods will continuously break. Not because statistics is broken, but because you are trying to hit a moving target that doesn't want to be hit.
That's not like detecting thoughts via fMRI, it's like detecting tomorrows malware with yesterday's malware signatures. Or like researchers making a vaccine against the common cold
And the obvious proposal to fix that has been made multiple times in this thread: don't make take-at-home tasks part of the grade. Instead of trying to punish what you can't reliably detect, take away the incentive to do it in the first place
> you create an adversarial environment
Do AI vendors specifically train models to circumvent AI detectors? Why would they?
The adversary aren't the model vendors, the adversary are the students. The students will modify the prompt, ask models to rewrite text in an atypical style, or use specialized services that attempt to hide the typical AI patterns. And if you pick up their pattern today they will just mix up the formula tomorrow
> You can detect if texts from a year ago used AI based on statistical patterns.
I don't understand your argument. The vendors for these detection tools can acquire recent samples from all frontier models just as easily as you can use them to write essays. There's nothing that requires a one-year delay.
> When we have a press release from a university about how researchers can detect thoughts via fMRI, we have no issue with the claim.
Different people. I for one have always claimed that fMRI is too coarse-grained for detailed thought detection.
If AI detection "sometimes fails", it doesn't "work". It works well enough to convict someone with other evidence, but when there's no other evidence nor an attempt to get any, it has no good use.
What I propose is simple: grade only closed-book exams, and hold students' phones during the exams. Students don't need 1:1 monitoring, it's the same as 10-20 years ago.
Does crapping on the average school's deep well of expertise for evaluating how effectively AI software solutions address their problems somehow fix the underlying problem (that the cost of catching cheaters is significantly higher than the cost of cheating)?
(This is roughly the same problem as evaluating software that only does an approximation of what it claims to do.)
(Aside: AI-based variations on this theme are in the early stages of proliferating across our society. They're being developed by many people using this forum and being sold to our schools, businesses, governments, and other organizations with little regard to whether they actually do what they claim.)
> that the cost of catching cheaters is significantly higher than the cost of cheating
This is tackling the problem from the wrong direction. The right direction would be to make it harder to cheat in the first place. For example: if the student submits an essay, and that student is able to coherently and accurately answer any questions asked about the essay in a face-to-face conversation, then that student is probably the genuine author of that essay.
I agree with you that a face-to-face q&a is a reasonably good way to detect low-effort cheating, but I'll still quibble a bit:
- I don't think this lowers the cost of detection as much as you imagine. You still need to know the paper better than the student and have to sacrifice already tight instruction/planning/grading time to have all of these conversations. Even if you catch enough to successfully deter most, it likely means not covering something else. It won't be too hard to catch low-effort cheaters who can't be bothered to read the paper, but you're on the low-leverage side of an arms race with the remaining students. You have experience on your side and they can't know what you'll ask, but they outnumber you and can certainly read the paper and use LLMs to quiz them on it. You have to invest your effort without knowing how each student prepared, so you'll spend about as much effort on every low-effort cheat as you do on the highest-effort cheat you are prepared to catch.
- Not sure it is "from the wrong direction" since both approaches raise the cost of cheating and lower the cost of detecting it.
- While this does avoid encouraging students to dumb down their work, it does still raise the cost of not-cheating. Unless you surprise the students with these conversations, the ones that care most will still anxiously prepare.
20 years ago (so very outdated), I TAed an introductory CS class for engineers who weren't going to be majoring in CS. We used MOSS [1]. Maybe it was the threshold we picked, but the results were pretty blatant. People renamed variables, renamed functions, changed comments, and the clever ones inlined or extracted a function. A lot forgot something, copying a bug or quirk from the original.
I object to the idea that the LLM writing that these students are trying to distinguish themselves from, is actually good in the first place. Although students might well end up writing worse because people are trusting the detection of LLM content to other LLMs. (And really, it's bizarre that these massively complex systems required to produce roughly human-like output, apparently offer such simplistic reasoning for what they detect as non-human.)
Honestly, I lean towards shaming educators who do that. If you can't detect the whiff of LLM with your own senses, then it has been used properly and shouldn't be faulted. If that premise invalidates your assignment, change the assignment. It's not as if you're assigning this work to test the basic mechanics of writing (grammar, sentence/paragraph structure, parallelism, whatever) — I mean, how much of that did you consciously try to teach? My recollection is, not an awful lot; and I can only imagine it's gotten worse since I was in K-12 (and I went to pretty darn good K-12).
> If you can't detect the whiff of LLM with your own senses, then it has been used properly and shouldn't be faulted.
But wouldn't this apply to any cheating method? I don't think educators would be able to tell the difference between using a calculator, getting answers from previous tests, resubmitting assignments, etc.
Every kind of examination should be proctored.
> using a calculator
Students who are at a level where they'd be learning to do the computations a calculator does, shouldn't have graded homework. And even at that level, real mathematics is more than just computation.
> getting answers from previous tests
Decades ago, my teachers and professors knew advanced tricks for this, like "not just reusing the test questions from last year". Sometimes they even changed the constants in math questions between sections of the class.
Reading previous tests (including correct answers) was never considered cheating, or even slightly unethical, in my education. In fact, one of our professors had this party trick of working through all the answers for a past-year exam (perhaps multiple of them; I can't recall the details, but certainly much faster than students were expected to work things out under exam conditions) in the space of a single lecture, near the end of the course. Students were meant to see this and learn from it (as well as be impressed).
> resubmitting assignments
Why would you ever not notice this?
I dont think you are being realistic at all.
>Students who are at a level where they'd be learning to do the computations a calculator does, shouldn't have graded homework. And even at that level, real mathematics is more than just computation.
So, a math level less than Real analysis shouldn't have graded homework?
>Decades ago, my teachers and professors knew advanced tricks for this, like "not just reusing the test questions from last year".
Math is not the only subject. For an English class, what constant would you change so that students get a comparable exam (especially if you are going to do this between sections in the same corhort)?
>resubmitting assignments
Students are not stupid, and obviously would not resubmit an assignment for the same teacher. However, there is a significant overlap between classes, so certain assignments should be retooled for other assignments.
We can't, and neither can the machines that people build and/or use for "detection." Everyone in this thread also needs to recognize the entrenched differences between secondary educators, who have wholeheartedly adopted AI products into their teaching workflow, and tertiary educators, who have adopted them only by necessity. "By necessity" in this case means "having to spend a ton of time dealing with, talking about, and learning about this nonsense."
The discourse around "cheating" with these products has always been a mistake. We should have characterized them less as "cheating machines" and more as "expediency machines." Because once you're invested in describing students as having academic dishonesty issues rather than skill issues, you've made it an administrative problem. You never come back from that.
For mine, we lost the issue long ago when accountability culture won. We should never have bothered with the idea that "mechanics, grammar, and proofreading" should be part of a "rubric" that "assessed outcomes" for "good writing." We should have just said "we don't care if you don't think this is worthwhile, because your time is worth nothing." The last two years of student labor certainly suggests this.
The point has always been the act of writing itself. What you write about is almost irrelevant; it’s that you spent the time writing, that you had ideas in your head, and that you squeezed them onto the page.
Sure. And my point is that the assignment is poorly conceived if an LLM's output can appear to "have ideas" that satisfy the prompt. Last I checked, they don't do a good job of modeling a specific, non-notable person within particular constraints, and then all the relevant life experiences of that person. An LLM essay should be human-detectable for the same reasons that one from an essay mill would be.
No matter how intricate and detailed an object is, it will appear similar to any other blurry mess if it's viewed through a shoddy lens.
I think your point stands for upper level work; however, at medium to lower levels, your counterfactual starts to weaken. The ideas have always been there, but it's the ability to express them--well enough to notice their presence--that is not.
Is that not pointless now? The point of writing was previously to communicate our thoughts and ideas to other people. Now and going forward that is unnecessary. The most efficient and effective way for us to communicate our thoughts and ideas is to have an agent organize and write them down for us.
Okay, and how does the agent know what your thoughts and ideas are?