Or: How I learned to stop worrying and love the bot
By Jim Lynch
For this website, the editor of Contingencies wanted a follow-up to the recent piece I wrote about ChatGPT and similar large language models (LLMs) (“Artificial Untelligence,” May/June 2023).
In that article, I was skeptical that the bot—which develops content based on user input and a huge bank of knowledge it has been fed—would develop into a writer worth reading.
A lot has changed in a few weeks.
The bot has gotten much smarter (though it flunked another actuarial exam). It still makes stuff up to escape a rhetorical jam, so you never know whether its facts are fancies.
Yet, people are MacGyvering ways to use LLMs to solve their problems. The bot does its job, but that work is not a final product. It is reviewed, molded, adapted, and sometimes ignored on the way to a final response.
I’ve changed, too. I laid out a rigid way to write an article—research/design/write. I didn’t consider that any worker uses the tools available. When you are writing, it helps if an aide can write 400 words per minute, even if few or none of those words make the final product.
The commotion has the attention of regulators worldwide, including the National Association of Insurance Commissioners (NAIC). I’ve added a few words about that.
Treat this article as an update. Per the deck above: I’ve stopped worrying because the bot is evolving so rapidly and people are adapting so quickly that the bots and our ability to use them will advance at warp speed.
The bot got smarter … kinda
The original ChatGPT was a free chatbot in beta testing (“research preview” in the words of its originator, OpenAI). It’s a leader in generative artificial intelligence—the ability of artificial intelligence (AI) to write or draw or otherwise create new content.
ChatGPT got its brainpower and eloquence from GPT-3.5, a version of OpenAI’s large language model. Think of GPT-3.5 as Cyrano de Bergerac, and ChatGPT was the handsome soldier to whom Cyrano whispered.
GPT-3.5 had little of Cyrano’s eloquence or intelligence. It’s garbled facts and muddled rhymes.
In March, Cyrano got an upgrade: a $20-a-month product called GPT-4. It is a lot smarter. GPT-4 scored in the 90th percentile on a simulated bar exam, OpenAI reports. The original bot was in the bottom decile.
Of course, OpenAI created all the GPTs, so a thumb might be on that scale. Not so with researchers from University of California, San Diego; Bryn Mawr University; and Johns Hopkins University. They found that the bot outperformed physicians responding to questions posted on a subreddit. The bot was more empathetic, too.
It’s encouraging that a bot can outperform credentialed randos. The last computer touted as a diagnostic tool, IBM’s Watson, never got that far.
GPT-4 hallucinates, just like the original—and, now that I think of it, “hallucinates” might be a charitable term. It can get stuff plain wrong.
I tested GPT-4 by using Microsoft’s Bing search engine. It’s tucked into the upper-right-hand corner of the Edge browser. Click the Bing icon, and a narrow window drops down. “Ask me anything,” it implores.
That’s how I found a study that showed a generative AI that improved the productivity of customer service representatives by 14%. True to itself, the bot hallucinated. It said Bloomberg News wrote the study. But Bloomberg only published an article summarizing the study. It was written by researchers from Stanford and MIT.
Bing will also generate an essay on any topic. I asked for a brief, comical blog post about Will Levis, the college quarterback who was passed over in the first round of the NFL draft, which surprised a lot of football experts. I know it was trying to be funny, but Bing told me Levis:
- wasn’t drafted at all (he was the first pick in the second round)
- had thrown for 3,000 yards and 26 touchdowns in 2022 (2,406 and 19)
- had taken up banana peeling, stripping 10 in a minute with one hand (he had actually eaten an unpeeled banana for his Instagram feed, which is funny enough)
In more serious work, GPT-4 scored 19.75 out of 52.5 on Casualty Actuarial Society’s Exam 9 (financial risk and rate of return), about half the pass mark of 38.5. David Wright, a blogger and podcaster, administered the exam. He achieved an ACAS and learned the principles of Exam 9 in an insurance position a few years back.
The bot’s strength, he wrote on LinkedIn, is its ability to regurgitate information. CAS exams have been developing more interpretive exam questions, something that will tend to thwart AI and maintain the value proposition of the actuarial profession.
Loving the one you are with
The shortcomings inspire workarounds.
Wright publishes a blog, embarrassinglies.com, that rewrites actual Securities and Exchange Commission press releases to, with a straight face, accommodate nonsense that Wright inserts. It “self-generates satire,” he said.
“David A. Hedges sat down with the interviewer, his eyes wild and fidgety. ‘Did you read the disclosure? Did you see what those damn regulators are doing to us?’ he spat. ‘They’re making us adopt this CECL thing and it’s costing us a million bucks just like that! It’s highway robbery!’ The interviewer cleared his throat. ‘Well, Mr. Hedges, it’s actually a new accounting standard that many companies are adopting. And while it may be costing you a bit up front, it’s meant to better reflect and prepare for potential credit losses in the future.’ David blinked a few times, as if coming out of a trance. ‘Oh, right. That…I knew that,’ he said, smoothing his tie. ‘It just seems like a lot of money, you know?’ The rest of the interview proceeded without any further hallucinations or outbursts.”
In satire, buffoonery amplifies insight. Wright’s insight: Generative AI translates complexity into a narrative that we can understand. Rather than abandoning nonsense, the bot adjusts the world to fit the narrative.
For most, hallucinations are a bug. Wright makes them a feature.
“People call it a bullshit machine,” he said. “But bullshit has to be plausible. If you can take incoherence and make it coherent, that’s interesting to think about.
“Truth is not a black-and-white thing in most circumstances,” he said. GPT-4 forces us to confront that.
He has a point. Google for information today, and you get whatever Google chooses to spit at you—a litany of purchased Adwords and sly SEOing. You are left to find the answer to your question, then hope the answer is correct.
Bing, to its credit, footnotes its search engine essays. And looking at those footnotes you’re more likely to consider the source. For example, I wonder why my question about GPT-4 cites DigitalTrends and ZDnet but not OpenAI, which, after all, created the bot itself.
I can’t solve that mystery, but I can look at the articles the bot cribs from. Click back to them, and you can verify accuracy.
That’s better than Bard, Google’s attempt at catch-up. It gave me a good-looking summation of how ChatGPT and GPT-4 differ. I asked for sources. It turned vague: official websites of ChatGPT and GPT-4; articles and blog posts; and “my own research and experimentation with ChatGPT and GPT-4.”
Input is the limiting factor on generative AI. It can’t be more accurate than the information it is fed. Generative AI would seem to operate best where language is most precise.
That would be the law. Legal definitions have been forged over centuries. Grammar has been straitjacketed; a million-dollar case can turn on the use of an Oxford comma. Legal writing is the closest thing we have to narrative datasets.
Daniel Schwarcz is among a team of researchers documenting how generative AI will, they believe, thrive at law firms. First, they showed that old-school ChatGPT, though an unimpressive student, would pass four courses at University of Minnesota Law School, where Schwarcz is faculty. The newer GPT-4 scores in the 90th percentile on the bar exam.
That level of quality means the bots could, if well-directed, write legal documents. Memos, briefs, contracts, motions, opinions: The law requires more writing than any other discipline. A bot trained strictly on legal matters could handle much of it—perhaps all.
In their paper, “AI Tools for Lawyers: A Practical Guide,” they documented a process that keeps the bot from straying. The method seems applicable outside the law, too.
They constrain the bot at the beginning of the process and the end.
At the outset, they carefully construct prompts to guide the bot, “requiring AIs to cite and quote from specific source material the lawyer directly provides.” They insist the bot answer as a specific jurist—say, Harvard Law School professor Cass Sunstein—would.
Instead of accepting the bot’s first answer, they take something closer to a Socratic approach. They cajole the bot, telling it when it is correct or incorrect, then ask again.
At the end, they verify the bot’s work, “in much the same way that they would cite-check human work product.”
“I don’t think it will replace lawyers,” Schwarcz told me. Humans need to understand the law to guide successfully. “But it will affect the way that tasks are done and the efficiency with which they are done.”
There are ethical considerations, note Schwarcz and co-author Jonathan H. Choi. Law students shouldn’t use the bot, because they need to demonstrate their own legal competence. On the other hand, in the professional world, lawyers might be ethically bound to use bots if the increased efficiency lowers legal costs.
One darker concern: Lawyers need to worry about what client information they feed bots. They write: “OpenAI and Microsoft’s assurances about data security and confidentiality will be insufficient for many legal settings.”
The buzz around ChatGPT—100 million users in three months gave it the fastest growth rate in internet history—naturally attracted regulatory scrutiny. Italy banned the bot for about a month, citing privacy concerns. Now users can deny the bot access to their conversations as training tools.
China’s Cyberspace Administration laid out draft rules. The rules focus on output; content needs to reflect the values of socialism and should not challenge the power of the state.
The U.S. insurance world has grouped the issues ChatGPT presents with other high-tech concerns. All of this harks back to the late teens and the emergence of insurtechs, according to Kathleen Birrane. She is Maryland’s insurance commissioner and chair of the Innovation, Technology and Cybersecurity Committee of the NAIC. It is the first new NAIC committee in more than a decade. It is the eighth committee overall, so it is commonly referred to as the H committee.
In general, she said, regulators recognize that insurance companies are using increasingly sophisticated tools—“We get that.”
Around a dozen NAIC working groups have been grappling with different aspects of the issue. They must balance the desire to innovate with the risk that comes with, as Birrane said, “incredibly complex processes with lots of room for error.”
The generative AI concerns parallel those of other insurtech innovations:
- Data privacy and security. “Who collects the data?” she said. “What do they do with it?” “Have you secured the data?”
- Methodologies. Are the methods accurate? Do they meet antidiscrimination laws? How reliable is data from external sources?
The H committee is developing a bulletin, deliverable this year, on fairness and antidiscrimination within the new tech world.
The bulletin will be principles-based, Birrane said, rather than a model law. It will focus on risk management, governance, and testing. It won’t seek to regulate third parties but will set expectations that companies should be following.
Regulators are focusing less on new rules for emerging technology. Instead, they emphasize that standards apply, regardless of new techniques.
“It doesn’t matter what methods you are using,” Birrane said. “The rules say you have to be accurate and precise when you are giving information to consumers.”
JIM LYNCH, MAAA, FCAS, is a freelance writer.