Return to Blog

July 6, 2023

Are we there yet? “Hackathon mode” and the pursuit of reliability with LLMs

Human comprehensibility leads to human ingenuity.  

I don’t want to understate the scale of the technical breakthrough that OpenAI has achieved with GPT-3 and GPT-4. But it’s easy to miss how much work the “chat” part of the phrase ChatGPT is doing.  

It’s an overlooked insight that the flowering of amazement associated with the current generations of large language models is directly linked to the user-experience breakthrough of allowing anyone to chat directly with the model.

Recursively, OpenAI used its own model to make its model legible to the world.  

This is a very FUN breakthrough.  

Three weeks ago, I asked ChatGPT to write an AML policy for a fintech (generic but passable) and then I asked it to translate the policy into Japanese. Interestingly, it had to be coaxed into completing the task (continue generating), but the contrast to my own painstaking efforts to translate much less complex material when I studied Japanese in college was a joy in and of itself.  

Last week, my seventh grader wanted more fill-in-the-blank practice problems related to complex verb tenses – do you remember what the past perfect progressive tense is? GPT does, and was much quicker to generate practice problems than I would have been. (GPT is both creative and a fast typer!) Somewhat hilariously, despite three separate prompts, GPT kept on putting the correct answers right after the sentence – almost like it couldn’t resist that urge to demonstrate its own competence much like a certain 7th grader!  

In a recent survey, I found that most people (56%) are using GPT this way – really just playing around.  

But in startup land, the real value is what I would call “hackathon mode.”  In hackathon mode, a single, usually junior developer builds a novel feature that they don’t have relevant experience with. In a day or two, the build is done whereas before that build would have taken a small team weeks or even months. Nearly 37% of respondents to my survey say that they are using ChatGPT this way, including many who otherwise describe themselves as non-technical.

This is a very EXCITING breakthrough.  

Largely because of “hackathon” mode, at QED we believe that the same companies who are most likely to benefit from the LLM/generative AI moment are those that are most threatened by it.  

Usually that threat is framed based on the potential that new “open” models are going to be so powerful that they will obviate the need for specialized models and will erode the power of proprietary data advantages. The power of GPT-4 in particular suggests a kind of reasoning ability, though the underlying model does not include any explicit logical engine. Some are projecting that this common sense is going to wipe out narrowly specialized AI and other vertical SaaS.  But my sense is that we are so awed by the passing of the Turing test that we’re overlooking how difficult non-language tasks can be.

Predictions can be hard, especially about the future, but most of the data in the world is actually not public and not accessible to LLMs. This is especially true of financial data, so I think we are still years away from LLMs or other publicly trained models displacing the value of proprietary data, training informed by domain expertise, and managed feedback loops. Finance may be a particularly hard nut to crack here, because  will still be necessary to achieve reliable performance in any domain where the downsides to wrong decisions are high.  LLMs must be combined with specialized models at least for this next stage of implementations.  

But there is another pernicious dynamic, namely that the excitement around “hackathon mode” causes people to overlook obvious drivers of value and business thresholds that must be passed in order for software to be effective. Overrating their own ability to “build it quickly” and underrating the challenges of perfecting and maintaining a feature.

In financial services, the point of any system is to have a reliable representation of the world in the form of data.  In underwriting or fraud, a mistake costs real money – the quantifiable impact of a mistake changes, quite literally, by an order of magnitude each time it moves one digit to the left or the right. Automation and straight through processing requires accuracy. I once had a founder pitch me that their solution was 100% accurate 80% of the time!

Two of my companies, Ocrolus and Ntropy announced a partnership that is not only a triumph over “hackathon mode” for each of them, but also allows their customers to get the benefits of proprietary data and fit for purpose models.  

Both Ocrolus and Ntropy have dedicated data science and machine learning teams – AI is what both of them DO – and they also recognize that GPT-4 would enable them each to create rudimentary, “slide-ware” versions of each others’ products in a few weeks. But that having “slide-ware” is not good enough for lending and fraud fighting. These functions require Ocrolus and Ntropy’s customers to make predictions, so mistakes will happen. But mistakes at the level of data input must be squeezed out. .  And since both companies build accuracy testing into their product by default, they recognize that excellence is worth partnering for.  

Ntropy's core is ingesting transaction data feeds and enriching this data to provide lenders a deeper view on their customers’ behavior. While lenders can get this data from aggregators, many borrowers continue to prefer providing PDF documents directly, and unfortunately aggregation services break more often than many realize – many lenders find that 20-80% of their pipeline requires PDF documents.  

Using Ntropy and Ocrolus together allows lenders not only to use a unified pipeline for digital and non-digital applicants, it also allows them to fully take advantage of their marketing and sales – no longer wasting spend on applicants stopped by an unnecessary friction in the underwriting process.  

For Ntropy, the decision to partner with Ocrolus should have been easy (I’m on the board of both companies!), but they went through exhaustive testing on latency and accuracy and Ocrolus was still the right choice. Their human-in-the-loop approach provides exactly the guard rails that companies need to embed AI into a core business process.

So, while I’m as excited about our eventual destination as anyone, like any dad in the front seat of this summer season, my answer to the inevitable question “are we there yet?” is “not yet”.  Value still matters. Proprietary data, models that are fit for purpose, and teams that thoughtfully include LLMs to improve margin without sacrificing quality are the recipe for success.