tl;dr: For increasingly capable AIs to gain widespread adoption, they will have to earn our trust by reliably doing what we want in the ways we want. Four promising areas for ensuring reliability and corrigibility in AI products are: refining AI behavior, interpreting models, evaluating models, and building systemic guardrails to mitigate the harms of misbehaving AIs.
Recent progress in AI is extremely exciting and has the potential to change the world for the better. However, in order for AI products to gain deep traction, they will have to earn our trust in the domains that matter the most to us. Andy summarized many of our questions around trust and AI in a recent post. As he points out, different business models, data policies, security systems, and use-cases will demand different levels of trust from the end-user.
One area I’ve been thinking more deeply about is how we’ll come to trust that AI products will do what we want in the way we want. Given their incredible capabilities, we will increasingly cede responsibility to AIs. But as black boxes, there’s no guarantee that they’ll act exactly how we’d like. How can we get sufficiently comfortable with an AI’s “character” before empowering it to be an agent in the world?
As AIs are deployed into consumer products and enterprise tools, companies will risk losing their customers’ trust if they can’t avoid especially aberrant behavior. Parents won’t want their children exposed to inappropriate content or manipulated for nefarious purposes; businesses won’t want artificial agents spending resources inefficiently or undermining their brand. In order for AI to scale to its full potential, we will need better technologies for ensuring AI reliability and corrigibility.
There are four areas that seem especially promising in this regard: refining AI behavior, interpreting models, evaluating models prior to deployment, and building systemic guardrails to mitigate the harms of misbehaving AIs. The first three of these are products that could be sold to AI companies looking to build or preserve trust, and the fourth is something that non-AI-native companies will need to deliver a great experience to their users in a world full of AI spamming, hacking, etc.
It’s useful to think of AIs as, in the extreme, akin to random number generators. In that analogy, if you’re hooking them to an important process, you’d want to constrain the set of possible outcomes as much as possible without sacrificing usefulness. Without any further specification, a sales bot trained on the set of language on the internet might find itself cursing out potential customers. Instead, you’d want to refine its vocabulary and tone to that of an effective sales agent (without making it worse at understanding language or otherwise selling products).
One example of this approach is Reinforcement Learning from Human Feedback (RLHF), which incentivizes a model to shape its outputs based on judgements from human evaluators. RLHF drove ChatGPT’s viability as a consumer product by forming the model outputs into conversational responses and by helping avoid potentially controversial content. This is probably the most developed of the four areas, and we’re excited about further progress in behavioral refinement techniques like Anthropic’s Constitutional AI and enabling technologies like Scale AI’s RLHF labeling tools.
Though techniques like RLHF can help make an AI behave more reliably, the goals and behaviors an AI develops will always be an imperfect approximation for how we want it to act: it only receives finite data and feedback, after all. Since we can’t perfectly specify how it should act in any given situation without losing generality of capabilities, it would be helpful if we could “read the AI’s mind” to understand how it represents our goals for it.
You could imagine an AI doctor trained to output diagnoses based on symptoms. This AI doctor might get a diagnosis wrong like any ordinary doctor might, which is to say it might follow the prescripts of accepted medical reasoning to an incorrect conclusion. However, the AI doctor might also get a diagnosis wrong because it simply “hallucinates” a connection between symptoms and a disease that no doctor – or human – would suggest. In such a situation, it would be helpful to contextualize a diagnosis by “reading the AI’s mind” to verify if its reasoning resembles that of a competent doctor. More broadly, companies might be better empowered to preempt problematic AIs if they can check how the AIs are reasoning about the tasks they face.
The field of interpretability is still nascent, but recent advances with smaller-scale models makes us excited about the potential for commercial applications. Our hunch is that the minimum viable interpretability product is much simpler than a complete “brain scan”. A tool as narrow as detecting whether an AI thinks it’s being truthful – without otherwise understanding its “brain” state – could help companies avoid manipulative behavior.
It’s hard for companies – let alone consumers – to vet the models they’re relying on for themselves. With everything that could go wrong with giving a black box the power to act on your behalf, customers of AI products will most likely want some validation that the product won’t end up in a predictable failure mode. We’re already seeing this dynamic with the Alignment Research Center (ARC) red-teaming OpenAI’s models prior to public release and thereby lending OpenAI a stamp of approval they can reference at launch. Though ARC is doing interesting work, they operate in a fairly bespoke way that doesn’t obviously scale to a world with many orders of magnitude more AI products being deployed each year. We’re interested in what a more scalable version of model evaluations might look like.
The first three areas I mentioned are all tools that AI companies can use to build a trusted brand. In a world with misbehaving AI agents running around, however, other companies will have to prove resilient to AI manipulation in order to maintain their brands. In the same way that Cloudflare helped websites develop a reputation for reliability by making them robust to DDoS attacks, we think there will be new tools to promote resilience to malicious AI behavior. New AIs might be able to get around CAPTCHAs (perhaps by asking for help on TaskRabbit), develop better phishing and spoofing strategies, and so much more. The internet’s infrastructure is not yet prepared for a world of AI agents interacting with websites on humans’ behalf.
Of course merely making AI products do what we want in the way we want isn’t sufficient for earning our trust. We hope to dive into the other issues around AI and trust in more detail in future blog posts.