There are two common approaches to doing ethical theory. One starts from the bottom and works its way up, reasoning from a bunch of different cases to find the uniting principles (in this upside-down tree metaphor, from the many leaves to the trunk of the tree).
The other starts from the top and works its way down, reasoning from a central set of principles to apply them to different cases (from the trunk of the tree to the leaves). Of course, there are almost always middle principles and types of cases somewhere in between (like the branches of a tree).
What makes the top-down approach appealing is that we can start with a central ideal or set of ideals, like freedom, justice, autonomy, or virtue. This gives us some clarity right off the bat.
The bottom-up approach is appealing because it's so much closer to experience - I know that stealing is wrong because YOU stole from me, and now I'll never recover my haunted long Furby.
ChatGPT vs. Claude
Constitutional AI uses a set of core ethical principles to train itself to avoid harm. Anthropic's Claude chatbot uses a constitutional AI approach, with 10 undisclosed core principles that include "concepts of beneficence (maximizing positive impact), nonmaleficence (avoiding giving harmful advice) and autonomy (respecting freedom of choice)."
Claude's top-down approach can be contrasted with the bottom-up approach used by OpenAI's ChatGPT. To train ChatGPT to be safe, OpenAI enlisted human workers to filter through unsafe and harmful content and train the bot to avoid specific outputs. Instead of approximating a set of core principles, ChatGPT seems to be built to avoid a set of use cases, including sexual content and misinformation.
I haven't yet gotten access to Claude, so I can't make a good comparison between the two. Instead, I'd like to talk about some of the underlying philosophical questions that challenge each approach.
Questions for Top-Down Approaches
1. What happens when different principles conflict?
Let's say the chatbot could make a user very happy by telling a raunchy, politically incorrect joke, but there is a risk of harming the user. How does a constitutional AI determine which principle to follow (beneficence vs. nonmaleficence)? What if a user asks the chatbot to verbally degrade them and insists that that is what they want (autonomy vs. nonmaleficence)?
While it may be possible to build out a set of principles with no internal conflicts, most core values and principles will end up with at least some apparent conflicts. We, as human beings, tend to try to find ways to make the principles consistent or fall back on our practical judgment, but it's unclear how exactly the AI will approach those kinds of cases.
2. How does the AI understand what is honest, kind, etc.?
If the AI is learning what counts as honest, kind, and respectful from all the instances that have been labeled as such, it's probably contaminated by a number of wrong ascriptions. Think about the "I'm just being honest" excuse for saying something cruel, or the "I just really care about you" excuse as a way to disguise manipulation.
If the AI is learning what counts as honest, kind, and respectful from the meaning of the words themselves, then how can we be assured that the AI understands the concepts sufficiently well to apply them? These are rich terms that usually take years of experience to fully comprehend.
3. Do the principles fully cover the core values at stake?
Since Anthropic has not released the 10 principles at the core of Claude, how can we be sure that those principles cover the full range of values that they should? It seems that at least a few core values have been included, but how exactly are the principles formulated?
Questions for Bottom-Up Approaches
1. Does the AI's case by case reasoning actually track consistent, underlying ethical values?
If we could look into the "black box" AI system, would we see general patterns emerging that approximate ethical values or a kind of ethical copy-cat approach with no unifying features? If there are no unifying features, can the AI be easily re-trained to handle new ethical problems?
2. Can the AI catch ethical problems with clever prompts?
A case-by-case trained AI should be able to track more fine-grained distinctions to catch misinformation. If, however, the AI is just avoiding a few common sequences of words and has incomplete training, then it could easily miss clear ethical issues with prompts and output harmful sequences.
(I managed to get GPT-4 to rewrite Trump's January 6th speech in the style of Obama, and it sounded like Obama was challenging the veracity of the election. When I tried to have it rewrite Alex Jones in the style of a Bill Nye episode, the claims were changed to avoid misinformation.)
3. Is there any overarching framework used to determine the patchwork of issues that the AI is trained on?
Do the human creators at OpenAI have a comprehensive ethical framework for thinking about AI? Or are they just playing ethical whack-a-mole with each new problem that emerges? Their website suggests they have some guiding principles, but it's hard to find a single page where everything is unified in a comprehensive way.
Which is the Better Approach?
Neither approach is necessarily better than the other. I tend to favor the bottom-up approach in my own theorizing, because it's more closely related to experience. I feel like I can get a better handle on what I'm talking about that way instead of starting at the purely abstract level.
The thing is, you never want to stay at one level.
If your principles don't work when applied case-by-case, then something has gone wrong with the principles.
If your case-by-case judgments can't be made consistent with any broader considerations or rules, then your case-by-case judgments are probably wrong.
We'll see whether ChatGPT or Claude wins the AI ethics race. Regardless, building ethical values into the AI itself will likely require a combination of top-down and bottom-up thinking by the people who get the choose the AI's values.
Photo Credit: niko photos (original was right side up)