How do we get AI systems to align with real human social contracts and values? Or, in more mathematical terms, how do we make legible our soft and squishy human values into hard mathematical formula and policy?
Mostly sourced from OpenAIâs approach to alignment
- RLHF: Summarization from human feedback was really the first convincing proof-of-concept that RLHF works on language models and that you can optimize goals that are fuzzy and somewhat ambiguous.
- How do we optimize for goals that are not easily quantizable?
- InstructGPTÂ demonstrated that there is a real âalignment overhangâ in language models that wasnât very hard to access. The amount of human feedback needed to achieve an astounding 100x improvement was pretty moderate and achievable: ~50,000 comparisons, and ~300,000 episodes of training. That number is so small that we could actually have humans hand-label every training episode
- Using models to augment rather than replace. Helping humans find 50% more flaws that they would have unassisted with a model that isnât superhuman on a task that isnât hard for humans is a surprisingly strong result, showing that our model can basically already add a lot of value for feedback assistance.