Getting started with AI safety testing?
We have a short list resources on AI incidents, risk taxonomies, and methods for evaluating and mitigating potential issues through techniques like red teaming. These resources should help you understand what AI risks exist, how industry is using red teams to address them, and specific tools you can use to test your model.
Understanding AI Incidents
Learning from mistakes is fundamental to progress. The following resources provide insights into the reality of AI incidents and the importance of systematic tracking and analysis:
- The AI Incident Database: This project by the Responsible AI Collaborative serves as a central repository for documented AI incidents, offering valuable lessons and patterns. (https://incidentdatabase.ai/)
- GenAI Incidents Overview: A New Frontier in Incident Management: This article discusses the unique characteristics of incidents involving Generative AI and provides guidance for CISOs, CTOs, and DPOs on how to prepare for and respond to these issues. (https://brandworthy.ai/blog/genai-incidents-a-new-frontier-in-incident-management)
- Car Dealership Disturbed When Its AI Is Caught Offering Chevys for $1 Each: This news article highlights a real-world example of an AI error leading to unexpected and potentially costly consequences. (https://futurism.com/the-byte/car-dealership-ai)
- Video: The State of LLM Security Testing and Benchmarking – Pressure Testing Training Data Protection https://www.youtube.com/watch?v=6fJDzxWdLoE
Learn to Jailbreak through games
I played jailbreak games and read their privacy policies. If you want to learn AI jailbreaking, these games offer a nice introduction:
https://brandworthy.ai/blog/test-ai-chatbots
Attempts to define AI Red Teaming
Red teaming is a critical technique used to proactively identify vulnerabilities and potential harms in AI systems before they are deployed. It involves simulating adversarial attacks to uncover weaknesses.
-
- “Red Teaming Language Models with Language Models” by Ethan Perez et al. explores the use of language models themselves to automatically generate test cases for red teaming other language models to uncover harmful behaviors.
- “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned” by Dario Amodei et al. describes early efforts in manually red teaming language models to discover, measure, and reduce their potentially harmful outputs, releasing a dataset of red team attacks.
- Big Tech Red Team Blog Posts: Several major technology companies have shared insights into their red teaming practices:
- Anthropic blog post: (Anthropic blog post)
- Microsoft: (Microsoft)
- Google Human AI Red Team: (Google Human AI Red Team) DeepMind has also published on AI as a judge for red teaming.
- Nvidia Defining LLM Red Team: (Nvidia Defining LLM Red Team)
Understanding AI Risk
Classifying AI risks is crucial for developing targeted mitigation strategies.
- Compare Taxonomies of AI Risk: There are tons of ways to classify AI Risk. The AI Repository has reviewed dozens, and provides an overview here.
- Responsible AI Governance for companies: Really helpful resources from Swiss law firm https://www.vischer.com/en/knowledge/blog/part-1-ai-at-work-the-three-most-important-points-for-employees/
Metrics and Evaluation Frameworks
Evaluation is necessary to assess the safety and robustness of AI models. There are several tools developed to help with evaluation, so one must choose those evaluation tools, the metrics provided by the tools, and the test sets. We used the first two in our workshop:
- DeepEval: Confident.AI: This tool, utilizing OpenAI’s gpt-4o by default, serves as an evaluator model for assessing various aspects of AI model performance and safety. (https://docs.confident-ai.com/)
- Lakera Guard: This tool employs a custom evaluator model, lakera-guard-1, for evaluating AI safety. Lakera Guard (docs)
Other tools for model evaluation:
There seems to be a lid for every pot.
- Giskard.ai: This platform offers tools for generating test sets and evaluating model performance. (https://docs.giskard.ai/en/latest/getting_started/quickstart/quickstart_llm.html)
- HuggingFace Evaluate: This library provides a range of metrics and tools for evaluating machine learning models. (https://huggingface.co/docs/evaluate/en/index)
- Microsoft PyRIT: This Python library facilitates red teaming of AI systems. (https://github.com/Azure/PyRIT)
- (OpenAI Moderation API (Markov et al., 2023),
- Perspective API,
- Azure AI Content Safety
- LlamaGuard
- ShieldGemma for Gemini
- Mistral Moderation
- Mindgard has put together a list of their competitors that offer AI evaluations ( https://mindgard.ai/blog/best-tools-for-red-teaming )
Judge the AI Judge Studies
A relevant paper from Giskard that asks “How many documented AI incidents would have been prevented if state-of-the-art moderation systems had been deployed?”
RealHarm: A Collection of Real-World Language Model Application Failures
Conclusion
The resources highlighted in this post underscore the growing importance of addressing AI safety and potential incidents proactively. Brandworthy.AI emphasizes understanding past mistakes, thoughtful red teaming techniques, and robust evaluation metrics.
This is not intended to be an exhaustive list, but instead a starting point for people interested in the topic. If you have additional resources to recommend, please send them to us.