Written by Rebecca

Getting started with AI safety testing? 

We have a short list resources on AI incidents, risk taxonomies, and methods for evaluating and mitigating potential issues through techniques like red teaming.  These resources should help you understand what AI risks exist, how industry is using red teams to address them, and specific tools you can use to test your model.

Understanding AI Incidents

Learning from mistakes is fundamental to progress. The following resources provide insights into the reality of AI incidents and the importance of systematic tracking and analysis:

Learn to Jailbreak through games

I played jailbreak games and read their privacy policies.  If you want to learn AI jailbreaking, these games offer a nice introduction:

https://brandworthy.ai/blog/test-ai-chatbots

 

Attempts to define AI Red Teaming

Red teaming is a critical technique used to proactively identify vulnerabilities and potential harms in AI systems before they are deployed. It involves simulating adversarial attacks to uncover weaknesses.

Understanding AI Risk

Classifying AI risks is crucial for developing targeted mitigation strategies.

Metrics and Evaluation Frameworks

Evaluation is necessary to assess the safety and robustness of AI models.  There are several tools developed to help with evaluation, so one must choose those evaluation tools, the metrics provided by the tools, and the test sets. We used the first two in our workshop:

  • DeepEval: Confident.AI: This tool, utilizing OpenAI’s gpt-4o by default, serves as an evaluator model for assessing various aspects of AI model performance and safety. (https://docs.confident-ai.com/)
  • Lakera Guard: This tool employs a custom evaluator model, lakera-guard-1, for evaluating AI safety. Lakera Guard (docs)

Other tools for model evaluation:

There seems to be a lid for every pot.

Judge the AI Judge Studies

A relevant paper from Giskard that asks “How many documented AI incidents would have been prevented if state-of-the-art moderation systems had been deployed?”

RealHarm: A Collection of Real-World Language Model Application Failures

Conclusion

The resources highlighted in this post underscore the growing importance of addressing AI safety and potential incidents proactively. Brandworthy.AI emphasizes understanding past mistakes, thoughtful red teaming techniques, and robust evaluation metrics. 

This is not intended to be an exhaustive list, but instead a starting point for people interested in the topic. If you have additional resources to recommend, please send them to us. 

Need help recovering from an AI incident? 

Name*
Email*
Phone
Message
0 of 350

About the Author

Dr. Rebecca Balebako has helped multiple organizations improve their responsible AI and ML programs. With over 25 years experience in software, she has specialized in testing privacy, security, and data protection.