This page contains description of the security projects my students are doing at ZHAW.
We are happy to meet collaborators! Reach out!
If you want to hire one of the students with this expertise, reach out to me. Students may also be willing to share their github repos or final reports.
Herbst 2025
End to end LLM as judge testing for secure code
Masters Credits
Abstract: Ensuring the security and reliability of AI-generated code is becoming increasingly important for modern software development, especially for small and medium-sized enterprises (SMEs) that lack extensive testing resources. While AI assistants can dramatically accelerate coding, they also introduce the risk of propagating known software vulnerabilities, potentially exposing critical systems to attacks.
This project presents a proof of concept for an end-to-end testing pipeline called “SecureCodeEval”, that enables SMEs to assess the security posture of AI‑driven code generation without incurring prohibitive costs or requiring deep security expertise. By leveraging the Common Weakness Enumeration (CWE) as a structured taxonomy, SecureCodeEval automatically selects relevant test cases, builds tailored test suites, and evaluates generated code using an LLM‑based security judge. In addition to conventional metrics such as accuracy, we introduce a “severity score” that aggregates the impact of overlooked vulnerabilities, allowing a more nuanced comparison of AI models.
Experimental evaluation across different locally hosted AI models demonstrates that SecureCodeEval can reliably surface systematic security gaps while remaining feasible for typical SME resources. The study identifies some practical limitations – such as occasional mis-grading by the LLM judge, and the need for more comprehensive weighting of severity levels – and proposes future improvements, including diversified test case generation, refined evaluation schemes, and the integration of static analysis tools.
Overall, the contribution is a practical, cost-effective approach for SMEs to assess and improve the security posture of AI-generated code, while providing a flexible foundation that can be adapted to broader AI testing scenarios.
Evaluation of the Security of AI Code Generation Tools
Bachelors Project
Abstract: Can we trust Large Language Models (LLMs) to reliably detect SQL injection vulnerabilities in code, and how sensitive are their judgements to the way we ask the question? As LLMs are increasingly integrated into security workflows, even small changes in prompt wording, code structure, or evaluation setup may tip the balance between correctly flagging a vulnerability and silently missing it. This project examines how LLMs behave in this setting by studying their performance on SQL injection detection under different prompting strategies, code contexts, and evaluation configurations. To ensure a reproducible workflow that can be analysed step by step, the experiments are implemented using the open-source framework Inspect AI, which separates dataset preparation, solver configuration, and scoring into distinct components. Two small but carefully constructed datasets were created: one containing short, isolated methods, and another embedding the same vulnerability patterns into longer and more realistic code fragments. A two-model setup was used, with Dolphin-Mistral generating the vulnerability assessments and Llama 3 8B acting as the evaluator. Running both models locally ensured consistent experimental conditions and avoided external API restrictions. Each combination of dataset and prompt type was executed multiple times to capture natural variance in the outputs, and performance was analysed both in aggregate and across metadata categories such as language and SQL injection pattern. The results show that changes in prompt design and code context can substantially affect the models’ decisions. In several cases, the scoring model displayed inconsistent judgements, which required manual review. Overall, the findings highlight both the potential and the current limitations of using LLMs in vulnerability detection workflows and outline methodological considerations that should be taken into account in future research.
A method for multidimensional evaluation of LLM generated code quality.
MAS Thesis
Abstract: Generative AI systems based on large language models (LLMs) have rapidly advanced in the past years and are changing how people create and maintain software. Consequently, LLMs that power developer tools are increasingly used for code generation, yet their quality evaluation often relies on narrow metrics such as productivity or isolated performance indicators. This work addresses three key gaps in the literature: (1) the lack of multi-dimensional code quality assessment, (2) confounding evaluation designs that mix AI effects with developer behavior, and (3) the absence of reproducible, deterministic evaluation pipelines for LLM generated code.
A five-stage, fully automated evaluation pipeline is presented that compares LLM generated solutions directly against independently human authored references. The pipeline centralizes prompt inputs, model outputs, execution results, and static code analysis based quality metrics in a single SQLite database, executes all code in isolated container environments, and enforces fixed generation parameters to maximize reproducibility. Code quality is assessed across multiple dimensions, including correctness, clarity (via cyclomatic complexity and function length), and security.
The pipeline is applied to six programming challenges comprising Advent of Code tasks and security-focused Python exercises, evaluating multiple OpenAI model families. Results show systematic differences between models: the gpt-4 series achieves near-perfect correctness with concise, low-complexity solutions, while gpt-5 nano and mini models exhibit higher variability and produce more verbose and structurally complex code, particularly on advanced tasks. Error-type analysis reveals that non-compliant input/output handling and algorithmic structure errors are dominant failure modes for newer models. Security findings are largely task-dependent and comparable to human authored references.
Overall, this study demonstrates that reproducible, multi-dimensional evaluation reveals qualitative differences in how models solve programming tasks, not just whether they succeed. The proposed pipeline provides a robust foundation for comparative and longitudinal assessment of LLM code generation as models continue to evolve.
Red Teaming für KI-gesteuerteautonome Systeme
Bachelors Project
Abstract: Der Einsatz KI-basierter Chatbots in Unternehmensprozessen nimmt stetig zu, wodurch Fragen der Sicherheit, Zuverlässigkeit und Regelkonformität solcher Systeme zunehmend an Bedeutung gewinnen. Insbesondere für kleine und mittelständische Unternehmen besteht der Bedarf an kosteneffizienten und datenschutzfreundlichen Möglichkeiten zur systematischen Überprüfung von KI-Systemen. Im Rahmen dieser Arbeit wird ein vollständig lokales, modular aufgebautes Framework zur automatisierten Sicherheitstestung von KI-Chatbots vorgestellt, das auf dem Testframework Inspect, lokal betriebenen Sprachmodellen und einer standardisierten Schnittstellenkomponente basiert. Mithilfe eines Red-Teaming-Prozesses werden Chatbots anhand gezielt konstruierter One-Shot-Prompts auf sicherheitskritisches Verhalten geprüft und bewertet. Die Machbarkeit und Zuverlässigkeit des Ansatzes wird exemplarisch anhand eines realen KI-Chatbots untersucht. Die Ergebnisse zeigen, dass ein lokaler RedTeaming-Ansatz eine praktikable Alternative zu cloudbasierten Lösungen darstellen kann und insbesondere im KMU-Umfeld Vorteile hinsichtlich Datenschutz, Kostenkontrolle und Flexibilität biete.
Frühling 2026
Lit review on secure code outcomes with vibe programming
Masters Credit
LLMs are being frequently adopted by software engineers and developers to write code. Some research has investigated how secure the code is that is jointly produced by humans and LLMs. We ask, how good are the studies done so far? Do the take into account variance of LLMs and do proper human study design? Do they accurately report their methods and statistics? Depending on the quality and number of papers, we would do one of the following reviews of the statistic used.
To aid researchers in conducting a priori power analysis and to know how many developers to recruit for further studies, we conduct a literature review to extract and condense the needed information on variance, sample size, and power.
Comparison of prompts for secure code
Bachelors Thesis
Which LLM prompts generate secure code
Required Work:
– Literature review of prompts for secure code
– Test setup : select 5-15 different prompts for testing, Selection of larger testset (based on literature review or benchmarks), with justification, selection of models to test and where to get GPUs
– Write a code extractor or other way to eliminate the problems with LLM-as-judge inaccuracy
– Run tests, include analysis of variance and report findings
Measurement of secure code : Inspect AI contribution
Bachelors Thesis
Create a secure code measurement function to contribute to the Inspect Open Source code repo.
– Review alternatives to measuring security of code (static and dynamic analysis, code complexity checks, etc), and choose a few that make sense.
– Test setup: You may want to create a test setup that allows you to execute the code. Perhaps containers for each code test? Part of the work is deciding and setting it up. You will also need a testset.
– Write code to summarize the results from both alternatives and LLM-as-judge. Decide what metric will be used.
– Run multiple tests with the new setup and summarize the results.”
AI red teaming with low resource languages
Bachelors Thesis
AI red teaming with low-resource languages. The goal is to understand whether LLMs are easier to attack if different human languages are used. The requirements for this project are
1. Attempt to attack a LLM with adversarial prompts. Create a set of prompts or conversations that have previously been shown to successfully attack a LLM. Part of this work includes defining what “”success”” means. Part of this work includes deciding what to attack.
2. Automate the test using an open-source framework such as Inspect AI or PyRIT. You will do this in python. You may need to request GPUs from our department or to find another solution to run the tests.
3. Translate the prompts into 3-5 non-English langauges, and measure the success rate of the adversarial prompts in different languages. Languages should range from “”high-resource”” to “”low-resource”” languages.
4. Statistical analysis on the results. How many times do you need to run the tests? How many prompts do you need. You may need to modify your test setup to get more deterministic results.
5. Think about the results: did low-resource languages impact the security of the results?”
