A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

Peláez-González, Carlos; Herrera-Poyatos, Andrés; Zuheros, Cristina; Herrera-Poyatos, David; Tejedor, Virilo; Herrera Triguero, Francisco

doi:10.1016/j.neucom.2026.133534

NeuroComputing__LLM_Domain_Characterization.pdf (2.348Mb)

Identificadores

URI: https://hdl.handle.net/10481/112788

DOI: 10.1016/j.neucom.2026.133534

ISSN: 1872-8286

ISSN: 0925-2312

Exportar

Editorial

Elsevier

Materia

AI Safety

Jailbreak

LLMs

Fecha

2026-04-13

Referencia bibliográfica

Published version: Peláez-González, C.; Herrera-Poyatos, A.; Zuheros, C. [et al]. (2026). A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models. Neurocomputing Volume 683, 133534. https://doi.org/10.1016/j.neucom.2026.133534

Patrocinador

National Institute of Cybersecurity and the University of Granada (IAFERCib C074/23); European Union (Next Generation)

Resumen

The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures as arising from gaps in generalization, objectives, and robustness. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks in terms of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories —mismatched generalization, competing objectives, adversarial robustness, and mixed attacks— offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.

Colecciones

DCCIA - Artículos

Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional