Automating the Initial Development of Intent-Based Task-Oriented Dialog Systems Using Large Language Models: Experiences and Challenges

Kharitonova, Ksenia; Pérez Fernández, David; Callejas Carrión, Zoraida; Griol Barres, David

doi:10.32604/cmc.2026.075777

TSP_CMC_75777.pdf (4.084Mb)

Identificadores

URI: https://hdl.handle.net/10481/112654

DOI: 10.32604/cmc.2026.075777

Exportar

Editorial

Tech Science Press

Materia

Task-oriented dialog systems

Large language models (LLMs)

RASA

Fecha

2026-03-12

Referencia bibliográfica

Kharitonova, K., Pérez-Fernández, D., Callejas, Z., Griol, D. (2026). Automating the Initial Development of Intent-Based Task-Oriented Dialog Systems Using Large Language Models: Experiences and Challenges. Computers, Materials & Continua, 87(2), 43. https://doi.org/10.32604/cmc.2026.075777

Patrocinador

MICIU/AEI/10.13039/501100011033 and FEDER, UE; Universidad Politécnica de Madrid (UPM) and University of Granada (UGR) - (PID2023-150584OB-C21) (PID2023-150584OB-C22)

Resumen

Building reliable intent-based, task-oriented dialog systems typically requires substantial manual effort: designers must derive intents, entities, responses, and control logic from raw conversational data, then iterate until the assistant behaves consistently. This paper investigates how far large language models (LLMs) can automate this development. In this paper, we use two reference corpora, Let’s Go (English, public transport) and MEDIA (French, hotel booking), to prompt four LLM families (GPT-4o, Claude, Gemini, Mistral Small) and generate the core specifications required by the rasa platform. These include intent sets with example utterances, entity definitions with slot mappings, response templates, and basic dialog flows. To structure this process, we introduce a model- and platform-agnostic pipeline with two phases. The first normalizes and validates LLM-generated artifacts, enforcing cross-file consistency and making slot usage explicit. The second uses a lightweight dialog harness that runs scripted tests and incrementally patches failure points until conversations complete reliably. Across eight projects, all models required some targeted repairs before training. After applying our pipeline, all reached ≥ 70% task completion (many above 84%), while NLU performance ranged from mid-0.6 to 1.0 macro-F1 depending on domain breadth. These results show that, with modest guidance, current LLMs can produce workable end-to-end dialog prototypes directly from raw transcripts. Our main contributions are: (i) a reusable bootstrap method aligned with industry domain-specific languages (DSLs), (ii) a small set of high-impact corrective patterns, and (iii) a simple but effective harness for closed-loop refinement across conversational platforms.

Colecciones

DLSI - Artículos

Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución 4.0 Internacional