News

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark

Published on September 18th, 2025
2 Minute Read
MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark

Aerospace Language Understanding Evaluation Benchmark Enables Thorough Evaluation of LLMs for Aerospace Tasks 

The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models (LLMs) for aerospace tasks. Given the safety-critical nature of aerospace, it is imperative that LLMs undergo thorough evaluation prior to their integration into systems.

The Aerospace Language Understanding Evaluation (ALUE) benchmark provides a crucial tool for guiding the assurance of LLMs tailored to the unique demands of the aerospace domain. It incorporates diverse datasets and tasks and introduces several metrics for evaluating the correctness of LLM-generated responses.

ALUE is designed to streamline and improve the evaluation and inference of LLMs using aerospace domain-specific information. The versatile benchmark supports custom datasets, open-source and domain-specific LLMs, user-defined prompts, and various quantitative performance metrics. Such evaluations are essential not only for assessing a model’s performance but also for understanding its inherent limitations and potential risks, including issues such as hallucinations, biases, and privacy concerns.

Ongoing work will continue to expand the benchmark’s complexity and scope to address more intricate real-world aerospace challenges. This includes developing tasks for extracting complex information from charts, such as airspace boundaries or navigational aids, which require sophisticated spatial and symbolic reasoning.

Future work will also incorporate tasks that require LLMs to consult external data sources, such as aircraft operational manuals, to determine precise parameters such as flap and thrust settings under specific conditions, moving beyond simple information extraction to knowledge application.

CAASD’s engineers, scientists, and analysts pair cross-disciplinary capabilities with deep mission-centric expertise to deliver impactful solutions to advance aviation and aerospace safety.

ALUE is available via GitHub to airlines, academia, and aerospace stakeholders who are using or considering using LLMs on aerospace data. Active community collaboration is important to enhancing the benchmark with additional curated datasets and tasks, and organizations can run the benchmark on their machines. ALUE is the starting point to ensure the assurance of sophisticated and reliable AI tools for the enhanced safety and efficiency of the National Airspace System.

Reference: Aerospace Language Understanding Evaluation (ALUE): Large Language Benchmark with Aerospace Datasets, AIAA

Vincent Lambercy
Vincent brings 24 years of Air Traffic Management experience to the team. Having founded FoxATM after working 17 years with ANSPs in technical and sales roles; within ANSPs and the ATM industry. He has strong technical and commercial experience in international projects.
Subscribe to Newsletter