Evaluating large language models - model architectures, training regimes and data selection

Contact person: Yves Scherrer
Keywords: large language models, evaluation, benchmarking
Research group: Language Technology Group (LTG)
Department of Informatics

In the past years, (generative) large language models have become the core foundation models for a wide range of traditional NLP tasks, and they have also seen widespread adoption by the general public. At the same time, little is known about the specific training setups of commercial models, and some design decisions (in terms of model architecture, training regimes, and data selection) are based on traditions rather than empirical or theoretical considerations. Moreover, most current LLMs rely heavily on English training and evaluation data, and their performance on non-English languages remains difficult to assess. Potential candidates are expected to formulate their research project within the broad area of LLM evaluation. Examples of research topics are given below.

Methodological research topics:

Compare fine-tuning external pre-trained LLMs with training language-specific LLMs from scratch.
Compare encoder-decoder LLMs with decoder-only LLMs.
Evaluate generative LLMs on various text generation tasks, such as summarization, simplification, text normalization.
Assess the multilingual (e.g. machine translation) and cross-lingual capabilities (cross-lingual transfer) of LLMs.
Investigate how closely related low-resource languages are best accommodated in LLMs.
Implement benchmarking datasets for LLM evaluation.

Mentoring and internship will be offered by a relevant external partner.