
Reinforcement Learning tool for Large Language Model in Collaboration Games
Eligibility: UK/International (including EU) graduates with the required entry requirements
Duration: Full-Time – between three and three and a half years fixed term
Application deadline: 25 October 2025
Interview date: Will be confirmed to shortlisted candidates
Start date: January 2026
For further details contact: Professor James Brusey
Introduction
Self-play and reinforcement learning has been applied by DeepSeek R1 to help learn coding and maths tasks. This works because the answers are given in a single turn (one prompt and one response) and can be assessed automatically. However, for social deduction problems, where responses form actions that have long-term consequences, there isn't a well established way to make use of self-play.
We propose to apply Reinforcement Learning (RL) approaches to the problem of identifying long-term conversational costs and benefits to provide a steering mechanism.
While much work to do with LLMs is compute costly, we believe that the cost can be kept low as the aim is to show a demonstrable improvement through self-play rather than specifically generate super-human performance. Furthermore, the constrained nature of the game will help keep compute costs down.
Although this work is focused on a specific game and limited to that, it might be generalised to other situations. E.g., the LLM can imitate the human response and thus gain an idea of how a conversation will play out. Value might be measured in terms of level of engagement or information transferred.
Project details
It is difficult to overestimate the enormous impact of Large Language Models (LLMs) on human society. Despite many impressive achievements, however, there are many areas where LLMs fall far short of human-level intellectual ability.
There are several possible avenues currently being explored to address this, such as:
- Improving the training data by: curating it, making it more diverse, or adding synthetic data;
- Increasing the number of model parameters (although this is already exceedingly large), or making the underlying networks more efficient;
- Increasing the number of tokens (allowing the LLM to think for longer);
- Improving the fine-tuning mechanism;
- Using some sort of hybrid approach that combines a basic LLM with tool use;
- Using agentic approaches to LLMs that use a divide and conquer approach to large problems;
- Adding modalities, such as sound, or vision.
There are also a number of ways in which models can be made to be more efficient (while not necessarily improving intelligence), such as distillation. For example, the Llama 8B model is a distilled version of the Llama 70B model and thus is faster and can be run on less expensive hardware.
While there are many benchmarks for rating the intellectual capacity of an LLM, most are geared toward problems that can be answered in one "turn". The Werewolf / Villager game is a collaborative game where each player is assigned a role (werewolf or villager). The villagers don't know who the werewolves are and the whole group engage in conversational reasoning to identify who to "vote out". An example where the LLM typically does poorly when playing a werewolf, is by trying to cast suspicion on a villager without evidence.
This is then seen as evidence by the villagers to identify them as a werewolf.
The overall aim of the project is to demonstrate an improvement in the quality of play from a LLM for a social deduction game without explicitly adjusting the LLM itself.
Programme and Methodology
The key research questions are as follows:
1. What negative behaviours (that lead to a loss) are observed when LLMs play social deduction games?
2. How can partial game logs be appropriately mapped into a hidden or latent state? Can this mapping be learned? How does this relate to learning a mapping from hidden state to value?
3. Can the value estimate so produced be used to self-play and thus learn to improve the assisted LLM's ability to play a social deduction game?
Programme and methodology
- M1--M6 Initial set-up, literature review, and tools training.
2. --M12 Social deduction framework development and initial experiments towards first RQ.
3. --M18 Initial variant of RL framework for social deduction
4. --M24 Experimental development
5. --M30 Resolve main research questions
6. --M36 Writing up
This is a joint Phd studentship between Coventry (UK) and Stellenbosch (South Africa) you will registered for a PhD at both Universities. The successful candidate will be based at Coventry University and will spend a few months at Stellenbosch (South Africa)
This project will leverage the complementary expertise of both supervisory teams in EEG signal processing and statistical physics. You will be jointly supervised by the Coventry team.
Funding
Tuition fees and stipend
Benefits
The successful candidate will receive comprehensive research training including technical, personal and professional skills. All researchers at Coventry University (from PhD to Professor) are part of the Doctoral and Researcher College, which provides support with high-quality training and career development activities.
Candidate specification
- A minimum of a 2:1 first degree in a relevant discipline/subject area with a minimum 60% mark in the project element or equivalent with a minimum 60% overall module average.
- A Masters’s degree with a minimum mark of 60% in the dissertation
PLUS
- The potential to engage in innovative research and to complete the PhD within 3.5 years.
- A minimum of English language proficiency (IELTS academic overall minimum score of 6.5 with a minimum of 6.0 in each component).
The potential to engage in innovative research and to complete the PhD within a prescribed period of study.
How to apply
In the first instance please submit your expression of interest via the button below with a supporting statement detailing your suitability with evidence of the following:
• Have backgrounds in computer science (or engineering), system engineering, or physics/mathematics.
• Knowledgeable in machine learning techniques (had successful courses or projects)
• Be proficient in programming (preferably in Python).
• Ideally familiar with machine/deep learning, signal processing, dynamical system or mathematical modelling
To find out more about the about the technical details of the project, please contact
Apply to Coventry University