Aligned large language models (LLMs) can largely follow people's verbal prompts to generate the desired content. However, control objectives cannot always be described verbally, such as certain reward functions, and LLMs' ability to follow instructions is imperfect, as demonstrated by hallucinations and jailbreak vulnerabilities. In this case, fine-grained control of LLMs can achieve non-verbal objectives to accomplish more tasks and enhance the models' abilities to follow instructions and reason, ultimately maximizing their utility to humans.
In this proposal, we introduce a framework for fine-grained control, and instantiate two simple implementations to demonstrate its potential. First, we provide a more unified understanding of controllable generation for LLMs from a sampling perspective, from which we derive a general framework for fine-grained controllable generation. Then, we instantiate the framework on two specific objectives: generating coherent jailbreak prompts and generating harmless prompts that trigger LLMs' false refusals. We also demonstrate on a classification task that allowing models to refine its outputs has the potential to enhance reasoning capabilities. Finally, we outline the steps required to fully implement this framework, aiming at ultimately achieving fine-grained control over LLMs.
Sicheng Zhu is a PhD student at the University of Maryland, College Park where he is advised by Prof. Furong Huang. His research focuses on trustworthy machine learning, including robustness to distribution shift and adversarial examples, and the intersection of these areas with geometric deep learning.