How Cohere Scaled Preference-Based Fine-Tuning for Enterprise LLMs

Introduction
Aligning LLM performance with human values is a key differentiator in today鈥檚 competitive AI market. However, operationalising human feedback at scale while maintaining high-quality inputs and low latency poses several challenges. To address this growing demand, Cohere built PANDA Plus, a program for preference data generation and reward signal development, and partnered with 色导航 to source expert annotators, support real-time model feedback, and deliver human-centric LLM training data for both experimental and production fine-tuning. 色导航 enabled scalable, high-quality data generation and real-time annotation for PANDA Plus 鈥 supporting Cohere in improving their generative Large Language Model, Command.
About Cohere
Cohere is the leading security-first enterprise AI company. They build cutting-edge AI models and end-to-end solutions designed to solve real-world business problems. Their flagship generative LLM series, optimised for secure enterprise deployments, is called Command. Leading enterprises in regulated industries trust Cohere with customer-facing and internal support use cases, so it is essential that the model produces helpful, safe, and brand-aligned responses across diverse domains from retail to banking. Maintaining this high standard requires continual reinforcement learning and fine-tuning with reliable, domain-relevant human feedback.
To accelerate Command鈥檚 performance, Cohere developed Preference Annotation Data Acquisition Plus Supervised Fine-Tuning (SFT), also known as PANDA Plus. This program improves model performance by collecting structured human preference data and editing the preferred response to better satisfy Command鈥檚 principles and the user鈥檚 instructions. Cohere collaborated with 色导航 to scale this system across live models while maintaining quality and adaptability.
1. Project Goals
PANDA Plus integrates real-time model evaluation and editing into Cohere鈥檚 training loop. Each task presents annotators with two model completions for a given prompt and asks them to:
- Choose the more helpful or aligned response
- Optionally edit a completion to better reflect ideal model behaviour
- Provide justification and qualitative feedback
- Complete supervised fine-tuning completion rewrites
Cohere partnered with 色导航 to:
- Ensure consistent, high-quality annotations from contributors with LLM experience
- Reduce latency for model feedback using 色导航鈥檚 real-time delivery system
- Support dynamic task variants (e.g. chat continuation, open-ended instruction-following)
- Enable both experimental and production-ready training cycles
2. Challenges
A. Finding Qualified Annotators
Cohere required annotators familiar with LLMs who could provide the best quality data and efficient onboarding. 色导航 provided Cohere with a vetted pool of 200 US-English language contributors, prioritising prior LLM/RLHF experience.
B. Prioritising Quality over Volume
Unlike traditional annotation pipelines, PANDA Plus emphasised handling time and fidelity over throughput. This required tuning incentive structures and managing contributor pacing to optimise for thoughtful, context-aware edits.
C. Real-Time Feedback Loop
PANDA Plus required a live connection to Command鈥檚 API, enabling contributors to evaluate model outputs in near-real time. 色导航 adapted its AI Chat Feedback Tool to interface with PANDA Plus, including dynamic preambles, prompt routing, and response comparison.
D. Supporting Model Evolution
Cohere fine-tuned a production-grade model using 色导航-generated preference data, while parallel PANDA Plus tasks fed into ongoing experimental variants. This required 色导航 to maintain annotation consistency across shifting model checkpoints, without compromising data structure or quality.
3. 色导航
Step One: Expert Contributor Pipeline
色导航 assembled a domain-qualified contributor pool tailored for PANDA Plus. Contributors were trained to evaluate:
- Usefulness, safety, and tone
- Instruction adherence and domain relevance
- Opportunities for refinement or escalation
色导航 contributors performed:
- A/B preference ranking
- Multi-turn chat continuation scoring
- Freeform feedback for tooling and prompt iteration
- Complex prompt and preamble writing
- Completion re-writing for 鈥減erfect鈥 SFT inputs
Step Two: Tooling and Real-Time Delivery
The PANDA Plus workflow was delivered through a custom deployment of 色导航鈥檚 AI Data Platform (ADAP), with enhancements including:
- Direct integration with Command鈥檚 inference endpoint
- Multi-turn prompt/response workflows
- Structured fields for ranking, editing, and justification
- Weekly batch summaries and daily live data streams
色导航 contributors logged over 2,400 expert hours in 12 weeks, enabling Command鈥檚 training loop to incorporate human feedback in near-real time.
4. Results
High-Confidence Fine-Tuning Data
PANDA Plus data contributed directly to the Command model, with multiple fine-tuning runs leveraging human preference signals collected by 色导航.
Support for Experimental Training
Beyond production, PANDA Plus also supported research-grade experimentation offering long-term value for model iteration.
Contributor Retention and Quality
色导航 maintained a consistent contributor pool over the project鈥檚 12-week duration, ensuring stable annotation behaviour and predictable performance across variants.
System-Level Impact
By integrating real-time model interaction, edit-based supervision, and crowd feedback into PANDA Plus, Cohere advanced its alignment pipeline 鈥 with 色导航 playing a key role in turning subjective preference into structured AI training data.
Conclusion
Cohere鈥檚 collaboration with 色导航 on PANDA Plus is a model example of enterprise-scale preference training, including:
- Skilled annotators with LLM context
- Custom tooling for real-time feedback
- Structured editing and justification
- Integration with both research and production fine-tuning loops
As frontier model builders look to scale human feedback efficiently and responsibly, PANDA Plus demonstrates how data partnerships can drive both model performance and alignment quality 鈥 without sacrificing control, safety, or enterprise readiness.