色导航

How safe are today鈥檚 MLLMs?
Resources
Case Studies

How Cohere Scaled Preference-Based Fine-Tuning for Enterprise LLMs

July 11, 2025
Share

Introduction

Aligning LLM performance with human values is a key differentiator in today鈥檚 competitive AI market. However, operationalising human feedback at scale while maintaining high-quality inputs and low latency poses several challenges. To address this growing demand, Cohere built PANDA Plus, a program for preference data generation and reward signal development, and partnered with 色导航 to source expert annotators, support real-time model feedback, and deliver human-centric LLM training data for both experimental and production fine-tuning. 色导航 enabled scalable, high-quality data generation and real-time annotation for PANDA Plus 鈥 supporting Cohere in improving their generative Large Language Model, Command.

About Cohere

Cohere is the leading security-first enterprise AI company. They build cutting-edge AI models and end-to-end solutions designed to solve real-world business problems. Their flagship generative LLM series, optimised for secure enterprise deployments, is called Command. Leading enterprises in regulated industries trust Cohere with customer-facing and internal support use cases, so it is essential that the model produces helpful, safe, and brand-aligned responses across diverse domains from retail to banking. Maintaining this high standard requires continual reinforcement learning and fine-tuning with reliable, domain-relevant human feedback.

To accelerate Command鈥檚 performance, Cohere developed Preference Annotation Data Acquisition Plus Supervised Fine-Tuning (SFT), also known as PANDA Plus. This program improves model performance by collecting structured human preference data and editing the preferred response to better satisfy Command鈥檚 principles and the user鈥檚 instructions. Cohere collaborated with 色导航 to scale this system across live models while maintaining quality and adaptability.

1. Project Goals

PANDA Plus integrates real-time model evaluation and editing into Cohere鈥檚 training loop. Each task presents annotators with two model completions for a given prompt and asks them to:

  • Choose the more helpful or aligned response
  • Optionally edit a completion to better reflect ideal model behaviour
  • Provide justification and qualitative feedback
  • Complete supervised fine-tuning completion rewrites

Cohere partnered with 色导航 to:

  • Ensure consistent, high-quality annotations from contributors with LLM experience
  • Reduce latency for model feedback using 色导航鈥檚 real-time delivery system
  • Support dynamic task variants (e.g. chat continuation, open-ended instruction-following)
  • Enable both experimental and production-ready training cycles

2. Challenges

A. Finding Qualified Annotators

Cohere required annotators familiar with LLMs who could provide the best quality data and efficient onboarding. 色导航 provided Cohere with a vetted pool of 200 US-English language contributors, prioritising prior LLM/RLHF experience.

B. Prioritising Quality over Volume

Unlike traditional annotation pipelines, PANDA Plus emphasised handling time and fidelity over throughput. This required tuning incentive structures and managing contributor pacing to optimise for thoughtful, context-aware edits.

C. Real-Time Feedback Loop

PANDA Plus required a live connection to Command鈥檚 API, enabling contributors to evaluate model outputs in near-real time. 色导航 adapted its AI Chat Feedback Tool to interface with PANDA Plus, including dynamic preambles, prompt routing, and response comparison.

D. Supporting Model Evolution

Cohere fine-tuned a production-grade model using 色导航-generated preference data, while parallel PANDA Plus tasks fed into ongoing experimental variants. This required 色导航 to maintain annotation consistency across shifting model checkpoints, without compromising data structure or quality.

3. 色导航

Step One: Expert Contributor Pipeline

色导航 assembled a domain-qualified contributor pool tailored for PANDA Plus. Contributors were trained to evaluate:

  • Usefulness, safety, and tone
  • Instruction adherence and domain relevance
  • Opportunities for refinement or escalation

色导航 contributors performed:

  • A/B preference ranking
  • Multi-turn chat continuation scoring
  • Freeform feedback for tooling and prompt iteration
  • Complex prompt and preamble writing
  • Completion re-writing for 鈥減erfect鈥 SFT inputs

Step Two: Tooling and Real-Time Delivery

The PANDA Plus workflow was delivered through a custom deployment of 色导航鈥檚 AI Data Platform (ADAP), with enhancements including:

  • Direct integration with Command鈥檚 inference endpoint
  • Multi-turn prompt/response workflows
  • Structured fields for ranking, editing, and justification
  • Weekly batch summaries and daily live data streams

色导航 contributors logged over 2,400 expert hours in 12 weeks, enabling Command鈥檚 training loop to incorporate human feedback in near-real time.

4. Results

High-Confidence Fine-Tuning Data

PANDA Plus data contributed directly to the Command model, with multiple fine-tuning runs leveraging human preference signals collected by 色导航.

Support for Experimental Training

Beyond production, PANDA Plus also supported research-grade experimentation offering long-term value for model iteration.

Contributor Retention and Quality

色导航 maintained a consistent contributor pool over the project鈥檚 12-week duration, ensuring stable annotation behaviour and predictable performance across variants.

System-Level Impact

By integrating real-time model interaction, edit-based supervision, and crowd feedback into PANDA Plus, Cohere advanced its alignment pipeline 鈥 with 色导航 playing a key role in turning subjective preference into structured AI training data.

Conclusion

Cohere鈥檚 collaboration with 色导航 on PANDA Plus is a model example of enterprise-scale preference training, including:

  • Skilled annotators with LLM context
  • Custom tooling for real-time feedback
  • Structured editing and justification
  • Integration with both research and production fine-tuning loops

As frontier model builders look to scale human feedback efficiently and responsibly, PANDA Plus demonstrates how data partnerships can drive both model performance and alignment quality 鈥 without sacrificing control, safety, or enterprise readiness.