
A multi-model AI workflow where one system generates research and another reviews and validates it before delivery. Image Source: DALL·E via ChatGPT (OpenAI)
Microsoft Adds Multi-Model AI to Copilot Researcher to Improve Trust in Enterprise Research
Microsoft has introduced Critique and Council, 2 new multi-model capabilities inside Microsoft 365 Copilot's Researcher agent, adding a built-in validation layer to AI-generated research. This matters now because AI is increasingly used in high-stakes professional workflows where accuracy and trust are critical, and single-model outputs are no longer sufficient.
The update introduces a system where multiple AI models generate, review, and compare research outputs before delivering a final result, adding a built-in validation layer to the research process. This development points to AI systems being designed for verification, not just generation, especially in enterprise environments where decisions depend on reliable information.
This directly impacts knowledge workers, enterprise teams, and professionals using AI for research, analysis, and decision support.
In short, Microsoft has restructured how Researcher produces results by separating generation and evaluation across multiple models, enabling AI to check its own work before presenting it.
Multi-model AI research refers to systems that distribute research tasks across multiple models, using one model to generate outputs and another to evaluate, challenge, or compare them before delivering a final result.
Key Takeaways: Microsoft Copilot Researcher Multi-Model AI (Critique and Council)
Microsoft's Critique and Council introduce a multi-model AI architecture in Copilot Researcher that enables AI systems to generate, evaluate, and compare research outputs, improving accuracy and trust in enterprise knowledge workflows.
Critique introduces a two-model architecture inside Researcher — one model generates the research report, a second independently reviews and refines it — replacing the single-model approach that handled all tasks at once.
Council runs multiple models simultaneously, producing separate reports side by side and surfacing where the models agree, where they diverge, and what unique insights each one contributes.
Both Anthropic and OpenAI models are explicitly involved in powering the new capabilities, making this a cross-frontier multi-model deployment inside a major enterprise platform.
Critique's performance was validated on the DRACO benchmark — 100 complex deep research tasks across 10 domains — where it showed meaningful improvements in analytical depth, presentation quality, and factual accuracy over the single-model approach.
The architecture is designed with clear role separation, giving Microsoft the ability to swap or upgrade individual generator and reviewer models over time as the system evolves.
Critique is now the default experience in Researcher and is broadly available through the Microsoft 365 Copilot Frontier program.
Microsoft Expands Copilot Researcher With a New Research Architecture
Microsoft 365 Copilot's Researcher agent was designed to handle complex research tasks inside the flow of professional work — not as a standalone tool, but embedded within the enterprise applications where decisions get made. With the launch of Critique and Council, Microsoft has made a fundamental change to how Researcher operates, replacing its single-model foundation with a multi-model system where AI outputs are evaluated by a second AI before reaching the user.
Critique is a deep research system built for complex tasks that separates generation from evaluation across 2 AI models — one leading the generation phase by planning the task, iterating through retrieval, and producing an initial draft, and a second acting as an expert reviewer focused on refining and strengthening that draft before the final report is produced.
Models from both Anthropic and OpenAI power this architecture. According to Microsoft, this design exceeds traditional single-model approaches and delivers best-in-class deep research quality, with built-in flexibility to swap or expand the generator and reviewer roles over time as the system evolves. Council takes a different approach, running multiple models in parallel and delivering their responses side by side, along with a cover letter that surfaces where the models agree, where they diverge, and the unique insights each one contributes.
Critique is now the default experience in Researcher, automatically activated when Auto is selected in the model picker. Council is available as an alternative mode built for side-by-side comparison across multiple models, accessible by selecting Model Council in the model picker. Both capabilities are currently broadly available through Microsoft's Frontier program — Microsoft's early access tier for enterprise customers advancing AI transformation with Copilot and agents.
That makes Microsoft 365 Copilot Researcher one of the first major enterprise AI products to deploy competing frontier models — Anthropic and OpenAI — as collaborating components within a single unified workflow.
How Critique Works: Separating Generation From Evaluation
The central design principle behind Critique is role separation. Rather than assigning 1 model to handle everything from planning to writing, Critique divides the work into 2 distinct roles. The first model plans the task, gathers sources, and produces an initial draft. The second model then takes over, validating claims, improving presentation, and strengthening the structure.
That review process follows academic and professional research standards, built around rubric-based evaluation. The reviewer examines the report from multiple angles — focused on strengthening it rather than becoming a second author — and produces an enhanced final version across 3 dimensions:
Source Reliability Assessment — the reviewer prioritizes reputable, authoritative, domain-appropriate sources and emphasizes evidence that is verifiable in the relevant research context.
Report Completeness — the reviewer evaluates whether the final report fully addresses the intent of the original request, with relevant and unique insights rather than surface-level coverage.
Strict Evidence Grounding Enforcement — every key claim must be anchored to a reliable source with a precise citation. The reviewer applies a conservative grounding standard, directly targeting factual accuracy and trustworthiness.
Microsoft describes this architecture as creating a powerful feedback loop — by giving evaluation as much emphasis as generation, the system is designed to catch errors, close coverage gaps, and challenge unsupported claims that a single model reviewing its own work is less likely to surface. The result is higher-quality outputs across factual accuracy, analytical breadth, and presentation.
How Council Works: Side-by-Side Model Comparison
The goal of Council is to surface what each model emphasizes differently — which facts it prioritizes, how it weights competing evidence, and how it frames analytical conclusions. Where Critique runs models in sequence, Council runs them in parallel: an Anthropic model and an OpenAI model each generate a full, independent report at the same time, each one potentially surfacing different facts, citing different sources, and reaching different analytical conclusions on the same question. A dedicated judge model then reviews both reports and produces a consolidated summary that maps where the 2 models reach the same conclusions, where they diverge, and what distinct insights each one contributed that the other missed.
The result is a cover letter that gives the user a clear map of the research landscape across models before they decide how to proceed. Where models agree, that shared conclusion carries stronger confidence. Where they diverge, the user gains visibility into genuine interpretive differences — not errors, but different analytical framings that may each be relevant depending on the decision being made.
DRACO Benchmark Results Show Measurable Gains in AI Research Accuracy
Microsoft validated Critique against the DRACO benchmark — Deep Research Accuracy, Completeness, and Objectivity — a set of 100 complex research tasks spanning 10 domains, including medicine, technology, and law. The benchmark was developed by researchers from Perplexity and academia, and the tasks were drawn from anonymized real-world usage patterns in a large-scale research system. Each response is scored against task-specific rubrics across 4 dimensions: factual accuracy, breadth and depth of analysis, presentation quality, and citation quality.
Results were evaluated using OpenAI's GPT-5.2 as the judge model — the strictest of the 3 judge models included in the benchmark paper. Microsoft applied the same evaluation protocol and configuration published in the benchmark paper to ensure a direct apples-to-apples comparison. Scores were calculated by averaging results across the full DRACO dataset, with each question evaluated across 5 independent runs. To measure the advantage of Critique, Microsoft then compared the new architecture against the single-model version of Researcher using the same GPT-5.2 judge across all 4 evaluation dimensions.
The largest improvements show that Critique produced notably stronger results in breadth and depth of analysis (+3.33) and presentation quality (+3.04), with a meaningful gain in factual accuracy (+2.58) as well. All 4 evaluation dimensions showed statistically significant improvements across the full dataset.
Microsoft notes that the gains in analytical breadth and presentation reflect Critique's ability to identify missing angles, close coverage gaps, and sharpen reports into more organized, clearly structured narratives — directly accounting for the substantial improvements in both of those scores.
The factual accuracy improvement reflects the reviewer model actively challenging weak claims and enforcing higher precision. Citation quality improvements came not from pulling in more sources, but from more selective use of existing ones — with the reviewer emphasizing evidence quality and precise citation over source volume coverage.
Breaking the results down by subject area, significant gains held across 8 of the 10 domains tested. The 2 exceptions — Academic and Needle-in-a-Haystack — produced inconsistent results across runs, making it difficult to draw a statistically reliable conclusion. Microsoft flags both as open areas rather than fundamental limitations.
Q&A: Microsoft Copilot Researcher Multi-Model AI (Critique and Council)
Q: What did Microsoft announce for Copilot Researcher?
A: Microsoft launched 2 new multi-model capabilities inside Microsoft 365 Copilot's Researcher agent: Critique, a two-model architecture where one model generates a research report and a second independently reviews and refines it; and Council, which runs an Anthropic model and an OpenAI model simultaneously to produce side-by-side reports with a judge model summarizing where they agree and diverge.
Q: What is multi-model AI in Copilot Researcher?
A: Multi-model AI in Copilot Researcher refers to a system where multiple AI models are assigned distinct roles — such as generating, reviewing, and comparing research outputs — allowing the system to validate its own work before delivering results.
Q: How does Critique work inside Copilot Researcher?
A: Critique separates the research process into 2 distinct roles. A generation model handles planning, source retrieval, and drafting the initial report. A reviewer model then evaluates that report against a structured rubric — assessing source reliability, completeness, and evidence grounding — and produces an enhanced final version. The reviewer does not rewrite the report from scratch; it strengthens and refines what the generator produced.
Q: Why does this matter for professionals using AI in their work?
A: As AI moves into research, analysis, and decision-support workflows, the cost of confidently delivered errors grows. A system that separates generation from evaluation — and validates claims before delivering results — provides a higher standard of reliability than a single model checking its own work. Critique and Council make that validation layer part of the standard research experience inside Microsoft 365 Copilot.
Q: Are there areas where Critique's improvements are still limited?
A: Yes. In DRACO benchmark testing, 2 domains — Academic and Needle-in-a-Haystack — did not show statistically significant improvement under Critique. Microsoft attributes this to high variance in those task types rather than a fundamental limitation, but those areas remain open questions as the architecture continues to evolve.
Q: Which AI models are involved in Critique and Council?
A: Microsoft has confirmed that models from both Anthropic and OpenAI are used across the 2 capabilities. In Critique, models from both frontier labs fill the generator and reviewer roles. In Council, an Anthropic model and an OpenAI model each produce independent reports, with a separate judge model comparing and synthesizing the results.
What This Means: Multi-Model AI Changes How Enterprises Trust AI Research
AI is no longer being used only for drafting and summarization. It is increasingly embedded in research, analysis, legal review, financial assessment, and other professional workflows where the cost of a confidently delivered wrong answer is real. That context is why Critique and Council matter now.
Key point: Microsoft has shown that structuring AI research as a multi-model process — with separate roles for generation and evaluation — produces more reliable outputs than single-model systems, establishing a new architectural standard for enterprise AI.
Who should care: Enterprise teams, research professionals, knowledge workers, and AI practitioners embedded in decision-making workflows should pay close attention. If your organization is using AI-generated research to inform strategy, policy, compliance, or investment decisions, the architecture of the system producing that research directly affects how much you can trust it.
Why this matters now: The AI industry has been moving toward agentic systems — AI that acts, not just answers. As those systems take on more consequential tasks, the need for internal validation layers becomes critical. Critique and Council show that multi-model verification is operationally viable inside a major enterprise product, not just a research concept.
What decision this affects: Organizations evaluating AI tools for serious knowledge work now have a concrete benchmark for what "trustworthy AI research" looks like architecturally. The question is no longer just which model gives the best answer — it is whether the system is designed to check itself before you rely on it.
In short, Critique and Council mark a concrete step toward AI research systems that don't just generate outputs — they validate them. The architecture Microsoft has deployed inside Copilot Researcher reflects a design philosophy that accuracy requires process, not just capability.
The future of AI in professional workflows isn’t just smarter models—it’s systems designed to question, validate, and prove their own answers before you rely on them.
Sources:
Microsoft Tech Community - Introducing Multi-Model Intelligence in Researcher
https://techcommunity.microsoft.com/blog/microsoft365copilotblog/introducing-multi-model-intelligence-in-researcher/4506011arXiv - DRACO: Deep Research Accuracy, Completeness, and Objectivity Benchmark
https://arxiv.org/abs/2602.11685Microsoft - Microsoft 365 Copilot Frontier Program
https://www.microsoft.com/en-us/microsoft-365-copilot/frontier-programMicrosoft 365 Blog - Powering Frontier Transformation with Copilot and Agents
https://www.microsoft.com/en-us/microsoft-365/blog/2026/03/09/powering-frontier-transformation-with-copilot-and-agents/
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from Claude, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to Claude for assistance with research and editorial support in crafting this article.



