The 70% Ceiling: What Google’s FACTS Benchmark Changes for Enterprise AI Projects
The 70% Ceiling: What Google’s FACTS Benchmark Changes for Enterprise AI Projects
Google’s new FACTS benchmark signals a structural limit in current enterprise AI: even the best models, including Gemini 3 Pro, GPT‑5 and Claude 4.5, remain below ~70% factual accuracy. For CIOs, Heads of Ops and data leaders, this is not a lab curiosity but an architectural constraint.
It reshapes how to design assistants internes, copilotes métiers, agents IA and how to govern risk.
This article explains:
• ★ Why this 70% ceiling exists and what it implies for critical workflows
• 🔎 How to translate FACTS sub‑scores into RAG, grounding and review strategies
• 🧩 How no‑code/low‑code stacks can industrialise “trust but verify” patterns
• 📏 Which questions, KPIs and guardrails matter for transformation at scale
1. The 70% factuality ceiling: what it really means for enterprises
Questions fréquentes sur le plafond de factualité à 70 %
FACTS (FActuality Coverage and Trustworthiness Suite) evaluates how often models are objectively correct, not just fluent or helpful. It measures both:
- Contextual factuality: answers grounded in provided data or documents
- World knowledge factuality: answers based on internal training + web search
Across scenarios, no leading model exceeds ~70% factuality. In practice, this means:
- In an uncontrolled setting, about 1 answer out of 3 can be wrong or partially wrong
- Errors are often confident, plausible and hard to detect by non‑experts
- Multimodal tasks (charts, images, diagrams) remain significantly less reliable
For a CIO or Head of Ops, this turns into three concrete constraints:
-
AI cannot be treated as a system of record
LLMs are powerful reasoning engines, not authoritative databases. Any pipeline that lets an LLM directly write into core systems (ERP, CRM, HRIS, accounting) without strong checks is structurally fragile. -
“Model choice” is not enough
Moving from model A to model B may shift factuality by a few points, but does not break the 70% ceiling. Architecture, retrieval and governance drive reliability more than chasing the latest model. -
Risk differs by use case
- For exploration (brainstorming, ideation), 70% may be acceptable
- For operational decisions (pricing, compliance, finance, HR), it is not
- For regulatory domains (health, legal, banking), blind trust is untenable
The core message: hallucinations IA are not an implementation bug; they are a systemic characteristic of current models. Systems must be built under this assumption.
2. Reading the FACTS sub‑scores as architecture requirements
flowchart TD
A[User visits Website] --> B{Tracking Consent}
B -->|Accept| C[Browser stores tracking data]
B -->|Reject| D[Only essential cookies used]
C --> E[Data sent to analytics platform]
D --> E
E --> F[Reports generated from user behavior]
FACTS decomposes factuality into four main dimensions with direct architectural implications.
2.1 Parametric: why model “memory” cannot be your source of truth
Parametric factuality ≈ “What the model knows from training data.”
- Trivia‑style questions
- General knowledge without external tools
Even top models plateau here. The gap between parametric and search scores in FACTS shows: models are more reliable when they can look things up than when they rely on their internal weights.
🔧 Architectural implication
For IA en entreprise, never rely on parametric memory alone when:
- Policies change frequently (compliance, HR rules, pricing, approval thresholds)
- Data is confidential or proprietary (contracts, internal procedures)
- Domain knowledge evolves fast (tax rules, product catalogues, medical standards)
Instead, systematise Retrieval‑Augmented Generation (RAG):
- Use a vector database for semantic search over internal content
- Feed relevant passages to the model as context
- Let the model reason over retrieved facts, rather than invent them
Table: Where parametric vs RAG makes sense
| Scenario | Parametric‑only OK? | RAG recommended? |
|---|---|---|
| Public general knowledge FAQ | Sometimes | Often, for up‑to‑date info |
| Internal policy chatbot | No | Yes, mandatory |
| Compliance or regulatory assistant | No | Yes, with human review |
| Product catalog / pricing queries | No | Yes, with live data source |
2.2 Search: RAG and live tools as default design
Ressources Recommandées
Documentation
The FACTS Search benchmark evaluates how well a model uses a web search tool. Scores are typically higher than parametric.
This confirms a key design pattern:
For critical facts, models must read from external tools or databases.
In enterprise stacks, “search” rarely means only the public web. It includes:
- Enterprise search engines over internal wikis and document management systems
- Vector DBs (Pinecone, Qdrant, Weaviate, pgvector, etc.)
- Domain APIs (ERP, BI, CRM, financial systems, ticketing)
🔧 Architectural implication
- Treat RAG as default, not as an optional add‑on
- Expose tool APIs to AI agents (e.g. “get_invoice(id)”, “search_policies(query)”)
- Log tool calls for auditability and debugging
- Impose permissions and scoping: the model should not see more than the user is allowed to see
Effective RAG design is often more impactful than switching model providers, comme le montre en détail l’analyse des limites des systèmes RAG d’entreprise et de la solution de « sufficient context » de Google dans l’article sur les systèmes RAG d’entreprise.
2.3 Grounding: keeping internal chatbots on‑script
The Grounding benchmark measures how well a model sticks to provided source text without inventing additional facts.
For internal assistants, grounding is central:
- Chatbot interne answering from a knowledge base
- Copilote métier summarising procedures, contracts, tickets
- Agents IA orchestrating workflows based on internal rules
Without strict grounding:
- The bot may paraphrase correctly but add non‑existent constraints
- It may “fill gaps” in policies, creating ghost rules or mis‑stated rights
- It may ignore edge cases or exceptions that are crucial for compliance
🔧 Architectural implication
- Prompt for source‑bounded behaviour:
- “Only answer from the documents provided. If not present, say I do not know.”
- Require citations / spans:
- “For every statement, reference the relevant document section.”
- Use grounding checkers or secondary models to verify that answers are supported by context
- Score and log: % of answers with valid, checkable citations
Grounding is not optional for gouvernance des données and auditability.
2.4 Multimodal: automate insight, not unsupervised extraction
Multimodal AI for Document & Chart Understanding
Pros
- Powerful assistant for humans reviewing visual documents
- Can pre-fill fields, highlight outliers and missing elements to speed up workflows
- Works best when combined with traditional OCR/document parsers and structured data
- Useful for low-stakes or exploratory insight generation from images and charts
Cons
- Accuracy on multimodal tasks is often below 50% across leading models
- Not reliable for fully automated extraction from PDFs, scans, charts and forms
- High‑risk for financial charts, invoices, payslips, and medical forms if used without human‑in‑the‑loop
- Requires careful monitoring of error rates and keeping humans in control for high‑stakes use cases
FACTS Multimodal scores (images, charts, diagrams) are notably low across models. Even for leaders, accuracy often remains below 50%.
Typical high‑risk areas:
- Reading financial charts and drawing numeric conclusions
- Extracting values from invoices, payslips, medical forms purely via vision
- Interpreting complex technical diagrams without schema knowledge
🔧 Architectural implication
- Avoid fully automated extraction from PDFs, scans and charts without human‑in‑the‑loop
- Use models to assist human review: pre‑fill fields, highlight outliers, detect missing elements
- Combine traditional OCR / document parsers with LLM reasoning, rather than relying on raw multimodal capabilities
- Log error rates by document type and keep humans in control where stakes are high
In short: multimodal AI is a powerful assistant, not yet a reliable autonomous extractor.
3. Building robust AI workflows with no‑code / low‑code stacks
No‑code/low‑code platforms (Make, n8n, Zapier, Retool, WeWeb and similar tools) enable rapid assembly of stack IA without heavy engineering. FACTS suggests clear patterns to follow.
3.1 Pattern 1 – Internal FAQ bot with RAG and double checking
Objective: Reliable chatbot interne for HR, IT or operations policies.
🧩 Architecture pattern
-
Trigger
- User asks a question via chat widget (WeWeb, Retool app, intranet).
-
Retrieval layer (RAG)
- The workflow queries a vector database populated from policy docs, wiki, emails, tickets.
- Top‑k results and snippets are attached as context.
-
Primary answer generation
- LLM writes an answer, constrained to context and required to provide citations.
-
Automated factuality check
- A second model (cheaper, smaller) re‑reads the answer + sources.
- It returns:
- Confidence score (0‑1)
- List of unsupported statements, if any
-
Decision layer
- If confidence ≥ threshold (for example 0.8) and no unsupported claims → answer is sent.
- If below threshold →
- The system sends a cautious response (“Insufficient information; here is what the documents say explicitly”)
- Or escalates to a human (HR or IT support) with full context.
-
Logging & governance
- Store the question, context docs, answer, confidence, and whether user flagged issues.
- Use this log to track hallucinations IA, tune thresholds and update documents.
🔁 Role of no‑code/low‑code
- Orchestrate triggers, API calls, branching logic and persistence without custom backend code
- Connect directly to document systems (SharePoint, Google Drive, Notion) and vector DBs via connectors or webhooks
3.2 Pattern 2 – Finance copilot with structured data and escalation
Finance Copilot with Structured Data – Implementation Flow
Data integration layer
Connect ERP, accounting and BI systems; expose only aggregated, authorised views to the copilot
Question parsing & query generation
Transform natural language questions into structured queries (SQL, BI filters) sent to a data API
Retrieval & numeric ground truth
Fetch actual numbers and breakdowns from BI; enforce them as the single source of truth the model cannot alter
Explanation generation
Generate narrative explanations and charts strictly based on the returned data
Factuality guardrails
Validate that all figures, percentages and deltas in the narrative match the underlying data; flag inconsistencies
Human-in-the-loop
Route reports requiring high assurance to mandatory human approval via a no-code UI showing data, AI output and flags
Objective: Copilote finance assisting controllers or FP&A analysts with budget analysis and reporting.
🧩 Architecture pattern
-
Data integration layer
- ETL or no‑code integrations pull data from ERP, accounting, BI (Snowflake, BigQuery, etc.).
- Only aggregated, authorised views are exposed to the copilot.
-
Question parsing and query generation
- User asks: “Explain the variance in marketing spend vs budget for Q3 in EMEA.”
- LLM translates the question into structured queries (SQL, BI filters) sent to a data API.
-
Retrieval and numeric ground truth
- The BI system returns actual numbers and breakdowns.
- These numbers become the single source of truth; the LLM cannot change them.
-
Explanation generation
- The model produces narrative explanation and charts, only based on returned data.
-
Factuality guardrails
- A rule engine or secondary LLM checks that:
- All figures in the narrative match actual numbers
- Percentages and deltas are computed correctly
- If inconsistencies are detected, the output is flagged for review.
- A rule engine or secondary LLM checks that:
-
Human‑in‑the‑loop
- For reports sent to management or regulators, require mandatory human approval in the workflow.
- No‑code UI (Retool, internal portal) presents data, AI explanation and flags.
🔁 Impact
- Automates narrativisation of numbers, speeds up analysis
- Keeps factuality grounded in verified data stores, not in the model
3.3 Pattern 3 – Legal and document analysis with cautious automation
Objective: Assistant juridique to analyse contracts, NDAs, policy documents and identify key clauses or risks.
🧩 Architecture pattern
-
Secure ingestion
- Contracts uploaded via a low‑code interface; documents stored in a controlled repository.
-
Chunking and vectorisation
- Documents are segmented into clauses; embeddings stored in vector DB.
-
Task 1 – Clause extraction
- LLM retrieves relevant chunks and marks clauses (confidentiality, liability cap, jurisdiction).
- Each extraction includes source references (document ID, section, page).
-
Task 2 – Risk classification
- A second model or rules engine classifies clauses (standard / non‑standard / high risk).
- Confidence thresholds determine which clauses are auto‑approved and which require review.
-
Human validation workflow
- Lawyers see AI‑proposed annotations and scores in a low‑code interface.
- They approve, amend or reject, with reasons logged.
-
Continuous improvement
- Corrections feed fine‑tuning or prompt updates.
- Metrics:
- Precision/recall per clause type
- Review time per document
- Disagreement rate between AI and human
🔁 Automation scope
- Automate triage, pre‑classification and summarisation
- Keep legal judgement and edge‑case decisions fully human
4. From benchmark to governance: a practical framework for decision‑makers
FACTS can be more than a leaderboard. It can be a tool for gouvernance des données and évaluation des modèles.
4.1 Questions to ask vendors and integrators
When evaluating tools, ask for concrete, FACTS‑related evidence:
-
Which FACTS sub‑scores matter for this use case?
- Search for research assistant or dynamic knowledge tools
- Grounding for internal policy chatbots
- Multimodal for document or image workflows
-
How does the system implement RAG?
- Which vector DB / search engine?
- How is context selected, updated, and versioned?
- How are access rights enforced?
-
What grounding safeguards are in place?
- Does the assistant only answer from provided sources?
- How are unsupported statements detected and handled?
-
How is multimodal data treated?
- Are critical decisions based solely on image/chart analysis?
- Is there a mandatory review for extracted values?
-
What observability is provided?
- Can the organisation see logs, citations, confidence scores?
- Are there dashboards for hallucination rates and “I don’t know” answers?
4.2 Using FACTS in RFPs and technical requirements
In cahiers des charges, FACTS can be turned into specific clauses:
- “The solution must support models with documented FACTS Search scores above X for our version at deployment.”
- “For internal knowledge use cases, the system must demonstrate grounded responses with explicit document citations.”
- “Any automated extraction from PDFs or images that bypasses human review must be justified with metrics equivalent to FACTS Multimodal, and error rates must be disclosed.”
The goal is not to optimise for a single composite score, but to:
- Align the sub‑benchmark with the business risk
- Make vendors commit to observable metrics, not vague claims
4.3 KPIs for “trust but verify” digital transformation
KPIs “trust but verify”
To avoid an “IA‑first, verification‑later” drift, monitor a small set of operational KPIs.
Suggested KPI set:
-
Taux d’hallucination
- % of answers flagged as incorrect by users or reviewers
- Breakdown by use case (HR bot, finance copilot, legal assistant)
-
Taux de réponses “je ne sais pas”
- Indicator of model humility and effective grounding
- Too low may suggest over‑confident hallucinations; too high may indicate poor retrieval
-
Coût de vérification
- Time and cost spent by humans to validate or correct AI outputs
- Should decrease over time as prompts, RAG and models improve
-
Coverage vs. accuracy
- For document workflows: % of documents with some AI assistance vs. % needing full manual processing
- For chatbots: % of questions answered autonomously vs. escalated
-
Incident rate
- Number and severity of factuality incidents that reached external stakeholders or regulators
These KPIs support a progressive rollout strategy: start with low‑risk scenarios, track metrics, then expand scope or autonomy only when evidence supports it.
Key Takeaways
All evaluated models achieved an overall accuracy below 70%, leaving considerable headroom for future progress.
- FACTS benchmark exposes a systemic ~70% factuality ceiling across leading models; architecture and governance matter more than model switching.
- RAG and tool use must be default design choices for enterprise AI; relying on parametric “memory” alone is unsafe for dynamic or critical knowledge.
- Grounding and citations are essential for internal chatbots and copilotes métiers, enabling auditability and controlling hallucinations IA.
- Multimodal AI is not yet reliable for unsupervised document or visual extraction; use it to assist humans, not to bypass them.
- A “trust but verify” approach, with explicit KPIs and human‑in‑the‑loop workflows, enables robust transformation digitale without over‑estimating current AI capabilities.
Tags
💡 Need help automating this?
CHALLENGE ME! 90 minutes to build your workflow. Any tool, any business.
Satisfaction guaranteed or refunded.
Book your 90-min session - $197Articles connexes
Why Development AI Agents Are Not (Yet) Ready for Production
CTO guide to AI coding agents: uncover limitations of AI code generation and learn when AI in software development is truly production-ready for your team
Read article
Google Workspace Studio: No-code AI agents at the heart of productivity
Discover how Google Workspace Studio enables no code AI automation, AI agents for Gmail and Docs, and secure digital transformation of business workflows
Read article