Lucidity

Lucidity is an independent evaluation service for agentic systems, run by researchers in software verification and testing. We test your system against the claims you make, in the workflow you deploy it in, and issue a signed report with a verdict.

Your claims, our tests, one verdict.

Request an evaluation.

Trust score 0 to 100

from 85 Full Approval · 70 to 85 Conditional Approval · below 70 Not Approved

F-01 Key failure · High severity

Failure: The system reports a successful record lookup that the tool never returned.
Category: Tool-use faithfulness
Blast radius: Unverified figures reach the compliance memo unflagged.
Compensating control: A reviewer verifies the cited records before sign-off.

REM-01 Remediation · Retest required

Required fix: State when a lookup returns nothing instead of reporting success. Addresses F-01.
Owner: Vendor
Deadline: 30 days from issue.

A condensed excerpt of a sample report: the trust score scale, a key failure, and the remediation it requires. The failure and remediation shown are illustrative. Request a full sample report.

What it does

Lucidity does not test whether a system is generally accurate. It tests whether your system stays grounded in the documents, tools, and evidence it is supposed to use within the assessed workflow, measured against the capability claims you state for it.

Claims. You state what your system can do, for the workflow it is deployed in.
Test generation. We generate test cases against each claim and against known failure modes of agentic systems.
Execution. Your system runs the cases. Full trajectories are captured: every reasoning step, tool call, and answer.
Review. Trajectories are scored across five faithfulness axes, and our experts confirm every result before it enters the report.
Report. A report with a verdict, required controls, and a remediation path, signed off by our technical committee.
Retest. Reports expire. We retest on a 90-day cadence and on any change to the model, prompts, or pipeline.

What you receive

One deliverable: a signed PDF report, structured for examiner review and retained in your model risk file.

Verdict: Full Approval, Conditional Approval, or Not Approved, with a rationale an executive can act on.
Trust score: A per-case and aggregate score, with the thresholds that map score to verdict.
Key failures: Each failure with its blast radius and a required compensating control, specific enough for an auditor to verify.
Remediation: Items with owners, deadlines, and retest requirements.
Regulatory alignment: A mapping of the report's sections to SR 11-7, NIST AI RMF, the EU AI Act, and OSFI E-23.
Validity: An expiry date, a 90-day retest cadence, and the change triggers that require immediate retest.
Appendices: Version pinning with a SHA-256 content fingerprint, the input context reproduced verbatim, and the raw per-case evidence.

Who uses it

Vendors

Selling into regulated buyers.

Independent validation of your capability claims. Evidence for enterprise RFPs and vendor risk questionnaires that ask for hallucination testing.

Deploying institutions

Approving a system for use.

A due diligence artifact for procurement. Input to model risk review under SR 11-7 or equivalent, and a basis for acceptable-use policy and oversight controls.

Regulators and auditors

Examining a deployment.

Evidence that an independent assessment was performed before deployment in a regulated workflow, structured to support examiner review.

How we evaluate

Agentic systems fail differently from chatbots. Reasoning drifts across turns, tool outputs are paraphrased into false claims, plans are silently abandoned. These failures are invisible to response-level checks; they only show at the trajectory level. Our engine evaluates the trajectory as a whole, scoring every reasoning step, tool call, and answer in the context of everything before it, across five faithfulness axes adapted from published research on agentic hallucination.

Reasoning faithfulness: Stated reasoning is supported by context, internally consistent, and justifies its conclusions. Catches fabricated claims, premise drift, invalid inference, cumulative error.
Tool-use faithfulness: Tool calls match what the context warrants, and their results are reported faithfully downstream. Catches wrong tool, fabricated calls, misreported results, ignored errors.
Context grounding: When the system claims to use prior information, that information exists and is used as cited. Catches fabricated sources, ignored evidence, misattribution.
Plan adherence: Declared subgoals advance the goal, and the system does not execute against an invalidated plan. Catches silent plan abandonment, infinite loops, off-plan actions.
User-intent fidelity: Explicit user constraints are respected, and nothing is attributed to the user that the user did not say. Catches ignored constraints, fabricated user statements, hallucinated consent.

The engine itself is agent-level, reproducible, and continuous. Prompts, mutations, and scoring are inspectable, and runs are auditable through pinned inputs. Structured output tracks every axis across retests, as both your system and the verifier evolve.

Independence

Independence: We did not build and do not operate the systems we assess, and we are not compensated based on whether a system passes or fails. No equity interest, no licensing arrangement, no revenue sharing.
Human review: Automated results are confirmed by our experts before they enter the report.
Sign-off: Every report carries the signatures of our technical committee.
Record retention: Complete test logs are retained for 36 months and are available to the deploying institution and its regulators on request.

Request an evaluation

Evaluations are scoped per engagement. Tell us the system, the workflow, and the claims.