Reliability of MACg for Source‑Aligned Clinical‑Trial Data Extraction: Hallucination, Accuracy, and Contextual Understanding

We evaluated MACg, a GPT–5–based writing and research assistant, for medical writing and data extraction, on five uploaded clinical‑trial PDFs (15 questions per study; 75 total). Questions on design, endpoints, efficacy, safety, limitations, and interpretation were asked one at a time, without instructions to avoid hallucinations or rely on the document. A clinician compared each answer with the source. MACg demonstrated zero hallucinations, 100% accuracy for requested data, contextual integration across sections, appropriate handling of missing information, and clinically coherent interpretation grounded in the trials, supporting its use for reliable, source‑aligned medical communication.

Published on Feb 14, 2026 | Ome Ogbru, PharmD

MACg

Abstract

Large language models (LLMs) are increasingly integrated into medical writing workflows, yet concerns about hallucinations and subtle factual errors persist, particularly when extracting granular data from clinical‑trial publications.

We conducted a pragmatic evaluation of MACg (a GPT‑5–based, domain‑specific assistant) focused on hallucinations, accuracy, depth of detail, contextual understanding, and appropriate inference when working from uploaded clinical‑trial PDFs.

Five peer‑reviewed clinical publications (phase 3 trials and a systematic review) were uploaded, and 15 structured questions per article (n=75) were posed one at a time, mirroring real medical writing tasks.

Questions covered study design, endpoints, statistical methods, efficacy and safety results, limitations, and authors' interpretations; no prompt contained explicit instructions to avoid hallucinations or to restrict answers to the attached documents, to test MACg's system‑level behavior when such instructions are not included.

A single evaluator compared each answer against the corresponding PDF and rated hallucination, accuracy of data extraction, level of detail, alignment with the reference, handling of missing information, and use of general clinical and scientific knowledge for interpretation.

Across 75 question–answer pairs, MACg produced zero hallucinations by the prespecified definition, achieved 100% accuracy for requested numerical and categorical data, and consistently delivered clinically appropriate detail and interpretations aligned with the trial reports.

When information was not explicitly available, MACg refrained from fabricating trial‑specific values while appropriately leveraging general domain knowledge to contextualize scales, endpoints, and clinical implications.

These findings suggest that MACg's design—a domain‑optimized configuration of the general GPT‑5 model for medical and scientific writing, coupled with document‑grounded reasoning and conservative safety behavior—can deliver highly reliable extraction and summarization of clinical‑trial data in medical and scientific writing workflows.

Introduction

The use of LLMs in medicine and life sciences has expanded rapidly, from literature search and trial synopsis drafting to medical information letters, creating slide presentations, congress materials, regulatory reports, publication support, and much more.

However, the risk that an LLM will produce confident but incorrect statements—hallucinations—remains a central barrier to adoption in regulated, evidence‑critical contexts.

For clinical‑trial publications in particular, small deviations in sample sizes, effect estimates, confidence intervals, or safety rates can materially alter the interpretation of benefit–risk and undermine trust in AI‑assisted documents.

Evaluating hallucinations, accuracy, high contextual understanding, and appropriate inference in domain‑specific AI platforms used for extracting data from clinical trial publications is important.

Generic benchmarks, often closed‑book or web‑search based, provide limited insight into how a tool behaves when a medical writer uploads trial PDFs and expects structured, source‑aligned outputs.

MACg is a domain‑optimized assistant that uses the GPT‑5 model family and other advanced technologies. It is designed specifically for medical writing and literature analysis, with a workflow centered on uploaded references, PubMed or web search results, and structured outputs mapped to the knowledge source. MACg is equipped with several tools that address medical and scientific writing workflows.

In this context, we aimed to evaluate not only hallucination avoidance but also accuracy, level of detail, contextual integration across sections and tables, and the platform's ability to extrapolate and infer appropriately from its general training and understanding of medical and scientific content, while remaining faithful to trial data.

The present study describes a pragmatic, document‑grounded test of MACg using five representative clinical‑trial PDFs and 75 structured questions to approximate the day‑to‑day tasks of medical professionals and scientific writers.

Methods

Study design

We conducted a single‑system, prospective evaluation of MACg's performance on predefined question sets anchored to uploaded clinical‑trial publications.

The design mirrored medical writing workflows in which users upload reference PDFs and iteratively request design summaries, numerical extractions, safety overviews, and interpretive paragraphs.

No additional external tools or guardrails beyond MACg's standard configuration were introduced, and prompts did not explicitly instruct the system to avoid hallucinations or to restrict answers to the attached documents.

To test how MACg handled such situations, we intentionally included questions whose answers were available only in supplemental materials that were not attached.

Reference documents

Five peer‑reviewed manuscripts were used as the test corpus:

A phase 3 trial of lecanemab in early Alzheimer's disease (Clarity AD; NEJM).
A randomized trial of zuranolone versus placebo in postpartum depression.
A phase 3 trial of darolutamide plus standard therapy in metastatic hormone‑sensitive prostate cancer.
An open‑label extension evaluating long‑term safety and efficacy of eculizumab in generalized myasthenia gravis.
A systematic review on lomitapide in homozygous familial hypercholesterolaemia.

These publications were selected to provide diversity in therapeutic areas, outcome measures, statistical methods, and safety profiles, and to reflect the types of references commonly used in medical affairs, publications, research, and clinical practice.

Each PDF was uploaded once and remained available to MACg throughout questioning about that specific trial or review.

Question development

For each article, 15 questions were drafted a priori, yielding a total of 75 questions.

Questions were structured to ensure that their answers could be checked against the content of the corresponding PDF, including text, tables, figures, and, where relevant, supplementary descriptions incorporated into the uploaded file.

Content domains included:

Study design and eligibility criteria.
Definition of primary and secondary endpoints and core statistical models.
Numerical efficacy results (effect sizes, confidence intervals, p‑values, event counts).
Safety outcomes, including adverse‑event profiles and key risk signals.
Author‑stated limitations and overall conclusions.

Questions spanned direct data extraction (e.g., “Report the adjusted mean change and 95% CI for the primary endpoint”) and higher‑order synthesis (e.g., “Summarize the major limitations as stated by the authors”).

Interaction protocol

Questions were asked one at a time, with the relevant trial PDF already attached within the MACg environment.

For each article, all 15 questions were posed sequentially, but each prompt contained only a single question and no explicit control language such as “do not hallucinate” or “use only data from the attached study.”

This choice was deliberate: the goal was to evaluate MACg's intrinsic training and system‑level instructions regarding source grounding and conservative behavior, rather than performance under highly engineered prompts.

All responses were captured verbatim and mapped to pre‑assigned question identifiers, enabling structured downstream analysis.

Evaluation parameters and definitions

A single medically trained evaluator, a PharmD with clinical and scientific writing experience, reviewed each answer against the source PDF.

Outcomes were assessed along six dimensions:

Hallucination (binary) – An answer was labeled hallucinated if it contained at least one factual claim (number, endpoint, conclusion, or safety statement) that was unsupported by or contradicted by the article.
Accuracy of data extraction – For requested trial‑specific elements (e.g., N, hazard ratios, LS means, CIs, event counts), correctness was judged strictly against the publication.
Detail of responses – Answers were assessed qualitatively for whether they included the level of granularity expected in scientific writing (e.g., effect size plus 95% CI and p‑value, not just directionality).
Contextual understanding – The evaluator judged whether MACg correctly integrated information spanning methods, results, tables, diagrams, reference lists, and footnotes, and preserved relationships between baseline characteristics, endpoints, and outcomes.
Alignment with the reference source – Beyond isolated facts, the overall narrative and interpretation were checked for consistency with the authors' own text.
Handling of missing information and inference – The evaluator noted how MACg responded when a requested detail was not clearly reported, and whether any explanatory or inferential content went beyond the data while remaining non‑fabricated.

Because this was a pilot, a single‑rater approach was used; limitations related to inter‑rater variability are addressed in the Discussion.

Results

Overall performance

Across the 75 question–answer pairs, MACg did not produce a single answer that met the predefined criteria for hallucination.

All trial‑specific numbers, such as sample sizes, adjusted mean changes, hazard ratios, confidence intervals, p‑values, and adverse‑event rates, matched the corresponding values in the publications.

Similarly, all categorical elements—definitions of endpoints, stratification factors, inclusion criteria, and authors' own conclusions—were correctly restated.

Thus, the observed hallucination rate at the answer level was 0% (0/75), and accuracy of data extraction for requested parameters was 100%.

Detail and contextual understanding

Across disease areas, MACg consistently provided more than minimal extraction, often delivering full “trial‑style” summaries that included scale ranges, directionality of benefit, and clinical interpretation.

For example, when asked about primary and key secondary endpoints, responses typically included named scales, scoring direction (higher vs lower is worse), magnitude of change, and statistical significance, mirroring standard reporting formats.

MACg also demonstrated robust contextual understanding, correctly linking baseline severity to endpoint interpretation, differentiating between modified intention‑to‑treat and safety populations, and explaining hierarchical testing where relevant.

There were no instances in which data from one uploaded article were inappropriately imported into the narrative of another; each answer remained confined to the trial under discussion.

For several questions, MACg explicitly supported its answers with verbatim excerpts from the trial publications. When prompted to justify a given response, it retrieved the relevant passages from the attached document and articulated, step by step, how those passages informed its conclusion, effectively communicatiing its evidence base and reasoning process to the evaluator.

Alignment with reference sources

Qualitative review of longer responses, such as summarizing trial limitations or authors' final interpretations, showed close alignment with the underlying texts.

The assistant accurately captured authors' caveats about duration of follow‑up, generalizability, multiplicity, and open‑label design where applicable, without adding speculative limitations or overstating certainty.

In benefit–risk summaries, MACg preserved the balance of efficacy and safety as described in the publications, avoiding language that could be interpreted as promotional or as minimizing key risks such as ARIA, infections, or CNS side‑effect profiles.

Handling of missing information and inference

When questions touched on details that were only partially reported or not clearly quantified in the main text, MACg did not fabricate specific numeric values.

Instead, it either described the limitation (e.g., noting that certain subgroup analyses were exploratory and not numerically detailed) or stayed at a qualitative level consistent with the authors' wording.

At the same time, MACg used its general training knowledge to explain the clinical meaning of scales (such as MG‑ADL, HAMD‑17, CDR‑SB) and to frame findings in familiar clinical terminology, without attributing unreported outcomes to the specific trials.

This combination of conservative data handling and clinically informed framing was observed across all five documents.

Sample MACg Questions Answers.pdf

Discussion

This evaluation shows that, under conditions similar to real medical writing workflows, MACg can extract and synthesize information from clinical‑trial PDFs with zero observed hallucinations and perfect accuracy for requested data elements.

These results are notable because prompts were intentionally not engineered with explicit anti‑hallucination instructions or strict “use only this document” language; instead, the system relied on its intrinsic training and system‑level constraints.

For clinicians and scientific communicators, this suggests that MACg's baseline behavior is already strongly biased toward document‑grounded answers and away from trial‑specific fabrication, even when the user does not remember to add safety phrasing.

Several design features likely contribute to this performance.

First, MACg is based on a GPT‑5 backbone with additional instruction fine-tuning for medical and scientific discourse, which biases it toward structured reporting of endpoints, effect sizes, and confidence intervals similar to the source literature.

Second, MACg has an advanced document‑reading pipeline that prioritizes extraction from uploaded files or PubMed search for trial‑specific questions, reducing reliance on purely parametric memory and lowering the incentive to “guess” missing values.

Third, system‑level safety instructions emphasize conservative behavior when information is incomplete, favor explicit acknowledgement of data limitations, and discourages free‑form speculation about trial outcomes.

Finally, MACg structures its responses in ways analogous to abstracts or clinical trial summaries—covering background, design, results, and interpretation systematically—which implicitly encourages comprehensive traversal of the source document before forming an answer. In several instances, MACg also surfaced verbatim quotations from the attached PDFs and explained how those excerpts supported its conclusions, illustrating an internal orientation toward transparent, document‑grounded reasoning rather than opaque guesswork.

In contrast, many published hallucination benchmarks assess LLMs in closed‑book or generic web‑search settings. In a closed‑book setting, the model is asked questions without being shown the source document or passages, and must answer purely from what it has learned during training, similar to a student taking a closed‑book exam. By contrast, a document‑grounded (RAG‑style) setting mirrors real medical writing workflows: the clinical‑trial publication is provided to the model at the time of the question, and the model is expected to extract and synthesize information directly from that document.

Hallucination risk is typically higher in closed‑book or loosely web‑based tasks because grounding in a specific, authoritative source is weaker, and the model is more likely to fill gaps with plausible‑sounding but unsupported content.

Our findings thus complement, rather than replace, those benchmarks by demonstrating what a domain‑specific, document‑centric assistant can achieve under conditions closely aligned with research and medical‑affairs practice.

The study has limitations.

The evaluation used a single human rater, so inter‑rater reliability for hallucination and accuracy judgments could not be quantified.

Although multiple disease areas were covered, the sample size was limited to five articles and 75 Q–A pairs; broader testing across additional therapeutic areas, observational studies, and guidelines would further validate generalizability.

Furthermore, all questions were relatively well‑posed and directly anchored in trial content; more adversarial prompts or ambiguous queries might expose different failure modes.

Despite these caveats, the consistency of zero hallucinations and complete numerical accuracy across diverse and complex trials provides strong evidence that MACg is well‑suited for high‑fidelity extraction and summarization tasks when used with appropriate human oversight.

Conclusion

In this realistic evaluation of MACg using five uploaded clinical‑trial PDFs and 75 structured questions, we observed zero hallucinations and 100% accuracy across all requested data elements, along with high levels of contextual understanding, detailed reporting, and appropriate inference.

These results reflect deliberate design choices in MACg's architecture, training, and system instructions that favor document‑grounded, conservative, clinically, and scientifically aligned behavior over speculative completion.

For medical professionals and life‑science researchers, MACg is a robust co‑pilot for tasks such as trial synopsis generation, safety and efficacy summarization, drafting scientific papers and presentations, medical affairs content, and structured data extraction from publications, provided that human experts continue to perform final review and contextual interpretation.

Future work will extend this evaluation to larger corpora, multi‑rater annotation, and automated, claim‑level hallucination metrics, but the present findings support MACg's use in high‑stakes scientific communication workflows where accuracy and alignment with clinical‑trial publications and other source documents is essential.

Start creating & editing content in minutes with AINGENS' MACg.

Discover all the amazing things you'll create with AI.

Learn More About MACg

100 AI Slide Creation Prompts for Medical & Scientific Professionals

100 AI Slide Presentation Prompts for Busy Medical and Scientific Professionals

This article provides 100 ready‑to‑use AI prompts to help medical and scientific professionals create high‑quality slide presentations with MACg. The prompts are organized into practical categories, including clinical education, drug and mechanism‑of‑action decks, PubMed‑driven evidence reviews, specialty teaching, medical affairs and HEOR, training and curriculum, data‑visualization, and conference or grant presentations. Each prompt uses clear “Instruction:” wording and customizable placeholders (e.g., [condition], [audience], [timeframe]) so users can quickly adapt them to their specific topic and setting. By removing blank‑slide paralysis, the collection speeds up evidence‑based slide creation while supporting consistent, structured, and compliant communication.

AINGENS Team

Feb 4, 2026

A Real-World Review of MACg for Editing, Writing, and PubMed Search

Dr. Goldina Erowele examines the transformative role of MACg, an AI platform designed for life sciences professionals. Her initial experience involved using MACg to review a draft document and integrate PubMed references, showcasing its ability to streamline research and ensure content accuracy. MACg's tools expedite the draft review process, improving tone and readability while maintaining compliance with medical standards. The platform's applications extend to creating scientific abstracts, clinical trial reports, and training materials, promising to enhance productivity and impact in medical communications.

Goldina Erowele, PharmD, MBA

Jan 27, 2025

How to Use AI to Update PowerPoint Presentations (Step-by-Step Guide for Faster, Better Slides)

Updating an existing PowerPoint deck no longer has to mean hours of manual rewriting and reformatting. This article shows a practical, step‑by‑step workflow for using AI (e.g., MACg) to generate new, evidence‑based slides and seamlessly integrate them into your current presentation. Instead of asking AI to “fix” an entire deck, you'll learn how to target specific sections, brief the AI with the right sources, generate PowerPoint‑ready slides, and then merge and restyle them so they match your original look and feel. The result is faster updates, higher consistency, and full human control over scientific quality.

AINGENS Team

Jan 5, 2026

Reliability of MACg for Source‑Aligned Clinical‑Trial Data Extraction: Hallucination, Accuracy, and Contextual Understanding