October 2, 2025
Clinicians are increasingly turning to AI to synthesize vast amounts of literature and guidelines to support clinical decisions. It is therefore critical to continuously evaluate these solutions for their scientific rigor and medical trustworthiness.
System’s Synthesize API takes in clinical and biomedical queries and returns highly accurate, fully cited, and extremely flexible natural language responses based on the relevant literature and guidelines. Powered by the proprietary System Graph and novel agent-based retrieval and rerank algorithms, our Synthesize API is currently integrated into clinical decision support (CDS) applications at leading healthcare providers and health technology companies.
Today we’re sharing the latest results of our ongoing benchmarking of this solution. In this blind head-to-head study, 31 clinicians were asked to provide 93 real-world clinical questions and score the accuracy of responses generated by System and OpenEvidence.
Participants
31 clinicians were recruited to participate in the evaluation via the User Interviews platform. Participants included physicians, physician assistants and nurse practitioners working across academic medical centers, family practice, and other settings.
Data Collection
As part of the recruitment process, we asked clinicians to identify their medical specialty (or specialities), board certifications, and to share 2-4 questions relevant to their clinical practice that required or could require guidelines and literature to answer. A total of 93 clinical questions were evaluated.
For each question, we generated two responses, one from System’s Synthesize API and one from OpenEvidence. Each response was copied into its own Google Doc including the entire text output, tables, figures, and references, and labeled as Synthesis 1 or Synthesis 2. Participants were blinded to the source of the response.
A single Google Form was created for each participant containing the responses to each of their questions and a set of criteria for evaluating responses. We asked reviewers to score the accuracy of the answer on a scale of 1 (strongly disagree) to 5 (strongly agree).
Participants were also asked to select which response they preferred overall, with the option to select either of the syntheses or mark ‘no preference’.
We collected data from September 7-28.
Analysis
At the end of the data collection period, we summed the total number of responses by category (1-5) for both System and OpenEvidence and calculated the average accuracy score across all participant responses, as well as the overall preference.
System builds knowledge infrastructure to transform decision-making from silos to systems — starting in healthcare. System’s APIs are used today by leading healthcare providers in the US and Europe to power groundbreaking clinical decision support systems (CDSS). At the core of System is the System Graph, a patented, large-scale, statistical graph of the world modeled as one interconnected system, based on trusted sources of evidence that are updated daily. System Inc. is a Public Benefit Corporation committed to advancing systems thinking in the world.
Filed Under:
Tech