Chapter 8. Artificial Intelligence-based Software
128 Regulatory Affairs Professionals Society (RAPS)
obtain a general idea of what current practices are
for appropriate standalone performance metrics.
Widely accepted performance assessment
methodologies include ROC and free-response
ROC (FROC) curves, sensitivity, specificity,
positive predictive value (precision), negative
predictive value and area under the ROC curve.
Manufacturers also use performance metrics
including the Dice coefficient, the Hausdorff
distance for image segmentation features, and
mean square errors. The inclusion of 95% confi-
dence intervals is critical when reporting results,
as the margin of error reports the uncertainty or
variability in the results.41 For example, time-sen-
sitive triage devices that fall under the FDA
product codes QFM (Radiological Computer-
Assisted Prioritization Software for Lesions) and
QAS (Radiological Computer-Assisted Triage
and Notification Software) have established high
sensitivity and specificity with 95% AUC.
Clinical Reader Performance Assessment
In general, reader studies are used to assess the
clinical performance of detection and diagnostic
devices. A reader study (either retrospective or
prospective) directly compares the performance
of a reader with and without the assistance of
the device. A well-designed reader study with an
appropriate statistical analysis plan is essential to
show direct assessment of diagnostic performance
in clinical practice. Having a good reference
standard or ground truth, a number of experts
providing the ground truth, and the appropri-
ate selection of readers are important things to
consider in reader studies as each of these factors
impact the reported performance studies.
It is also important to select appropriate
board-certified readers with experience in the
disease area to conduct the study. In general,
US-board certified readers should be used
as this then avoids the burden of having to
demonstrative to the FDA that international
readers’ training is equivalent to that of their US
counterparts, and that clinical practice in their
countries is similar to that of the US. In addition,
close attention needs to be paid to the potential
source of bias and variability among the readers.
Factors such as study population, data acquisition,
characteristics of the AI/ML applications, and
human-AI interactions need to carefully be evalu-
ated to avoid bias and ensure the reproducibility
of the results. Lesion detection and segmentation,
for example, require metrics that describe how
well a predicted area matches the reference area,
which is typically delineated by a radiologist. It
is recommended that developers consult with
the FDA via the pre-submission process prior to
executing the clinical reader study. Although this
does not appear to be a regulatory requirement
or stated explicitly in a guidance, manufacturers
should be aware that FDA often requests that at
least 50% of data used to test a device come from
a US-based dataset.
It is vital to ensure appropriate risk-based
models are considered in the software life cycle
development as per IEC 6230471 and well-de-
fined training and validation studies are planned
to demonstrate the generalizability as per AAPM
TG Report 273.41
Ground Truthing
Establishing a robust reference standard is one
of the key steps in developing successful CAD
applications. There are two types of reference
standards, one is subjective, and the other is
objective. Reference standards based on physi-
cians’ opinion are considered subjective and are
considered reliable only if they are based on con-
sensus of multiple experts. Based on the authors’
experience, it is recommended that developers
include at least three or more experts to establish
a subjective reference standard. In addition, it is
always preferable to include an objective refer-
ence standard. This should consider pathology
results, clinical diagnosis and outcomes, and the
results of longitudinal studies.
There are, or course, other processes employed
for setting the reference standard. For example,
with reconstructive algorithms, one method used
is establishing the ground truth based on CT
images reconstructed by filtered back projection
a high radiation dose version of the same data.
Filtered back projection is a mathematically accu-
rate reconstruction algorithm developed under the
best data acquisition and reconstruction condi-
tions. The ground truth images could be based
on images collected from both phantoms in the
Previous Page Next Page