How to Conduct an AI Audit
An AI audit is a structured review of how an organization builds, buys, deploys, and governs artificial intelligence systems, measured against defined standards for accuracy, fairness, security, transparency, and legal compliance. It produces evidence: a complete inventory of AI systems, documented risk ratings, test results, and a remediation plan with owners and dates. A good audit answers three questions for every system in scope: what does it do, what could go wrong, and who is accountable when it does.
What is an AI audit and why does it differ from a software audit?
A traditional software audit checks whether code does what the specification says and whether controls protect data. An AI audit adds questions that conventional review does not cover, because machine learning systems behave probabilistically and change with their data.
Three properties separate AI systems from deterministic software:
They learn from data, so bias in the training set becomes bias in the output. A model trained on historical hiring decisions can reproduce past discrimination even when no protected attribute appears in the input.
Their behavior drifts. Performance that passed validation in March can degrade by September as real-world inputs shift away from the training distribution. Static, point-in-time review misses this.
Their decision logic is often opaque. A deep network does not expose a readable rule set, so the audit has to rely on testing, documentation, and explainability techniques rather than line-by-line inspection.
An AI audit therefore examines the full lifecycle: the data that trained the model, the model itself, the application wrapped around it, the humans in the loop, and the governance that holds the whole thing together. It treats the system as something that keeps operating and keeps changing, not as a fixed artifact.
What frameworks should an AI audit be measured against?
Audits need a defined reference point. Picking recognized standards keeps findings defensible and comparable across systems. Four references cover most needs.
Framework: NIST AI Risk Management Framework (AI RMF)
What it governs: Voluntary AI risk management practices organized into four functions: Govern, Map, Measure, and Manage.
How an audit uses it: Provides the overall audit structure, with each function mapping to a specific set of governance evidence.
Framework: EU AI Act
What it governs: Legal obligations based on AI risk tiers: unacceptable, high, limited, and minimal risk.
How an audit uses it: Identifies which AI systems are subject to mandatory conformity assessments, documentation, and compliance requirements.
Framework: ISO/IEC 42001
What it governs: Requirements for an AI Management System (AIMS).
How an audit uses it: Evaluates whether AI governance processes are documented, repeatable, and suitable for certification.
Framework: OECD AI Principles
What it governs: High-level principles for trustworthy AI, including transparency, accountability, and human-centered design.
How an audit uses it: Serves as the basis for assessing fairness, transparency, and responsible AI governance.
The NIST AI RMF is usually the most practical structure because its four functions read as a workflow. Map identifies context and risk. Measure runs the tests. Manage prioritizes and treats the risks. Govern applies across all three, defining roles, policies, and accountability. If your organization sells into or operates inside the EU, the EU AI Act risk tiers decide how heavy the obligations are, with high-risk uses such as credit scoring, hiring, and biometric identification carrying the strictest requirements.
For organizations that want a documented control set before they start, our AI compliance checklist maps these standards to specific artifacts you can collect.
How do you scope an AI audit before you start?
Scope failures are the most common reason audits produce thin findings. You cannot audit what you have not located, and most organizations underestimate how many AI systems they run. Shadow AI, meaning tools adopted by teams without central sign-off, frequently outnumbers the systems leadership knows about.
Start by building an AI system inventory. For each system, record:
Business purpose and the decisions it influences
Whether it was built in-house, bought, or accessed through an API
Data sources, including any personal or sensitive data
The affected population: customers, employees, applicants, patients
A preliminary risk tier
The accountable business owner and the technical owner
Then set the boundary. A first audit rarely covers everything at once. Prioritize by risk and reach. A model that screens job applicants or sets insurance prices affects people directly and carries legal exposure, so it ranks above an internal tool that drafts meeting summaries. Document what is in scope, what is out, and why, so the limits of the audit are explicit and nobody mistakes silence for a clean result.
What are the steps to conduct an AI audit?
The work moves through seven stages. Each produces an artifact that the next stage depends on, which keeps the audit auditable in its own right.
Inventory and scope. Catalog every AI system, assign preliminary risk tiers, and fix the audit boundary in writing.
Assemble documentation. Collect model cards, data sheets, training data lineage, intended-use statements, and prior validation reports. Gaps here are themselves findings.
Review data. Examine sources, consent and licensing, representativeness across affected groups, labeling quality, and retention. Check whether sensitive attributes leak through proxy variables such as ZIP code standing in for race.
Test the model. Measure accuracy on a held-out set, then test for bias across demographic groups using metrics like demographic parity and equalized odds. Probe robustness with edge cases and adversarial inputs. For generative systems, test for harmful outputs, prompt injection, and confident wrong answers.
Evaluate controls and oversight. Confirm human review exists where stakes are high, that an appeal path is available to affected people, and that the model cannot act beyond its intended scope.
Check monitoring and incident response. Verify that production logging, drift detection, and alerting are live, and that a defined process handles failures when they surface.
Report and remediate. Rate each finding by severity, assign an owner and a deadline, and schedule re-testing to confirm fixes hold.
Treat documentation gaps as substantive. If no one can produce the training data lineage for a model that prices loans, that absence is a high-severity finding regardless of how well the model scores on accuracy.
How do you test an AI model for bias and fairness?
Fairness testing is where audits most often go wrong, because "fair" has several mathematical definitions that cannot all hold at once. The audit's job is to pick the definitions that fit the use case, state them plainly, and test against them.
Common fairness metrics:
Demographic parity: outcomes are distributed similarly across groups
Equalized odds: error rates are similar across groups
Predictive parity: a given score means the same thing for every group
Calibration: predicted probabilities match observed frequencies within each group
These metrics can conflict. A model can satisfy predictive parity while violating equalized odds. The audit documents which metric governs and the reasoning, rather than claiming a single universal fairness number.
Practical fairness testing follows a sequence:
Define the protected groups that matter for this system and this jurisdiction.
Disaggregate performance, computing accuracy and error rates separately for each group.
Look for proxy discrimination where a permitted variable encodes a protected one.
Set thresholds in advance for what gap counts as a problem, so the decision is not made after seeing inconvenient results.
Record both the numbers and the judgment calls behind them.
Reported disparities vary widely by domain and dataset, so cite your own measured figures rather than borrowed ones. If you need an industry comparison point, mark it clearly as [stat to verify] until you can source it.
What goes in the audit report and remediation plan?
The report converts evidence into decisions leadership can act on. Vague findings stall. Specific findings with owners and dates produce action.
A usable AI audit report contains:
Executive summary: the highest-severity findings and the overall risk posture in plain language
Scope statement: systems covered, systems excluded, and the standards applied
Findings register: each finding with a severity rating, the evidence behind it, and the standard it relates to
Remediation plan: for every finding, a named owner, a defined fix, a deadline, and a re-test date
Residual risk: what remains after planned fixes, and who formally accepts it
Severity ratings should rest on stated criteria, combining the likelihood of harm with the scale of the affected population. A bias finding in a system that touches every loan applicant outranks a documentation gap in a low-traffic internal tool. The plan only works if accountability is individual. "The data team will address this" is weaker than "Priya Nair owns remediation, due 15 September, re-test 1 October."
How often should an AI audit be repeated?
Once is not enough, because the systems change after you certify them. Set a cadence rather than treating the audit as a one-time event.
Continuous: automated monitoring for drift, performance drops, and anomalies in production
Triggered: a fresh audit when a model is retrained, when the data source changes materially, when the system enters a new market, or after any incident
Periodic: a full audit on a fixed schedule, commonly annual for high-risk systems and less frequent for low-risk ones
The continuous layer catches the failures that point-in-time review cannot, and it gives the next scheduled audit real evidence about how the system behaves under live conditions.
Next Steps
Work through this checklist to run your first AI audit:
Build the inventory. List every AI system, including bought tools and API-based services, with owners named.
Assign risk tiers. Use EU AI Act categories or your own scale to rank systems by potential harm and reach.
Pick your standard. Adopt the NIST AI RMF functions as the audit structure and add the EU AI Act if you operate in the EU.
Set the scope. Document what is in, what is out, and the reasoning, signed off by leadership.
Collect documentation. Gather model cards, data sheets, and validation reports. Log every gap as a finding.
Test data, model, and controls. Run accuracy, bias, and robustness tests. Verify human oversight and appeal paths.
Confirm monitoring. Check that drift detection, logging, and incident response are live in production.
Write the findings register. Rate severity, attach evidence, and tie each finding to a standard.
Build the remediation plan. Assign an owner, a deadline, and a re-test date to every finding.
Set the cadence. Schedule continuous monitoring, triggered re-audits, and a periodic full review.
Frequently Asked Questions
What is the difference between an AI audit and AI governance?
AI governance is the standing system of policies, roles, and accountability that controls how an organization uses AI. An AI audit is a point-in-time evaluation that tests whether those controls work and whether specific systems meet defined standards. Governance is the operating model; the audit is the inspection. A finding from an audit often becomes an input that strengthens governance.
Who should conduct an AI audit?
It depends on independence requirements. Internal teams combining data science, legal, risk, and the relevant business unit can run operational audits. For high-risk systems, regulatory conformity, or external assurance, an independent third party adds credibility, because the people who built a model are poorly placed to judge its risks objectively. Many organizations use both: internal reviews on a frequent cycle and external audits at key milestones.
How long does an AI audit take?
Duration scales with scope, system complexity, and documentation quality. A single moderate-risk model with clean records can be audited in a few weeks. A first organization-wide audit, where the inventory itself has to be built and shadow AI located, runs longer. Poor documentation is the largest single driver of delay, since missing artifacts have to be reconstructed before testing can start.
What does an AI audit cost?
Cost tracks scope, the number and risk level of systems, and whether you use internal staff or external auditors. The largest expense is usually skilled time across data science, legal, and risk functions rather than tooling. Treat the figure as proportional to risk exposure: a system that makes legally consequential decisions about people justifies more investment than a low-stakes internal assistant.
Can AI audits be automated?
Parts can. Drift detection, performance monitoring, bias metric computation, and documentation checks all benefit from tooling and should run continuously. Judgment-heavy work resists automation: deciding which fairness definition fits a use case, weighing residual risk, and assessing whether human oversight is meaningful all require human reviewers. The practical model pairs automated monitoring with periodic human-led audits.