When AI Systems Fail#
Every organization deploying AI will eventually face an AI incident. It’s not a question of if, but when. The difference between a manageable incident and an existential crisis often comes down to preparation and response.
AI failures differ from traditional software failures in important ways:
- Harms may be diffuse and delayed, affecting many people before detection
- Causation is often unclear, AI decisions are difficult to explain
- Evidence is ephemeral, model states, inputs, and outputs may not be logged
- Regulatory scrutiny is intense, AI failures attract special attention
- Reputational damage spreads fast, AI failures make headlines
This playbook provides a structured framework for responding to AI incidents, from the moment of detection through post-incident analysis and remediation.
Incident Classification Framework#
Not all AI incidents are created equal. Classify incidents to trigger appropriate response levels.
Severity Levels#
Definition: AI system causing active, serious harm to individuals or organization
Examples:
- AI making discriminatory decisions affecting protected classes
- AI causing physical harm (autonomous vehicles, medical devices)
- Data breach involving AI system or training data
- AI generating content causing legal liability (defamation, IP infringement)
- AI system completely unavailable for critical business function
Response Time: Immediate (within 1 hour) Escalation: Executive leadership, legal counsel, board notification
Definition: AI system malfunction with significant business or customer impact
Examples:
- Systematic errors affecting customer decisions
- Significant accuracy degradation detected
- Privacy violation (unintended data exposure)
- Compliance violation identified
- High-volume customer complaints about AI behavior
Response Time: Within 4 hours Escalation: Department leadership, legal review
Definition: AI system issues requiring attention but limited immediate impact
Examples:
- Intermittent errors or unexpected behavior
- Performance degradation below SLA thresholds
- Bias detected in limited scope
- Individual customer complaints about AI decisions
- Model drift detected but not yet impacting outcomes
Response Time: Within 24 hours Escalation: Technical leads, product management
Definition: Minor issues with minimal impact
Examples:
- Cosmetic issues with AI outputs
- Edge cases causing errors
- Documentation or training gaps identified
- Improvement opportunities discovered
Response Time: Within 1 week Escalation: Standard ticket/issue tracking
Immediate Response Steps#
When an AI incident is detected, follow these steps in order. Speed matters, but so does avoiding panic decisions that make things worse.
Phase 1: Assess (First 30 Minutes)#
Scope Determination:
- What AI system is affected?
- What is the system’s function and who uses it?
- How long has the issue been occurring?
- How many individuals/decisions are potentially affected?
- Is the harm ongoing or has it stopped?
Initial Classification:
- Assign preliminary severity level
- Identify incident commander
- Open incident tracking ticket/channel
- Notify on-call personnel per escalation matrix
Immediate Safety:
- Is anyone in physical danger? (medical AI, autonomous systems)
- Is sensitive data actively being exposed?
- Can the harm be stopped without shutting down the system?
Phase 2: Contain (First 2 Hours)#
The goal of containment is to stop ongoing harm while preserving your ability to investigate.
Containment Options (escalating severity):
- Monitor Only, If harm is limited and you need data to understand the issue
- Rate Limit, Reduce AI decision volume while maintaining some functionality
- Human Review Gate, Require human approval for all AI decisions
- Rollback, Revert to previous known-good model version
- Disable AI Component, Remove AI from the workflow, use fallback process
- Full System Shutdown, Take the entire system offline
Shut down immediately if:
- AI is causing physical harm
- AI is making clearly discriminatory decisions at scale
- Data breach is actively occurring
- Legal counsel advises immediate shutdown
- You cannot contain the harm by lesser means
Document the shutdown decision: Who made it, when, and why. This becomes important for legal defense and regulatory responses.
Phase 3: Assemble Response Team#
Core Incident Response Team:
- Incident Commander, Single point of authority and coordination
- Technical Lead, AI/ML expertise to diagnose and remediate
- Legal Counsel, Privilege, notification obligations, liability assessment
- Communications, Internal and external messaging
- Business Owner, Decision authority for business impact tradeoffs
Extended Team (as needed):
- Privacy/compliance officer
- HR (if employment decisions affected)
- Customer support leadership
- Executive sponsor
- External forensics/consultants
- Insurance carrier (for potential claims)
Documentation Requirements#
Documentation serves multiple critical purposes: investigation, legal defense, regulatory compliance, and organizational learning. Start documenting immediately.
Incident Log#
Maintain a real-time log of all incident-related activities:
INCIDENT LOG TEMPLATE
Incident ID: AI-2025-001
System Affected: [AI system name]
Detection Time: [timestamp]
Severity: [1-4]
Incident Commander: [name]
TIMELINE (append entries, never delete):
[timestamp] - [who] - [what happened/action taken]
[timestamp] - [who] - [what happened/action taken]
...Critical log entries:
- When and how the incident was detected
- Who was notified and when
- All containment decisions and rationale
- All communications (internal and external)
- Evidence collection activities
- Remediation steps taken
- Decisions made and by whom
Technical Documentation#
Model State:
- Model version/hash at time of incident
- Model configuration and parameters
- Most recent training date and data sources
- Any recent updates or changes
System State:
- System logs for affected period
- Error logs and stack traces
- Performance metrics and anomalies
- Integration/API logs
- User session data (as permitted)
Decision Records:
- Inputs to affected AI decisions
- Outputs/decisions made
- Confidence scores (if available)
- Explanation data (if available)
- Human override data (if any)
Infrastructure:
- Server/cloud resource status
- Network logs
- Access logs
- Security event logs
Impact Assessment#
Document the scope of harm as precisely as possible:
- Number of individuals affected (known and estimated)
- Types of decisions affected (hiring, lending, healthcare, etc.)
- Time period of affected decisions
- Geographic scope (jurisdictions involved)
- Demographic impact (if discrimination is suspected)
- Financial harm (to individuals and organization)
- Data exposure (types and volume)
Notification Obligations#
AI incidents may trigger legal notification requirements. Consult legal counsel, but know the landscape.
Regulatory Notifications#
Data Breach Laws (vary by jurisdiction):
- Personal data exposed → state attorneys general, affected individuals
- HIPAA data → HHS within 60 days (or immediately if 500+ affected)
- GDPR data → supervisory authority within 72 hours
Industry-Specific:
- Financial institutions → OCC, FDIC, relevant regulators
- Healthcare → state health departments, CMS (if Medicare/Medicaid)
- Publicly traded companies → SEC (if material)
AI-Specific Regulations:
- EU AI Act → Serious incident reporting requirements
- FDA (medical AI) → Medical Device Report (MDR) requirements
- NHTSA (autonomous vehicles) → Standing General Order reporting
Contractual Notifications#
Review contracts for notification obligations to:
- Customers whose data or decisions were affected
- Business partners and vendors
- Insurance carriers
- Integration partners
Timing matters. Many contracts require notification within specific timeframes (24-72 hours is common). Late notification can void coverage or breach contracts.
Internal Notifications#
Immediate (Severity 1-2):
- Executive leadership
- Board of directors (or designated committee)
- General counsel
- CISO/CIO
Within 24 hours:
- Affected business unit leadership
- Risk management
- Internal audit
- HR (if employee data or employment decisions affected)
Preservation of Evidence#
Evidence preservation is critical for investigation, legal defense, and regulatory compliance. Start preserving immediately, before you understand the full scope.
Legal Hold#
Issue a legal hold notice to preserve all potentially relevant materials:
Documents and Data:
- All communications about the AI system
- Model development and training records
- Testing and validation documentation
- Deployment and monitoring records
- Incident-related communications
- Customer/user complaints
- Prior incident reports for the system
Systems and Logs:
- AI system logs (do not rotate/delete)
- Model versions and snapshots
- Training data and data pipeline logs
- API and integration logs
- Email and chat communications
Personnel:
- AI/ML team members
- System administrators
- Customer support staff who handled complaints
- Business owners and decision-makers
Evidence Collection Best Practices#
Preserve the crime scene: Don’t modify systems until evidence is collected. If you must take action, document the state before and after.
Chain of custody: Document who collected evidence, when, from where, and how it’s been stored. This matters for litigation.
Integrity verification: Hash files and data to prove they haven’t been modified.
Authorized collection: Ensure evidence collection doesn’t violate privacy laws or employment contracts.
Expert involvement: For serious incidents, consider forensic experts who can testify about collection procedures.
What to Preserve#
High Priority:
- Model state at time of incident (weights, configuration, version)
- Input data for affected decisions
- Output decisions and explanations
- System logs covering incident period
- Training data and preprocessing pipeline
- Bias testing results and audit reports
- Contracts and vendor documentation
Medium Priority:
- Development communications (email, Slack, Jira)
- Model development history
- Prior testing and validation records
- Change management records
- User feedback and complaints
Communication Templates#
Prepare communications in advance. In a crisis, you won’t have time to wordsmith from scratch.
Internal Incident Notification#
SUBJECT: [SEVERITY LEVEL] AI Incident - [System Name] - Immediate Attention Required
INCIDENT SUMMARY:
An issue has been identified with [AI System Name] that [brief description of impact].
STATUS: [Investigating / Contained / Resolved]
AFFECTED SYSTEMS: [List systems]
AFFECTED TIMEFRAME: [Start time] to [End time or "ongoing"]
ESTIMATED IMPACT: [Number of users/decisions affected]
IMMEDIATE ACTIONS TAKEN:
- [Action 1]
- [Action 2]
INCIDENT COMMANDER: [Name]
NEXT UPDATE: [Time]
DO NOT:
- Discuss this incident outside approved channels
- Delete any data, logs, or communications related to this system
- Make public statements without Communications approval
QUESTIONS: Contact [Incident Commander] via [channel]Customer Notification (Data Breach)#
SUBJECT: Important Notice About Your Data
Dear [Customer Name],
We are writing to inform you of a security incident that may have affected your information.
WHAT HAPPENED:
On [date], we discovered that [brief, factual description of incident].
WHAT INFORMATION WAS INVOLVED:
[Specific types of data affected - be precise]
WHAT WE ARE DOING:
- [Remediation step 1]
- [Remediation step 2]
- [Ongoing monitoring]
WHAT YOU CAN DO:
- [Recommended action 1]
- [Recommended action 2]
We take the security of your information seriously and deeply regret this incident occurred.
If you have questions, please contact [dedicated support channel].
Sincerely,
[Appropriate executive]Customer Notification (AI Decision Error)#
SUBJECT: Important Update About a Decision Affecting Your Account
Dear [Customer Name],
We recently identified an error in our automated decision system that may have affected your [application/account/claim].
WHAT HAPPENED:
[Brief factual explanation - avoid admitting liability without legal review]
WHAT THIS MEANS FOR YOU:
[Specific impact on this customer]
WHAT WE ARE DOING:
We are reviewing all affected decisions and will [specific remediation - reconsideration, refund, etc.].
YOUR NEXT STEPS:
[Clear instructions for the customer]
We apologize for any inconvenience and are committed to making this right.
Questions? Contact [support channel].
Sincerely,
[Name]Regulatory Notification#
Work with legal counsel on all regulatory notifications. Template for reference only:
[Regulatory Body]
[Address]
Re: Notification of [Incident Type] Pursuant to [Regulation]
Dear [Regulator]:
[Organization Name] hereby provides notice of a [incident type] as required by [specific regulation/section].
SUMMARY OF INCIDENT:
[Factual summary - work with counsel on precise language]
DATE OF DISCOVERY: [Date]
DATE OF INCIDENT: [Date range if known]
AFFECTED INDIVIDUALS: [Number and categories]
NATURE OF INFORMATION/DECISIONS AFFECTED:
[Specific description]
REMEDIAL ACTIONS:
[Steps taken and planned]
CONTACT:
[Designated contact for regulatory inquiries]
We will provide updates as our investigation continues.
Respectfully,
[Authorized signatory]Post-Incident Analysis#
Once the immediate crisis is resolved, conduct a thorough post-incident review. This is not about blame, it’s about learning.
Root Cause Analysis#
Technical Causes:
- What specifically failed in the AI system?
- Was the failure in the model, data, infrastructure, or integration?
- Was this a known failure mode or novel?
- What made detection difficult or delayed?
Process Causes:
- Were proper testing and validation procedures followed?
- Were monitoring and alerting adequate?
- Did change management processes fail?
- Were known risks properly documented and addressed?
Organizational Causes:
- Did the team have adequate AI/ML expertise?
- Were there resource or time constraints that contributed?
- Did organizational structure impede communication?
- Were there incentives that discouraged raising concerns?
Use the “5 Whys” technique: Keep asking why until you reach root causes, not just proximate causes.
Incident Review Meeting#
Participants: All incident responders plus:
- Additional technical experts
- Risk management
- Legal (privileged discussion)
- Quality assurance
Agenda:
- Incident timeline reconstruction
- What went well in the response
- What could have gone better
- Root cause findings
- Remediation actions and owners
- Lessons for future prevention
Post-Incident Report#
Document the incident formally:
POST-INCIDENT REPORT
Incident ID: [ID]
Report Date: [Date]
Author: [Name]
Classification: [Confidential/Attorney-Client Privileged as appropriate]
1. EXECUTIVE SUMMARY
[2-3 paragraph summary for leadership]
2. INCIDENT TIMELINE
[Detailed chronological account]
3. IMPACT ASSESSMENT
- Individuals affected: [number]
- Decisions affected: [number and type]
- Financial impact: [estimate]
- Regulatory implications: [assessment]
- Reputational impact: [assessment]
4. ROOT CAUSE ANALYSIS
[Findings from investigation]
5. RESPONSE ASSESSMENT
- What worked well
- Areas for improvement
6. REMEDIATION ACTIONS
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
7. PREVENTION RECOMMENDATIONS
[Systemic changes to prevent recurrence]
8. APPENDICES
- Detailed logs
- Evidence inventory
- Communications recordRemediation and Prevention#
Immediate Remediation#
Before returning the AI system to production:
- Root cause identified and addressed
- Fix tested and validated
- Affected decisions identified for review/correction
- Affected individuals notified (if required)
- Monitoring enhanced to detect recurrence
- Rollback plan prepared if fix fails
- Legal/compliance approval obtained
- Business owner sign-off received
Long-Term Prevention#
Technical improvements:
- Enhanced testing coverage for identified failure modes
- Improved monitoring and alerting
- Better logging for future investigations
- Model performance thresholds and automatic safeguards
Process improvements:
- Updated change management procedures
- Enhanced pre-deployment validation
- Improved incident response procedures
- Regular tabletop exercises
Organizational improvements:
- Training on AI risks and incident response
- Clear escalation paths and decision authority
- Adequate resourcing for AI safety
- Culture that encourages raising concerns
Frequently Asked Questions#
Should we involve legal counsel immediately?#
Yes, for Severity 1-2 incidents. Key reasons:
- Privilege protection: Early involvement can protect investigation materials
- Notification obligations: Counsel can advise on legal requirements
- Liability management: Framing matters for future litigation
- Regulatory strategy: Counsel can advise on regulatory engagement
For Severity 3-4, involve legal if there’s any potential for escalation or external exposure.
When should we notify our insurance carrier?#
Immediately for any incident that might result in a claim. Most policies require prompt notification, and late notice can void coverage. Your carrier may also provide valuable resources:
- Breach response vendors
- Forensic investigators
- Crisis communications
- Legal defense coordination
How do we balance transparency with legal risk?#
This is the hardest question in incident response. General principles:
- Don’t lie. Ever. It always makes things worse.
- Be factual. Avoid speculation or admissions of fault.
- Be timely. Delayed disclosure often increases liability.
- Be consistent. Internal and external messages should align.
- Get legal review. Before any external communication.
What if we’re not sure if an incident occurred?#
Document your uncertainty and investigate promptly. Treat ambiguous situations seriously, the cost of over-responding to a non-incident is much lower than under-responding to a real one.
How do we handle incidents involving AI vendors?#
Your contract should define responsibilities. Generally:
- Notify the vendor immediately
- Invoke contractual audit and cooperation rights
- Document vendor response (or lack thereof)
- Preserve evidence from your side
- Consider whether vendor met their contractual obligations
Quick Reference Card#
Print this and post it where your incident responders can see it.
AI INCIDENT RESPONSE QUICK REFERENCE
FIRST 30 MINUTES:
□ Assess scope and severity
□ Assign incident commander
□ Open incident channel/ticket
□ Notify on-call team
□ Determine if shutdown needed
FIRST 2 HOURS:
□ Implement containment
□ Notify legal counsel (Sev 1-2)
□ Assemble response team
□ Begin documentation
□ Issue legal hold (if needed)
□ First internal notification
FIRST 24 HOURS:
□ Complete impact assessment
□ Determine notification obligations
□ Begin evidence preservation
□ Prepare external communications
□ Notify insurance carrier
□ Document everything
AFTER RESOLUTION:
□ Conduct root cause analysis
□ Complete post-incident report
□ Implement remediation
□ Update procedures
□ Conduct lessons learned
□ Monitor for recurrenceConclusion#
AI incidents are inevitable. How you respond determines whether an incident becomes a manageable event or an organizational crisis.
Key principles:
- Prepare before incidents occur. Have playbooks, teams, and communications ready.
- Act quickly but deliberately. Speed matters, but panic makes things worse.
- Document everything. Your documentation is your defense.
- Preserve evidence immediately. You can’t investigate what you don’t save.
- Communicate appropriately. Transparency and legal prudence must coexist.
- Learn and improve. Every incident is an opportunity to get better.
The AI standard of care includes not just preventing failures, but responding appropriately when they occur. Organizations that handle incidents well, quickly, transparently, and fairly, often emerge with their reputations enhanced rather than damaged.
Use this playbook as a starting point. Adapt it to your organization, your systems, and your risk profile. Practice it before you need it. And when an incident occurs, remember: the goal is not perfection, it’s appropriate response.