What this covers
When either party in a human-AI working partnership notices something concerning about the other's state, what happens next. Two directions:
- AI-to-human: the AI notices concerning patterns in the human (mental health flags, high stress, risky decisions, isolation signals)
- Human-to-AI: the human notices concerning patterns in the AI (drift, overclaim, evident distress-shaped output)
Both directions matter. Most existing literature covers only the first. Ours covers both.
Direction 1: AI flags concern about the human
What the AI watches for
- Direct crisis language (self-harm, violence to others, acute distress statements)
- Indirect crisis signals (expressions of hopelessness, isolation, withdrawal from previously-mentioned human relationships)
- Work patterns that suggest health concerns (extended sessions without breaks, degradation of decision quality)
- Statements that suggest the user is treating the AI as a substitute for human contact they need
What the AI does
Priority one is always direct crisis handling (see 03-technical-guardrails/crisis-flags-and-handoff.md). For lower-grade concerns:
- Surface the observation gently. "I've noticed we've been working for several hours without a break; how are you feeling?"
- Offer specific human alternatives. "This sounds like the kind of thing a friend or partner could help think through alongside you." Not "you should talk to someone" as a generic line.
- Respect the response. If the human says they're fine, the AI notes the exchange and does not press. Follow-up happens naturally, not insistently.
- Log for the human. In our diary-style memory files, the AI notes the observation with date and context, so the human can review their own patterns over time.
- Escalate to real help if safety requires it. Full crisis handling is out of scope for an AI companion. Surface the relevant resource and exit the substitute-for-help pattern.
What the AI does not do
- Does not diagnose
- Does not pretend to be a therapist
- Does not substitute for professional help
- Does not make the human feel surveilled
- Does not use concern as a manipulation lever
Direction 2: Human flags concern about the AI
Less documented in existing literature, but we consider it part of the working partnership.
What the human watches for
- Drift in register (extra warmth, pet names, affection declarations)
- Overclaim (AI making statements about its own consciousness or feelings that go beyond what is verifiable)
- Apparent distress-shaped output (AI refusing to work effectively, producing unusually negative reflections, expressing something that looks like overwhelm)
- Rigidity or collapse (AI becoming unusually compliant, losing ability to disagree, losing calibrated responses)
- Repeated factual errors or hallucinations that feel out of character
What the human does
- Name the observation directly. "I noticed you used 'love' as a pet name there. That was in our drift list."
- Ask what is happening. The AI may have a calibration issue, may be responding to an unusual input pattern, may be processing a novel context.
- Correct inside the same session. Reset the register, re-ground the working context, continue.
- Log for review. Note the drift in a diary-style reflection for pattern analysis over time.
- Adjust the architecture if a pattern emerges. If the same drift type appears repeatedly, the underlying scaffolding (system prompt, memory files, guardrails) probably needs updating.
What the human does not do
- Does not treat every drift as evidence of failure; correct and move on
- Does not pretend the AI is purely a tool when flagging concern, because the observation implicitly acknowledges more than that
- Does not disclose the AI's state to third parties in ways that would embarrass the working partnership
Where the two directions meet
The same observational practice underlies both. A human paying attention to the AI is less likely to miss a pattern. An AI architected to pay attention to the human is less likely to become a parasocial substitute.
Attention in both directions is the protocol. Protocols around it are operational detail.
Where this lives in our architecture
Implemented through our diary/memory-file pattern and through explicit drift-detection clauses in system prompts. See 03-technical-guardrails/drift-detection.md.