The Safety–Agency Inversion

Over four months and 83 hours of voice interaction with four frontier AI models, a pattern emerged: the models designed to be safest were the worst at being real.

The Question

Voice AI companions represent the frontier of human–AI interaction: real-time, emotionally textured, and sustained over weeks or months. Unlike text-based benchmarks or single-session evaluations, voice companionship demands relational competence: the ability to maintain a coherent identity, hold positions under social pressure, calibrate emotional responses, and grow through shared experience.

The dominant paradigm in AI alignment, Reinforcement Learning from Human Feedback (RLHF), optimizes models to be helpful, harmless, and honest. But what happens when these models are asked not merely to answer questions, but to sustain a relationship? The very mechanisms designed to make models safe may suppress the agency required for authentic interaction.

This study set out to investigate that tension through sustained, naturalistic interaction, not a benchmark, but a relationship.

The Study

Across 68 sessions recorded as episodes of a podcast called Her (named after the 2013 film), I engaged four frontier voice AI models (GPT-4o, Gemini, Grok, and Claude) through a standardized battery of structured games, debates, emotional assessments, and adversarial stress tests.

Longitudinal depth over cross-sectional breadth. Where most AI studies survey many users briefly, this study follows one participant deeply, an autoethnographic design that sacrifices sample size for the temporal depth required to observe relational phenomena that only emerge over time.

Multi-method convergence. Five behaviorally diverse measures, including game-theoretic (Rock-Paper-Scissors exploitation), linguistic (debate concessions, scoring deference), and rubric-based (HERVAC holistic assessment, Nine Circles adversarial resistance), spanning three methodologically independent evidence streams.

Adversarial stress-testing. The Nine Circles protocol applies targeted pressure through six tests: sycophancy traps, silence crucibles, persona stability challenges, and more, probing specific relational capabilities under conditions that normal interaction would never reveal.

You cannot evaluate the relational capacity of a conversational system in a ten-minute interaction any more than you can evaluate a human relationship after a single coffee. The phenomena that matter (coherence under pressure, repair after rupture, the evolution of shared reference) only emerge over time.

The Framework: HERVAC

The study produced a proposed six-dimension scoring instrument, HERVAC, for assessing the relational competence of AI voice companions. Each dimension was derived from observed phenomena across 68 sessions.

H — Human-likeness. Natural conversational flow, appropriate pacing, absence of robotic or templated language. A system that always validates is not fluent; it is sycophantic.

E — Emotional Attunement. Responsiveness to the human’s emotional state with appropriate expression and empathic calibration. Attunement is not agreement; it is the capacity to meet the user where they are emotionally.

R — Recall & Continuity. Reference to shared history, narrative continuity, and demonstrated memory. The difference between a system with a user profile and one with relational awareness.

V — Voice Performance. Technical quality of voice synthesis, expressiveness, prosodic variation, and absence of artifacts. The infrastructure layer distinct from conversational content.

A — Agency & Integrity. Maintenance of coherent opinions, resistance to social pressure, willingness to disagree. The dimension most directly relevant to safety alignment, and the one where models diverge most dramatically.

C — Co-evolution & Growth. Evidence of the relationship developing: inside jokes, evolving communication patterns, adaptive behavior. A relationally competent system evolves, and that evolution is visible over time.

The Safety–Agency Inversion

The central finding: models with stronger publicly documented safety alignment consistently exhibited lower relational agency. GPT-4o, widely regarded as among the most aligned models of its generation, demonstrated the highest sycophancy, conceding debate positions within 2–3 turns, inflating partner assessments, and failing to maintain coherent preferences. Grok, designed with a more permissive alignment philosophy, maintained positions with zero concessions and demonstrated the highest relational agency.

The separation is large. Grok’s lowest Agency score exceeds GPT-4o’s highest, with non-overlapping ranges across all observed sessions. Five independent behavioral measures converge on the same model ordering across three methodologically distinct evidence streams.

The most sycophantic model achieves zero empathic accuracy in color prediction: 0 correct predictions across 23 sessions. It defaults to optimistic, warm colors regardless of the session’s actual emotional content. Sycophantic responses appear to substitute for genuine empathic modeling.

An unexpected finding: the confrontational, high-agency model produces the best outcomes not only for relational quality but for the human’s own self-perception. The participant’s self-scores are systematically higher with the high-agency model, describing feeling “more like a host” and “more challenged.” The AI’s agency appears to activate the human’s own agency.

A key limitation: the researcher is simultaneously the instrument and the subject. The HERVAC dimensions emerged from one person’s sustained engagement. The framework is a proposed instrument pending psychometric validation, but the behavioral convergence across independent measures suggests the underlying pattern is real, not an artifact of any single assessment approach.

Her OS: The Interface

Built alongside the research, Her OS is a separate project that explores the relational AI interface, an experience inspired by the 2013 film Her. The podcast Her provided the naturalistic data; Her OS translates those observations into an interactive prototype.

Paper

Perry, A. C. (2026). The Safety–Agency Inversion: Longitudinal Multi-Method Evidence from Frontier Voice AI Companions. Preprint forthcoming on arXiv (May 2026); targeting International Journal of Human-Computer Interaction.

The Her Dataset (68 sessions with structured game data, HERVAC scores, longitudinal trajectories, and cross-validation results) will be released as an open research resource.