Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Eliron Rahimi1,2, Elad Hirshel3, Rom Himelstein1, Amit LeVi1, Avi Mendelson1, Chaim Baskin2
1 Department of Computer Science, Technion – Israel Institute of Technology
2 INSIGHT Lab, School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Israel
3 Computer Science Department, University of Haifa, Haifa, Israel
Technion lab BGU lab

About

This repository provides code for generating and visualizing Step-Wise Refusal Internal Dynamics (SRI) signals from language models.

SRI is a step-wise internal representation that tracks how refusal-related internal states evolve during text generation. The framework supports analysis of internal generation activations across both autoregressive and diffusion-based language model architectures.

This repository focuses on step-wise internal signals that capture the evolution of refusal alignment throughout the generation process.

Unlike output-level refusal detection, SRI operates on internal model representations extracted at each generation step, enabling fine-grained inspection of refusal formation, stability, recovery, and transitions during generation.

The framework is model-agnostic and intended for research in model safety, alignment, and interpretability.

SRI example

Overview

This project introduces a step-wise internal perspective on refusal behavior in both autoregressive (AR) and diffusion language models (DLMs). We show that refusal and safety outcomes depend not only on learned representations, but also strongly on the sampling dynamics used at inference time.

Key Contributions

  • A step-wise analytical framework for comparing refusal dynamics between AR and diffusion decoding.
  • The Step-Wise Refusal Internal Dynamics (SRI) signal for interpretability and safety.
  • Evidence that harmful generations exhibit incomplete internal recovery, even when text-level refusal does not trigger.
  • A lightweight inference-time detector (SRI Guard) that generalizes to unseen attacks with >100× lower inference overhead.

Quick Start

This section is intentionally minimal. A full reproducibility guide will be added soon.

git clone https://github.com/ElironRahimi/sri-signal
cd sri-signal

Citation

@article{rahimi2026step,
  title={Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models},
  author={Rahimi, Eliron and Hirshel, Elad and Himelstein, Rom and LeVi, Amit and Mendelson, Avi and Baskin, Chaim},
  journal={arXiv preprint arXiv:2602.02600},
  year={2026}
}

Contact

Eliron Rahimi — elironrahimi@campus.technion.ac.il

Chaim Baskin — chaimbaskin@bgu.ac.il