[태그:] Activation Patching

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (ICLR 2024)

이 논문은 activation patching(= causal tracing/interchange intervention)의 실험 설정(hyperparameter) 이 interpretability 결과를 얼마나 크게 바꾸는지를 체계적으로 분석한 논문입니다. 핵심 메시지는 다음과 같습니다. “Activation patching 자체보다도,어떤 corruption method를 쓰고 어떤 metric으로 측정하느냐가localization/circuit discovery 결과를 크게 바꾼다.” 즉, 기존 mechanistic interpretability 논문들의 결과가 설정에 민감할 수 있으며, activation patching에도 “best practice”가 필요하다는 주장입니다. 1. Activation Patching이란?…

5월 27, 2026
* Towards Faithful Natural Language Explanations: A Study Using Activation Patching in LLMs (EMNLP 2025)

다음 논문은 LLM의 Natural Language Explanation (NLE)의 “faithfulness(충실성)”을 내부 causal 관점에서 측정하는 매우 중요한 메커니즘 기반 연구입니다 1. 핵심 문제 정의 문제 LLM은 CoT 등으로 **그럴듯한 설명(plausible explanation)**을 잘 생성하지만, 이 설명이 실제 내부 reasoning을 반영하는지 (faithful) 는 별개 즉, Faithfulness 정의 논문은 다음 정의를 채택: “Explanation이 모델의 실제 reasoning process를 얼마나 정확히 반영하는가” 즉,…

3월 20, 2026