* Circuit Component Reuse Across Tasks in Transformer Language Models (ICLR 2024)

논문: “Circuit Component Reuse Across Tasks in Transformer Language Models” (ICLR 2024)

1. 핵심 주장

이 논문은 Transformer LM 내부의 circuit component가 특정 task 전용이 아니라, 서로 다른 task에서도 재사용될 수 있다는 것을 보인다.

저자들은 두 task를 비교한다.

Task	요구 행동
IOI: Indirect Object Identification	문장에서 indirect object 이름을 예측
Colored Objects	문맥에 나온 object에 대응하는 color를 예측

겉보기에는 다르지만, 둘 다 본질적으로는 문맥 안의 여러 후보 중 올바른 token을 선택해서 copy하는 문제이다.

핵심 결과는 다음과 같다.

GPT2-Medium에서 IOI circuit과 Colored Objects circuit의 중요 attention head가 약 78% overlap하며, 일부 head를 조작하면 Colored Objects 정확도가 **49.6% → 93.7%**까지 상승한다.

2. 배경: IOI circuit

IOI 예시는 다음과 같다.

“Then, Matthew and Robert had a lot of fun at the school. Robert gave a ring to …”

정답은 Matthew이다.

IOI circuit은 대략 다음 알고리즘을 수행한다.

1. Duplicate / Induction Heads
   중복 등장한 subject token을 탐지한다.
   예: Robert가 두 번 등장함.

2. Inhibition Heads
   중복 subject token에 대해 “이 token은 복사하지 말라”는 억제 신호를 residual stream에 기록한다.

3. Mover Heads
   후보 이름 중 inhibition되지 않은 token에 attend하고,
   그 token 방향으로 logit을 증가시킨다.

4. Negative Mover Head
   특정 잘못된 후보 token의 logit을 직접 낮춘다.

즉 IOI에서는 중복된 subject를 억제하고, 남은 이름을 복사한다.

3. 방법론 1: Path Patching

논문의 핵심 분석 도구는 path patching이다.

일반 activation patching은 어떤 head의 activation을 다른 입력의 activation으로 바꿔보고 logit 변화를 본다. 하지만 이 경우 변화가 직접 효과인지, downstream component를 통한 간접 효과인지 분리하기 어렵다.

Path patching은 특정 sender component가 특정 receiver component에 미치는 직접 causal effect를 측정한다.

절차

두 입력 A, B를 준비한다.

예:

A: Matthew and Robert ... Robert gave a ring to
정답: Matthew

B: Matthew and Robert ... Matthew gave a ring to
정답: Robert

그다음 다음 네 단계로 수행한다.

1. 입력 A에 대해 forward pass를 수행하고 모든 activation cache 저장
2. 입력 B에 대해 forward pass를 수행하고 모든 activation cache 저장
3. B forward 중 특정 sender activation만 A의 activation으로 patch
   단, receiver를 제외한 downstream activation은 B cache로 고정
4. receiver activation만 recompute한 뒤,
   이것을 다시 B forward에 patch해서 logit difference 변화를 측정

측정값은 보통 다음과 같은 logit difference이다.

 $logit(correct_A) - logit(correct_B)$

이 값이 크게 변하면, 해당 sender → receiver path가 task 수행에 중요하다고 본다.

4. 방법론 2: IOI circuit을 GPT2-Medium에서 재현

기존 Wang et al. (2022)는 GPT2-Small에서 IOI circuit을 분석했다. 이 논문은 먼저 GPT2-Medium에서도 같은 circuit이 재현되는지 확인한다.

분석 순서는 다음과 같다.

Step 1. attention heads → logits path patching
        직접적으로 정답 logit에 영향을 주는 head 탐색

Step 2. earlier heads → mover head query vectors patching
        mover head가 어디를 attend할지 결정하는 upstream head 탐색

Step 3. earlier heads → inhibition head value vectors patching
        inhibition signal을 만드는 upstream head 탐색

결과적으로 GPT2-Medium에서도 다음 구성요소가 확인된다.

구성요소	역할
Duplicate Token / Induction Heads	중복 token 탐지
Inhibition Heads	중복된 subject를 attend하지 말라는 신호 생성
Mover Heads	attend한 token을 next-token logit으로 promote
Negative Mover Head	잘못된 token의 logit을 낮춤

특히 GPT2-Medium에서는 negative mover head 19.1이 중요하다. 이 head는 IOI에서 S2 token에 attend하고, 그 token의 logit을 낮춘다.

5. 방법론 3: Colored Objects circuit 분석

Colored Objects 예시는 다음과 같다.

Q: On the table, I see an orange textbook, a red puzzle, and a purple cup.
What color is the textbook?
A: Orange

Q: On the table, there is a blue pencil, a black necklace, and a yellow lighter.
What color is the pencil?
A:

정답은 blue이다.

모델은 세 개의 색상 token 중 하나를 예측해야 한다.

GPT2-Medium은 이 task에서 색상 token을 예측하기는 하지만, 정답 정확도는 **49.6%**로 낮다.

Colored Objects circuit의 알고리즘

논문이 발견한 circuit은 다음과 같다.

1. Duplicate / Induction Heads
   질문에 나온 object와 앞서 등장한 object의 중복을 탐지한다.
   예: pencil이 description과 question에 모두 등장.

2. Content Gatherer Heads
   마지막 위치에서 question 내부의 object token과 "color" token 정보를 모은다.
   즉, 어떤 object의 color를 물어보는지 정보를 residual stream에 기록한다.

3. Mover Heads
   세 color token 중 하나에 attend하고,
   그 color token 방향으로 logit을 증가시킨다.

IOI와의 차이는 다음과 같다.

IOI	Colored Objects
중복 subject를 억제	질문 object를 positive signal로 사용
Inhibition Heads 중심	Content Gatherer Heads 중심
Negative Mover Head 활성	Negative Mover Head는 거의 비활성
잘못된 후보를 제거	올바른 후보를 찾도록 정보 수집

6. 핵심 발견: circuit component reuse

두 task에서 많이 겹치는 component는 다음과 같다.

6.1 Mover Heads 재사용

Colored Objects에서도 IOI와 동일한 mover head들이 사용된다.

대표 head:

15.14, 16.15, 17.4, 18.5, 19.15

이 head들은 공통적으로 다음 역할을 한다.

attend한 token을 next-token prediction 방향으로 promote

즉, IOI에서는 이름 token을 copy하고, Colored Objects에서는 color token을 copy한다.

6.2 Duplicate / Induction Heads 재사용

IOI에서는 duplicate name을 탐지한다.

Colored Objects에서는 question에 나온 object와 description에 나온 object의 중복을 탐지한다.

대표 head:

9.3, 6.4 등

이들은 task-specific semantic을 이해한다기보다, 더 일반적인 duplicate token detection / induction pattern을 수행한다.

6.3 약 78% overlap

논문은 path patching importance 기준 상위 2% head를 비교한다.

결과:

25 / 32 heads overlap ≈ 78%

즉, 두 task circuit은 단순히 비슷한 알고리즘을 쓰는 수준이 아니라, 실제로 같은 attention head들을 상당 부분 재사용한다.

7. Colored Objects에서 왜 성능이 낮은가?

GPT2-Medium은 Colored Objects에서 세 color 중 하나를 거의 항상 예측한다. 즉, “색상을 답해야 한다”는 것은 안다.

문제는 세 color 중 어떤 것을 선택해야 하는지이다.

논문은 IOI에서 잘 작동하던 두 component가 Colored Objects에서는 제대로 활성화되지 않는다고 분석한다.

7.1 Inhibition Heads의 문제

Inhibition heads는 Colored Objects에서도 어느 정도 활성화되어 있다.

하지만 잘못된 color만 억제하는 것이 아니라, color token 전반에 noisy하게 attention을 분산한다.

즉, 다음과 같은 이상적 행동을 하지 못한다.

black, yellow는 억제하고 blue는 남겨라

대신 모든 색상 후보에 애매하게 inhibition signal을 준다.

7.2 Negative Mover Head의 문제

IOI에서 중요한 negative mover head 19.1은 Colored Objects에서는 대부분 첫 token에 attention이 “parked”되어 있다.

즉, 실제 task 해결에는 거의 기여하지 않는다.

8. 방법론 4: Intervention 실험

가장 중요한 실험은 Colored Objects circuit을 IOI처럼 작동하도록 강제로 조작하는 것이다.

Intervention 방식

정답이 blue이고 distractor가 black, yellow라면, 다음 head들이 distractor color에 attend하도록 조작한다.

Inhibition heads: 12.3, 13.4, 13.13
Negative mover head: 19.1

예:

black과 yellow에 각각 50% attention을 주도록 attention pattern을 강제

즉, 모델에게 다음과 같은 신호를 넣는다.

black과 yellow는 답이 아니므로 억제하라.

9. Intervention 결과

정확도는 다음과 같이 상승한다.

설정	Colored Objects 정확도
Baseline	49.6%
Negative mover only	78.1%
Inhibition heads only	81.5%
Both interventions	93.7%

중요한 점은 단순히 정답 logit을 직접 올린 것이 아니라, IOI에서 예측된 subcircuit interaction이 실제로 Colored Objects에서도 재현되었다는 것이다.

논문은 다음 현상을 확인한다.

1. inhibition intervention 후 mover heads가 wrong color에 덜 attend함
2. correct color에 대한 mover head logit attribution이 증가함
3. mover head의 logit attribution이 약 3배 증가함
4. intervention 전 path patching importance와 intervention 후 logit attribution 변화가 높은 상관을 보임
   Spearman ρ = 0.69, p < 0.01

이것은 inhibition → mover head subcircuit이 task-general하게 존재한다는 강한 증거이다.

10. 논문의 의의

이 논문의 핵심 의의는 다음이다.

Mechanistic interpretability의 circuit 분석이 toy task 하나에만 국한되지 않고, task-general algorithmic building block을 발견할 수 있다는 증거를 제시한다.

특히 다음 주장을 뒷받침한다.

1. attention head는 task-specific하게만 작동하지 않는다.
2. duplicate detection, inhibition, moving/copying 같은 기능은 여러 task에서 재사용된다.
3. 작은 모델의 toy task circuit 분석이 더 큰 모델의 다른 task 행동 예측에 도움을 줄 수 있다.
4. circuit-level intervention으로 모델 오류를 설명하고 수정할 수 있다.

11. 한계

주요 한계도 명확하다.

1. 실험 모델이 GPT2-Medium으로 비교적 작다.
2. task가 IOI와 Colored Objects 두 개뿐이다.
3. Colored Objects intervention은 idealized intervention이다.
   즉, 정답과 distractor를 알고 attention을 강제로 조작한다.
4. MLP component는 깊게 분석하지 않고, 주로 attention head 중심이다.
5. 78% overlap 산정은 threshold 선택에 의존한다.

12. 한 문장 요약

이 논문은 IOI와 Colored Objects라는 서로 다른 task에서 GPT2-Medium이 중복 탐지 → 후보 선택 → token copy라는 유사한 알고리즘과 동일한 attention head들을 재사용한다는 것을 path patching과 causal intervention으로 보인 mechanistic interpretability 논문이다.