[태그:] Faithfulness of NLE

* Self-Critique and Refinement for Faithful Natural Language Explanations (EMNLP 2025)

다음 논문은 LLM이 생성하는 자연어 설명(NLE)의 “faithfulness(충실성)”을 어떻게 개선할 것인가를 다룬 매우 중요한 연구입니다. 핵심은 모델이 스스로 자신의 설명을 비판하고 수정할 수 있는가입니다. 1. 문제 정의 (Why this paper?) 핵심 문제: NLE의 “비충실성 (Unfaithfulness)” 예: 즉, plausible explanation ≠ faithful explanation 2. 핵심 아이디어: SR-NLE Self-Critique + Refinement 논문에서 제안한 프레임워크: SR-NLE (Self-critique and Refinement…

3월 20, 2026
* Towards Faithful Natural Language Explanations: A Study Using Activation Patching in LLMs (EMNLP 2025)

다음 논문은 LLM의 Natural Language Explanation (NLE)의 “faithfulness(충실성)”을 내부 causal 관점에서 측정하는 매우 중요한 메커니즘 기반 연구입니다 1. 핵심 문제 정의 문제 LLM은 CoT 등으로 **그럴듯한 설명(plausible explanation)**을 잘 생성하지만, 이 설명이 실제 내부 reasoning을 반영하는지 (faithful) 는 별개 즉, Faithfulness 정의 논문은 다음 정의를 채택: “Explanation이 모델의 실제 reasoning process를 얼마나 정확히 반영하는가” 즉,…

3월 20, 2026
*** The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in LLMs (ACL 2024)

1. 문제 정의 (핵심 동기) LLM이 생성하는 explanation (CoT, NLE 등)는 그럴듯하지만 실제 reasoning을 반영하지 않을 수 있음. 즉, plausibility ≠ faithfulness 논문은 다음 질문을 다룸: “설명이 모델의 실제 결정 과정과 얼마나 일치하는가?” 2. 기존 접근의 한계: Counterfactual Test (CT) CT 방식 (기존 SOTA) CT의 두 가지 치명적 문제 (1) “언급 여부만” 평가 → trivial…

3월 19, 2026

[태그:] Faithfulness of NLE

* Self-Critique and Refinement for Faithful Natural Language Explanations (EMNLP 2025)

* Towards Faithful Natural Language Explanations: A Study Using Activation Patching in LLMs (EMNLP 2025)

*** The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in LLMs (ACL 2024)