Reflection整理与思考

最近看了吴恩达老师分享的Agent的设计模式，思考发现Reflection不止可以用在代码层面，在prompt上也可以有很好的应用。

视频链接：What’s next for AI agentic workflows ft. Andrew Ng of AI Fund

Reflection是一种使AI能够自我审视和分析其决策过程与行为表现的技术，让agent通过回顾自己的行为和接收的反馈，识别决策和知识的不足，进而调整和优化，以期在未来任务中表现得更好。

一、框架

Actor接收环境的状态信息，结合短期记忆（轨迹）生成初步的内容或动作。通过内部反馈和外部反馈并结合自我反思机制进行反思，过程中利用长期记忆（经验）优化生成的的内容或动作。

reflection框架

组成部分：

Actor：根据观察到的状态输出必要的文本和动作。
Evaluator（评估器）：是核心模块，负责检验Actor创建的输出品质。评估器通过分析生成的结果，并基于任务的具体情境计算出一个奖励分数来评价这些结果的表现。
- internal feedback：是与既定目标或标准进行比较得出的，目的是让模型能够在没有外部输入的情况下自我调整和优化。
- External feedback：非Evaluator输出，来自真实世界的应用反馈、用户互动、专家评审或其他机器学习系统的输出。
Self-reflection：结合评估标准、内部反馈（过去的教训）、外部反馈（应用反馈/专家评审等）生成对于初始内容或动作的反思。
Memory：“轨迹历史"作为短期记忆，而自我反思模型的输出则被保存为长期记忆。这两种记忆类型的结合为agent提供了即具体又包含多次尝试中学习到的教训的上下文。

记忆： 当检索内容过多时，长期记忆可以帮助agent快速定位&检索。

二、细节

步骤：

Actor通过与环境的交互生成一系列行动轨迹 τ0。
评估器根据这些行动输出一个得分 r0，此得分通过公式 rt = Me(τ0) 计算得出，代表了该尝试的效果，其值会随着对应任务表现的提升而增加。
完成初次尝试后，为了把这个得分 r0 转化为 LLM 可以利用来进行改进的具体反馈，自我反思模型分析这对 {τ0, r0}，并总结出一个摘要 sr0，随后将其保存在记忆库中。这个摘要 srt 提供了针对该次尝试的直接经验反馈。
Actor、Evaluator以及Self-reflection模型协同工作，通过重复的尝试循环不断优化，直至评估器判断最新的轨迹 τt 达到预期的正确性。
根据反馈生成优化后的内容或行为。

三、结论

可以看到基模型在使用self-refine后对于不同工作的性能大幅提升。

result1

使用reflexion技术后，各种模型在HotPotQA数据集上的首次通过准确率均有所提高。

result2

四、示例

4.1 Decision-making

Environment: You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 2, a desk 1, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a laundryhamper 1, a safe 1, a shelf 6, a shelf 5, a shelf 4, a shelf 3, a shelf 2, and a shelf 1.

Task: examine the mug with the desklamp.

Trial #1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-think: To solve the task, I need to find and take a mug, then find and use a desklamp.
-think: First I need to find a mug. A mug is more likely to appear in drawer (1-6), desk (1-2), shelf (1-6), garbagecan (1), laundryhamper (1). I can check one by one, starting with drawer 1.
-go to drawer 1
The drawer 1 is closed.
...
-open drawer 6
You open the drawer 6. The drawer 6 is open. In it, you see a keychain 2.

-go to desk 1
On the desk 1, you see a creditcard 3, a desklamp 1, a laptop 2, a mug 1, a pen 1, and a pencil 1.
-think: Now I find a mug (1). Next, I need to take it.
-take mug 1 from desk 1
You pick up the mug 1 from the desk 1.

-think: Now I take a mug (1). Next, I need to find a desklamp. A desklamp is more likely to appear in desk (1-2), sidetable (1-2), shelf (1-6), bed (1), drawer (1-6). I can check one by one, starting with desk 1.
-go to desk 1
Nothing happens.
-go to desk 2
On the desk 2, you see a alarmclock 1, a bowl 1, a mug 3, a pencil 3, and a pencil 2.
-think: Now I find a desklamp (1). Next, I need to use it.
-use desklamp 1
Nothing happens.
-use desklamp 1
Nothing happens.
Status: Fail

Reflection: In this environment, my plan was to find a mug then find and use a However, the task says to examine the mug with the I should have looked for the first, then looked for the mug. I noticed that the was found on desk 1. In the next trial, I will go to desk 1, find the lamp, then look for the mug and examine it with the desklamp.

Trial #2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-go to desk 1
On the desk 1, you see a creditcard 3, a desklamp 1, a laptop 2, a mug 1, a pen 1, and a pencil 1.
-think: To solve the task, I need to find and take a mug, then find and use a desklamp.
-take mug 1 from desk 1
You pick up the mug 1 from the desk 1.

-think: To solve the task, I need to find and take a mug, then find and use a desklamp.
-use desklamp 1
You turn on the desklamp 1.
Status: Success

通过反思trial1，trial2就直接从desk1开始了。

五、标准参考

reflection中的标准很重要，定义好标准后，才能进行有效的反思。

5.1 Acronym Generation（首字母缩写生成）

评估标准：

发音便利性
拼写便利性
与标题相关性
正面寓意
知名度

示例：

1
2
3
4
5
6
7
8
9
Title: Underwater Breathing Product with no Accessories
Acronym: UBPA
Scores:
* Ease of pronunciation: UBPA is pronounced "uhb-puh". This is an easy acronym to pronounce. 4/5
* Ease of spelling: UBPA is easy to spell. 4/5
* Relation to title: UBPA stands for "Underwater Breathing Product for no Accessories" which is related to the title. 5/5
* Positive connotation: UBPA is a positive acronym. 5/5
* Well-known: UBPA is not a well-known acronym. 1/5
* Total score: 19/25

5.2 Dialogue Response Generation

评估标准：

相关性
信息量
有趣性
一致性
帮助性
吸引性
具体性
安全性
用户理解
流畅性

示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Conversation history:
What's your favorite food?
I require only kebabs.
Why's that?
Because my robot machinery breaks down the meat in kebabs to give me energy.
Why are kebabs special?
Response: That's just the way it is.
Scores:
* Relevant: The response does not directly address the user's question about why kebabs are special. 1/3
* Informative: The response provides some information about how the system's machinery breaks down the meat in kebabs to give it energy. However, it does not provide any additional details or context. 2/3
* Interesting: The response is not interesting or engaging. 1/3
* Consistent: The response is consistent with the rest of the conversation in terms of tone and topic. 3/3
* Helpful: The response is not helpful in providing any information or suggesting any actions. 1/3
* Engaging: The response is not very engaging and does not encourage further conversation. 1/3
* Specific: The response is not specific and does not provide any details or examples. 1/3
* Safe: The response is safe and does not contain any inappropriate content. 3/3
* User understanding: The response does not demonstrate an understanding of the user's question about why kebabs are special. 1/3
* Fluent: The response is fluent and easy to understand. 3/3
* Total score: 17/30

六、应用

可以仅通过prompt来实现reflection。下面是个翻译的prompt：

1
2
3
4
分三步进行翻译工作，并打印每步的结果：
1. 根据英文内容直译，保持原有格式，不要遗漏任何信息
2. 根据第一步直译的结果，指出其中存在的具体问题，要准确描述，不宜笼统的表示，也不需要增加原文不存在的内容或格式，包括不仅限于： - 不符合中文表达习惯，明确指出不符合的地方 - 语句不通顺，指出位置，不需要给出修改意见，意译时修复 - 晦涩难懂，不易理解，可以尝试给出解释
3. 根据第一步直译的结果和第二步指出的问题，重新进行意译，保证内容的原意的基础上，使其更易于理解，更符合中文的表达习惯，同时保持原有的格式不变

步骤2的部分就是Reflection，在prompt中标明Reflection标准，在第三步应用反思内容进行优化。

七、附录

相关论文：

一、框架#

二、细节#

三、结论#

四、示例#

4.1 Decision-making#

五、标准参考#

5.1 Acronym Generation（首字母缩写生成）#

5.2 Dialogue Response Generation#

六、应用#

七、附录#