为什么Transformer中位置编码前要将Embedding乘以根号d | YelloooBlue Blog 博客

type

Post

status

Published

date

Jun 24, 2025

slug

summary

tags

DL

category

知行合一

icon

password

😀

这里写文章的前言：一个简单的开头,简述这篇文章讨论的问题、目标、人物、背景是什么？并简述你给出的答案。

可以说说你的故事：阻碍、努力、结果成果，意外与转折。

📝 主旨内容

为了让信息不被“淹没”

为了使位置编码相对较小。这意味着当我们相加时，嵌入向量中的原始意义不会丢失。

将Embedding与位置编码PE缩放到相近的维度

与Softmax共享有关

可能为了在解码器嵌入和解码器预softmax线性权重之间共享权重

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension 𝑑model. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24]. In the embedding layers, we multiply those weights by √𝑑model

与Embedding的初始化有关

transformer中embedding层输出为什么要乘上sqrt(d_k)？ - Kerry的回答 - 知乎 https://www.zhihu.com/question/12138033385/answer/100458991151

根据知乎回答

embedding的初始化采用了Xavier初始化，它的方差是1/sqrt（dk），所以这样做可以把方差恢复成1。

但是经过查看pytorch的代码,我发现W是这样被初始化的

init.normal_(self.weight)

官网的信息也说明，Pytorch的Embedding使用的并不是Xavier

weight (Tensor) – the learnable weights of the module of shape (num_embeddings, embedding_dim) initialized from N(0,1)

词嵌入的初始化均值为 0.0，标准差为 embedding_dim ** -0.5

为了让训练更加稳定

并确保输入标记和权重之间的点积在合理的范围内。缩放权重有助于防止在训练过程中梯度变得过大或过小，这可能导致学习动态不稳定。

🤗 总结归纳

总结文章的内容

📎 参考文章

https://stackoverflow.com/questions/56930821/why-does-embedding-vector-multiplied-by-a-constant-in-transformer-model

https://datascience.stackexchange.com/questions/87906/transformer-model-why-are-word-embeddings-scaled-before-adding-positional-encod

https://github.com/pytorch/tutorials/issues/2849

https://discuss.pytorch.org/t/pytorch-transformers/76993/2

https://www.reddit.com/r/MachineLearning/comments/fbn0oe/d_attention_is_all_you_need_transformer_decoder/

https://docs.pytorch.org/docs/stable/generated/torch.nn.Embedding.html

https://github.com/espnet/espnet/issues/2797

https://www.zhihu.com/question/12138033385/answer/116327921219

💡

有关Notion安装或者使用上的问题，欢迎您在底部评论区留言，一起交流~

Python3 常见排序及其复杂度分析

硬件选型-海康威视工业相机型号定义和选购指南

硬件选型-海康威视工业相机型号定义和选购指南

Author:YelloooBlue
URL:https://tangly1024.com/article/21ce32f0-1b7f-805a-bd41-cd02c13cf703
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts

如何从0开始设计一个MLP

Lazy loaded image

Embedding 的终点是 Token 吗？RQ-KMeans 与推荐系统的新范式

Lazy loaded image

机器学习常用激活函数及其选型

Lazy loaded image

机器学习常用损失函数及其选型

Lazy loaded image

机器学习常用优化器及其选型

Lazy loaded image

InBatchNeg采样去偏推导

Lazy loaded image

Catalog

你好！我是

YelloooBlue

推荐系统｜大模型｜多模态｜边缘计算｜时空数据挖掘

持续学习者