Seq as input
Sophisticated Input
- Input is a vector
- Input is a set of vectors
Vector Set as Input
举例子:
- 文字处理
文本单词的处理形式常见的有两种: One-hot Encoding, Word Embedding
- One-hot encoding: 不会表征不同label之间的关系(语义关系)
- Word Embedding: 会有对单词进行 语义聚类 一般的处理
To learn more: https://youtu.be/X7PH3NuYW0Q(in Mandarin)
语音辨识(简化
Graph
- Social Network
- Drug Discovery
- Social Network
Output
- Each vector has a label
- The whole sequence has a label
- Model decides the no. of labels itself. seq2seq
Sequence Labeling
上面的第一种,输入输出方式,被称为 Sequence Labeling
Self-attention
FC: Fully-connected network
Self-attention 会吃掉(接收)一整个 Sequence 的输入,然后 input 几个vector就会有几个Output
self-attention 是可以多次使用的
Attention is all u need
在上面的论文中,Google 第一次提出了 transformer 的网络架构。
transformer 里面最重要的 Module 就是 Self-attention
李沐:在 Transformer(aka 变形金刚)之后,Model的名字,变得越来越Fantasy
Relevant
- Dot-product
- Additive
其中
Extract information based on attention scores
Vectorization
做了很多复杂的操作,最后需要学习的参数也只有
Multi-head Self-attention
Different types of relevance
2 heads
每个同上标的量只跟,同上标的对应量进行操作。
Positional Encoding
Each column represents a positional vector
- No position information in self-attention
- Each position has a unique positional vector
- hand-crafted
- learned from data
比较新的研究: https://arxiv.org/abs/2003.09229
提出和比较了不同的 positional encoding
Application
Transformer: https://arxiv.org/abs/1706.03762
BERT: https://arxiv.org/abs/1810.04805
Widely used in Natural Language Processing(NLP)!
Self-attention for Speech
https://arxiv.org/abs/1910.12977
如果按照之前的 注意力机制 来设计,Attention Matrix 的参数量与 seq 的长度为平方关系。占用大量的 memory, 因此一般会使用 Truncated Self-attention,只考虑一部分的 seq
Self-attention for Image
An image can also be considered as a vector set.
Self-Attention GAN
https://arxiv.org/abs/1805.08318
Detection Transformer(DETR)
https://arxiv.org/abs/2005.12872
Self-attention v.s. CNN
- CNN: self-attention that can only attends in a receptive field
- CNN is simplified self-attention.
- Self-attention: CNN with learnable recptive field
- Self-attention is the complex version of CNN
CNN 的感受野是 人为划定 的,而 Self-attention 的感受野是机器 自己学习出来 的
Relationship
19年的一篇论文:
On the Relationship between Self-Attention and Convolutional Layers
这篇paper以数学的方式严谨的证明了:
- CNN就是Self-attention的特例,Self-attention只要设定合适的参数,就能做到与CNN一样的的事情
- Self-attention是更flexible的CNN
amount of dataset
An image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Self-attention v.s. RNN
- RNN 与 Self-attention 一样,都是要处理 input 为一个 sequence 的状况
- 有一个 memory 的 vector 和 RNN 的 block balabala 李老师没讲清楚
- RNN 是无法平行输出的,Self-attention 的机制可以平行输出
- 运算速率上,Self-attention 更加 effective,很多的模型都用 Self-attention来取代RNN的架构
Transformer are RNNs: Fast Autoregressive Transformers with Linear Attention
Self-attention for Graph
Graph 也可以看作是,一堆vector,那么一堆vector就能当用 self-attention 来处理。
Self-attention 有自己寻找不同对象之间的关联性机制。但是 Graph 自身的属性,就已经包含了关联性信息。
Consider edge: only attention to connected nodes
This is one type of Graph Neural Network(GNN).
To learn more
Long Range Arena: A Benchmark for Efficient Transformers
注意力机制的弊端: 运算量很大,这篇paper尝试量各种各样不同变体,测试了不同性能和训练效率
- Post title: 07_Hung-yi Lee_Self-attention
- Create time: 2022-04-06 14:39:01
- Post link: https://overduse.github.io/Machine-Learning/07-hung-yi-lee-self-attention/
- Copyright notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.