Seq as input

Sophisticated Input

Input is a vector
Input is a set of vectors

Vector Set as Input

举例子:

文字处理

文本单词的处理形式常见的有两种: One-hot Encoding, Word Embedding

One-hot encoding: 不会表征不同label之间的关系（语义关系）
Word Embedding: 会有对单词进行 语义聚类 一般的处理

To learn more: https://youtu.be/X7PH3NuYW0Q(in Mandarin)

语音辨识（简化
Graph
1. Social Network
2. Drug Discovery

Output

Each vector has a label
The whole sequence has a label
Model decides the no. of labels itself. seq2seq

Sequence Labeling

上面的第一种，输入输出方式，被称为 Sequence Labeling

Self-attention

FC： Fully-connected network

Self-attention 会吃掉（接收）一整个 Sequence 的输入，然后 input 几个vector就会有几个Output

self-attention 是可以多次使用的

Attention is all u need

Attention is all you need.

在上面的论文中，Google 第一次提出了 transformer 的网络架构。
transformer 里面最重要的 Module 就是 Self-attention

李沐：在 Transformer(aka 变形金刚)之后，Model的名字，变得越来越Fantasy

Relevant

Dot-product
Additive

其中进入Softmax前的参数被称为 attention score

Extract information based on attention scores

Vectorization

做了很多复杂的操作，最后需要学习的参数也只有

Multi-head Self-attention

Different types of relevance

2 heads

每个同上标的量只跟，同上标的对应量进行操作。

Positional Encoding

Each column represents a positional vector

No position information in self-attention
Each position has a unique positional vector
hand-crafted
learned from data

比较新的研究： https://arxiv.org/abs/2003.09229
提出和比较了不同的 positional encoding

Application

Transformer: https://arxiv.org/abs/1706.03762
BERT: https://arxiv.org/abs/1810.04805

Widely used in Natural Language Processing(NLP)!

Self-attention for Speech

https://arxiv.org/abs/1910.12977

如果按照之前的 注意力机制 来设计，Attention Matrix 的参数量与 seq 的长度为平方关系。占用大量的 memory，因此一般会使用 Truncated Self-attention，只考虑一部分的 seq

Self-attention for Image

An image can also be considered as a vector set.

Self-Attention GAN

https://arxiv.org/abs/1805.08318

Detection Transformer(DETR)

https://arxiv.org/abs/2005.12872

Self-attention v.s. CNN

CNN: self-attention that can only attends in a receptive field
- CNN is simplified self-attention.
Self-attention: CNN with learnable recptive field
- Self-attention is the complex version of CNN

CNN 的感受野是 人为划定 的，而 Self-attention 的感受野是机器自己学习出来的

Relationship

19年的一篇论文：
On the Relationship between Self-Attention and Convolutional Layers

这篇paper以数学的方式严谨的证明了：

CNN就是Self-attention的特例，Self-attention只要设定合适的参数，就能做到与CNN一样的的事情
Self-attention是更flexible的CNN

amount of dataset

An image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Self-attention v.s. RNN

RNN 与 Self-attention 一样，都是要处理 input 为一个 sequence 的状况
有一个 memory 的 vector 和 RNN 的 block balabala 李老师没讲清楚

RNN 是无法平行输出的，Self-attention 的机制可以平行输出
运算速率上，Self-attention 更加 effective，很多的模型都用 Self-attention来取代RNN的架构

Transformer are RNNs: Fast Autoregressive Transformers with Linear Attention

Self-attention for Graph

Graph 也可以看作是，一堆vector，那么一堆vector就能当用 self-attention 来处理。

Self-attention 有自己寻找不同对象之间的关联性机制。但是 Graph 自身的属性，就已经包含了关联性信息。

Consider edge: only attention to connected nodes
This is one type of Graph Neural Network(GNN).

To learn more

Long Range Arena: A Benchmark for Efficient Transformers
注意力机制的弊端：运算量很大，这篇paper尝试量各种各样不同变体，测试了不同性能和训练效率

Efficient Transformers: A Survey