07_Hung-yi Lee_Self-attention
Carpe Tu Black Whistle

Seq as input

Sophisticated Input

  • Input is a vector
    image
  • Input is a set of vectors
    image

Vector Set as Input

举例子:

  1. 文字处理
    image
    文本单词的处理形式常见的有两种: One-hot Encoding, Word Embedding
  • One-hot encoding: 不会表征不同label之间的关系(语义关系)
  • Word Embedding: 会有对单词进行 语义聚类 一般的处理

To learn more: https://youtu.be/X7PH3NuYW0Q(in Mandarin)

  1. 语音辨识(简化
    image

  2. Graph

    1. Social Network
      image
    2. Drug Discovery
      image

Output

  • Each vector has a label
    image
  • The whole sequence has a label
    image
  • Model decides the no. of labels itself. seq2seq
    image

Sequence Labeling

上面的第一种,输入输出方式,被称为 Sequence Labeling

Self-attention

FC: Fully-connected network

image

Self-attention 会吃掉(接收)一整个 Sequence 的输入,然后 input 几个vector就会有几个Output

self-attention 是可以多次使用的

Attention is all u need

Attention is all you need.

在上面的论文中,Google 第一次提出了 transformer 的网络架构。
transformer 里面最重要的 Module 就是 Self-attention

李沐:在 Transformer(aka 变形金刚)之后,Model的名字,变得越来越Fantasy

Relevant

  1. Dot-product
  2. Additive

image

image
其中进入Softmax前的参数被称为 attention score

Extract information based on attention scores

image

Vectorization

image

image

image

image

做了很多复杂的操作,最后需要学习的参数也只有

Multi-head Self-attention

Different types of relevance

2 heads

每个同上标的量只跟,同上标的对应量进行操作。

image

image

Positional Encoding

Each column represents a positional vector

  • No position information in self-attention
  • Each position has a unique positional vector
  • hand-crafted
  • learned from data

比较新的研究: https://arxiv.org/abs/2003.09229
提出和比较了不同的 positional encoding

image

Application

Transformer: https://arxiv.org/abs/1706.03762
BERT: https://arxiv.org/abs/1810.04805

Widely used in Natural Language Processing(NLP)!

Self-attention for Speech

image
https://arxiv.org/abs/1910.12977

如果按照之前的 注意力机制 来设计,Attention Matrix 的参数量与 seq 的长度为平方关系。占用大量的 memory, 因此一般会使用 Truncated Self-attention,只考虑一部分的 seq

Self-attention for Image

An image can also be considered as a vector set.

Self-Attention GAN

image
https://arxiv.org/abs/1805.08318

Detection Transformer(DETR)

image
https://arxiv.org/abs/2005.12872

Self-attention v.s. CNN

  • CNN: self-attention that can only attends in a receptive field
    • CNN is simplified self-attention.
  • Self-attention: CNN with learnable recptive field
    • Self-attention is the complex version of CNN

CNN 的感受野是 人为划定 的,而 Self-attention 的感受野是机器 自己学习出来 的

Relationship

image

19年的一篇论文:
On the Relationship between Self-Attention and Convolutional Layers

这篇paper以数学的方式严谨的证明了:

  • CNN就是Self-attention的特例,Self-attention只要设定合适的参数,就能做到与CNN一样的的事情
  • Self-attention是更flexible的CNN

amount of dataset

image

An image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Self-attention v.s. RNN

  • RNN 与 Self-attention 一样,都是要处理 input 为一个 sequence 的状况
  • 有一个 memory 的 vector 和 RNN 的 block balabala 李老师没讲清楚

image

  • RNN 是无法平行输出的,Self-attention 的机制可以平行输出
  • 运算速率上,Self-attention 更加 effective,很多的模型都用 Self-attention来取代RNN的架构

Transformer are RNNs: Fast Autoregressive Transformers with Linear Attention

Self-attention for Graph

Graph 也可以看作是,一堆vector,那么一堆vector就能当用 self-attention 来处理。

Self-attention 有自己寻找不同对象之间的关联性机制。但是 Graph 自身的属性,就已经包含了关联性信息。

Consider edge: only attention to connected nodes
This is one type of Graph Neural Network(GNN).

image

To learn more

Long Range Arena: A Benchmark for Efficient Transformers
注意力机制的弊端: 运算量很大,这篇paper尝试量各种各样不同变体,测试了不同性能训练效率
image

image
Efficient Transformers: A Survey