## 论文摘要

• 抛弃RNN，只用CNN做Seq2Seq（机器翻译）任务。但是全篇少不了CNN与RNN的对比描述。
• RNN是链式结构（Chain Structure），不能并行训练，CNN(Hierarchical Structure)可以，并且大大降低计算复杂度。

（主要详细展开第三节的内容。本篇以走读的形式，一句话quote出来，再解释理解，写完整理删，挑我觉得重要的句子讲，读这篇文章最好是对这篇论文有一定了解，或者是左边是论文，右边是这篇文章，下面引用的句子都是论文里的文字，之前读过很多论文，也是在纸上画画，这个是第一篇写论文笔记）

## 1. Introduction

CNN vs RNN

1. 链式/层级 结构，并行运算
2. 上下文相关
3. 输入输出固定长度
4. 计算复杂度

RNN是链式结构，CNN是层级结构

Compared to recurrent layers, convolutions create representations for fixed size contexts, however, the effective context size of the network can easily be made larger by stacking several layers on top of each other. This allows to precisely control the maximum length of dependencies to be modeled.

• CNN处理的input受限，需要是等长的文本，fixed size，但是只要往上堆叠层数就能处理更长的文本。
• 这样的结构能控制最大长度数值。

Convolutional networks do not depend on the computations of the previous
time step and therefore allow parallelization over every element in a sequence. This contrasts with RNNs which maintain a hidden state of the entire past that prevents parallel computation within a sequence.

• 句子中的每个单词并行运算，不依赖前个词的计算。与RNN的隐藏状态相反。

Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words by applying only O(n/k) convolutional operations for kernels of width k, compared to a linear number O(n) for recurrent neural networks.

Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word.

CNN处理数据：常量个kernel和非线性处理

RNN：变量个n步骤，第一个词非线性，最后一个词做单一处理

In this paper we propose an architecture for sequence to sequence modeling that is entirely convolutional. Our model is equipped with gated linear units (Dauphin et al., 2016) and residual connections (He et al., 2015a).We also use attention in every decoder layer and demonstrate that each attention layer only adds a negligible amount of overhead.

• 完全靠卷积
• 顺带GLU/residual connections/attention

## 2. Recurrent Sequence to Sequence Learning

• Input Sequence                   $x = (x_1,…,x_m)$

• Encoder Embedding           $w = (w_!,…,w_m)$

• State Representation           $z = (z_1,…,z_m)$

================================================= [Encoder]

• Conditional Input                $c = (c_1,…,c_i,…)$

================================================= [Decoder]

• Hidden State                        $h = (h_1,…,h_n)$

• Decoder Embedding            $g = (g_1,…,g_n)$

• Output Sequence                  $y = (y_1,…,y_n)$

1. 因为是Encoder-Decoder结构，所以运算都是上下对称的。

• w和g分别为input sequence和output sequence的

0%