注意

here下载完整的示例代码

从零开始 NLP：使用字符级 RNN 分类名字¶

我们将建立和训练一个基本的字符级 RNN 来分类单词。本教程以及以下两个教程演示如何"从零开始"为 NLP 建模做数据预处理，特别的是不使用 torchtext 的许多便捷函数，因此你可以看到 NLP 建模的预处理在底层如何工作。

字符级 RNN 将单词作为一系列字符读取 — 在每个步骤中输出预测和"隐藏状态"并将前一个隐藏状态输入到下一个步骤。我们把最终的预测作为输出，即单词属于哪个类别。

具体来说，我们将对来自18种语言的几千个姓氏进行训练，并根据拼写预测名字来自哪种语言：

$ python predict.py Hinton
(-0.47) Scottish
(-1.52) English
(-3.57) Irish

$ python predict.py Schmidhuber
(-0.19) German
(-2.48) Czech
(-2.68) Dutch

推荐阅读：

我假设你至少安装了 PyTorch，知道 Python 并理解 Tensor：

安装指南： https://pytorch.org/
PyTorch 入门： PyTorch 深度学习：60分钟闪电战
根据示例学习 PyTorch 提供广泛而深入的概述
如果你之前 Lua Torch 用户，请参考 PyTorch for Former Torch Users

了解 RNN 及其工作方式也很有用：

循环神经网络的异常有效性展示一系列真实生活的例子
了解 LSTM 网络是专门介绍 LSTM 的，但也有关于 RNN 的一般信息

准备数据¶

注意

从此处下载数据并将其提取到当前目录。

data/names目录中包含 18 个文本文件，这些文件名为“[Language].txt”。每个文件包含一组名字，每行一个名字，大部分是罗马字母化的（但我们仍然需要从 Unicode 转换为 ASCII）。

最后，我们将得到每种语言的一个名字列表字典， {language: [names ...]}。变量"category"和"line"（在我们的例子中用于语言和名字）用于以后的扩展。

from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

输出：

['data/names/Czech.txt', 'data/names/Vietnamese.txt', 'data/names/Arabic.txt', 'data/names/Irish.txt', 'data/names/Chinese.txt', 'data/names/German.txt', 'data/names/Korean.txt', 'data/names/Polish.txt', 'data/names/Scottish.txt', 'data/names/Greek.txt', 'data/names/English.txt', 'data/names/Spanish.txt', 'data/names/Portuguese.txt', 'data/names/French.txt', 'data/names/Japanese.txt', 'data/names/Dutch.txt', 'data/names/Russian.txt', 'data/names/Italian.txt']
Slusarski

现在我们有了category_lines一个字典将每个类别（语言）映射到行（名称）列表。我们还跟踪all_categories（只是语言列表）和n_categories供以后参考。

print(category_lines['Italian'][:5])

输出：

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']

将名字转换为张量¶

现在，我们已经组织好所有的名字，我们需要把它们变成张量，以充分利用它们。

为了表示单个字母，我们使用大小为<1 x n_letters>的“one-hot 向量”。一个 one-hot 向量除当前字母索引处的数字为 1，其余用 0 填充，例如"b" = <0 1 0 0 0 ...>。

为了成为一个词，我们将它们连接成一个 2D 矩阵<line_length x 1 x n_letters>。

额外的 1 维度是因为 PyTorch 假定一切都是分批次的 — 我们这里只使用大小为 1 的批处理。

import torch

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J'))

print(lineToTensor('Jones').size())

输出：

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])

创建网络¶

在自动分级之前，在 Torch 中创建循环神经网络涉及在几个时间步长内克隆层的参数。保持隐藏状态和渐变的图层现在完全由图形本身处理。这意味着您可以以非常"纯"的方式实现 RNN，作为常规进给层。

此 RNN 模块（主要从用于割炬用户的 PyTorch 用户教程复制）只是 2 个线性图层，它们在输入和隐藏状态上运行，输出后有一个 LogSoftmax 图层。

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

要运行此网络的一个步骤，我们需要传递一个输入（在我们的例子中，当前字母的 Tensor）和以前的隐藏状态（我们首先初始化为零）。我们将返回输出（每种语言的概率）和下一个隐藏状态（我们为下一步保留该状态）。

input = letterToTensor('A')
hidden =torch.zeros(1, n_hidden)

output, next_hidden = rnn(input, hidden)

为了提高效率，我们不想为每个步骤创建新的"张力"，因此我们将使用lineToTensor而不是letterToTensor并使用切片。这可以通过预计算的张量进一步优化。

input = lineToTensor('Albert')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

输出：

tensor([[-2.9094, -2.9352, -2.9637, -2.9170, -2.9943, -2.7928, -2.9661, -2.9507,
         -2.7908, -2.8993, -2.8857, -2.8203, -2.8707, -2.9812, -2.8445, -2.8495,
         -2.8213, -2.8694]], grad_fn=<LogSoftmaxBackward>)

正如您所看到的，输出是<1 x n_categories> Tensor，其中每个项目都是该类别的可能性（越可能越高）。

训练¶

准备训练¶

在开始训练之前，我们应该做一些辅助函数。第一种是解释网络的输出，我们知道每个类别的可能性。我们可以使用Tensor.topk来获取最大值的索引：

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

输出：

('Scottish', 8)

我们还希望快速获得训练示例（姓名及其语言）：

import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

输出：

category = Russian / line = Hazov
category = Scottish / line = Fraser
category = German / line = Stieber
category = Greek / line = Close
category = Italian / line = Ruggeri
category = Polish / line = Gorka
category = Arabic / line = Antar
category = Polish / line = Sokal
category = Vietnamese / line = Doan
category = Czech / line = Fritsch

训练网络¶

现在，训练这个网络所需要的只是向它展示一堆例子，让它做出猜测，并告诉它，如果它是错的。

对于损失函数，nn.NLLLoss是正确的，因为 RNN 的最后一层是nn.LogSoftmax。

criterion = nn.NLLLoss()

每个训练循环将：

创建输入和目标张量
创建值为零的初始隐藏状态
读取每个字母并
- 为下一个字母保留隐藏状态
将最终输出与目标进行比较
反回传播
返回输出和损失

learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn

def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()

    rnn.zero_grad()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(-learning_rate, p.grad.data)

    return output, loss.item()

现在，我们只需要用一堆例子来运行它。由于train函数返回输出和损耗，我们可以打印其猜测，也可以跟踪损失绘图。由于有 1000 个示例，我们只打印每个print_every示例，并平均采用损失。

import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000



# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

输出：

5% (0m 13s) 1.9342 Minnubaev / French ✗ (Russian)
10% (0m 25s) 2.6328 Elena / Spanish ✗ (Italian)
15% (0m 39s) 1.5368 Porto / Italian ✓
20% (0m 52s) 1.4485 Zou / Korean ✗ (Chinese)
25% (1m 4s) 3.3202 Martz / Spanish ✗ (German)
30% (1m 17s) 2.4654 Malone / French ✗ (Irish)
35% (1m 29s) 1.0719 Gzovsky / Polish ✗ (Russian)
40% (1m 42s) 2.4962 Jackson / Scottish ✗ (English)
45% (1m 56s) 2.3762 Schult / Scottish ✗ (German)
50% (2m 9s) 0.1110 Thach / Vietnamese ✓
55% (2m 22s) 0.3596 Yoon / Korean ✓
60% (2m 35s) 2.6410 Amsel / Arabic ✗ (German)
65% (2m 48s) 0.6986 Prosdocimi / Italian ✓
70% (3m 1s) 2.8788 Santiago / Japanese ✗ (Portuguese)
75% (3m 14s) 1.4967 Longo / Italian ✓
80% (3m 27s) 0.9245 Ijichi / Japanese ✓
85% (3m 40s) 0.1977 Anetakis / Greek ✓
90% (3m 53s) 1.4531 Gonzales / Greek ✗ (Spanish)
95% (4m 6s) 3.5294 Kasamatsu / Greek ✗ (Japanese)
100000 100% (4m 19s) 4.0148 Shigemitsu / Greek ✗ (Japanese)

绘制结果|

绘制all_losses的历史损失显示了网络学习：

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

../_images/sphx_glr_char_rnn_classification_tutorial_001.png

评估结果|

为了了解网络在不同类别中的表现，我们将创建一个混淆矩阵，指示每个实际语言（行）网络猜测（列）的语言。为了计算混淆矩阵，一组样本通过网络运行，使用evaluate()，这与train()减去后普罗相同。

# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# Just return an output given a line
def evaluate(line_tensor):
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()

../_images/sphx_glr_char_rnn_classification_tutorial_002.png

你可以从主轴上挑出亮点，显示它猜错了哪些语言，例如，中文表示韩语，西班牙语表示意大利语。它似乎在希腊语方面做得很好，在英语方面也很差（也许是因为与其他语言重叠）。

在用户输入上运行|

def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    with torch.no_grad():
        output = evaluate(lineToTensor(input_line))

        # Get top N categories
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')

输出：

> Dovesky
(-0.39) Russian
(-1.78) Czech
(-2.29) Polish

> Jackson
(-0.43) Scottish
(-1.73) English
(-2.92) Russian

> Satoshi
(-1.30) Japanese
(-1.59) Portuguese
(-1.83) Italian

实用 PyTorch repo 中脚本的最终版本将上述代码拆分为几个文件：

data.py加载文件）
model.py定义 RNN）
train.py运行培训）
predict.py使用命令行参数运行predict()
server.py将预测用作具有bottle.py的 JSON API）

运行train.py以训练和保存网络。

使用名称运行predict.py以查看预测：

$ python predict.py Hazaki
(-0.42) Japanese
(-1.39) Polish
(-3.51) Czech

运行server.py并访问http://localhost:5533/Yourname以获取预测的 JSON 输出。

练习|

请尝试使用不同的线 -* 类别数据集，例如：
- 任何单词 -* 语言
- 名字 -* 性别
- 字符名称 -* 编写器
- 页面标题 -* 博客或子网站
使用更大和/或形状更好的网络获得更好的结果
- 添加更多线性图层
- 试试nn.LSTM和nn.GRU图层
- 将这些 RN 的多个 RN 结合为更高级别的网络

脚本总运行时间： （4分36.196秒）

由狮身人面像库生成的画廊