TensorFlow Linear Model Tutorial

在本教程中，我们将使用TensorFlow中的tf.estimator API来解决二元分类问题：给定关于某人的人口普查数据，如年龄，性别，教育和职业（特征），我们将尝试预测每年获得超过5万美元的收入（目标标签）。 We will train a logistic regression model, and given an individual's information our model will output a number between 0 and 1, which can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

Setup

尝试本教程的代码：

Install TensorFlow if you haven't already.
下载教程代码。
安装熊猫数据分析库。 tf.estimator不需要熊猫，但它支持它，而本教程使用熊猫。要安装熊猫：
1. 获取pip：
  
  Ubuntu / Linux 64位
  
  $ sudo apt-get install python-pip python-dev
  
  Mac OS X
  
  $ sudo easy_install pip $ sudo easy_install - 升级六
2. 使用pip安装熊猫：
  
  $ sudo pip安装熊猫
如果您在安装熊猫时遇到困难，请参阅熊猫网站上的说明。
使用以下命令执行教程代码以训练本教程中描述的线性模型：
```
$ python wide_n_deep_tutorial.py --model_type=wide
```

请继续阅读以了解此代码如何构建其线性模型。

Reading The Census Data

我们将使用的数据集是人口普查收入数据集。您可以手动下载训练数据和测试数据或使用如下代码：

import tempfile
import urllib
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)

一旦下载了CSV文件，我们将它们读入Pandas数据框。

import pandas as pd
CSV_COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income_bracket"]
df_train = pd.read_csv(train_file.name, names=CSV_COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file.name, names=CSV_COLUMNS, skipinitialspace=True, skiprows=1)

由于该任务是一个二元分类问题，我们将构造一个名为“label”的标签列，如果收入超过50K，那么它的值为1，否则为0。

train_labels = (df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
test_labels = (df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

接下来，让我们看看数据框，看看我们可以使用哪些列来预测目标标签。这些列可以分为两类：分类和连续列：

如果某个列的值只能是有限集合中的某个类别，则该列被称为categorical。例如，一个人的国籍（美国，印度，日本等）或教育水平（高中，大学等）是分类专栏。
如果某列的值可以是连续范围内的任何数值，则该列称为连续。例如，一个人的资本收益（例如$ 14,084）是一个连续的列。

以下是人口普查收入数据集中可用列的列表：

|列名|类型|说明| {.sortable} | -------------- | ----------- | --------------------------------- | |年龄|连续|个人的年龄| |工作类|分类|雇主的类型| ：：个人有（政府，......：军事，私人等）。：| fnlwgt |连续|普查人数| ：：：接受者相信观察......：代表（样品重量）。这：：：：变量将不会被使用。：|教育|分类|最高级别的教育| ：：：为那个人取得了成就。：| education_num |连续|在|的最高教育水平：：：数字形式。：| marital_status |分类|个人的婚姻状况。 | |职业|分类|个人的占领。 | |关系|分类|妻子，自己的孩子，丈夫，| ：：：不在家人，其他亲属，：：：：未婚。：|比赛|分类|白人，亚裔Pac-Islander，| ：：：Amer-Indian-Eskimo，其他，黑色。：|性别|分类|女人男人。 | | capital_gain |连续|资本收益记录。 | | capital_loss |连续|记录资本损失。 | | hours_per_week |连续|每周工作时间。 | | native_country |分类| |的原产国：：个人。：|收入|分类| “> 50K”或“<= 50K”，意思是| ：：这个人是否赚得更多......：每年超过50,000美元。 :

Converting Data into Tensors

在构建tf.estimator模型时，输入数据通过输入生成器函数指定。此构建函数在稍后传递给tf.estimator.Estimator方法（如train和evaluate）之前不会被调用。这个函数的目的是构造输入数据，它以tf.Tensor s或tf.SparseTensor的形式表示。更详细地说，输入生成器函数返回以下一对：

feature_cols：特征列名称到张量或SparseTensors的词典。
标签：包含标签列的张量。

feature_cols的键将用于构建下一节中的列。因为我们想用不同的数据调用train和evaluate方法，所以我们定义一个方法，该方法根据给定的数据返回一个输入函数。请注意，返回的输入函数将在构建TensorFlow图形时调用，而不是在运行图形时调用。它返回的是将输入数据表示为TensorFlow计算的基本单位，一个Tensor（或SparseTensor）。

我们使用tf.estimator.inputs.pandas_input_fn方法从pandas数据框中创建一个输入函数。列车或测试数据帧中的每个连续列将被转换为张量，这通常是表示密集数据的良好格式。对于分类数据，我们必须将数据表示为SparseTensor。这种数据格式适合表示稀疏数据。表示输入数据的另一种更高级的方法是构建表示文件或其他数据源的Inputs and Readers，并在TensorFlow运行图形时迭代文件。

def input_fn(data_file, num_epochs, shuffle):
  """Input builder function."""
  df_data = pd.read_csv(
      tf.gfile.Open(data_file),
      names=CSV_COLUMNS,
      skipinitialspace=True,
      engine="python",
      skiprows=1)
  # remove NaN elements
  df_data = df_data.dropna(how="any", axis=0)
  labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
  return tf.estimator.inputs.pandas_input_fn(
      x=df_data,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)

Selecting and Engineering Features for the Model

选择和制定正确的特征列是学习有效模型的关键。一个特征列可以是原始数据框中的一个原始列（我们称之为基本特征列），也可以是基于一个转换定义的任何新列或多个基本列（我们称它们为派生特征列）。基本上，“特征列”是任何可用于预测目标标签的原始或派生变量的抽象概念。

Base Categorical Feature Columns

要为分类特征定义特征列，我们可以使用tf.feature_column API创建一个CategoricalColumn。如果您知道列的所有可能的特征值的集合，并且只有其中的一些特征值，则可以使用categorical_column_with_vocabulary_list。列表中的每个键将被分配一个从0开始的自动增量ID。例如，对于性别列，我们可以通过执行以下操作将特征字符串“女性”指定为整数ID 0和“男性”1：

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])

如果我们事先不知道可能的数值，该怎么办？不是问题。我们可以使用categorical_column_with_hash_bucket代替：

occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)

会发生什么情况是，特征列占领中的每个可能值都将散列为整数ID，因为我们在训练中遇到它们。看到下面的示例插图：

ID | Feature --- | ------------- ... | 9 | "Machine-op-inspct" ... | 103 | "Farming-fishing" ... | 375 | "Protective-serv" ... |

无论我们选择定义一个SparseColumn，每个特征字符串都将通过查找固定映射或哈希来映射到整数ID。请注意，哈希碰撞是可能的，但可能不会显着影响模型质量。 Under the hood, the LinearModel class is responsible for managing the mapping and creating tf.Variable to store the model parameters (also known as model weights) for each feature ID. 模型参数将通过后面将要讨论的模型训练过程来学习。

我们会做类似的技巧来定义其他的分类特征：

education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

Base Continuous Feature Columns

同样，我们可以为我们想要在模型中使用的每个连续要素列定义一个NumericColumn：

age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

Making Continuous Features Categorical through Bucketization

有时连续特征和标签之间的关系不是线性的。作为一个假设的例子，一个人的收入可能会在职业生涯的早期阶段随着年龄的增长而增长，然后增长可能会放缓，最终退休后的收入会减少。在这种情况下，使用原始age作为实值特征列可能不是一个好选择，因为模型只能学习三种情况之一：

随着年龄增长，收入总是以某种速度增加（正相关），
收入总是随着年龄的增长而减少（负相关），或者
无论年龄多少（不相关），收入都保持不变

如果我们想分别学习收入与各个年龄组之间的细微关联，我们可以利用bucketization。分割是将连续特征的整个范围划分为一组连续的桶/桶，然后根据该值落入哪个桶，将原始数值特征转换为桶ID（作为分类特征）。所以，我们可以在age上定义一个bucketized_column为：

age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

其中边界是桶边界的列表。在这种情况下，有10个边界，从而导致11个年龄组桶（从17岁以下，18-24,25-29，...到65岁以上）。

Intersecting Multiple Columns with CrossedColumn

单独使用每个基本特征列可能不足以解释数据。例如，不同职业的教育与标签（收入> 50,000美元）之间的关系可能会有所不同。因此，如果我们只学习education =“Bachelors”和education =“Masters”的单一模型权重，我们将无法捕捉到每一个教育职业组合（例如区分education =“Bachelors”和职业=“Exec-management”和education =“Bachelors”和职业=“工艺修理”）。要了解不同特征组合之间的差异，我们可以将交叉要素列添加到模型中。

education_x_occupation = tf.feature_column.crossed_column(
    ["education", "occupation"], hash_bucket_size=1000)

我们也可以在两列以上创建一个CrossedColumn。每个组成列可以是分类（SparseColumn）的基本特征列，实值特征列（BucketizedColumn），或者甚至是另一个CrossColumn < / T2>。 这是一个例子：

age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
    [age_buckets, "education", "occupation"], hash_bucket_size=1000)

Defining The Logistic Regression Model

处理完输入数据并定义所有特征列后，我们现在准备将它们放在一起并构建一个Logistic回归模型。在上一节中，我们已经看到了几种基本和派生特征列，其中包括：

CategoricalColumn
NumericColumn
BucketizedColumn
CrossedColumn

所有这些都是抽象FeatureColumn类的子类，可以添加到模型的feature_columns字段中：

base_columns = [
    gender, native_country, education, occupation, workclass, relationship,
    age_buckets,
]
crossed_columns = [
    tf.feature_column.crossed_column(
        ["education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, "education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        ["native_country", "occupation"], hash_bucket_size=1000)
]

model_dir = tempfile.mkdtemp()
m = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns + crossed_columns)

该模型还会自动学习一个偏倚项，它可以控制在不观察任何特征的情况下进行的预测（有关更多解释，请参见“逻辑回归的工作原理”一节）。学习的模型文件将存储在model_dir中。

Training and Evaluating Our Model

将所有功能添加到模型后，现在让我们看看如何实际训练模型。使用tf.estimator API训练模型只是一个单线程：

# set num_epochs to None to get infinite stream of data.
m.train(
    input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
    steps=train_steps)

在模型被训练之后，我们可以评估我们的模型在预测保持数据的标签方面的优势：

results = m.evaluate(
    input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
    steps=None)
print("model directory = %s" % model_dir)
for key in sorted(results):
  print("%s: %s" % (key, results[key]))

输出的第一行应该类似于精度：0.83557522，这意味着精度为83.6％。随意尝试更多的功能和转换，看看你能做得更好！

如果您想看到一个可用的端到端示例，可以下载我们的示例代码。并将model_type标志设置为wide。

Adding Regularization to Prevent Overfitting

正规化是一种用于避免过度拟合的技术。过度拟合发生在模型在训练数据上表现良好时，但在模型以前从未见过的测试数据（如实时流量）上更糟糕。过度拟合通常发生在模型过于复杂时，例如相对于观察到的训练数据的数量太多的参数。正则化允许您控制模型的复杂性，并使模型更具概括性，以避免看不见的数据。

在线性模型库中，您可以将L1和L2正则化添加到模型中，如下所示：

m = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
      learning_rate=0.1,
      l1_regularization_strength=1.0,
      l2_regularization_strength=1.0),
    model_dir=model_dir)

L1和L2正则化之间的一个重要区别是L1正则化倾向于使模型权重保持为零，从而创建更稀疏的模型，而L2正则化也试图使模型权重接近于零，但不一定为零。因此，如果您增加L1正则化的强度，您将有一个更小的模型大小，因为许多模型权重将为零。如果特征空间非常大但稀疏，并且存在资源限制，从而无法为太大的模型提供服务时，这通常是可取的。

在实践中，您应该尝试L1，L2正则化强度的各种组合，并找到最佳控制过度拟合的最佳参数，并为您提供理想的模型大小。

How Logistic Regression Works

最后，让我们花点时间谈一谈Logistic回归模型实际上是什么样子，以防你不熟悉它。我们将标签标记为 $Y$ ，并将观察到的特征集合表示为特征向量 $\mathbf{x}=[x_1, x_2, ..., x_d]$ 。 We define $Y=1$ if an individual earned > 50,000 dollars and $Y=0$ otherwise. 在Logistic回归中，给定特征 $\mathbf{x}$ 的标签为正数（ $Y=1$ ）的概率为：

$P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$

其中 $\mathbf{w}=[w_1, w_2, ..., w_d]$ 是特征 $\mathbf{x}=[x_1, x_2, ..., x_d]$ 的模型权重。 $b$ 是一个常数，通常称为模型的bias。该方程由两部分组成 - 线性模型和逻辑函数：

线性模型：首先，我们可以看到 $\mathbf{w}^T\mathbf{x}+b = b + w_1x_1 + ... +w_dx_d$ 是线性模型，输出是输入要素 $\mathbf{x}$ 的线性函数。 The bias $b$ is the prediction one would make without observing any features. 模型权重 $w_i$ 反映了特征 $x_i$ 与正面标签的相关性。如果 $x_i$ 与正标签正相关，则权重 $w_i$ 增加，并且概率 $P(Y=1|\mathbf{x})$ 将接近于1。另一方面，如果 $x_i$ 与正标签呈负相关，则权重 $w_i$ 减小，并且概率 $P(Y=1|\mathbf{x})$ 将接近于0。
Logistic Function: Second, we can see that there's a logistic function (also known as the sigmoid function) $S(t) = 1/(1+\exp(-t))$ being applied to the linear model. 逻辑函数用于将线性模型 $\mathbf{w}^T\mathbf{x}+b$ 的输出从任何实数转换为 $[0, 1]$ 的范围，这可以解释为概率。