gradient — 符号微分

符号梯度通常从gradient.grad()计算,其提供一种更方便的语法用于计算一个标量cost对于一些表达式梯度的常见情况。grad_sources_inputs()函数完成底层工作,并且更灵活,但在gradient.grad()可以执行该作业时使用它则显得更难用。

梯度相关的函数

Driver for gradient calculations.

exception theano.gradient.DisconnectedInputError[source]

在grad被要求计算相对于断开的输入的梯度且disconnected_inputs='raise'时引发。

class theano.gradient.DisconnectedType[source]

一个类型,表示变量是c对于x的梯度的结果,但是c不是x的函数。它是一个符号占位符0,但是传达一个额外信息就是由于它是断开的所以梯度为0。

exception theano.gradient.GradientError(arg, err_pos, abs_err, rel_err, abs_tol, rel_tol)[source]

在梯度虽然已经计算但是不正确的时候,引发这个错误。

theano.gradient.Lop(f, wrt, eval_points, consider_constant=None, disconnected_inputs='raise')[source]

计算fwrt的L操作,在eval_points中给出的点求值。在数学上,这代表f对于wrt的jacobian左乘eval_points。

Return type:Variable或Variable的列表/元组,取决于f的类型。
Returns:符号表达式,使得L_op[i] = sum_i ( d f[i] / d wrt[j]) eval_point[i],其中该表达式中的索引是魔术多维索引,指定列表内的位置和张量元素在最后的所有坐标。如果f是列表/元组,则返回的结果为列表/元组。
exception theano.gradient.NullTypeGradError[source]

Raised when grad encounters a NullType.

theano.gradient.Rop(f, wrt, eval_points)[source]

计算fwrt的R操作,在eval_points中给出的点求值。在数学上,这代表f对于wrt的jacobian右乘eval_points。

Return type:Variable或Variable的列表/元组,取决于f的类型。
Returns:符号表达式,使得R_op[i] = sum_j ( d f[i] / d wrt[j]) eval_point[j],其中该表达式中的索引是魔术多维索引,指定列表内的位置和张量元素在最后的所有坐标。如果wrt是列表/元组,则返回的结果为列表/元组。
theano.gradient.consider_constant(x)[source]

DEPRECATED: use zero_grad() or disconnected_grad() instead.

Consider an expression constant when computing gradients.

The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, it will not be backpropagated through. In other words, the gradient of the expression is truncated to 0.

Parameters:x – A Theano expression whose gradient should be truncated.
Returns:The expression is returned unmodified, but its gradient is now truncated to 0.

New in version 0.7.

theano.gradient.disconnected_grad(x)[source]

Consider an expression constant when computing gradients, while effectively not backpropagating through it.

The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, it will not be backpropagated through. This is effectively equivalent to truncating the gradient expression to 0, but is executed faster than zero_grad(), which stilll has to go through the underlying computational graph related to the expression.

Parameters:x – A Theano expression whose gradient should not be backpropagated through.
Returns:The expression is returned unmodified, but its gradient is now effectively truncated to 0.
theano.gradient.format_as(use_list, use_tuple, outputs)[source]

Formats the outputs according to the flags use_list and use_tuple. If use_list is True, outputs is returned as a list (if outputs is not a list or a tuple then it is converted in a one element list). If use_tuple is True, outputs is returned as a tuple (if outputs is not a list or a tuple then it is converted into a one element tuple). Otherwise (if both flags are false), outputs is returned.

theano.gradient.grad(cost, wrt, consider_constant=None, disconnected_inputs='raise', add_names=True, known_grads=None, return_disconnected='zero', null_gradients='raise')[source]

返回某个cost对于一个或多个变量的符号梯度。

Theano中自动微分如何工作的有关更多信息,请参见gradientFor information on how to implement the gradient of a certain Op, see grad().

Parameters:
  • cost标量0维张量None)— 我们微分所涉及的值。如果提供known_grads,可以为None
  • wrt变量变量列表)— 我们想要的梯度的项。
  • consider_constant变量列表)— 不反向传播的表达式。
  • disconnected_inputs ({'ignore', 'warn', 'raise'}) —

    如果wrt中的某些变量不是计算图计算cost的一部分(或者所有链接都是不可微分的),则定义相应的行为。The possible values are:

    • ‘ignore’: considers that the gradient on these parameters is zero.
    • 'warn':认为梯度为零,并打印警告。
    • 'raise':引发DisconnectedInputError。
  • add_names(布尔值)— 如果为True,grad生成的变量将命名为(d<cost.name>/d<wrt.name>),只要cost和wrt都具有名字。
  • known_gradsdict可选)— 一个字典,映射变量到其梯度。这在你知道一些变量的梯度但不知道原始cost的情况下是有用的。
  • return_disconnected ({'zero', 'None', 'Disconnected'}) —
    • ‘zero’
      如果wrt[i]断开,返回值i将为
      wrt[i].zeros_like()
    • ‘None’
      如果wrt[i]断开,返回值i将为
      None
    • ‘Disconnected’ : returns variables of type DisconnectedType
  • null_gradients ({'raise', 'return'}) —

    定义wrt中的某些变量具有空梯度时的行为。The possibles values are:

    • 'raise':引发一个NullTypeGradError异常
    • 'return':返回空梯度
Returns:

cost相对于wrt项中的每一个的梯度的符号表达式。If an element of wrt is not differentiable with respect to the output, then a zero variable is returned.

Return type:

变量或变量组成的列表/元组(匹配wrt

theano.gradient.grad_clip(x, lower_bound, upper_bound)[source]

This op do a view in the forward, but clip the gradient.

This is an elemwise operation.

Parameters:
  • x – the variable we want its gradient inputs clipped
  • lower_bound – The lower bound of the gradient value
  • upper_bound – The upper bound of the gradient value.
Examples:

x = theano.tensor.scalar()

z = theano.tensor.grad(grad_clip(x, -1, 1)**2, x) z2 = theano.tensor.grad(x**2, x)

f = theano.function([x], outputs = [z, z2])

print(f(2.0)) # output (1.0, 4.0)

Note:

We register an opt in tensor/opt.py that remove the GradClip. So it have 0 cost in the forward and only do work in the grad.

theano.gradient.grad_not_implemented(op, x_pos, x, comment='')[source]

Return an un-computable symbolic variable of type x.type.

If any call to tensor.grad results in an expression containing this un-computable variable, an exception (NotImplementedError) will be raised indicating that the gradient on the x_pos‘th input of op has not been implemented. Likewise if any call to theano.function involves this variable.

Optionally adds a comment to the exception explaining why this gradient is not implemented.

theano.gradient.grad_undefined(op, x_pos, x, comment='')[source]

Return an un-computable symbolic variable of type x.type.

If any call to tensor.grad results in an expression containing this un-computable variable, an exception (GradUndefinedError) will be raised indicating that the gradient on the x_pos‘th input of op is mathematically undefined. Likewise if any call to theano.function involves this variable.

Optionally adds a comment to the exception explaining why this gradient is not defined.

theano.gradient.hessian(cost, wrt, consider_constant=None, disconnected_inputs='raise')[source]
Parameters:
  • consider_constant — 不反向传播的表达式列表。
  • disconnected_inputsstring)— 定义如果wrt中的某些变量不是计算图计算cost的部分(或者如果所有链路是不可微分的)的行为。可能的值有:— 'ignore':认为这些参数的梯度为零。—'warn':认为梯度为零,并打印警告。—'raise':引发一个异常。
Returns:

一个Variable的实例或者Variable组成的一个列表/元组(取决于wrt),表示expression对于wrt(的元素)的Hessian 。If an element of wrt is not differentiable with respect to the output, then a zero variable is returned. 返回值与wrt具有相同的类型:在所有情况下为列表/元组或TensorVariable。

theano.gradient.jacobian(expression, wrt, consider_constant=None, disconnected_inputs='raise')[source]
Parameters:
  • consider_constant — 不反向传播的表达式列表。
  • disconnected_inputsstring)— 定义如果wrt中的某些变量不是计算图计算cost的部分(或者如果所有链路是不可微分的)的行为。可能的值有:— 'ignore':认为这些参数的梯度为零。—'warn':认为梯度为零,并打印警告。—'raise':引发一个异常。
Returns:

一个Variable的实例或者Variable组成的一个列表/元组(取决于wrt),表示expression对于wrt(的元素)的jacobian 。If an element of wrt is not differentiable with respect to the output, then a zero variable is returned. 返回值与wrt具有相同的类型:在所有情况下为列表/元组或TensorVariable。

class theano.gradient.numeric_grad(f, pt, eps=None, out_type=None)[source]

Compute the numeric derivative of a scalar-valued function at a particular point.

static abs_rel_err(a, b)[source]

Return absolute and relative error between a and b.

The relative error is a small number when a and b are close, relative to how big they are.

Formulas used:
abs_err = abs(a - b) rel_err = abs_err / max(abs(a) + abs(b), 1e-8)

The denominator is clipped at 1e-8 to avoid dividing by 0 when a and b are both close to 0.

The tuple (abs_err, rel_err) is returned

abs_rel_errors(g_pt)[source]

Return the abs and rel error of gradient estimate g_pt

g_pt must be a list of ndarrays of the same length as self.gf, otherwise a ValueError is raised.

Corresponding ndarrays in g_pt and self.gf must have the same shape or ValueError is raised.

max_err(g_pt, abs_tol, rel_tol)[source]

Find the biggest error between g_pt and self.gf.

What is measured is the violation of relative and absolute errors, wrt the provided tolerances (abs_tol, rel_tol). A value > 1 means both tolerances are exceeded.

Return the argmax of min(abs_err / abs_tol, rel_err / rel_tol) over g_pt, as well as abs_err and rel_err at this point.

theano.gradient.subgraph_grad(wrt, end, start=None, cost=None, details=False)[source]

With respect to wrt, computes gradients of cost and/or from existing start gradients, up to the end variables of a symbolic digraph. In other words, computes gradients for a subgraph of the symbolic theano function. Ignores all disconnected inputs.

This can be useful when one needs to perform the gradient descent iteratively (e.g. one layer at a time in an MLP), or when a particular operation is not differentiable in theano (e.g. stochastic sampling from a multinomial). In the latter case, the gradient of the non-differentiable process could be approximated by user-defined formula, which could be calculated using the gradients of a cost with respect to samples (0s and 1s). These gradients are obtained by performing a subgraph_grad from the cost or previously known gradients (start) up to the outputs of the stochastic process (end). A dictionary mapping gradients obtained from the user-defined differentiation of the process, to variables, could then be fed into another subgraph_grad as start with any other cost (e.g. weight decay).

In an MLP, we could use subgraph_grad to iteratively backpropagate:

x, t = theano.tensor.fvector('x'), theano.tensor.fvector('t')
w1 = theano.shared(np.random.randn(3,4))
w2 = theano.shared(np.random.randn(4,2))
a1 = theano.tensor.tanh(theano.tensor.dot(x,w1))
a2 = theano.tensor.tanh(theano.tensor.dot(a1,w2))
cost2 = theano.tensor.sqr(a2 - t).sum()
cost2 += theano.tensor.sqr(w2.sum())
cost1 = theano.tensor.sqr(w1.sum())

params = [[w2],[w1]]
costs = [cost2,cost1]
grad_ends = [[a1], [x]]

next_grad = None
param_grads = []
for i in xrange(2):
    param_grad, next_grad = theano.subgraph_grad(
        wrt=params[i], end=grad_ends[i],
        start=next_grad, cost=costs[i]
    )
    next_grad = dict(zip(grad_ends[i], next_grad))
    param_grads.extend(param_grad)
Parameters:
  • wrt (list of variables) – Gradients are computed with respect to wrt.
  • end (list of variables) – Theano variables at which to end gradient descent (they are considered constant in theano.grad). For convenience, the gradients with respect to these variables are also returned.
  • start (dictionary of variables) – If not None, a dictionary mapping variables to their gradients. This is useful when the gradient on some variables are known. These are used to compute the gradients backwards up to the variables in end (they are used as known_grad in theano.grad).
  • cost (scalar (0-dimensional) variable) –

    Additional costs for which to compute the gradients. For example, these could be weight decay, an l1 constraint, MSE, NLL, etc. May optionally be None if start is provided. Warning : If the gradients of cost with respect to any of the start variables is already part of the start dictionary, then it may be counted twice with respect to wrt and end.

    Warning

    If the gradients of cost with respect to any of the start variables is already part of the start dictionary, then it may be counted twice with respect to wrt and end.

  • details (bool) – When True, additionally returns the list of gradients from start and of cost, respectively, with respect to wrt (not end).
Return type:

Tuple of 2 or 4 Lists of Variables

Returns:

Returns lists of gradients with respect to wrt and end, respectively.

New in version 0.7.

theano.gradient.verify_grad(fun, pt, n_tests=2, rng=None, eps=None, out_type=None, abs_tol=None, rel_tol=None, mode=None, cast_to_output_type=False)[source]

Test a gradient by Finite Difference Method. Raise error on failure.

Example:
>>> verify_grad(theano.tensor.tanh,
...             (numpy.asarray([[2,3,4], [-1, 3.3, 9.9]]),),
...             rng=numpy.random)

Raises an Exception if the difference between the analytic gradient and numerical gradient (computed through the Finite Difference Method) of a random projection of the fun’s output to a scalar exceeds the given tolerance.

Parameters:
  • fun – a Python function that takes Theano variables as inputs, and returns a Theano variable. For instance, an Op instance with a single output.
  • pt – the list of numpy.ndarrays to use as input values. These arrays must be either float32 or float64 arrays.
  • n_tests – number of times to run the test
  • rng – random number generator used to sample u, we test gradient of sum(u * fun) at pt
  • eps – stepsize used in the Finite Difference Method (Default None is type-dependent) Raising the value of eps can raise or lower the absolute and relative errors of the verification depending on the Op. Raising eps does not lower the verification quality for linear operations. It is better to raise eps than raising abs_tol or rel_tol.
  • out_type – dtype of output, if complex (i.e. ‘complex32’ or ‘complex64’)
  • abs_tol – absolute tolerance used as threshold for gradient comparison
  • rel_tol – relative tolerance used as threshold for gradient comparison
  • cast_to_output_type – if the output is float32 and cast_to_output_type is True, cast the random projection to float32. Otherwise it is float64.
Note:

WARNING to unit-test writers: if op is a function that builds a graph, try to make it a SMALL graph. Often verify grad is run in debug mode, which can be very slow if it has to verify a lot of intermediate computations.

Note:

This function does not support multiple outputs. In tests/test_scan.py there is an experimental verify_grad that covers that case as well by using random projections.

theano.gradient.zero_grad(x)[source]

Consider an expression constant when computing gradients.

The expression itself is unaffected, but when its gradient is computed, or the gradient of another expression that this expression is a subexpression of, it will be backpropagated through with a value of zero. In other words, the gradient of the expression is truncated to 0.

Parameters:x – A Theano expression whose gradient should be truncated.
Returns:The expression is returned unmodified, but its gradient is now truncated to 0.

实现R op的列表

See the gradient tutorial for the R op documentation.

list of ops that support R-op:
  • 已测试[大部分测试在tensor/tests/test_rop.py]
    • SpecifyShape
    • MaxAndArgmax
    • Subtensor
    • IncSubtensor set_subtensor too
    • Alloc
    • Dot
    • Elemwise
    • Sum
    • Softmax
    • Shape
    • Join
    • Rebroadcast
    • Reshape
    • Flatten
    • DimShuffle
    • Scan [In scan_module/tests/test_scan.test_rop]
  • 没有测试
    • Split
    • ARange
    • ScalarFromTensor
    • AdvancedSubtensor1
    • AdvancedIncSubtensor1
    • AdvancedIncSubtensor

Partial list of ops without support for R-op:

  • All sparse ops
  • 所有线性代数运算
  • PermuteRowElements
  • Tile
  • AdvancedSubtensor
  • TensorDot
  • Outer
  • Prod
  • MulwithoutZeros
  • ProdWithoutZeros
  • CAReduce(for max,... done for MaxAndArgmax op)
  • MaxAndArgmax(仅用于矩阵的0轴或1轴上)