| @@ -23,21 +23,51 @@ class AttachSpec: | |||||
| class GradManager: | class GradManager: | ||||
| r""" | r""" | ||||
| GradManager manages auto differentiation and all resources required to perform it. | |||||
| GradManager computes gradients or more generally, vector-Jacobian product, by reverse mode | |||||
| automatic differentiation (a.k.a. back propagation). | |||||
| Our auto differentiation framework requires that the user explicitly indicates when | |||||
| the forward operations start and when all resources should be released. A typical usage of | |||||
| GradManager is as follows: | |||||
| Reverse mode autodiff normally reuses many intermediate tensors for best computation efficiency. | |||||
| In a read-eval-print-loop (REPL) environment however, it is impossible to known how the user | |||||
| would take gradients later thus which tensors to keep. To solve this problem, the user must | |||||
| somehow declare beforehand which gradient could possibly be taken. With GradManager, users are | |||||
| required to call the :meth:`attach` method on a tensor if they want to take gradients with | |||||
| respect to it later. Furthermore, any computation on a tensor before it is attached is | |||||
| completely ignored from the autodiff perspective, so :meth:`attach` must be called before any | |||||
| computation that needs differentiation. | |||||
| For example, the following symbolic differentiation code | |||||
| .. code-block:: | |||||
| x = get_x() | |||||
| y = f(x) | |||||
| dy = ones_like(y) | |||||
| dx = vjp(y, x, dy) # vector-Jacobian product | |||||
| can be rewriten using GradManager for REPL environment as | |||||
| .. code-block:: | |||||
| with GradManager() as gm: | |||||
| x = get_x() | |||||
| gm.attach(x) # must be placed before any computation on x that needs differentiation | |||||
| y = f(x) | |||||
| dy = ones_like(y) | |||||
| gm.backward(y, dy) # doesn't need x, already known via attach() | |||||
| dx = x.grad # backward() saves result to .grad attribute | |||||
| A more realistic example of training a neural network would be like | |||||
| .. code-block:: | .. code-block:: | ||||
| gm = GradManager() | gm = GradManager() | ||||
| gm.attach(model.parameters()) | gm.attach(model.parameters()) | ||||
| with gm: | |||||
| # forward operations | |||||
| ... | |||||
| # backward gradients | |||||
| gm.backward(loss) | |||||
| for data in dataset: | |||||
| with gm: | |||||
| loss = model(data) | |||||
| gm.backward(loss) | |||||
| # gradients w.r.t. parameters is accumulated into their .grad attributes | |||||
| You can also use ``record()`` and ``release()`` method instead of ``with`` context: | You can also use ``record()`` and ``release()`` method instead of ``with`` context: | ||||
| @@ -46,14 +76,29 @@ class GradManager: | |||||
| gm = GradManager() | gm = GradManager() | ||||
| gm.attach(model.parameters()) | gm.attach(model.parameters()) | ||||
| gm.record() | |||||
| # forward operations | |||||
| ... | |||||
| # backward gradients | |||||
| gm.backward(loss) | |||||
| gm.release() | |||||
| for data in dataset: | |||||
| gm.record() | |||||
| loss = model(data) | |||||
| gm.backward(loss) | |||||
| # backward() will clear recorded history and free resources | |||||
| # call release() if backward() is not called | |||||
| # gm.release() | |||||
| For your convenience, GradManager may (not must) be reused. As shown in the examples, you | |||||
| only need to attach a tensor once and GradManager will remember it afterwards. | |||||
| However, a single GradManager can record only one computation history at a time. To run | |||||
| multiple differentiations simultaneously or perform high order differentiation, create | |||||
| as many GradManager as you need. | |||||
| .. note:: | |||||
| Mutable tensors introduce ambiguities when doing symbolic differentiation: which version | |||||
| of the tensor are we referring to? For attached tensors, GradManager resolves this | |||||
| ambiguity by "snapshoting" them on first encounter, either on :meth:`record` (or entering | |||||
| with statement) if tensor is attached before :meth:`record`, or on :meth:`attach` if | |||||
| GradManager is already recording. Attached tensors will then be interpreted as their | |||||
| snapshotted version for differentiation purpose. The same ambiguity on the first parameter | |||||
| of :meth:`backward` is simply resolved by using the latest version. | |||||
| Typically, in data parallel, we would like to average the gradients across | Typically, in data parallel, we would like to average the gradients across | ||||
| processes. Users will finally get the averaged gradients if an "AllReduce" | processes. Users will finally get the averaged gradients if an "AllReduce" | ||||
| @@ -77,17 +122,59 @@ class GradManager: | |||||
| def attach(self, tensors: list, callbacks=None): | def attach(self, tensors: list, callbacks=None): | ||||
| r""" | r""" | ||||
| Registers parameters that gradients should be calculated with respect to. | |||||
| Callback Functions should have a signature like this: | |||||
| Instruct GradManager to track operations on tensors, so that gradients with respect | |||||
| to those tensors could be evaluated later. | |||||
| :meth:`attach` also accepts a list of callbacks, which will be called with the tensor and | |||||
| its gradient during :meth:`backward`. The signature of callbacks should look like: | |||||
| .. code-block:: | .. code-block:: | ||||
| def cb(param: Tensor, grad: Tensor) -> Tensor: | |||||
| # do something | |||||
| def callback(tensor: Tensor, grad: Tensor) -> Tensor: | |||||
| ... | |||||
| # returned grad is passed to subsequent callbacks | |||||
| # and finally accumulated to the .grad attribute of tensor | |||||
| return grad | return grad | ||||
| :param params: to be registered parameters | |||||
| :param callbacks: list of callback functions | |||||
| :meth:`attach` calls with overlapping tensors will result in their callbacks concatenated, | |||||
| independently for each tensor. For example, | |||||
| .. code-block:: | |||||
| gm.attach([x, y], callbacks=[f]) | |||||
| gm.attach([y], callbacks=[g]) | |||||
| is equivalent to | |||||
| .. code-block:: | |||||
| gm.attach([x], callbacks=[f]) | |||||
| gm.attach([y], callbacks=[f, g]) | |||||
| The effect of :meth:`attach` will persist across multiple uses of the GradManager. When | |||||
| reusing a GradManager, it is likely a mistake to call :meth:`attach` on the same set of | |||||
| tensors and callbacks repeatedly, which may grow the callback list indefinitely. | |||||
| .. note:: | |||||
| When reusing a GradManager, it is sometimes desirable to attach temporary tensors each | |||||
| time, e.g. for computing gradients of inputs of a neural network. GradManager tries to | |||||
| accommodate such usages by holding weak references to attached tensors. Most of the | |||||
| times, this should be enough to prevent resource leak. Unfortunately, there are still | |||||
| some pitfalls left: | |||||
| - Callbacks should not hold strong references, directly or indirectly, to attached | |||||
| tensors. Any strong reference, including those from callbacks, will prevent | |||||
| garbage collection (even by the cycle collector!) of a attached tensor, until | |||||
| the GradManager object is garbage collected. | |||||
| Please also note that GradManager might hold additional strong references to attached | |||||
| tensors when it is in use. This note only covers potential resource leaks across | |||||
| multiple uses of a GradManager, which is unrelated to whether resources is timely | |||||
| released within a single use. | |||||
| :param tensors: tensor or list of tensors to track | |||||
| :param callbacks: callback or list of callbacks | |||||
| """ | """ | ||||
| if callbacks is None: | if callbacks is None: | ||||
| callbacks = [] | callbacks = [] | ||||
| @@ -127,10 +214,30 @@ class GradManager: | |||||
| def backward(self, y=None, dy=None): | def backward(self, y=None, dy=None): | ||||
| r""" | r""" | ||||
| Performs back-propagation and computes gradients. | |||||
| :param ys: outputs of forward operators, e.g., the loss tensor | |||||
| :param dys: derivatives of ys | |||||
| Compute gradients (or vector-Jacobian product) for all attached tensors, accumulate to | |||||
| corresponding .grad attribute, and release resources along the way. | |||||
| :meth:`backward` computes the vector-Jacobian product :math:`dx_j = \sum_{i} dy_i J_{ij}` | |||||
| where :math:`J_{ij} = ∂y_i/∂x_j` is the Jacobian matrix between vector variables :math:`y` | |||||
| and :math:`x`, with all vectors involved represented as a list of tensors, in the sense of | |||||
| direct sums (or flatten-and-concatenate). :math:`y` and :math:`dy` are passed as the first | |||||
| and second parameter respectively, whereas :math:`x` is directly taken from the list of | |||||
| all attached tensors. The result :math:`dx` is also not returned. Instead, it is directly | |||||
| accumulated into the .grad attribute of matching attached tensors (a.k.a. :math:`x`). This | |||||
| can be done unambiguously since :math:`dx` as a list of tensors has the same structure as | |||||
| :math:`x`. | |||||
| If :math:`y` is a scalar and :math:`dy` is chosen to be 1, the vector-Jacobian product | |||||
| yield gradient of :math:`y` with repect to :math:`x` as a special case. In that case, | |||||
| you will be able to omit the :math:`dy` parameter and :meth:`backward` will automatically | |||||
| use 1 for it and compute the gradient. | |||||
| :meth:`backward` consumes all resources held by this GradManager and releases them in the | |||||
| process of this call. When the call successfully finishes, the GradManager will be put back | |||||
| to an inactive state. | |||||
| :param y: tensor or list of tensors | |||||
| :param dy: tensor or list of tensors. Defaults to 1 if y is scalar | |||||
| """ | """ | ||||
| from ..functional import ones_like | from ..functional import ones_like | ||||
| @@ -144,14 +251,18 @@ class GradManager: | |||||
| "call a method that clears the history?" | "call a method that clears the history?" | ||||
| ) | ) | ||||
| assert self._grad is not None | assert self._grad is not None | ||||
| if ys is None: | |||||
| if y is None: | |||||
| ys = [] | ys = [] | ||||
| if not isinstance(ys, (tuple, list)): | |||||
| ys = [ys] | |||||
| if dys is None: | |||||
| elif isinstance(y, (tuple, list)): | |||||
| ys = y | |||||
| else: | |||||
| ys = [y] | |||||
| if dy is None: | |||||
| dys = [ones_like(y) for y in ys] | dys = [ones_like(y) for y in ys] | ||||
| if not isinstance(dys, (tuple, list)): | |||||
| dys = [dys] | |||||
| elif isinstance(dy, (tuple, list)): | |||||
| dys = ys | |||||
| else: | |||||
| dys = [dy] | |||||
| try: | try: | ||||
| self._grad(ys, dys) | self._grad(ys, dys) | ||||
| for callback in self._after_backward_callback: | for callback in self._after_backward_callback: | ||||
| @@ -172,7 +283,9 @@ class GradManager: | |||||
| def record(self): | def record(self): | ||||
| r""" | r""" | ||||
| Starts recording forward operations. | |||||
| Start recording operations | |||||
| After this call, you will be able to call :meth:`backward`. | |||||
| """ | """ | ||||
| if self._recording: | if self._recording: | ||||
| raise RuntimeError("already recording") | raise RuntimeError("already recording") | ||||
| @@ -198,7 +311,9 @@ class GradManager: | |||||
| def release(self): | def release(self): | ||||
| r""" | r""" | ||||
| Stops recording and releases resources for gradients calculation. | |||||
| Stop recording operations and release resources kept for gradient computation | |||||
| After this call, you will not be able to call :meth:`backward`. | |||||
| """ | """ | ||||
| if self._grad is not None: | if self._grad is not None: | ||||
| self._grad.__exit__(None, None, None) | self._grad.__exit__(None, None, None) | ||||