I have been using PyTorch extensively in some of my projects lately, and one of the things that has confused me was how to go about implementing a hidden layer of Rectified Linear Units (ReLU) using the nn.ReLU() syntax. I was already using the functional F.relu() syntax, and wanted to move away from this into a more OOP-approach.

The following is a straightforward example on the way to convert an F.relu() model building approach to an nn.ReLU() model building approach, along with some discoveries about PyTorch and ReLUs in general that I made in the search for the conversion.

Implementation: F.Relu()

An example of what I was starting out with was the following. This is a two hidden-layer model implemented in the functional-style, where linear layers are activated using F.relu in the forward method:

import torch.nn as nn
import torch.nn.functional as F

class Network(nn.Module):
    def __init__(self, num_input, num_output):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(num_input, 10)
        self.fc2 = nn.Linear(10, 10)
        self.fc3 = nn.Linear(10, num_output)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Implementation: nn.Relu

The nn.ReLU implementation is almost like a mirror of the above, and looks like this:

import torch.nn as nn

class Network(nn.Module):
    def __init__(self, num_input, num_output):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(num_input, 10)
        self.fc2 = nn.Linear(10, 10)
        self.fc3 = nn.Linear(10, num_output)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x)
        x = self.fc3(x)
        return x

The first thing we need to realise is that F.relu doesn’t return a hidden layer. Rather, it activates the hidden layer that comes before it. F.relu is a function that simply takes an output tensor as an input, converts all values that are less than 0 in that tensor to zero, and spits this out as an output.

nn.ReLU does the exact same thing, except that it represents this operation in a different way, requiring us to first initialise the method with nn.ReLU, before using it in the forward call. In fact, nn.ReLU itself encapsulates F.relu, as we can verify by directly peering into PyTorch’s torch.nn code (repo url / source url).

This led me to an important realisation — F.relu itself likely doesn’t hold any tensor state. Hence the reason why it is known as the functional approach!

Which brings up the following question that had me stuck for a little bit.

Do we need to initialise nn.Relu multiple times?

In other words (or code) do we need to do this:

import torch.nn as nn

class Network(nn.Module):
    def __init__(self, num_input, num_output):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(num_input, 10)
        self.relu_1 = nn.ReLU()
        self.fc2 = nn.Linear(10, 10)
        self.relu_2 = nn.ReLU()
        self.fc3 = nn.Linear(10, num_output)

    def forward(self, x):
        x = self.relu_1(self.fc1(x))
        x = self.relu_2(self.fc2(x)
        x = self.fc3(x)
        return x

The answer is simply no.

Remember that nn.ReLU encapsulates F.relu? Specifically, it does it in this way, as per the source code:

class ReLU(Module):
    def __init__(self, inplace=False):
        super(ReLU, self).__init__()
        self.inplace = inplace

    def forward(self, input):
        return F.relu(input, inplace=self.inplace)

Notice that nn.ReLU directly uses F.relu in its forward pass. Therefore we can surmise that the nn.ReLU approach is simply a more verbose way of calling the F.relu function we were using earlier.

Layer Abstractions

If nn.ReLU is simply a more verbose way of calling the F.relu function, why would we bother with it in the first place?

To me it’s simply a question of consistency. The nn.ReLU approach offers us the ability to think in terms of a convenient set of layer abstractions. Instead of looking at a hidden layer and having to think that it becomes activated by a ReLU function, I can look at a layer and think of it as a ReLU layer.

The fact is, of course, there is no such thing as a ReLU layer, or even a ReLU tensor. In fact, there isn’t even such a thing as a hidden layer. Rather, in neural networks, there are a bunch of neurons (tensors in PyTorch) interacting with each other. Even nn.Linear is an abstraction that defines the relationship between a set of tensors according to a specific formula — initialise a set of tensors that are connected in a certain way, then with each pass, take in a certain number of inputs, do a bunch of computations, and spit out a certain number of outputs. Also, it records these outputs to assist in the back-propagation that comes later.

But to be able to think of a set of interconnected neurons as a ReLU layer, or a Dropout layer, or a Convolutional layer presents a set of benefits. It allows us to think about that specific set of interconnected neurons as a single abstraction. This allows me to code out my network in the following way:

import torch.nn as nn

class Network(nn.Module):
    def __init__(self, num_input, num_output):
        super(Network, self).__init__()
        self.relu_1 = ReLULayer(num_input, 10)
        self.relu_2 = ReLULayer(10, 10)
        self.output = nn.Linear(10, num_output)

    def forward(self, x):
        x = self.relu_1(x)
        x = self.relu_2(x)
        x = self.output(x)
        return x

So when I look at the above code,

  1. I need only to think in terms of the layers defined in the init method, when evaluating what the network does.
  2. I need only to think in terms of classes and objects, without needing to mix in functional concepts.
  3. Anything in forward simply becomes placeholder code, and might even be potentially automated away.

This doesn’t seem like much with a small model, but as models get more complicated, interpretation of the code becomes much simpler using layer abstractions.

Some implementations of the relu layer

How might we go about implementing the ReLULayer class then? I can think of a few options.

Option 1: nn.Sequential

import torch.nn as nn

class ReLULayer(nn.Module):
    def __init__(self, input_number, output_number):
        super(ReLULayer, self).__init__()
        self.layer = nn.Sequential(
            nn.Linear(input_number, output_number),
          nn.ReLU())

    def forward(self, x):
        return self.layer(x)

Option 2: Using F.relu in Forward

import torch.nn as nn

class ReLULayer(nn.Module):
    def __init__(self, input_number, output_number):
        super(ReLULayer, self).__init__()
        self.linear = nn.Linear(
            input_number, 
            output_number)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.linear(x))

Option 3: Clamp

import torch.nn as nn

class ReLULayer(nn.Module):
    def __init__(self, input_number, output_number):
        super(ReLULayer, self).__init__()
        self.linear = nn.Linear(
            input_number, 
            output_number)

    def forward(self, x):
        return self.linear(x).clamp(min=0)

This last one seems strange at first, but turns out to be extremely interesting. This is because .clamp(min=0) essentially does what F.relu() does — take a set of tensors, and limit the values in each tensor so that none of them are less than 0.

The above OOP approach might seem overly verbose, but when we do something like this, we move the abstraction out of the network, and into its own class. It’s a tradeoff decision that answers the following question: where do we want to be confused by the code? While it sits encapsulated by its own class doing one single thing? Or while it sits side-by-side jumbled up with a bunch of other layers, each doing their own thing.

For simple networks this would almost certainly be overkill, but in larger more complex networks, an abstraction like this could be extremely useful.