VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)
with gzip.open(path_gz, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
Initial setup
Data
n,m = x_train.shapec = y_train.max()+1nh =50
class Model(nn.Module):def__init__(self, n_in, nh, n_out):super().__init__()self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)]def__call__(self, x):for l inself.layers: x = l(x)return x
model = Model(m, nh, 10)pred = model(x_train)pred.shape
torch.Size([50000, 10])
Cross entropy loss
First, we will need to compute the softmax of our activations. This is defined by:
gives a simplification when we compute the log softmax:
def log_softmax(x): return x - x.exp().sum(-1,keepdim=True).log()
Then, there is a way to compute the log of the sum of exponentials in a more stable way, called the LogSumExp trick. The idea is to use the following formula:
The cross entropy loss for some target \(x\) and some prediction \(p(x)\) is given by:
\[ -\sum x\, \log p(x) \]
But since our \(x\)s are 1-hot encoded (actually, they’re just the integer indices), this can be rewritten as \(-\log(p_{i})\) where i is the index of the desired target.
This can be done using numpy-style integer array indexing. Note that PyTorch supports all the tricks in the advanced indexing methods discussed in that link.
Basically the training loop repeats over the following steps: - get the output of the model on a batch of inputs - compare the output to the labels we have and compute a loss - calculate the gradients of the loss with respect to every parameter of the model - update said parameters with those gradients to make them a little bit better
loss_func = F.cross_entropy
bs=50# batch sizexb = x_train[0:bs] # a mini-batch from xpreds = model(xb) # predictionspreds[0], preds.shape
for epoch inrange(epochs):for i inrange(0, n, bs): s =slice(i, min(n,i+bs)) xb,yb = x_train[s],y_train[s] preds = model(xb) loss = loss_func(preds, yb) loss.backward()with torch.no_grad():for l in model.layers:ifhasattr(l, 'weight'): l.weight -= l.weight.grad * lr l.bias -= l.bias.grad * lr l.weight.grad.zero_() l.bias .grad.zero_() report(loss, preds, yb)
def fit():for epoch inrange(epochs):for i inrange(0, n, bs): s =slice(i, min(n,i+bs)) xb,yb = x_train[s],y_train[s] preds = model(xb) loss = loss_func(preds, yb) loss.backward()with torch.no_grad():for p in model.parameters(): p -= p.grad * lr model.zero_grad() report(loss, preds, yb)
fit()
0.19, 0.96
0.11, 0.96
0.04, 1.00
Behind the scenes, PyTorch overrides the __setattr__ function in nn.Module so that the submodules you define are properly registered as parameters of the model.
class SequentialModel(nn.Module):def__init__(self, layers):super().__init__()self.layers = nn.ModuleList(layers)def forward(self, x):for l inself.layers: x = l(x)return x
class Optimizer():def__init__(self, params, lr=0.5): self.params,self.lr=list(params),lrdef step(self):with torch.no_grad():for p inself.params: p -= p.grad *self.lrdef zero_grad(self):for p inself.params: p.grad.data.zero_()
model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))
opt = Optimizer(model.parameters())
for epoch inrange(epochs):for i inrange(0, n, bs): s =slice(i, min(n,i+bs)) xb,yb = x_train[s],y_train[s] preds = model(xb) loss = loss_func(preds, yb) loss.backward() opt.step() opt.zero_grad() report(loss, preds, yb)
0.18, 0.94
0.13, 0.96
0.11, 0.94
PyTorch already provides this exact functionality in optim.SGD (it also handles stuff like momentum, which we’ll look at later)
from torch import optim
def get_model(): model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))return model, optim.SGD(model.parameters(), lr=lr)
model,opt = get_model()loss_func(model(xb), yb)
tensor(2.33, grad_fn=<NllLossBackward0>)
for epoch inrange(epochs):for i inrange(0, n, bs): s =slice(i, min(n,i+bs)) xb,yb = x_train[s],y_train[s] preds = model(xb) loss = loss_func(preds, yb) loss.backward() opt.step() opt.zero_grad() report(loss, preds, yb)
0.12, 0.98
0.09, 0.98
0.07, 0.98
Dataset and DataLoader
Dataset
It’s clunky to iterate through minibatches of x and y values separately:
xb = x_train[s] yb = y_train[s]
Instead, let’s do these two steps together, by introducing a Dataset class:
We want our training set to be in a random order, and that order should differ each iteration. But the validation set shouldn’t be randomized.
import random
class Sampler():def__init__(self, ds, shuffle=False): self.n,self.shuffle =len(ds),shuffledef__iter__(self): res =list(range(self.n))ifself.shuffle: random.shuffle(res)returniter(res)
from itertools import islice
ss = Sampler(train_ds)
it =iter(ss)for o inrange(5): print(next(it))
0
1
2
3
4
list(islice(ss, 5))
[0, 1, 2, 3, 4]
ss = Sampler(train_ds, shuffle=True)list(islice(ss, 5))
[9479, 15594, 48548, 36621, 15204]
import fastcore.allas fc
class BatchSampler():def__init__(self, sampler, bs, drop_last=False): fc.store_attr()def__iter__(self): yieldfrom fc.chunked(iter(self.sampler), self.bs, drop_last=self.drop_last)
class DataLoader():def__init__(self, ds, batchs, collate_fn=collate): fc.store_attr()def__iter__(self): yieldfrom (self.collate_fn(self.ds[i] for i in b) for b inself.batchs)
You always should also have a validation set, in order to identify if you are overfitting.
We will calculate and print the validation loss at the end of each epoch.
(Note that we always call model.train() before training, and model.eval() before inference, because these are used by layers such as nn.BatchNorm2d and nn.Dropout to ensure appropriate behaviour for these different phases.)