Transfer Learning: Working through the pytorch tutorial

Quick post on Transfer Learning

A common situation that we encounter is the lack of data, which results in not having sufficient data to properly train a high capacity architecture. Thus, often times, a pretrained model is used for initialization as opposed to (fine-tuning) or as a fixed feature extractor, where all layers excluding the final FC is frozen.

The pytorch tutorial[1] provides a couple examples, one related to finetuning a resnet18 model pre-trained on imagenet 1000 dataset. When finetuning, we use the pre-train model as the initialization to our new architecture, where we have redefined the final fully connected layer to take in they same number of in features model_ft.fc.in_features while we reset the number of out features to accommodate the number of labels, which in the case is set to two as we are classifying between ants and bees, and train per usual on our smaller data set.

model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2) #nn.Linear(number of in_features, number of
out_features)

if use_gpu:
    model_ft = model_ft.cuda()

criterion = nn.CrossEntropyLoss()

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

Resnets

Deep Residual Learning is presented in “Deep Residual Learning for Image Recognition”, and builds on the evidence that depth of a neural network plays a significance role in the performance of a given model. A key excerpt from the paper is quoted below:

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2 , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

The results are pretty impressive, considering the ants and bees data set only consists of 120 training images, and 75 images for validation. As suggested in the paper, this result is likely related to the ability for deep representations having excellent generalization performance on recognition tasks.

Epoch 22/24
----------
train Loss: 0.0800 Acc: 0.8525
val Loss: 0.0429 Acc: 0.9542

Epoch 23/24
----------
train Loss: 0.0714 Acc: 0.8730
val Loss: 0.0425 Acc: 0.9477

Epoch 24/24
----------
train Loss: 0.0620 Acc: 0.9016
val Loss: 0.0434 Acc: 0.9412

Training complete in 51m 51s
Best val Acc: 0.954248

Next Steps

As for next steps, would be interesting to see finetuning on the resnet model performs on a data set consisting of larger number of labels.

Reference

http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
https://arxiv.org/pdf/1512.03385.pdf

Transfer Learning: Working through the pytorch tutorial

Quick post on Transfer Learning

Resnets

Next Steps

Reference

Related Posts