In this example, you use Deep learning HDL Toolbox to deploy a quantized deep convolutional neural network and classify an image. The example uses the pretrained ResNet-18 convolutional neural network to demonstrate transfer learning, quantization, and deployment for the quantized network. Use MATLAB ® to retrieve the prediction results.
ResNet-18 has been trained on over a million images and can classify images into 1000 object categories (such as keyboard, coffee mug, pencil, and many animals). The network has learned rich feature representations for a wide range of images. The network takes an image as input and outputs a label for the object in the image together with the probabilities for each of the object categories.
For this example, you need:
Deep Learning Toolbox ™
Deep Learning HDL Toolbox ™
Deep Learning Toolbox Model for ResNet-18 Network
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Image Processing Toolbox ™
Deep Learning Toolbox Model Quantization Library
MATLAB Coder Interface for Deep Learning Libraries
To perform classification on a new set of images, you fine-tune a pretrained ResNet-18 convolutional neural network by transfer learning. In transfer learning, you can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch. You can quickly transfer learned features to a new task using a smaller number of training images.
To load the pretrained series network ResNet-18, enter:
snet = resnet18;
To view the layers of the pretrained series network, enter:
analyzeNetwork(snet);
The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the number of color channels.
inputSize = snet.Layers(1).InputSize;
This example uses the MathWorks MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).
curDir = pwd; unzip('MerchData.zip'); imds = imageDatastore('MerchData', ... 'IncludeSubfolders',true, ... 'LabelSource','foldernames'); [imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
The fully connected layer and classification layer of the pretrained network net are configured for 1000 classes. These two layers fc1000 and ClassificationLayer_predictions in ResNet-18, contain information on how to combine the features that the network extracts into class probabilities and predicted labels . These two layers must be fine-tuned for the new classification problem. Extract all the layers, except the last two layers, from the pretrained network.
lgraph = layerGraph(snet)
lgraph =
LayerGraph with properties:
Layers: [71×1 nnet.cnn.layer.Layer]
Connections: [78×2 table]
InputNames: {'data'}
OutputNames: {'ClassificationLayer_predictions'}
numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
newLearnableLayer = fullyConnectedLayer(numClasses, ... 'Name','new_fc', ... 'WeightLearnRateFactor',10, ... 'BiasLearnRateFactor',10); lgraph = replaceLayer(lgraph,'fc1000',newLearnableLayer); newClassLayer = classificationLayer('Name','new_classoutput'); lgraph = replaceLayer(lgraph,'ClassificationLayer_predictions',newClassLayer);
The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.
pixelRange = [-30 30]; imageAugmenter = imageDataAugmenter( ... 'RandXReflection',true, ... 'RandXTranslation',pixelRange, ... 'RandYTranslation',pixelRange);
To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ... 'DataAugmentation',imageAugmenter); augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.
options = trainingOptions('sgdm', ... 'MiniBatchSize',10, ... 'MaxEpochs',6, ... 'InitialLearnRate',1e-4, ... 'Shuffle','every-epoch', ... 'ValidationData',augimdsValidation, ... 'ValidationFrequency',3, ... 'Verbose',false, ... 'Plots','training-progress');
Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Support by Release (Parallel Computing Toolbox)). Otherwise, the network uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify the execution environment by using the 'ExecutionEnvironment' name-value argument of trainingOptions.
netTransfer = trainNetwork(augimdsTrain,lgraph,options);

Create a dlquantizer object and specify the network to quantize.
dlquantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');
Use the calibrate function to exercise the network with sample inputs and collect the range information. The calibrate function exercises the network and collects the dynamic ranges of the weights and biases in the convolution and fully connected layers of the network and the dynamic ranges of the activations in all layers of the network. The calibrate function returns a table. Each row of the table contains range information for a learnable parameter of the quantized network.
dlquantObj.calibrate(augimdsTrain)
ans=95×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue
__________________________ __________________ ________________________ ________ ________
{'conv1_Weights' } {'bn_conv1' } "Weights" -0.86045 1.3675
{'conv1_Bias' } {'bn_conv1' } "Bias" -0.66706 0.67651
{'res2a_branch2a_Weights'} {'bn2a_branch2a'} "Weights" -0.40354 0.34824
{'res2a_branch2a_Bias' } {'bn2a_branch2a'} "Bias" -0.7954 1.3412
{'res2a_branch2b_Weights'} {'bn2a_branch2b'} "Weights" -0.75855 0.5863
{'res2a_branch2b_Bias' } {'bn2a_branch2b'} "Bias" -1.3406 1.7593
{'res2b_branch2a_Weights'} {'bn2b_branch2a'} "Weights" -0.32464 0.35274
{'res2b_branch2a_Bias' } {'bn2b_branch2a'} "Bias" -1.1606 1.5388
{'res2b_branch2b_Weights'} {'bn2b_branch2b'} "Weights" -1.1713 0.95244
{'res2b_branch2b_Bias' } {'bn2b_branch2b'} "Bias" -0.73906 1.2628
{'res3a_branch2a_Weights'} {'bn3a_branch2a'} "Weights" -0.19423 0.2396
{'res3a_branch2a_Bias' } {'bn3a_branch2a'} "Bias" -0.53868 0.69323
{'res3a_branch2b_Weights'} {'bn3a_branch2b'} "Weights" -0.53801 0.73706
{'res3a_branch2b_Bias' } {'bn3a_branch2b'} "Bias" -0.6457 1.1458
{'res3a_branch1_Weights' } {'bn3a_branch1' } "Weights" -0.64085 0.98864
{'res3a_branch1_Bias' } {'bn3a_branch1' } "Bias" -0.9258 0.76574
⋮
Use the dlhdl.Target class to create a target object with a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG,Install Xilinx™ Vivado™ Design Suite 2019.2. To set the Xilinx Vivado toolpath, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.bat');
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Use the dlhdl.Workflow class to create an object. When you create the object, specify the network and the bitstream name. Specify the saved pretrained alexnet neural network as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single data type.
hW = dlhdl.Workflow('Network', dlquantObj, 'Bitstream', 'zcu102_int8','Target',hTarget);
To compile the netTransfer DAG network, run the compile method of the dlhdl.Workflow object. You can optionally specify the maximum number of input frames.
dn = hW.compile('InputFrameNumberLimit',15)### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8 ...
### The network includes the following layers:
1 'data' Image Input 224×224×3 images with 'zscore' normalization (SW Layer)
2 'conv1' Convolution 64 7×7×3 convolutions with stride [2 2] and padding [3 3 3 3] (HW Layer)
3 'bn_conv1' Batch Normalization Batch normalization with 64 channels (HW Layer)
4 'conv1_relu' ReLU ReLU (HW Layer)
5 'pool1' Max Pooling 3×3 max pooling with stride [2 2] and padding [1 1 1 1] (HW Layer)
6 'res2a_branch2a' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
7 'bn2a_branch2a' Batch Normalization Batch normalization with 64 channels (HW Layer)
8 'res2a_branch2a_relu' ReLU ReLU (HW Layer)
9 'res2a_branch2b' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
10 'bn2a_branch2b' Batch Normalization Batch normalization with 64 channels (HW Layer)
11 'res2a' Addition Element-wise addition of 2 inputs (HW Layer)
12 'res2a_relu' ReLU ReLU (HW Layer)
13 'res2b_branch2a' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
14 'bn2b_branch2a' Batch Normalization Batch normalization with 64 channels (HW Layer)
15 'res2b_branch2a_relu' ReLU ReLU (HW Layer)
16 'res2b_branch2b' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
17 'bn2b_branch2b' Batch Normalization Batch normalization with 64 channels (HW Layer)
18 'res2b' Addition Element-wise addition of 2 inputs (HW Layer)
19 'res2b_relu' ReLU ReLU (HW Layer)
20 'res3a_branch2a' Convolution 128 3×3×64 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer)
21 'bn3a_branch2a' Batch Normalization Batch normalization with 128 channels (HW Layer)
22 'res3a_branch2a_relu' ReLU ReLU (HW Layer)
23 'res3a_branch2b' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
24 'bn3a_branch2b' Batch Normalization Batch normalization with 128 channels (HW Layer)
25 'res3a' Addition Element-wise addition of 2 inputs (HW Layer)
26 'res3a_relu' ReLU ReLU (HW Layer)
27 'res3a_branch1' Convolution 128 1×1×64 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer)
28 'bn3a_branch1' Batch Normalization Batch normalization with 128 channels (HW Layer)
29 'res3b_branch2a' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
30 'bn3b_branch2a' Batch Normalization Batch normalization with 128 channels (HW Layer)
31 'res3b_branch2a_relu' ReLU ReLU (HW Layer)
32 'res3b_branch2b' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
33 'bn3b_branch2b' Batch Normalization Batch normalization with 128 channels (HW Layer)
34 'res3b' Addition Element-wise addition of 2 inputs (HW Layer)
35 'res3b_relu' ReLU ReLU (HW Layer)
36 'res4a_branch2a' Convolution 256 3×3×128 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer)
37 'bn4a_branch2a' Batch Normalization Batch normalization with 256 channels (HW Layer)
38 'res4a_branch2a_relu' ReLU ReLU (HW Layer)
39 'res4a_branch2b' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
40 'bn4a_branch2b' Batch Normalization Batch normalization with 256 channels (HW Layer)
41 'res4a' Addition Element-wise addition of 2 inputs (HW Layer)
42 'res4a_relu' ReLU ReLU (HW Layer)
43 'res4a_branch1' Convolution 256 1×1×128 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer)
44 'bn4a_branch1' Batch Normalization Batch normalization with 256 channels (HW Layer)
45 'res4b_branch2a' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
46 'bn4b_branch2a' Batch Normalization Batch normalization with 256 channels (HW Layer)
47 'res4b_branch2a_relu' ReLU ReLU (HW Layer)
48 'res4b_branch2b' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
49 'bn4b_branch2b' Batch Normalization Batch normalization with 256 channels (HW Layer)
50 'res4b' Addition Element-wise addition of 2 inputs (HW Layer)
51 'res4b_relu' ReLU ReLU (HW Layer)
52 'res5a_branch2a' Convolution 512 3×3×256 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer)
53 'bn5a_branch2a' Batch Normalization Batch normalization with 512 channels (HW Layer)
54 'res5a_branch2a_relu' ReLU ReLU (HW Layer)
55 'res5a_branch2b' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
56 'bn5a_branch2b' Batch Normalization Batch normalization with 512 channels (HW Layer)
57 'res5a' Addition Element-wise addition of 2 inputs (HW Layer)
58 'res5a_relu' ReLU ReLU (HW Layer)
59 'res5a_branch1' Convolution 512 1×1×256 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer)
60 'bn5a_branch1' Batch Normalization Batch normalization with 512 channels (HW Layer)
61 'res5b_branch2a' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
62 'bn5b_branch2a' Batch Normalization Batch normalization with 512 channels (HW Layer)
63 'res5b_branch2a_relu' ReLU ReLU (HW Layer)
64 'res5b_branch2b' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer)
65 'bn5b_branch2b' Batch Normalization Batch normalization with 512 channels (HW Layer)
66 'res5b' Addition Element-wise addition of 2 inputs (HW Layer)
67 'res5b_relu' ReLU ReLU (HW Layer)
68 'pool5' Global Average Pooling Global average pooling (HW Layer)
69 'new_fc' Fully Connected 5 fully connected layer (HW Layer)
70 'prob' Softmax softmax (SW Layer)
71 'new_classoutput' Classification Output crossentropyex with 'MathWorks Cap' and 4 other classes (SW Layer)
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
5 Memory Regions created.
Skipping: data
Compiling leg: conv1>>pool1 ...
Compiling leg: conv1>>pool1 ... complete.
Compiling leg: res2a_branch2a>>res2a_branch2b ...
Compiling leg: res2a_branch2a>>res2a_branch2b ... complete.
Compiling leg: res2b_branch2a>>res2b_branch2b ...
Compiling leg: res2b_branch2a>>res2b_branch2b ... complete.
Compiling leg: res3a_branch1 ...
Compiling leg: res3a_branch1 ... complete.
Compiling leg: res3a_branch2a>>res3a_branch2b ...
Compiling leg: res3a_branch2a>>res3a_branch2b ... complete.
Compiling leg: res3b_branch2a>>res3b_branch2b ...
Compiling leg: res3b_branch2a>>res3b_branch2b ... complete.
Compiling leg: res4a_branch1 ...
Compiling leg: res4a_branch1 ... complete.
Compiling leg: res4a_branch2a>>res4a_branch2b ...
Compiling leg: res4a_branch2a>>res4a_branch2b ... complete.
Compiling leg: res4b_branch2a>>res4b_branch2b ...
Compiling leg: res4b_branch2a>>res4b_branch2b ... complete.
Compiling leg: res5a_branch1 ...
Compiling leg: res5a_branch1 ... complete.
Compiling leg: res5a_branch2a>>res5a_branch2b ...
Compiling leg: res5a_branch2a>>res5a_branch2b ... complete.
Compiling leg: res5b_branch2a>>res5b_branch2b ...
Compiling leg: res5b_branch2a>>res5b_branch2b ... complete.
Compiling leg: pool5 ...
Compiling leg: pool5 ... complete.
Compiling leg: new_fc ...
Compiling leg: new_fc ... complete.
Skipping: prob
Skipping: new_classoutput
Creating Schedule...
.............................
Creating Schedule...complete.
Creating Status Table...
............................
Creating Status Table...complete.
Emitting Schedule...
..........................
Emitting Schedule...complete.
Emitting Status Table...
..............................
Emitting Status Table...complete.
### Allocating external memory buffers:
offset_name offset_address allocated_space
_______________________ ______________ ________________
"InputDataOffset" "0x00000000" "24.0 MB"
"OutputResultOffset" "0x01800000" "4.0 MB"
"SchedulerDataOffset" "0x01c00000" "4.0 MB"
"SystemBufferOffset" "0x02000000" "28.0 MB"
"InstructionDataOffset" "0x03c00000" "4.0 MB"
"ConvWeightDataOffset" "0x04000000" "16.0 MB"
"FCWeightDataOffset" "0x05000000" "4.0 MB"
"EndOffset" "0x05400000" "Total: 84.0 MB"
### Network compilation complete.
dn = struct with fields:
weights: [1×1 struct]
instructions: [1×1 struct]
registers: [1×1 struct]
syncInstructions: [1×1 struct]
To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.
hW.deploy
### Programming FPGA Bitstream using Ethernet... Downloading target FPGA device configuration over Ethernet to SD card ... # Copied /tmp/hdlcoder_rd to /mnt/hdlcoder_rd # Copying Bitstream hdlcoder_system.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/hdlcoder_system.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' Downloading target FPGA device configuration over Ethernet to SD card done. The system will now reboot for persistent changes to take effect. System is rebooting . . . . . . ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 11-Jan-2021 11:26:16 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 11-Jan-2021 11:26:16
Load the example image.
imgFile = fullfile(pwd,'MerchData','MathWorks Cube','Mathworks cube_0.jpg'); inputImg = imresize(imread(imgFile),[224 224]); imshow(inputImg)

Execute the predict method on the dlhdl.Workflow object and then show the label in the MATLAB command window.
[prediction, speed] = hW.predict(single(inputImg),'Profile','on');
### Finished writing input activations.
### Running single input activations.
Deep Learning Processor Profiler Performance Results
LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s
------------- ------------- --------- --------- ---------
Network 7323615 0.02929 1 7323615 34.1
conv1 1111619 0.00445
pool1 235563 0.00094
res2a_branch2a 268736 0.00107
res2a_branch2b 269031 0.00108
res2a 94319 0.00038
res2b_branch2a 268677 0.00107
res2b_branch2b 268863 0.00108
res2b 94255 0.00038
res3a_branch1 155156 0.00062
res3a_branch2a 226445 0.00091
res3a_branch2b 243593 0.00097
res3a 47248 0.00019
res3b_branch2a 243461 0.00097
res3b_branch2b 243581 0.00097
res3b 47232 0.00019
res4a_branch1 133899 0.00054
res4a_branch2a 134402 0.00054
res4a_branch2b 234184 0.00094
res4a 23628 0.00009
res4b_branch2a 234058 0.00094
res4b_branch2b 234648 0.00094
res4b 23756 0.00010
res5a_branch1 310730 0.00124
res5a_branch2a 310810 0.00124
res5a_branch2b 595374 0.00238
res5a 11827 0.00005
res5b_branch2a 595150 0.00238
res5b_branch2b 595904 0.00238
res5b 12012 0.00005
pool5 35870 0.00014
new_fc 17811 0.00007
* The clock frequency of the DL processor is: 250MHz
[val, idx] = max(prediction);
dlquantObj.NetworkObject.Layers(end).ClassNames{idx}ans = 'MathWorks Cube'