SPRADB0 may 2023 AM62A3 , AM62A3-Q1 , AM62A7 , AM62A7-Q1 , AM67A , AM68A , AM69A
Once a model architecture is selected and a dataset is created, the next step is to retrain the model on a new dataset.
TI provides a suite of easy-to-use tools for model training and compilation. Any training framework can be used to train the deep learning model, so long as it only contains layers supported on the device's accelerator. Experts may wish to use tools they are more familiar with, such as training with PyTorch and exporting to ONNX for compilation. Less experienced users can use TI's tools like Model Composer and edgeai-modelmaker. The steps covered here use edgeai-modelmaker. An advantage to using TI tools is that compilation can be handled automatically at the end of training, such that Section 5 can be entirely skipped.
Edgeai-modelmaker, once setup/installed, uses a separate training framework like edgeai-mmdetection to perform the training itself. This starts from a pretrained model and fine-tunes via transfer learning by resetting the last layer of the network and using a low learning rate. For models supported in model zoo, this downloads the pretrained weights. Note that when using this process, layers before the last layer are not frozen and will change during training.
To train a model, the steps are as follows; refer to modelmaker READMEs for the most up-to-date instructions:
If a GPU is present on the training machine, it is highly recommended to configure it for training, as it provides substantial speedup. The most important component of configuration is ensuring that the CUDA version is consistent with the version Pytorch/torchvision is compiled against. The one shipped with modelmaker at the time of this writing is CUDA 11.3. Correctly setting up these drivers can be a pain point, as there is a relationship between the CUDA version and the NVIDIA display/graphics driver.
If the dataset is fairly small (<10000 images), it is good to start from a network that is pretrained on a large, generic dataset like imagenet1k for image classification or COCO for object detection. Larger networks generally require more samples for training due to the increased number of parameters
For the food-recognition dataset of ~2500 images (after train-test split and augmentations), training took approximately 1.5 hrs on an A2000 GPU (6 GB VRAM) with 12-core CPU, 16GB RAM, and SSD for 30 epochs of training. This dataset reached a mean average precision (mAP; a common object detection accuracy metric) score of 0.68, which is extremely high. This stems from two facts:
A full evaluation on the final test set was not performed for the retail-scanner model. Rather, a visual inspection was done to assert the model performed reasonably well before moving on to the next stage. Several iterations of training were performed by varying the number of epochs, degree of augmentation, and variety of augmentations.