SPRADB0 Application note

SPRADB0 may 2023 AM62A3 , AM62A3-Q1 , AM62A7 , AM62A7-Q1 , AM67A , AM68A , AM69A

2.1 Collecting Images

Images should be collected in a similar context to how the model will be used in practice on the target embedded processor. For example, a checkout can have a downward facing camera with bright lighting and a neutral colored background. The space can be (mostly)free of additional objects but possibly showing a user’s hands, their wallet, smartphone, and so forth.

The camera used to collect images should be good quality, but does not need to be professional quality such as a photographer would use. Most images used during deep learning are scaled to a smaller resolution (< 1 Megapixel: for example, 512x512) than the original input. It can be advantageous to use the same camera that the end product will use, including the same autogain, white balancing, and other tuning settings a camera may require, but it is not explicitly necessary. Using different cameras or settings can improve robustness later in case the camera must change. The retail-scanner application in this document used the end camera (an IMX219 image sensor) and a 1080p USB camera for image capture.

Within the dataset, the goal is to give a machine learning model enough instances of patterns that relate to the problem to solve. For example, what a banana looks like such that it can be added to a customer’s order. Irrelevant patterns should not consistently exist in the training samples, such as a banana always being near the edge of the image or against a white background. To improve robustness of the model, it is recommended to vary lighting conditions; object orientations, positions, and angles; camera angle/height; and the versions of the objects themselves (for example, bananas at different stages of ripeness) if characteristics can change. If there is not enough variation, then the training data can be overfit such that the neural network is extremely effective at recognizing the training data but fails to generalize to new data. The dataset should reflect the images that the end system will see in practice.

Larger datasets are more effective at training deep neural networks than small ones. Large, high-quality, datasets are extremely effective. However, a large dataset is time-consuming to collect and label. Starting with a smaller dataset is helpful for diagnosing types of problems that will require more data. A dataset with as small as 50 instances of each class of item is plenty sufficient for a first attempt. As little as 10 can be used for an initial proof of concept. It is important for these tiny datasets that the trained model has already been trained once on a large dataset like imagenet of COCO, which is the case for models trained using TI tools.

The food-detection dataset uses 100-200 instances of each food item for more robustness. This application was intended to demonstrate device capability. While the model did not need to be production worthy, it did need to perform reliably enough to demonstrate in ad hoc settings with reasonable accuracy. This food dataset included realistic fake versions of food or packaged non-perishable foods to make the demo easier to transport and reproduce in new places. This way, the objects’ appearances do not change substantially after training. Figure 2-1 is an example image from the food-detection dataset.

Figure 2-1 An Image From the Food-Recognition Dataset

The dataset can be downloaded from the Github repository for this demo application [2].