Efficient Visual Localization on TDA4VM

Trademarks

Jacinto is a trademark of Texas Instruments Incorporated.

Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or elsewhere.

All trademarks are the property of their respective owners.

1 Introduction

Accurate ego-location within a map is an essential requirement for autonomous navigation. In the ADAS and Robotics communities this problem is referred to as the localization problem. Typically, when a vehicle or robot is outdoors, localization can be handled to some extent by the Inertial Navigation System (INS), which uses Global Positioning System (GPS) data together with measurements from an Inertial Measurement Unit, or IMU, to localize the vehicle/robot. However, an INS can only communicate with a GPS satellite when there is no obstruction between the two, that is, when there is a clear line of sight (LOS) towards the satellite. When the vehicle or robot is in a garage, warehouse, or tunnel, the accuracy of GPS location degrades substantially, because the LOS to the satellite is obstructed. Furthermore, even when GPS is available, it can only position a vehicle within approximately a 5 meter radius [1]. This error coupled with errors in the IMU result in noisy localization that may not be sufficiently accurate for high complexity ADAS or robotics tasks.

Visual localization is a popular method employed by the ADAS and robotics community to meet the stringent localization requirements for autonomous navigation. As the name implies, in visual localization, images from one or more cameras are used to localize the vehicle or robot within a map. Of course, for this task, a map of the environment needs to be constructed and saved prior to localization. In the field of localization, the more popular solutions thus far have been based on LiDAR, because LiDAR measurements are dense, and precise. However, though LiDAR based localization is highly accurate, it is cost prohibitive for the every day vehicle, because high precision LiDARs are typically in the order of thousands of dollars. Thus, it is critical that a cheaper alternative such as visual localization is made available.

In robotics and in automotive, the computations for localization as well as for other tasks, need to be performed within the vehicle or robot. Therefore, it is critical that the vehicle or robot is equipped with high-performance embedded processors that operate at low power. The Jacinto 7 family of processors by TI, was designed from the ground up with applications such as visual localization in mind. The Jacinto 7 family is in fact the culmination of two decades worth of TI experience in the automotive field, and many decades of TI experience in electronics. These processors are equipped with deep learning engines that boast one of the best power-to-performance ratios of any device in the market today, together with Hardware Accelerators (HWAs) for specific Computer Vision (CV) tasks, and also Digital Signal Processors (DSPs) that can efficiently perform related CV tasks.

Figure 1-1 Why Localization?

2 The Visual Localization Problem

In the simplest sense, visual localization, as the name implies, is the problem of determining the location of a vehicle or robot by matching key-points in a stored map with key-points extracted from images captured by a camera mounted on a vehicle/robot. A key-point is a unique or distinctive point in space from which a descriptor can be extracted. A descriptor is a set of values (a vector) that holds information about a key-point, that will help distinguish said key-point from others. The method that is used to compute these features is described in the next section.

The first step in localization is extracting key-points from the image. Then, the extracted key-points, which are on a 2D image plane, need to matched with a 3D sparse map held in memory. To create the 3D sparse map, features need to be extracted and stored together with their corresponding locations in some arbitrary but known coordinate system. This task is typically achieved by driving a vehicle equipped with a high precision differential GPS and camera along all the paths that make up the map. In order to ensure the features are not biased by the time of day, or day of the year, information is gathered throughout the year to refine the map. Then, when the position of the vehicle/robot needs to be estimated, key points extracted from an image are matched with key points in the sparse 3D map, and using the point correspondence the pose of the vehicle/robot is estimated. This process is described in more detail in the next section.

The entire localization process is shown at a high level in Figure 2-1.

Figure 2-1 High-Level Block Diagram on Visual Localization

In the next two sections, the implementation of the steps that make up visual localization, namely key point extraction, descriptor computation, feature matching and pose estimation, are described in more detail.