Camera-based systems - due to their higher image sensor resolution - perform outstandingly well in 2D detection tasks. When it comes to 3D position estimation, camera only based systems are also highly welcome, however compared to LiDARs for example they provide less accurate localization of distant objects. Since various types of sensors have different strengths and weaknesses it is worth to use them jointly in order to achieve increased detection performance. Let us briefly introduce one of our detectors which are based on the fusion of camera images and LiDAR point clouds.
The approach can be decomposed into three main stages. First, there is a 2D detection stage during which the objects present in the camera images are detected in form of 2D bounding boxes expressed in the pixel coordinate frame. Since the camera(s) as well as the LiDAR(s) are calibrated in advance as next step the frustum formed by the camera center and the 2D bounding box of a detected object can be determined (see the animation). At this point we know, that the object in the LiDARs point cloud is located somewhere inside the determined frustum and the task is to find its exact location.
There are several approaches how to solve this latter problem, however if the processing time is of key importance many available solutions cannot be considered anymore. For example, segmenting the laser points of the pedestrian and estimating the 3D bounding box on machine learning basis may take significant amount of processing time (depending on the complexity of the network and the used hardware). Nevertheless, the 3D localization stage relies on the 2D bounding boxes estimated during the 2D detection phase of the processing. Thus, the 2D detection should be as reliable as possible since the 3D localization stage depends on it.
From time complexity aspect there is tradeoff between the reliability and time complexity. For instance, during this experiment we have been considering to use the YOLO4 as well as the tiny-YOLO4 to detect objects in camera images. The tiny-YOLO4 obviously runs faster, however is less reliable than its more complex YOLO4 counterpart which on the other hand runs significantly slower. Our aim was to process the data at greater than 20 FPS (the max. configurable frame rate for LiDARs is 20Hz) including both the image and point cloud processing. Therefore, to localize the object inside the frustum we applied simpler methods based on statistics, thus and acceptable performance could be achieved. Another important aspect when using multiple source of time-series data is the precise synchronization of all the sensors in order to get corresponding LiDAR and camera data frames. The detector has been extended by an interactive multiple model filter based target-tracking feature, as well which significantly contributes to its robustness.
The developed detector was running on GeForce 2060 Super hardware; the achieved processing time was ~30ms.
The sensor setup of the measurement vehicle is depicted in Fig. 1., two 16 channel side LiDARs and a single 2MPixel industrial camera running at 30 FPS. The vehicle was equipped with an IMU and dGPS system, as well. The pointclouds of the two side LiDARs have been merged together (given the extrinsics) in order obtain a denser pointcloud. The detected objects were given in IMU as well as in UTM. The results can be followed in Fig. 2, while the main steps of detection are illustrated by the animation. The calibration of the LiDAR and the camera was based on the method developed by authors in [1]. The method uses a chessboard to determine the camera-LiDAR extrinsics. [1]
Reference:
[1] Wang, Weimin and Sakurada, Ken and Kawaguchi, Nobuo, „Reflectance Intensity Assisted Automatic and Accurate Extrinsic Calibration of 3D LiDAR and Panoramic Camera Using a Printed Chessboard”, Remote Sensing, Vol. 9, No. 8, 2017, ISSN: 2072-4292, DOI:10.3390/rs9080851