Reviewing my implementation design choices.
last week I spoke with Arnoud. One idea was that Bleser et al.’s approach might be outdated, as it comes from 2008. Back when I started with my thesis, that was „fresh enough”, but now it is 10 years old and a lot has happened in the field of tracking for both virtual and augmented reality applications. So, let’s do some literature research again.
The focus of this part of the literature study is recent developments in vision-hybrid pose tracking systems. „Recent” means that I will focus on papers from 2014 up to now. „Vision-hybrid” indicates systems using multiple sensors, of at least one camera. This camera might work with infrared, optical light, or structured light approaches. „Pose tracking systems” are those systems that estimate the 6D pose (3D position and orientation) over time. This excludes one-shot localization systems.
Billinghurst et al. is a big survey from 2014 over the complete width of AR technologies. It subdivides AR tracking into methodologies based on the sensor used: magnetic, vision, inertial, GPS, and hybrid for sensor-fusion systems. It does not include technical depth of these approaches.
In their semi-final chapter, „Research Directions”, they state that „Tracking is one of the most popular areas for AR research because it is one of the most important of the AR enabling technologies. […] However, there is still significant work that needs to be done before the vision of ‚anywhere augmentation’ is achieved, where users will be able to have a compelling AR experience in any environment.”. They list some robust indoor tracking methods, includng a method with ceiling-mounted ultra-sounic receivers by Dissanayake et al. in 2001; a method where camera images are combined with a 2D laser tracker for range finding [Scheer and Müller, 2012]; „computer vision trakcing from markers pasted over the walls [Wagner, 2002, Schmalstieg and Wagner, 2007]” and „hybrid systems with computer vision, inertial sensors and ultra-wide-band tracking [Newman et al., 2006] among others”.
They note that handheld depth sensors for indoor tracking seem very promising, such as The Google Tango Project and Intel’s RealSense. I would say that these devices are not yet small enough to be embedded in consumer devices to be widely used.
Another challenge this paper observes is seamless switching „from” (why not bothways „between”?) outdoor to indoor tracking environments. All cited sources in this section use at least GPS, with only [Piekarski et al., 2003] fusing that with vision. This brings them to 20 cm precision, too low in my case.
As a follow up on Billinghurst et al., I looked at Google Tango. Google has ended support for Tango on 1 March 2018, in favor for Google ARCore. Looking at the required permissions, ARCore only uses the camera, and requires no other sensor permissions.
Related to ARCore is Apple’s ARKit. I need to investigation that API.
Bloesch et al. presented a visual-inertial tracking method using the EKF. It might be a proper alternative for Bleser’s models, although I think Bleser et al. have been a bit more precise. Compare Bleser’s models to the following:
\[\begin{align} \mathbf{x}_t &= \begin{bmatrix} _s \mathbf{s}_{rs,t} \\ \dot{_s \mathbf{s}}_{rs,t} \\ \mathbf{q}_{sw,t} \\ \mathbf{b}^a_{s,t} \\ \mathbf{b}^\omega_{s,t} \\ \mathbf{c}_{s,t} \\ \mathbf{z}_{sc,t} \\ \mathbf{\mu}^{(i)}_{c,t} \\ \rho^{(i)}_{c,t} \end{bmatrix} \end{align}\]with coordinate frames at a global world coordinate system \(\mathcal W\), robot’s center \(\mathcal R\), IMU center \(\mathcal S\) and camera center \(\mathcal C\), and (quoted from Bloesch et al., replacing their notation with Bleser’s):
- \(\mathbf{s}_{rs,t}\): robocentric position of IMU (expressed in \(\mathcal S\))
- \(\mathbf{v}_{rs,t}\): robocentric velocity of IMU (expressed in \(\mathcal S\))
- \(\mathbf{q}_{sw,t}\): attitude of IMU (map from \(\mathcal S\) to \(\mathcal W\))
- \(\mathbf{b}^a_{s,t}\): additive bias on accelerometer (expressed in \(\mathcal S\))
- \(\mathbf{b}^\omega_{s,t}\): additive bias on gyroscope (expressed in \(mathcal S\))
- \(\mathbf{c}_{s,t}\): translational part of IMU-camera extrinsics (expressed in \(\mathcal S\))
- \(\mathbf{z}_{sc,t}\): rotational part of IMU-camera extrinsics (map from \(\mathcal S\) to \(\mathcal C\))
- \(\mathbf{\mu}^{(i)}_{c,t}\): bearing vector to feature \(i\) (expressed in \(\mathcal C\))
- \(\rho^{(i)}_{c,t}\): distance parameter of feature \(i\).
Their prediction model is given below as a differential equation, with \(\mathbf{u}_t^T = \begin{bmatrix} \mathbf{y}^\omega_{s,t} & \mathbf{y}^a_{s,t} \end{bmatrix}\):
\[\begin{align} \partial_{\mathbf{x}} f(\mathbf{x}_{t-T}, \mathbf{u}_t, \mathbf{v}_t) &= \begin{bmatrix} -{\bar{\mathbf{y}}^\omega_{s,t}}^\times \mathbf{s}_{rs,t} + \dot{\mathbf{s}}_{rs,t} + \mathbf{v}^r_t \\ -{\bar{\mathbf{y}}^\omega_{s,t}}^\times \dot{\mathbf{s}}_{rs,t} + \bar{\mathbf{y}}^a_{s,t} + \mathbf{q}^{-1}_{sw,t} (\mathbf{g}_w) \\ -\mathbf{q}_{sw,t}(\bar{\mathbf{y}}^\omega_{s,t}) \\ \mathbf{v}^{b^a}_{s,t} \\ \mathbf{v}^{b^\omega}_{s,t} \\ \mathbf{v}^{c}_{s,t} \\ \mathbf{v}^{z}_{sc,t} \\ \mathbf{N}^T(\mathbf{\mu}^{(i)}_{c,t}) \mathbf{z}(\bar{\mathbf{y}}^\omega_{s,t}) - \begin{bmatrix}0 & 1 \\ -1 & 0\end{bmatrix} \mathbf{N}^T(\mathbf{\mu}^{(i)}_{c,t}) \frac{\mathbf{z}(\dot{\mathbf{s}}_{rs,t} + {\bar{\mathbf{y}}^\omega_{s,t}}^\times \mathbf{c}_{s,t}}{d(\rho^{(i)}_{c,t})} + \mathbf{v}^{\mu^{(i)}}_t \\ -{\mathbf{\mu}^{(i)}_{c,t}}^T \frac{\mathbf{z}(\dot{\mathbf{s}}_{rs,t} + {\bar{\mathbf{y}}^\omega_{s,t}}^\times \mathbf{c}_{s,t}}{\partial_{\rho^{(i)}} d(\rho^{(i)}_{c,t})} + \mathbf{v}^{\rho^{(i)}}_{c,t} \end{bmatrix} \end{align}\]with:
- superscript \(\mbox{}^\times\) denoting the skew symmetric matrix of a vector,
- \(\mathbf{N}^T(\mathbf{\mu}^{(i)}_{c,t})\) linearly projects a 3D vector onto the 2D tangent space around the bearing vector \(\mathbf{\mu}^{(i)}_{c,t}\),
- \(d(\rho^{(i)}_{c,t}) = \frac{1}{\rho^{(i)}_{c,t}}\) model of that distance parameter \(\rho^{(i)}_{c,t}\),
- corrected measurements \(\bar{\mathbf{y}}^a_{s,t} = \mathbf{y}^a_{s,t} - \mathbf{b}^a_{s,t} - \mathbf{v}^a_{s,t}\) and \(\bar{\mathbf{y}}^\omega_{s,t} = \mathbf{y}^\omega_{s,t} - \mathbf{b}^\omega_{s,t} - \mathbf{v}^\omega_{s,t}\).
- \(\mathbf{g}_{w}\): gravity vector
One thing that pops out is that they do not take into account variable intervals between updates. In other words, \(T = 1\).
I need to study their observation model further.
Mur-Artal is the main author on a series on monocular, stereo and RGB-D and DBoW2 Place recognizer ORB-SLAM papers. Another addition to that series by Mur-Artal and Tardós integrates IMU information for accurate tracking. Nothing seems to indicate that this system is limited to the robotics settings (where estimates of a control vector are reliable), and it seems equally applicable for augmented reality. In the conclusion of IMU they state:
We have presented a novel tightly coupled Visual-Inertial SLAM system, that is able to close loops in real-time and localize the sensor reusing the map in already mapped areas. This allows to achieve a zero-drift localization, in contrast to visual odometry approaches where drift grows unbounded. The experiments show that our monocular SLAM recovers metric scale with high precision, and achieves better accuracy than the state-of-the-art in stereo visual-inertial odometry when continually localizing in the same environment. We consider this zero-drift localization of particular interest for virtual/augmented reality systems, where the predicted user viewpoint must not drift when the user operates in the same workspace. Moreover we expect to achieve better accuracy and robustness by using stereo or RGB-D cameras, which would also simplify IMU initialization as scale is known. The main weakness of our proposed IMU initialization is that it relies on the initialization of the monocular SLAM. We would like to investigate the use of the gyroscope to make the monocular initialization faster and more robust.
Bold is mine. First bold comment is done in their final paper from this series, ORB-SLAM2, but no simplification of IMU initialisation is presented. The second bold comment can be investigated further.
It is a resource-intense method, using three threads for tracking, localization and loop closing, but that should not limit its applicability.
TODO
- Investigate what Apple’s ARKit actually does. Does it use sensor fusion? Did they reveal the papers that inspired their methods?
- Is combining the inertial and stereo/RGB-D ORB-SLAM2 interesting enough?
References
- 
Gabriele Bleser and Didier Stricker.
 Advanced tracking through efficient image processing and
  visual-inertial sensor fusion.
 In Virtual Reality Conference, pages 137–144. IEEE, 2008.
[ bib | DOI ]@inproceedings{bleser2008advanced, author = {Bleser, Gabriele and Stricker, Didier}, booktitle = {Virtual Reality Conference}, title = {Advanced tracking through efficient image processing and visual-inertial sensor fusion}, year = {2008}, pages = {137--144}, organization = {IEEE}, doi = {10.1109/VR.2008.4480765} }
- 
Dorian Galvez-López and Juan Domingo Tardós.
 Bags of binary words for fast place recognition in image sequences.
 IEEE Transactions on Robotics, 28(5):1188–1197, October 2012.
[ bib | DOI ]@article{galvez2012bags, author = {Galvez-López, Dorian and Tardós, Juan Domingo}, journal = {IEEE Transactions on Robotics}, title = {Bags of Binary Words for Fast Place Recognition in Image Sequences}, year = {2012}, month = oct, volume = {28}, number = {5}, pages = {1188-1197}, keywords = {feature extraction;geometry;image sequences;mobile robots;object recognition;SLAM (robots);trees (mathematics);bags of binary words;fast place recognition;image sequences;visual place recognition;accelerated segment test;vocabulary tree;binary descriptor space;geometrical verification;feature extraction;FAST±BRIEF features;mobile robots;simultaneous localization and mapping;Vocabulary;Indexes;Vectors;Robots;Feature extraction;Cameras;Bag of binary words;computer vision;place recognition;simultaneous localization and mapping (SLAM)}, doi = {10.1109/TRO.2012.2197158}, issn = {1552-3098} }
- 
Mark Billinghurst, Adrian Clark, and Gun Lee.
 A survey of augmented reality.
 Foundations and Trends® in Human-Computer Interaction,
  8(2-3):73–272, 2014.
[ bib | DOI ]@article{billinghurst2014survey, author = {Billinghurst, Mark and Clark, Adrian and Lee, Gun}, title = {A Survey of Augmented Reality}, journal = {Foundations and Trends® in Human-Computer Interaction}, volume = {8}, number = {2-3}, pages = {73--272}, year = {2014}, doi = {10.1561/1100000049} }
- 
Michael Bloesch, Sammy Omari, Marco Hutter, and Roland Siegwart.
 Robust visual inertial odometry using a direct EKF-based approach.
 In 2015 IEEE/RSJ International Conference on Intelligent Robots
  and Systems (IROS), pages 298–304, September 2015.
[ bib | DOI ]@inproceedings{bloesch2015robust, author = {Bloesch, Michael and Omari, Sammy and Hutter, Marco and Siegwart, Roland}, booktitle = {2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title = {Robust visual inertial odometry using a direct {EKF}-based approach}, year = {2015}, month = sep, pages = {298-304}, keywords = {autonomous aerial vehicles;distance measurement;feature extraction;image resolution;image sensors;Kalman filters;mobile robots;nonlinear filters;object tracking;pose estimation;robot vision;state estimation;UAV;multirotor unmanned aerial vehicle;control loop;power-up-and-go state estimation system;computational performance improvement;inverse-distance landmark parametrization;σ-algebra;landmark position decomposition;3D landmark location estimation;robocentric approach;innovation term;extended Kalman filter;multilevel patch feature tracking;image patches;pixel intensity errors;direct EKF-based approach;robust monocular visual-inertial odometry algorithm;Feature extraction;Cameras;Three-dimensional displays;Robots;Estimation;Technological innovation;Uncertainty}, doi = {10.1109/IROS.2015.7353389} }
- 
Raúl Mur-Artal, José M. Martínez Montiel, and Juan Domingo Tardós.
 ORB-SLAM: A versatile and accurate monocular SLAM system.
 IEEE Transactions on Robotics, 31(5):1147–1163, October 2015.
[ bib | DOI ]@article{murartal2015orbslam, author = {Mur-Artal, Raúl and Montiel, José M. Martínez and Tardós, Juan Domingo}, journal = {IEEE Transactions on Robotics}, title = { {ORB-SLAM}: A Versatile and Accurate Monocular {SLAM} System}, year = {2015}, month = oct, volume = {31}, number = {5}, pages = {1147-1163}, keywords = {SLAM (robots);ORB-SLAM system;feature-based monocular simultaneous localization and mapping system;survival of the fittest strategy;Simultaneous localization and mapping;Cameras;Optimization;Feature extraction;Visualization;Real-time systems;Computational modeling;Lifelong mapping;localization;monocular vision;recognition;simultaneous localization and mapping (SLAM);Lifelong mapping;localization;monocular vision;recognition;simultaneous localization and mapping (SLAM)}, doi = {10.1109/TRO.2015.2463671}, issn = {1552-3098} }
- 
Raúl Mur-Artal and Juan Domingo Tardós.
 Visual-inertial monocular SLAM with map reuse.
 IEEE Robotics and Automation Letters, 2(2):796–803, April
  2017.
[ bib | DOI ]@article{murartal2017visual, author = {Mur-Artal, Raúl and Tardós, Juan Domingo}, journal = {IEEE Robotics and Automation Letters}, title = {Visual-Inertial Monocular {SLAM} With Map Reuse}, year = {2017}, month = apr, volume = {2}, number = {2}, pages = {796-803}, keywords = {cameras;robot vision;SLAM (robots);visual-inertial monocular SLAM;map reuse;visual-inertial odometry;sensor incremental motion;trajectory estimation;visual-inertial simultaneous localization;zero-drift localization;camera configuration;IMU initialization method;gyroscope;accelerometer;micro-aerial vehicle public dataset;monocular camera;Optimization;Cameras;Gravity;Accelerometers;Simultaneous localization and mapping;Tracking loops;Sensor fusion;SLAM;visual-based navigation}, doi = {10.1109/LRA.2017.2653359}, issn = {2377-3766} }
- 
Raúl Mur-Artal and Juan Domingo Tardós.
 ORB-SLAM2: An open-source SLAM system for monocular, stereo, and
  RGB-D cameras.
 IEEE Transactions on Robotics, 33(5):1255–1262, October 2017.
[ bib | DOI ]@article{murartal2017orbslam2, title = { {ORB-SLAM2}: An Open-Source {SLAM} System for Monocular, Stereo, and {RGB-D} Cameras}, author = {Mur-Artal, Raúl and Tardós, Juan Domingo}, journal = {IEEE Transactions on Robotics}, year = {2017}, month = oct, volume = {33}, number = {5}, pages = {1255-1262}, keywords = {cameras;distance measurement;Kalman filters;mobile robots;motion estimation;path planning;robot vision;SLAM (robots);ORB-SLAM;open-source SLAM system;lightweight localization mode;map points;zero-drift localization;SLAM community;monocular cameras;stereo cameras;simultaneous localization and mapping system;RGB-D cameras;Simultaneous localization and mapping;Cameras;Optimization;Feature extraction;Tracking loops;Trajectory;Localization;mapping;RGB-D;simultaneous localization and mapping (SLAM);stereo}, doi = {10.1109/TRO.2017.2705103}, issn = {1552-3098} }