Perception August 9, 2024

LiDAR vs Stereo Vision for AMR Obstacle Detection

Sensor comparison diagram for warehouse AMR obstacle detection systems

Every new AMR deployment involves a sensor configuration decision that ops teams make once and live with for years. 2D LiDAR only? Add a 3D unit? Go with an RGBD camera or a stereo pair? Layer all of the above? The choice affects what kinds of obstacles the robot can reliably detect, how fast the perception pipeline can process each frame, and how much compute budget is left for the planner.

We have tested Mobvynt's perception stack against seven different sensor configurations. This post shares what we actually measured, not benchmarks from sensor data sheets. The numbers below come from logged sessions in a controlled test environment with repeatable obstacle scenarios — not from simulated or synthetic data.

The Core Trade-Off in One Paragraph

2D LiDAR is fast, reliable, and predictable. A single-plane scan at 10 Hz gives you a clean distance profile around the robot's horizontal plane. Processing latency is low — from raw scan to obstacle detections in the costmap, you are typically in the 15 to 25 ms range end-to-end. The limitation is obvious: it only sees what intersects that horizontal plane. A person's legs at scan height, yes. A pallet at scan height, yes. A wire rack with horizontal bars only at the very top, possibly not.

Stereo vision and RGBD cameras give you depth across a full field of view but at the cost of substantially higher computational load and more sensitivity to environmental conditions. A stereo pair producing 640×480 disparity maps at 30 fps, processed through a point cloud and into object detections, runs 60 to 80 ms end-to-end on a mid-range embedded compute platform — before any classification. That is the frame processing budget already at 80 ms before you have even started planning.

Fusion configurations — 2D LiDAR primary, RGBD or stereo secondary — can get the best of both if the pipeline is designed carefully. Poorly designed, they just add latency from both paths while trusting neither.

What 2D LiDAR Actually Misses

The standard argument against 2D LiDAR is that it misses obstacles that do not intersect the scan plane. In practice, this is real but narrower than it sounds. Most warehouse AMRs mount their primary 2D LiDAR at a height between 200 mm and 350 mm above the floor — chosen to catch pallet legs, cart wheels, and person ankles reliably. At this height, a standard Euro pallet on the floor is detectable (the wooden slats intersect the scan plane), a person walking is detectable (their legs are within the scan window), and most warehouse obstacles are detectable.

The genuine failure modes are:

  • Elevated overhangs. A loaded pallet rack extending into an aisle at 1.8 m height while the floor zone is clear. The 2D scan sees clear floor; the robot proceeds and clips the overhang. We have seen this with tall-mounted conveyor side guards and with stretch-wrapped pallet loads that bow outward above the scan plane.
  • Thin vertical features. Floor-mounted bollards with diameters under 50 mm can fall between scan point spacing at range. At 5 m distance with a typical 0.33° angular resolution, point spacing is roughly 29 mm — a narrow bollard can produce zero or one scan hit, which the threshold-based detector may reject as noise.
  • Low-profile floor debris. Shrink wrap on the floor, a dropped glove, a thin wooden board — these produce reflections below the scan plane and are completely invisible to horizontal 2D LiDAR.

These are real failure modes. We are not saying 2D LiDAR handles everything. The question is whether adding a second sensor modality actually fixes these failure modes in operational conditions, or whether it introduces new ones.

Where Stereo Vision Struggles in Warehouse Environments

Stereo depth estimation depends on finding matching features between the left and right camera images. Uniform surfaces — the side of a white pallet, a plain painted concrete wall, a stretch-wrapped load — are exactly the kind of low-texture scenes that confuse stereo matching algorithms. The resulting disparity maps have holes or high uncertainty in precisely the areas that matter most for obstacle detection.

In our test environment, we ran a ZED 2i stereo camera at 720p/30fps against a set of standard warehouse obstacle scenarios and measured depth accuracy on several representative surfaces:

  • Standard Euro pallet, unpainted wood grain: depth error <3 cm at 2 m. Reliable.
  • Pallet wrapped in white stretch film: depth error 8–18 cm at 2 m, with frequent holes in the disparity map at the wrapped faces. Not reliable for precise clearance estimation.
  • Plain painted concrete floor: depth error >20 cm at distances beyond 1.5 m. Essentially unusable for floor-level obstacle detection at range.
  • Person in high-vis vest: reliable across all tested distances. Feature-rich surface, no ambiguity.

Warehouse environments have a disproportionately high fraction of low-texture surfaces specifically because operators wrap, stack, and standardize everything. The obstacle population that stereo vision is worst at happens to be the most common one.

RGBD cameras using structured light or time-of-flight — Intel RealSense D435i, for example — avoid the texture dependency but introduce a different problem: range. Structured-light RGBD is accurate at 0.3 to 3 m but degrades significantly beyond that. For a robot operating in 40-meter aisles, three-meter effective depth range means the robot needs to already be very close to the obstacle before it gets reliable depth readings. That is too late for confident replanning at 1.0 m/s travel speed.

Fusion: The Right Answer, Done Right

The configuration we have found most reliable for typical ambient warehouse environments is 2D LiDAR at standard scan height as the primary obstacle detection input, with a forward-facing RGBD camera as a secondary input specifically targeted at the overhang and upper-body detection gap. The fusion is not pixel-level sensor fusion — it is obstacle-level: detections from each modality are separately processed and merged in the obstacle tracking layer.

The key design choice is the trust hierarchy. In our pipeline, the 2D LiDAR detections have lower latency and higher spatial precision; they drive the planner's immediate obstacle map. The RGBD detections are used to augment the obstacle classification (discriminating person vs object based on upper-body geometry) and to flag overhang obstacles that would be missed by the horizontal scan alone. If the two modalities disagree — RGBD shows a clear path but LiDAR shows an obstacle — we trust LiDAR.

This asymmetric trust model matters because RGBD false positives (reporting an obstacle where there is none) cause replanning and route deviations that reduce throughput. False negatives (missing a real obstacle) are a safety issue. Neither is acceptable at scale, but the failure modes of each sensor modality push in opposite directions, and the fusion architecture should account for that asymmetry explicitly rather than averaging both sources naively.

Latency Budget Across Configurations

To put concrete numbers on this, here are the end-to-end obstacle detection latencies we measured across three configurations in our lab environment, defined as time from sensor frame capture to obstacle detection available in the planner's costmap:

  • 2D LiDAR only (RPLIDAR A3, 10 Hz): 18–28 ms median. Very consistent. The variance comes almost entirely from ROS 2 message scheduling jitter.
  • Stereo vision only (ZED 2i, 720p/30fps): 62–95 ms median. High variance; GPU availability on embedded hardware creates scheduling spikes.
  • 2D LiDAR + RGBD fusion (RPLIDAR A3 + RealSense D435i): 22–35 ms median for LiDAR-path detections; RGBD augmentation path adds 45–60 ms but runs asynchronously and does not block the planner.

The fusion configuration effectively gives you the LiDAR's low latency for the primary obstacle path while the RGBD contribution arrives with a 20–40 ms delay and updates the obstacle classification and upper-zone detection asynchronously. This is the architecture that keeps the planner's critical latency path at sub-35 ms while still getting the benefits of volumetric depth data.

Practical Recommendation

For most ambient warehouse deployments, start with 2D LiDAR as the required baseline and add a forward-facing RGBD unit if the facility has known overhang obstacles or frequent tall-load movement. Do not assume stereo vision alone gives you better coverage than a well-tuned LiDAR stack — in warehouse conditions, it often gives you worse coverage with higher compute cost.

Where stereo vision does outperform: outdoor or semi-outdoor environments with natural texture, environments with complex 3D geometry where scan-plane coverage is inherently inadequate (ramps, multi-level pick areas), and scenarios where you need full-scene 3D reconstruction rather than just obstacle detection. For the majority of flat-floor logistics environments, it is a configuration premium that does not always pay off in detection reliability.

The sensor modality decision is also not the last word. How well the perception software uses the sensor data matters as much as which sensors are installed. A 2D LiDAR feeding a naive threshold-based obstacle detector is going to miss more obstacles than the same LiDAR feeding a tracking-aware detector that distinguishes moving from static objects and maintains per-object confidence decay. The sensor is the input. What you do with it is the real work.