Jethro's Braindump

Event-based Vision

Event-based cameras are bio-inspired sensors. They asynchronously measure per-pixel brightness changes, and output a stream of digital events (or spikes) that encode the time, location and sign of the brightness changes.

Principle of Operation

Standard cameras acquire full images at a rate specified by an external clock (e.g. 30fps). Event-cameras respond to brightness changes in the scene asynchronously and independently for every pixel. The output is a variable data-rate sequence of spikes, each event representing a change of brightness (log intensity).

An event consists of the location (x, y), the time (t), and the polarity of the change (p).

The incident light at a pixel is a product of scene illumination and surface reflectance. Hence, changes in log intensity in the scene generally signals a reflectance change (e.g. as a result of movement in the scene), as the illumination is usually constant. This is why event cameras are invariant to scene illumination.

DVS Cameras

DVS Cameras followed a frame-based silicon retina design, where the continuous-time photoreceptor was coupled to a readout circuit that was reset each time the pixel was sampled (Gallego et al., n.d.).

DVS events (x, y, t, p) can be used in many applications, but some also require static output (i.e. absolute brightness). Cameras have been developed to concurrently output both dynamic and static information.

Benefits

High Temporal Resolution
Monitoring brightness changes is fast in analog circuitry. The read-out of events is digital with a 1MHz clock, so events are detected and timestamped at the microsecond resolution. Hence, they can capture fast motions without motion blur.
Low Latency
Each pixel operates independently, and can transmit log intensity changes upon detection.
Low Power Consumption
Event cameras transmit only changes in brightness and removes redundant data.
High Dynamic Range
Event cameras can acquire information in different lighting conditions, due to the fact that photoreceptors operate in logarithmic scale, and that each pixel operates independently, not waiting for a global shutter

Event Generation Model

INdependent pixels respond to changes in their log photocurrent \(L = \log (I)\). In a noise-free scenario, an event \((\boldsymbol{x}_{k}, t_{k}, p_{k})\) is generated as soon as the brightness increment since the last event at the pixel:

\begin{equation} \Delta L(\boldsymbol{x}_{k}, t_{k}) = L(\boldsymbol{x_{k}}, t_{k}) - L(\boldsymbol{x}_{k}, t_{k} - \Delta t_{k}) \end{equation}

reaches a temporal contrast threshold \(\pm C\):

\begin{equation} \Delta L(\boldsymbol{x}_{k}, t_{k}) = p_{k} C \end{equation}

where \(C > 0\), and \(\Delta t_{k}\) is the time elapsed since the last event at the same pixel, and \(p_{k} \in \{+1, -1\}\) is the polarity of the brightness change.

The contrast sensitivity \(C\) is determined by pixel bias currents, which set the speed and threshold voltages of the change detector, and are generated by an on-chip digitally programmed bias generator. Typical DVS Cameras can set the threshold to 10%-50% illumination change.

Event Representations

Events are processed and transformed into alternative representations to facilitate the extraction of meaningful information to solve a given task.

Individual Events \(e_{k} = (x_{k}, t_{k}, p_{k})\) are used by probabilistic filters and Spiking Neural Networks. Filters and SNNs fuse incoming events asynchronously to produce and output. (7, 24, 61, 96, 97)

Event Packet Events \(\mathcal{E} \doteq\left\{e_{k}\right\}_{k}^{N_{e}}\) in a spatio-temporal neighbourhood are processed together to produce an output. Precise timestamp and polarity information is retained. Choosing the appropriate packet size \(N_{e}\) is crucial to satisfy assumptions of the algorithm.

Event Frame The events are converted in a simple way (e.g. pixel-wise by counting events, accumulating polarity) into an image, that can be fed into image-based computer vision algorithms. This discards sparsity, quantizes timestamps, and results in images that are sensitive to the number of events received. Hence, this is not the ideal representation, but have an intuitive interpretation and are compatible with traditional computer-vision algorithms.

Time surface (TS) A TS is a 2D map where each pixel stores a single time value. Events are converted into an image whose intensity is a function of the motion history at that location, with brighter values corresponding to more recent motion. TSs highly compress information as they only keep one timestamp per pixel, and their effectiveness degrades on textured scenes, where pixels spike frequently.

Voxel Grid is a space-time (3D) histogram of events, where each voxel represents a particular pixel and time interval. This representation preserves the temporal information of events by avoiding to collapse them on a 2D grid.

3D Point Set Events in a spatio-temporal neighbourhood are treated as points in 3D space \((x_{k}, y_{k}, t_{k})\). The temporal dimension is transformed into a geometric one, to be used with point-based geometric processing methods such as PointNet.

Point sets on image plane Events are treated as an evolving set of 2D points on the image plane. Early shape tracking methods such as mean-shift or ICP work on this data.

Motion-compensated event image This representation depends not only on events but also on motion hypothesis.

Reconstructed Images Brightness images can be obtained by image reconstruction, that can be interpreted as a motion-invariant representation.

Methods For Event Processing

  • Event-by-event-based methods

    Deterministic filters such as space-time convolutions and activity filters have been used for noise reduction, feature extraction, image reconstruction, and brightness filtering. Probabilistic filters such as Kalman Filter and Particle Filter have been used for pose tracking in SLAM. Incoming events are compared against additional information to update the filter state.

    Alternatively, multi-layer ANNs that take in frames are trained using gradient-based methods, and then converted into SNNs that process data event-by-event.

  • Methods for Groups of Events

    Each event carries little information, and is subject to noise. Hence, it is common to process several events together to yield a sufficient signal-to-noise ratio. Time Surfaces are useful for motion analysis and shape recognition, because of their sensitivity to direction of motion and scene edges. Methods using Voxel Grids involve more memory and computation than lower-dimensional representations.

    Motion compensation is a technique used to estimate the parameters of the motion that best fit a group of events. It has a continuous-time warping model that allows to exploit the fine temporal resolution of events.

Algorithms and Challenges

Feature Detection and Tracking

A key challenge to overcome is the variation of scene appearance from motion in event cameras. Tracking requires establishing correspondence between events at different times.

tracking more complex, user-defined shapes can be done using event-by-event adaptations of the Iterative Closed Point algorithm, gradient-descent, mean-shift and Monte Carlo Methods, and Particle Filter. (118, 119, 176, 177)

Corner Detection

Event cameras naturally respond to edges in the scene, and shorten the detection of lower-level primitives such as keypoints or “corners”. Corners can be computed as the intersection of two moving edges.

Optical Flow Estimation

Events do not contain enough data to determine flow, and need to be aggregated to produce an estimate. Computing flows from events is attractive, because they represent edges, which are parts of the scene where flow estimation is less ambiguous. Their low latency also allows for fine-grained computation of flow.

3D reconstruction, Monocular and Stereo

Depth estimation with event cameras can be done in multiple ways. Most works target the problem of “instantaneous” depth estimation on a per-event basis, from two or more event cameras.

Image Reconstruction

Events are a compressed per-pixel way of encoding visual content in the scene. Hence, the data can be decoded in the event stream, at very high frame rate. Since these cameras report offsets in brightness, a base offset image is required. Some works use spatial and/or temporal smoothing to reconstruct brightness from zero initial condition.

Motion Segmentation

Segmentation of moving objects viewed by a stationary event camera is simple, because events are solely imputable to the motion of the objects (assuming constant illumination). However, when the camera is moving, events are triggered everywhere because the static scene appears to be moving as a result of the camera’s ego-motion.

Neuromorphic Control

Event-based control changes the control commands asynchronously. It is justified by considering the trade-off between computation/communication cost. One should choose control frequencies based on changes in the plant dynamics. Event-triggered controllers can achieve the same performance with a fraction of computation.

A key challenge is being able to find useful signals in the large number of events per second.

_

Bibliography

Gallego, Guillermo, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, et al. n.d. “Event-Based Vision: A Survey.” http://arxiv.org/abs/1904.08405.

Links to this note