Unlocking Gesture Control: The Rise of a Neural Input Wristband as the Next Generation's Pointing Device

Leeor Langer
May 27, 2024
15 min read

Updated: Jun 9, 2024

A neural input wristband will replace the dead weight of XR glasses

In the constantly changing realm of technology, gesture recognition stands as a crucial interface between humans and machines. While traditional gesture recognition methods often rely on cameras for detecting gestures, a new contender has emerged: a neural input wristband.

This blog explores how these wristbands, equipped with sensors such as IMUs (Inertial Measurement Units) and SNC (Surface Nerve Conductance), are reshaping the way we interact with devices, and the great benefit they may provide for face-worn devices such as the Apple Vision Pro or the Meta Quest 3.

Delving into the technical intricacies, we uncover some of mechanics behind these sensors and how they translate human movement and intent into actionable digital commands. Furthermore, we dissect the algorithms powering a wearable neural gesture recognition system, shedding light on the distinct computations that enable natural and intuitive interaction. From enhancing accessibility to providing a more intuitive user experience, the potential applications of neural input wristbands are vast.

Join us as we delve into this groundbreaking technology and its implications for the future of human-computer interaction.

AVP to MVP using Mudra

There is a pattern in the emerging technologies of today – the way we interact with the digital world is changing. Search-by-keyword is being replaced by natural language processing chatbots such as ChatGPT and Claude, offering a more intuitive chat with a computer and enhancing the experience by adapting and customizing the results based on user queries. Similarly, gesture control using intuitive hand and fingertip movements are used in face-worn devices such as Apple Vison Pro (AVP) and Meta Quest 3 (MQ3), which offer familiar tap, pinch and swipe gestures to interact with the GUI. The gamification of the recalibration of the tracking algorithm for eye-movement gaze and hand-movement gestures is to provide the two fundamental functionalities of a pointing device: Navigating in 2D or 3D space and manipulating – selecting or interacting – with digital elements.

However, a multitude of high speed / high resolution sensors, utilizing large neural networks for interaction and input comes at a price and weight: AVP at $3,499 and 650g, MQ3 at $499 and 515g. Compared to the 40g weight of normal sunglasses, there is a better way to reduce the weight of technology high-end face-computers – which will result in higher convenience and user willingness to adopt and wear these devices on a day-to-day basis.

Attaining an 'overall positive product experience' poses a significant challenge, involving different display and interface technologies, various sensors, multiple algorithm approaches, and complex integration processes. Each technology influences the overall experience, such as the comfort of the display and the performance of its control interface. This “overall experience” is the crux of mass adoption.

Part 1 of the blog will present a common use-case and example of a desirable pointing device system. We discuss the projected accuracy to reach mass adoption, with various user physiology variances.

In Part 2 we elaborate on fusion of IMU and SNC sensors into a wrist-worn device, and present a method which may provide similar user experience to control a face-worn device using spatial gestures, such as the AVP vision-based gestures input.

In Part 3 we discuss the Field-of-View (FoV) challenges vision-based gesture control interfaces confront, and our alternative approach to solving this challenge.

We recommend that you also read our blog What Provides the Best Experience to Interact with Smart Glasses? Which covers Human-Computer Interactions, In-Depth analysis of the AVP and MQ3 input interfaces, and the Mudra Band solution.

Let’s Dive In.

PART1: A COMMON USE-CASE FOR USING AND CONTROLLING FACE-WORN GLASSES

An example of a usage scenario for smart glasses involves a user on an online shopping tour. As the user browses at the various items displayed in the store window, a certain item captures the attention. Subsequently, the user stares at it thoroughly and “selects” the item, and it is added to the shopping cart. Afterwards the user takes a deeper view of the product by interacting with it. Finally, the user decides to complete the transaction by going through the checkout procedure.

These interactions involve two main functionalities of any pointing device: Navigation and Pointing.

A technical break-down of the user’s input interactions and actions includes:

Browsing through different items
Staring at a specific item to learn more about it
Pinching the item to reveal more product information
Selecting the item to add it to the cart
Interacting with the interface to initiate the checkout
Completing the checkout process

For the simplicity of the discussion, let’s assume that the user can either use a computer and a computer mouse to perform the above actions, or uses AVP with its current input interface.

The table below shows how each of the above actions is performed:

#	Functionality	Computer and Mouse	Apple Vision Pro
1	Browsing	Moving the neck sideways	Moving the neck sideways
2	Staring (Navigation)	Moving the cursor over the desired item	Looking at the desired item
3	Pinch + Hold	Pressing on the left button	Pinching and holding the index and the thumb
4	Selecting (Pointing)	Clicking on the left button	Tapping the index and thumb fingers together
5	Interacting (Pointing)	Pressing on the left button while using wrist movement	Pinching and holding the index and the thumb while using wrist movement (flick)

Actions such as the above can be performed using a computer mouse, a trackpad, and a gaming controller – to name a few. The traditional mouse method to input commands is best suited for sitting down and working with a screen. For more intimate interactions, such as using a mobile phone or a tablet, the touchscreen is a more direct medium to input commands. For the outdoors, gesturing to a device is the optimal input method. Thus a user can input commands by either holding a device, touching a device, or gesturing to a device.

Gesturing can be performed either by vision-based input or by using a wearable for gesture control (voice may not serve the user well for pointing device functionalities but rather more suitable for general actions such as “Hey Siri, launch the gaze-recognition calibration procedure”). And while commercially available gesture recognition solutions such as Leap Motion and Kinect are over a decade old, the availability of wearable gesture control is gradually maturing with research conducted at Meta’s Reality Labs [1], Apple Research [2] and an academic research [3].

To recap the above literature, Navigation and Pointing can be performed using a wearable by integrating two types of sensors: IMU for movement and an array of bio-potentials sensors for finger movements\innervation. An early example of such a product was the Myo Armband, launched in 2014, a gesture control armband that used Electromyography (EMG) arm signals and offered five palm and hand gestures that used eight EMG sensors and an IMU.

An Inertial Measurement Unit (IMU) sensor is a device that typically combines multiple sensors, such as accelerometers, gyroscopes, and sometimes magnetometers, to measure and report information regarding an the forces which act on the object, such as gravitational and geo-magnetic forces.

Surface Nerve Conductance (SNC) sensors is an array of electrodes that reacts to ions, via the process of ionic exchange and reacts to innervation picked up mostly by finger and hand usage patterns. The advantage of this sensor type is that it offers a non-invasive and convenient way to track physiological signals, such as fingertip pressure gradations.

A good combination of the two sensors allows for an accurate representation of both hand motion and fingertip pressure. Thus, the function of Navigation can be performed by the IMU sensor, and the function of Pointing will be handled by the SNC sensor. The IMU data could also be used in the Pointing algorithms, which will follow in the next chapter.

To summarize part 1, we’ve noted that the basic functionalities of a Pointing Devices are Navigation and Pointing. Each functionality can be achieved either by a hand-held device (e.g. computer mouse), a touch-based device (e.g. touchscreen) and by gesturing. Traditionally gesture control has materialized through vision-based sensors, with wearables harnessing bio-potential signals such as EMG or SNC for the Pointing functionality.

PART 2: IMU AND SNC SENSOR FUSION FOR WEARABLE SPATIAL GESTURE CONTROL

Sensors such as IMU and SNC, integrated into a wristband form factor, form a crucial anatomical touchpoint. IMUs sense the motion of the arm, as well as vibrations caused by tapping fingers together. SNC sensors work even when no motion of the wrist occurs, such as when pressing the fingers together. When sampled in tandem, such sensors provide (noisy) data, including arm movement and finger innervation. Each sensor has different properties. IMUs are well established MEMS sensors which measure acceleration and angular velocity. SNC sensors are sensitive to the electrical activity of the nerves and muscles adjacent to the wrist.

For the purpose of simplicity and coherence we focus on the “Tap” gesture, which is a very intuitive gesture, i.e. it is familiar and is cognitively related to the action of choice (picking). The tap movement creates a vibration, which can be sensed with an IMU mounted on the wrist. The vibration pattern is distinct, few other wrist movements can produce a similar pattern. However, different people tend to have different tap patterns, various levels of fingertip pressure tapping force, different arm orientations, etc. The tap gesture itself contains a large amount of variability. The neural networks and methodologies are inherently different than in other application, such as computer vision-based systems.

To collect data, we used a Mudra full-wrap wristband with high-speed IMU and SNC sensors mounted on it. Data was collected from multiple first-time users, while performing gestures of real usage scenarios – gestures such as tap, pinch-and-hold, and swipe. Additional recordings include “noisy” hand movements such as typing, drumming, walking.

To label the data, we’ve added an additional discrete sensor, mounted on the index and thumb fingertips, using a conductive material. When the index touches the thumb current passes through the fabric, thus providing both an external signal and physical force measurement.

Such a labelling mechanism will yield ‘1’ when the fingers are joined, i.e. a tap is performed, or '0’ otherwise. It provides an automated annotation mechanism for each of the above-mentioned gestures. This automatic method alleviates the need for laborious manual segmentation or implementing heuristic\algorithmic approaches for labeling relevant data. Thus, our approach leverages neural networks without the need for manual annotation.

IMU data – Tap Recognition

Training the neural network was conducted on big-data collection from a large amount of unique, first-time users, who are not familiar with such a setup or gesture control in general. Thus such a database contains various (“worst case”) noisy inputs, recorded across different usage scenarios. Such scenarios include different arm orientations, movements\gestures, applied force, arm circumferences, skin types etc.

To visualize the results, we’ve placed the IMU accelerometer data stream (left) alongside the neural network prediction (right), as can be seen on the following illustration. This example depicts a typical tap “tremor” classification.

On the top left, an animation of accelerometer data, which is (part of) what the IMU “senses”, presented in red, green, and blue colors. On the bottom left, the labels denote the state of the button, presented using a black-color line. The right side displays an animation of the neural network inference. The compact, fully colored black circle, tracks the state of the button. The inner circle denotes the probability of taps ("double tap” gesture shown as an example), using a real time aggregation neural network.

As sensor data represents a time series, we can base decisions on multiple successive windows (resembling several frames in the illustration example provided). The window-wise accuracy measured on an internal database is 86%. When aggregating per-window results across multiple windows, with a second neural network, the accuracy rises to 96%.

A detailed discussion of this two-stage detector can be found on the following blog posts [4] and [5]. We observe that this classification error tends towards high sensitivity (even subtle taps are recognized, few false negatives) and low specificity (different movements may trigger the classifier, most errors are false positives). This point shall be addressed later.

SNC data – Fingertip Pressure Estimation

Biopotentials are measured using an electrode array, using three electrode pairs. The signals presented show how pressing one’s fingers together can be sensed, using a miniscule of a fraction of a camera sensor. Every subtle detail of innervation is picked up, along with noise and noisy movement. For a detailed review of biopotential sensing using electrodes, see [1].

The animation below shows how applying fingertip pressure affects the SNC array. On the top left side is an animation of the SNC sensors, colored in red, green and blue. The data stream shows a drag-and-move sample acquisition. Such data is also a time series, yet it has very different properties: It is indefinitely longer than a momentary tap, disconnections motion and friction artifacts may occur during this time. Note the labeling (bottom left) is ‘1’ when the user applies fingertip pressure. In general, higher amplitude SNC signals (at certain frequencies) are indicative of innervation.

Combined movement and pressure is generally noisy: the electrodes may dis-attach from the wrist skin surface thereby introducing motion\friction artifacts, as well as differences in user behavior, the amount of pressure applied, and user-physiology skin impedance variances. Not all electrodes are in snug (constant contact) with the skin, which makes the classification much more challenging.

Note that each electrode is mapped to a corresponding state on the large circle on the right, colored accordingly. Note how using a majority vote (black circle), overcomes disconnections in the SNC sensor array, by ignoring noisy electrode inference. For more details on the properties of SNC and overcoming disconnections can be found here [6].

Training a neural network on each sensor in this array is done separately, which yields a per-window accuracy of 83%. The results are aggregated from all the sensors together and the use of majority vote across the sensors yields an improvement of 89%. Results are also aggregated in time, with such aggregation requires more advanced memory mechanisms. When aggregating in time using custom recurrent cells, we observe an accuracy of 94%.

To summarize part 2, we’ve introduced our data collection system – a wristband mounted with IMU, SNC and a fingerprint pressure sensor used for labeling. IMU data training shows the feasibility of using IMU tap recognition in real world systems with high accuracy. SNC data training shows fine-grained control of an interface based on a bio-potential sensor array. Following algorithm training, the reported results are of 96% accuracy tap based gestures and 94% accuracy for fingertip pressure estimation. These results pose a competitive solution to current generation systems, which exhibit similar accuracy.

PART3: TRASCENDING MOTION BEYOND VISION-BASED FIELD-OF-VIEW BOUNDARIES

As mentioned on our previous blog post, one of the most hailed features of the Apple Vision Pro is its ability to use hand gestures to interact with VisionOS. The AVP’s outward-facing cameras track hand gestures, with a large enough Field of View (FOV) and line-of-sight, the user is not required to hold the arm up in mid-air when performing a gesture. The hand can rest comfortably on a desk or along the waist when conducting most gestures. The eyes are used for Navigation – the user simply gazes at an element, and it slightly changes its contrast or texture to hint that it is selectable.

By carefully positioning multiple cameras on the headset, hand gestures can be tracked correctly (mostly), across a wide FOV, resulting in comfortable body postures. Using more cameras and more compute power widens the FOV, at the cost of device weight, battery usage, wearability, user comfort, and higher costs.

Line-of-sight mostly dictates the ability to correctly recognize and use a familiar gesture, and field of view is mostly correlated with how comfortable the user’ body posture is – everybody loves a “Minority Report” style gesture control, yet these types of mid-air gestures are not comfortable and have been proven to be unergonomic (aka “gorilla arm syndrome”). On the set of the film, Cruise needed riggers to tie his wrists up to the scaffolding (marionette-style) because his arms got so tired from shooting the opening scene that he couldn't lift them on his own. Therefore, detecting user hand motion is a challenge – the field of view must be large enough for a comfortable body posture.

XR gaming controllers, a hand-held input interface, use “inside-out” tracking; An IMU has the ability to perform such tracking by estimating orientation and positioning. However, the quality of such tracking is limited, based on the noise properties of the IMU itself and the estimation algorithms used. This problem is a form of Dead Reckoning [7], it is subject to cumulative error over time. This contrasts with AVP motion tracking, which does hand tracking using cameras positioned on the device. The error of a computer vision-based system is not cumulative; it does not depend on usage time.

Early VR systems, such as HTC Vive, used “outside-in” tracking with fixed cameras positioned in the surrounding environment (in the playroom, for example), which does not constrain the user to FOV limitation, since the base stations are situated in a way such that at each moment in time, at least one camera is in line-of-sight of the user’s hands. The obvious disadvantage is that such a system is completely stationary after installation.

The key advantage of a wearable interface is that it is completely free of such restraints, such as stationarity, FOV and line-of-sight limitations. Yet, it provides the same user experience. Such UX consists of familiar gestures and comfortable spatial body postures beyond LoS and FoV boundaries.

To explain why such limitations are not necessary, we examine how an IMU’s accelerometer measures orientation. An accelerometer is exceptionally good at measuring the constant gravity pull we all feel. This is a three-dimensional vector called , the gravity vector. When we move our body or hand we apply force, which means we accelerate, let’s say at a rate , and the total acceleration is a=g+amove . The IMU measures the sum, not each acceleration individually. When an IMU is placed on the wrist, it I possible to derive the pointing angle from , since knowing the direction of "down” and the direction or “forward” is a constant rotation. So, by decoupling g and amove, the pointing direction as well as hand movement are known. By applying a classifier only to amove, hand gestures are invariant to orientation, allowing for classification f(amove) to disregard hand orientation. In addition, different movements hand or finger movements have distinct acceleration patterns, which dominate particular frequency bands.

The illustration below depicts accelerometer samples of typical usage patterns, using a high sampling rate. We constructed a time-frequency representation, which utilizes the prior knowledge of typical user-movement, by partitioning the spectrum via a deep scattering transform [8]. Such a transform partitions the spectrum into bands and sub-bands, forming an interpretable and easily learnable representation for classification. This implies that in practice, less data is required to train a classifier, due to the properties of the “signature” of the transformation (you can learn more on such properties in blog [4]). Intuitively, a deep transform can be thought of as a “heatmap” (with zero representing no energy), such that higher values indicate higher “spectral content” thus more motion (see scale of each image). While the x-axis represents time, the y-axis represents higher frequencies – which amounts to a faster change of acceleration.

typical IMU accelerometer usage patterns and analysis

The second and third row images show the first and second order scattering transforms, with and without the low frequency (gravity) component. The rightmost transform (scrolling) is shown with the gravity component, to show the effects of slow change in the orientation of the arm. Removing this component from the spectrum removes its effect on the signature. Having such a representation, along with a representative dataset of user motion and interaction, facilitates training neural networks which are immune by design to tapping “outside the FOV”. Using a systematic approach, instead of only learning from data what we know, saves unnecessary parametrization and inefficient modeling.

As previously mentioned, the typical error of a wearable sensor classifier is skewed towards false positives, with very few false negatives. Most of our day-to-day movements are completely unrelated to wearable interaction. There are very few specific patterns which represent interaction with a digital device, such as by tapping our fingers together. When training a classifier, balancing the dataset with equal representation of noise and gestures is essential. The false positives of such a classifier will be more “easily discoverable” since they dominate the way such a system is used. Classification for XR UX must balance the error itself, taking care to mitigate such false positives.

Even with such error in place, we can effectively fuse eye tracking and gesture recognition, in a similar manner to the way we fuse together home automation sensors: using simple logic. The probability of recognizing an action with AND logic is P = PGesture * Peye-tracker. Since false negatives are low, when conducting a gesture, both will trigger (p≈0.99²=0.98). However, only when we hover over the screen, with no intended interaction, a false positive may occur (in either system). However, it will rarely occur simultaneously on both systems, so the combined probability becomes exceptionally low (p≈0.01²=0.000.1 ).

CONCLUSION

We’ve analyzed the properties of wearable sensors compared to camera-based systems. This analysis included a discussion of signal decomposition and classification and the expected performance of such a system. This solution overcomes FOV challenges and limitations associated with running large computer vision models for hand tracking.

These properties in turn provide a higher level of comfort and usability, for XR towards mass adoption.

THE MUDRA BAND

Mudra Band is the world’s first neural input wristband. It translates movement intent into digital commands to control digital devices using subtle finger and hand gestures. It connects to the Apple Watch just like any regular watch band, and lets you control Apple ecosystem devices using simple gestures. Your iPhone, iPad, Apple TV, Mac computer, Vision Pro, and additional Bluetooth controlled devices can be paired with the Mudra Band and be operated using Touchless Gestures.

The mudra Band Is equipped with three proprietary Surface Nerve Conductance (SNC) sensors. These sensors are located on the inside face of the band and keep constant contact with the skin surface. Each sensor is approximately located above the ulnar, median, and radial nerve bundles, which control hand and finger movement.

The Mudra Band also uses an IMU to track your wrist movement and speed. If you’ve moved your wrist up, down, left or right, inwards or outwards - the IMU captures the motion.

Using sensor fusion, our algorithms integrate fingertip pressure and wrist motion to determine the type of gesture you’ve performed. It can be a mere navigation function that is only using wrist movement, or it can also incorporate any type of fingertip pressure for pointing. Combining the two readings, motion and pressure, manifests in the magical experience of Air-Touch: performing simple gestures such as tap, pinch and glide, using a neural wristband.

If you’ve liked what you’ve read, we welcome you to Start a Movement and Join the Band at www.mudra-band.com

[1] A generic noninvasive neuromotor interface for human-computer interaction - CTRL-labs at Reality Labs, David Sussillo, Patrick Kaifosh, Thomas Reardon

Wearable
Devices

Unlocking Gesture Control: The Rise of a Neural Input Wristband as the Next Generation's Pointing Device

[1] A generic noninvasive neuromotor interface for human-computer interaction - CTRL-labs at Reality Labs, David Sussillo, Patrick Kaifosh, Thomas Reardon

[2] Enabling Hand Gesture Customization on Wrist-Worn Devices – Xuhai Xu et al.

[3] Deep learning based multimodal complex human activity recognition using wearable devices – Chen et al.

[4] Wearable Sensors for Effortless Interaction – Sofia Afaneseva

[5] Tap recognition task: Second Stage, Aggregation Problem - Sofia Afaneseva

[6] Novel User Interactions with Long-Press Gestures: Swipe, Pinch, and More - Sofia Afaneseva

[7] Dead Reckoning

[8] Deep Scattering Spectrum – Anden, Mallat.

Recent Posts

Comments