At Wearable Devices we are building the next big thing in human-machine interfaces. So what does this have to do with deep learning and biopotentials? Biopotentials are electric potentials (typically on a scale of micro-volts) that are measured between points on living cells. We measure such biopotentials directly from the wrist. This phenomenon holds the key to unlocking a truly great Human Machine Interface (HMI). You can think of a biopotential HMI as a “spy” which listens to the brain — nervous system — wrist conversation and translates it into a language we can understand.
So why is the above interesting for an HMI? Think of the computer mouse for example. The mouse is a mature technology which “simply works”. What makes it so successful is the ability to transform very minute movements into digital actions. In addition, the error is very small. We get fine grained control (subtle movements) with high accuracy (low error). Another example is the smartphone touchscreen interface. I would argue that the touchscreen is what makes a smartphone “smart”. This interface requires very little effort and its accuracy is high (though not as high as a keyboard + mouse which we use for work).
In the past few years we have been developing Mudra — a wrist-worn gesture controller. When the user moves his fingers, sensors mounted on the device measure biometric signals originating from the brain, passing through the nervous system to the wrist. Deep Learning algorithms map such signals into a low dimension and Machine Learning algorithms classify such a representation as user-intended gestures. The Mudra device detects discrete and continuous finger movements.
Each discrete gesture defines a unique control operation on the device — e.g. the soft tap of the index finger with the thumb can be used as ‘Select’ whereas the soft tap of the middle finger with the thumb can function as ‘Back’. Alternatively, continuous gestures can be used for zoom in/out, Increase (+)/Decrease (-) volume, Scroll Up/Down and more.
Biopotentials can do what no other method can. For example, no camera in the world can recognize how hard we press our fingers together and no other sensor can recognize very fine-grained movements. The only problem is accuracy. The Mudra device detects the constant or gradational fingertip pressure level the user applies between two fingers, or on physical objects. You can use this functionality for drag and drop, move and rotate — the way we handle physical objects in real-life. However, sensing such delicate activity is prone to various noise sources. I will explain this in depth below…
While deep learning has made an impact on many problems in machine learning, it has its roots in computer vision. In addition, deep learning requires a lot of data and such data in this case is expensive. So how can we overcome such problems? I will focus on recent developments that may interest deep learning practitioners as well as HMI enthusiasts.
In order to classify gestures by moving our fingers we need to begin with examining the data itself. A gesture observation looks like so:
Gesture frame as seen by the Mudra sensors
The top image represent the raw digital sensor data of my thumb movement within a time frame. The first three sensors are termed Surface Nerve Conductance — SNC sensors — and are designed to sense the biopotentials emanating from the wrist. The fourth sensor data is the norm of the IMU acceleration. I’ve come to think of the data above as energy readings of the body. The stronger the gesture is, the data will fluctuate at a higher amplitude and frequency (this is a slight over-simplification).
To make more sense of the SNC, the sensors’ data is transformed via a custom time-frequency representation. We can feed a neural net data and expect it to learn a mapping function, but this is not always the best idea. In particular, learning basis functions is difficult. We introduce domain knowledge when we use such a transform and domain knowledge is a valuable tool in all machine learning applications. Why? Notice that the first electrode (marked sensor 0 in the legend) slightly disconnects at near the end of the frame. This manifests itself in all bands of the red (sensor 0) transform as can be seen in the wavelet image (middle subplot). This makes everything easier for a neural net to map; such noise is termed a motion artifact in the literature. The transformation makes the data more sparse, another recurring idea in machine learning.
Once we’ve conditioned our data properly, lets feed it into a Convolutional Neural Network (CNN). Here we run into another problem, which I term contradicting reactions. This means that one person’s energy signature for a thumb movement, for example, might be very similar to another person’s index movement! Imagine if dogs in ImageNet are labeled as cats by one person and as lions by another…
For this reason, we have laboriously worked to label the unique sets of data we collect from each user. To make gesture recognition work, we use a mixture of technologies, including Transformations, Deep Metric Learning and Few-Shot Learning. We are using TensorFlow for some of this work. We also follow the intuition of prominent Deep Learning practitioners. I particularly like the arguments about wide vs deep neural nets, which gives great intuition regarding neural net architectures by Lei Jimmy Ba and Rich Caruan(see here). In a nutshell, when your data is noisy and contains very little structure (SNC data for example), go for a wide and shallow net. If you have a lot of structure with a small amount of noise (typical images for example) then go for a Deep Neural Net. I simply do not believe in plugging data into a neural net and hoping for the best.
In order to map the gestures collected in-house to a low dimension, we use Deep Metric Learning. In particular, our loss function is itself a complete algorithm. This algorithm simulates how “good” a mapping is. In essence, what we are creating with deep metric learning is a metric space X with a distance function d. Distances are key to an interpretable model. A model is interpretable when we look at its output and we have at least a partial understanding why it came to the conclusion it did.
3D embedding visualization
The image above (Embedding Space) visualizes a 3D embedding of 15 proficient users performing 1,500 different gesture movements: thumb, index and tap gestures. Each point represents a gesture observation, gray points denote noise observations. Note how the embedding “makes sense”. Each user is mapped to a specific color shade. Users are mapped to different overlapping regions and gestures sometimes overlap themselves. This embedding space can also be represented in 2D for a simpler (not necessarily as accurate) representation:
2D Embedding Space
A new user will be mapped to the feature space above (see dots in the image ). This doesn’t mean the user will be mapped exactly where the proficient users are embedded, but it should come close. As long as we make sense of the data we can drive the accuracy higher and higher, on a course to create the next big breakthrough in human-machine interfaces.
A few words about myself: I am the co-founder and CTO of Wearable Devices LTD. I held lead algorithms engineer positions in the Israeli high-tech industry and had the opportunity to work with extremely talented individuals in Academia. My background involves Machine learning and Signal / Image processing. I hold an MSc. in Applied Mathematics and BSc. in Electrical Engineering.
Wearable Devices LTD is a startup company which develops hardware and software solutions to interact with computers. Our vision is to transform interaction and control of computers to be as natural and intuitive as real-life experiences. We imagine a future in which the human hand becomes a universal input device for interacting with digital devices, using simple gestures.
Original blog post was published on Medium
コメント