I'm a real sucker for natural user interface. I eat it up at tradeshows, I drain my wallet on the various accessories, and I carpal my tunnels with its contortions. But does it make me fitter / happier / more productive / comfortable / not drinking too much? Not really, not yet, or at least not all by itself. I would argue that "natural user interface" is the wrong goal to have in mind while building user experiences, because it says nothing about the efficiency of the operation or the ultimate results. What we want is a Dyson sphere.
What's that? Let's let Captain Picard explain, from Star Trek: The Next Generation's sixth season episode "Relics":
PICARD: It's a very old theory, Number One. I'm not surprised that you haven't heard of it. In the twentieth century a physicist called Freeman Dyson postulated the theory that an enormous hollow sphere could be constructed around a star. This would have the advantage of harnessing all the radiant energy of that star. A population living on the interior surface would have virtually inexhaustible sources of power.
I'm guessing he read this on Wikipedia, as did I, though I first heard of the Dyson sphere theory from the Peter F. Hamilton novel Pandora's Star. Who's the biggest nerd now, Picard? A more popular derivative of the theory is Larry Niven's Ringworld, which has a Dyson ring encompassing a star, instead of a sphere. Basically, the premise is that a sufficiently advanced civilization will harness a lot more of the energy of its star than the rays that just happen to hit their home planet.
No matter which matter of nerdery led you to the theory, however, my point is about user interface and user experience. I would argue that instead of attempting to make our computer interaction look more like "real life" or be more "physical," the UX designer should attempt to make humans more efficient by capturing and utilizing a larger portion of the user's existing output (through new and existing sensors in our devices), instead of just capturing a different, contrived sort of output. Just as a Dyson sphere captures 100 percent of the star's output, computers should capture 100 percent of a user's output... and make sense of it.
There are plenty of examples where this is already beginning to happen, or failing to happen; let's explore a few.
Location and status-based
This is one of my favorite recent developments in the mobile world. Applications like Locale, Tasker, reQall, GeoReminders, and Apple's upcoming Reminders all offer ways to capture already existing user output (Where are you? What time is it?), and do stuff based on that data. Locale and Tasker are the most elaborate of the group, capable of setting up profiles that change settings on the phone or execute workflows -- and are naturally Android only. Examples include turning off your ringer while you're at work or church, sending an automatic "away" text message response when it's late at night and you're away from your phone, or reducing your power consumption aggressively when you're low on battery.
The likes of reQall, GeoReminders, and Reminders are a bit less ambitious, but potentially more approachable. All of them let you set geotagged to-do reminders to remind you to pick up the dry cleaning when you're in the neighborhood, or remove the chicken from the freezer when you get home, or call your mom back when you leave work. This is a great augmentation of an existing process (adding to-dos) with existing user input (his or her location) and existing sensors (a phone's built-in GPS).
While gaming represents some of the flashiest new forms of user interaction, I'd argue it also harbors some of the worst executions of the natural user interface concept. The likes of Nintendo, Microsoft and Sony seem bent on replacing existing control schemes instead of augmenting them. This has made for a great splash in sales, and an expansion in the customer base, which is great. But it has seldom resulted in me having better control over an on-screen character or mechanism, merely a different sort of control.
Here's an example: for a decade or so, first person shooters have mapped the aim + look control to the right analog stick. The targeting reticule stays in the center of the screen, and the entire view follows every twitch of your right thumb. Now, if you don't think too much about it, you'd think a motion-sensing controller that actually tracks your physical aim would be a boon to the FPS, but instead it's serious handicap. Why? Because the motion controller is only mapped to aim, not look. (If the differentiation between aim and look is confusing to you, think of it this way: aim = your eyeballs, look = your neck). To look around in Wii and PlayStation Move shooters you typically have to edge your aim reticule up against the edge of the screen (each game has its own level of sensitivity) to turn the entire view.
See what happened there? For the sake of "immersion" we actually reduced our ability to control the character, and I'm not so sure how "immersive" the concept of moving a reticule up to the edge of a screen to move the overall view is -- I don't recall moving my eyeballs to the edge of their sockets to turn my neck. Obviously, the idea of operating a motion controller for aim in addition to an analog nub for look sounds like a real challenge to accessibility, but at least it's not leaving anything on the cutting room floor. The recently released Razer Hydra could supply us with an answer, since it combines precise, two-handed motion control with dual analog sticks.
While the Wii and Move FPS problem bugs me, I've reserved a much larger pool of ire for a sort of interaction that Kinect seems to favor… I call it "the charade." You know what I'm talking about: you jump and your on-screen character jumps, you duck and your on-screen character ducks, you mime an elaborate fireball releasing gesture, and your on-screen character releases a fireball. What's wrong with that? Well, I've got three suggestions: A, B, and X. See, none of those motions couldn't be performed with a single button press, and they would all happen much more quickly and accurately to boot. Imagine if my buddy asked me what movie I just watched, and I proceed to gesticulate wildly, offering hints at genre, number of words in the title, and an exaggerated enactment of the memorable scene where the one dude kills the other dude. Or I could've just told him the name of the movie, you know, with my mouth.
Don't get me wrong, I don't hate all full-body gestures. It's fun to get the heart pumping and to look stupid in front of friends and family. But I'm certainly not fooled. What I'm asking for is that the detail of the input match the detail of the output. My motion shouldn't cause the character to merely jump, it should determine how high. When I merely had a four directional d-pad on my NES controller I didn't expect to have diagonal control of my 8-bit adventurer, but once I got the analog nub I started to expect my characters to do crazy things like go diagonal and modify their speed based on how far I'm tilting the nub. Imagine playing Mario 64 with an NES controller, or, worse, mapping the N64 analog stick input to a simple four directions! Now that we've graduated from analog to full-body input, I expect the resulting output on screen to expand as well.
This isn't just about the appearance of analog input, I want analog results as well. It's not just about my avatar's sword moving 1:1 with my motion control wand of choice, it's about the game calculating the physical results of a strike based on the abundance of data it's receiving -- position, angle, speed, momentum, and that fancy follow through. I've bothered to get up off the couch and immerse myself, immerse me back!
The emergence and now ubiquity of pinch-to-zoom is one of those things that seems a little obvious in hindsight, but the fact that it didn't show up in a mainstream device before 2007 proves how non-obvious, or at least non-trivial, it actually was. Apple remains at the vanguard of touch UI, but I'm starting to wonder if touchscreens peaked early with the iPhone -- every major device these days has a multitouch capacitive screen, but they don't use it for much more than swiping and zooming and un-zooming. Apple's looking to change that somewhat with its new three-finger gestures for switching between apps or desktops. Unfortunately, while I enjoy using these gestures in Lion and on the iPad, they're too obscure for a regular user to intuit.
I know it doesn't really fit with my whole Dyson sphere thesis, but I'll throw it out there anyway: any "natural" input action that requires memorization of a certain gesture or finger configuration isn't natural at all. The problem is that my hands and fingers have too wide a field of possibilities. Yes, it's easy to put three fingers on a trackpad and swipe, but how would I know that ahead of time? And where does it stop? Should I assume that since I can pinch-to-zoom and three-finger-swipe that I can turn that knob with two fingers or punch the screen with my knuckles to delete something?
For me the best new example of actually intuitive touch interaction isn't multitouch at all, and it wasn't developed by Apple -- though iOS scrolling might've been part of the inspiration. It's pull to refresh, an action that originated in Tweetie (now the official Twitter client). It's something I'm already doing anyway, scrolling to the top see if there are any new messages, and it's at that point that I'm prompted to scroll even further to check if there's anything new. Its eventual adoption as a default behavior in all list-oriented touchscreen applications on desktop and mobile seems an inevitability to me.
Video chat / video recording
You might be scratching your head, wondering why I'd add something like video chat to a list of interaction types, but I think video chat is really the furthest along of any of these interaction types in reaching Dyson sphere status -- and proving the worth of such a goal. See, while interacting with a computer is typically a highly lossy process -- my volume of physical and mental output is vastly greater than the computer's received set of instructions -- video chat merely asks the computer to pass along everything. The input is my face, my voice, my hand movements, my eye movement, and even my terrible posture, and the output is a good 99 percent of that.
For the history of the internet, communication has been predominately text, and I have the chat logs to prove it. But voice has been creeping in, and now video chat is becoming a default to most devices and services. People have been using the "technology" of writing to communicate over distances or with permanence for thousands of years, and the internet has had literacy as a barrier of entry for its entire history. But that could be changing. When I was in high school, kids had a LiveJournal, now they have a video blog on YouTube. When I was in middle school I used AIM and ICQ to communicate, now kids can video chat. I really hope textual literacy doesn't die completely, because I'd be out of a job, but all the tools are in place now for a generation to grow up illiterate and yet completely informed and completely in touch with one another.
For an educated person this might sound like an apocalypse, but it's not all doom and gloom. I consume most of my fiction through audio books -- it takes a little longer, but I find it much more immersive and memorable. Video is much more information dense than text or audio (at least in some ways), and allows people who aren't gifted with written words (but are still nice people) to communicate complicated ideas more effectively. You ever notice how everybody sounds a lot dumber online than the people you meet in real life? There's a real disparity there, and while I'm happy to reap the rewards now of being a highly literate person (relatively speaking), a more level playing field could be a great thing for society.
This is where things really get interesting. I've gone over some of the existing standard methods of input, but I think we're on the verge of adding a few more. Ones that, if I'm right, we won't know how we lived without in a few years time.
If there was some cool chart somewhere that tracked the quality and ubiquity of voice recognition (with mainly Google to thank), against the prevalence of actual voice recognition usage by regular consumers, the former line would be winning handily.
There are two primary problems with voice recognition as it exists today. The first is that you look kind of stupid talking to your phone, computer or TV. Our society has begrudgingly accepted the apparent insanity of a Bluetooth headset talker, but the idea of negotiating with a computer verbally is still a bridge too far in most scenarios. The one possible exception is the car, where your hands are occupied and you have an entire commute to convince SYNC you want to listen to Ryan Adams, not Bryan Adams.
Which brings me to the second problem: no matter how good voice recognition gets (and it's really, really good right now), it has to be near perfect to actually be acceptable as a form of input. Additionally, computers will need to learn how to ask for clarification, use logic, and consider context to really be useful in this sense. Not out of the realm of possibility, but we have some work to do.
Still, I think there's an even larger use for ambient voice recognition. It goes back to the concept of the Dyson sphere: I'm conversing all day long, spouting ideas and taking in facts, so why isn't that being recorded, transcribed, logged, and sorted? The sci-fi-prevalent idea of an AI personal assistant that lives in your head might still be a bit far out, but everybody I know already carries a device capable of recording and storing audio all day long (at least, as long as it's outside a pocket or purse) -- why not put this skill to use? There are already some services that let you call yourself and they'll transcribe it, but that requires me to do something to get the process started.
Initially the transcriptions would require a lot of digging through to get any use out of them, but eventually I imagine a service being capable of sorting out who I talked to, or labeling conversations by subject. As the algorithms get better and better (something akin to IBM's Watson), some of this stuff could happen in real time. Perhaps my phone hears me attempting to make a bet, and gives me a "call" to let me know I have my facts wrong. In fact, I'd love to have my entire day fact checked by a Watson-style machine. How many bizarre urban legends, misremembered stats and political lies do I hear every day? I'd love to browse through a log of flagged statements made by myself and others at the end of the day. Another thing I'd love would be a playlist composed of all the music I heard that day -- I rarely manage to open Spotify in time to tag a song accurately. All the technology to do this is here, it's just not evenly distributed. In fact, a bunch of nerds at Microsoft Research have been working on this idea this for more than a decade.
One of the initial sparks of this Dyson sphere idea for me was the recent Mango incarnation of Bing Vision on Windows Phone. The way it's implemented is really slick: you don't take a picture of something and then search for it, like with Google Goggles, you just hold it up in front of the camera and get search results dynamically. It only works with bar codes and the covers of media like movies and books right now, but it's just such a slick presentation.
The Dyson sphere version just builds on the existing power of Bing Vision and Google Goggles. Anything you take a picture of during the day should not only be geotagged and time stamped, but any data inside the picture should be scanned and recognized, including text and brands and faces (obviously faces are already being scanned by Facebook and iPhoto, a great start). The next step would be to have a camera actually recording everything you see during a day, with that copious amount of data made useful through these contextual clues -- and of course cross-referenced with the audio log. What was the name of that person you were talking with at the party? Skim through your video of the party and find their face, then either look them up on Facebook based on facial recognition algorithms (super creepy, but kind of cool), or just wade through the log of your conversation until you find the part where introductions were made.
Look, I love robotics, alright? I'm not going to get too in-depth here, since this is obviously a technology in its infancy, but I think it's pretty obvious what robots bring to the world of natural user interface and Dyson spheres. Basically, the idea of a robot is to address a human on the human's terms, instead of a human addressing a computer on the computer's terms. This is the holy grail of interaction, but it's also a huge undertaking that we've barely begun. The problem is that a robot can't just solve for one natural interaction paradigm, it has to solve for all of them -- gestures, location, object recognition, voice control, etc. -- while pairing that input with precise output. You can't have a robot that gets you your drink some of the time, but other times brings you a pineapple.
The good news is that as we solve all these other interaction cases, robots will be able to build directly off of those achievements. Everybody wins.
As computers have gotten better, more useful and more mobile, I've spent more and more time with a screen in my face. Paul Miller at age 26 is perpetually hunched over a digital device, contorting his fingers into different keyboard configurations, clicking or scrolling or swiping or pinching or flicking as each specific UI requires. I've fully conformed myself to the digital devices that enable me to create and communicate. I don't begrudge them this, and I don't advocate an alternative where I do less merely so that I can appear a little more like a mid twentieth century man.
Still, what I think a Dyson sphere approach to interaction allows is for me to step away from the screen at times without losing any of the utility. Let's say I'm at a party. This party has been in my calendar for a month, so it's obviously important to me. I'm in a conversation, gesticulating wildly. When I receive a call from a good friend who lives out of state my phone gently pushes him to voicemail so as to not interrupt me -- if he calls back the phone will ring out loud. Someone agrees with the amazing point I'm making and recommends an essay in The New Yorker from two months ago that totally beat me to it. Instead of pulling out my phone and making a note of this essay, my phone finds and records the link for me while I continue. Perhaps I peel off and talk more with this interesting person, and at the end of the night we're great friends -- but it's rude to ask their name again. Should I pull out my phone and do the awkward "let's become Facebook friends right now" dance? No worries, my phone remembers who this is, and since they're a friend of a friend there's an "Add as Friend" next to their face in my conversation log waiting for me when I get home.
In fact, when you think through that example, another spherical "Dyson" comes to mind... you know, a Dyson Ball. A vacuum cleaner sucks in everything, but filters out the dirt (preferably through a HEPA filter, but let's not get too far down this tangent). Because capturing all this data really isn't going to be that much of a stretch, we already have all the tools at our disposal, but making sense of it is going to be a gargantuan undertaking.
Ultimately, it's not that I want to use less of my computers, I just want my computers to use more of me. Luckily, most of the things I've outlined here are obvious and inevitable evolutions of computer interaction. What scares me is that in the meantime we'll get so hung up on "natural user interface" that we'll just use fancy new gestures and voice commands to do the exact same tasks we've been doing with perfect speed and accuracy for years. I don't think either Dyson would be happy about that.