I'll say this as a linguist: I was always intrigued by the voice recognition software in Kinect. How it would pick apart different accents, different languages, different dialects of those languages, etc. I would imagine that Kinect would pick up what I say to it, but if someone like my mom talked to it, it wouldn't be able to pick it up at all despite her speaking English (but she speaks a completely different dialect of English that's hard for people to decipher). I have to wonder how further along they are with that tech.
The short answer is that accounting for those sorts of differences is really damn hard and represents most of the research effort in the field for the last forty years.
The long answer is that modern voice recognition works off of a giant probability graph that matches input to the most likely sequence of phonemes (and the most likely sequence of words, from there). If you're trying to match against the entire vocabulary of a language your error rates are still fairly substantial, but cutting the vocabulary way down like you would for a device like Kinect results in many less possibilities, which lets you achieve much better accuracy even across multiple speakers.
A dialect so thick that even people have trouble with it isn't likely to play well with a device expecting English input, however.
Read up on hidden Markov models and how they're applied to speech recognition if you want to get more into the science behind the tech.