I'd rather see a game where the person samples the chat at second intervals or whatever and executes the most popular choice then.
If I were designing it, after having watched this, I'd work on a frame-length timestep (presumably 17ms). I'd go on a three-frame cycle.
Frame 1: Accumulate inputs
Frame 2: Average inputs. Each up counters a down. Each left counters a right. The direction with the most keypresses wins out, or A/B if either has more presses than a direction.
Frame 3: For the duration of this frame, send those inputs to the emulator.
Frame 4: Begin accumulating inputs again.
I'd have three such processes operating independently of one another, so each frame is covered (When one process is accumulating inputs, a second is calculating the outcome of the previous set and the third is outputting the results from two sets ago)
I think that might be fairly effective. I'm slightly banking on the idea that each frame can determine a result in the milliseconds available, which would ordinarily be easy but could in theory be overwhelmed by Twitch spam.
(Speaking of the Twitch spam, I wonder if the convertor is smart enough to quick-reject certain entries rather than parse them *entirely* even if they're just general chat)