To try and explain this better, look at this image:
That area I circled in the blue/grey is where you wind up a lot on the third floor once you're in the maze, because it is the end point of every failed attempt.
When people watching see him there, a ton of right inputs come in to start moving him out of here, and those inputs are stretched over probably 20+ seconds due to discrepancies in delay between individuals. This doesn't even slow down until most people see him start going that way. It's hundreds and hundreds of inputs. There are a lot of up commands here too, from people that don't want to go back down onto the spin tile.
By the time he hits the right wall and starts going up, people are only just starting to do left/down inputs to try and get him into the hallway right above there. But the right/up commands are going to come in for another 15-20 seconds, which is more than enough to often put us to the staircase. So the left/down inputs don't start registering his movement until we are on the above floor... and there left/down inputs easily send us into the wrong maze. The left inputs are really prevalent too because people desperately want to send him into the hallway, so over half of them are left. They just come in way later than you'd actually need to have hit left to even have a chance. The only times we've gotten in that hallway where with predictive random lefts while people were still mostly spamming right from below, and the rights continue long enough to counter those usually.
What dropping start into all of these movement commands does is waste a lot of those unnecessary commands that flood in, so long as the start commands are put in during the big movement pushes. Realistically you need him to be like one foot in that doorway with a menu open for 10-ish seconds before you have a solid chance of having the people doing inputs based on what they see to get him in the rest of the way.