Watching != interacting. Passively observing a linear visual sequence has you perceive framerate pretty differently to interacting with a non-linear visual sequence. The former requires little investment, and allows your brain to relax and comfortably find patterns in "24 frames per second" data to form a cohesive visual sequence. The issue interactive mediums introduce is fact we're not longer casual observers; we've a natural expectation of patterns not just in visuals but visuals as feedback for interactivity. We don't "run" at a low framerate, or any framerate, so our interactivity marked with low framerate feedback can lead to a weird dissonance. In almost all scenarios your brain can and will appreciate more frequently updated visual feedback to your physical input, as that is what we're used to simply via our own existence.
An easy experiment is to play a game at 60fps and 30fps while recording footage, and going back and watching that footage. In the former scenario the footage will almost always seem oddly smooth and fast, moreso than you remember while playing. Yet playing the latter may lead to frustration as you try to line up shots and move your character, yet the footage will seem perfectly cohesive and smooth.
Watching is a passive experience: your brain only needs to find patterns in the visual data and that's it. If it works, it works, and everybody is happy. Interaction is different, your brain not passively finding visual patterns, but attempting to correlate those patterns to your own deliberate physical input that is not bound by "framerates".
An argument that gameplay should be 60+fps and cutscenes 30fps is another matter entirely.