Materialisation, Emotion, & Attention: Tracking Sound’s Perceptual Effects in Film

Creator's Statement

Musicians, film-goers, and cinema theorists alike understand the emotional power of sound in film. Whilst the effects of the soundtrack may often be elicited subliminally, the strength of audio to add value to image and narrative is well accepted, if not fully understood. Part I: “Sound, Ears & Emotion” begins by outlining the functional mechanics of score and sound design. It draws on inherent elements of evolutionary biology and principles of systematic musicology to better understand how sound can make the audience feel. Importantly, these academic principles support an active practitioner’s perspective, and the essay explores through its form (music choices, vocal delivery, sound design, and visual choices) the very theories it is presenting.

Having established the means by which sound charges and changes the emotional experience of image and narrative, the essay then takes advantage of the opportunities afforded by eye tracking technology to consider more deeply the effects of sound beyond affect. Part II: “Sound, Eyes & Attention” draws on results from an ETMI Research Group study. It examines data collected from six participants watching Monsters Inc. under two conditions – sound on and sound off. By so doing, it considers how audio shifts the traditional balance between volitional (top down) and reactive (bottom up) attention. Through the exploration of the data collected, it demonstrates how contemporary sound film under silent conditions draws the eye to low-level, salient stimuli when a scene is assessed. These findings speak to sound’s capacity to immerse the audience in the world of the film, rather than being distracted in its absence by movement, light, and colour. Through audiovision, decoding the image is no longer a series of knee-jerk reactions. Viewers “view” differently.  Rather than the established understanding, where sound design is considered an instrument for generating the perception of image’s authenticity or music is used to synchronize group emotions, this research demonstrates sound’s capacity to actively shift our focus and change the methods by which we engage with what’s on screen.

For the first section, the video essay format provides an ideal opportunity to present the cinematic audiovisual relationships under discussion. Beyond mere outlines, these principles can be actively demonstrated. Examples are played with and without sound, radical recontextualisations and juxtapositions are constructed, and scenes from very different films are sutured together with audio as ideas are being explored. In the second section, examples of the footage from the eye tracking tests exploit a different advantage of the format, and data is presented as a temporal experience. In so doing, it allows for more dynamic and, at times, playful graphic analysis as a means to present what could otherwise be a series of dry stills.  What interests me most is the extent to which this format allows for a shift in tone from a standard journal article. Sadly, in print, there is limited capacity to augment academic writing with violence and stupidity. The possibility for academic rigor to coexist with devices from popular entertainment, to inform and engage beyond the limits of an academic cohort, is possibly the most exiting potential of the form.

Response to Reviewers' Comments:

This essay was my first attempt to explore the affordances and limitations of a new medium for disseminating ideas. I think both authors have identified the compromises at the core of trying to balance tone - a sense of playful provocation delivered at an inherently fast pace - with the traditional clarifications, cautions and caveats expected of academic writing. Even though I would argue that some of the listed criticisms have misunderstood what I was saying, the fact that such misunderstanding occurred speaks to the options for careful explanation inherent in the printed journal format when compared to a video essay explicitly aiming to bring these ideas to a broader audience at speed. This is in part to do with space and traditions of the written word – but interestingly, I think it also identifies the ease in print of rereading until understood, when compared to the one-pass approach we expect from temporal audiovisual media.

With more time – and if provocation and pace were to be sacrificed - there are a number of issues I could certainly have made clearer:

For example, the essay isn’t suggesting that profound monosensory experiences aren’t possible (indeed, that they are is at the core of my more recent 4D research into the flavours of aesthetic experience). Rather, this essay’s focus was what happens to sound film when its audio is removed. Similarly, it’s not suggesting that misdirection isn’t direction – in fact, I agree that it’s quite the opposite. “Reality”? That cornstarch sounds more like snow than recordings of snow is an example of how visceral exaggeration can make the audience intuitively feel that something is “right” even when the sound in a film is as patently false as the colour. For misattribution, Brown and I are arguing about agreeing – as the point that I was making was that the salient stimulus, to which the feelings generated by sound are ‘misattributed’, is the vision. It’s simply an extension, via cognitive musicology, of Chion’s uncontroversial ‘added value’ principle.

For Smith, whilst I certainly agree that there is usually an emotional audiovisual alignment in film, hence “misattribution” as a term is contestable, I would argue that, in other instances, our emotional state, perceived as a “legitimate response to the movie” is still being “misattributed” to the narrative experience when it can often be being generated by the music. The misattribution then refers to where we might mistakenly perceive the source of the emotion. This is completely independent of empathetic or unempathetic alignments. Whether it’s the Strauss waltz or the Jaws theme as we look up at the swimmer’s legs (see Higson’s Jaws Opening Scene on Youtube), our emotion is still misattributed onto the situation we’re watching – eg. this is a beautiful image/scene vs this is creepy/anxiety-inducing scene – all from the black ops of the sound.

These clarifications simply highlight the importance of time and space to ensure a point is understood – and also the care with which arguments should be made and qualified in academic discourse.  In a way, my first attempt at a video essay was an opportunity to play with the violence and stupidity to explore how it might sit with academic rigour. Ultimately, this was done in order to find a broader audience to contemplate these ideas. This piece should be viewed then as a playful starting lob for more rigorous sound discussions to come. That it might be better suited to youtube than an academic journal however is certainly something I am open to.

P.S. Hilariously, I actually thought that my selection of Citizen Kane for a “good” film and Batman vs Superman for a “bad” would be utterly uncontroversial!! I have much to learn about cinema theorists….

This video-essay provides an interesting and playful analysis of the role that sound plays in guiding our emotional responses to films, as well as in directing our attention across the frame during film viewing. Split into two halves, with the first looking at sound and emotion while the second considers sound and attention, the film uses heat map imagery drawn from eye tracking experiments in order to demonstrate the latter.

While thought-provoking and engaging, the video-essay offers some points of contention. First of all, for the video-essay to claim that ‘at best, worlds on screen without audio aren’t engaging’ potentially dismisses as ‘boring’ a history of silent cinema, including films that are not intended to have scores played live alongside them. Surely silent cinema can be engaging – and certainly images without sound can attract our attention (and thus engage us on one level), since we commonly find our eyes drawn involuntarily towards the screens that others carry about with them in our everyday lives, even though we cannot hear sound emanating from them. As such, its claim that images are not scary, exciting, or believable without sound is also highly contentious; indeed, it would suggest that neither a painting nor a photograph can induce such responses, since such visual representations typically do not involve sound.

Later, the video-essay discusses film music, suggesting via David Huron that ‘[w]henever we experience a strong emotion, the brain has a tendency to associate the emotional state with whatever salient stimuli exists in the environment’. And yet, shortly after the video-essay claims that music works ‘subliminally’ in that we attribute our emotional response not to the music but to what is happening in the narrative at the time that the music is deployed. Music also is later described as a ‘black ops’ device. And yet, these two claims contradict each other: if music is a salient stimulus, then its salience takes it outside the realm of the subliminal; if it works subliminally, then it cannot be salient.

When the video states that ‘if the film is good, this [addition of a musical score to elicit an emotional response] may be a built-in layer of redundancy,’ the video-essay does not explain what a ‘good’ film is or might be. Indeed, it pictures Citizen Kane, a film that at the time of its release struggled to find audiences (even if this was related to William Randolph Hearst boycotting the film), with one of my film students recently describing Kane as ‘meh’ (i.e., not very good). The video-essay then gives as an example of a ‘bad’ film Batman v Superman: Dawn of Justice. The latter film has been generally panned critically, and yet Kane also received poor reviews at the time of its release – and only later became recognised as a masterpiece. While I find it unlikely that Batman v Superman will experience the same reversal of critical fortune, the video-essay is nonetheless revealing here its own prejudices about what constitutes a good and/or a bad film (and in a way that is both ‘canonical’ – Kane’s status as classic is beyond question – and ‘trendy’ – Batman v Superman is an obvious butt of a joke). And yet, according to Box Office Mojo, Citizen Kane made $1,588,634 at the box office, which, when adjusted for inflation using the website areppim, amounts to $20,132,091 in 2015. Batman v Superman, meanwhile, has made $873,260,194 at the box office worldwide – and this during an era when people are supposedly not going to the cinema anymore to watch films, with greater sales taking place in ancillary markets. This means that Batman v Superman has a 43-times greater box office return than Kane – meaning that for a ‘bad’ film, it still made a ton of money, which surely complicates our understanding of good and bad more generally. This isn’t supposed to be a defence of Batman v Superman. But simply put: if it was that bad, how and why did it make so much money? Is it because, contrary to the video’s claim, sound really can save a duff film? Or is it because, contrary to the video’s claim, the film is not as bad as all that – at least to a huge number of filmgoers? Either way, this demonstrates something wrong with the video’s claims.

A final point in relation to the first half: we are told of the ear removal scene from Reservoir Dogs that director Quentin Tarantino ‘misdirects’ our emotional response by including Mr Blonde singing along to ‘Stuck in the Middle with You’ by Steeler’s Wheel. And yet, since Tarantino clearly wants to create a strong counterpoint between the horror of the torture being carried out and Mr Blonde’s delight in carrying it out, surely Tarantino simply directs our emotional response. I am not condoning torture, but do we know a priori that torture is always de facto bad? Or that it must be something about which we cannot laugh and be horrified at the same time? Is our very laughter, or at least discomfort, not a sign of our capacity for cruelty, which Tarantino exploits rather than blandly condemn? This is not to address the possibility that sometimes a score can be ‘too manipulative’, as viewers occasionally allege happens in relation to the music of John Williams in the films of Steven Spielberg, for example.

In other words, the first half of the essay makes various elisions and conflations that might under other circumstances be more carefully thought through.

The second half of the video-essay generally makes significantly more solid claims, although some issues remain, mostly centered on the concept of ‘volitional attention’. We are told that volitional attention is deployed when sound is included for viewers watching Monsters Inc. while wearing an eye tracking device. In one shot with sound, viewers look at both the dark and the light part of the screen, while without sound they look only/predominantly at the light. In a second shot with sound, viewers look at mainly the mouths and eyes of characters having a conversation; without sound, they look also at a red light and a character’s moving feet. If viewers are deploying volitional attention by looking at the shadow in the first shot, then why viewers look at the shadow needs to be made more clear; what is the narrative context for them to do so? For, if viewers look only at mouths and eyes in the second shot -- i.e., they have a more homogenous response -- then one might contend that where we look is not so much ‘volitional’ in such an instance, but far more under the control of the film. Conversely, if without sound our eyes go searching the second shot in far more detail, paying attention to ‘low-level stimuli’ such as the moving feet and the red light, then surely this is a sign of actively searching the visual scene, i.e., searching it without voluntarily? Perhaps we might say that without sound the light and the feet equally attract our attention in an involuntary fashion. Nonetheless, this still does not answer the issue of how or why looking only at mouths and eyes with sound is any less involuntary. Indeed, it is presumably where the filmmakers want us to look. Some conceptual clarity might be useful here.

Additionally, it is not clear how the experiment was conducted. The voice over tells us that participants in the experiment watched the clips with the sound on and then the sound off. This would imply that all viewers watched the same clips twice, and that all viewers always watched the clips in that order (sound on then sound off). When the voice over then tells us that the way in which the participants’ eyes searched the visual space of the scene is typical for the first time that one sees a room, then it may simply be because of the space’s novelty that the participants’ eyes wandered in the way that they did, and not so much to do with the sound. By the time they are looking at the clip a second time, the room is no longer novel to them, and so it might simply be because the space is familiar – and not because of the change in sound – that they look at the room in a different way. Again, greater clarity about the methodology of the experiment would be useful here in order to get to the veracity of the claims being made.

At the end of the same discussion, the voice over suggests that without the sound we are ‘distracted’ by the moving feet and that with sound ‘we are no longer distracted by bright things, moving things, colourful things’. If bright, moving, and colourful things are ‘distractions’ that stop us from ‘properly’ understanding a film, then why are they in films? In this use of the words ‘distract’ and ‘distraction’, the voice over wishes to suggest a singular and normative way of watching films, together with a singular and normative way of making films. Why is it not the sound that ‘distracts’ us from the feet rather than the feet ‘distracting’ us from the mouths from which the sound (supposedly) emanates? Why the emphasis simply on narrative rather than contextual or other information? How is the hierarchy established between low and high level salience? Again, the film raises in its language these issues without ever clearly addressing them.

Finally, on a technical note, the video-essay’s voice over is not always clearly audible on the monitor and sound system on which I watched the film, especially during the otherwise witty sequence of having the voice play out the possibility that it comes from a boat on the sea during a storm. What is important, though, is that without physically stopping and rewatching the video because I had to for the purposes of carrying out this peer review, I would not have done so – and I would happily have watched the video without really hearing, and not really listening to, the verbal information that was being given to me. That is, I was being happily enough entertained without feeling the need to hear the voice over in full detail.

If the voice is not particularly important when watching this very video about film sound, as my pleasure is derived as much from the image and other elements on the soundtrack as it is from the voice, then does this not raise important questions regarding precisely the role of sound – especially the voice – in film? How salient is it? How subliminal is it? How much do I care about it when I am swept up by impressively edited sequences using footage from professional films (sometimes with heat maps added to them)?

But more than this: to what extent is the video essay weakened by such techniques? (I attend not so much to what the video essay is actually trying to tell me, instead focussing on the glossy images and rapid editing that do not tell me so much as entertain me.) Can a filmmaker deliberately use these glossy techniques in order for their audience not to pay attention to what they are actually hearing? Do filmmakers do this? Is this in fact a common trait of the video essay – that often it uses its status as a video to gloss over issues that upon closer scrutiny (when you stop and listen to what is being said) in fact do not make that much sense? Should the makers of video essays not lay bare these contradictions, especially in a video essay that is about how sound can direct our attention and ‘distract’ us from information that it wants to convince us is unimportant, but which may in fact undermine its entire argument? To what extent is cinema a machine as a whole for ‘distracting’ us and inducing in us a capacity not to think critically? And to what extent might the increasingly audiovisual nature of our society (including in the academy via projects like [in]Transition?) play a role in that process of distracting people away from critical thinking? These and other issues are all implicitly raised by this film.

Materialisation, Emotion & Attention: Tracking Sound’s Perceptual Effects in Film is an entertaining and thought-provoking video-essay. It raises some important questions about the role of sound in our emotional and attentional response to film, while also raising important questions about the aims of the video essay itself (and cinema more generally). To explore all of its elisions and contradictions – while also explaining in full detail the nature and methodology of the experiments upon which it is based – would require significantly more space and perhaps also a different medium (namely, traditional text). Nonetheless, since the video offers us a forum to discuss such issues, then I recommend publication since it is better to raise and to discuss such questions rather than a priori to silence them.

Review by Murray Smith (Princeton University and the University of Kent)

This is an engaging, tightly-argued, and well-constructed piece that packs a lot into its 15-minute duration. The two-part structure, focussing respectively on sound and emotion and sound and attention, works well to lay out a number of existing arguments concerning film sound before moving onto the original research conducted by the team behind the video. It draws on a range of vivid examples to support its claims, varying straight ‘quotation’ with subversive and often very funny re-edited versions of the chosen sequences. And the thesis of the original research in Part II – that stripping out the soundtrack undermines the carefully-wrought balance between bottom-up and top-down perception in film (sound) design – is plausible, well-supported, and clearly-illustrated. In addition to the value of the piece in terms of introducing ‘new knowledge’ to the field, it will also be an excellent resource in teaching, combining its presentation of original research with a lively overview of existing ideas on film sound.

Nobody and nothing is perfect, though, so let’s get down to some criticism. I begin with a few ‘generic’ worries, that is, points that really concern the short video essay as a form in itself. Then I turn my attention to some issues raised by this particular piece.


The video is divided pretty much equally between its two parts – ‘Sounds, Eyes & Emotion’ and ‘Sounds, Ears & Attention’ - such that each occupies about 7-8 minutes. Relative to the norms of conventional written and published research, these are very odd proportions because the first part is essentially an introduction to the research background against which the original study, reported in the second part, takes place. In a conventional academic essay, the introduction might take up 10-15% of the available space, not the 50% it consumes here. Rather than seeing this as a shortcoming of the piece, however, I think it would be fairer to say that it is symptomatic of the fact that video essays like this are subject to another set of pressures and expectations in addition to scholarly and academic ones. As is made explicit in the written exegesis, where Verhagen writes amusingly of the possibility of augmenting academic writing ‘with violence and stupidity’, he actively wants the piece to reach beyond an academic audience and to break with some of the hidebound norms of academic discourse.


Verhagen’s piece is edited at fairly breakneck speed, partly for reasons of economy and partly for aesthetic (comic) effect (again, ‘entertainment’ value has a higher priority here than it would do in a conventional scholarly context). In an ordinary viewing, one can grasp the gist of the argument but not a lot of the nuance. Many of the examples and experimental details fly by so quickly it is hard to register and digest them. Is this a problem? Probably not. It would be if this were a work conceived to be appreciated under historically ‘standard’ movie viewing conditions, that is, watch it once, without pause, and then move on to the next work. But if the target viewer is someone who will watch it, rewatch it, pause it, and dissect it in the same way that Verhagen has analysed the films he references, then the density and fast tempo of the piece is perhaps a virtue. (But see below for one sequence where the fast pace may mask problems in the argument.)

Some theoretical questions

Onto some more specific queries and concerns, beginning with part one. The general picture the video paints regarding the roles of sound in film (material authentication + emotional evocation) is good so far as it goes,…which is a long way, but not all the way. Specifically, it seems to leave no space at all for films that eschew sound altogether. I don’t mean so-called ‘silent’ (pre-synchronous sound) cinema, as we know that wasn’t silent (although we surely do need an account of how music and sound effects work in such films different in certain respects from the account of synchronous sound film given by Verhagen). I mean that small subset of experimental films that quite literally do without sound altogether. Stan Brakhage’s films are an obvious example, but there are narrative films without soundtracks as well.

Misattribution: I wonder if this is the right model for understanding how it is that music commonly intensifies the affective force of the narrative as it is rendered visually (or strictly speaking, ‘amusically’ – visually, with synchronous sound, but without music). In the psychological literature on misattribution drawn on by David Huron and in turn by Verhagen, an affect with a particular causal etiology (fear caused by walking on a rickety bridge over a gorge) is misconstrued (‘misattributed’) by subjects as having another cause (erotic arousal). But in a film, surely what we have are two aspects of one aesthetic object – the visual ‘track’ and the musical score of a film – designed to work together (where ‘together’ encompasses both direct and contrapuntal uses of sound relative to the image). When John Williams draws on the conventions of tonal music to elicit excitement, suspense, pathos, or grandeur, there is nothing mistaken in our experiencing the music in that way, and in experiencing the film as an audiovisual whole in a way shaped by the music. Nothing is going on ‘behind the backs’ of viewers comparable with what happens in the misattribution experiments, since every competent film viewer and music listener knows that both music and stories elicit emotion, and can be conjoined for ‘added value’. (Huron’s arguments about the role of ‘misattribution’ in the context of music perception, however valid in that context, do not carry over to our perception of film music – but it would be quite a digression to explain why.)

Reservoir Dogs: At this point in the essay, I think the fast tempo of the piece helps it glide over some significant complications in both the argument and the sequence from Tarantino’s film used to illustrate it. The idea here is that the effect of the music (‘Stuck in the Middle with You’) is so potent that it compels us to adopt the sadist’s viewpoint, and that there is ‘no sonic support to back up how we expect we should be feeling’ (that is, for the victim). But that’s just wrong. Throughout the scene we hear the muffled but terrified cries of the victim, and these are amplified by close-ups of him (e.g., at 4.33), underlining the context of which we are already fully aware, i.e., that the hapless policeman is in the hands of a psychopathic and sadistic criminal. If the music did draw us in so completely, the sequence could not work as an instance of sound counterpoint; for counterpoint to work, the visual and sonic elements must be able to set up their expressive effects ‘autonomously’ such that they can then clash. Nor is the end of the extract, when the cop is obscured by the psychopathic Mr. Blonde leaning over him, the camera craning up to and lingering on the grimly ironic ‘watch your head’ graffiti, really an example of an ‘uncertain sign’ in Barthes’ sense. We have a very good idea of what’s going on off-screen. Rather, the camera movement here further elaborates the structure of counterpoint, so that now we see an ‘empty’ section of the dramatic space while we hear the sounds of torture off-screen along with the incongruously upbeat pop song: at least three ‘voices’ working contrapuntally.

Mike’s foot shuffle: In the second part of the essay (at 12.15), Verhagen notes that in the ‘sound off’ condition, viewer’s attention was drawn to Mike’s shuffling feet – while in the ‘sound on’ condition, viewer attention is typically directed at Mike’s face. Verhagen argues that ‘this movement, far from being [narratively] important, is simply an animation device to confer a sense of authenticity on the character, in the way that eye blinks, facial expressions, and fur weft, do.’ I think Verhagen is right about this detail in this example, but it would be a mistake to generalize from this example to the conclusion that behavioural details such as blinks, expressions, and micro-movements can do nothing more than establish perceptual realism (or, in Verhagen’s terms, ‘authenticate’ the film). In his ‘Who Blinked First?’ essay, David Bordwell has shown how something as narrowly physiological (in the real world) as an eye blink can become narratively charged (in a film), suggesting much about the psychology of a character and patterns of dominance and submission between characters.

The rapidly-scrolling bibliography – set to the theme music from the Benny Hill show – is a nice comic touch appropriately playing up once again the emotional power of music. But of course you can’t read bibliographic references zipping by at this speed! So perhaps the bibliography should be added in a conventional fashion to the written contextualizing statement.