Audiovision, psy-ops and the perfect crime: Zombie Agents and sound design

Darrin Verhagen

This paper examines audiovisual relationships - how they may be processed by an audience, and how such cognitive mechanics (and their histories) can play directly into the practiced, manipulative hands of composers and sound designers. As a member of that cohort myself, this paper is designed to research the underpinnings of (what were previously) an intuitive set of artistic responses I would draw upon when scoring for vision. Through an exploration  of subliminal processing, and with some of my own material as research examples, I intend to explore the elasticity of audiovisual alignment, and, ultimately, the range of potential responses that can occur when preset frames of reference are violated.

In order to understand how the subconscious can be manipulated through sound though, it's important to first grasp the collection capabilities of our sensory interface, and what happens to that data upon collection. At any given moment, each sense is gathering and delivering different amounts of information to the brain. Bundled, it amounts to about 1.1 meg  each second - sight 10,000,000 bits, skin/feel 1,000,000 bits, hearing 100,000 bits, smell 100,000 bits, taste 1000 bits. (Norretranders 1999: 121) This figure stands in stark relief to the amount of information one  can actually devote conscious attention to – which is said to be somewhere between 16-40 bits. This is a focussed beam, trained, deliberately, on a small selection of the available data. To deal with the massive contrast (between the breadth of collection and narrow bandwidth of this conscious registration), we've developed skills where we can divert incoming information to subconscious routines. This is "gist perception" -  processing executed by what Christopher Koch terms "Zombie Agents" – routines, based on past experience, which can run, unnoticed, in the background whilst we're paying attention to other things. (Koch & Crick 2001)

One of the systems which makes this possible, and such an effective operative model, is the development of mental schemas which are referenced  at the point of data reception. These are useful mechanisms when it comes to allocating or bypassing conscious attention as they abstract from the specific, a general principle which will assist greater cognitive efficiency in decoding similar future streams. In other words, based on experience, they flag a clearly defined territory – so we can (with confidence) not unnecessarily waste our 16-40 bits a second on what ends up being a "known known". Rather, we rely on these pre-determined subroutines to make sense of such data in the background. Genres and relationships -  musical, and audiovisual  behaviours -  are all thumbnailed to assist us in more efficiently understanding incoming flows of information. This store of knowledge feeds successful "feature integration" (Levitin 2006: 101)  – where the elements are framed by such schemas and the "gist" is correctly and effortlessly ascertained. These are the expectations that composers and sound designers exploit as part of their stock in trade. Through subconscious manipulation, sonic practitioners can  minimise any audience member's cognitive load so that their 16-40 bits can be better allocated. If executed well, sound (and its craft) will avoid the  trip-wires of curiosity, and pass by conscious attention completely "unnoticed". The noises will do their  work – but the audience member won't have registered this work being done (busy as they may well be concentrating on the machinations of plot, poignant dialogue, or the excitement of large things exploding).

But not paying active attention to data isn't simply cognitive resource management from the audience. The process also opens up serious potential (for exaggeration and manipulation) for the sound designer. Ultimately, the cloak of invisibility on offer to designers (and, to a lesser extent, composers) gives us license to indulge in some magnificent black-ops when the opportunity allows. And the more manipulation we can effect off-radar, the more successfully we can influence the audience's perception and judgement (whether it be at the service of the narrative, broader artistic experience or whichever particular commercial agenda lies immediately at hand). So, with sound, whilst one might not be focussing on what's being delivered, this can only be achieved through successful top-down processing/management of the auditory stimulus being fed up the chain. If the internal logic of the data stream is intuitively understood and contained within the appropriate  schema there's no need for the audience to give it any conscious thought. And with that assurance (and with enough 16-40 "distractions") in place, there's plenty of mischief to be had in the background.

Audiovisual relationships will be examined shortly. For now though, for our first practical demonstration, let's limit the exploration of this process to audio. At any given point, the elements of sound (timbre, pitch, volume, rhythm, tempo, spatial location, not to mention any associated context-coding) are being processed in different parts of the brain. Listeners make unconscious judgement calls as to how best these should be integrated – according to the templates they have developed to amalgamate and read sound. There are three basic possibilities –  (i) sounds which it pays to focus on (ii) sounds which can be left to a subconscious subroutine/"Zombie Agent", and (iii) sounds which can be completely ignored.

Steve Reich's tape loop piece  "It's gonna rain"  affords a good example of this decision-making process: as one listens, attention shifts from one frame of reference to another, in an attempt to extract as much meaning and significance from the work as it unfolds in all its minimalist glory.

Given that the text which forms the piece is mostly just that single phrase, the value of linguistic content is exhausted almost immediately, simply  through repetition. The audience member doesn't need to keep listening to what is being said. As a result, they change focus – perhaps initially to rhythmic registration, maybe then melodic, then timbral. These shifts aren't acknowledged problems. In fact, in this case, they're actually the source of joy in the experience of the composition. Through repetition, there's a reveal of hidden qualities – recontextualised (as they come up for conscious scrutiny) as recognisable musical attributes - ones that would normally be handled by Zombie Agents as they riffled through the paralinguistic codes of speech.

"It's gonna rain" is an example of what one could call an  "artistic experience" brought about through sequential stimulation of the conscious registration of previously hidden components - "gifts from the Zombies." But, in this instance, they're not being offered up for inspection due to a registered difficulty. Rather, it's the other way around – conscious attention is faced with the repetition of the same incoming data, and once one chosen template reveals nothing further, pulls up others in its ongoing attempt to extract meaning and significance from the incoming stream – "clearly important" simply because of its continued repetition.

But immediate (as opposed to gradual or sequential) registration of a traditionally hidden component presents a different processing experience for the listener. For example, why is the sound of a fart or a bubble with minimal pitch contour less funny than one with an exaggerated tonal envelope? It's not humour through the violation of expectation. It's humour through this conscious registration of too much pitch information. It's a reveal – but the question remains: why is that inherently funny? This isn't relief laughter. It's not humour through incongruity, resolution, indignity, (Pinker 1997: 550) surprise disambiguation, or contextual violation. It's a "peak shift" (Ramachandran 2005: 43) of a previously integrated component exaggerated into the territory of caricature. The registration of the commonly invisible (pitch) now "seen" inherently ridiculous simply because it gets on radar.

The class laughter which follows at 2:53 in this next example is a further demonstration. Case studies presented by Steven Mithven suggest that we are primed to register sound according to a tripartite cognitive stratification of Language, Environment or Music (Mithven 2005:39). In Diana Deutsch's "Sometimes behave so strangely", a humour reflex kicks in as the text, in the last few moments, shifts "invisibly" from speech to music – a transformation only possible once the listener has been primed with the requisite training. And, in this case, the reflex (yet forced) substitution of one schema for another is enough of a contextual breach for its infraction to trigger amusement.

Examples thus far have been examining Zombie Agents processing audio. But audiovisual associations produce similarly vital schemas when managing a dense dataflow. The following four video clips are taken from my own practice. I divide my time between writing for the (remnants of the) post-Industrial underground and more conservative corporate commissions. But regardless of the demography or desired artistic outcomes, the tools on offer (and options contained therein) are all the same. This first demonstration is a small excerpt from a larger noise piece (which therefore needs to be played at high volume) and provides examples of "synchresis," the tight structural alignment of a successfully fabricated audiovisual relationship (Chion 1994: 58).

Clip 1

[Click image to play movie]

Richard Grant, the video maker, working with my pre-composed music, delivers a clear (albeit artificial/projected) structural and contextual rationale for what is occurring in sound. Through structure he binds the sound and vision in synchretic fury; through context, he delivers visual logic as to why the auditory data is so confronting. Before vision, as a standalone piece of music, we were overloaded with a mass of sensory information telling us how we should feel. Run alongside vision, this is now legitimised by additional data telling us what we might think.

This united audiovisual logic has been the sensory cornerstone of our evolutionary experience. Prior to the twentieth century and the development of electronics and recording, there was always a binding between a sonic event and the corresponding action necessary to cause the sound. To make a noise one needed some equivalent movement. Sound, as this by-product, therefore, authenticated activity. The residue of this history now sees us primed to make synchronously occurring sound and vision "work" with as little cognitive scrutiny as possible. If simultaneous data is received through two senses, the probability that they confirm one event tends to be too strong to challenge. It's not until there are violations (the conceptual or physical distance between the sound and vision), which are too significant to ignore that we need to access other tools. (Murch 2002:119)

The next excerpt from the "Distorted" festival provides an example of how an audience might handle such transgressions (in this case, conceptual). This instance sees Grant deliberately setting up points of potential cognitive dissonance, where what we're being told to think and feel through vision and sound aren't always in clear alignment. When presented with a sliding scale (from ambient contrast/gentle counterpoint to absolute disjunct) one of four things seems to happen – with the responses depending on the specifics of the set-ups. I'd suggest that this manipulation will either be the perfect crime, quite funny, a failure, or an interesting failure ("Art", perhaps)

Clip 1

[Click image to play movie]

The perfect crime

In this reading, the score successfully overpowers the vision, the devices (of audiovisual incongruity) stay off radar, and the music utterly recontextualises the innocent scenes (eg. the dancing, joyous moments, party sequences) into step with their sinister sonic associates. Such a reading is unlikely for the Danny Kaye sections in this sequence – as the non sequitur is probably too significant.  But in general, if the vision is quick enough with an edit, or careful enough managing the extent of the "risk", and the corresponding sound progresses with enough confidence, this perfect crime is certainly possible. The party excerpts in this clip, for example, are points where I think such recontextualisation successfully takes place. Those scenes are quick, narratively ambiguous, and tonally innocuous. They're no match for the emotional power of sound, and will be coloured accordingly. Under circumstances in day to day experience where there may be confusion, imperfect knowledge or multiple, conflicting goals (Mithven 2005: 87) emotions traditionally replace reason to guide immediate responses and actions. In film, fast edits of material from incongruous sources will invariably confound a rational reading, so an emotive interpretation (with guidance boldly on offer from the sound) can become an invaluable and instinctive tool. The party sequence reframing process constitutes such a successful sleight of hand, and, in this instance, the jury of conscious attention is nowhere to be seen.

Concentrating on Danny Kaye, however, is different. It brings us to a second option where the mismatch has some potential to be funny.


As an alternative to the invisible recontextualisation of vision by sound, in this option, there's actually a conscious registration from the audience of the inappropriate relationship – "Hey, that was Danny Kaye!" - but the mis-alignment is potentially accepted and folded into the experience as humour. The insertion of "Danny Kaye" at the end of the list "Pain, violence, death and horror" essentially constitutes a punchline.  In this case, it's a reading in step with the idea that laughter is a residue of a neurological coping mechanism for the contravention of expectation. (Rammachandran, 2005: 22)

Obviously the width of disparity between the audio and visual stimuli will both determine the likely on versus off-radar registration. But, commensurate with the vicissitudes of personal subjectivity, it can also determine whether it's conventional humour, black humour, or simple tastelessness.

The perfect crime and funny outline two successful possibilities when reconciling such (tonal/conceptual) audiovisual mismatching. But there's also a third alternative.


With failure, there's no sleight of hand recontextualisation, no "folding in". The moment just doesn't work. The device (incongruous visual quotation against "evil" music) is registered, and registered as a type of malfunction. This process can be immediate. Alternatively, and worse still, identifying what's wrong can take some time and thought, with the suspension of disbelief/utter acceptance provisionally derailed for the duration of such contemplation. Whether instant or eventual, either way, it ends up on radar and it's a mistake. Clearly, there's a problem, and – albeit momentarily - the full force of conscious attention has been brought to bear upon it. Whilst this can be a crisis in many circumstances, there are possible exceptions . . .

Interesting Failure (at times, "Art")

As with ordinary failure, there is still a sense that something doesn't work - the device is again registered - but this time it potentially gives rise to desired (as opposed to unwanted) thought. For example, "Hmmm . . . what was the idea behind the artist doing that? It's not automatically working the way I think it should because – obviously – they're saying something. Clearly they're not incompetent. So what informs this disjunct? What might be being communicated through this frightfully interesting lack of success?" etc. In this instance, the mismatch, and the thoughts its registration generates, are all part of the artist's agenda. While the perfect crime and humour open up dimensionality through their success, interesting failure opens up dimensionality through the very "problems" it creates.

All of these manipulations and their commensurate responses, with the probable exception of utter failure, are perfectly acceptable options. Which response is triggered depends on the artist's objective, the execution, and – obviously - the mediating effect of the viewer's subjective experience. And if the options for the likely reaction are sufficiently open/ambiguous (i.e. if there hasn't been enough control at the point of creation to completely corral the outcome) then the audience member's response to, and experience of, the audiovisual material is determined by their individual clutch of "top-down" schemas (the aforementioned conceptual templates) when processing "bottom up" (saliency-driven, memory free) data.

The principle that has clearly emerged in these examples is that the relationship between sounds and images have their place and are only offered up for conscious attention once the set-up is transgressed beyond what a Zombie Agent can actually handle. I will explore one final project to elaborate on this – an example drawn, this time, from the other side of my practice. This illustration is an early draft of a station ID taken from my recent work with the Showtime Movie Channel on Foxtel. And of the four options developed earlier, I'd suggest that it's somewhere between an "unmitigated" and a "gently interesting" (arty) failure.

Clip 1

[Click image to play movie]

This example, I'd suggest, articulates two of the main issues I've raised. Given the short timeframe in (and agenda within) which it operates, the first is the potential need for a consistency of internal logic in the audiovisual contract, so as not to violate a likely schematic expectation (no mater how interesting such a breach may be). And this is where the trouble starts – because it was precisely this violation (withholding the third "ting"), which, for me, was the most seductive aspect of the soundtrack. It successfully "fails" along a number of lines. There are problems with:

a) the established audiovisual logic (flare = ting) being discarded
b) the musical trajectory (informed by that audiovisual logic) being ruptured ie. an implied three part phrase with only two parts delivered
c) the infraction compounded by the close of a clearly musical phrase being delivered at that point instead by the sound design ("woosh" not "ting")
d) (compounding a-c) the overall loss of a sense of presence at the point where the third note should fall. ie. If the objective is to seamlessly morph musical description into prescribed sonification of the vision (Murch's "sonic shadow"), then the mix should ensure a more successful pass from score to sound design for the shift to remain unnoticed. To do this, a structural con would need to take effect. At the moment, there's a dissipation of energy at the transition point – so the device is all the more likely to trigger unwanted attention. To work, it would need to displace the loss of presence with a subtly greater sense of mass than currently present (effectively replacing the diamond with some coal before the security system registers the swap).

That's the logic. But here's the problem - for me, there is an interesting feeling which accompanies the absence of the third ting (and all the issues I've just outlined). This is something which is difficult to explain or justify rationally. But, in this case, there is a "resonance" which arises from the transgression of an expected audiovisual logic. Squint in the cinema and you'll start to hear these devices being employed more often than you'd expect. What happens to the way we feel when we are denied the articulation of events we expect sonified? For all the clean and exciting aural violence we demand in our action films, what is the effect of this resonance when things are subtly de-materialised? Whilst overt denial of expected sound is part of our staple diet in certain film genres (Horror, Asian ghost cinema, Science Fiction) what happens to the audience experience when such denial is executed to remain off radar, and in a context not legitimised by devices such as the supernatural or the extraterrestrial? The close car and tail blinker for the flashback child death in "A Civil Action" (1998), the lack of impact for the informant's smeared truck death in the remake of "Miami Vice" (2006), hearing the sandals in the dust over the road but not the dialogue of the same character in "Zaitoichi" (2003) - they're all gentle violations which don't necessarily register, but they result in a type of feeling through transgressing an expected audiovisual logic. It's a managed risk, and the payoff can be as sublime as it is significant.

By comparison, and as previously outlined, in certain Art-forms the registrations of such abuse can be essential to the craft.  That is, such cognitive dissonances can be the tools by which an artist extracts an audience member from their aesthetic bath and leads them to a thought, an issue, a contemplative moment. Mainstream film, jingles and advertising, however, tend to be different. Risks are certainly taken, but they tend to be controlled risks. There may well be dangers which will allow the perception of edginess, but not ones which should draw obvious attention to the mechanics of the craft – particularly if it's likely to distract from the narrative experience, message or product itself.

And herein lies the issue with the failed Showcase clip. I'd suggest that whilst the flares are ostensibly a "physical event" – the first two are accompanied by enough tonal information to force them into a musical, more representational structure – they do not arise from within the image, but are imposed (more descriptively) from without. Whether the sense of tonality in the music occurring at the site of the third flare is strong enough in people's perception to follow through the job begun by the first two (and reads as something off-radar, resonant, and interesting) or whether the absence triggers attention (and reads as a problem) depends on the individual. As it stands though, it's an outcome which is left open. It's a risk.

And this raises a second issue – and explains where this consistency was ruptured. This issue concerns the dangerous – subjective and uncertain - tipping points when it comes to top-down management of the components of sound. At what point do one's personal Zombie Agents give up and flag the problem? Such a consideration became important when I was experimenting (with a certain degree of artsy curiosity) with this relative smudginess between sound design and score.  Initially, I kept the elements fairly autonomous and running in parallel: sound design to materialise the vision; score to tell us how we should feel about it.

My first mistake in this strategy was to allow an innate musical sense to align the tonality of the tings with the key of the music (and, quite coincidentally, with its rhythm).  Suddenly, those tings move into a netherworld. In one way, they're still (sort of) being created by the visual flare, but they've now started to inhabit the more descriptive – and more on radar - Disney-esque realm of the soundtrack. That in and of itself is not a problem - the smudginess could be seen as a fairly harmless call. The problem enters when I attempted to "interestingly" violate that set-up. For me, the slight vacuum in the absence of a third musical ting creates the most interesting feeling in the whole station ID.  It doesn't register for me as a problem – to a degree because the music subtly continues the line – but rather, I find any issues of the moment simply experienced as one of these "resonances" I discussed earlier. That said, consistent queries from friends – questioning where the third ting was - suggested that my satisfaction was a very personal resonance, and not one which was translating successfully to the experience of others.

To resolve the issue, I composed the following version - a less interesting, but certainly less problematic ID.

Clip 1

[Click image to play movie]

This version offers a continuity of logic; and it has no "problems" to rupture the magic, distract from the message or confound an Agent. But looking back to the original ID - the shifting of the audiovisual contract, in and of itself, certainly isn't the crime. Sound constantly adjusts its role, changing over – not to mention wearing multiple hats all the time. At any given point sound can work to:

And it's rare that any of these agendas are mutually exclusive or restricted to either the score (music) or the sound design (environment). The audiovisual relationships are fluid, adaptable, expedient, complex, multifaceted, and under constant revision. And, for the most part, the Zombies are quick to adapt, moment by moment, to make everything work without a second's active consideration.

The problem occurs when you confuse an Agent – and you end up being sprung, in the glare of conscious attention, whilst you're engaged in such black-ops.

It's hardly the perfect crime if you're going to get caught.


All visuals created by Richard Grant. Excerpts one and two from "EPA live" (Distorted Festival, Melbourne 2005 © Dorobo), excerpts three and four a "Showcase" station ID (Premium Movie Partnership 2007) Used with permission


