IF Audiogaming

The recent post on ifbyphone discussed the technique of making interactive text available as interactive audio. How does this media-shifting affect the experience?

There is a long-standing connection between the IF community and blind/visually impared gaming or audiogaming. While not all IF is accessible, the vast majority can be used by a visually impared gamer simply by reading the text-only content using normal screen-reading software. Special IF-text-to-speech interfaces are also in development which streamline the text-to-speech process (say, to prevent rereading of the status bar when it has not changed). In this context, it makes sense to ask to what extent experiences interactive audio experiences are accessible. Are they similar to the screen experience of the work? What changes? “Photopia” and “Fail-Safe” provide good case studies:
Tags: , ,

The color of sound

ifbyphone features a nicely chosen selection of games for the new player. “Adventure” and “Zork I” are traditional, while “Dreamhold,” “Theatre,” and “Ediface” are well written and provide a wide range of genre and tone. “Photopia” is an acclaimed work, and a particularly interesting choice in that, like some editions novels such as of Michael Ende’s novel “The Neverending Story” or Mark Z. Danielewski’s “House of Leaves,” it uses color printing to influence the mood and indicate story progression. This color effects are primarily ornamental, as “Photopia” is fully playable without color. However it is interesting to see how quickly media translation issues appear even when dealing with collections of apparently minimalist work. To adapt the color edition of “Photopia,” should a different text-to-speech (TTS) voice read each color? Should it be the same voice, but pitch-shifted?

Even without color, auditory adaptation of interactive fiction immediately raises some interesting aesthetic questions - many of which are common to the production of TTS audio books, but some of which are unique to IF. This becomes most apparent in considering the choice of Jon Ingold’s “Fail-Safe” to be experienced as TTS audio.

The sound of sound

Jon Ingold’s “Fail-Safe” is perhaps a perfect choice for “by phone” transmission, as the game already claims to be occuring entirely via radio transmission. Just as you the player cannot see anything beyond text, the operator whose role you play in “Fail-Safe” relies entirely on audio from the other end of the line to find out what is going on from the very introduction:

Bzzt. Crackle. *Static*

“…hello? Hello? Can… me? .. Anyone! Hel…. Need.. hello?”

Bleep - PLEASE WAIT - Locating/Tuning signal…

..

“.. help. Repeat, can anybody hear me? Can you hear me? Hello..”

>> hello

“Hello? Hello! The .. pretty bad. Are you receiving this? Over.”

>> _

Considering this text, we can see immediately that a lot of audio information has been textually encoded. “Bzzt. Crackle. *Static*” is not supposed to sound like those onomotopeaic words, pronounced outloud - it indicates that we should imagine actual line noise, with the bold emphasis and the asterisks indicating high volume, and the ellipses indicating not just lapses of time between sounds, but shifts in relative volume when the static drowns out the signal, or the signal volume fades.

“Fail-Safe” is most appropriate for ifbyphone because it is most particular to an auditory experience - but ironically the visual encoding of auditory information is also what makes much of the content totally inappropriate for being processed by text-to-speech (TTS). Similarly, what makes “Fail-Safe” so perfect for its interaction loop is that its parser error messages refer to not hearing what was said - which in my experience was frequently the actual problem. Given that we can expect to be yelling “OPEN DOOR!” into the phone in a low-reception area while the computer replyies “That isn’t something you can throw”, wouldn’t it be interesting to process error messages based on the real audio input? Pattern matching failures might generate error messages like “you’re speaking too quickly” “you’re breaking up” “there’s too much static” “could you please speak a little louder?” based on the actual quality of the audio received, rather than at random (as “Fail-Safe currently does). Importantly, the difficulty in speaking over the crackly channel is part of the suspense of the story - it is an obstacle of the medium (the limits of cell-phone networks and TTS technology) that is already subsumed into the dramatic situation.

Of course, these kinds of interactive audio designs go far beyond the ifbyphone concept, which appears to be a comparatively simple, efficient and reliable text-to-speech / speech-to-text loop. Still, this shouldn’t prevent us from asking how the experience might be improved. How could we make a better ”Fail-Safe“ audiogame? One approach might be to pre-record each of the text outputs from the game, creating a kind of IF audio-book in which the parser strings together mp3s for playing rather than printing text strings - these audio clips could be performed, with appropriate background noise. A more complex way of doing it would be to introduce an audio mixing layer, where TTS strings from the parser could be combined with radio-play-esque sound effects strings as well as background sound - in this case, a volume controllable wall of static. A parser that produced its audio rather than presenting it would maintain many of the advantages of the text-based system - for example, the possibility of inline translation, of substituting voices with different accents, or of fine-tuning the audio effects.

If we did create such an audiogame, however, it would be asymmetrically text/audio based even when interacting with it by keyboard - much like Subservient Chicken was an asymmetrical text/video chatbot. This suggests that our model for looking at text-based systems in general could use some refining - what separates IF and chatbots from many related works is not that they are text-based, but that they are text/text as opposed to say text/image or image/text.

Is an experience that is both heard and voice-controlled really ”text art,“ even if the background processing occurs entirely within a text engine? Recently, WRT has covered a lot of image-based ”text art,“ involving words embedded in cartoon images, words formed in water, words made of light, or words that otherwise depart from the traditional understanding of digital characters computationally encoded symbols (ASCII, Unicode, etc.) rather letterforms whose creation was enabled by some digitally-assisted process (digitally controlled spigots, digitally controlled LED-boards, etc., etc., etc.)

The example of ifbyphone represents yet another limit on the concept of ”digital text.“ ifbyphone and IF/TTS rely on symbolic manipulation of encoded symbols (rather than say audio transformations) in determining their output. Yet, in order to argue that they are digital text art, we must concede that neither the reception (heard audio) nor the interaction (spoken audio) need be textual.

This concession would seems to throw any definition of ”digital character art“ wide open, and leave us with only ”character manipulation processes“ - which sounds like a Turing Machine to me….

Related: ”You’re Kidding Me“ simulated phone conversation, based on an earlier work by the same designer



0 Responses to “IF Audiogaming”

  1. No Comments

Leave a Reply