Uncanny Simulations

Pictured above are the lead actors in two recent Hollywood feature films, Final Fantasy: The Spirits Within and Polar Express. When designing these animated characters, the creators of each film made different trades in their attempts to make the artificial appear real, and appealing. Aki Ross, the lead in Final Fantasy, was made from whole cloth. Though Ming-Na provided the voice for the lead, as a 38 year-old Chinese-born actor, she was not an explicit model for the very realistic on-screen image of Aki, a 27 year-old Caucasian. The conductor in Polar Express, on the other hand, was a clearly an interpretation of Tom Hanks. This representation was less than photorealistic, but yet not a caricature. Using both the voice and visage of a screen star doubles down on the public’s association with a known persona. The expectation is that the popularity and comfort moviegoers maintain for the star will carry over to the animation. Each film also incorporated a somewhat different balance between the realism of the simulated reflectance and a natural similitude to human movement. Masahiro Mori claimed in his 40-year-old seminal article on the Uncanny Valley that capturing human-appropriate motion has a greater effect on the level of acceptance of a simulation than does the accuracy of its surface reflectance. It is likely, however, that these two aspects of representation maintain a much more complex interaction than has been worked out so far.

In these two movies, this notion about the relative importance of motion and reflective appearance on the acceptance of simulations was put to the test. The animators of Final Fantasy arguably produced some of the best representations of simulated skin and hair ever done, and Polar Express moved the state of the art of motion capture considerably forward. Nonetheless, both films unquestionably stumbled into the Uncanny Valley. The results, whether because of regrettable aesthetic choices or because of existing technological (or financial) limitations, evoked an appreciable degree of Mori’s uncanniness. Evidently, doing either of the major constituents of human representation disproportionately well is not, in itself, sufficient to elicit enough acceptance to span the Valley. In the final analysis, these movies became technological Pyrrhic victories. They were good enough to be bad.

However, it can be argued that these films were brought low by more than animation flaws. Certainly, Polar Express had to contend with a script that took place mostly at night and suffered the concomitant ghoulish overtones of shadows from non-overhead lighting, while Final Fantasy had to contend with a script that was itself completely in the dark – a script that was so painfully subtle that the crucial plot turning point became apparent only after listening to the animators’ commentary on the DVD. Nonetheless, it has been difficult to establish a causal link from problems such as these to the plummeting of the animations’ aesthetics that occurred just as their imperfections were becoming very small.

Why, then, did these (very expensive) simulations fall short? There are three areas that might prove informative: what factors make positive, or even essential, contributions to the construction of a digital human, what factors are antithetical to this creation, and finally, we need to look at what it is within an observer that has possibly evolved to promote survival as a human being and, in consequence, finds some of these simulations appealing, and others so disquieting.

Is it that our sensitivity to simulation flaws is adaptive, and that as the animation improves so does our discernment? Or, is it the case that qualitative errors in simulation design continue to be made that consign the best (to date) of the objectively improving results to the Uncanny Valley? It may well be the latter; else, an observer would end up rejecting real humans for no more than an eyelash out of position. In this view, there are essential contributors to the composition of any simulation that must be present to elicit the perception of humanness. While the exact set of these features may vary across individuals according to their experience, the evaluation of what constitutes an acceptable assembly of features appears to be guided by a judgment mechanism that is as innate to human beings as breathing. Here in The Skin Appearance Laboratory, many potential skin properties have been collected over the last twelve years from the clinical and scientific literature for inclusion in a skin gazetteer. What remains is to define the interactions among the candidate details, as well as to determine under what conditions they become essential. These are the positive increments of appearance and motion that hopefully, when properly configured, will be sufficient to bridge the valley.

At present, all the candidate entries for the gazetteer are contributors to static appearance. A taxonomy of human motion is needed to complement the still image reflectance measures and thereby extend the representational quality of humanness, or what Mori would describe as increased familiarity. Potentially, one way to factor out the contribution from each modality (appearance v. motion) and perhaps begin to construct a motion taxonomy would be to extract still and movie samples from the same image data source, two data types obtained from identical conditions. To evaluate these data, assessments of the familiarity of movement and appearance could be formed by comparing movie segments and corresponding stills. In rating the movies, a continuous real-time scoring slider similar to that used to rate TV shows might be employed. Video segments associated with high positive ratings could then be analyzed for ‘essential’ candidate motion components. It would then be possible to compare the values assigned to the motion segments to the ratings of the isolated stills. In the best of worlds, a representational synergy would emerge that would enable creation of simulations with even greater acceptance.

As the general problem of rating humanoid similitude gets cast differently through the application of varying physical configurations, it would perhaps be helpful to explore the overall space of representations that are being employed to see what the complete range of acceptable percepts a given humanoid configuration is able to produce. If the movies are CGI, it should also be possible to extract the stills with any associated motion blur removed. An assessment can then be made on the basis of rendered appearance factors alone. It should be recognized, however, that although a video is composed of a series of still image frames (give or take some motion blur), the perception elicited by a video is rarely the concatenation of the percepts obtained from the individual constituent frames.

Beyond determining all the properties that need to done right, there well may be some that are inappropriate that need to be avoided in any simulation. Through a comparison of positive and negative features, it may be possible to establish the basis for a taxonomy of appropriate human movement that leads to high levels of acceptance as well as for an equally helpful taxonomy of inappropriate motion that leads to residency in the Uncanny Valley. It has been shown in the literature that just as individuals make virtually instantaneous judgments of personal attractiveness that they do not appear to abandon over time, they also appear to assess what is uncanny with the same immediacy. An attempt should be made to deconstruct these judgments in detail so as to have a basis for both correcting and understanding the humanoid evaluation process. How would human raters evaluate movies of our primate cousins, or perhaps simulations constructed from varying amounts of different motion components distilled from primate mocap driving human avatars? Continuing with a previous example, when one recent commentary referred to the children in Polar Express as appearing to labor as if under a carapace, it rang true to the subjective viewing experience where even the best mocap apparently failed to produce the necessary levels of fluidity and compliance in body motion simulations to induce viewer acceptance. The detail that needs to be determined in this case is to define what in the essence of human flow and flexibility, even with all that motion capture sampling, was not being properly represented in the animation. Failures also abound in still image representations. A rating survey across the oeuvres of different artists, covering photographic, painting and drawing, might possibly aid in further fleshing out the basis functions for appearance.

A growing body of evidence supports the existence of mechanisms in the frontal lobes of our brains that provide a facial recognition capability that facilitates recognition of cues associated with vocalizations, interactions for social bonding, and for maintenance of group hierarchies. Mechanisms such as these may provide additional reasons to avoid certain properties of facial appearance in simulations. It has been proposed that humans have developed an ability that is beneficial to survival, a tendency that can be described as innate pathogen avoidance. Since the appearance of the skin is the greatest source of visual information on the state of an individual’s health, it might be helpful to survey the dermal manifestations of the signs and symptoms associated with the historic diseases that have decimated mankind – e.g., plague, smallpox, cholera and anthrax. The aim would then be to avoid combinations of light sources, shadows and skin tones that together approximate any of these symptoms. It would probably be appropriate to do the same for the changes in skin appearance associated with the end of life and the stages of death. Certainly, disease and mortality are closely tied in our cultural memory. In the more remote past, human beings would have been more proximal to the percepts of mortal pandemic disease and of death than we are today (at least in the West). The Middle Ages, and several of the scourges associated with that time period listed above, should be sufficiently distant for the observed biases and inclinations to have taken firm root in humanity’s collective mind.

All that would remain then would be to assess the extent to which these properties, when present in the skin, make us uncomfortable with any simulation hosting them. This analysis might well benefit from breaking down the properties into changes in the distribution of the different skin chromophores and textures. It is expected that efforts built on previous dermatologic imaging work in my laboratory that looks for the mechanisms underlying the appearance of all manifestations of diseased (and healthy) skin can be brought to bear on this analysis. Again, measures of motion are missing. Are there characteristic changes to human movement brought on by the debilitating consequences of these mortal diseases? Can any be inferred based on the modern medical understanding of the course of these conditions?

Finally, in addition to determining what should be included, and avoided, in these simulations (as described above), it might be better to first take a more detailed look at what is being attempted. What in the simulations is it that actually needs to be measured during an observation? In Mori’s original article, familiarity was plotted against human likeness. Several of the later translations took issue with the choice of the ordinate label and, consequently, terms such as likeability, comfort level, and rapport have all been proposed as alternatives for familiarity. Each selection corresponds to a different modification of the task, and of the results. It is not only the dependent variable that is in published contention. There is more here than a dispute over the translation of words. The classification of still and moving images has also evolved – motion/immobility has changed to movement, and movement on to behavior. The widespread attempts at redefinition belie the existence of any consensus on the proper space for these transformations. What we can perhaps all agree on is that there is some alteration of perception in the humanoid/human transition that becomes significantly nonmonotonic just prior to the ideal. Exactly what that appropriate description is hasn’t yet been determined. To obtain a feeling for the potential of the space of human representation beyond where we now can reach with simulations, compare the two images above with the two images from the beginning of the photographic portraiture section. There evidently is very much more to understand.

So where will this ever higher level of human simulation quality ultimately lead us? Will our increasing success in this endeavor be our downfall? “I am the death of real.” This assertion affirms S1m0ne’s recognition that the ability to create illusion, or fraud, in images, moving or still, can now sometimes exceed our capacity to detect it. More importantly, she also implies that this level of corruption devalues our ability to rely on what appears real. Not that it is all that easy to get there, as the two movies discussed above have demonstrated all too well. Of concern in the long term are what might be new instances of perceptual acquiescence for effects such as the classic ‘wagon wheels’ illusion (motion aliasing). These are distortions that creep into our cultural (or clinical) acceptance of what subjectively passes as adequately real to us as we get inured to such fabrication by viewing these corrupted simulations on a repeated basis. While this may be more of a concern for applications such as image-based diagnostic decision-making, it would be good to ensure that artifice in human simulations for entertainment does not overly corrupt their essential representation so as to poison the well for this new cinematic style. As S1m0ne would have it, the days of real may be numbered, however, if the transition is done right, nothing else of value need perish with it.

Brian C. Madden