Criteria for Judging the Quality of Science

Scientific and technical exhibits exist in a profusion of types, styles, sizes,
and designs. However, their underlying purpose is generally the same-to
impart knowledge about various technical subjects, and/or to change the
attitude of the viewer in a favorable direction toward science, its practitioners, and its institutions. Such exhibits are considered by many to be a
uniquely appropriate and effective means of narrowing the gap between
the sophisticated world of modern science and technology and the “everyday” world of that abstraction we call the general public. Museums, schools,
private industry, and the Federal Government are deeply committed to
the use of scientific and technical exhibits to carry their informational and
attitudinal messages to the public. The extent of this commitment can be
judged by the number of such exhibits in existence or planned, their prominence at major fairs, exhibitions and museums, and the funds expended on
their design, fabrication and housing.
APPROACHES TO STUDYING EXHIBIT EFFECTIVENESS
Considering exhibits as a form of visual communication puts them in the
Support for this work was provided by the United States Atomic Energy Commission
under Purchase Order No. NY-65-366.
137
CURATOR
same methodological and research context as educational television, slides,
movies, and other forms of pictorially-based educational media. Thus, one
might expect that research strategies typically used to measure effectiveness
for these other media would also be used in studying exhibit effectiveness.
However, this is not generally the case. Since exhibits depend on voluntary
audiences in most situations, they must be concerned with an element not
usually shared by these other media: attracting power. Unless people come
to view an exhibit, any possible “teaching” attempt is wasted. By the same
token, of course, an exhibit that successfully attracts an audience but fails
to reach its educational and attitudinal objectives represents a waste of
effort and funds (both of which can be sizable for a modem exhibit). Because attracting power is such an obviously necessary ingredient for the
success of an exhibit, studies of exhibit effectiveness have tended to concentrate on their popularity, and at the same time have tended to neglect
efforts to measure their actual “teaching power.” The underlying rationale
of many effectiveness studies could thus be reduced to a kind of syllogism:
If exhibit A attracts more people than exhibit B, then exhibit A is better
than exhibit B. This kind of head-count research is paralleled by a complementary emphasis by exhibit designers on the use of display effects to
attract attention. One could argue, however, that the excessive use of such
attention-getting techniques may actually detract from the achievement
of the exhibit’s educational and attitudinal objectives. Given “x” dollars to
spend on an exhibit, the temptation to use them for the “sizzle” rather than
the “steak may be difficult to resist. However, this situation is not likely
to improve as long as those working in the field are content to equate
popularity with effectiveness.
This is not to say that there have been no attempts to assess success in
terms of knowledge gained and attitudes changed. Work has been done on
measurement and several interesting projects are underway that should
shed considerable light on exhibits as change agents. However, the results
of studies completed to date have been generally disappointing. The measured change in behavior resulting from exposure to a given exhibit has been
either quite small, nonexistent, or, in a few instances, in the wrong direction.
These kinds of results may be partly due to unsolved methodological problems, particularly in defining objectives, and in designing appropriate measuring instruments that are sensitive to changes. These problems, of course,
still plague all of our media houses.
APPROACH USED IN THIS STUDY
The disappointing results found in many studies of exhibits may also be
due to the fact that principles of good design and proper utilization are
known by specialists in the field but often ignored by them. Such principles
are found in abundance in the exhibit literature in the form of prescriptive
138
or normative statements as to what constitutes a “good,” or “effective”
exhibit and what, conversely, constitutes a “bad,” or “ineffective” exhibit.
While knowledge based solely on expertise and experience generally lacks
the precision of scientifically-based knowledge, it may still be valuable and
worthwhile. It would be unwise to reject off hand the experience of those
who have worked with exhibits for many, many years. If such knowledge
and experience could be condensed and shown to be related to the proven
effectiveness of exhibits, it would serve as a very useful guide. Poor and
ineffective practices would be less likely to become incorporated into exhibits, exhibit effectiveness would be enhanced, and exhibit utilization
would be based on firmer ground.
In short, exhibits may be poorer than they ought to be simply because
designers are not utilizing those principles known by the leaders in the
field, or, the experts have not adequately translated their prescriptions into
terms that designers can implement. If either of these statements is true,
then the situation could be improved by making these principles more
available or more understandable, or both. This would be essentially a
translation and dissemination problem, and would involve R more intimate
dialogue between conceptualist and designer.
Two kinds of “truths” could be determined for the principles of good
design and practice in the exhibit literature. The first has to do with their
validity. That is, is there a relationship between the use of the principles
and the measured success of the exhibit? This is a difficult question to
answer, requiring appropriate measuring instruments and large-scale field
studies. In short, this approach takes us back to complex effectiveness studies. However one could save much time and effort by first studying the
reliability of such statements. For example, if the literature states that an
effective exhibit must be “well-lighted,’’ the reliability of this statement
could be determined by asking a number of persons knowledgeable about
the exhibit field to rate an exhibit on the degree to which it does or does
not conform to this principle. If the informed persons agree with each
other, then one would be encouraged to further investigate the question of
validity; do people who view poorly lighted exhibits learn less and/or have
their attitudes changed less, than people who view well-lighted exhibits?
If, on the other hand, “A” says an exhibit is well-lighted and “B” says it is
poorly lighted, then the prescription about lighting has little substantive
meaning, at least for these two individuals. If such a result were found to
hold generally for all or most of the statements found in the literature, there
would be good reason to question the usefulness of such statements. This
finding would also help to explain why exhibits tend to be an art form
rather than a technology. It would suggest that the reason for ineffectiveness is not that designers aren’t incorporating those features known to be
effective, but that the leaders in the field are not clear in their own thinking
139
CURATOR
as to what constitutes good exhibit design.
In summary, the study reported on here was designed to determine only
the extent to which the statements made in the published literature regarding the quality of scientific and technical exhibits are meaningful and unambiguous. This was done by constructing a rating scale, the items of
which were drawn from the exhibit literature. By having persons qualified
in the exhibit field use the scale and then comparing their ratings, the
reliability of the statements could be measured. Only if such statements
were found to be reliable would a study of their validity be considered
profitable.
Literature Survey and Item Assembly
A review of the exhibit literature was conducted in an effort to locate
those sources most likely to contain prescriptive or normative statements.
A total of forty-seven references were thus identified. Each potential source
was carefully read. Whenever the author made a statement that involved
exhibit effectiveness, it was recorded. Statements that were specific to a
particular exhibit were included only if a general principle was either
explicitly or implicitly associated with the statement. Thus, the item: “The
red lettering on the agricultural exhibit did not show up against the pink
background, thus making the labels difficult to read would be recorded,
but the general principle would also be noted: “Lettering should contrast
with the background.” Most authors did, in fact, write in general terms
since their remarks were meant to apply to more than one exhibit. The
complexity and variety of exhibits would seem to preclude the possibility
of anyone saying “The letters of all exhibits must be white on a black background.” The items in the rating scale would also have to avoid this level
of specificity if the scale were to have applicability to all scientific and
technical exhibits. Over 350 different statements were thus recorded from
the forty-seven references.
Next, related detailed statements were grouped into fifteen logical categories. These categories became the general headings under which various
numbers of specific items of the draft scale were placed. The fifteen categories are shown in Table 1. The items falling under a specific category
were then reviewed to see to what extent they could be combined. In
general, the aim was to provide adequate coverage of all the different
characteristics noted in the literature. The more than 3-50 statements
recorded were reduced in this manner to seventy-four specific questionnaire items. Every effort was made to avoid distorting or changing the
meaning of an item. Even though some items seemed vague or even unintelligible, they were retained essentially as they appeared in the original
quotation. Thus the initial scale was as nearly as possible an empirically
developed instrument.
TABLE 1
Basic Exhibit Categories
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Attractiveness of Exhibit
Ease of Comprehension
Unity Within the Exhibit
Ability to Attract Attention
Ability to Hold Visitor Attention
Appropriateness of Exhibit
Presentation
Accuracy of Information Presented
Location and Crowd Flow
Visitor Characteristics
Focus of Attention
Textual Material (labels,
headings, etc.)
Relation of Exhibit to Surrounding
Area and Other Exhibits

  1. Design of Exhibit
    Size
    Physical layout
    Use of color
    Use of light
    Use of contrast
  2. Exhibit items
    Quantity
    Attractiveness
    Size
    Sound
    Motion
    Demonstrations
    Charts
    Films
    Models
    Auxiliary teaching techniques
    Audience participation
  3. Communication Techniques
    TO illustrate the way in which scale items were generated, two items are
    shown below along with the individual references which supported these
    items. Scale Item 1: “How would you rate this exhibit on the appropriate
    use of light?”
    Supporting Statements from the Literature:
    Wright, G., 1958. “Is lighting adequate, 01. could it be improved?’
    Gardner, J. & Caroline Heller, 1960. “It is impossible to exaggerate the
    importance of lighting in exhibitions since it is lighting after all that
    largely determines what we see and how we feel about what we see.”
    Goins, A. E. & G. B. Griffenhagen, 1957 and 1958. “Several factors operate simultaneously to determine the popularity of an exhibit. . . . Some
    of the factors which are inherent in the exhibition are . . . light (e.g.,
    illumination, movement), . . .”
    Carmel, J. H., 1962. “As with any other part of exhibition design, light
    when correctly employed in an exhibition should enhance, emphasize,
    create atmosphere and otherwise help tell the story; it should never
    dominate, dazzle or distract.”
    New York Museum of Science and Industry, 1940.
    (1) “The functions of an exhibition and the more important means for
    carrying them out are: To draw attention: [by means of] color,
    light, motion, sound. . . .”
    (2) “From our experience here at the Museum we have found that the
    141
    CUR AT 0 R
    three-part formula of sound architectural design, proper use of
    illumination, and good color effects, is the first essential of good
    exhibition practice.”
    [Note that this item also contains material relating to other categories. Such items were noted on separate cards and filed under
    each category, i.e., light, color and design.]
    Borhegyi, S. F., 1963. “. . . light can be used to heighten the dramatic
    effect of visual images.”
    [Note: There were many additional references to lighting, but they all
    were concerned with more specific areas of lighting and thus generated
    more specific items in the initial rating scale.]
    Scale Item 2: Not all subject matter lends itself to an exhibit presentation.
    How suitable is this subject matter for exhibit presentation?
    Supporting Statements from the Literature:
    Carmel, J. H., 1962.
    (1) “. . . material ill-suited to temporary exhibition use . . . includes
    anything hazardous, objects requiring lengthy labels for comprehension, or anything which is too complex or obscure to be comprehended in a reasonable length of time by a standing visitor.”
    (2) “For permanent exhibitions, it is probably sound policy to avoid
    exhibitions on any subject that can be explained as well or better
    by an article, book or pamphlet with well selected illustrations.”
    Inverarity, R. B., 1961. “An exhibition should not attempt to do what
    can be better done in snme other medium.”
    Gardner, J. & Caroline Heller, 1960.
    (1) “Exhibition has its limitation . . . complicated stories and arguments
    (2) “It must be accepted that there are some subjects that will never
    Dale, E., 1946. “Is the material worth the time, expense, and effort involved?”
    Weiss, R. S. & S. Boutourline, Jr., 1963. “The most attractive exhibits
    had the characteristic that they could only be seen in the museum. Visitors would have been justified in feeling it unnecessary to come to the
    museum for information which might be found in a book.”
    Weiss, R. S. & S. Boutourline, Jr., 1963. “A show which fair-goers believe could have been seen in a book, in their local library, in their local
    museum, or even at an industrial show is likely to create some resentment.”
    Hull, T. G. & T. Jones, 1961. “Not all subjects lend themselves to presentation by exhibits.”
    should be left to leaflet or guides.”
    make good exhibitions.”
    Draft Scale
    The draft form of the scale contained fifty-five items, five of which had
    two or more subparts for a total of seventy-four individual questions. The
    rater was asked to judge each of the items by circling one of six verbal
    “tags.” An item from the scale, along with the six rating categories, is
    142
    shown below:
    How well do the various elements of the exhibit combine or relate to
    one another to produce a coherent unity?Z
    Excellent Very Good High Average Low Average Fair Poor
    The scale used freely distributed ratings rather than forced ratings. It
    was felt that the use of unforced ratings was a realistic approach to evaluation since an exhibit need not have a given number of excellent qualities
    and an equal number of poor ones.
    In order to make it possible to check the internal reliability of the ratings,
    the initial scale was divided into two parts. Part I contained items dealing
    only with each broad category (as seen in Table l), and Part 11 contained
    the more specific items falling under each of these categories. The latter
    were not identified to the rater as to the broad category to which they
    pertained. This format of the scale made it possible to compare the rating
    of a broad category (such as the “lighting” item referred to earlier) with
    the various specific items which would fall under the general lighting category ( such as one dealing only with “glare and reflection”).
    An item asking for an overall judgment of the exhibit was also included
    as part of the scale. In this way, it could be determined to what extent the
    raters agreed with each other on a total evaluation of the entire exhibit. The
    format of this item was the same as that shown above for the sample item.
    While both Parts I and I1 of the scale required the rater to circle the
    appropriate word which best reflected his judgment for each item, Part I1
    further requested the rater to indicate in a few words why he rated each
    specific item the way he did. Two raters may both agree that the “attractiveness of the display materials” is Fair, but one may have rated it that way
    because “they were poorly selected while another may have rated it Fair
    because “they were all bunched together.” Thus, the written comments
    would make it possible to better evaluate the extent of agreement as indicated by the ratings.
    The draft scale was tried out at the American Museum of Atomic Energy
    at Oak Ridge, Tennessee. This museum is operated by the Information and
    Exhibits Division, Oak Ridge Institute of Nuclear Studies, Inc. The various
    exhibits and models show nuclear reactors and atomic power plants, describe the production of raw materials and radioisotopes, and emphasize
    the peaceful applications of atomic energy in industry, agriculture, and
    medicine. Most of the exhibits are designed and fabricated by the staff of
    the Information and Exhibits Division. Seven displays in the museum were
    This item is an example of one which seemed ambiguous and potentially unreliable
    but since it was noted several times in the literature, it was included in the scale for
    initial tryout. After all, those knowledgeable in the exhibit field may have a very
    precise meaning for “coherent unity.”
    143
    C U It AT OR
    selected for rating. They covered a variety of exhibit techniques, design
    features, size, complexity and subject matter. Members of the museum
    staff, including those concerned with management, tours, and exhibit design and construction used the scale. A total of thirty-three scales were
    completed by twenty-five separate raters at the museum. Of the seven
    exhibits covered by the tryout, one was rated by six raters, three by five
    raters, and three by four raters. Eight of the twenty-five individual raters
    rated two exhibits.
    A form was prepared that asked the raters to make written comments on
    the scale itself. In addition, a member of the project staff discussed the
    scale with each rater after he had completed the rating of at least one
    exhibit.
    ANALYSIS
    The initial step in the data analysis was to transform the six values into
    a numerical scale (Excellent = 6, Very Good = 5, High Average = 4,
    Loto Average = 3, Fair = 2, Poor = 1 ) . In assigning integral weights, the
    assumption was made that equal intervals existed between categories.
    Two sorts of questions might be asked in connection with scale reliability. One involves self-agreement, that is, the consistency with which a given
    rater evaluated similar elements in a given exhibit. The second has to do
    with interrater reliability, the extent to which different raters agreed with
    each other in their evaluation of the same element. This latter question will
    be dealt with first.
    lnterrater Reliability
    A primary interest in the analysis of ‘these data was in determining the
    agreement or lack of agreement among the raters on individual items and
    over the entire scale, e.g., interrater reliability. Such information could be
    derived only by an exhibit-by-exhibit analysis, since there is no logical
    reason for assuming that the rating of a particular feature in one exhibit has
    any relation to the rating of that feature in another exhibit.
    As an estimate of variability in individual item ratings, the standard
    error of the mean was computed for each item in the scale for each of the
    seven exhibits. This analysis would indicate the range in which the true
    mean will be found. The larger the standard error of the mean, the wider
    the range and the greater the variability; therefore, the lower the reliability.
    A logical (and liberal) criterion for the mean was established by defining
    the actual mean rating for an item as being within plus or minus one rating
    category of the obtained mean. In other words, if the obtained mean for
    all ratings of a given item was three, or Low Average, it was assumed that
    the actual mean was between High Average and Fair. Any item where
    the standard error of the mean failed to meet this criterion (at the .05 level)
    144
    was considered to be statistically unreliable for that exhibit. The data for
    each item on the scale were obtained for each exhibit rated.
    The items failing to meet this criterion by exhibit and across exhibits is
    found in Tables 2 and 3. It can be seen in Table 2, for example, that for
    Exhibit 6, forty-eight of the seventy-four items on the scale, or sixty-five
    percent, do not meet the established criterion of reliability. For four of the
    seven exhibits, more than half the items fail to meet the criterion. From
    Table 3, one notes that two of the seventy-four items fail to meet the established criterion for all seven exhibits and only one item shows acceptable
    reliability for all seven exhibits.
    TABLE 2 Items Failing to Meet Reliability Criterion on Each Exhibit
    Items Failing to Meet
    Criterion (Total = 74)
    Number of
    Number Percent Raters
    Exhibit 1
    Exhibit 2
    Exhibit 3
    Exhibit 4
    Exhibit 5
    Exhibit 6
    Exhibit 7
    44 59 5
    31 42 5
    38 51 5
    26 35 4
    42 57 4
    48 65 4
    36 49 6
    TABLE 3 Items Failing to Meet Reliability Criterion Across Exhibits
    Number of
    Exhibits
    Number of Items Failing to
    Meet Criterion (Total = 74)
    2
    4
    11
    25
    16
    9
    6
    1
    145
    CURATOR
    The standard error of the mean was also computed for the two overall
    judgment items on the scale. This provides comparison with the individual
    items within the scale. On only one exhibit does the standard error of
    the mean for the overall rating fail to meet the established criterion. This
    occurs on both the pre- and postscale ratings. It should be pointed out that
    this result was found on the exhibit where the greatest number of individual items failed to meet the established criterion (Exhibit 6, Table
    2). Thus, while raters tend to disagree on the quality, success, or suitability of the individual items that go to make up the exhibit, they tend to
    agree (at least within plus or minus one category) that a given exhibit is
    Fair, Very Good, etc.
    One is not encouraged to attach much signscance to the reliability of
    the general ratings since they seem to be based on large areas of disagreement. The extent of this disagreement can perhaps be better understood
    by examples of the actual item ratings. To the item, “How would you
    rate the overall design of the exhibit?” the following results were obtained
    from six raters: 1 Excellent, 1 Very Good, 1 High Average, 1 Low Average,
    2 Fair. And on an exhibit evaluated by four raters, the question, “How
    would you rate the actual wording of the main title of the exhibit?” brought
    the following results: 1 Excellent, 1 Very Good, 1 Low Average, 1 Poor. In
    looking at the “Why” answers for this item, the areas of disagreement are
    revealed: “What else would be more clear?” “It speaks principally to this
    topic.’’ “A title which would stir the curiosity of the audience must be used.”
    “Title not complete. Should be . . .”
    Some of the items that had varied ratings had quite similar comments,
    such as the following item evaluated by four raters: “The size of an exhibit is influenced by a variety of factors, some having to do with its
    objective( s ), its subject matter, amount of material displayed, the surrounding objects, etc. How would you rate the appropriateness of the
    size of this exhibit?” The responses were: 1 Very Good, 2 Low Average,
    and 1 Poor. To the question “Why” we found, “Possibly taking up too
    much space for its purpose and also with respect to other exhibits around
    it . . .,” “Seems a little large for material used,” “It could be smaller and
    still do the same job,” and, “Too large to present such few concepts.”
    No statistic was used to account for this anomaly of divergent ratings
    and convergent comments, since it is difficult to weigh the statements.
    While the raters all seemed to agree that the exhibit was too large, it is
    not clear how serious each rater considered this deficiency except by his
    own rating. The only logical conclusion one could draw from this situation
    is that although experts may agree on the nature of a particular deficiency,
    they may differ widely on the importance attached to that deficiency.
    Another way of looking at the reliability of the scale is to measure
    146
    XI12 1968
    agreement among judges over the entire scale. That is, knowing how rater
    A scaled the individual items, how accurately can one predict how rater
    B will scale those items, using, of course, the same exhibit. To answer
    this, all possible interjudge correlations ( Pearson product-moment correlation coefficient) were computed for each exhibit. A total of sixty-three
    correlations were computed. The distribution of these correlation coefficients is shown in Table 4. The individual coefficients range from-.17 to S8.
    There are three negative correlation coefficients. The median correlation
    coefficient is .24. From these results, it is evident again that there are large
    areas of disagreement. With few exceptions, knowing how one rated individual aspects of an exhibit would tell relatively little about how another
    would rate the same aspects in that exhibit.
    TABLE 4
    Distribution of Interjudge Correlations on All Exhibits
    (Total Number of Correlations = 63)
    On the basis of these interrater measures, it may be concluded that the
    terms commonly used in exhibit literature for describing the effectiveness
    of an exhibit are not adequate or, at the very least, are not sufficiently
    reliable. There may be agreement that lighting, color, labels, etc., are important elements in exhibits, but those knowledgeable in the field seem
    not to agree as to the quality of these elements as they exist in a particular
    exhibit.
    Intern1 Scale Reliability
    The draft scale was divided into two parts to permit a check on its
    internal consistency, or more accurately, the internal consistency of individual judges in rating specific categories. For most categories, a general item appeared in Part I of the scale and the more specific items falling
    under that category appeared in Part 11. A test for internal consistency
    would measure the degree to which individuals gave the specific items
    in a category (e.g., glare and reflection) the same rating as to the corresponding general item (e.g., lighting).
    This analysis was performed on only six of the fifteen categories. The
    six categories selected were: design, lighting, color, title, labelling, and
    general text. Because consistency in rating the items should be independent of the exhibit and because combining the exhibits would provide a
    greater spread of scores, a single correlation coefficient ( Pearson productmoment) was computed for each variable over all seven exhibits. Since
    each category contained more than one related item, each subject’s average rating for all the specific items in the category was correlated with his
    corresponding general item rating. The resulting six correlation coefficients
    147
    CURATOR
    are shown in Table 5. It is interesting to note that only two of these fall
    below the highest interjudge correlation coefficients (-58). Thus, as would
    be expected, there tends to be greater consistency among judges in rating
    general and specific items in individual categories than there is between
    judges over the entire scale. However these correlations still illustrate that
    lack of agreement is evident in both interrater and intrarater measures.
    TABLE 5
    Internal Consistency Correlation Coefficients
    Category r
    Design .65
    Color .85
    Light .70
    Title .40
    Label .64
    Text .52
    Rater-Evaluation Sheet
    Each rater using the initial version of the scale was asked to fill out an
    evaluation sheet noting suggestions and criticisms. The comments thus
    collected indicated the following major drawbacks to the initial rating
    scale: it was too long (average time to complete was one hour, ten
    minutes), it was overly redundant, and the discrimination required on the
    six-point scale was too fine. The first two criticisms were not unexpected
    since redundancy was purposely built into the scale to permit intrarater
    measures to be taken. The latter objection was not anticipated, although
    it appeared to be well taken. Many reported that at times they had to
    resort to guess-work in choosing between two adjacent ratings. Many of
    them recommended a four-point scale as being more realistic.
    The raters consistently and enthusiastically noted that the scale forced
    them to “look” at an exhibit in a more analytic fashion, and that for this
    reason deficiencies came to their attention that had previously gone unnoticed. This is particularly interesting because most raters had been exposed to the exhibits for extended periods of time, and in many cases had
    actually conducted tours which used the rated exhibits.
    SCALE REVISION
    One objective in revising the rating scale was to shorten it. An effort
    was also made to reduce any apparent ambiguities in the format of the
    148
    XI12 1968
    scale or in wording individual items.
    The remaining revision involved eliminating overlapping items and adding a few items about certain exhibit features that had not been adequately
    covered. As noted earlier, the intrusion of uninformed opinion was avoided
    in preparing the initial form of the scale. By the time the scale had been
    completed and tested, the authors felt better informed and did add several
    items. Since there had been some confusion as to how to rate an item when
    a feature was not present in a particular exhibit (motion, for example), the
    format was changed to make it appropriate to answer only if that feature
    were present. At the end of the scale the rater is given the opportunity to
    check those features which the exhibit does not include but which, in his
    opinion, should have been included. Comments from the rater’s evaluation
    sheets, the reliability data, and the data on internal consistency were all instrumental in making these revisions.
    The revised scale is substantially sh,orter than the version used in the
    tryout ( thirty-five items yielding forty-eight separate questions ), and is
    no longer divided into two separate parts. Because of the objections by
    the raters to the six-point scale, the scale in its revised form uses only four
    categories: Excellent, Good, Fair, Poor. Estimated time to complete the
    revised scale is forty-five minutes. A sample of items from the revised version of the rating scale is found at the end of this article.
    DISCUSSION
    A note of caution is in order in interpreting the results of this investigation. There was a rather wide diversity of qualifications and duties among
    the raters used in the tryout. Perhaps it could be said that the raters
    represented different categories of exhibit expertise. Some of them were
    responsible for designing and fabricating exhibits, some with using and
    interpreting exhibits for the public, and some with the dissemination of
    technical information in the atomic energy field. Each of these individuals
    may be expected to view a particular exhibit from his own personal and
    professional bias, based on his background, and associations with the exhibits, etc. It may be argued that this diversity in raters would result in
    less consistent ratings than would be the case had the raters all been designers, or managers, or curators. On the other hand, one could argue that
    fundamental agreement ought to exist within the field on so important a
    question as exhibit effectiveness, regardless of the particular interests of
    any one group of specialists. In any case, due to the relatively small number of raters who used the scale in the present study, no effort was made
    to divide them into subgroups for analysis purposes. If the scale (as revised) is given wider use, it would be possible to separate raters by the
    categories noted above (and perhaps others as well) to see to what extent
    they differed in their judgment of exhibit characteristics, and to see if any
    149
    CURATOR
    patterns emerge that reIate specific occupation to ratings?
    It remains true that the data analysis indicates the general inadequacy
    and unreliability of published criteria as over-all guides to determining
    exhibit effectiveness, and suggests that there would be little to gain from
    an effort to test the validity of such criteria. Nevertheless, the scale seemed
    to make a worthwhile contribution to the more analytic inspection of
    exhibits. Each of the raters knew how he would improve the exhibit he
    rated. However, since raters tended to disagree on what the deficiencies
    were, these improvements might not lead to an increase in actual effectiveness as measured by some external measure, such as knowledge gained,
    attitudes changed, or people attracted. In short, the scale may at best lead
    to better informed opinions as to what constitutes an effective exhibit.
    At worst, it may mislead those who use it to believe that the categories on
    the scale are known to be related to actual effectiveness (they may or may
    not be), and, that the rater knows what the relationship is. It is therefore
    the opinion of the author that the scale should be made available to interested parties while making them aware of its limitations and its lack
    of demonstrated reliability. Additional use may lead to improvements and
    refinements in the scale which would increase its usefulness and perhaps
    even its reliability.
    One observation stands out very clearly as a result of this small-scale
    study, and that is the need for more clearly stated objectives for exhibits.
    This deficiency very likely contributed to the low reliability of the scale.
    Raters often reflected this need in their written comments and in their
    discussions after using the scale. They realized (many of them for the first
    time) that they had no baseline against which to judge the various elements. The question which should be asked is, “Specifically what do you
    want whom to do, know, or feel after seeing the exhibit that they could
    not do, know, or feel before seeing the exhibit?” Answering such a question in adequate detail would cut through much of the ambiguity, and
    even mistique, that surrounds the exhibit field. Otherwise it is not possible
    to design reliable and valid measuring instruments that would determine
    exhibit effectiveness, since it is not clear what should be measured. HOW
    can a rater judge the adequacy of a label if it is not known exactly what
    the label is supposed to communicate (teach) and the characteristics of
    the intended audience ( age, background, education, etc. ) . If those who write about exhibit effectiveness have difficulty in com-
    ‘In this connection, a more detailed report of this study is available from the Clearinghouse for Federal and Scientific and Technical Information, National Bureau of
    Standards, U.S. Department of Commerce, Springfield, Virginia 22151, ( TID-22703,
    “An Evaluation of Existing Criteria for Judging the Quality of Science Exhibits”).
    This report contains the revised rating scale and the annotated bibliography used in
    developing the scale.
    150
    municating with others in the field (as the data here seem to indicate)
    then it is not surprising that attracting power is often equated with the
    success of an exhibit. A designer may not know what “coherent unity”
    means, and he may not have very specific objectives to use as a basis for
    his design, but he does know that he can attract people with clever and
    dramatic effects. And the success of such efforts can be easily and accurately measured.
    It is probably true that prescriptions for effective exhibit design will
    never be reduced to a set of specifications that can be looked up in a
    handbook. It is equally true, though, that until those responsible for the
    preparation of scientific and technical exhibits become more analytic and
    more concerned with objectives and evaluation, little real advance in the
    field will be made. No sensible person would want to take the art out of
    exhibit design; no sensible person should resist injecting more systematic
    knowledge into exhibit design. Ad hoc solutions to these problems are not
    only a bad risk, they are becoming increasingly more expensive.
    The techniques for improving this situation do exist. They have been
    successfully applied to other media of communication and information such
    as films, educational television, and programmed instructional materials.
    In each of these areas, improved statements of intended objectives and
    evaluation instruments based on these objectives are of primary importance. Better statements of objectives and improved evaluation instruments
    can be prepared in the exhibit area. Ultimately, design variables can be
    related to effectiveness variables. Only when this is accomplished wilI it
    be possible to put the development of scientific and technical exhibits on a
    solid foundation.
    SELECTED ITEMS FROM THE REVISED
    EXHIBIT EFFECTIVENESS RATING SCALE
  4. Not all subject matter lends itself to the exhibit medium. How suitable is this
    subject matter for exhibit presentation?
    Excellent Good Fair Poor
    WHY?
    (NOTE: The above format was repeated for each item in the scale.)
  5. How would you rate the following in terms of visitor ease of viewing?
    a. The exhibit’s distance from the visitor
    b. Its physical layout (height of exhibit; amount of material displayed;
    placement and arrangement of material within the exhibit)
  6. How would you rate this exhibit on its appropriate use of color?
  7. How would you rate the main title of this exhibit from the standpoint of:
    a. Wording-content
    151
    CURATOR
    5.
    6.
    7.
    8.
    9.
    b. Design
    By its very nature, an exhibit can never tell the “complete” story or show
    “everything.” To what extent does this exhibit incorporate material which
    contributes most to the exhibit story and avoid including unrelated or unimportant material?
    Consider “drawing power” for a moment. How would you rate the popularity of this exhibit?
    Now consider “holding power.” To what extent do you feel this exhibit will
    hold visitor interest?
    If this exhibit makes use of slide and film projection techniques, how appropriately are these devices used?
    The appropriateness of the visual materials, the clarity of the textual
    material, and the spatial arrangement of the various exhibit elements all
    affect the exhibit’s intelligibility. How would you rate the overall exhibit for
    ease of comprehension by its intended audience?
    REFERENCES
    Bloomberg, Marguerite: An experiment in museum instruction. American Association of Museums, Washington, D.C., no. 8. 1929.
    Borhegyi, S. F.: Visual communication in the science museum. Curator, vol. 6,
    no. 1, pp. 45-57. 1963.
    Bureau of Social Science Research: Audience reaction to two ICS cultural exhibits: report on the pretest of a questionnaire. Washington, D.C., no. 518.
    1954.
    Bureau of Social Science Research: The Japanese house: a ~tudy of its visitors
    and their reactions. Washington, D.C., no. 518. 1956.
    Bureau of Social Science Research: The peopk’s capitalism cnhihit: n study of
    reactions to foreign visitors to the Washington preview. Washington, D.C., no.
  8. 1956.
    Carmel, J. H.: Exhibition techniques-traveling and temporary. Reinhold Publishing Corporation, New York. 1962.
    Dale, E.: Audio-visual methods in teaching. The Dryden Press, New York. 1946.
    Gardner, J. and Caroline Heller: Exhibition and display. F. W. Dodge Corporation, New York. 1960.
    Goins, A. E. and G. B. Griffenhagen: Psychological studies of museum visitors
    and exhibits at the US. National Museum. Museologist, vol. 64, pp. 1-6. 1957.
    The efject of location, and a combination of color, lighting, and artistic design
    on exhibit appeal. Museologist, vol. 67, pp. 6-10. 1958.
    Hull, T. G. and T. Jones: Scientific exhibits. Charles C. Thomas, Springfield,
    Illinois. 1961.
    Inverarity, R. B.: Museum and exhibits. Museologist, vol. 81, pp. 2-4. 1961.
    Melton, A. W., N. G. Feldman, and C. W. Mason: Experimental studies of the
    education of children in a museum of science. American Association of MUseums, Washington, D.C., no. 15. 1936.
    New York Museum of Science and Industry: Exhibition techniques. 1940.
    Robinson, P. V.: An experimental study of exhibit arrangement and viewing
    method to determine their eflect upon learning a factual material. Unpublished doctoral dissertation, University of Southern California. 1980.
    152
    a
    Seattle World’s Fair, 1962. Institute for Sociological Research, University of
    Washington, Seattle. 1963.
    Weiss, R. S. and S. Boutourline Jr.: A summary of fairs, exhibits, pavilions, and
    their audiences. Robert S. Weiss. 1962.
    The communication due of exhibits. Museum News, vol. 42, no. 3, pp. 23-
  9. 1963.
    Wright, G.: Some criteria for evaluating displays in museums of science and industry. Midwest Museum Quarterly, vol. 18, no. 3,pp. 62-71. 1958.

This question has been answered.

Get Answer