Texas A&M University-Corpus Christi

The Validity of Student Ratings Interpretations

Franklin, J., & Theall, M. (1990) Communicating student ratings to decision makers: Design for good practice. In Theall,
M. & Franklin J. (Eds), Student Ratings of Instruction: Issues For Improving Practice, Jossey-Bass, Inc. San Francisco.
---------

Conversations with faculty and administrators...led increasingly to concerns about what users [e.g., chairmen; deans] were doing with the information we were providing. We saw that some departmental administrators, who routinely use ratings to make decisions about personnel, evaluation policy, and resource allocation, were not familiar enough with important ratings issues to make well informed decisions...

We received many requests from faculty for assistance in interpreting reports, and we discovered that our clients would not or could not use many of the instructions for interpretation that we had provided. Clearly stated disclaimers regarding the limitations of ratings data in particular circumstances appeared to have little effect on the inclination of some clients to use invalid or inadequate data...

Our research findings, as well as anecdotal reports from many of our colleagues, suggest that many of those who routinely use ratings are liable to be seriously uninformed about critical issues. For example, among faculty respondents who reported using ratings for personnel decisions involving other faculty, nearly half were unable to identify likely sources of bias in ratings results, recognize standards for proper samples, or interpret commonly used descriptive statistics...

A great deal of scholarly attention has been paid to the validity and reliability of student ratings as a measure of instructional quality. Considerably less has been given to actual practice... Utilization of ratings is one of the least often studied or discussed issues in the realm of ratings phenomenon. There are far fewer reported observations of ratings users in
action in personnel decision making or of the ways in which teaching improvement consultants use ratings in interactions with their faculty clients...

Even given the inherently less than perfect nature of ratings data and the analytical inclinations of academics, the problem of unskilled users, making decisions based on invalid interpretations of ambiguous or frankly bad data, deserves attention. According to Thompson (1988, p. 217). "Bayes' Theorem shows that anything close to an accurate interpretation of the results of imperfect predictors is very elusive at the intuitive level. Indeed, empirical studies have shown that persons unfamiliar with conditional probability are quite poor at doing so (that is, interpreting ratings results) unless the situation is quite simple." It seems likely that the combination of less than perfect data with less than perfect users could quickly yield completely unacceptable practices, unless safeguards were in place to insure that users knew how to recognize problems of validity and reliability, understood the inherent limitations of rating data and knew valid procedures for using ratings data in the contexts of summative and formative evaluation.

Whether the practices of those who operate rating systems or use ratings can stand close inspection has become open to question. It is hard to ignore the mounting anecdotal evidence of abuse. Our findings, and the evidence that ratings use is on the increase, taken together, suggest that ratings malpractice, causing harm to individual careers and undermining institutional goals, deserves our attention ...(pp. 78-80).
_________________________________

The mechanics and style of interpreting ratings appear to vary dramatically across the domains of ratings use, particularly with respect to the role of quantitative information. It is our impression that many teaching consultants employ subjective, experientially based methods of dealing with information, while administrative decision makers may strive to construct empirically based (or "empirical looking") formulas...

There are some fundamental concepts for using numbers in decision making. To the degree that these concepts are ignored, interpretations of data become, at best, projective tests reflecting what the user [e.g., a chairperson or dean] already knows, believes, or perceives in the data. Treating tables of numbers like inkblots ('ratings by Rorschach') will cause decisions to be subjective and liable to error or even litigation...

Ratings are particularly subject to sampling problems, such as not having enough courses on which to base a comparison between two instructors and not involving enough students in rating each course section. Moreover, the fact that classes with fewer than thirty students are statistically small samples means that special statistical methods are required for some
purposes.... Substantially different models for analysis are also required for various uses of the data. Given such problems, there are many opportunities for error in dealing with numbers. Three types of errors come to mind immediately.

The first involves interpretation of severely flawed data, with no recognition of the limitations imposed by problems in data collection, sampling, or analysis. This error can be compared to a Type I error in research -- wrongly rejecting the null hypothesis -- because it involves incorrectly interpreting the data and coming to an unwarranted conclusion. In this case, misinterpretation of statistics could lead to a decision favoring one instructor over another, when in fact the two instructors or not significantly different.

The second type of error occurs when, given adequate data, there is a failure to distinguish significant differences from insignificant differences. This error can be compared to a Type II error -- failure to reject the null hypothesis -- because the user does not realize that there is enough evidence to warrant a decision. In this case, failure to use data from available
reports (assuming the reports to be complete, valid, reliable, and appropriate) may be prejudicial to an instructor whose performance has been outstanding but who, as a result of the error, is not appropriately rewarded or worse, is penalized.

The third type of error occurs when, given significant differences, there is a failure to account for or correctly identify the sources of differences. This error combines the other two types and is caused by misunderstanding of the influences of relevant and irrelevant variables. In this case, a personal predisposition toward teaching style...may lead a user to attribute negative meanings to good ratings, or to misinterpret the results of an item as negative evidence when the item is actually irrelevant and there is no quantitative justification for such a decision.

Any of these errors can render an interpretation entirely invalid...

How can we conceptualize the problem of ensuring that users do not make decisions or take actions that are based on invalid interpretations of data? In the following example, invalid interpretations are seen to result from either invalid or unreliable data or from lack of skill, knowledge, or necessary information on the part of the user. The strategy is to make sure that users either have or have access to sufficient skills or information to form valid hypotheses. Valid, reliable hypotheses are those
interpretations of ratings that knowledgeable, skilled users, with adequate information concerning the present data, would be likely to produce or concur with.

Let us...state our goal in the following way: "The user will make decisions that are based on valid, reliable hypotheses about the meaning of data." In this case, the user should receive or construct working hypotheses that do the following things:

-Take into account problems in measurement, sampling, or data collection and include any appropriate warnings or disclaimers regarding the suitability of the data for interpretation and use.

-Do not attempt to account for differences between any results when they are statistically not significant (probably <.05).

-Disregard any significant differences that are merely artifacts (for example, small differences observed in huge samples), which can technically be significant but are unimportant).

-Account for any practically important, significant differences between results in terms of known, likely sources of systematic bias in ratings or reliably observed correlations, as well as in terms of relevant praxiological constructs about teaching or instruction.

The user should also refrain from constructing or acting on hypotheses that do not meet these conditions... (pp. 87-89)...

The validity of inferences or interpretations should concern those who design and operate ratings systems as much as validity and reliability of instruments used to obtain the data... How use occurs ought to be very important issue, one for which those who develop ratings systems ought to be held accountable... (pp. 80-81).
_______________________________

_________________________________
_________________________________

                        John C Damron, PhD        *
                        Douglas College, DLC   * * *
                        P.O. Box 2503       *     *     *
                        New Westminster, British Columbia
                        Canada V3L 5B2 FAX:(604) 527-5969
                        e-mail: damronjc@dowco.com
                        FAX (604) 527-5960 Tel: (604) 527-5860
                        ==================

http://www.douglas.bc.ca/

http://www.douglas.bc.ca/psychd/index.html

Society for a Return To Academic Standards

Last Updated: 13 April 2018

About the University

Research & Academics

Quick Links

Visitor Information