成人VR视频

Student evaluations on teaching are biased and unreliable

Universities should rethink how they use student evaluations of teaching because of their bias towards male instructors, argue Anne Boring, Kellie Ottoboni and Philip B. Stark

October 3, 2018
University lecturer

Many universities rely heavily or exclusively on student evaluations of teaching (SET) for hiring, promoting and firing instructors. After all, who experiences teaching more directly than students? But to what extent do SET measure what universities expect them to measure – teaching effectiveness?

To answer this question, from a natural experiment at a French university (based on by Anne Boring), and a randomised, controlled, blind experiment in the US (based on by Lillian MacNell, Adam Driscoll and Andrea N. Hunt). We confirm and extend the studies’ main conclusion: student evaluations of teaching are strongly associated with the gender of the instructor. Female instructors receive lower scores than male instructors. SET are also significantly correlated with students’ grade expectations: students who expect to get higher grades give higher SET, on average. But SET are not strongly associated with learning outcomes.

have found little difference between average SET for male and female instructors, but the design of those studies has serious flaws. Not only are they observational studies rather than experiments, they ask the wrong question, namely, “do male and female instructors get similar SET?”. A better question is, “would female instructors get higher SET but for the mere fact that they are women?”. We can answer that question using these unique data sets: yes.

The French data

Since effective teaching should promote student learning, students of more effective instructors should have better learning outcomes on average. Students in different sections of each course, taught by different instructors, take the same final exam, allowing us to compare learning outcomes. We find?that SET are, at best, weakly associated with student performance.

成人VR视频

ADVERTISEMENT

Correlation between SET and final exam score by subject

Figure 1. Average correlation between SET and final exam score, by subject

Note: p-values are one-sided, since, if SET measured teaching effectiveness, mean SET should be?positively associated with mean final exam scores. Correlations are computed for course-level?averages of SET and final exam score within years, then averaged across years. *** p<0.01, * p<0.1

成人VR视频

ADVERTISEMENT

On the other hand, SET correlate significantly with?instructor?gender (male students gave higher SET to male instructors, Figure 2) and with students’ expected grades. This adds evidence to the hypothesis that instead of promoting better teaching, SET . We find no evidence that male?teachers are more effective than female teachers. If anything, students of male instructors perform worse on the final exam.

Average correlation between SET and gender
Note: p-values are two-sided. *** p<0.01, ** p<0.05, * p<0.1

Figure 2. Average correlation between SET and gender concordance

The US data

Lillian?MacNell, Adam?Driscoll and Andrea Hunt?collected data from four online sections of a course, two taught by a male instructor and two by a female instructor. Students were assigned randomly to the four sections. The male instructor taught one section using his own identity and switched identities with the female instructor for the other section, and vice versa.?

This lets us see how believing that an instructor is male or female affects SET for the very same instructor.?We confirm the original authors’ main finding that students generally rate?perceived?female instructors lower in several dimensions of teaching.

Even on measures one would expect to be objective, ratings were lower for perceived?female instructors. For instance, graded assignments were returned simultaneously in all four sections, but students reported that the perceived female instructor was less prompt in returning assignments.?

Male-female instructor mean ratings
Note: The scale is 1-5 points, so a difference of 0.8 is 20% of the full range. p-values are two-sided. *** p<0.01, * p<0.1

Figure 3. Difference in mean ratings and reported instructor gender (male minus female)

In both the French and US data, male instructors got higher SET, but in the US data, female students tended to give higher scores to perceived male instructors, whereas in the French data, male students tended to give higher scores to male instructors.

成人VR视频

ADVERTISEMENT

Difference in mean SET by student gender

Figure 4. Difference in mean SET by student gender, for perceived and actual instructor gender (male minus female)

成人VR视频

ADVERTISEMENT

?In another study conducted , researchers are finding that female instructors receive lower scores because male students give lower scores to female instructors.?

Differences among these studies could be cultural or related to topic, class size, mode of instruction (online versus face-to-face), ethnicity, race, physical attractiveness, or other confounding variables that have been found to affect SET. Clearly, there can be no simple adjustment for the bias.

The French data show that bias varies by course subject, further complicating any attempt to correct for these biases. The only field in which male students do not rate male instructors significantly higher is sociology. This is especially interesting because sociology is the only field in which there was near gender balance among instructors (46.4 per cent female instructors). This could suggest that gender balance in a field affects gender stereotypes and might reduce bias against female instructors.

Why don’t universities use better methods? SET are the familiar devil. Habits are hard to change. Alternatives (reviewing teaching materials, peer observation, surveying past students, ) are more expensive and time-consuming, and this cost falls on faculty and administrators rather than on students.?

The mere fact that SET are numerical gives them an unearned air of scientific precision and reliability. And reducing the complexity of teaching to a single (albeit meaningless) number makes it possible to compare teachers. This might seem useful to administrators, but it is a gross oversimplification of teaching quality.

Evidence?of any connection between SET and teaching effectiveness is murky, whereas the associations between SET and grade expectations and between SET and instructor gender are clear and significant. Because SET are evidently biased against women (and likely against other underrepresented and protected groups) and worse, do not reliably measure teaching effectiveness. The onus should be on universities either to abandon SET for employment decisions or to prove that their reliance on SET does not have disparate impact.

This blog post is based on a preprint:?

Anne Boring is?a research fellow at Sciences Po and a research affiliate at Paris Dauphine University. Kellie Ottoboni?is?a PhD student in the statistics department at the University of California, Berkeley?and a fellow at the?Berkeley Institute for Data Science. Philip B. Stark is professor of statistics and associate dean of mathematical and physical sciences at the University of California, Berkeley.?

成人VR视频

ADVERTISEMENT

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Register
Please Login or Register to read this article.

Related articles

Sponsored

ADVERTISEMENT