Another nerdy tab post: What’s the point of speaker points in forensics?

In other words, the epistemology of “the top competitors at a given tournament.”

When we tiebreak a final, we’re asking how to determine the top 6 (or so) speakers at a particular speech tournament, or ways of distinguishing between several teams that may have the same debate record. But how do we know who those are? What makes the difference between #4 and #5?

My last post got me to thinking about the whole subject of speaker points. If judges have their own individual ranges, and there’s very little that compels a judge to follow a certain standard^[1]The National Parliamentary Debate Association used to put speaker point ranges on ballots with some amount of detail, and individual events tournaments often will have ranges corresponding to … Continue reading, then why do we have speaker points in the first place? And do they help us distinguish between different kinds of speakers?

It’s long been argued that tournaments need speaker points because judges need some kind of way of communicating about the quality of the round — a 25 out of 25 in a speech round represents “perfection” or nearly so. Yet, many judges as a matter of course start off with a 25 for the first place speaker, so clearly, not every judge subscribes to that paradigm. I do recall a tournament where a judge, as a form of protest of speaker points, gave every speaker 25 speaker points.

As I alluded to in my previous post, the average speaker points given by judges has definitely increased over the years, and this seems to hold true regardless of the format: individual events, policy debate, parliamentary debate, and even IPDA debate. Even though I was typically below the normal averages^[2]Which is probably just my perception…since there are no really good studies as to what those averages are/were., my speaker points generally increased over the years. But then again, in the late 1980’s/early 1990’s, judges felt freer to use more of the 1-25 range in speech events than they do now.^[3]Of course, I’m nerdy enough to have the data for all but one tournament I judged, and that’s only because I must have lost the piece of paper from a tournament I judged in 1996. In my case, here are some interesting numbers:^[4]I’ll spare y’all the full tables. I do have them though. :)

Season	Mean	Standard Deviation
1989-1990	17.39 (out of 25)	3.97
1997-1998	19.05	2.54
1998-1999	19.76	2.20
2010-2011	19.94	2.53
2020-2021	20.31	2.20

My speaker points over the years: Individual Events

Season	Mean	Standard Deviation
1997-1998	24.10 (out of 30)	3.08
1998-1999	25.06	2.40
2010-2011	26.03	1.63
2019-2020	26.58	2.16

My speaker points over the years: Parliamentary Debate

Now there should be a couple of qualifications about my speaker points: the number of rounds/speakers I saw each year definitely was not consistent. In 1997-1998, I judged about 90 rounds, and thus saw about 400 speakers. I was in a much higher percentage of tabrooms ^[5]The behind-the-scenes part of the tournament that does little if any judging because it’s where the tournament is tabulated and directed. the longer my career went on (in 1997-1998, 2 out of 20 tournaments; in 2010-2011, 11 out of 17 tournaments).

Policy debate has been quite helpful in examining the trends regarding debate speaker points in particular, and Dr. Gary Larson and Jason Regnier have done a lot of work in the area. Jason’s work in particular is quite comprehensive and looked at the entirety of the NDT judging community. Given a relatively stable set of judges, and given a finite number of teams, it’s possible to make some fairly educated assumptions as to how debaters perform in the course of a season (or indeed, a career).

But what about the run-of-the-mill speech or debate tournament, which has varying judge pools, varying types of competition, and these days, variances as to online vs. in-person tournaments? And there are arguably far more students competing in individual events than there are in policy debate. How do we make those same type of determinations between the quality of speakers? After all, Bradley and Bethany Lutheran draw different types of competition^[6]Although there are some competitors who would do well at either tournament!, and have different types of judges. We don’t always have that much data about a local hired judge.

What makes Reigner’s data interesting is that there’s a relatively consistent pool of judges, each of whom has to judge at least 12 rounds in order to be eligible to judge at the NDT (National Debate Tournament). So those judges have some motivation to be considered regular judges, and some motivation to want to conform to norms, since students and coaches are allowed to prefer certain kinds of judges as opposed to others. ^[7]For my non-forensic friends, the simplest way to explain is this: in policy debate and in certain other kinds of debate tournaments, students and coaches can choose to constrain themselves from … Continue reading And back in the days when Reigner published his data about judging, there were several dozen judges in the database, each of whom judged at arguably a similar mix of regional and national-level tournaments. That’s far different than a speech judge who judges at a Twin Cities Forensics League tournament on a Tuesday and then judges at a large national-level tournament the following weekend. It’s like umpiring a minor-league baseball game one day, and then the playoffs the next.^[8]But even that analogy isn’t perfect — quite good national-level competitors would go to Twin Cities tournaments to try out significant changes in their speeches, or new events, or just to … Continue reading

There’s a theoretical question that gets talked about in judge’s lounges and tabrooms, but not to my knowledge in any kind of scholarly work: should the point values remain constant across different kinds of tournaments? In other words, should a 25 at a TCFL mean the same as a 25 at Bradley? Most, but not all, judges would say that it shouldn’t. Some judges believe that the best speaker at that tournament deserves a 25, whereas other judges may give a 25 only a few times per season, if at all. But yet, how often do we see judges give the usual 24-23-22-21-20-19 point distribution at every single tournament?

It would be interesting to see if the standard deviation for each judge is relatively consistent. Some individual events judges will have a scale such as the following:

1st place speaker – 24 points
2nd place speaker – 23 points
3rd place speaker – 22 points
4th place speaker – 21 points
5th place speaker – 20 points.

In that scenario, the mean = 22, standard deviation = 1.58.^[9]I’m using the standard deviation of a sample here; arguably there are other ways of calculating the standard deviation. I’m more interested in the differences in standard deviations for … Continue reading

But what about the judge who does the following:

1st place speaker = 23 points
2nd place speaker = 21 points
3rd place speaker = 19 points
4th place speaker = 18 points
5th place speaker = 17 points

For the second judge, the mean is 21, but a standard deviation of 2.41.

This judge is likely sending multiple signals: first, that there are significant differences between the top 3 speakers as compared to the bottom ranked speakers, and second, that there’s a large difference between the first and fifth ranked speaker.

Some tournaments allow judges to tie points and some don’t. That’s worth a full post or two in and of itself as to how standard deviation changes impact who makes elimination rounds. And debate tournaments allow gradations of either 0.1 points or 0.5 points while using a 30 point scale. To explain all of that would require another post. And some speech tournaments use a 70-100 scale instead of 1-25 scale, which I’m personally a fan of and implemented at two different national tournaments. To justify that (you guessed it!) would be another post.

The problem is that at any one tournament, judges are judging a wide variety of events, and a wide variety of skill levels. And community norms have developed such that if a judge gives too low of a rating, there’s a strong likelihood that judge will get questioned by both the tournament and the coach of the student receiving that rating. Thus, such a judge needs to put a very clear justification for a low point rating.^[10]Racist, sexist, and similar comments are normally a very good reason, as well as ad hominem attacks in debate rounds and significant rudeness. Or if a speech is 30 seconds long and is supposed to be … Continue reading

So what do we do about speaker points? One possible solution would be to look at all of the points that judge has given in the past. But the problems that Reginer doesn’t have to deal with rear their ugly head for other kinds of tournaments: different quality of teams, differing quality of students in each round, judges who vary their speaker points (intentionally or unintentionally) by event and/or their own comfort level with that event ^[11]I found in working on this post that I’m much harder in extemporaneous speaking and impromptu rounds. My old teammates and coach wouldn’t be too surprised: those were my best events, … Continue reading

It’s worth mentioning why we’re even having this discussion: we still have to break ties in some kind of way, otherwise the tournament can easily become unwieldy (10 people in a final anyone?).

And more importantly, the assumptions that we make about systems, and especially the individual forensics tournament as a system, assume that each speaker is relatively consistent — dare I even say that the speaker follows a normal distribution in terms of the quality of their performances? What about that speaker that improves consistently throughout the year, or that speaker that just has the tournament of their life? People will over/under-perform during the course of a tournament…that’s just a given. And really, what we’re measuring is performance at *a* tournament in a given round against a certain group of competitors, hopefully with enough randomness in the tournament and each of the rounds such that each competitor is fairly distributed against a representative sample of other competitors and judges, and that someone doesn’t have too many elite or not-so-elite competitors in their round.^[12]As my mind drifts back to a persuasion round at UW-Whitewater where I had 4 national elimination round contestants from the previous year in my preliminary section. I took 4th in the round, and was … Continue reading

The forensics community and the gymnastics community^[13]Figure skating could be used here too. suffer from the same problems, but in slightly different ways. The gymnastics community has agreed on what is considered the “ideal” when it comes to certain elements of performances, and has agreed upon penalties for not fulfilling those elements. In forensics, we’ve long written about unwritten rules^[14]I’d list all the sources that have talked about them for the last 40 years, but that could be another post in itself…. And I’m admittedly a bit lazy to do so here. Regardless, it’s not too … Continue reading. We each have different ideals about the different events. And when those ideals are dramatically different, you get the final round where someone takes 1st place on one ballot and 6th place on another ballot: same speech, but people see it very differently.

I know that for some of you reading this, statistics is one of those bad words… and others may read this and think “La-la-la…I’m not sure if I understand or care.” But until the forensics community can do more to think about how it judges and what level of deviation it wants to allow from judges, and how and why we’ve chosen to keep the system that we have of breaking ties, it will remain difficult to really answer the question of how we determine the best competitors at any given tournament.

Notes[+]

Notes
↑1	The National Parliamentary Debate Association used to put speaker point ranges on ballots with some amount of detail, and individual events tournaments often will have ranges corresponding to “excellent,” “good,” “fair” and “poor.”
↑2	Which is probably just my perception…since there are no really good studies as to what those averages are/were.
↑3	Of course, I’m nerdy enough to have the data for all but one tournament I judged, and that’s only because I must have lost the piece of paper from a tournament I judged in 1996.
↑4	I’ll spare y’all the full tables. I do have them though. :)
↑5	The behind-the-scenes part of the tournament that does little if any judging because it’s where the tournament is tabulated and directed.
↑6	Although there are some competitors who would do well at either tournament!
↑7	For my non-forensic friends, the simplest way to explain is this: in policy debate and in certain other kinds of debate tournaments, students and coaches can choose to constrain themselves from judging certain students or teams, and can rank the judges in the order they’d prefer to be judged by them. Each team is allowed to do so, and the computer figures out the best matches. So if you and I prefer the same judge equally, that’s likely who is going to judge our round. There are good reasons for constraints: for example, if a judge were to judge someone they were teammates with last year, that’s probably not a good thing. If someone from school A were to judge someone they’re dating from school B, again, not a good thing. There’s a whole lot more to the discussion of judging preferences and constraints, but I’ll stop here, since that’s another post in and of itself.
↑8	But even that analogy isn’t perfect — quite good national-level competitors would go to Twin Cities tournaments to try out significant changes in their speeches, or new events, or just to get back into tournaments after being away for a semester — kind of like when a major leaguer does a rehab assignment in the minors.
↑9	I’m using the standard deviation of a sample here; arguably there are other ways of calculating the standard deviation. I’m more interested in the differences in standard deviations for this argument.
↑10	Racist, sexist, and similar comments are normally a very good reason, as well as ad hominem attacks in debate rounds and significant rudeness. Or if a speech is 30 seconds long and is supposed to be 3-5 minutes long, a judge can generally get away with giving low speaker points, but even then, that’s usually on the order of the student receiving 10-12 points out of 25, and some judges would still go much higher than that.
↑11	I found in working on this post that I’m much harder in extemporaneous speaking and impromptu rounds. My old teammates and coach wouldn’t be too surprised: those were my best events, along with communication analysis/rhetorical criticism. Perhaps this will encourage me to look at my averages and standard deviations by event. But my files aren’t set up that way, and having judged at over 400 tournaments in my career — that’s a project for a long break.
↑12	As my mind drifts back to a persuasion round at UW-Whitewater where I had 4 national elimination round contestants from the previous year in my preliminary section. I took 4th in the round, and was quite happy.
↑13	Figure skating could be used here too.
↑14	I’d list all the sources that have talked about them for the last 40 years, but that could be another post in itself…. And I’m admittedly a bit lazy to do so here. Regardless, it’s not too hard to find those references.