[One can tell it’s reviewing and letter-writing season when I escape to blogging more often..]
There’s been some discussion on the NIPS experiment, enough of it that even my neuro-scientist brother sent me a link to Eric Price’s blog post. The gist of it is that the program chairs duplicated the reviewing process for 10% of the papers, to see how many would get inconsistent decisions, and it turned out that 25.9% of them did (one of the program chairs predicted that it would be 20% and the other that it would be 25%, see also here, here and here). Eric argues that the right way to measure disagreement is to look at the fraction of papers that process A accepted that would have been rejected by process B, which comes out to more than 50%.
It’s hard for me to interpret this number. One interpretation is that it’s a failure of the refereeing process that people can’t agree on more than half of the list of accepted papers. Another viewpoint is that since the disagreement is not much larger than predicted beforehand, we shouldn’t be that surprised about it. It’s tempting when having such discussions to model papers as having some inherent quality score, where the goal of the program committee is to find all papers above a certain threshold. The truth is that different papers have different, incomparable qualities, that appeal to different subsets of people. The goal of the program committee is to curate an a diverse and intellectually stimulating program for the conference. This is an inherently subjective task, and it’s not surprising that different committees would arrive at different conclusions. I do not know what’s the “optimal” amount of variance in this process, but I would have been quite worried if it was zero, since it would be a clear sign of groupthink. Lastly, I think this experiment actually points out to an important benefit of the conference system. Unlike journals, where the editorial board tends to stay constant for a long period, in conferences one gets a fresh draw of the committee every 6 months or a year.
It’s nice to see a post where someone clearlý acknowledges how much of a role subjectivity might play. I hope we’ll go some way to characterising that when we write up the results.
Yes, I think it’s a very good thing you added a prediction question ahead of time, since it helps calibrate whether or not we should consider the results surprising.
Knowing what it takes to be a PC chair of even a much smaller conference, I hope you will take your time and rest before writing up the results.
Thanks Boaz. Main issues I’m dealing with now are the backlog that chairing generated! Corinna’s also had a lot on her plate. Looking forward to a break over Christmas. In fact I’ve been looking forward to it for over a year now. With REF and NIPS things have been relentless!
It would be very interesting to look at the 25.9% inconsistent papers one by one and see what really happened to each of them. Maybe the paper was considered borderline by both committees, but they made different decisions in the end. Or one committee found a weakness that the other did not. Or the paper had a champion in one committee, but not in the other. Or maybe one committee had stronger papers in this subarea, making this paper look less impressive. These are very different situations and it would be nice to know the main reason for the discrepancy.
Hi Daniel, We went through the papers doing this, mainly to check whether the rejecting side had found a fatal flaw not found by the accepting side. This hadn’t happened, and broadly speaking the reasons for different scores seemed to reflect subjective reviewer opinion.
more analysis here collecting misc refs on the subj & also agreeing with the inherently subjective nature of the process. in any case there is room for some rethinking of the peer review system esp in the new cyber/ open access age & trying out some new experiments/ models. however, its definitely a hard task that cannot be pulled off on a widespread basis. and one might also take into account machiavellis famous quote about the new vs old order….