International surveys of educational achievement and functional literacy are increasingly common. We consider two aspects of the robustness of their results. First, we compare results from four surveys: the Trends in International Maths and Science Study, the Programme for International Student Assessment, the Progress in International Reading Literacy Study and the International Adult Literacy Survey. This contrasts with the standard approach which is to analyse just one survey in isolation. Second, we investigate whether results are sensitive to the choice of item response model that is used by survey organizers to aggregate respondents’ answers into a single score. In both cases we focus on countries’ average scores, the within-country differences in scores and on the association between the two.