What Do Consistency Estimates Tell Us about Reliability in Holistic Scoring ?
Keywords:
Consistency Estimates, Reliability, Holistic ScoringAbstract
Essay writing assessment is a largely used test in many examination types. Preferred fortheir “validity and authenticity” (Hamp-Lyons, 2003:163), direct writing tests are prevailing in entrance, placement examinations, as well as in continuous assessment. However, when
compared to indirect writing testing, their level of reliability is questioned and their scoring procedures incriminated. True scoring does not exist; errors stem from various sources: theraters, their training, the task (Huot, 1990); rendering essay marking doubtful, and raters’scoring inconsistent. This study reports on a large scale, high-stake writing proficiency test taken by 441 students. The essays were holistically scored on a 7-point scale by 16 raters. ThePearson correlation coefficient was used for assessing the degree of consistency between
raters. The coefficientwas calculated for each pair of judges in the 25 groups of students.
Results show positive correlation, but consistency in relationship has revealed some degree of variability between the paired samples. The range of correlations fell between .16 and .91. with the majority between .50 and .74. These findings raise issues about the factors that threaten consistency of scoring in writing tests.
Downloads
References
Bachman, L.F. and Palmer A.S. (1996) Language Testing in Practice : Designing and Developing Useful Language Tests. Oxford: Oxford University Press
Barkaoui K. (2010) Explaining ESL Essay Holistic Scores: A Multilevel Modeling Approach Language Testing 27(3)
Brown, T.L. Gavin (2009) The Reliability of Essay Scores: The Necessity of Rubrics and Moderation. In Tertiary Assessment and Higher Education Student Outcomes: Policy, Practice and ResearchEds:Luanna H. Meyer; Susan Davidson; Malcolm Rees ; Richard B. Fletcher and Patricia M. Johnston . Wellington, New Zealand : AkoAotearoa
Douglas, D. (2000) Assessing Language for Specific Purposes, Cambridge
Cambridge University Press
Excks,T ( 2012) Operational Rater Types in Writing Assessment: Linking Rater Cognition to Rater Behavior. Language Performance Assessment Quarterly, 9: 270–292.
Fei Wong FookFei, MohdSallehhudinAbd Aziz and ThangSiew Ming (2011).The Practice of ESL Writing Instructors in Assessing Writing Performance.Procedia Social and Behavioral Sciences 18 (2011) 1–5
Greenberg, L. Karen (1992) validity and reliability: Issues in the direct assessment of writing. Writing Program Administration Vol.16 Nos 1-2,
Hamp-Lyons,L (2003), Writing teachers as assessors of writing in Exploring The Dynamics of Second Language Writing Barbara Kroll (ed). Cambridge University Press, 162-190
Huot, B. (1990). The Literature of Direct Writing Assessment: Major Concerns and Prevailing Trends. Review of Educational Research, 60(2), 237–263. Retrieved from http://www.jstor.org/stable/1170611
Huot, B. (1996).Toward a New Theory of Writing Assessment CollegeComposition and Communication, Vol. 47, No. 4 (Dec., 1996), pp. 549-566 Published by: National Council of Teachers of English Stable URL: http://www.jstor.org/stable/358601Accessed: 30/09/2010 16:23
Huot, B and Peggy O'Neill ( 2007). Assessing Writing: The Introduction A Critical Sourcebook. Bedford/St. Martin's
http://casymposium.blogspot.com/2007/10/assessing-writing-introduction.html
Klapper (2006) Understanding and Developing good practice : Language Teaching in Higher Education . London: CILT
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? LanguageTesting, 19, 246-276.
McNamara. T. F.(1996). Measuring second language performance London; New York: Longman
Shohamy, E. Gordon, C. and Kraemer, R. (1992) the effect of raters’background and training on the reliability of direct writing tests. Modern Language Journal 76(4), 513-521
Stemler, Steven E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). http://PAREonline.net/getvn.asp?v=9&n=4 .
Wang, P (2009) The Inter-Rater Reliability in Scoring Composition. English Language Teaching, 2 (3)
Weigle, C, Sarah S.C. (1994). Effects of training on raters of ESL compositions. LanguageTesting, 11, 197-223.
Weigle, C, Sarah (1998). Using FACETS to model rater training effects
at: Language Testing, 15/2/263 http://ltj.sagepub.com/content/15/2/263
Weigle, C, Sarah (2002). Assessing Writing. Cambridge University Press