Ethical Considerations in Second Language Testing and Assessment: A Path Towards Fairness, Transparency, and Accountability

Language testing and assessment, especially in the context of second language acquisition, are instrumental in shaping educational trajectories and societal integration (Lynch, 2001). These processes, however, are not merely technical or administrative tasks; they are deeply embedded with ethical considerations that directly influence their fairness, transparency, and accountability (Estaji, 2011). Historically, language testing and assessment were largely viewed through a psychometric lens, focusing on reliability, validity, and standardization (Abedi, 2006). This perspective, rooted in the traditional paradigm of language as a fixed system of structures, prioritized accuracy and uniformity. However, it often overlooked the complex sociocultural factors influencing language learning and use, leading to potential biases and inequities.

The latter part of the 20th century witnessed a paradigm shift in the field, with the advent of communicative language teaching (Savignon, 1987) and the recognition of language as a dynamic social practice (Bandura, 2001; Li, 2018). This led to a more holistic and inclusive approach to language testing and assessment, emphasizing communicative competence, cultural sensitivity, and real-world applicability. Yet, even as the field evolved, ethical considerations remained a critical concern, underscoring the need for ongoing dialogue and action (Kunnan, 2000). Theoretical advancements have further highlighted the centrality of ethics in language testing and assessment (ibid.). Contemporary theories argue that ethical considerations are not peripheral to the testing process but integral to its purpose and function (Taylor, 2009).

This essay situates itself within this evolving academic context, arguing for the necessity of addressing ethical considerations in second language testing and assessment. The core thesis posited that to ensure (1) unbiased and fair testing and (2) transparency and accountability, it is imperative to confront and address the inherent ethical considerations in these processes. The main sections of the present essay will explore each of these two ethical considerations in detail, followed by implications for teaching practice and proposing potential solutions. As a previous tutor for CAT4 (Cognitive Abilities Test, 4th Edition) Verbal Reasoning, I will provide some item samples from this test I encountered, to better illustrate the arguments proposed in the essay.

Bias in Language Testing

For language testing and assessment, bias remains a pervasive issue, particularly concerning cultural, linguistic, and educational backgrounds. As delineated by Kunnan (2000), test fairness embodies a multifaceted construct encompassing diverse aspects such as test-taker traits and contextual factors surrounding the testing environment. This intricacy frequently culminates in systemic bias that can disproportionately impact specific student populations. For example, students hailing from distinct cultural or linguistic backgrounds may grapple with culturally exclusive test items or face difficulties due to a systemically different focus on language proficiency. Adopting a comprehensive view, the primary presentations of biases comprise cultural bias in test content, linguistic bias in the language used in test items, and educational bias in the assumptions made about students' prior knowledge and experiences (Shohamy, 2001). They can subsequently precipitate inequitable outcomes, especially for individuals originating from marginalized or minority groups. These biases may manifest as reduced test scores for the affected students, thereby exerting a substantial influence upon their academic and professional pathways.

One of the critical areas of bias in language testing lies in the realm of cultural bias. Roever and McNamara (2006) discussed the issue of test items frequently being culturally specific, deeply rooted in the experiences and knowledge acquired by individuals from a particular cultural group. The presence of culturally exclusive content poses a challenge for test-takers from diverse backgrounds, as they may not have experienced or comprehended the cultural context incorporated within the questions. To exemplify this issue, consider an English language proficiency assessment that includes a question centred around baseball – a sport that may not be universally understood or played across all cultures. Test-takers who are not familiar with baseball are placed at a disadvantage as their linguistic competency becomes erroneously conflated with their cultural knowledge. Moreover, during my teaching on the CAT4 Verbal Analogy section, I encountered various examples of culturally embedded items that could reinforce such biases. One striking instance is Test Sample 1, which concerns traditional customs:

Test Sample 1 from CAT4
Thanksgiving → Turkey = Christmas → ?
A. Pumpkin B. Fireworks C. Ham D. Easter Bunny E. Champagne

The inherent cultural bias in this type of test item is likely to contribute to lower scores for students whose cultural backgrounds do not encompass these specific customs.

Similarly, linguistic bias also plays a significant role in disadvantaging certain groups of students. A pervasive assumption underlying language assessments is the reliance on standardized dialects, primarily influenced by Western and English-speaking conventions (Lippi-Green, 2012). This presumption poses challenges for students from diverse linguistic backgrounds, speaking alternate dialects or languages in their home environments (i.e., students from outer- or expanding- circle countries; Pennycook, 2017). These individuals potentially face two layers of difficulty – reduced familiarity with the test language and confrontation with dialectical discrepancies that may impede comprehension and articulation. The consequential effect materializes as diminished test scores for such students; however, this outcome does not accurately reflect their genuine language competence. Rather, it is indicative of ingrained linguistic biases present in the design of these assessments.

Test Sample 2 from CAT4
Choose the word that does not belong:
A. Mum B. Mom C. Mama D. Mother E. Father

For example, in test sample 2 from CAT4, the apparent correct answer is E. "Father," as it's a different parent role than the others. However, this question may inadvertently introduce bias against test-takers with varying dialectical familiarity. Those acquainted with British English, for instance, may use "Mum" rather than the American counterpart, "Mom." The phrasing of the question implies that one term is erroneous or unsuitable, possibly placing students familiar with alternative dialects at a disadvantage. Furthermore, the question assumes that test-takers originate from cultural backgrounds or family structures that clearly demarcate these parental roles. Such an assumption may not be universally applicable and could perpetuate exclusionary tendencies within the evaluation process.

Furthermore, educational bias in language testing often results from assumptions made about students' prior knowledge and experiences. Kunnan (2013) posits that developers of such tests often overlook the diversity in educational backgrounds among examinees. Consequently, they inadvertently include items that favour certain educational experiences over others. This predisposition can hinder students with differing backgrounds, culminating in lower scores and a distorted portrayal of their language proficiency.

Test Sample 3 from CAT4
Cello → Orchestra = ? → NBA
A. Baseball B. Basketball C. Tennis D. Goal E. Football

A prime illustration of potential bias is evident in Test Sample 3. Here, the validity of this item relies on an assumed familiarity with Western cultural knowledge—specifically, understanding the role of a cello within an orchestra and awareness about the National Basketball Association (NBA). Students from varying cultural or educational contexts may lack exposure to these facets of information, which places them at a comparative disadvantage. Rather than having experience with aspects embedded in this test item, they might possess knowledge of alternative musical instruments, sports, authors, or artists that remain underrepresented in such assessments. It is crucial to acknowledge that this issue also ties into concerns regarding test transparency: specifically, the inappropriate use and misinterpretation of results stemming from such bias-laden evaluations. Instead of fulfilling their intended purpose as measures of language and cognitive abilities, they may turn into proxies for assessing intelligence or socio-economic status—an outcome that deviates significantly from their original objective (see later sections for more).

In light of the discussed biases, it is also imperative to recognize their interconnected nature, resulting in a cumulative disadvantage for specific cohorts of learners. For example, students representing distinct cultural origins, utilizing non-standard dialects, and navigating unfamiliar educational systems may experience a threefold disadvantage within language assessments. This intersectionality of bias in language testing necessitates a more nuanced understanding of the issue and the development of equitable testing practices.

Such inequality in student assessment also has considerable consequences on motivation and engagement in the learning process. Persistent underperformance on biased examinations can demoralize learners, leading to disconnection from academic pursuits and subsequent adverse effects on their long-term educational achievements and career prospects (Fusarelli, 2004; Harber, 2023). This disenchantment is notably observed among students with diverse backgrounds who may perceive language tests as prejudiced due to inherent biases, leaving them feeling as though their cultural, linguistic, or educational experiences are inadequately valued and acknowledged. Kunnan (2004) conducted a study exploring the ramifications of perceived unjust testing practices on students. Findings demonstrated that such perceptions correlate with diminished motivation, reduced academic self-concept, and heightened anxiety among learners. These factors can significantly impact students' academic pathways, resulting in suboptimal educational attainment and constrained opportunities for societal integration.

Furthermore, the demotivating effects of bias in language testing can be exacerbated for students who are already marginalized or disadvantaged. For these students, biased language tests can reinforce existing inequities and further marginalize them within the education system (Lynch, 2001). Moreover, these demotivating consequences extend beyond individual experiences, manifesting at the collective level as well. Patterns of poor performance on language assessments due to inherent biases can contribute to the development of stereotype threat among student groups (Steele & Aronson, 1995). This phenomenon entails a heightened fear of reinforcing pre-existing negative stereotypes about one's own group, subsequently resulting in diminished performance. As a consequence, stereotype threat has the potential to demoralize students further, augmenting disengagement from learning and exacerbating societal inequities.

Transparency and Accountability in Language Testing

Besides biases presented above, the importance of transparency and accountability in language testing and assessment cannot be overstated. These principles are fundamental to the integrity of the testing process, and their absence can lead to misuse or abuse of tests, mistrust, and potential harm to students.

Transparency and accountability are two critical factors in the practice of language testing that enable the assessment process to be open and fair. This discussion draws from Roever and McNamara (2006) and Shohamy (2001), who outline the importance of each respective factor and their implications on various stakeholders, including test-takers, educators, parents, and policymakers. Transparency in language testing encompasses the clarity and openness surrounding the test process. The significance of transparency lies in empowering test-takers with a comprehensive understanding of expectations and evaluation methods for their performance. Moreover, it facilitates stakeholders—such as educators, parents, and policymakers—in critically analysing the testing process and maintaining test developers' and administrators' accountability. Accountability, on the other hand, pertains to the responsibility of test developers, administrators, and users to ensure that the test is used appropriately and ethically. This includes ensuring that the test is designed and administered in a way that minimizes bias and maximizes fairness, and that test results are used appropriately and ethically. Accountability also involves holding these individuals and organizations accountable for any negative consequences resulting from the misuse or abuse of the test.

The absence of these important features can lead to severe consequences, including the misuse or abuse of assessments. Instances of such inappropriate use involve employing test results for purposes that they were not designed or validated for (Kunnan, 2004). Consequently, test-takers may face unjust outcomes, such as denial of opportunities or benefits based on invalid or untrustworthy results (Roever & McNamara, 2006). In Test Sample 3 (Cello → Orchestra = ? → NBA), for example, cultural and educational biases become apparent, and the pronounced influence of socio-economic factors on a student's performance is highlighted. The term 'Orchestra' implicitly assumes familiarity with Western classical music, which is often affiliated with specific socio-economic classes. This connection emerges from various factors, such as the expense of music lessons and instruments and the cultural capital associated with this musical genre. Employing tests like these to estimate a student's socio-economic status, rather than their language proficiency (considering CAT4 was usually utilized as an entrance exam for UK private schools), can perpetuate systematic disadvantages for individuals hailing from diverse socio-economic backgrounds. It also extends beyond the individual to have profound implications for entire communities, reinforcing socio-economic disparities in access to opportunities, because these tests might be employed to inform decision-making pertaining to educational policies or immigration regulations.

Potential Strategies and Solutions

Addressing these ethical issues in second language testing and assessment requires comprehensive strategies targeting both inherent biases and issues surrounding transparency and accountability. Adopting a multipronged approach that involves all stakeholders—test developers, administrators, educators, students, and policymakers—can foster a more equitable testing environment and minimize adverse outcomes.

In pursuit of developing equitable assessments, forging alliances with multidisciplinary groups of test developers, encompassing members with varied cultural, linguistic, and educational origins, holds promise in attaining this objective (Gordon, 1995; Taylor, 2009). To tackle cultural prejudice, avoiding an overdependence on culture-specific knowledge or presumptions is imperative. Instead, cultivating a culturally impartial stance is advised. Should cultural particulars be inescapable, the context selected ought to possess either global familiarity or be accompanied by a comprehensive explanation within the assessment. Addressing linguistic bias involves the integration of diverse dialects and languages within test items. This strategy can alleviate the exclusion and unwarranted penalization of individuals employing non-standard dialects. With regard to educational prejudice, it is essential for tests to curtail assumptions about students' prior knowledge or encounters. Furthermore, vigilance should be exercised to preclude any inherent bias that favors particular educational systems or backgrounds in test items.

To promote transparency and accountability, test providers should publish comprehensive details about test design, administration procedures, and scoring methods (Taylor, 2009). To mitigate potential misuse in language proficiency testing, education plays a crucial role (Gebril, 2023). Informing test-users about appropriate interpretations and utilization of test results can counteract adverse effects. For example, underscoring that such tests are not valid indicators of intelligence or socio-economic status will help discourage improper applications of the results. Moreover, fostering accountability can be achieved through regular evaluations and systematic audits of the testing processes (Spolsky, 2013). Establishing ethics review boards contributes to this endeavor by ensuring that tests adhere to ethical guidelines, minimize biases, and maintain fairness in assessing language proficiency. These review boards can also serve as platforms for addressing grievances and remediating instances of misuse or unethical practices within the testing field.

Following such implementation of review boards, conducting periodic validity studies that evaluate the impact of language tests on different groups of test-takers can be another way to enhance accountability (Fulcher, 2006). These rigorous investigations serve to identify disproportionate negative consequences on specific groups, thereby prompting corrective actions. In light of the artificial intelligence (AI) era, various language-based services, including Turnitin, have incorporated detection methods to identify AI-generated language outputs (Perkins, 2023). However, a recent study by the researchers in Stanford University revealed that non-native English speakers' writing is consistently classified as 'machine generated,' possibly due to inherent differences in their use of English (Liang et al., 2023). Consequently, such investigations hold critical significance for assisting educators in comprehending ethical concerns surrounding tests like this and promoting increased accountability.

Conclusion

In conclusion, the complex tapestry of language testing and assessment is inextricably intertwined with ethical considerations. Addressing these ethical concerns, including bias, transparency, and accountability, is essential to foster a more equitable and just educational landscape. Shifting our mindset from solely prioritizing psychometric aspects to incorporating a holistic approach that recognizes sociocultural factors can lead to reduced disparities among test-takers. By embracing multi-disciplinary collaborations, fostering cultural and linguistic inclusivity, minimizing educational assumptions in test items, and promoting transparency and accountability through ethics review boards and ongoing evaluations, we can strive toward a more empathetic and conscious approach in the realm of language testing and assessment.

By refining our evaluative practices in alignment with contemporary theoretical advancements, we can cultivate an environment wherein the experiences of diverse students are valued and acknowledged. Ultimately, this will contribute to safeguarding students' motivation and engagement within the learning process, which subsequently has the potential to reverberate positively within their wider societal integration. As educators, policymakers, test developers, and administrators synthesize collective efforts to address these ethical considerations, we can pave the way for an enhanced understanding and celebration of language as a dynamic social practice.

References

Abedi, J. (2006). Psychometric Issues in the ELL Assessment and Special Education Eligibility. Teachers College Record: The Voice of Scholarship in Education, 108(11), 2282–2303. https://doi.org/10.1111/j.1467-9620.2006.00782.x

Bandura, A. (2001). Social Cognitive Theory: An Agentic Perspective. Annual Review of Psychology, 52(1), 1–26. https://doi.org/10.1146/annurev.psych.52.1.1

Estaji, M. (2011). Ethics and Validity Stance in Educational Assessment. English Language and Literature Studies, 1(2), p89. https://doi.org/10.5539/ells.v1n2p89

Fulcher, G. (2006). Language Testing and Assessment: An Advanced Resource Book (1st ed.). Routledge. https://doi.org/10.4324/9780203449066

Fusarelli, L. D. (2004). The Potential Impact of the No Child Left Behind Act on Equity and Diversity in American Education. Educational Policy, 18(1), 71–94. https://doi.org/10.1177/0895904803260025

Gebril, A. (2023). Book Review: Challenges in Language Testing Around the World: Insights for Language Test Users. Language Testing, 40(1), 180–183. https://doi.org/10.1177/02655322221113189

Gordon, E. W. (1995). Toward an Equitable System of Educational Assessment. The Journal of Negro Education, 64(3), 360. https://doi.org/10.2307/2967215

Harber, K. D. (2023). The model of threat-infused intergroup feedback: Why, when, and how feedback to ethnic minority learners is positively biased. Educational Psychologist, 0(0), 1–16. https://doi.org/10.1080/00461520.2023.2170377

Kunnan, A. J. (2000). Fairness and Validation in Language Assessment: Selected Papers from the 19th Language Testing Research Colloquium, Orlando, Florida. Cambridge University Press.

Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona Conference (Vol. 18, pp. 27–48). Cambridge University Press.

Kunnan, A. J. (2013). Validation in Language Assessment. Routledge.

Li, W. (2018). Translanguaging as a Practical Theory of Language. Applied Linguistics, 39(1), 9–30. https://doi.org/10.1093/applin/amx039

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers (arXiv:2304.02819). arXiv. https://doi.org/10.48550/arXiv.2304.02819

Lippi-Green, R. (2012). English with an Accent: Language, Ideology and Discrimination in the United States. Routledge.

Lynch, B. K. (2001). Rethinking assessment from a critical perspective. Language Testing, 18(4), 351–372. https://doi.org/10.1177/026553220101800403

Pennycook, A. (2017). The cultural politics of English as an international language. Taylor & Francis.

Perkins, M. (2023). Academic Integrity considerations of AI Large Language Models in the post-pandemic era: ChatGPT and beyond. Journal of University Teaching & Learning Practice, 20(2). https://doi.org/10.53761/1.20.02.07

Roever, C., & McNamara, T. (2006). Language testing: The social dimension. International Journal of Applied Linguistics, 16(2), 242–258. https://doi.org/10.1111/j.1473-4192.2006.00117.x

Savignon, S. J. (1987). Communicative language teaching. Theory Into Practice, 26(4), 235–242. https://doi.org/10.1080/00405848709543281

Shohamy, E. (2001). Democratic assessment as an alternative. Language Testing, 18(4), 373–391. https://doi.org/10.1177/026553220101800404

Spolsky, B. (2013). The Influence of Ethics in Language Assessment. In A. J. Kunnan (Ed.), The Companion to Language Assessment (pp. 1571–1585). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118411360.wbcla005

Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69, 797–811. https://doi.org/10.1037/0022-3514.69.5.797

Taylor, L. (2009). Developing Assessment Literacy. Annual Review of Applied Linguistics, 29, 21–36. https://doi.org/10.1017/S0267190509090035