Download: CELF5 Test Review-LEADERS PDF

Analysis of the CELF-5 for validity, reliability and bias: Every clinician’s responsibility 

Catherine J. Crowley & Elsa Bucaj (Updated, May 2023). 

Any time a clinician decides to use the results of a test or any assessment instrument to identify a language disorder, the clinician must have already done a rigorous analysis of that assessment instrument. This analysis must include examining the assessment materials and any information on validity, reliability, and biases, including cultural and linguistic biases and biases related to socio-economic background and prior experiences. 

To do this, the evaluator must analyze the assessment tool itself, including the information available in the test examiner’s manual and technical and interpretive manual. Then the evaluator compares that content to the standards of the field, including guidance from the American Speech, Language, and Hearing Association (ASHA), such as on ethics (ASHA, 2023), evidence-based practice (ASHA, 2005), and available research. This includes any research relating to the particular assessment tool and research that relates to what is being assessed and how well the test focuses on that area. Since the ASHA Code of Ethics requires SLPs to practice consistent with federal and state laws and regulations, and most school-age evaluations must follow the standard set by the federal law on special education (IDEA, 2004), evaluators must also determine whether the test and its results meet these standards.  

According to IDEA (2004), evaluations must “gather relevant functional, developmental and academic information” (20 U.S.C. Sec. 1414(b)(2)), using a variety of assessment tools and instruments. They must also contain clear standards for disability evaluations, including that assessment materials be used for purposes for which they are “valid and reliable” (20 U.S.C. 1414(3)(A)(iii)) and “not discriminatory on a racial or cultural basis” (20 U.S.C. 1414(3)(A)(i)), and be able to distinguish a true disability from lack of adequate instruction in reading or math or from “Limited English Proficiency” (20 U.S.C. 1414 (5)).

While an evaluator is responsible for ensuring the quality and accuracy of an assessment measurement, according to Betz et al. (2013), this analysis is often not performed by SLP evaluators. Because SLPs use the CELF-5 and CELF-4 for approximately 60 percent of their evaluations (Fulcher-Rood et al., 2019; Ogiela & Montzka, 2021), an analysis of the CELF-5 is especially critical to ensure the test can accurately identify a student with a language disorder. As for SLPs, the point of a disability evaluation is to determine whether the student has a language disorder that has prevented them from acquiring the language(s)/varieties of English that they were exposed to in their home and community (Wolfram et al., 1999). Too often, assessment materials, such as the CELF-5, identify whether the student has acquired the morphology of Standard/General/Mainstream American English (GAE) and the vocabulary expected of children/students from Mainstream American middle-class school-oriented homes. 

As is discussed further below, the Clinical Evaluation of Language Fundamentals (CELF-5) fails to meet current standards for construct validity, “the cornerstone of good test development” (McCauley & Swisher, 1984). Additionally, the bias issues with the test’s focus on acquiring the morphology of one variety of American English, GAE, and its additional focus on vocabulary acquisition, including in the specific vocabulary subtests and those subtests that require certain vocabulary knowledge to do well, are very concerning. 

Construct validity

The CELF-5 was developed to identify those students with and without a language disorder. Construct validity relates to the test’s ability to accurately measure what it is meant to measure, which with the CELF-5 is primarily whether or not a student has a language disorder. 

The construct validity of a language disorder test is typically measured using sensitivity and specificity, which denote a test’s degree of discriminant accuracy. In the case of the CELF-5, construct validity would be measured by the CELF-5 accuracy in distinguishing a student who has a true language disorder from a student who does not.  Sensitivity measures the degree to which the assessment will accurately identify those who truly have a language disorder as having one (Dollaghan, 2007). Specificity measures the degree to which the assessment will accurately identify those who truly do not have a language disorder as typical (Dollaghan, 2007).

The Plante and Vance standard

No test is 100% accurate, and the standard set forth by Plante and Vance (1994) is used to determine whether a test is “accurate enough” to identify students with language disorders and those without language disorders. A test that accurately identifies students with language disorders and those without language disorders is considered “good” if it is accurate over 90% of the time; “fair” if it is accurate 80 to 89 percent of the time; and “unacceptable” if it is accurate less than 80% of the time because such a high rate of misdiagnosis can lead to serious social consequences (Plante & Vance, 1994). 

The Technical Manual of the CELF-5 reports sensitivity and specificity measures at 4 cutoff scores, with the optimal score being 1.3 standard deviations below the mean with sensitivity and specificity at 97 percent accuracy, which is “good” according to the Plante and Vance standard (Wiig et al., 2013, p. 68). At 2 standard deviations below the mean, however, discriminant accuracy reduces to only 57 percent accuracy in identifying students with a disorder, meaning that at 2 standard deviations below the mean, the CELF-5’s accuracy is only slightly better in identifying a language disorder than if one were to flip a coin.  Moreover, the sensitivity group on the CELF-5 only includes 66 students with language disorders (Wiig et al., 2013, p. 69). 

The inadequacy of the CELF-5 reference standard

The reference standard is the standard used to define which individuals are placed in the sensitivity group–students with true language disorders–and which students are in the specificity group. The “reference standard” is called the “gold standard,” and is meant to be the “gold standard” accepted in the field to identify whether a student truly has a language disorder. 

According to the CELF-5’s Technical Manual (Wiig et al., 2013, p. 24), the reference standard for the sensitivity group was that the students would have scored 1.5 standard deviations below the mean, or lower, on any language test and are receiving speech-language services. The reference standard for the specificity group was that the students were not receiving services and had not been identified as needing services (Wiig et al., 2013, p. 24).  However, neither of these reference standards is the “gold standard” for identifying a language disorder. Instead, these reference standards are based on circular reasoning, that is, the CELF-5’s “gold standard” for identifying a language disorder is whether the student already has an IEP and is getting services (sensitivity) or is not getting services (specificity). 

According to the CELF-5 Technical Manual (Wiig et al., 2013), how those students in the sensitivity group were identified as having a language disorder is also concerning. Over half of the students were identified using the CELF-4 and PLS-3, and those tests’ construct validity is also questionable. Sixty-six percent of individuals in the CELF-4 sensitivity group were included based on their performance on the CELF-3 or PLS-3. Both the PLS-3 and CELF-3 have “unacceptable” levels of sensitivity and specificity. The sensitivity of the PLS-3 ranged from 36%-61% for 3-5-year-olds (Zimmerman, Steiner, & Pond, 1991, as cited in Crowley, 2010). The sensitivity and specificity for the CELF-3 were reported as 78.7%–but only 57% when considering children with language disorders alone–and 92.6% for specificity (Ballantyne et al., 2007). To put it more plainly, the CELF-3 Total Language score incorrectly identified 43% of students with language disorders as not having a language disorder, meaning that flipping a coin would be more accurate than the test’s results (Ballantyne et al., 2007). Because each revision of the CELF relied primarily on its prior version for the discriminant accuracy analysis, the impact of these flaws in prior versions of the CELF  means that construct validity concerns are deeply embedded in each revision, including that of the CELF-5.

Another concern from the research is the use of an arbitrary cutoff score for the sensitivity reference standard. The CELF-5 reference standard that uses a score of 1.5 SD below the mean or lower to define those students in the sensitivity group is flawed and ineffective in identifying a true disorder (Spaulding et al., 2006). Each test has a different point of accuracy (Spaulding et al., 2006). Additionally, there is no analysis of the validity, reliability, and bias in the underlying tests that identified the students for the sensitivity group. As for the specificity group, the simple absence of a student receiving speech-language services for the specificity group cannot be a “gold standard” as there are many reasons why a student with a language disorder might not yet be identified as having a language disorder, even if they do have a language disorder.  This means that the true diagnostic status of the students in both groups is unknown, another reason why the test lacks construct validity. 

While the CELF-5 meets the Plante and Vance standard at 1.3 SD below the mean, a current caseload-based reference standard used to identify the sensitivity and specificity groups does not meet the standard of the field (Dollaghan, 2007) and is in no way an accepted  “gold standard” for the identification of a language disorder. With such flaws in the CELF-5’s construct validity, any clinician must question its use in identifying the presence or absence of a language disorder.  

Standardization Sample

The CELF-5’s Technical Manual states that the standardization sample was based on the March 2010 US Census Update and was stratified by age, sex, race/ethnicity, geographic region, and parent education level (Wiig et al., 2013, p. 26 ), which is a rather concerning approach to use to identify whether a student has a language disorder. Comparing a student’s language skills to some idealized linguistic norm derived from all the students across the country does not measure whether the student has a language disorder. It does tell us that the student’s language differs from some calculated mean, and perhaps an idealized vision of how students should be speaking (Nair et al., 2023), but this is not the purpose of a speech-language disability evaluation. 

Rather, a speech-language evaluation determines whether the student has a true language disability that would limit their ability to acquire the language variety(ies) the student has been exposed to in their home and community. This requires that the evaluator gather information about the student, including their language acquisitional history, developmental milestones, family background, adverse childhood experiences that may affect learning without a language disability, significant changes in the family structure or disruptive prior experiences, present academic performance and progress, language structures and exposure, and the student’s and family’s culture(s) and values. Then the evaluator analyzes the information and determines whether the student has a language disorder, using their clinical judgment (ASHA, 2005; ASHA, 2023; Castilla-Earls et al., 2020).

Using the U.S. Census to create a normed mean to see how that student compares to that calculated mean essentially reduces speech-language evaluations to asking whether the student has been raised in mainstream American middle-class school-oriented culture(s) where GAE is the only accepted variety of English. This violates federal law, ASHA’s Code of Ethics, and the civil rights of the students being identified with a disability based on this flawed instrument and approach to assessment (20 U.S.C. 1414; ASHA, 2005; ASHA, 2023; Nair et al., 2023).


Reliability relates to the consistency of the results of a given test, as any assessment has inherent errors. Reliability measures the number of errors or inconsistencies in the test results. 

Test-Retest Reliability: Test-retest reliability measures how consistent a test score is over time (McCauley & Swisher, 1984). The CELF-5 test was administered twice to 137 students across three age bands, with the correlation coefficients for composite and index scores ranging from .83-.90. However, according to Salvia et al. (2010, as cited in Betz et al., 2013), these are inadequate reliability coefficients, as a minimum standard of .90 is advised for making educational placement decisions. 

Inter-examiner Reliability: Inter-examiner reliability measures the effect of different scorers or administrators on test outcomes (McCauley & Swisher, 1984). The CELF-5 tests were evaluated using specially trained examiners, and inter-examiner reliability was calculated for four tests with specific scoring protocols. Results showed acceptable reliability with coefficients ranging from .91 to .99.

Inter-item Consistency: Inter-item consistency assesses whether different parts of the assessment measure similar things. For the CELF-5, the split-half method was used, comparing scores from the first half of the test to scores in the second half. The study included normative and clinical samples, including students with language disorders, autism spectrum disorder, and reading and learning disabilities. Reliability measures for the normative sample ranged from .75-.98, while composite and index scores had acceptable reliability ranging between .95-.96. Reliability coefficients for these groups ranged between .81-.99, with the majority of correlation coefficients being acceptable. 

Standard Error of Measurement. The Standard Error of Measurement (SEM) and Confidence Intervals provide information about the precision of a student’s ‘true’ observed test scores (Betz et al., 2013). It is included under “reliability” as it relates to the inherent error in a test, sometimes called “human error.” SEM estimates the number of errors in a student’s test scores and is inversely related to test reliability. This means that the smaller the SEM, the greater the reliability and confidence in the precision of the observed test score. Confidence intervals show the range of scores within which we can be confident that a student’s true score lies. The CELF-5 provides CIs at 68%, 90%, and 95%. The clinician selects the appropriate confidence level—higher confidence levels yield a larger range of scores, while a lower level yields a smaller range. Wide ranges are often necessary to achieve high levels of confidence in language assessment instruments, which in practice underscores the limitations of putting primary reliance on a single test score to identify a language disorder. 

Any evaluator who uses this test to identify a language disorder must do so within a confidence interval that reflects the inherent error within any test and is reflected in the SEM and provided within the test’s examiner’s manual.  While some evaluators may write the confidence interval on the CELF-5 scoring protocol sheet or in a box in the evaluation report, it is extremely rare to see an evaluator analyze a student’s results within that confidence interval. These confidence intervals are calculated assuming the CELF-5 is valid, reliable, and free of bias, which evaluators using this test to identify a disorder are also likely to assume, however incorrect that assumption is. It is the responsibility of the evaluators who subscribe to this test to, at the very least, use the SEM and confidence intervals in analyzing any student outcomes using the CELF-5, which would include great variability in identifying a true language disorder.


Linguistic Biases

     English as a New Language: Students learning English as a new language present morphological differences in GAE consistent with those of students with a true language disorder (Paradis (2005, 2010). This means they can be misidentified as having a language disorder when showing signs of typical second language acquisition. This misidentification most often occurs in the CELF-5 Word Structure, Recalling Sentences, and Formulated Sentences subtests. 

     Dialect varieties: Student performance on the CELF-5 language test may be impacted by their dialect of English. GAE is the dialect in which the test is administered, which can cause difficulties for speakers of other varieties of English, such as African American English (AAE), Spanish-influenced English, Appalachian English, and Chinese-influenced English. Hendricks and Adlof (2017) found that using the CELF-5 “modified scoring guidelines” for speakers of AAE led to an unacceptable underidentification of students with true language disabilities. Conversely, typically developing AAE-speaking students given the CELF-5 without the “modified scoring guidelines” were overidentified as having language disorders. Hamilton et al. (2018) describe the dialectal differences in morphology of Filipino-English which would have impacted their performance on a morphology-focused test such as the CELF-5.  Twenty of the 33 items in the Word Structure subtest, or 60 percent of the subtest items, assess non-obligatory features of AAE and other varieties of American English, such as possessive nouns, contractible copulas, and auxiliary “be” verbs, and regular/irregular past tenses (Wiig et al., 2013). 

Socio-economic Biases 

Research indicates that a student’s vocabulary knowledge positively correlates with their socioeconomic status (SES) and that students from lower SES backgrounds have smaller vocabularies than those from higher SES families. This has been attributed to parents’ education level, which also correlates with family SES. Higher education leads to exposure to more, higher-level academic vocabulary words like the ones that tend to be assessed on these tests. Parents from a lower educational background tend to have lower SES and tend to use fewer of those words that appear in vocabulary-based tests than tends to be the case for parents with higher education and higher SES (e.g., Horton-Ikard & Ellis Weismer, 2007; Stockman, 2000; e.g., Hart & Risley, 1995). Students learning English as a new language also present with more limited vocabulary skills (e.g., Anaya et al., 2018; Bonifacci et al., 2020).

Besides the focus on the morphological acquisition of GAE, the CELF-5 heavily emphasizes vocabulary. Vocabulary-based tasks, such as those used in a number of the CELF-5 subtests, used to identify a language disorder, assume that all the students given the assessment have had the same amount of exposure and the same quality of contextualized exposure as all the other students being given the tests and subtests. Then if a student does not do well on the vocabulary-focused subtest, that is caused by a language disorder. Of course, there are many reasons why a student might not do well on a vocabulary-based assessment other than due to a language disorder, indicating a flaw in the foundational logic of the CELF-5. Examples of vocabulary-based CELF-5 subtests include Word Definitions and Semantic Relationships, but also the Word Classes subtest that requires vocabulary knowledge (e.g., quest/search, longitude/latitude, prosperous/wealthy, essential/crucial, and biography/memoir). While it is true that grade-level vocabulary skills are important for academic success, CELF-5 relies heavily on vocabulary knowledge to perform well. Because this test is designed to identify a language disorder, CELF-5 draws a causal connection between a student not knowing certain vocabulary words and the presence of a language disorder. Yet, given the myriad reasons why a student might not have a grade-level vocabulary beyond a pure language disorder, the connection between a low vocabulary and a language disorder is correlative at best.  

Prior Knowledge/ Experience Biases 

Evaluators cannot assume that all students have equal exposure to the content of the CELF-5. Yet, by using this test to identify a language disorder, evaluators are failing again to distinguish causation (performance due to a language disorder) and correlation (performance due to any number of factors, including possibly a language disorder). Some examples of these assumed experiences include the topics of some of the stories in the Listening to Spoken Paragraphs subtests, such as school field days, class trips to the zoo, museums, marching band trips, and hurricanes. While some of these experiences may be quite common, maybe even common to experiences in most school districts across the country, in many schools, especially those in poor urban areas, these experiences may not be part of a shared school experience. Once again, this puts these students at a disadvantage, as they do not share the assumed knowledge and prior experiences that are used to identify a language disorder. 

Cultural Biases

According to Peña and Quinn (1997), language assessments often do not account for variations in socialization practices, which can affect a student’s response to the type of questions asked, how they are asked, and their interaction with the examiner during testing. Cultural experiences and practices can influence a student’s behavior during testing, such as conversing with unfamiliar adults or asking questions. Some cultures may not expect students to speak until spoken to, which can lead to false identification of language disorders if they do not respond to the clinician’s questions. 

Many tasks on the CELF-5 Pragmatics Activities Checklist (PAC) subtest relate to culturally-specific nonverbal skills, including gaze, gesture, expression, and body language. The PAC includes games and a requirement to interact with an unknown adult, both of which are highly culturally dependent and not indicative of a disorder when compared to typically developing peers from their speech community. These items could be difficult for a student from a cultural background that differs from the examiner or mainstream American middle-class school-oriented culture(s). 

The CELF-5 Pragmatics Profile for older students reflects awareness of how pragmatics interactions are impacted by culture. As part of this checklist-formated subtest, evaluators are asked to consider, as they answer, whether a student “demonstrates culturally appropriate use of language.” However, how likely is it for an evaluator using the CELF-5 as the primary tool to identify a language disorder to consider or even be able to know what is “culturally appropriate” for the student being evaluated? 

While well-intentioned, the caveat that evaluators must assess pragmatic behaviors within the framework of whether the student’s behavior is a “culturally appropriate use of language” is unlikely to lead to accurate findings. A subtest like this one is likely to contribute to the overidentification of “minority” students as being identified as having an emotional disturbance (20 U.S.C. 1401(3)(A)) and is, thus, quite dangerous. In fact, one of the congressional findings listed at the front of IDEA (2004) that led to this law included that “African-American students are identified … emotional disturbance at rates greater than their White counterparts (20 U.S.C. 1401(c)(12)(C)) and “greater efforts are needed to prevent the intensification of problems connected with mislabeling and high dropout rates among minority students with disabilities” (20 U.S.C. 1401(c)(12)(A)). This subtest is unlikely to address these congressional findings.  

Attention and Memory

During the administration of standardized tests, attention and motivation are critical factors that can affect a student’s performance. Fatigue, lack of interest, and discomfort with the test format can impact results. Additionally, short-term memory demands may be a limiting factor but are not necessarily indicative of a speech or language disorder. As anyone, with or without attention and memory problems, will attest, the CELF-5, and most other omnibus assessments, are quite tiring and rarely engage the student’s interest, especially when that student is unaware and likely uninterested in the results of this assessment. But those typical students who rightfully find these tests uninteresting may perform poorer simply because they are not motivated to stay engaged. 


While the CELF-5 continues to be, by far, the most widely used assessment instrument to identify a language disorder used by SLPs in disability evaluations, an analysis of the test shows that it fails to meet standards. Its evidence for construct validity is insufficient because the reference standard used to identify the sensitivity and specificity groups for the crucial discriminant accuracy analysis is flawed according to the research and standards of the field. Moreover, This test contains very significant racial, cultural, and socioeconomic biases. These biases mean that the results of the individual subtests, even without scoring, will certainly show gaps rather than weaknesses and cannot distinguish a true language disorder from “something else.”

The continued use of this test to identify a language disorder is not the fault of its publisher. Rather, each evaluator, as a licensed professional, must determine whether it is ethical and evidence-based to continue to use this test. School districts must ensure that assessment instruments meet the federal standard. State and federal education departments must use their monitoring powers to identify and address disproportionality while also ensuring that students are identified as having a disability by accurate assessment materials. Masters and undergraduate CSD programs must ensure that they are educating future SLPs to have the knowledge and skills to know what makes quality assessment materials and approaches. 

Most importantly, before using any assessment materials, evaluators must do this kind of analysis before blindly foisting what can be invalid and biased tests on students in our schools, who have been entrusted with our care and clinical expertise. 


American Speech-Language-Hearing Association (2005). Evidence-based practice in communication disorders [Position statement]. Available from 

American Speech-Language-Hearing Association (2023). Code of Ethics. Available from 

Anaya, J. B., Peña, E. D., & Bedore, L. M. (2018). Conceptual scoring and classification accuracy of vocabulary testing in bilingual children. Language, Speech, and Hearing Services in Schools, 49(1), 85-97. doi: 10.1044/2017_LSHSS-16-0081 

Ballantyne, A. O., Spilkin, A. M., & Trauner, D. A. (2007). The revision decision: Is change always good? A comparison of CELF-R and CELF-3 test scores in children with language impairment, focal brain damage, and typical development. Language, Speech, and Hearing Services in Schools, 38(3), 182–189.

Betz, S. K., Eickhoff, J. R., & Sullivan, S. F. (2013). Factors influencing the selection of test for the diagnosis of specific language impairment. Language, Speech, and Hearing Services in Schools, 44(2), 133-146. DOI: 10.1044/0161-1461(2012/12-0093)

Bonifacci, B., Atti, E., Casamenti,M., Piani, B., Porrelli, M., and Mari,R. (2020). Which measures better discriminates language minority bilingual children with and without developmental language disorder? A study testing a combined protocol of first and second language assessment. Journal of Speech, Language and Hearing Research, 63(6), 1898-1915.

Castilla-Earls, A., Bedore, L., Rojas, R., Fabiano-Smith, L., Pruitt-Lord, S., Restrepo, M.A., and Peña, E. (2020). Beyond scores: Using converging evidence to determine speech and language services  eligibility for dual language learners. American Journal of Speech-Language Pathology, 29(3), 1116–1132. doi: 10.1044/2020_AJSLP-19-00179

Crowley, C. (2010). A Critical Analysis of the CELF-4: The Responsible Clinician’s Guide to the CELF-4. Dissertation. 

Dollaghan, C. (2007). The Handbook for Evidence-based Practice in Communication Disorders. Baltimore, MD: Paul H. Brooks Publishing Co.

Fulcher-Rood, K., Castilla-Earls, A. P., & Higginbotham, J. (2019). Diagnostic decisions in child language assessment: Findings from a case review assessment task. Language, Speech, and Hearing Services in Schools, 50(3), 385–398. 10.1044/2019_LSHSS-18-0044

Hamilton, M., Angulo-Jiménez, H., Taylo, C., & DeThorne, L. S. (2018). Clinical Implications for Working With Nonmainstream Dialect Speakers: A Focus on Two Filipino Kindergartners, Language, Speech and Hearing Services in Schools,  49(3), 497-508.

Hart, B & Risley, T.R. (1995). Meaningful Differences in the Everyday Experience of Young American Children. Baltimore: Paul Brookes. 

Hendricks, A.E., & Adlof, S. (2017). Language assessment with children who speak nonmainstream dialects: Examining the effects of scoring modifications in norm-referenced assessment. Language, Speech, and Hearing Services in Schools, 48 (3), 168-182. Doi: 10.1044/2017_LSHSS-16-0060

Horton-Ikard, R., & Ellis Weismer, S. (2007). A preliminary examination of vocabulary and word learning in African American toddlers from middle and low socioeconomic status homes. American journal of speech-language pathology, 16(4), 381–392.

Individuals with Disabilities Education Improvement Act, 20 U.S.C. § 1400 (2004).

McCauley, R. J. & Swisher, L. (1984). Psychometric review of language and articulation tests for preschool children. Journal of Speech and Hearing Disorders, 49(1), 34-42. DOI: 10.1044/jshd.4901.34

Nair, V. K., Farah, W., & Cushing, I. (2023). A critical analysis of standardized testing in speech and language therapy. Language Speech and Hearing Services in Schools, 1–13.

Ogiela, D. A., & Montzka, J. L. (2021). Norm-referenced language test selection practices for elementary school children with suspected developmental language disorder. Language, speech, and hearing services in schools, 52(1), 288–303. 

Paradis, J. (2005). Grammatical morphology in children learning English as a second language: Implications of similarities with Specific Language Impairment. Language, Speech and Hearing Services in the Schools, 36(3), 172-187.  DOI: 10.1044/0161-1461(2005/019)

Paradis, J. (2010). Bilingual Children’s Acquisition of English Verb Morphology: Effects of Language Exposure, Structure Complexity, and Task Type. Language Learning, 60(3), 651–680.

Peña, E., & Quinn, R. (1997). Task familiarity: Effects on the test performance of Puerto Rican and African American children. Language, Speech, and Hearing Services in Schools, 28(4), 323–332. 

Plante, E. & Vance, R. (1994). Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools, 25(1), 15-24.

Salvia, J., Ysseldyke, J. E., & Bolt, S. (2010). Assessment in Special and Inclusive Education (11th edition). Belmont, CA: Wadsworth Cengage Learning.

Spaulding, T. J., Plante, E., & Farinella, K. A. (2006). Eligibility criteria for language impairment: is the low end of normal always appropriate? Language, Speech, and Hearing Services in Schools, 37(1), 61-72. DOI: 10.1044/0161-1461(2006/007)

Stockman, I. (2000). The new Peabody Picture Vocabulary Test: An illusion of unbiased assessment. Language, Speech, and Hearing Services in Schools, 31(4), 340-353. DOI: 10.1044/0161-1461.3104.340

Wiig E. H., Semel E., Secord W. A. (2013). Clinical Evaluation of Language Fundamentals–Fifth Edition (CELF-5). Bloomington, MN: NCS Pearson.

Wiig, E. H., Semel, E., & Secord, W. A. (2013). Clinical Evaluation of Language Fundamentals-Fifth Edition: Technical Manual. Pearson, 24, 26, 68, 69.

Wolfram, W., Adger, C.T. & Christian, D. (1999). Dialects in Schools and Communities. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., Publishers.