CANS’ Construction Problems | Concerns About CANS' Validity

There are many problems with the construction of CANS, a few of which have been acknowledged by the developers and then ignored. The problems include the use of nominal data, having no standard set of CANS questions, changing response choices without investigation, and failing to establish a standardized scoring or administration procedures.

CANS’ Versions

All nine of the peer-reviewed CANS’ Publications used a different version of CANS. We have labeled the CANS’ versions by listing the year, state and number of items. For example, the 41-item Iowa version of CANS reported in Anderson et al., (2003 [03]) is listed as “2003 IA CANS-41”. Epstein [07] used the 2011 TN CANS-65. Kisiel [16] used the 2009 IL CANS-105. Lyons [25] didn’t report the number of items but was likely similar, but not exactly the same as the 41-item version reported by Dilley [U]. For example, items like “Social” appear to be different between these two versions of CANS. This is odd, since both versions were used in Illinois during the same year. Sieracki [47] used the 2008 IL-CANS 44. Weiner’s two articles ([50], [51]) fail to report the number of items, as did Lyons [35].

None of these nine (9) peer-reviewed publications mention the external validity limitations of their design. Failing to note the lack of external validity of these studies violates ORI Guideline 22: “Authors have an ethical obligation to report all aspects of the study that may impact the replicability of their research by independent observers.”

This lack of consistency in the published literature is just the tip of the iceberg. There appear to be an unending proliferation of tools labeled as “CANS”. In 2004 Lyons wrote:

“The modular design of the CANSMH allows the tool to be adapted for local applications without jeopardizing its measurement properties. By modular, we mean that because each item can stand alone, consensus process allows partners to decide which items should or should not be included or whether a new item should be created for a local application. For example, in the Alaskan Youth Initiative application, partners in this program expressed a need to include an item called Cycling of Symptoms” ([27], p. 9).

In 2015 Israel wrote:

“The instrument is modular in the sense that both items and domains can be selected based on the context and system uses of the instrument” ([H], p. 241).

We can find no justification for this approach to test construction in the scientific literature. Despite Lyons’ insistence that inter-rater reliability, even at the item-level, is paramount, none of the new items appear to have been tested, nor the validity of the newly revised summary scales evaluated. There is no mention of this issue or limitation in any form of publication or manual, with one exception:

“The weakness of this approach is that the use of different item sets across studies limits their replicability. In order to facilitate replicability, future studies should be clear about which items are included when using the CANS or other modular measures.” ([H], p. 246)

Instead, all other authors sidestep the problem. For example, Israel’s [H] very next sentence after boasting about the modular design, states: “The instrument is explicitly designed to be reliable and valid at the item level (Lyons, 2009b [34]).” This leads the casual reader to assume these new items or modules have been tested and are reliable and valid.

It does not appear to be required to get developer assistance or permission when revising CANS (e.g., making new items, deleting, or otherwise revising scales and response choices). However, praedfoundation.org has a long list of downloadable versions of CANS for virtually every state in the US and for a few other countries. In October, 2019, we downloaded them all. In total there were fifty-nine (59) distinct and different versions of CANS downloaded from the website, none with any validity or reliability support. Yet, most of the manuals claim these versions are valid and make false citations for its support, or ignore the problem altogether.

Evolving Response Choices

How children are rated on CANS has evolved. No longer are respondents rating how depressed the child is, for example. Instead, he is rated on how much treatment he needs. This is why, raters are asked to guess how depressed the child would be if treatment, like medication and therapy, are suddenly removed. This is likely an impossible task for anyone to do reliably, even a clinician; and now, there are legions of non-clinical raters making these guesses every day. We can find no other assessment tool that measures outcomes within this odd, hypothetical framework.

Originally, CANS was scored on the same dimensions as the other tools that Lyons had developed, like the Childhood Severity of Psychiatric Illness (CSPI, Lyon, 1998; cf. Anderson et al., 2003 [03], p. 283). The CSPI requires expert clinical judgment in rating the severity of disturbance from “0” = “no disturbance” to “3” = “severe disturbance” (Greenham & Bisnaire, 2008 [11]). The CANS, on the other hand, now rates children with the same four numbers, but using different action-level anchors related to intervention and treatment needs. “The anchors themselves are worded in terms of the level of intervention needed” (Lyons, 2004 [26], p. 145).

This evolution is ignored in all CANS’ Publications. For example, most publications reference Anderson [03] for the reliability of CANS and universally fail to mention that the response choices have changed since that publication.

Lyons acknowledges this evolution in his 2009 book, but does not discuss the scientific impact: “The levels of the CSPI was defined using more traditional Likert-type rating scales of None, Mild, Moderate, and Severe. But in training first in the CSPI and then with the CANS, I had often mentioned that you could also think about things from a service planning perspective of No Evidence, Watchful Waiting/Prevention, Action and Immediate/Intensive Action. These ratings were not explicit aspects of CANS at the time, just alternative ways of understanding the Likert ratings” (Lyons, 2009 [34], p. 101). He credits a parent (Julie Hdalio) for convincing him to change the rating scale on CANS.

By way of example only, in Lyons 2004 book [26], the appendix of which reproduced the most recent CANS Manual, “Psychotic Symptoms” and the “1” rating were listed as follows:

This rating is used to describe symptoms of psychiatric disorders with a known neurological base. DSM-IV disorders included on this dimension are Schizophrenia and Psychotic Disorders (unipolar, bipolar, NOS). The common symptoms of these disorders include hallucinations, delusions, unusual thought process, strange speech, and bizarre/idiosyncratic behavior.

1 – This rating indicates a child with evidence of mild disruption in thought processes or content. The child may be somewhat tangential in speech or evidence of somewhat illogical thinking (age inappropriate). This also includes children with a history of hallucinations but not currently. The category would be used for children who are below the threshold for one of the DSM IV diagnoses listed above. (emphasis added)

By contrast, in 2018, the California Integrated Practice CANS Manual [W] lists the parallel CANS’ construct (Psychosis – Thought Disorder) as follows:

This item rates the symptoms of psychiatric disorders with a known neurological base, including schizophrenia spectrum and other psychotic disorders. The common symptoms of these disorders include hallucinations (i.e. experiencing things others do not experience), delusions (i.e. a false belief or an incorrect inference about reality that is firmly sustained despite the fact that nearly everybody thinks the belief is false or proof exists of its inaccuracy), disorganized thinking, and bizarre/idiosyncratic behavior.

1 – Identified need requires monitoring, watchful waiting, or preventive activities. Evidence of disruption in thought processes or content. Child/youth may be somewhat tangential in speech or evidence somewhat illogical thinking (age-inappropriate). This also includes child/youth with a history of hallucinations but none currently. Use this category for child/youth who are below the threshold for one of the DSM diagnoses listed above. (emphasis added)

These untested changes to CANS responses have serious consequences, especially when using repeated measures to track treatment outcomes. Even if progress is made during treatment, maybe with anti-psychotic medication, for example, and symptoms ameliorate and functioning improves, CANS’ ratings remain in the severe/treatment-needed ranges. Raters are no longer rating the child’s current status, they are rating whether continued treatment is needed. As Lyons wrote:

The CANS separates the description of the functional status of a child from the specific impact of ongoing, necessary treatment interventions and environments. Since a communimetric measure is designed to effectively communicate status across varied treatment environments, it defines a person’s functioning without interventions in place. (Lyons & Israel, 2017 [X], p. 1)

Therefore, CANS cannot be used to demonstrate functional improvements in children until those treatments “cure” the child. “Stimulants do not cure ADHD, they simply manage the symptoms. To say that putting a child on stimulants resolves his/her ADHD is simply illogical and does not represent the child’s clinical and functional status” (Lyons & Israel, 2017 [X], p. 1).

Nominal Data

CANS response choices are nominal. Quoting from California’s 2018 Integrated Practice CA IP-CANS Manual ([W], p. 6), the score of 1 (on a 0-3 scale) can mean a “significant history” of a problem that is not currently “interfering with functioning”, or a “possible need” that requires “additional assessment.”

This midpoint anomaly is a catchall category for “we’re not sure, yet.” With more time and evaluation on each clinical case, some of these categorized-1-scores should really be a 3, meaning “hey, after more assessment time, we realize he’s really suicidal, is not contracting for safety, has a plan, has access to a gun, and it’s life threatening.” And some of these categorized-1-scores are more accurately 0, meaning “we’ve investigated further and conclude that he’s not even depressed and we’re convinced he’s never been suicidal, and has no current thoughts or intentions. He’s just having trouble sharing his feelings with the team.”

This midpoint anomaly is made infinitely more complicated by Lyons’ belief in “shared visions” and “pre-measurement triangulation” – an idiosycractic process that requires all stakeholders to work together to complete the CANS, not independently, but as a collective group-think exercise. It comes from Lyons’ theory of human services that he calls Communimetrics (Lyons, 2009 [34]).

These constitutive theories fit perfectly within the goals of human service enterprises—to establish a shared vision of the needs of people served, and to monitor the impact of interventions on these needs over time…. The concept of creating and measuring a shared meaning can be understood as engaging in a pre-measurement triangulation process. If the multiple parties involved in the human service enterprise participate in the creation [of the ratings] …, then the value of triangulation is preserved—multiple perspectives are represented. But since only one measure results, the burden, expense, and analytic complexity of both analysis and interpretation is dramatically simplified. This simplification makes the use of data from these measures far more widely accessible to individuals who may not be sufficiently sophisticated in statistics or program evaluation to analyze, report, and interpret data collected using traditional triangulation strategies…. This is usually accomplished by capturing information from the multiple service providers involved and then analyzing each perspective separately. Integration can only occur at the interpretation of findings. This process can be both cost- and labour-intensive, and allows for the introduction of single and/or group decision-making biases depending on the manner in which the team operates….

Of course, pre-measurement triangulation through the use of a communimetric measurement approach has limitations. It is always possible that someone who completes the measure fails to actually involve others in its production. This problem is reduced by making the measure an active part of the intervention itself and ensuring that all parties expect to see it and participate in its creation…. It is also possible that users with different perspectives cannot come to an agreement. Although this phenomenon does happen, it has been our experience that it is uncommon. The CANS was used more than one million times worldwide in 2009, and the anecdotal experience of those using it suggest that significant contention arises in no more than 1% of cases. However, when disagreements are not resolvable, the “1” rating (indicating watchful waiting/prevention) is recommended, so that the parties can monitor the need and see over time who has the more accurate perspective” (Obeid & Lyons, 2011 [38], p. 67-77) (emphasis added)

Therefore, the score of 1 can be used when team members, family members and the child, who are supposed to be sitting around together, debating and having their voices heard, do not agree. For example, they may disagree as to whether the substance abuse problem is urgent enough to require “immediate or intensive” intervention (3), or the problem can be handled as a 2, with normal out-patient intervention. Instead of categorizing these disagreements as, maybe, a 2.5, (or leaving it blank, or having a cannot-agree category) raters are required to put it down as a 1.

It’s not clear that any jurisdiction or user of CANS has ever been able to accomplish the tool-required pre-measurement triangulation. Lyons boasts that he knows from the million children assessed each year that it can happen. It’s more probable that the pre-measurement triangulation meetings never happen or would take days to come to consensus on over a hundred ratings. An adolescent, for example that is abusing substances and not ready to change, will insist on a low rating and others will insist on a high rating. This is what the initial phase of much substance abuse treatment focuses on, increasing readiness to change and admitting that one has a problem. If treatment cannot start until we do the assessment and all agree on everything, a lot of consumers will likely suffer.

When tracking the substance abuse outcomes of this hypothetical child and he really did need urgent care, and was able to be discharged from the detox facility within thirty days to out-patient care, this progress would be recorded on CANS as worsening (going from a score of 1 to 2).

When analyzing a CANS’ dataset, it is impossible to know which of the very different meanings of 1 is meant by the rater. CANS data is nominal data and the summary data on CANS domains and statistical analyses that assume the data is ordinal, interval or continuous should not be reported or performed.

No Scoring Standardization

Lyons discusses three scoring options including: (i) no scoring, just using individual items; (ii) average items in each domain (like Risk) and multiply by 10; or (iii) combine certain CANS domains into one total score (Lyons, 2009 [34], p. 104). The third approach is presented without a scoring algorithm and is “not recommended.” Making it more complicated, since there are no standard set of items in CANS domains, the weighting and power of specific items can overwhelm others. Lyons discusses this issue in his 2009 book [34], but then neither follows his own advice, nor provides a scoring manual to untangle the mystery.

He uses the domain School as an example, admitting that it had only one item in the original version (cf. Anderson et al., 2003 [03]). In many applications since 2008, however the one item in the domain of Functioning has been replaced by three (School Achievement, School Attendance, School Behavior). Weighting each item in the domain equally would take the Functioning scale which originally had four items and tripling the weight of school-related issues compared to Family functioning. To compensate for this change in meaning of the Functioning scale, Lyons recommends turning the three items of School back into one item before averaging them with other items in the Functioning domain.

We find no CANS’ Publication that discuss the scoring process used, and the School issue is just the tip of the iceberg. Original items like Depression/Anxiety have been split into two separate items in most modern applications. The 2008 IL-CANS-139 added at least six additional items to the domain of Problem Presentation (Anger control, Somatization, Behavioral regression, Attachment difficulties, Eating disturbances, Affect dysregulation). The 2018 CA-CANS-50 only includes the Anger control addition and alternatively substitutes Adjustment to trauma. There is simply no way to understand the meaning of CANS’ scores presented in publications.

Further obfuscating the meaning of the domains, Chor and colleagues ([A], [B], [C]) “standardize” CANS’ scores, not to any general population or clinical sample (norms have not been established), but to the baseline scores in each research study. As such, one cannot compare scores across each of these three publications.

Returning to the only independent and peer-reviewed CANS’ Publication (Effland, [05]), the version of CANS, the number of items, and how CANS was scored were not discussed. As such, there is limited external validity to the study. It is impossible to know what the “Baseline behavioral health” CANS’ score of 15.96 listed in Table 2 (p. 740) means or how it compares to the “Baseline strengths” score of 19.93.

By contrast, Anderson and colleagues [03] report that in a study of community mental health and child protective services the means of psychosis CANS’ scores was 0.17 when rated by clinicians and 0.02 when rated by researchers conducting chart audits. The Effland [05] results, by this comparison seem extreme, but one can only guess how they really compare.

No Standardized Administrative Procedure

A serious CANS’ limitation is the lack of administration standards. It can be completed by a non-clinician who has some minimal training. It can be completed by just looking at the medical record. It can be completed after just interviewing the parent. It can be completed without any interview.

In addition to the pre-measurement triangulation approach (already reviewed above) that requires all participants to be in one location and agree on all CANS’ scores, developers describe other administration approaches in these different ways:

“The CANS is designed as an information integration tool that incorporates data from multiple sources (e.g., interview of child and caregivers, child self report, caregiver and teacher report tools, observation of child and family, review of case records, and clinical judgment of the clinician). This tool is scored by a clinician who is trained and certified in its reliable use” ([16], p. 146).

In Los Angeles, the CANS is administered as a semi-structured interview. “The parent or parent/caregiver’s perspective is the most important” (p. 8). To elicit what the response choice should be, facilitators are encouraged to ask questions like: “Wound you say that is something that you feel needs to be watched [i.e., a score of 1], or is help needed [i.e., a score of 2]?” (p. 8) “Remember, this is not a ‘form’ to be completed, but the reflection of a story that needs to be heard” (LA CANS MANUAL [AA], p. 5).

“However, parent interviews are not always feasible… Whoever completes the CANS must take all the information available (e.g., observation, documentation, or both) and integrate it into his or her best estimate of the level of need or strength… The sources of information may vary from child to child; therefore, the CANS method allows the rater to take the information available from all sources and integrate it into the rater’s best estimate of the level of needs and strengths” (Lyons, Weiner & Lyons, 2004 [27], p. 8).