Research report no.14 2005
The efficacy of early childhood interventions
by Sarah Wise, Lisa da Silva, Elizabeth Webster and Ann Sanson
5. Adequacy of evaluation design
Ideally, evaluations of interventions should be systematic, comprehensive and use rigorous scientific controls, such as randomised trials and sufficient statistical power, to find meaningful program effects (Sanders 2003). Some existing reviews of program evaluations have developed standards, grades or levels of evidence for early childhood interventions, based on certain criteria. These categories are used as a means of reporting the rigour of the evaluation design (for example, Mrazek and Brown 2002).
Evidence rating system
The evidence rating system adopted in this report aims to provide information on a number of fundamental research design elements. The elements included in this review are:
- Appropriate evaluation design methodology. Evaluations (including cost-benefits analyses) require an appropriate control or comparison group. This can be achieved either by randomly assigning participants to be in the intervention or control group, or by selecting a group of participants that are matched to the intervention group on a number of characteristics such as gender and age (matched comparison group).
- Pre-intervention data. For matching intervention and control groups, and to detect change as a result of implementation, it is necessary to collect baseline information.
- Intermediate follow-up and long-term follow-up. To determine whether the intervention has had any short-term and/or long-term effects, outcome data should be regularly collected on the intervention and comparison groups. Ideally, follow-up should continue for a number of years.
- Representative sample of participants in the evaluation. To ensure that an evaluation is representative of the intervention it is evaluating, the evaluation sample must be representative of the whole sample that received the intervention.
- Low attrition at follow-up and non-random attrition. Attrition in regard to evaluation integrity refers to the number of participants that could not be included in the immediate or long-term follow-up. Attrition is generally deemed to be acceptable if it is no more than 10 per cent per follow-up time point. Therefore, in a sample of 100, no more than ten participants could be lost at each follow-up time point.
- Adequate statistical power. To ensure that an evaluation is statistically adequate, the case-to-variable ratio used in an analysis needs to be considered. A minimum of five participants for every one characteristic measured is standard.
- Reliable measures. The integrity of an evaluation is enhanced if the tools used to measure outcomes are standardised (that is, have known psychometric properties) and widely used.
- Appropriate choice of measures. In making decisions about how outcomes are to be measured, serious consideration must be given to the measures used. A measure that does not adequately assess what evaluators want it to assess will compromise the integrity of the evaluation.
- Appropriate analytic approach. This criterion refers to the use of appropriate statistical techniques. This is necessary to ensure that the findings are reliable.
The presence or absence of each design element is recorded in Tables 1-5 below. Full details of the intervention evaluations and outcomes are provided in Appendix 2.
Adequacy of cluster 1 evaluations
All evaluations in cluster 1 included a representative sample of participants. Most used reliable measures, made appropriate choices about measures and used appropriate analytic approaches. Four of the six interventions (Perry, CPC, High/Scope and PIDI) included an appropriate control or comparison group and four (Perry, Head Start, High/Scope, PIDI) collected pre-intervention data. Half of the interventions had follow-up data (Perry, CPC, High/Scope).
The evaluation integrity of three interventions in cluster 1 was very good, with all three interventions containing nine of the ten research design elements (Perry, CPC, High/Scope). The evaluation integrity of one intervention (Saginaw) was very poor, containing only two of the research design elements; while the evaluation integrity of the remaining two interventions (Head Start, PIDI) was moderate (six design elements). These details are illustrated in Table 1.
Adequacy of cluster 2 evaluations
All but one of the evaluations in cluster 2 (SHELLS) contained an appropriate control or comparison group. All of the evaluations included pre-intervention measures. SHELLS and Baby HUGS did not collect follow-up data, while the remaining evaluations included at least intermediate follow-up data. Half of the evaluations did not have adequate statistical power and half did not use reliable measures.
The evaluation integrity of one intervention (Elmira PEIP) was excellent, reflecting all ten of the design elements. One intervention (SHELLS) had very poor evaluation integrity (one design element present) while the evaluation integrity of the remaining six interventions was moderate to good. These details are illustrated in Table 2.
Adequacy of cluster 3 evaluations
All of the evaluations of interventions in cluster 3 included appropriate control or comparison groups, a representative sample, adequate statistical power, reliable measures and chose appropriate outcome measures.
Table 3 shows that the evaluation integrity of two of the interventions was very good, with both evaluations containing nine of the ten design elements (New Hope, FTP). The evaluation integrity of the remaining intervention (TPDP) was good, containing seven design elements.
Adequacy of cluster 4 evaluations
Most of the evaluations in cluster 4 included a representative sample and chose appropriate outcome measures, while two-thirds of the evaluations included an appropriate control or comparison group and two-thirds used reliable measures. For most of the other design elements, approximately half contained each design element. Attrition in the evaluations was acceptable in only four of the evaluations (Abecedarian, IHDP, Incredible Years, ECEAP) and were not applicable in half of the interventions due to the lack of longitudinal follow-up.
The evaluation integrity of three interventions was very good, with all evaluations containing nine of the ten design elements (Abecedarian, IHDP, Incredible Years). Two interventions (Sure Start and NEWPIN) had very poor evaluation integrity, with each intervention containing only one design element. However, more comprehensive evaluations of Sure Start are pending. The evaluation integrity of the remaining seven evaluations was moderate to good (five to seven design elements). These details are illustrated in Table 4.
Adequacy of cluster 5 evaluations
All three of the evaluations in cluster 5 contained an intermediate follow-up and a representative sample, however none of them contained a long-term follow-up. In addition, attrition was high in all but one evaluation (Cuyahoga) and only Triple P included an appropriate control group and used an appropriate analytic approach.
As shown in Table 5, the evaluation integrity of Triple P was good (seven design elements); the evaluation integrity of PAT was poor (four design elements); and the evaluation integrity of Cuyahoga was moderate (five design elements).
Relative adequacy of evaluations across clusters
It is difficult to make any firm distinctions between clusters, given the great variability in evaluation integrity within clusters. With the exception of cluster 5, each cluster contained evaluations with very good integrity, while all clusters except cluster 3 contained evaluations with very poor to poor integrity.
One design element that warrants further discussion is the use of reliable measures. Regardless of cluster, most of the evaluations included some objective measures, as well as parental reports. Although parent reported measures have their merit, and are usually the most expedient way of data collection, they are subjective by nature. Objective measures are therefore needed to corroborate parental reports.
