This study is unique because it examines both within-observer and between-observer variation among more than two observers of clinical findings in children with dyspnoea which are being used in all published composite dyspnoea severity scoring systems. The results of our study show moderate-to-good intraobserver reliability and poor interobserver reliability of the clinical assessment of dyspnoea. Subcostal retractions and wheeze showed the best interobserver agreement and mental state the least, the other signs were more or less comparable.
Due to this variation within and between observers, the SDC exceeded the minimally important effect of treatment in 69.4% of observations in our study, obscuring the detection of a clinically important improvement in dyspnoea after treatment.
Our findings implicate that in clinical practice, assessment of the severity of dyspnoea in children is not interchangeable between professionals. The results of our study, therefore, argue for great caution in interpreting the effect of a trial treatment with a bronchodilator, which is recommended in clinical guidelines that include young children with acute severe wheeze, in particular when the assessment of dyspnoea before and after bronchodilator is being performed by different observers. However, even when the same professional assesses the degree of dyspnoea before and after bronchodilator, the considerable intraobserver variation (Table 3) should be taken into account. If clinical dyspnoea scoring is being used in clinical trials of young children, the number of different observers should be presented and discussed, because of the large variation between observers (Table 3). Our results suggest that clinical dyspnoea scoring systems require further validation testing and assessment of variation between and within observers. We postulate that the use of more objective parameters, such as oxygen saturation and lung function assessments with acceptably small measurement error, will provide less variable and thus more reliable assessments of dyspnoea in children. Furthermore, it remains important to assess children with dyspnoea together with the colleague who will take over the care of the patient in the next shift.
Strengths and Limitations
The major strengths of our study include the measurement of intraobserver and interobserver reliability, the use of a large group of observers in a crossed design and the assessment of the clinical impact of reliability of these clinical signs of dyspnoea by computing measurement error.
We acknowledge the following weaknesses of our study. The use of video recordings has limitations. The video recordings were relatively of short duration (2–3 min), which may have led to less accurate ratings or missed observations and may have decreased the likelihood of detecting subtle signs on physical examination. For our study purposes, however, video recordings were considered to be the only feasible method. The lack of chest auscultation could also be viewed as a weakness; however, previous studies have shown poor association between wheeze severity on auscultation and the degree of airway obstruction and hypoxaemia. Furthermore, leaving out auscultation in the assessment of dyspnoea severity in children reflects clinical practice where many assessments are being made by healthcare professionals who have not been trained in chest auscultation.
We examined a limited number of patients for feasibility reasons, to avoid observer fatigue and boredom while assessing the videos. One could argue that the number of observers is also relatively small, although it is considerably larger than the two to four observers used in previous studies (see online supplement 1). The reliability of the dyspnoea score and the individual items may have been greater if we would have included more children with (very) severe dyspnoea. It would also have been interesting to evaluate the relation of the observer variation to the severity of the dyspnoea. However, our study group was too small to be able to compare the observer variation between these subgroups. Additionally, in our general practice, children with very severe dyspnoea needing mechanical ventilation comprise only a small minority of all children with dyspnoea presenting to our clinic (1%–2%). Thus, a clinical dyspnoea scoring system is potentially most useful in mild-to-moderate dyspnoea, and this is represented by our study population.
Furthermore, one could hypothesise that the age of patients might influence the observer variation. The median age of patients in our sample was 19 months, but the sample also included a patient aged 7 years. Our sample size was too small to divide the patients into different age categories, as most (>90%) patients were <4 years old. On the other hand, acute wheeze is most commonly represented by this preschool age group. Therefore, we still feel that our sample is sufficiently representative for this purpose.
Variation between observers may be reduced by formal standardising of the assessment and training. Only a few examples are available in the literature, however, all pointing towards a positive effect of training and/or standardising. In clinical practice, it is uncommon to (re)train basic skills after graduation, apart from newly developed tools or reanimation skills. This study may help increasing awareness that evaluating (and maybe training) commonly used day-to-day skills is valuable. Further studies are needed to assess whether training professionals—and even more what kind of training—can reduce the amount of variation we observed in this study.
Another aspect that might be interesting to take into account in future studies, aiming at improving the assessment of dyspnoea in children, is the parental judgement. In this study, we asked the treating doctor as well as the parent to rate the effect of treatment with bronchodilators. It appeared that in 67% (18/27) there was full agreement between parents and doctors. In the nine patients where there was no agreement, disagreement occurred in both directions (in four patients doctors rated improvement while parents rated no change or slight improvement, and in five patients doctors rated no effect while parents rated improvement). This may suggest that parents take other aspects into consideration than medical professionals, which may be of importance when further elaborating the way we assess children with dyspnoea.
Finally, we feel that it might be useful to further evaluate the assessment of the 'mental state'. Mental state, sometimes described as 'general condition' or 'cerebral function', is included in many dyspnoea scores. In the present study, the prevalence of affected mental state was much lower (4.4%) than the prevalence of the other clinical findings (21.2%–68.0%), resulting in bias of the κ value. The low prevalence may have played a role in explaining the low κ value for mental state compared with the high-percentage agreement. The mental state in our study was rated as 'affected', when the observer assessed the mental state either as 'hyperalert or anxious' or as 'decreased consciousness'. Possibly, a more subtle description of the mental state will improve the precision of dyspnoea assessment, especially when evaluating responsiveness over time, which can sometimes be experienced by only small differences that are difficult to describe. The assessment of the mental state would typically be an item where involving the parents might improve accuracy and utility.