Correlation and agreement between physical and ultrasound examination after a training session dedicated to the standardization of synovitis assessment in rheumatoid arthritis patients

Assessing disease activity in rheumatoid arthritis (RA) patients requires comprehensive quantification of tender and swollen joints. We aimed to evaluate the correlation and agreement between rheumatologists after a training session dedicated to the standardization of synovitis assessment and compare its performance with a reference imaging modality such as musculoskeletal ultrasonography (MSUS). In this cross-sectional study, a total of 28 and 10 joints in RA patients were evaluated by physical examination and ultrasound (US), respectively. After participating in a training session, individual joint assessment for tenderness and swelling was performed by three rheumatologists. MSUS examination was performed separately by an experimented radiologist in a standardized manner, evaluating findings according to the Outcome Measures in Rheumatology Clinical Trial (OMERACT) guidelines. A total of 80 RA patients were included, with a mean Disease Activity Score based on 28 joints (DAS28)-ESR of 4.02. The interobserver overall agreement and concordance rate in a total of 2240 joints assessed was 81.7% (k = 0.449, p < 0.0001) for tender joints and 66% (k = 0.227, p < 0.0001) for swollen joints. The overall concordance rate was fair (Fleiss' kappa = 0.21, p = 0.027) with an overall agreement of 67.18% yet, more joints were found to be swollen by the US assessment, compared to the physical examination (43% vs 39%). In our study population, joint tenderness showed better interobserver agreement, correlation, and concordance rate than joint swelling. When comparing the US assessment to the physical examination, a fair overall concordance rate supports the need for the implementation of training sessions dedicated to standardization in rheumatology clinics.


Introduction
Rheumatoid arthritis (RA) is a chronic systemic autoimmune disease characterized by hyperplasia and inflammation of synovial tissue with subsequent bone erosion and loss of joint space with a noteworthy

Open Access
Advances in Rheumatology association on functionality and quality of life [1]. In order to assess disease activity in RA patients, there is not an individual parameter to be used. It is thus necessary to achieve a comprehensive approach through the use of several individual clinical and or laboratory parameters and to develop quantitative indexes to be used in daily clinical practice by rheumatology health professionals [2][3][4].
Among the most commonly used methods, the Disease Activity Score (DAS), and its modified version, DAS28, are based on a tender and swollen joint count, combined with other parameters such as a patient global health assessment [4,5]. Joint tenderness assessments estimate the patient's response to potentially painful stimulation. On the other hand, joint swelling assessment measures synovial inflammation or effusion recognized by fluid displacement. The examination technique should include exerting continuous and direct pressure on an affected joint with the thumb and index fingers until the examiner's nail bed turns pale; this compares with a pressure of approximately 4 kg/cm 2 [6,7].
Although tender and swollen joint counts are considered fundamental parameters to estimate disease activity, its assessment is not as straightforward as assumed, thus exhibiting certain potential disadvantages. Amongst the prevailing concerns are poor reproducibility and substantial interobserver variability [6,8]. One of the reasonable explanations could be the training and clinical experience gap among practitioners; therefore, finding a feasible solution through standardization.
Historically, training sessions focused on standardization of all DAS28 parameters assessment have been proposed, and once applied, have shown to provoke a substantial reduction regarding the variation between examiners [6][7][8]. The entire medical and nursing staff who will be assessing RA patients should be trained regularly (at least once every year) and should be trained together, allowing discussion about current standard procedures. There has been proposed as well the use of ultrasonography to demonstrate active synovitis in case of disagreement [6].
Prognostic factors, treatment decisions, monitoring, and complications are defined based on disease activity, thus consequently on a reliable tender and swollen joint count for each RA patient [9]. The aim of this study was to evaluate the correlation and agreement between rheumatologists after a training session dedicated to the standardization of synovitis assessment and compare its performance with a reference imaging modality such as musculoskeletal ultrasonography (MSUS).

Patients
Patients were recruited from an outpatient clinic based in the Rheumatology Department at Fundación Santa Fe de Bogotá University Hospital, Colombia. Stratified random sampling was conducted, selecting patients with different disease activity states (according to the most recent clinical record) from a previously established ongoing cohort of 820 patients. The sample size was calculated for the desired correlation coefficient of 0.6, a population correlation coefficient of 0.8, a power of 0.8, and a confidence interval of 0.95. All those who were invited to participate fulfilled the 2010 ACR/European League Against Rheumatism (EULAR) classification criteria [10] and were at least 18 years old. Those who had a history of trauma, septic arthritis, joint replacement or synovectomy, joint deformity, and or crystal arthropathy were excluded.
The following data were registered at baseline: age, gender, treatment, erythrocyte sedimentation rate (ESR), C-reactive protein (CRP), rheumatoid factor (RF), global health assessed by the patient (GH), Clinical Disease Activity Index (CDAI), and Simplified Disease Activity Index (SDAI).

Training session dedicated to standardization
Our planned training session was structured to focus on standardization of the tenderness and swelling assessment of 28 joints (bilateral shoulders, elbows, wrists, knees, metacarpophalangeal (MCP), and interphalangeal (PIP)).
Three rheumatologists, with 10, 13, and 15 years of experience in clinical examination of RA patients, attended three sessions (separated by 1-2 weeks). Each session was divided into three steps: (1) individual joint assessment was performed by each rheumatologist (blinded to both clinical and other rheumatologist's data), (2) 20-min discussion on practice observations, difficulties, limitations, and facilitating factors during the physical examination, in order to reach agreement on uniform criteria and technique, and (3) joint individual reassessment based on the unified criteria. On the third session, disease activity indexes (DAS28-ESR, SDAI, and CDAI) were calculated individually by each rheumatologist.

Ultrasound assessment
Twenty minutes after the last training session's third step, each patient was instructed to proceed to the ultrasound (US) assessment room. The US examination was performed by a radiologist with 15 years of experience and training in musculoskeletal radiology (blinded to physical examinations' data) on 10 joints (bilateral wrists, and 2nd to 5th MCPs), using a GE (General Electric) LOGICQ E ultrasound machine with a 6-13 Hz multifrequency  [11]. PIPs joins were considered as potential confounders due to the eventual overlapping of tenosynovitis, therefore were not assessed. Synovitis grading was conducted based on a scoring method initially introduced by Szkudlarek et al., widely used for studies of this kind [12][13][14] and currently supported by the EULAR-OMERACT ultrasound taskforce [15,16] (0 = absence of synovial thickening, 1 (mild) = minimal synovial thickening, filling the angle between the periarticular bones, without bulging over the line linking tops of the corresponding bones, 2 (moderate) = synovial thickening bulging over the line linking tops of the periarticular bones but without extension along the bone diaphysis, 3 (severe) = severe synovial thickening bulging over the line linking tops of the periarticular bones and with extension to at least one bone diaphysis). Normal distances between bone and joint capsule were acknowledged based on average population values proposed by Schmidt et al. [17].

Statistical analysis
Statistical analysis was performed using STATA software, version 15.0. Descriptive analysis was presented for continuous variables with central tendency measures as mean and standard deviation (SD) or as the median and interquartile range (IQR) for normally or nonnormally distributed data, respectively. For dichotomous variables, data were presented with percentages and absolute values.
Interobserver agreement and concordance were calculated through Cohen's kappa (between two observers considering all the possible pairs, i.e., Observer A vs B, observer A vs C, observer B vs C), Fleiss' kappa (between the three observers), and percentage of an overall agreement (percentage of observed exact agreements). The relative strength of agreement was described according to the following ranges of kappa (k) coefficients: < 0.00 = poor, 0.00-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial, and 0.81-1.00 = almost perfect [18]. Linear correlation coefficients (Pearson correlation coefficient) were also calculated for tender and swollen joint counts.
The linear correlation coefficient (Spearman's correlation coefficient) was calculated, taking all the possible pairs (observer A-B, observer A-C, observer B-C), thus having an overall correlation coefficient of 0.81 and 0.431 for swollen and tender joints, respectively.

Agreement and concordance between physical and US examination
What is interesting in Table 3 is that, from the then joints assessed by both methods, there has been a slight relative increase in the number of swollen joints through US examination (345 of 800 joints assessed by US (43%)), when compared to physical examination (946 of 2400 joints assessed by our three rheumatologist (39%)). Concordance strength of agreement was fair and ranged from Fleiss kappa 0.168-0.264. Likewise, the k coefficients between physical and US examination in the left MCP5 joint (k = 0.168, p 0.06) followed by left MCP4 (k = 0.177, p 0.053) joint stand out as the lowest values. By contrast, k coefficient in wrist was slightly higher (k = 0.241 (p 0.007) and k = 0.213 (p 0.023)). The percentage of overall agreement and  concordance rate was 67.18% and k = 0.210; (p 0.027), respectively.

Discussion
As explained in the introduction, it is clear that training sessions focused on the standardization of joint assessment techniques play a pivotal role in understanding the reliability of quantitative indexes daily used in RA patients. We aimed to evaluate the correlation and agreement between rheumatologists after such a training session and compare its performance with a reference imaging modality, namely, MSUS, to find a reasonable approach on behalf of the inevitable high intra-and interobserver variability described in the literature. Contrary to expectations, despite the fact that all three participating rheumatologists attended the proposed training sessions, interobserver variability among them was still present. The wide range of perceived concordance rates suggests that the assessment of some individual joints may be particularly challenging on their own, thereby showing specific difficulties during standardization, with possible attribution to an underlying long and comprehensive learning process.
In terms of overall interobserver agreement and concordance rates, as well as of overall correlation, our findings propose that tenderness assessment was more homogenous than swelling assessment. A possible explanation for these results may be the fact that joint tenderness is inferred solely by the patient's response, while that of joint swelling by physicians' technique and interpretation of findings.
The importance of training sessions focused on standardization was first stated by Scott et al. in 1996 [19], whose findings suggested a considerably increased sensitivity of measurement for both tender and swollen joints and a reduction in the mean coefficient of variation for the number of swollen joints (82% vs 59%) after a 60-min training session based on the EULAR handbook for joint evaluation. Unlike Scott et al., on a multicentric cohort study evaluating standardization based on the aforementioned EULAR handbook, Grunke et al. [20] stated that even when consistency and variability significantly improved, the mean number of tender, as well as swollen joints decreased. Nonetheless, a reference imaging modality was not used.
In addition, clinical experience plays an essential part as recently proposed by El-Hadidi et al. [21] where after a consensus on joint assessment, the interobserver agreement was calculated to compare an experienced rheumatology professor (30 years of experience) with a young Rheumatology fellow (3 years of training). Although a high correlation between professor and candidate was described, specific results on joint assessment correlation are similar to ours, showing a more robust correlation regarding tender joints compared to that of swollen joints.
When compared to MUSC, as previously stated, agreement and concordance were slightly lower for swelling assessment. This discrepancy could be attributed to the mean age of our patients (55 years); it seems possible that the older the patient, the more frequent synovial thickening and incipient joint deformities, thus being confounding for not only MSUS but also for PE assessment. Moreover, it is important to consider what has been widely held by recent reports, that is, the trend towards higher agreement and concordance rates regarding swollen joints count [22,23]. These data must be interpreted with caution because of the use of the OMERACT synovitis definition and its implications of considering even a minimal amount of intraarticular tissue as an abnormal finding of synovial hypertrophy, thereby involving a potential overestimation.
It could be argued that the average age and the duration and activity of the disease could contribute to the differences between physical examination and US; nevertheless, those patients with joint deformity were excluded. It is relevant to bear in mind that, in those populations with a poorly controlled-and a longer-disease, joint assessment either by physical examination or by ultrasound remains a challenge when considering joint surface irregularities. It supports the pivotal role of conventional radiography as the first choice for the evaluation of structural changes such as erosions.
The limitations of this study include firstly the absence of power Doppler ultrasonographic assessment due to timing issues on behalf of the vast number of evaluated joints; prior studies have noted the role of power Doppler US in detecting subclinical and residual synovitis when assessing synovial vascularity, although, it was beyond the scope of our work to evaluate such conditions. Secondly, we did not count on another radiologist with sufficient experience in MSUS to be considered as a second evaluator; additionally, the US evaluation performed by a radiologist and not an articular US rheumatologist might be considered as a source of uncertainty; nonetheless, the vast experience, as well as the specific MSUS training of our radiologist, support the internal validity of the assessments. Very little was found in the literature on the question of performance differences between those scenarios (radiologist vs. rheumatologist), remaining as an intriguing issue for future research, especially for low and middle-income countries where training for such medical sub-specialties is not as frequent.
Thirdly, the absence of early RA patients, as well as of patients in clinical remission, and finally, that we did not report pretraining session joint count results in order to propose a variability improvement. The latter consideration must be handled with care when considering that this decision was based on the fact that pretraining session correlation and agreement scores were certainly variable, close to the null value, and even negative. Taking this into consideration, the gain of the stated training session was to revert such tendency and yet turn it to positive values, even though the kappa coefficients were in the range of slight to moderate. It is worth mentioning that when adding more than two observers to the calculations, the slightest discrepancy substantially lowers the statistical estimator; thus, an adjusted threshold must be taken when interpreting such results. Finally, a reasonable approach could be to hypothesize that the increase in the total swollen joint count through US examination would have a substantial implication in the categorization of the activity of the disease; nevertheless, even when it was beyond our scope, we trust that future research on this behalf will rise as a natural progression of this work.

Conclusions
Taken together, our findings indicate that joint tenderness assessment had a better interobserver agreement and concordance rate, moreover a higher overall correlation coefficient than joint swelling. The evidence from our results suggests a fair overall concordance rate when comparing the US examination to the physical examination, thus supporting the need for the implementation of training sessions dedicated to standardization in rheumatology clinics to guarantee reliability in identifying tender and swollen joints within rheumatologists.