Development and validation of machine learning models for predicting lung metastasis risk in differentiated thyroid cancer based on two databases

Haolin Shen; Caiyun Yang; Yuegui Wang; Jianmei Liao; Xianbo Zuo; Bo Zhang; Xiao Yang

doi:10.21037/gs-24-481

Original Article

Development and validation of machine learning models for predicting lung metastasis risk in differentiated thyroid cancer based on two databases

Haolin Shen^1#, Caiyun Yang^1#, Yuegui Wang¹, Jianmei Liao¹, Xianbo Zuo², Bo Zhang^3,4, Xiao Yang^1,5

¹Department of Ultrasound, Zhangzhou Municipal Hospital Affiliated to Fujian Medical University, Zhangzhou, China; ²Department of Dermatology, China-Japan Friendship Hospital, Beijing, China; ³Department of Ultrasound, China-Japan Friendship Hospital, National Center for Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, Institute of Respiratory Medicine of Chinese Academy of Medical Sciences, Beijing, China; ⁴Department of Institute of Respiratory, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China; ⁵Department of Ultrasound, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

Contributions: (I) Conception and design: H Shen, X Zuo, X Yang; (II) Administrative support: J Liao, X Zuo, B Zhang, X Yang; (III) Provision of study materials or patients: C Yang, Y Wang; (IV) Collection and assembly of data: H Shen, C Yang; (V) Data analysis and interpretation: X Zuo, B Zhang, X Yang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Xiao Yang, MD. Department of Ultrasound, Zhangzhou Municipal Hospital Affiliated to Fujian Medical University, No. 59 North Shengli Road, Zhangzhou 363000, China; Department of Ultrasound, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 1 Shuaifuyuan Road, Beijing 100730, China. Email: yang_smile@163.com; Xianbo Zuo, PhD. Department of Dermatology, China-Japan Friendship Hospital, No. 2 East Yinghua Road, Hepingli, Chaoyang District, Beijing 100029, China. Email: zuoxianbo@qq.com; Bo Zhang, MD. Department of Ultrasound, China-Japan Friendship Hospital, National Center for Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, Institute of Respiratory Medicine of Chinese Academy of Medical Sciences, Beijing 100029, China; Department of Institute of Respiratory, Chinese Academy of Medical Sciences and Peking Union Medical College, 1 Shuaifuyuan Road, Beijing 100730, China. Email: thyroidus@163.com.

Background: Differentiated thyroid cancer (DTC) progresses slowly, but patients with lung metastasis (LM) have a poor prognosis. The aim of this study was to develop and evaluate the predictive ability of machine learning (ML) models in estimating the risk of LM in patients with DTC and to identify the independent risk factors specific to different age and gender subgroups.

Methods: The demographic and clinicopathological data of patients with DTC were obtained from two databases: firstly, the National Institutes of Health Surveillance, Epidemiology, and End Results (SEER) database [2010–2015], which provides extensive epidemiological and clinical information on cancer patients; secondly, the Zhangzhou Municipal Hospital Affiliated to Fujian Medical University [2014–2017], which focuses more on patients’ specific clinicopathological characteristics and treatment outcomes. Common variables from both databases were extracted. The data were then split into training, testing and validation sets. The training set was used to build and train ML models, while the testing and validation set were employed to assess the performance of these models. In terms of model development, we established five different ML models: logistic regression (LR), random forest (RF), decision tree (DT), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM). For model validation, we utilized various evaluation metrics, including accuracy, precision, recall, F1 score, Brier score, area under the receiver operating characteristic (ROC) curve (AUROC), area under the precision-recall (PR) curve (PR-AUC), calibration curve, and decision curve analysis (DCA). The importance of various features was ranked and visualized for the top-performing models.

Results: The analysis identified age, gender, tumor size, T stage, N stage, and histologic type as significant independent risk factors for LM. The effects of gender, T stage, and histological type on the risk of LM varied across the different age subgroups. In the female population, tumor size was an independent risk factor for LM, while it was not in the male population. GBM achieved an AUROC of 0.982, a Brier score of 0.047, an accuracy of 0.818, and an F1 score of 0.818 in the validation set, outperforming the other models.

Conclusions: The GBM model emerged as an effective tool for identifying high-risk LM populations in DTC, with the potential to guide clinical practice and facilitate the development of individualized treatment plans. Further research to validate these findings across more diverse patient populations and clinical settings is recommended.

Keywords: Lung metastasis (LM); machine learning (ML); Surveillance, Epidemiology, and End Results database (SEER database); differentiated thyroid cancer (DTC)

Submitted Nov 05, 2024. Accepted for publication Nov 20, 2024. Published online Nov 26, 2024.

doi: 10.21037/gs-24-481

Highlight box

Key findings

• The independent risk factors for lung metastasis (LM) vary across different age groups and genders, and machine learning (ML) models can accurately predict the risk of LM in patients with differentiated thyroid cancer (DTC) to guide clinical decision-making.

What is known and what is new?

• Previous models for predicting LM risk in DTC have relied solely on the Surveillance, Epidemiology, and End Results (SEER) database.

• We integrated data from the SEER database with those from Chinese hospitals to develop ML models. Our model not only effectively identifies high-risk populations for LM but also represents a novel approach in data fusion modeling, significantly enhancing the accuracy and generalization capability of the models.

What is the implication, and what should change now?

• The gradient boosting machine model performed well on different datasets and achieved high value in predicting the LM risk for patients with DTC. It can aid clinicians in formulating personalized treatment plans and thus improve patient outcomes.

Introduction

Thyroid cancer is the most common malignancy to affect the endocrine glands (1). Differentiated thyroid cancer (DTC) typically has a slow progression and favorable prognosis, with a 10-year survival rate of approximately 85–90% (2). Distant organ metastasis, most commonly to the lung, develops in 3–20% of patients with DTC (3,4). Patient prognosis is related to the size of lung metastases (LMs): the larger the LM, the worse the prognosis (5,6). Research in this field underscores the significance of the early detection of LM for improved treatment efficacy and survival rates in DTC. It is therefore crucial to accurately assess the aggressiveness of DTC, identify individuals at high risk for LM, and develop personalized follow-up plans accordingly.

Currently, the assessment of DTC malignancy and screening for LM mainly rely on clinical features, pathologic type, and molecular markers. The predictive value of clinical characteristics such as age, gender, tumor size, and tumor-node-metastasis (TNM) staging varies significantly across studies (7,8). Pathologic type alone is insufficient for accurately assessing the invasiveness of DTC (9), while the high cost and technological demands of genetic testing have restricted its widespread clinical application. Therefore, it is imperative to develop a simple and accurate method for assessing the risk of LM in DTC.

Risk prediction models, by employing a series of predictive factors, can accurately quantify the risk level of individuals with specific characteristics (10). In clinical practice, there is an escalating demand for such high-precision prediction models, as they provide a scientific basis for personalized medical decision-making, the formulation of early intervention strategies, and the optimal allocation of resources. Logistic regression (LR) is the most traditional risk prediction model, which constructs regression equations by intaking variables (11). With the rise of artificial intelligence, machine learning (ML) algorithms such as decision trees (DTs) and random forests (RFs) have been utilized to improve disease prediction models (12). An important advantage of ML over conventional statistical methods (e.g., LR) is that ML algorithms do not need data to meet statistical assumptions, such as independence of observations and the avoidance of multicollinearity of independent variables. Another advantage is the abundance of electronic health record (EHR) data and modern computing power (13). Numerous studies have demonstrated the high accuracy of ML in cancer prediction (14,15). Despite the availability of various ML algorithms, there is currently no consensus on which algorithm outperforms others in disease prediction.

The Surveillance, Epidemiology, and End Results (SEER) database provides the clinical and demographic data of patients diagnosed with cancer who have died (16,17). SEER*Stat software can be used for filtering and retrieving data from the SEER database. The SEER database has been used to construct ML models to predict distant metastasis in patients with thyroid cancer (18,19). However, these models have not been externally validated, and their applicability and effectiveness in the Chinese population have not been demonstrated.

Thus, we developed several ML models and validated them using the SEER and other datasets. The best-performing models were selected to facilitate the early identification of patients with DTC at high risk of developing LM. We present this article in accordance with the TRIPOD reporting checklist (available at https://gs.amegroups.com/article/view/10.21037/gs-24-481/rc) (20,21).

Methods

Variables were extracted from two databases to construct multiple ML models, which were then validated using different datasets to select the optimal model. This retrospective study was approved by the research ethics committee of Zhangzhou Municipal Hospital Affiliated to Fujian Medical University (No. 2022KYB138). The requirement for individual consent was waived due to the retrospective nature of the analysis. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Definition criteria for LM

Patients who exhibited indications of LM through imaging or pathological examination were classified as being at high risk for LM. Specifically, a patient was placed into this high-risk category if any of the following criteria were fulfilled: (I) the presence of LM discernible on chest computed tomography; (II) the presence of LM discernible on I¹³¹ imaging; (III) surgical resection of lung nodules, with subsequent pathological analysis suggesting metastasis; and (IV) bronchoalveolar lavage suggestive of LM. The assessment of these criteria and the subsequent classification of patients as high risk for LM were performed by experienced and senior radiologists and pathologists, without any knowledge of the patients’ information.

Datasets

The data used in this study were extracted from the SEER Registry Research Database. These data originated from 17 cancer registries, covering cancer incidence from 2000 to 2020, and were collected up to November 2022 (SEER Incidence Database; 17 Registries; November 2022; sub 2000–2020). Case lists were generated using SEER*Stat software. Our study focused on patients diagnosed with DTC between 2010 and 2015, as the LM site information prior to 2010 was not available. The collected data encompassed demographic and clinical characteristics, including age, sex, multifocality, maximum diameter, T stage, N stage, histological type, and the presence of LM. The International Classification of Diseases (ICD) codes considered as DTC [third edition of the ICD for Oncology (ICD-O-3)] included those for papillary thyroid carcinoma (PTC) (8050 and 8260) and follicular thyroid carcinoma (FTC) (8330, 8331, 8332, 8335, and 8337). The exclusion criteria were as follows: (I) DTC diagnosis confirmed solely by autopsy or death certificate; (II) incomplete patient information; (III) 0 months of survival; (IV) unclarified T or N stage, or stage T0; and (V) nonprimary DTC tumor.

Another dataset from Zhangzhou Municipal Hospital Affiliated to Fujian Medical University comprised the demographic and clinical characteristics of patients with DTC treated from January 2014 to December 2017 and was in alignment with the SEER database variables. Patients with incomplete clinical data were excluded. It has been proposed that the accuracy of samples larger than 120 exhibits relatively minor variations across all ML classifiers, leading to more accurate results (22). We ultimately included 274 patients. Each patient was followed up for a minimum of 5 years and was considered to be at low risk of LM if they did not develop LM 5 years after DTC was discovered.

Data processing

Variables common to the SEER database and Zhangzhou Municipal Hospital Affiliated to Fujian Medical University were included in analysis. Ordinal or categorical variables were used for gender (male or female), age (<40, 40–59, or ≥60 years), multifocality (single or multiple), T stage (T1, T2, T3, or T4), N stage (N0 or N1), and size.

Outliers were identified as those 1.5 times the interquartile range (IQR), and the outliers were replaced with the nearest nonoutlier boundary value. For missing values, multiple interpolation was used for processing.

Sampling

The original dataset was severely unbalanced, with LM accounting for approximately 10% of cases. We therefore employed an undersampling technique to balance the SEER datasets prior to modeling (23). The processed dataset was then divided into a training set and a testing set (8:2 ratio). The data of Zhangzhou Municipal Hospital Affiliated to Fujian Medical University were divided into a training set [2014–2015] and validation set [2016–2017] according to time nodes without any sampling processing. The training set from the SEER database and training set from our hospital data were fused into a new dataset, which was used as the training set for the ML model. The testing set from SEER and the validation set from our hospital data were used for evaluating the model’s performance.

Statistical analysis

Statistical analysis was conducted using R v. 4.3.0 software (https://www.r-project.org; The R Foundation of Statistical Computing, Vienna, Austria). Following the Kolmogorov-Smirnov test for normality, measurement data adhering to a normal distribution are presented as mean [standard deviation (SD)], and group comparisons were performed using one-way analysis of variance. Data with a nonnormal distribution are represented as the median and IQR, and group comparisons were performed using the rank-sum test. Categorical variables are expressed as the number of cases and rate, and the Chi-square test was employed for intergroup comparisons. Univariate analyses were conducted using the Mann-Whitney test, Pearson Chi-square test, or Fisher exact test. Variables with P values less than 0.20 in the univariate analyses were analyzed with multivariate LR to identify independent risk factors for LM (P<0.05).

Age was divided into three subgroups: <40, 40–59, and ≥60 years. Gender was divided into two subgroups: male and female. Univariate and multivariate analyses were performed on various variables within each subgroup to screen for independent risk factors for LM, and the differences in independent risk factors among different subgroups were compared.

The five ML models used for this study were LR (with the Akaike information criterion used to select optimal variables), RF (with optimized via tuning parameters mtry and ntree), DT (with default configuration), extreme gradient boosting (XGBoost; configured with binary LR as the objective function, a learning rate of 0.1, a maximum tree depth of 4, and sample and feature sampling rates of 0.8), and gradient boosting machine (GBM; with a learning rate of 0.005, a maximum tree depth of 4, a minimum number of observations in a node of 3, and a total number of trees of 5,000). The training data were fed into the aforementioned models for training, resulting in predictive outcomes from different models.

The receiver operating characteristic (ROC) curve, precision-recall (PR) curve, calibration curve, and decision curve analysis (DCA) were plotted. Model performance was evaluated in terms of accuracy, precision, recall, F1 score, Brier score, area under the ROC curve (AUROC), and area under the PR curve (PR-AUC). The top-performing model’s variable importance was visualized.

Results

The original dataset obtained from the SEER database, which totaled 30,433 patients, was balanced using the undersampling technique. After balancing, the dataset comprised 238 patients with LM and 238 patients without LM. The dataset was then randomly partitioned into a training set (382 cases) and a testing set (94 cases) at an 8:2 ratio. Data from a total of 267 cases were collected in our hospital, which were divided into a training set (111 cases) and a validation set (156 cases) according to time nodes. The training set had a total of 493 cases. The distribution of variables in different data sets are shown in Table 1.

Table 1

Distribution of variables in the different datasets

Variables	Training		Testing		Validation
Variables	NLM (n=288)	LM (n=205)	NLM (n=47)	LM (n=47)	NLM (n=145)	LM (n=11)
Age (years)
<40	94 (32.6)	38 (18.5)	14 (29.8)	7 (14.9)	41 (28.3)	3 (27.3)
40–59	144 (50.0)	50 (24.4)	29 (61.7)	12 (25.5)	95 (65.5)	2 (18.2)
≥60	50 (17.4)	117 (57.1)	4 (8.5)	28 (59.6)	9 (6.2)	6 (54.5)
Gender
Male	61 (21.2)	96 (46.8)	10 (21.3)	22 (46.8)	27 (18.6)	4 (36.4)
Female	227 (78.8)	109 (53.2)	37 (78.7)	25 (53.2)	118 (81.4)	7 (63.6)
Multifocality
Solitary	172 (59.7)	102 (49.8)	26 (55.3)	24 (51.1)	89 (61.4)	2 (18.2)
Multifocal	116 (40.3)	103 (50.2)	21 (44.7)	23 (48.9)	56 (38.6)	9 (81.8)
Size (mm)	17.5 (15.7)	44.1 (24.4)	25.1 (25.8)	52.0 (28.1)	10.3 (7.3)	27.0 (6.8)
T
T1	180 (62.5)	18 (8.8)	29 (61.7)	2 (4.3)	116 (80.0)	1 (9.1)
T2	43 (14.9)	17 (8.3)	8 (17.0)	4 (8.5)	7 (4.8)	1 (9.1)
T3	38 (13.2)	78 (38.0)	9 (19.1)	13 (27.7)	12 (8.3)	0 (0.0)
T4	27 (9.4)	92 (44.9)	1 (2.1)	28 (59.6)	10 (6.9)	9 (81.8)
N
N0	178 (61.8)	62 (30.2)	32 (68.1)	12 (25.5)	59 (40.7)	1 (9.1)
N1	110 (38.2)	143 (69.8)	15 (31.9)	35 (74.5)	86 (59.3)	10 (90.9)
Histology
PTC	278 (96.5)	158 (77.1)	46 (97.9)	39 (83.0)	144 (99.3)	8 (72.7)
FTC	10 (3.5)	47 (22.9)	1 (2.1)	8 (17.0)	1 (0.7)	3 (27.3)

Data are presented as n (%) or mean (SD). NLM, non-lung metastasis; LM, lung metastasis; PTC, papillary thyroid carcinoma; FTC, follicular thyroid carcinoma; SD, standard deviation.

Results of univariate and multivariate LR analyses

Analysis was performed on the data from 30,433 patients in the SEER database. Table 2 shows the results of univariate and multivariate LR analyses. In univariate analysis, all variables with a P value below 0.2 were included in the multivariate analysis. Multivariate analysis showed that age, gender, tumor size, T stage, N stage, and histologic type were independently correlated with LM (all P values <0.05).

Table 2

Univariate analysis and multivariate LR analysis of clinical characteristics related to LM

Variables	Univariate analysis			Multivariate LR analysis
Variables	NLM (n=30,195)	LM (n=238)	P value	OR	95% CI	P value
Age (years)			<0.001
<40	10,583 (35.0)	38 (16.0)		Reference	–	–
40–59	13,871 (45.9)	56 (23.5)		1.261	0.828–1.940	0.28
≥60	5,741 (19.0)	144 (60.5)		5.882	4.049–8.731	<0.001
Gender			<0.001
Male	6,773 (22.4)	114 (47.9)		Reference	–	–
Female	23,422 (77.6)	124 (52.1)		0.607	0.459–0.803	<0.001
Multifocality			0.003
Solitary	18,375 (60.9)	122 (51.3)		Reference	–	–
Multifocal	11,820 (39.1)	116 (48.7)		1.197	0.898–1.595	0.22
Size (mm)	32.1 (116.5)	109.2 (240.5)	<0.001	1.001	1.001–1.002	<0.001
T			<0.001
T1	17,957 (59.5)	18 (7.6)		Reference	–	–
T2	4,953 (16.4)	18 (7.6)		2.253	1.146–4.430	0.02
T3	6,506 (21.5)	89 (37.4)		6.483	3.911–11.325	<0.001
T4	779 (2.6)	113 (47.5)		46.843	28.015–82.499	<0.001
N			<0.001
N0	21,371 (70.8)	74 (31.1)		Reference	–	–
N1	8,824 (29.2)	164 (68.9)		3.814	2.657–5.550	<0.001
Histology			<0.001
PTC	27,944 (92.5)	184 (77.3)		Reference	–	–
FTC	2,251 (7.5)	54 (22.7)		5.433	3.587–8.211	<0.001

Data are presented as n (%) or mean (SD), unless otherwise stated. LR, logistic regression; LM, lung metastasis; NLM, non-lung metastasis; OR, odds ratio; CI, confidence interval; PTC, papillary thyroid carcinoma; FTC, follicular thyroid carcinoma; SD, standard deviation.

Subgroups analysis according to age and gender

Table 3 summarizes the independent risk factors for LM, stratified by age: <40, 40–59, and ≥60 years. Gender was found to be an independent risk factor solely in patients aged <60 years and not for those aged ≥60 years. Multifocality was not significantly associated with risk in any of the age groups. Tumor size consistently emerged as an independent risk factor regardless of age. T stage was found to be a risk factor in the youngest and oldest subgroups but not the middle-aged subgroup. N stage was prominent as a uniform risk factor across all ages, while histological type was only significant in the age groups of ≥40 years.

Table 3

Univariate and multivariate analysis of the P value for the association between variables and LM across the different age groups

Variables	<40 years		40–59 years		≥60 years
Variables	Univariate analysis	Multivariate analysis	Univariate analysis	Multivariate analysis	Univariate analysis	Multivariate analysis
Gender	<0.001		<0.001		<0.001
Male		Reference		Reference		Reference
Female		0.005		0.02		0.12
Multifocality	0.01		0.004		0.21
Solitary		Reference		Reference		–
Multifocal		0.69		0.11		–
Size (mm)	<0.001	0.01	<0.001	0.11	<0.001	<0.001
T	<0.001		<0.001		<0.001
T1		Reference		Reference		Reference
T2		0.44		0.98		0.28
T3		0.004		0.98		<0.001
T4		<0.001		0.98		<0.001
N	<0.001		<0.001		<0.001
N0		Reference		Reference		Reference
N1		0.002		<0.001		<0.001
Histology	0.52		<0.001		<0.001
PTC		–		Reference		Reference
FTC		–		<0.001		<0.001

LM, lung metastasis; PTC, papillary thyroid carcinoma; FTC, follicular thyroid carcinoma.

Table 4 provides a comparison of the independent risk factors associated with LM between the male and female populations. Tumor size was an independent risk factor in females but not males. In contrast, no significant differences between genders were observed for other variables in their association with LM.

Table 4

Gender-stratified univariate and multivariate analysis of P values for the association between variables and LM

Variables	Male		Female
Variables	Univariate analysis	Multivariate analysis	Univariate analysis	Multivariate analysis
Age (years)	<0.001		<0.001
<40		Reference		Reference
40–59		0.78		0.33
≥60		<0.001		<0.001
Multifocality	0.582		0.004
Solitary		–		Reference
Multifocal		–		0.21
Size (mm)	<0.001	0.06	<0.001	<0.001
T	<0.001		<0.001
T1		Reference		Reference
T2		0.15		0.08
T3		<0.001		<0.001
T4		<0.001		<0.001
N	<0.001		<0.001
N0		Reference		Reference
N1		<0.001		<0.001
Histology	<0.001		<0.001
PTC		Reference		Reference
FTC		<0.001		<0.001

LM, lung metastasis; PTC, papillary thyroid carcinoma; FTC, follicular thyroid carcinoma.

Construction and validation of the ML models

The model accuracy, precision, recall, F1 score, AUROC, and Brier score in the training set, testing set, and external validation set are shown in Table 5.

Table 5

Prediction performance for each model

Model	Datasets	Accuracy	Precision	Recall	F1	Brier	AUROC
LR	Training	0.834	0.783	0.829	0.806	0.117	0.913
	Testing	0.894	0.894	0.894	0.894	0.094	0.946
	Validation	0.885	0.370	0.909	0.526	0.067	0.969
RF	Training	0.901	0.868	0.898	0.883	0.073	0.967
	Testing	0.883	0.860	0.915	0.887	0.093	0.944
	Validation	0.910	0.421	0.727	0.533	0.056	0.962
DT	Training	0.830	0.798	0.790	0.794	0.170	0.906
	Testing	0.872	0.872	0.872	0.872	0.128	0.918
	Validation	0.942	0.563	0.818	0.667	0.058	0.961
XGBoost	Training	0.911	0.889	0.898	0.893	0.069	0.971
	Testing	0.830	0.844	0.809	0.826	0.103	0.936
	Validation	0.942	0.556	0.909	0.690	0.053	0.972
GBM	Training	0.866	0.836	0.844	0.840	0.097	0.942
	Testing	0.851	0.884	0.809	0.844	0.093	0.943
	Validation	0.974	0.818	0.818	0.818	0.047	0.982

AUROC, area under the receiver operating characteristic curve; LR, logistic regression; RF, random forest; DT, decision tree; XGBoost, extreme gradient boosting; GBM, gradient boosting machine.

Training set

In the training set, all models exhibited an AUROC exceeding 0.9. RF, XGBoost, and GBM had Brier scores below 0.1 and achieved accuracy, precision, recall, and F1 scores exceeding 0.8.

Test set

In the test set, the ROC curves indicated that all models exhibited an AUROC exceeding 0.9 (Figure 1A), with the PR-AUC also exceeding 0.9 (Figure 1B). LR, RF, and GBM maintained Brier scores below 0.1, with all models maintaining accuracy, precision, recall, and F1 score above 0.80. The DCA showed that the threshold range of clinical benefit was greater for LR, XGBoost, and GBM than for DT and RF (Figure 1C). Calibration curves overlapped well with standard curves for all models (Figure 1D).

Figure 1 A series of curves for the test set. (A) ROC curves for each model in the test set. (B) PR curves for each model in the test set. (C) DCA of each model in the test set. (D) Calibration curves of each model in the test set. ROC, receiver operating characteristic; LR, logistic regression; RF, random forest; DT, decision tree; XGBoost, extreme gradient boosting; GBM, gradient boosting machine; LM, lung metastasis; PR, precision-recall; DCA, decision curve analysis.

Validation set

In the validation set, all five models exhibited AUROC values exceeding 0.90, with GBM obtaining the highest AUROC of 0.982 (Figure 2A). Similarly, GBM had the highest PR-AUC score of 0.7507 (Figure 2B). All models had a Brier score below 0.1. The accuracy, precision, recall, and F1 scores of GBM were 0.974, 0.818, 0.818, and 0.818 separately. In contrast, the accuracy and recall of LR, DT, and XGBoost remained above 0.8, but their precision and F1 scores were significantly lower. Additionally, the accuracy, recall, and F1 scores of RF were significantly lower. The DCA showed that the threshold range of clinical benefit was greater for LR, XGB, and GBM than for DT and RF (Figure 2C). The GBM model calibration curve showed the greatest overlap with standard curves (mean absolute error =0.008) (Figure 2D).

Figure 2 A series of curves for the validation set. (A) ROC curves for each model in the validation set. (B) PR curves for each model in the validation set. (C) DCA of each model in the validation set. (D) Calibration curves of each model in the validation set. LR, logistic regression; RF, random forest; DT, decision tree; XGBoost, extreme gradient boosting; GBM, gradient boosting machine; ROC, receiver operating characteristic; LM, lung metastasis; PR, precision-recall; DCA, decision curve analysis.

GBM model

Model parameters

During the model-building process, we tuned the parameters to optimize performance: the learning rate was set to 0.005 to ensure robust iterative learning of the model; the tree depth was fixed at four levels, which could capture complex interactions while preventing overfitting; the minimum number of observations per internal node was set to 3, balancing model complexity and generalization ability; 5,000 trees were constructed to leverage the potential of big data to enhance prediction accuracy; and the model was validated through five-fold cross-validation to ensure its stability and reliability in practical applications. By using the prediction function in R software, we could generate predictions for new data.

Variable importance

In model with superior diagnostic performance, the GBM model, the rank of feature importance, from largest to smallest, was as follows: tumor size, T stage, age, N stage, multifocality, gender, and histology (Figure 3).

Figure 3 Histogram of feature importance in the GBM model. The feature importance ranking of the GBM model includes tumor size, T stage, age, N stage, multifocality, gender, and histology. GBM, gradient boosting machine.

Discussion

In the assessment of feature importance within the GBM model, size emerged as the primary independent risk factor for LM in DTC. This finding aligns with a previous report indicating that the risk of LM increases for patients with tumor sizes exceeding 4 cm (24). Therefore, precise measurement of thyroid nodule size using ultrasound technology offers valuable preoperative insights for assessing the risk of LM. In our study, T stage emerged as an independent risk factor for LM, a finding that is consistent with prior research conducted by Li et al. (25). Previous research indicates that lymph node metastasis is associated with a fourfold increase in the likelihood of LM (26). Similarly, our study confirmed the independent significance of lymph node metastasis for LM. Distant metastases have been reported to be more common in FTC than in PTC due to the propensity for peripheral and vascular invasion (27), which is in line with our findings. Although some pathological subtypes such as diffuse sclerosing papillary carcinoma and the tall cell variant have been revealed to be highly aggressive (28,29), the SEER datasets lack of pathologic subtype data have precluded subtype-specific LM risk analysis. This warrants further study in the future. Age has often been reported to be an independent risk factor for distant metastasis in DTC (30). Similarly, we observed that the older the age was, the higher the risk of LM. It has been noted that older patients with DTC tend to develop resistance to radioactive iodine therapy. With greater age, immune system function wanes, and overall mortality rate increases (30). Whether these factors contribute to the positive correlation between age and LM risk requires further study. Furthermore, analysis of the SEER data identified male gender as a risk factor for LM. This finding supports a previous study indicating that DTC tends to be more aggressive and prone to metastasis in men, although the overall incidence of DTC is reported to be lower in men than in women (31).

To further investigate the risk factors, we performed subgroup analyses stratified by age and gender. Notably, gender lost its independent status as a risk factor for LM among individuals aged 60 years and above, which aligns with the findings of Shobab et al. (31). A study suggested women typically possess a more robust immune system, exhibiting stronger antibody, T-cell, and cytokine responses, conferring greater resistance to tumor aggressiveness. However, this advantage diminishes with advancing age (32). This may be attributable to immunosenescence, a process in which the immune system progressively weakens with age. Immunosenescence impairs both innate and adaptive immunity, reducing T-cell function, cytokine production, and overall immune surveillance, which may explain why the gender differences in LM risk are absent in the older adult population (33,34).

Among individuals younger than 40 years, histologic type did not emerge as an independent risk factor for LM, suggesting that there is no significant difference in the risk of LM occurrence between PTC and FTC. Although FTC is generally more aggressive than is PTC, a study showed that in children and young adults, PTC is more prone to invading lymphatic vessels, resulting in LM in 8% to 20% of patients (35). PTC’s lymphatic spread in younger individuals may stem from heightened lymphangiogenic factor expression and a responsive tumor microenvironment, fostering aggressiveness (36). Further research is needed to clarify if these factors underlie the observed lack of difference in LM risk between PTC and FTC in this age group. Interestingly, in the 40- to 59-year age group, we unexpectedly found no significant association between T stage and LM. This result challenges the conventional view that the depth of tumor invasion directly affects its aggressiveness. It is possible that in this age group, earlier detection and more aggressive medical interventions, such as prophylactic central neck dissection, may reduce the apparent influence of T stage on LM and thus mask the expected relationship. Moreover, hormonal shifts during this age period, such as the perimenopausal transition in women, might alter tumor biology and its interaction with the immune system (37), warranting further investigation into how these factors interact with LM.

In the gender subgroup analysis, we observed no major LM risk differences between sexes, except that tumor size was an independent risk factor in females but not in males. Extensive research has highlighted significant differences in sex hormone levels and gene expression profiles between men and women with DTC (38,39). These biological differences are likely to provide crucial insights into the different associations of tumor size and LM risk between the sexes. A previous study demonstrated that the expression of estrogen receptors in thyroid cancer cells significantly promotes cell proliferation and growth (40). Another analysis revealed that the expression of this receptor correlates with a more aggressive phenotype in DTC (41). Estrogen receptors are expressed at higher levels in women than in men, which may explain why tumor size is an independent risk factor for LM in women with DTC, while this association is not present in men.

Overall, RF, XGBoost, and GBM performed well on both the training set and the SEER test set, achieving satisfactory results across all metrics. However, in the validation set, only the GBM model maintained an accuracy, precision, recall, and F1 score above 0.8. Considering the imbalance inherent in the external dataset, PR-AUC offers a more sensitive assessment of model performance by gauging the model’s capability to identify LM samples, in contrast to AUROC, which remains unaffected by the ratio of positive to negative class instances (42). The results indicated that the GBM model achieved the highest PR-AUC value, suggesting that it can more effectively distinguish between patients with DTC at high risk and low risk for LM in unbalanced data, demonstrating superior discrimination capabilities. Furthermore, the calibration curve confirmed that the predicted results of the GBM model aligned closely with the actual outcomes, demonstrating good calibration. In conclusion, the GBM had the best performance among the models.

Nonetheless, each of the models possesses unique merits: LR is intuitive and efficient, yet constrained in handling nonlinear relationships; RF offers high accuracy and noise resilience, yet has a tendency to overfitting and resource consumption; DT is comprehensible, yet prone to overfitting and a lack of stability; XGBoost, although efficient, requires intricate parameter tuning and is susceptible to overfitting in small sample sizes. In contrast, GBM can leverage the power of ensemble learning and a gradual optimization mechanism to approximate optimal solutions. With its flexible parameter configuration, GBM exhibits superior predictive accuracy in intricate data landscapes. In this study, GBM excelled in several indicators, demonstrating high accuracy and high clinical utility. Among patients with DTC, it can help physicians to identify groups at high risk of LM in advance and promote early intervention, thus effectively improving the prognosis of patients.

The SEER dataset, predominantly comprising patients from the United States, who may differ in terms of demographics, lifestyle, and genetics from the Chinese population, potentially impacting disease patterns and progression. To enhance model generalization, we fused the SEER database with our hospital’s data, forming a new training set. This integration not only broadened the training sample but also validated the effectiveness of data fusion in boosting model performance.

Certain limitations to this study should be acknowledged. First, the sample size was constrained, and we relied solely on common variables across both databases, restricting model complexity and predictive strength. Second, certain predictive factors significant in the United States (such as race/ethnicity) may be less relevant in China due to its homogeneous population, limiting their predictive value. We will expand our sample size and gather data from a wider array of regions, thereby enriching our data sources and further substantiating the generalization capability of our models.

Conclusions

Age, size, T stage, N stage, gender, and histologic type were identified as independent risk factors for LM in DTC. The results of subgroup analysis indicate that there are differences in the independent risk factors of LM across different age groups and genders. The GBM model performed well on different datasets and achieved high value in predicting the LM risk for patients with DTC. This model can help clinicians in formulating personalized treatment plans and improving patient outcomes. In the future, we will further validate our findings through multicenter, large-sample data.

Acknowledgments

Funding: This work was supported by the Natural Science Foundation of Fujian Province (No. 2022J011478), the National Natural Science Foundation Project (No. 82273523), and the National High-Level Hospital Clinical Research Funding (Nos. 2022-NHLHCRF-LX-02-03 and 2023-NHLHCRF-YXHZ-ZRZD-06).

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://gs.amegroups.com/article/view/10.21037/gs-24-481/rc

Data Sharing Statement: Available at https://gs.amegroups.com/article/view/10.21037/gs-24-481/dss

Peer Review File: Available at https://gs.amegroups.com/article/view/10.21037/gs-24-481/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://gs.amegroups.com/article/view/10.21037/gs-24-481/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This retrospective study was approved by the research ethics committee of Zhangzhou Municipal Hospital Affiliated to Fujian Medical University (No. 2022KYB138). The requirement for individual consent was waived due to the retrospective nature of the analysis. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Choi SM, Kim DG, Lee JE, et al. Thyroid lobectomy is sufficient for differentiated thyroid cancer with upgraded risk after surgery. Gland Surg 2022;11:1451-63. [Crossref] [PubMed]
Yoon JH, Jeon MJ, Kim M, et al. Unusual metastases from differentiated thyroid cancers: A multicenter study in Korea. PLoS One 2020;15:e0238207. [Crossref] [PubMed]
Qiu ZL, Shen CT, Sun ZK, et al. Lung Metastases From Papillary Thyroid Cancer With Persistently Negative Thyroglobulin and Elevated Thyroglobulin Antibody Levels During Radioactive Iodine Treatment and Follow-Up: Long-Term Outcomes and Prognostic Indicators. Front Endocrinol (Lausanne) 2019;10:903. [Crossref] [PubMed]
Lee JS, Lee JS, Yun HJ, et al. Prognosis of Anaplastic Thyroid Cancer with Distant Metastasis. Cancers (Basel) 2022;14:5784. [Crossref] [PubMed]
Yang J, Liang M, Jia Y, et al. Therapeutic response and long-term outcome of differentiated thyroid cancer with pulmonary metastases treated by radioiodine therapy. Oncotarget 2017;8:92715-26. [Crossref] [PubMed]
Chen P, Feng HJ, Ouyang W, et al. Risk factors for nonremission and progression-free survival after I-131 therapy in patients with lung metastasis from differentiated thyroid cancer: a single-institute, retrospective analysis in southern China. Endocr Pract 2016;22:1048-56. [Crossref] [PubMed]
Li Y, Zhang H, Cao Y, et al. Establishment and verification of the first prognostic nomograms in locally advanced thyroid cancer based on the analysis of clinical and follow-up information on 2396 patients. Heliyon 2024;10:e24798. [Crossref] [PubMed]
Li Y, Gao X, Guo T, et al. Development and validation of nomograms for predicting the risk of central lymph node metastasis of solitary papillary thyroid carcinoma of the isthmus. J Cancer Res Clin Oncol 2023;149:14853-68. [Crossref] [PubMed]
Baloch ZW, Asa SL, Barletta JA, et al. Overview of the 2022 WHO Classification of Thyroid Neoplasms. Endocr Pathol 2022;33:27-63. [Crossref] [PubMed]
Moons KG, Kengne AP, Woodward M, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart 2012;98:683-90. [Crossref] [PubMed]
Cai D, Wu S. Efficacy of logistic regression model based on multiparametric ultrasound in assessment of cervical lymphadenopathy - a retrospective study. Dentomaxillofac Radiol 2022;51:20210308. [Crossref] [PubMed]
Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med 2019;380:1347-58. [Crossref] [PubMed]
Song X, Liu X, Liu F, et al. Comparison of machine learning and logistic regression models in predicting acute kidney injury: A systematic review and meta-analysis. Int J Med Inform 2021;151:104484. [Crossref] [PubMed]
Borzooei S, Briganti G, Golparian M, et al. Machine learning for risk stratification of thyroid cancer patients: a 15-year cohort study. Eur Arch Otorhinolaryngol 2024;281:2095-104. [Crossref] [PubMed]
Li Q, Wang Y, Chen J, et al. Machine learning based androgen receptor regulatory gene-related random forest survival model for precise treatment decision in prostate cancer. Heliyon 2024;10:e37256. [Crossref] [PubMed]
Che WQ, Li YJ, Tsang CK, et al. How to use the Surveillance, Epidemiology, and End Results (SEER) data: research design and methodology. Mil Med Res 2023;10:50. [Crossref] [PubMed]
Bi J, Zhang H. Nomogram predicts risk and prognostic factors for lung metastasis of anaplastic thyroid carcinoma: a retrospective study in the Surveillance Epidemiology and End Results (SEER) database. Transl Cancer Res 2023;12:3547-64. [Crossref] [PubMed]
Qiao L, Li H, Wang Z, et al. Machine learning based on SEER database to predict distant metastasis of thyroid cancer. Endocrine 2024;84:1040-50. [Crossref] [PubMed]
Liu W, Wang S, Ye Z, et al. Prediction of lung metastases in thyroid cancer using machine learning based on SEER database. Cancer Med 2022;11:2503-15. [Crossref] [PubMed]
Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594. [Crossref] [PubMed]
Moons KG, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162:W1-73. [Crossref] [PubMed]
Rajput D, Wang WJ, Chen CC. Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics 2023;24:48. [Crossref] [PubMed]
Koivu A, Sairanen M, Airola A, et al. Synthetic minority oversampling of vital statistics data with generative adversarial networks. J Am Med Inform Assoc 2020;27:1667-74. [Crossref] [PubMed]
Vuong HG, Duong UNP, Pham TQ, et al. Clinicopathological Risk Factors for Distant Metastasis in Differentiated Thyroid Carcinoma: A Meta-analysis. World J Surg 2018;42:1005-17. [Crossref] [PubMed]
Li M, Trivedi N, Dai C, et al. Does t stage affect prognosis in patients with stage IV B differentiated thyroid cancer? Endocr Pract 2019;25:877-86. [Crossref] [PubMed]
Li Y, Gao X, Guo T, et al. Development and validation of a nomogram for risk of pulmonary metastasis in non-papillary thyroid carcinoma: A SEER-based study. Medicine (Baltimore) 2023;102:e34581. [Crossref] [PubMed]
Zhang T, He L, Wang Z, et al. Risk factors for death of follicular thyroid carcinoma: a systematic review and meta-analysis. Endocrine 2023;82:457-66. [Crossref] [PubMed]
Chereau N, Giudicelli X, Pattou F, et al. Diffuse Sclerosing Variant of Papillary Thyroid Carcinoma Is Associated With Aggressive Histopathological Features and a Poor Outcome: Results of a Large Multicentric Study. J Clin Endocrinol Metab 2016;101:4603-10. [Crossref] [PubMed]
Nath MC, Erickson LA. Aggressive Variants of Papillary Thyroid Carcinoma: Hobnail, Tall Cell, Columnar, and Solid. Adv Anat Pathol 2018;25:172-9. [Crossref] [PubMed]
Huang X, Xia Q, Huang Y, et al. Age increased the cancer-specific mortality risk of thyroid cancer with lung metastasis. Clin Endocrinol (Oxf) 2022;96:719-27. [Crossref] [PubMed]
Shobab L, Burman KD, Wartofsky L. Sex Differences in Differentiated Thyroid Cancer. Thyroid 2022;32:224-35. [Crossref] [PubMed]
Klein SL, Flanagan KL. Sex differences in immune responses. Nat Rev Immunol 2016;16:626-38. [Crossref] [PubMed]
Lian J, Yue Y, Yu W, et al. Immunosenescence: a key player in cancer development. J Hematol Oncol 2020;13:151. [Crossref] [PubMed]
Liu Z, Liang Q, Ren Y, et al. Immunosenescence: molecular mechanisms and diseases. Signal Transduct Target Ther 2023;8:200. [Crossref] [PubMed]
Bauer AJ. Papillary and Follicular Thyroid Cancer in children and adolescents: Current approach and future directions. Semin Pediatr Surg 2020;29:150920. [Crossref] [PubMed]
Skuletic V, Radosavljevic GD, Pantic J, et al. Angiogenic and lymphangiogenic profiles in histological variants of papillary thyroid carcinoma. Pol Arch Intern Med 2017;127:429-37. [Crossref] [PubMed]
Hoffmann JP, Liu JA, Seddu K, et al. Sex hormone signaling and regulation of immune function. Immunity 2023;56:2472-91. [Crossref] [PubMed]
LeClair K, Bell KJL, Furuya-Kanamori L, et al. Evaluation of Gender Inequity in Thyroid Cancer Diagnosis: Differences by Sex in US Thyroid Cancer Incidence Compared With a Meta-analysis of Subclinical Thyroid Cancer Rates at Autopsy. JAMA Intern Med 2021;181:1351-8. [Crossref] [PubMed]
Zahedi A, Bondaz L, Rajaraman M, et al. Risk for Thyroid Cancer Recurrence Is Higher in Men Than in Women Independent of Disease Stage at Presentation. Thyroid 2020;30:871-7. [Crossref] [PubMed]
Kabat GC, Kim MY, Wactawski-Wende J, et al. Menstrual and reproductive factors, exogenous hormone use, and risk of thyroid carcinoma in postmenopausal women. Cancer Causes Control 2012;23:2031-40. [Crossref] [PubMed]
Derwahl M, Nicula D. Estrogen and its role in thyroid cancer. Endocr Relat Cancer 2014;21:T273-83. [Crossref] [PubMed]
Liu S, Roemer F, Ge Y, et al. Comparison of evaluation metrics of deep learning for imbalanced imaging data in osteoarthritis studies. Osteoarthritis Cartilage 2023;31:1242-8. [Crossref] [PubMed]

(English Language Editor: J. Gray)

Cite this article as: Shen H, Yang C, Wang Y, Liao J, Zuo X, Zhang B, Yang X. Development and validation of machine learning models for predicting lung metastasis risk in differentiated thyroid cancer based on two databases. Gland Surg 2024;13(11):2174-2188. doi: 10.21037/gs-24-481

Development and validation of machine learning models for predicting lung metastasis risk in differentiated thyroid cancer based on two databases

Highlight box

Introduction

Methods

Definition criteria for LM

Datasets

Data processing

Sampling

Statistical analysis

Results

Table 1

Results of univariate and multivariate LR analyses

Table 2

Subgroups analysis according to age and gender

Table 3

Table 4

Construction and validation of the ML models

Table 5

Training set

Test set

Validation set

GBM model

Model parameters

Variable importance

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share