Introduction

Coronavirus disease 2019 (COVID-19) is caused by severe acute respiratory syndrome coronavirus 2, an enveloped single-strand-RNA β-coronavirus, from the family Coronaviridae1,2. The virus was initially named 2019 novel coronavirus, after isolation from patients with viral pneumonia in Wuhan China in late December 20193. Poor outcomes have been associated with multiple host factors, including age, sex, comorbidities, laboratory markers and lack of vaccination1,4,5. Higher mortality was also previously associated with the delta variant6,7,8. This variant originated in October 2021 in India and has three clades, 21A, 21I and 21J, with latter dominating across continents9. More recent work suggests delta variant was not more fatal than pre-delta variants4, but more fatal than omicron variant5.

However, World Health Organization describes a wide range of determinants of health, encompassing aspects of our social, economic, and physical environments, in addition to personal characteristics and behaviors10. It is therefore plausible that in addition to individual risks, country-level factors, including demographic-, socioeconomic-, and environment-related health parameters, may play an important role in COVID-19 incidence and subsequent mortality.

Previous authors have reported on COVID-19 mortality risk factors within geographic regions11,12,13,14, or during defined pandemic phases e.g., the first wave12,14. However, other authors have reported worldwide country-level COVID-19 mortality risk factors using data close to time of publication. Implicated risk factors include general pre-pandemic variables comprising country-level demographics (population ≥ 60 years15); socioeconomic and governance factors e.g., higher GDP per capita16,17, higher income disparity16, higher transport infrastructure quality and lower government effectiveness18; national healthcare metrics including lower general health expenditure, lower infectious disease system responsiveness and greater accountability19; and population health characteristics like higher prevalence of obesity15,16, chronic obstructive pulmonary disease, Alzheimer’s disease, and depression17. Reported risk factors also comprise pandemic-specific variables. These include pandemic features e.g., duration15 and case load19; pandemic-associated public health policies inclusive of delayed international travel restrictions15 and lower testing18; and pandemic-associated public behaviours, e.g., shorter duration of mask wearing15. In addition, prior work also explored worldwide country-level COVID-19 mortality with a focus on select predictors, for example health system parameters19,20.

However, to the best of my knowledge, no report has tested potential country-level predictors of COVID-19 mortality using a wide range of potential predictors, unrestricted geographic scope, and late 2022 COVID-19 mortality data. This study therefore aimed to identify country-level independent predictors of COVID-19 mortality after controlling for a diverse range of potential factors utilizing current worldwide mortality data.

Results

Study data

There were no duplicates or invalid entries. Number of countries with missing data ranged from 0 (Country) to 46 (Delta21Jp100K). One hundred and fifty-two countries had complete data for all variables. The study data therefore comprised 152 cases (countries) and 41 variables, used for all analyses.

Descriptive statistics are presented in Table 1. The dataset comprised 38, 26, 18, 48, 8, and 14 countries within the Africa, Americas, Eastern Mediterranean, Europe, South-East Asia, and Western Pacific UN-regions, respectively. COVID-19 deaths/100,000 population ranged widely between 0.13 and 660.38. Cases/100,000 population also ranged widely from 39.12 to 70,445.77. Mean per capita deaths/cases were 26.7/2300, 204/13,531, 73.3/9334, 251/33,367, 47/6971 and 44.2/18,842, within the Africa, Americas, Eastern Mediterranean, Europe, South-East Asia, and Western Pacific UN-regions, respectively, with Africa reporting the lowest per capita cases and deaths.

Table 1 Descriptive statistics.

Data quality checks revealed most variables contained recent data (2018–2022), as seen with GDPpC_Yr, IGSpGDP_Yr, HEpGDP_Yr, and pPopH20_Yr, with minimum year of 2014, 2004, 2011, and 2007 respectively. Datapoints earlier than 2018 occurred in 1, 5, 2, and 3 cases for GDPpC, IGSpGDP, HEpGDP, and pPopH20, respectively. Thirteen variables contained outliers, but these were relatively few for most variables. Quality checks also identified a few additional datapoints of potential concern, including within the AdLit and HBp100K variables, with a solitary minimal value of 0, for Chad and Mali, respectively. However, these were not outliers as confirmed with the boxplot function (graphics package21), and therefore not removed. Also, no outliers were removed as they were judged to represent valid data after visualization with plot function (base package22).

Correlation analysis

Correlations between COVID-19 mortality and continuous variables of interest ranged between − 0.58 and 0.78, with all 19 correlations being statistically significant, and the strongest correlation being with COVID-19 cases per capita (Table 2). Among the 19 associations, there were 1, 7, and 10 small, medium, and large positive correlations, as well as 1 large negative correlation (Table 2, Supplementary Fig. S1).

Table 2 Summary of correlation analysis.

ANOVA analysis/Welch’s Heteroscedastic F Test

Summary of the ANOVA and Welch’s Heteroscedastic F Test results are presented in Table 3. Mean COVID-19 deaths did not differ across PoeMgt groups. However, there was a statistically significant difference across WHO_Region groups. Pairwise comparisons demonstrated significant differences (two-sided) between several pairs (Supplementary Table S1).

Table 3 Summary of ANOVA/Welch’s Heteroscedastic F Test.

Multivariate analysis

The GAMs analyzed included 6 limited models (geographic, demographic, socioeconomic, health metric, population health, and pandemic-related) and one full model. As shown in Table 4, WHO_Region and s(AvgTemp), s(pPop65) and s(AdLit), s(pWfUem) and s(CPI), s(HEpGDP) and s(MDp100K), s(pPopH20), and s(CovCp100K), were identified as independent predictors of COVID-19 mortality for the gam.MODEL_1Geo, gam.MODEL_2Dem, gam.MODEL_3SoEc_mod, gam.MODEL_4Heal_mod, gam.MODEL_5PopH, and gam.MODEL_6Pand models respectively. For the gam.MODEL_Full_mod model, independent predictors were WHO_Region, s(pPop65), s(CPI), s(HBp100K), and s(CovCp100K), accounting for 80.7% of variance. All models demonstrated practical significance with at least medium effect size: R2 > 13% or Cohen’s f2 > 0.1523. Models fitted with standardized independent variables produced similar results. Backward elimination using the R2 criterion, identified gam.MODEL_Full_mod as most parsimonious, without further modification (not shown).

Table 4 Independent predictors of COVID-19 mortality.

Mortality varied across regions with highest mean deaths in Europe and the Americas. Initial analysis revealed that mortality was significantly greater in the Americas compared with Africa, with an average increase of 66.1 deaths/100,000 population. This analysis also suggested that the Americas and to a lesser extent Eastern Mediterranean and Europe appeared to stand out from the other regions. Additional models were therefore run with these WHO regions as reference. With the Americas as reference group, in addition to Africa, mortality was significantly greater in the Americas compared with South-East Asia and Western Pacific, with an average increase of 60.4 and 80.4 deaths/100,000 population respectively. With Eastern Mediterranean as reference group, mortality was significantly greater in this region compared with Western Pacific, with an average increase of 50.8 deaths/100,000 population. Finally, with Europe as reference group, mortality was significantly greater in Europe compared with Western Pacific, with an average increase of 49.6 deaths/100,000 population.

Partial effects

Partial effects plots assessed impact of individual predictors. The gam.MODEL_1Geo model demonstrated greatest COVID-19 mortality at average temperature between approximately 10 °C to 15 °C (Supplementary Fig. S2). In the gam.MODEL_2Dem model, mortality increased with percent of population ≥ 65 up to approximately 20% then tapered off, but was initially unchanged with increasingly percent of adult literacy, before increasing above approximately 70% (Supplementary Fig. S3). Among socioeconomic variables, deaths increased logarithmically with Corruption Perception Index (CPI), but increased with greater percent unemployment up to approximately 15% before leveling off (Supplementary Fig. S4). COVID-19 death increased progressively with health expenditure relative to GDP and doctors/100,000, as shown in gam.MODEL_4Heal_mod model (Supplementary Fig. S5). In the gam.MODEL_5PopH and gam.MODEL_6Pand models, deaths increased exponentially with percent of population using at least basic drinking water services (Supplementary Fig. S6), but increased logarithmically with COVID-19 Cases/100,000 population (Supplementary Fig. S7), respectively.

Including all eligible variables in the gam.MODEL_Full_mod model, demonstrated that COVID-19 deaths increased progressively with greater percent population ≥ 65 and up to approximately 30,000 cases/100,000, but decreased progressively above a CPI of approximately 50 and above approximately 50 hospital beds/100,000 population. Assessment of the trend changes for these independent predictors comparing the modified full model with the corresponding smooth term versus the linear term, using the F-test25, demonstrated significant differences (1-sided) for CPI (F = 3.3735, p = 0.03838) and CovCp100K (F = 5.7823, p = 0.003924). The partial effects plots also suggested deaths increased progressively with increasing COVID-19 Delta 21J sequence count/100,000 tests, and decreased progressively above life expectancy at birth (LEB) of approximately 75. However, these latter trends were not statistically significant (Fig. 1).

Figure 1
figure 1

Partial effects plots for gam.MODEL_Full_mod model. (a) AvgTemp, (b) pPop65, (c) pPopUrb, (d) AdLit, (e) pWfUem, (f) CPI, (g) HEpGDP, (h) LEB, (i) HBp100K, (j) CovCp100K, (k) Delta21Jp100K, and (l) VacFullp100. Plotted with the shift argument to shift the scale based on the intercept value, for a more natural interpretation28. The smooth function on y-axis therefore represents the partial effect of the independent variable on COVID-19 mortality (presented as the independent variable, with effective degrees of freedom). Shading reflects 95% confidence interval for the mean shape of the effect28. AvgTemp = average temperature (2021); pPop65 =  ≥ 65 years of age (percent of population); pPopUrb = urban population (percent of population); AdLit = adult literacy rate (percent ≥ 15 years); pWfUem = unemployment (percent of workforce); CPI = Corruption Perception Index; HEpGDP = health expenditure (% of GDP); LEB = life expectancy at birth; HBp100K = hospital beds/100,000 population; CovCp100K = COVID-19 Cases/100,000 population; Delta21Jp100K = COVID-19 Delta 21J sequence count/100,000 tests; VacFullp100 = number of persons fully vaccinated/100 population.

Differences in continuous independent predictors across who regions

As with COVID-19 deaths/100,000 population, WHO regions also differed significantly in percent of population ≥ 65 years of age (Welch's Heteroscedastic F Test statistic = 71.86161, dfnum = 5, dfdenom = 38.79369, p < 0.001), CPI (Welch's Heteroscedastic F Test statistic = 11.31819, dfnum = 5, dfdenom = 45.55676, p < 0.001), HBp100K (Welch's Heteroscedastic F Test statistic = 26.15499, dfnum = 5, dfdenom = 40.21732, p < 0.001), and COVID-19 cases/100,000 population (Welch's Heteroscedastic F Test statistic = 28.97315, dfnum = 5, dfdenom = 37.22894, p < 0.001). For all five variables tested across WHO regions, Africa had the lowest mean, while Europe had the highest. Pairwise comparisons demonstrated significant differences (two-sided) between several WHO region pairs for these variables: always including Africa vs. Europe and South-East Asia vs. Europe, as well as Africa vs. Americas for cases, deaths, and percent of population ≥ 65 years (Supplementary Tables  S1S5).

Linear modelling

Multiple linear regression identified similar independent predictors for the limited models, except the MODEL_1Geo model where only WHO_Region was predictive. However, in the corresponding full linear model (MODEL_Full_mod), COVID-19 mortality independent predictors were WHO_Region, pPop65, CPI, and AdLit, accounting for 63.2% of variance. Neither HBp100K nor CovCp100K were predictive (not shown).

Discussion

This study identified WHO region, percent of population ≥ 65 years, CPI, hospital beds/100,000 population, and COVID-19 Cases/100,000 population as independent predictors of COVID-19 mortality, accounting for 80.7% of variance. Mortality varied across regions with highest mean deaths in Europe and the Americas. The partial effects plots further demonstrated that COVID-19 deaths increased progressively with greater percent of population ≥ 65 and up to approximately 30,000 cases/100,000, but decreased progressively above a CPI of approximately 50 and above approximately 50 hospital beds/100,000 population.

Mortality was significantly greater in the Americas, compared with Africa. Caseload was lowest for Africa in the dataset and consistent with WHO’s confirmation of under-reporting in this region29. However, the low African mortality was determined after controlling for eligible variables including caseload. Potential contributors to the low African mortality include younger age structure with associated reduced comorbidities, genetic factors including decreased response to angiotensin-converting enzyme inhibitors, natural selection conferring protection, trained immunity-based herd immunity, lower life expectancy, and low seeding rate due to lower air traffic to the continent30,31,32,33. Additional analyses revealed that mortality was also significantly greater in the Americas compared with South-East Asia and Western Pacific, in Eastern Mediterranean compared with Western Pacific, and in Europe compared with Western Pacific region. Various factors have been suggested for the low Western Pacific mortality including prior investment in pandemic preparation, as well as rapid and stringent public health responses, such as aggressive testing and early case management34,35. The causes of the lower Africa, South-East Asia, and Western Pacific mortality warrant further study.

Progressively greater COVID-19 mortality with increasing percent of population ≥ 65 is not surprising given previous similar findings for proportion of population ≥ 6015, and can be explained by multiple factors. These include atypical presentation of respiratory infections with associated delayed intervention and polypharmacy, age-related altered immune response, increased presence of multiple comorbidities, and polypharmacy-associated enhanced susceptibility to viral infections36. This finding is also consistent with the younger age-structure in Africa, contrasting that in Europe and the Americas. The initial enhanced mortality with increasing caseload also appears logical. It is less clear why mortality levels off beyond approximately 30,000 cases/100,000. Possible explanations include increasing competence37,38, or resource allocation39, in settings with high caseloads, offsetting the heightened burden.

There was progressively decreasing mortality above a CPI of approximately 50. CPI is a composite index derived from studies and expert surveys, published annually by Transparency International, and measures perceived public sector corruption. The index ranges from 1 to 100, with 100 indicating the lowest level of perceived corruption40, and is strongly correlated with other measures of corruption41. It is possible that higher levels of corruption could negatively impact reporting, and recent work has shown that high CPI was associated with increased daily reported COVID-19 cases and deaths. However, this analysis was restricted to data for the initial 120 days from first confirmed case42, and could be reflective of the early pandemic response. In contrast, the present CPI finding is consistent with findings of a significant negative association between CPI and poor health outcomes, and a positive association between health-sector corruption specifically and chronic disease, using data covering the period 2004–201543. It is also consistent with evidence corruption undermines various aspects of healthcare system performance, including efficiency44,45. However, efficiency is not guaranteed by abundance of healthcare system inputs including health expenditure46, which may help explain why countries with comparable health expenditure differ with respect to important health outcomes including LEB and infant mortality rate47. This lack of congruence is consistent with the present finding that health expenditure was not an independent predictor of COVID-19 mortality in the modified full model.

The decreased mortality above approximately 50 hospital beds/100,000 population is also not consistent with some prior reports. An early study, using global October 2020 data, found no significant association between beds/100,000 and COVID-19 deaths20, a result that could at least partly be due to the early data. A more recent study found increased mortality in Italian regions with higher beds per capita, after adjusting for percentage of population ≥ 65/LEB/aging index, health expenditure per capita, general practitioners per 1000, and number of long-term care facilities48. However, as these authors suggest, regions with higher beds per capita are more centralized, and will likely attract higher caseloads and hence mortality. This is supported by the observation that hospital beds/100,000 population was lost as an independent predictor if percentage of population ≥ 65 and cases per capita were removed from the current paper’s modified full model. The current findings are also consistent with a USA report that regions with more general medicine/surgical beds per COVID-19 case had significantly lower COVID-19 mortality39. Populations served by less than 50 hospital beds/100,000 may therefore be at risk.

The study results suggest some important implications. They highlight the complex nature of the relationships under investigation. For example, even though average temperature, adult literacy rate, health expenditure, doctors per capita, and percent of population using at least basic drinking water services, were all highly correlated with mortality and identified as independent predictors in their respective limited models, these relationships were lost after controlling for all eligible variables in the full model. Likewise, the present study found that Africa had the lowest cases, deaths, and CPI among WHO regions. Low African caseload was consistent with the low mortality seen in both the relevant limited model and the full model. However, although there was a positive medium correlation between CPI and mortality and progressively greater mortality with increasing CPI in the relevant limited model, after adjusting for other variables in the full model, low CPI was associated with high COVID-19 mortality. This implies the low African CPI does not adequately explain the low COVID-19 mortality seen in Africa. Also, full vaccination per capita, moderately correlated with COVID-19 deaths, was not a predictor of mortality in the full model. Previous USA49 and global50 analyses suggested that vaccination reduces mortality. However, the USA analysis considered any level of vaccination, controlled for county population size, social vulnerability index, and mobility changes, and assessed data from the alpha/delta phase of the pandemic49, during which vaccinations may have been more impactful, considering the lower post-delta mortality risk5. Similarly, although the global analysis included a diverse range of covariates, the data was from late 2021/early 202250, in proximity to the delta wave4. Therefore, the timing of the current dataset may partly explain why Delta 21J sequence count/100,000 tests was not identified as an independent predictor of COVID-19 mortality. These findings suggest the need for future work to determine the temporal relationships between COVID-19 mortality and potential predictors. The results also imply that a substantial portion of COVID-19 mortality risk originates from factors beyond the control of individuals. Accordingly, the WHO’s Sustainable Development Goal 3, “Good Health and Well-Being”51, arguably represents a justifiable mandate for countries to assume substantial responsible for the welfare of their citizens. However, further work also seems prudent to assess the impact of other potential predictors relevant to personal responsibility.

Among the study’s strengths, the dataset comprised complete data on 152 countries, with representation from all UN-defined regions. A comprehensive list of variables was also included in the models. This was important, as reported COVID-19 deaths may also depend on extraneous factors including population demographics, governance, and health system capacity. Including such variables therefore facilitated controlling for diverse factors. Additionally, the methodology utilized generalized additive models, allowing for analysis and visualization of complex, non-linear relationships. Regarding analysis, use of GAMs probably explains why HBp100K and CovCp100K were identified as independent predictors in the gam.MODEL_Full_mod model, but not the linear MODEL_Full_mod model, based on their clearly non-linear partial effects plots. Regarding visualization, comparing gam.MODEL_Full_mod with corresponding smooth term versus linear term for CPI and CovCp100K, demonstrated significant differences, implying non-linear trends, supporting utility of visualizing trend changes with GAM-based partial effects plots.

There were also some limitations. Firstly, the cross-sectional design prevents causal assumptions. In addition, the database used was dependent on available sources. It is possible that some sources under-reported COVID-19 cases and deaths, as suggested for Africa29, as well as other variables. The results must therefore be interpreted accordingly. Further, data quality could have varied between different sources. However, the overall data quality was generally good, with most data being recent (2018–2022), with only a solitary datapoint in two variables appearing potentially questionable.

In conclusion, COVID-19 mortality varied across regions with highest mean deaths in Europe and the Americas. Mortality increased progressively with increasing population ≥ 65, as well as with caseload up to ~ 30,000/100,000 population. Finally, mortality decreased progressively at high CPI and high hospital beds per capita. These findings suggest areas for focused intervention in the event of similar future public health emergencies, including prioritization of the elderly, optimizing healthcare capacity, and improving deficient health sector-related governance.

Methods

Study design and sample size calculation

This was a cross-sectional study. The required sample size was calculated based on the Raosoft online sample size calculator at http://www.raosoft.com/samplesize.html. Assuming a population of approximately 200 United Nation-defined countries, the minimum required sample size to achieve a 5% margin of error at the 95% confidence level was 132 countries.

Raw data

Data was obtained from several sources (Table 5), including two datasets from the World Health Organization (WHO); one from Trading Economics; nine from the World Bank; and one each from the Nuclear Threat Initiative/Johns Hopkins Center for Health Security/Economist Impact (NTI/JHCHS/EI), CoVariants, and Worldometer. Country, WHO region, as well as COVID-19 cases and deaths were from 16th December 2022. COVID-19 Delta 21J sequence count per country (based on data made available by GISAID: the Global initiative on sharing all influenza data52,53) was from 15th December 2022. The per capita vaccination data was from 13th December 2022, and average temperature data was for 2021. Other variables were obtained for the most recent year for each country. As a result, for each of these variables, the year of collection varied between countries, which was collected as an associated “_Yr” variable (Table 1). All variables were numeric, except for WHO region and point of entry management (PoeMgt). There were six WHO regions: Africa, Americas, Eastern Mediterranean, Europe, South-East Asia, and Western Pacific. PoeMgt was measured with 3 groups: no plan; plan between public health system and border control authorities to identify international cases, and trace and quarantine contacts in response to active public health emergencies; and plan between public health system and border control authorities to identify international cases, and trace and quarantine contacts to prepare for future public health emergencies27. All variables were used as obtained from source, except for Delta 21J sequence count/100,000 tests, which was computed based on sequence counts per country from CoVariants54 and total tests per country from Worldometer55. Variables of interest were extracted from the original datasets and manually merged based on the “Country” identifier, in Microsoft Excel comma separated values (.csv) format.

Table 5 Data description, sources, and access.

Preprocessing

Raw data was read into R software with the base22 and haven68 packages, and comprised 221 cases (countries) with 41 variables, including country name, COVID-19 deaths, WHO region, average temperature, COVID-19 cases, Delta 21J sequence counts per country, total tests per country, Delta 21J sequence count/100,000 tests, and per capita vaccination, in addition to 16 other potential independent variables and their associated year (Table 5). The categorical variable PoeMgt, originally coded as 0, 1, and 2, but normalized to 0, 50, and 100 respectively, to make them directly comparable with other indicators27, was re-coded (mutate function from dplyr package69) to reflect the underlying categories as defined in the source documentation27. Data was screened for duplicates (distinct function from dplyr package69), invalid entries (empty rows or columns), and missing data (is.na function from R base package22). Countries with incomplete data were removed with the complete.cases function from the stats package70, and lack of missing values subsequently confirmed. Outliers, defined as datapoints > Q3 + 3 × IQR or < Q1 – 3 × IQR, were then detected with the rstatix package71. Finally, the e1071 package72 was used to compute descriptive statistics.

Statistical analysis

Due to non-linearity and outliers among continuous bivariate models, Spearman’s correlation (correlation package73) was used to examine the associations between COVID-19 mortality and continuous variables. The relationship with PoeMgt was tested with ANOVA (stats package70), but with WHO_Region using Welch’s Heteroscedastic F Test (onewaytests package74), because of heteroscedasticity between variable groups. Based on the presence of non-linear heteroscedastic multivariate models, country-level independent predictors of COVID-19 mortality were identified by weighted generalized additive models (GAMs) using the gam function (mgcv package:75). Due to cone-shaped residual plots, inverse error variances were applied as weights, as previously described for weighted least squares76,77. Briefly, model residuals were regressed on model fitted values, and weights estimated as the inverse of the squared extracted fitted values. GAMs apply smooth functions to continuous independent variables that capture non-linear aspects of non-linear relationships, with each flexible smooth comprised of smaller basis functions that model a portion of the relationship78. Adequacy of basis functions (model complexity) and concurvity were tested for all models. GAMs were fitted without adjusting model complexity (k-value), using a smoothing parameter of 0.0001 to minimize risk of overfitting78, and after removal of model variables with high partial concurvity (Table 5) defined as > 0.8 between a variable pair28. Non-significant or less significant high-concurvity variables, were removed first. Limited models were initially fitted for groups of related variables (Table 5), and then a final (full) model for all variables that did not violate the concurvity limit. Each model was evaluated with Cohen’s R2 and f2 to identify practically significant models defined as at least medium effect size: Cohen’s R2 > 13% or f2 > 0.1523. The impact of individual variables was then assessed with partial effects plots. All seven models were fitted with multiple linear regression, fitted with standardized independent variables, and the final model was run with various WHO regions as reference, for comparison. Trend changes for identified independent predictors were further assessed, by comparing the full model with the corresponding smooth term versus the linear term, using the F-test25. Differences in continuous independent predictors across WHO regions was also tested with Welch’s Heteroscedastic F Test (onewaytests package74. Finally, the full model was tested by backward elimination (R2 criterion) to determine the most parsimonious model. All data preprocessing, statistical analysis, and data visualization were performed with R, version 4.2.1 (The R Foundation for Statistical Computing, 2022). A p value of < 0.05 was considered statistically significant. However, for multiple comparisons, p-values were adjusted by Bonferroni correction.