07/03/2026
The α–regression for compositional data: a unified framework for standard, spatially-lagged, spatial autoregressive and geographically-weighted regression models

The α–regression for compositional data: a unified framework for standard, spatially-lagged, spatial autoregressive and geographically-weighted regression models

Compositional data–vectors of non-negative components summing to unity–frequently arise in scientific applications where covariates influence the relative proportions of components, yet traditional regression approaches face challenges regarding the unit-sum constraint and zero values. This paper revisits the α–regression framework, which uses a flexible power transformation parameterized by α to interpolate between raw data analysis and log-ratio methods, naturally handling zeros without imputation while allowing data-driven transformation selection. We formulate α–regression as a non-linear least squares problem, study its asymptotic properties, provide efficient estimation via the Levenberg-Marquardt algorithm, and derive marginal effects for interpretation.

Views: 23

Compositional data are characterized as vectors of non-negative components constrained to sum to a constant, conventionally normalized to unity. Compositional data structures arise across diverse scientific domains, as evidenced by the substantial body of methodological literature devoted to their rigorous statistical analysis. The sample space of such data is defined by the standard simplex.

The methodological imperative to develop models specifically calibrated for compositional data has catalyzed considerable innovation, particularly in the contemporary statistical literature. The foundational framework was established by Aitchison (2003)–subsequently designated as Aitchison’s model–predicated upon log-ratio transformations, thereby inaugurating the logratio analysis (LRA). This methodology was subsequently refined by Egozcue et al. (2003), who implemented an isometric log-ratio (ilr) transformation to preserve geometric properties. In contrast, the stay-in-the-simplex approach employs probability distributions and regression structures intrinsically defined on the simplex manifold. Notably, Dirichlet regression has been extensively utilized within compositional frameworks Gueorguieva et al. (2008), Hijazi and Jernigan (2009), Melo et al. (2009). Furthermore, Iyengar and Dey (2002) investigated the generalized Liouville distribution family, which accommodates negative or heterogeneous correlation structures, thereby extending beyond the restrictive positive correlation constraint of Dirichlet distributions. A less theoretically justified, yet occasionally employed strategy involves disregarding the unit-sum constraint and treating compositional data within a Euclidean framework—an approach designated as raw data analysis (RDA) (Baxter, 2001, Baxter et al., 2005). A fourth methodological paradigm employs the α–transformation family (Tsagris et al., 2011), which interpolates continuously between the RDA and LRA, thereby affording enhanced model flexibility while accommodating zero components naturally.

Concerning spatial autocorrelation structures, the spatially lagged X (SLX) model represents a parsimonious specification incorporating spatial dependence exclusively through exogenous covariates, thereby excluding spatial lags of the dependent variable (Elhorst, 2014, LeSage and Pace, 2009). The spatial autoregressive (SAR) model (Kazar and Celik, 2012, Shi et al., 2025), analogous to temporal autoregressive processes, posits that observations are influenced by proximate spatial neighbors. Specifically, the SAR model expresses the dependent variable as a function of both explanatory covariates and a spatially weighted average of neighboring dependent variable realizations. Geographically weighted regression (GWR) constitutes a local regression methodology designed to capture spatially heterogeneous relationships (Brunsdon et al., 1996). In contrast to conventional regression, which assumes parameter stationarity, GWR permits spatial nonstationarity through location-specific coefficient estimation.

The integration of the spatial regression framework within the compositional data analysis represents a relatively narrow research area2. Leininger et al. (2013) synthesized hierarchical Bayesian models for zero-inflated compositional data, incorporating spatial random effects to accommodate local variation. Nguyen et al. (2021) and Yoshida et al. (2021) developed a SAR specification, and GWR, respectively, both employing the ilr transformation for compositional responses. Clarotto et al. (2022) introduced a novel power transformation, conceptually analogous to the α-transformation, specifically calibrated for geostatistical modeling of compositional data.

In this paper we adopt a pragmatic methodological stance, particularly tailored to regression with compositional data. The principal contribution of this paper is a unified framework for regression modelling of compositional data. We examine the α–regression (Tsagris, 2015b), that was proposed as a generalization of Aitchison’s log-ratio regression (Aitchison, 2003), that naturally accommodates zero components, while offering flexibility via the α-transformation. First, we review the α-regression model, examine it as a non-linear least squares minimization problem and use a modified Levenberg-Marquardt algorithm, that is computationally efficient, to estimate the regression coefficients. We suggest two approaches to select the optimal value of α, and provide formulas for the marginal effects (MEs) of the covariates, including their asymptotic variance, We then establish the consistency and the asymptotic normality of the regression coefficients. Concluding the presentation of the α-regression, we discuss robust extensions and a simple method to incorporate compositional and Euclidean predictors.

The advantages of the α-regression are: a) ability to handle zeros naturally without imputation. b) Flexibility as α provides a continuum from power transforms to log-ratio methods. c) A predictive performance that is often higher compared to classical methods. d) A balance of the strengths of power transformations and log-ratio methods, providing a flexible and effective tool for predictive modeling on the simplex. A disadvantage though is the reduced interpretability of regression coefficients compared to log-ratio approaches.

We next extend the α-regression to accommodate spatial dependencies via three directions. The first extension is the α-SLX model, where we allow for spatial correlation in the predictors, that is we allow for spillover effect at the covariate level. The covariates affect directly the response, but also indirectly via the values of their neighbours. The second extension is the α-SAR model, where we place the correlation in the response side. The response is not only affected by the covariates, but also by the response values of the neighbours. Finally, we propose the GWαR model, where the regression coefficients are location specific. For all aforementioned spatial regression models, the selection of α and the free parameters is achieved via the spatial K-fold cross-validation (CV) protocol. Further, since the resulting regression coefficients are hard to interpret the effect of the covariates, we provide formulas to compute the MEs (and their asymptotic variance).

See also

Τμήμα Οικονομικών Επιστημών

myEcon Newsletter

Εγγραφείτε στην λίστα ειδοποιήσεων του Τμήματος Οικονομικών Επιστημών.