Flexible Non-parametric Regression Models for Compositional Response Data with Zeros

Flexible Non-parametric Regression Models for Compositional Response Data with Zeros

Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. When parametric assumptions do not hold or are difficult to verify, non-parametric regression models can provide a convenient alternative method for prediction. To this end, we consider an extension to the classical k-NN regression, termed a-k-NN regression, that yields a highly flexible non-parametric regression model for compositional data through the use of the a-transformation.

Views: 682

Non-negative multivariate vectors with variables (typically called components) conveying only relative information are referred to as compositional data. When the vectors are normalized to sum to 1. The need for valid regression models for compositional data in practice has led to several developments in this area, many of which have been proposed in recent years. The first regression model for compositional response data was developed by Aitchison (2003), commonly referred to as Aitchison’s model, and was based on the additive log-ratio transformation defined in Section 2. Dirichlet regression was applied to compositional data in Gueorguieva et al. (2008), Hijazi and Jernigan (2009), Melo et al. (2009). Iyengar and Dey (2002) investigated the generalized Liouville family of distributions that permits distributions with negative or mixed correlation and also contains non-Dirichlet distributions with non-positive correlation. The additive log-ratio transformation was again used by Tolosana-Delgado and von Eynatten (2009) while Egozcue et al. (2012) extended Aitchison’s regression model by using an isometric log-ratio transformation but, instead of employing the usual Helmert sub-matrix, Egozcue et al. (2012) chose a different orthogonal matrix that is compositional data dependent. For a substantial number of specific examples of applications involving compositional data see Tsagris and Stewart (2020). 
A drawback of the aforementioned regression models is their inability to handle zero values directly and, consequently, a few models have recently been proposed to address the zero problem. In particular, Scealy and Welsh (2011) transformed the compositional data onto the unit hyper-sphere and introduced the Kent regression which treats zero values naturally. Spatial compositional data with zeros were modelled in Leininger et al. (2013) from the Bayesian stance. In Mullahy (2015), regression models for economic share data were estimated, with the shares taking zero values with nontrivial probability. Alternative regression models in the field of econometrics and applicable when zero values are present are discussed in Murteira and Ramalho (2016). In Tsagris (2015a), a regression model that minimizes the Jensen-Shannon divergence was proposed while in Tsagris (2015b), a-regression (a generalization of Aitchison’s log-ratio regression) was introduced, and both of these approaches are compatible with zeros. An extension to Dirichlet regression allowing for zeros was developed by Tsagris and Stewart (2018) and referred to as zero adjusted Dirichlet regression.

Most of the preceding regression models are parametric models in the sense that they are limited to the assumptions of linear or generalized linear relationships between the dependent and independent variables, even though the relationships in many real-life applications are not restricted to the linear setting nor conform to fixed parametric forms. In the case of unconstrained data, when parametric assumptions are not satisfied or easily verified, non-parametric regression models and algorithms, such as k–N N regression, are often considered. As detailed in Section 3, Kernel regression (Wand and Jones, 1994) is a more sophisticated technique that generalizes k–N N regression by adding different weights to each observation that decay exponentially with distance. A disadvantage of Kernel regression is that it is more complex and computationally expensive than k–NN regression.

The contribution of this paper is an extension of these classical non-parametric approaches for application to compositional data through the utilization of the a–transformation. Specifically, the proposed a–k–NN and a–kernel regressions for compositional data link the predictor variables in a non-parametric, non-linear fashion, thus allowing for more flexibility. The models have the potential to provide a better fit to the data compared to conventional models and yield improved predictions when the relationships between the compositional and the non-compositional variables are complex. Furthermore, in contrast to other non-parametric regressions such as projection pursuit, applicable to log-ratio transformed compositional data (see Friedman and Stuetzle (1981)), the two proposed methods allow for zero values in the data. While our objective is improved prediction for complex relationships involving compositional response data, a disadvantage of non-parametric regression strategies, in general, is that the usual statistical inference of the effect of the independent variables is not straightforward.
However, the use of ICE plots (Goldstein et al., 2015) can overcome this obstacle and facilitate visualization of the effect of the independent variables. Finally, a significant advantage of a–k–NN regression in particular is its high computational efficiency compared to all current regression techniques, even when the sample sizes number hundreds of thousands or even millions of observations. This is also true for the case of big data and online data or streaming data that require rapid predictions and our work lays the foundation for novel methods in machine learning since many currently rely on k–NN methodology. Functions to carry out both a–k–NN and a–kernel are provided in the R package Compositional.

See also

Τμήμα Οικονομικών Επιστημών

myEcon Newsletter

Εγγραφείτε στην λίστα ειδοποιήσεων του Τμήματος Οικονομικών Επιστημών.