04/10/2018
A Folded Model for Compositional Data Analysis

A Folded Model for Compositional Data Analysis

A folded type model is developed for analyzing compositional data based that provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation.

Views: 1463

Compositional data are met in many dierent scientic elds. In sedimentology, for example, samples were taken from an Arctic lake and their composition of water, clay and sand were the quantities of interest. Data from oceanography studies involving Foraminiferal (a marine plankton species) compositions at 30 dierent sea depths from oceanography were analyzed in Aitchison (2003, pg 399). Schnute & Haigh (2007) also analyzed marine compositional data through catch curve models for a quillback rocksh (Sebastes maliger) population. In hydrochemistry, Otero et al. (2005) used regression analysis to draw conclusions about anthropogenic and geological pollution sources of rivers in Spain. Stewart & Field (2011) modeled compositional diet estimates with an abundance of zero values obtained through quantitative fatty acid signature analysis. In another biological setting, Ghosh & Chakrabarti (2009) were interested was in the classication of immunohistochemical data. Other applications areas of compositional data analysis include archaeometry (Baxter et al., 2005), where the composition of ancient glasses, for instance, is of interest, and in economics (Fry, Fry, & McLaren, 2000), where the focus is on the percentage of the household expenditure allocated to dierent products. Compositional data are also met in political science (Katz, & King, 1999) for modeling electoral data and in forensic science where the compositions of forensic glasses are compared and  classied (Neocleous, Aitken, & Zadora, 2011). In demography, compositional data are met in multiple-decrement life tables and the mortality rates amongst age groups are modeled (Oeppen, 2008). In a study of the brain, Prados et al. (2010) evaluated the diusion anisotropy from diusion tensor imaging using new measures derived from compositional data distances. Some recent areas of application include bioinformatics and specically microbiome data analysis (Xia et al., 2013; Chen & Li, 2016; Shi, Zhang & Li, 2016). These examples illustrate the breadth of compositional data analysis applications and consequently the need for parametric models dened on the simplex.

The Dirichlet distribution is a natural distribution for such data due to its support being the simplex space. However, it has long been recognized that this distribution is not statistically rich and flexible enough to capture many types of variabilities (especially curvature) of compositional data and, for this reason, a variety of transformations have been proposed that map the data outside of the simplex. In Aitchison (1982) the log-ratio transformation approach was developed, and later the so called isometric log-ratio transformation methodology which was first  proposed in Aitchison (2003, pg 90) and examined in detail by Egozcue et al. (2003). More recently, Tsagris, Preston & Wood (2011) suggested the -transformation which includes the  isometric transformation as a special case. The a-transformation is a Box-Cox type transformation and has been successfully applied in regression analysis (Tsagris, 2015) and classication settings (Tsagris, Preston & Wood, 2016).

Regardless of the transformation chosen, the usual approach for modeling compositional data is to assume that the transformed data is multivariate normally distributed. While the a-transformation offers flexibility, a disadvantage of this transformation is that it maps the compositional data from the simplex (SD1) to a subset of RD1, and not RD1 itself on which the multivariate normal density is dened. An improvement to this method can be obtained by using a folded model procedure, similar to the approach used in Scealy & Welsh (2014)and to the folded normal distribution in R employed in Leone, Nelson & Nottingham (1961) and Johnson (1962). The folded normal distribution in R, for example, corresponds to taking the absolute value of a normal random variable and essentially folding the negative values to the positive side of the distribution. The model we propose here works in a similar fashion where values outside the simplex are mapped inside of it via a folding type transformation. An advantage of this approach over the aforementioned log-ratio methodology is that it allows one to fit any suitable multivariate distribution on SD1 through the parameter a.

See also

Τμήμα Οικονομικών Επιστημών

myEcon Newsletter

Εγγραφείτε στην λίστα ειδοποιήσεων του Τμήματος Οικονομικών Επιστημών.