Clustering methods such as k-means and its variants are standard tools for finding groups in the data. However, despite their huge popularity, the underlying uncertainty can not be easily quantified. On the other hand, mixture models represent a well-established inferential tool for probabilistic clustering, but they are characterized by severe computational bottlenecks and may have unreliable solutions in presence of misspecifications. Instead, we rely on a generalized Bayes framework for probabilistic clustering based on Gibbs posteriors. Broadly speaking, in such a setting the log-likelihood is replaced by an arbitrary loss function and this arguably leads to much richer families of clustering methods. Our contribution is two-fold: first, we describe a clustering pipeline for efficiently finding groups and then quantifying the associated uncertainty. Second, we discuss two broad classes of loss functions which have advantages in terms of analytic tractability and interpretability. Specifically, we consider losses based on Bregman divergences and pairwise dissimilarities and we show they can be interpreted as profile and composite log-likelihoods, respectively. Full Bayesian inference is conducted via Gibbs sampling but efficient deterministic algorithms are available for point estimation. As an important byproduct of our work, we show that several existing clustering approaches can be interpreted as generalized Bayesian estimators under specific loss functions. Hence, our methodology can be also used to formally quantify the uncertainty in widely used clustering approaches.
Joint work with Amy Herring and David Dunson (Duke University)