## Thursday, August 22, 2013

### Making Statistical Inferences with Images: Using CANDELS Data to Learn About Our Universe

The data from the CANDELS program have been, unquestionably, a boon for astronomers, providing them with an unprecedented means by which to view galaxy evolution in the early Universe. But these data are a boon for statisticians as well. That's because of their complexity: going from, say, images to making inferences about the Universe requires the use of statistical techniques that can be considerably more advanced than simple linear regression. (And it may require new advanced techniques as well.) The application of the newest, most cutting-edge statistical techniques to CANDELS data: this is where I and my collaborators come in.

I am part of a small but growing class of astrostatisticians. We work at the boundary between astronomy and statistics, understanding the data and the science of the former, and applying the latest methods of the latter.  For my part, my Ph.D. is in astronomy, but for the last decade I have worked in Carnegie Mellon's Department of Statistics and with the International Computational Astrostatistics (or InCA) group, a collection of CMU statisticians and computer scientists as well as astronomers at CMU, Pitt, and places beyond. I work to facilitate communication between statisticians and their astronomical counterparts and, to borrow a phrase from computer science, I try to "abstract away" the complexity of astronomical data to make them more immediately accessible to statisticians, particularly to statistics students.

One major aspect of an astrostatistician's work is working with experts to internalize the nitty-gritty details of the science underlying a particular dataset. But another involves looking beyond these details towards the "big picture."  Are there newer and better ways to analyze a particular dataset, or to answer a particular scientific question? For the specific case of CANDELS, thinking about the big picture has led me to ask the following question: how exactly should we turn galaxy images into constraints on parameter values for cosmological and astrophysical models? Below, I'll describe how one might do this...and long story short: it won't be as easy as performing simple linear regression. An illustration of the question posed in the text above: how do we turn galaxy images (in this case, the four representative H-band images of high-redshift galaxies from the CANDELS program shown at left) into constraints on parameter values for cosmological and astrophysical models (in this case, the 68- and 95-percent confidence regions for the values of two cosmological parameters shown at right)?  The pixels for each of the galaxies at left are colored according to relative intensity: for instance, we observe less light in a pixel colored green than we do in one colored pink, or red.  Yellow-colored pixels are associated with maximum galaxy intensity.  One might wonder about the two galaxies at left, where the yellow pixels appear offset from the image centers.  These are examples of disturbed galaxies, which are relatively common in the chaotic universe of 10 billion years ago: what we see as a single "galaxy" actually might be two galaxies that are completing the process of merging together.

The basic recipe for constraining the parameters of a given model (like the Lambda CDM model) should be familiar to anyone who has ever tried to fit a curve to a set of data: assume parameter values (like the slope and intercept of a line), construct predicted data based on these values (determine the amplitude of the line at each observed datum), numerically evaluate the difference between predicted and observed data (such as by computing the squares of the differences), and iterate, changing the parameter values until the difference metric is minimized. This recipe seems clear and easy to follow...but with CANDELS data the devil will be in the details!

The first devil-infused detail is "constructing predicted data." Our predicted data are galaxy images, and to map a set of parameter values to images we need state-of-the-art simulation software that takes into account all the intricacies of the physics involved in galaxy formation. This is a detail I leave to the experts; it suffices to say here that members of the CANDELS team including Joel Primack, Avishai Dekel, and others are on the case, combining cosmological simulations with code that models how stars evolve and how photons diffuse through dusty environments to create lovely images of galaxies. So while there is presumably still much to do to  perfect the simulation models, the construction of predicted data is not something that we should have to worry too much about in the long term.

The second devil-infused detail, "numerically evaluating the difference between predicted and observed data," is more problematic. How exactly does one compare (up to hundreds of thousands of) simulated and observed images? A direct image-by-image and pixel-by-pixel comparison is not appropriate, because of the random nature of simulations: we cannot expect a simulation to reproduce the structural details of galaxies that exhibit bars, arms, and/or clumps of star formation. Instead, what we want to do is to compare the entire ensemble of simulated images with that of observed images.  For instance, suppose that each galaxy image is a "postage stamp" measuring 100 by 100 pixels. One can use each of these pixels to define an axis, creating a 10,000-dimensional space; a given galaxy's value along the first axis would be, for instance, the intensity in the lower-leftmost pixel, while the galaxy's value along the second axis would be the intensity in an adjoining pixel, etc., etc. Once all the galaxy images are mapped to this space, we'd have two sets of points, for simulated and observed galaxies, respectively. Then all we have to do it compare them: how "far" is one set from the other? Eventually, we'd want to find the set of model parameters that minimizes that distance.

Before answering that, we should probably admit that working in a 10,000-dimensional space is actually infeasible: we would suffer from the so-called "curse of dimensionality," wherein the amount of data needed to achieve a precise statistical result generally grows exponentially with dimension. But all is not lost, because we wouldn't want to work in such a large space anyway, if we can avoid it.  We expect that galaxies "live" in some subset of that space, perhaps one of much lower dimensionality. For instance, all the galaxies might inhabit a line or curve running through our large space, or perhaps a plane. We don't expect the space in which the data live to be quite so low dimensional, but it may not be much more complicated than that. The "noisy spiral": these data live along a one-dimensional spiral embedded in a two-dimensional space, much like we expect galaxies to inhabit a low-dimensional subset of the high-dimensional space defined by image pixels.  From Lafon & Lee (2006).
There are various flavors of "dimension reduction" techniques that we astrostatisticians use frequently. Some are relatively familiar to astronomers, like principal components analysis, and others less so. Here we'll concentrate on one familiar to galaxy morphologists: using summary statistics. A summary statistic is simply a mapping from data values to a number, with the most familiar one being the mean, or average, of all the values in a dataset. A number of statistics have been developed for summarizing galaxy images in a way that retains important morphological information. (For instance, in Freeman et al.  (2013), we propose three new summary statistics that are useful for detecting disturbed morphologies, which the reader can think of as separating those galaxies with clumpy profiles from those that appear smooth.)

So, let's assume that we have summarized all of our simulated and observed images with some number of these oft-used statistics. Now, instead of two sets of points in a 10,000-dimensional space, we might have two sets in, e.g., an eight-dimensional space. That's more like it...but what do we do now?

One answer to this question involves the statistical concept of density. A probability density function, or pdf, is the probability of sampling a datum, per unit (multi-dimensional) volume. For instance, there is a pdf from which human heights are sampled: we can compute the probability of sampling a height between 66 and 68 inches by integrating this pdf over the range (66,68), and we would find that that probability is much larger than the probability of sampling a height, for instance, between 80 and 82 inches. To compare our sets of points, we could estimate the pdfs for each, then compute some metric that encapsulates how different the two pdfs are. Or even simpler, we could utilize work by my colleagues Ann Lee and Rafael Izbicki and base our metric on the ratio of pdfs at each cluster point, which can be estimated simply without knowing the pdfs themselves. We don't know yet what the optimal method will be, both statistically and in terms of computational feasibility. It's an open question, which is the kind of question statistics researchers like best!

So check back in a few years: maybe by then our group here at CMU, or some other group, will have solved the problem of making statistical inferences in an effective manner using galaxy images. Hopefully it will happen, and if it does, don't worry that astrostatisticians like me will be left with nothing to do. The great thing about astronomy is that there shiny objects everywhere that can grab an astrostatistician's attention. There will always be more data to analyze, and with these data there will always be more statistical problems to solve.