Tuesday, August 30, 2016

From PhD Astronomer to Data Scientist

Like so many other recent graduates I have decided to trade in research in academia for research in the tech industry.

A few years ago, about half way into my PhD program, I wasn't sure what I wanted to do after I graduated. Would I enter the nomadic post-doc life? Am I actually qualified to do anything else? It was at this time I took a class simply called Data Analysis in Astronomy. This class really opened my eyes to a multitude of tools such as: principal component analysis, k-mean clustering, and many other statistical techniques. We had to do a final group project where we developed a facial recognition routine using PCA. This was a fun assignment and really got me thinking about a career where these tools are used in an applied way like this.

The other formative experience was listening to a talk by an astronomy professor/data scientist where he talked about some of the under appreciated results of statistical analysis. For example, he talked about how in Florida right before a hurricane, Targets/Wal-Marts were experiencing a huge spike in sales of a specific item, but it wasn't an obvious one. Not tissue paper, nor bread, nor eggs, nor milk, nor water nor whatever most people would immediately think to stock up on. Instead it was pop-tarts, which kids like, don't need to be heated to eat and are cheap. Here's a perfect example of a result that makes perfect sense when you reflect back on it but wouldn't be immediately apparent upon first thought.

(Funny this looks a lot like me! courtesy:
http://www.marketingdistillery.com/2014/08/30/data-science-skill-set-explained/)
After taking that class and listening to that speaker I realized I was more interested in the tools used to analyze data than the data itself. I discovered I wanted to potentially solve a ton more problems than just in astronomy. So I focused my thesis on machine learning (PCA, random forest, time series analysis) so I could more effectively market myself for a post-grad school life.

Applying to jobs and fellowship incubator opportunities are a little different than applying to a post-doc or graduate school. I decided to apply to the Insight Data Science and Data Incubator fellowship programs, which seek to provide training to academics so that they can transfer their skill sets to work in tech. Additionally, the Data Science for the Social Good fellowship looks like a great place to go if you're interesting in working for non-profits or city governments.

These programs offer different resources to accomplish those goals so it would be helpful to ask recent graduates about how the liked the experience. Insight's application was easier since all it required was a short 30-min chat with them to explain a project (thesis or other) that uses data. It's important to have something you can show visually. Data Incubator's application was much more intense. They require that you solve 2 difficult data problems, plus you are to propose the project you will work on during the fellowship. I didn't quite realize this project had to be near the final stages even before applying, so its best to come up with something well before the application due date. All told, I was offered a spot in the Insight Health data science fellowship but in Boston. I was more interested in staying in the Baltimore/DC region so I decided to continue to look for jobs in the area.

After a short search on Glassdoor, LinkedIn, and other job websites I found my current company, SocialCode. They focus on analyzing ads and ad interactions on social media (e.g. Facebook, Twitter, Instagram).

I applied to a few other places but the interview process was pretty similar for every company. Each began with a very short (~10 min) phone screen just to make sure I was who I said I was. Then a short (~30 min) chat with a current data scientist about my thesis work and they'd ask me some follow-up questions about the data analysis. Sometime in the process, the company would send a short data project that I had 3-7 days to complete. It was usually an open ended question to see how I would analyze data I'd never seen before. This project was then followed by a longer (~45-60 min) chat about the results of the project. Now if they liked what they'd seen and heard from me I'd be invited to an in person interview.  At these I would meet with a few current employees and they would grill me on my research, abstract data analysis questions, specific computer science questions among other topics. Honestly, the oral examination of abstract data analysis was much more difficult than defending my thesis!

I'm excited about what the future of data science will bring and how I can contribute, but I'd be lying if I didn't say I was going to miss astronomy. All the wonderful people I've met and interesting projects and teams I've worked on have been a great source of happiness. The academic route was just not for me. Everyone should follow their path as they see it, sometimes that means academia but sometimes not. Don't let anyone else's expectations for you determine your trajectory.

Monday, February 8, 2016

The VIMOS UltraDeep Survey – a spectroscopic survey of high redshift galaxies


The VIMOS UltraDeep Survey (short: VUDS) is an observational program to gain spectroscopic measurements for ~10,000 galaxies at high redshift, when the Universe was only between 1- 3 billion years old (today, the Universe is 13.8 billion years old). This is a particularly interesting era to study in terms of galaxy evolution since astronomers expect galaxies at that epoch to look very different from today. For example, at that early time we observe that galaxies have a much more disturbed morphology compared to the beautiful structured spiral galaxies or smooth elliptical galaxies that we see in the local Universe. We expect that galaxies formed many more stars at that time partly triggered by disturbances from the merging of galaxies but also because more gas was still available to form stars in those galaxies. The time between redshift 2 to 6 (i.e. the first 1-3 billion years of the Universe’s age) is thus a major epoch of galaxy assembly.

Figure 1: Very Large Telescope in Chile, photo credit: R. Thomas.
With CANDELS, galaxies in that epoch are studied mostly based on photometry, meaning images taken at different wavelengths. We described in earlier blog posts how with photometry at many different wavelengths astronomers are able to study the properties of galaxies through comparing the observed data to model galaxy spectra.

With VUDS galaxy evolution is approached from the spectroscopic side. A spectrum of an object is created by dispersing all its emitted light by directing it through a disperser like a prism, meaning the light is split up according to its wavelength. An easy example is the creation of a rainbow where the light from the sun hits raindrops in the air which act as dispersers and split the originally "white" sunlight up by wavelength, creating the typical coloured stripes. Such spectra allow us to study the properties of galaxies in much more detail compared to the study of images alone. 

The VUDS survey covers about 1 square degree in the sky. As a comparison the diameter of the full moon is about 0.5 degrees and its area is ~0.2 square degrees, which means it’s a fifth of the area covered by the VUDS survey. However, this 1 square degree of area of the VUDS survey is split over 3 separate fields in the sky that have been observed with a lot of different instruments and at many wavelengths already, creating a unique and precious data set for astronomers to carry out their studies. The three fields are the COSMOS field (which overlaps with the CANDELS-COSMOS field), the Extended-Chandra Deep Field South (which overlaps with the CANDELS-GOODS-South field) and the VVDS-2h field. Within those 3 fields spectra of ~10,000 galaxies were taken with the VIMOS multi-object spectrograph at the Very Large Telescope (VLT) in Chile (Figure 1). We described how multi-object spectroscopy works in more detail in this recent post. In short, suffice it to say that with that instrument, astronomers are able to take a spectrum of many galaxies at the same time. VUDS is the largest spectroscopic survey of galaxies at these early cosmic times.

Two of the 3 fields covered by VUDS overlap with the CANDELS area. The spectra and spectroscopic redshifts in that overlap area (~ 700 galaxies) were just publicly released by the VUDS team.

Figure 2: Stacked spectrum of galaxies between redshift 3 to 4 with the most reliable spectroscopic redshifts in VUDS. Vertical dashed lines indicate known spectral lines which are used to determine spectroscopic redshift and galaxy properties.

For the VUDS survey, the objects which were targeted for the spectroscopy, were selected primarily based on their redshift as derived purely from photometry (again, see this blog post here). Additionally, some sources were added based on their photometric colours (i.e. the difference in brightness between two wavelength bands) which indicate a high redshift. These objects were then observed with two different grisms -- one for the blue wavelength end and one for the red wavelength end – for about 14 hours each. The resulting spectra cover a wavelength range from the blue optical to the very red optical. This means that for these high-redshift galaxies, we really observed their ultra-violet to blue optical wavelength range which are shifted due to the redshift into the optical wavelength range covered by the VIMOS instrument. This wavelength range reveals many properties of galaxies, especially with regard to their star formation. In Figure 2 we show you a stacked spectrum of some VUDS sources in which also the spectral lines are indicated. In Figure 3 you can see all the spectra of the VUDS survey compiled into a picture and sorted by redshift, where each line represents one spectrum. Emission and absorption lines in this image are nicely visible in this as bright and dark lines that stretch across the image from left bottom to top right. This also illustrates how spectral features are redshifted towards redder wavelengths. The most common spectral lines and features in these spectra are the Hydrogen Lyman-alpha, Lyman-beta and Lyman-gamma lines, the Lyman limit (below which almost all emission is absorbed by neutral Hydrogen around newly formed stars), the Carbon lines (CII, CIII and CIV, where the Roman numbers behind the letters indicate the ionization level of the element) and lines from Helium (He), Oxygen (O), Silicon (Si) and Aluminium (Al). These lines are used not only to determine the spectroscopic redshift of these galaxies (i.e., through their known rest-frame wavelength), but also other galaxy properties such as star formation and chemical composition of the galaxies. Overall in VUDS we were able to determine reliable spectroscopic redshifts for ~6000 galaxies which cover a large range of brightnesses and stellar masses. Some of the galaxies in this survey form up to 1000 solar masses per year!

Figure 3: Compilation of each spectrum taken in the VIMOS UltraDeep Survey and sorted by redshift. Redshift increases from the bottom to the top, meaning the further up in the image we go, the further into the past we look and the younger the Universe is. Marked are spectral emission (bright spots in the spectrum) and absorption lines (faint spots in the spectrum) at each redshift. This figure illustrates nicely how certain spectral features seem to be present in galaxies in this survey at the various redshifts and thus across cosmic time. Figure from Le Fevre et al. 2015, A&A 576, A79
Since the completion of the observations, many researchers in the international VUDS team work on all aspects of galaxy formation and evolution, from morphology to identifying proto-galaxy-clusters and groups, from studying the ultra-violet spectroscopic properties of very young galaxies to the merging history of the Universe, in alignment with the science goals of the overall survey. If you are interested in following the results from the VUDS survey, you can find our Facebook page here and our Twitter account here.