Tuesday, August 30, 2016

From PhD Astronomer to Data Scientist

Like so many other recent graduates I have decided to trade in research in academia for research in the tech industry.

A few years ago, about half way into my PhD program, I wasn't sure what I wanted to do after I graduated. Would I enter the nomadic post-doc life? Am I actually qualified to do anything else? It was at this time I took a class simply called Data Analysis in Astronomy. This class really opened my eyes to a multitude of tools such as: principal component analysis, k-mean clustering, and many other statistical techniques. We had to do a final group project where we developed a facial recognition routine using PCA. This was a fun assignment and really got me thinking about a career where these tools are used in an applied way like this.

The other formative experience was listening to a talk by an astronomy professor/data scientist where he talked about some of the under appreciated results of statistical analysis. For example, he talked about how in Florida right before a hurricane, Targets/Wal-Marts were experiencing a huge spike in sales of a specific item, but it wasn't an obvious one. Not tissue paper, nor bread, nor eggs, nor milk, nor water nor whatever most people would immediately think to stock up on. Instead it was pop-tarts, which kids like, don't need to be heated to eat and are cheap. Here's a perfect example of a result that makes perfect sense when you reflect back on it but wouldn't be immediately apparent upon first thought.

(Funny this looks a lot like me! courtesy:
http://www.marketingdistillery.com/2014/08/30/data-science-skill-set-explained/)
After taking that class and listening to that speaker I realized I was more interested in the tools used to analyze data than the data itself. I discovered I wanted to potentially solve a ton more problems than just in astronomy. So I focused my thesis on machine learning (PCA, random forest, time series analysis) so I could more effectively market myself for a post-grad school life.

Applying to jobs and fellowship incubator opportunities are a little different than applying to a post-doc or graduate school. I decided to apply to the Insight Data Science and Data Incubator fellowship programs, which seek to provide training to academics so that they can transfer their skill sets to work in tech. Additionally, the Data Science for the Social Good fellowship looks like a great place to go if you're interesting in working for non-profits or city governments.

These programs offer different resources to accomplish those goals so it would be helpful to ask recent graduates about how the liked the experience. Insight's application was easier since all it required was a short 30-min chat with them to explain a project (thesis or other) that uses data. It's important to have something you can show visually. Data Incubator's application was much more intense. They require that you solve 2 difficult data problems, plus you are to propose the project you will work on during the fellowship. I didn't quite realize this project had to be near the final stages even before applying, so its best to come up with something well before the application due date. All told, I was offered a spot in the Insight Health data science fellowship but in Boston. I was more interested in staying in the Baltimore/DC region so I decided to continue to look for jobs in the area.

After a short search on Glassdoor, LinkedIn, and other job websites I found my current company, SocialCode. They focus on analyzing ads and ad interactions on social media (e.g. Facebook, Twitter, Instagram).

I applied to a few other places but the interview process was pretty similar for every company. Each began with a very short (~10 min) phone screen just to make sure I was who I said I was. Then a short (~30 min) chat with a current data scientist about my thesis work and they'd ask me some follow-up questions about the data analysis. Sometime in the process, the company would send a short data project that I had 3-7 days to complete. It was usually an open ended question to see how I would analyze data I'd never seen before. This project was then followed by a longer (~45-60 min) chat about the results of the project. Now if they liked what they'd seen and heard from me I'd be invited to an in person interview.  At these I would meet with a few current employees and they would grill me on my research, abstract data analysis questions, specific computer science questions among other topics. Honestly, the oral examination of abstract data analysis was much more difficult than defending my thesis!

I'm excited about what the future of data science will bring and how I can contribute, but I'd be lying if I didn't say I was going to miss astronomy. All the wonderful people I've met and interesting projects and teams I've worked on have been a great source of happiness. The academic route was just not for me. Everyone should follow their path as they see it, sometimes that means academia but sometimes not. Don't let anyone else's expectations for you determine your trajectory.

Monday, February 8, 2016

The VIMOS UltraDeep Survey – a spectroscopic survey of high redshift galaxies


The VIMOS UltraDeep Survey (short: VUDS) is an observational program to gain spectroscopic measurements for ~10,000 galaxies at high redshift, when the Universe was only between 1- 3 billion years old (today, the Universe is 13.8 billion years old). This is a particularly interesting era to study in terms of galaxy evolution since astronomers expect galaxies at that epoch to look very different from today. For example, at that early time we observe that galaxies have a much more disturbed morphology compared to the beautiful structured spiral galaxies or smooth elliptical galaxies that we see in the local Universe. We expect that galaxies formed many more stars at that time partly triggered by disturbances from the merging of galaxies but also because more gas was still available to form stars in those galaxies. The time between redshift 2 to 6 (i.e. the first 1-3 billion years of the Universe’s age) is thus a major epoch of galaxy assembly.

Figure 1: Very Large Telescope in Chile, photo credit: R. Thomas.
With CANDELS, galaxies in that epoch are studied mostly based on photometry, meaning images taken at different wavelengths. We described in earlier blog posts how with photometry at many different wavelengths astronomers are able to study the properties of galaxies through comparing the observed data to model galaxy spectra.

With VUDS galaxy evolution is approached from the spectroscopic side. A spectrum of an object is created by dispersing all its emitted light by directing it through a disperser like a prism, meaning the light is split up according to its wavelength. An easy example is the creation of a rainbow where the light from the sun hits raindrops in the air which act as dispersers and split the originally "white" sunlight up by wavelength, creating the typical coloured stripes. Such spectra allow us to study the properties of galaxies in much more detail compared to the study of images alone. 

The VUDS survey covers about 1 square degree in the sky. As a comparison the diameter of the full moon is about 0.5 degrees and its area is ~0.2 square degrees, which means it’s a fifth of the area covered by the VUDS survey. However, this 1 square degree of area of the VUDS survey is split over 3 separate fields in the sky that have been observed with a lot of different instruments and at many wavelengths already, creating a unique and precious data set for astronomers to carry out their studies. The three fields are the COSMOS field (which overlaps with the CANDELS-COSMOS field), the Extended-Chandra Deep Field South (which overlaps with the CANDELS-GOODS-South field) and the VVDS-2h field. Within those 3 fields spectra of ~10,000 galaxies were taken with the VIMOS multi-object spectrograph at the Very Large Telescope (VLT) in Chile (Figure 1). We described how multi-object spectroscopy works in more detail in this recent post. In short, suffice it to say that with that instrument, astronomers are able to take a spectrum of many galaxies at the same time. VUDS is the largest spectroscopic survey of galaxies at these early cosmic times.

Two of the 3 fields covered by VUDS overlap with the CANDELS area. The spectra and spectroscopic redshifts in that overlap area (~ 700 galaxies) were just publicly released by the VUDS team.

Figure 2: Stacked spectrum of galaxies between redshift 3 to 4 with the most reliable spectroscopic redshifts in VUDS. Vertical dashed lines indicate known spectral lines which are used to determine spectroscopic redshift and galaxy properties.

For the VUDS survey, the objects which were targeted for the spectroscopy, were selected primarily based on their redshift as derived purely from photometry (again, see this blog post here). Additionally, some sources were added based on their photometric colours (i.e. the difference in brightness between two wavelength bands) which indicate a high redshift. These objects were then observed with two different grisms -- one for the blue wavelength end and one for the red wavelength end – for about 14 hours each. The resulting spectra cover a wavelength range from the blue optical to the very red optical. This means that for these high-redshift galaxies, we really observed their ultra-violet to blue optical wavelength range which are shifted due to the redshift into the optical wavelength range covered by the VIMOS instrument. This wavelength range reveals many properties of galaxies, especially with regard to their star formation. In Figure 2 we show you a stacked spectrum of some VUDS sources in which also the spectral lines are indicated. In Figure 3 you can see all the spectra of the VUDS survey compiled into a picture and sorted by redshift, where each line represents one spectrum. Emission and absorption lines in this image are nicely visible in this as bright and dark lines that stretch across the image from left bottom to top right. This also illustrates how spectral features are redshifted towards redder wavelengths. The most common spectral lines and features in these spectra are the Hydrogen Lyman-alpha, Lyman-beta and Lyman-gamma lines, the Lyman limit (below which almost all emission is absorbed by neutral Hydrogen around newly formed stars), the Carbon lines (CII, CIII and CIV, where the Roman numbers behind the letters indicate the ionization level of the element) and lines from Helium (He), Oxygen (O), Silicon (Si) and Aluminium (Al). These lines are used not only to determine the spectroscopic redshift of these galaxies (i.e., through their known rest-frame wavelength), but also other galaxy properties such as star formation and chemical composition of the galaxies. Overall in VUDS we were able to determine reliable spectroscopic redshifts for ~6000 galaxies which cover a large range of brightnesses and stellar masses. Some of the galaxies in this survey form up to 1000 solar masses per year!

Figure 3: Compilation of each spectrum taken in the VIMOS UltraDeep Survey and sorted by redshift. Redshift increases from the bottom to the top, meaning the further up in the image we go, the further into the past we look and the younger the Universe is. Marked are spectral emission (bright spots in the spectrum) and absorption lines (faint spots in the spectrum) at each redshift. This figure illustrates nicely how certain spectral features seem to be present in galaxies in this survey at the various redshifts and thus across cosmic time. Figure from Le Fevre et al. 2015, A&A 576, A79
Since the completion of the observations, many researchers in the international VUDS team work on all aspects of galaxy formation and evolution, from morphology to identifying proto-galaxy-clusters and groups, from studying the ultra-violet spectroscopic properties of very young galaxies to the merging history of the Universe, in alignment with the science goals of the overall survey. If you are interested in following the results from the VUDS survey, you can find our Facebook page here and our Twitter account here.

Thursday, December 17, 2015

Where is the Dust in Distant Galaxies?

In a previous post we wrote about the morphology of a galaxy's star light. Most galaxies have matter that we can see in 3 forms: stars, gas, and dust.



Figure 1: M51 at optical wavelengths of light. Credit: NASA, ESA, 
S. Beckwith (STScI), and The Hubble Heritage Team (STScI/AURA).

Figure 1 shows a picture of M51 at optical wavelengths of light. The yellow, red, and blue parts of the picture are the regions hosting M51's stars which are visible to us. Along the spiral arms we also see dark structures. The dark parts of the picture are the regions hosting M51's stars which are invisible to us --- the stars which are obscured by dust.

Dust grains in M51 absorb light from these stars and reemit that light at infrared wavelengths. Figure 2 shows a picture of M51 at an infrared wavelength of light. The spiral arms in the infrared picture line up with the dark structure in the optical picture. We know where the dust is in M51. What about the dust in other galaxies?

Figure 2: M51 at an infrared wavelength of light. Credit: IRSA.
As we study galaxies that are further and further away from our own, we lose information on where the dust in these galaxies is. Infrared telescopes cannot produce pictures of a distant galaxy at the same resolution as pictures of M51. We can guess at where the dust is by looking at dark structures in pictures at optical wavelengths. Your eye is good at picking out dark structures in a many-color image. What is a dark structure in a two-color image?

Figure 3 shows a zoomed in picture of M51. The dark structure is black in many places. In many other places the dark structure is adjacent to a red spot, a spot missing blue and yellow colors. Dust grains in M51 are good at obscuring blue and yellow light and less good at obscuring red light. If we measure the brightness of a spot in a red image, and the brightness at the same location in a blue image, many galaxies will have the same ratio between those brightnesses. The ratio comes from two aspects of the dust: the sizes of the grains and the number of grains. A dark structure in a distant galaxy might be a red spot with a weak blue spot.

Figure 3: a cropped and zoomed-in view of M51 at optical
wavelengths of light. Credit: NASA, ESA, S. Beckwith (STScI),
and The Hubble Heritage Team (STScI/AURA).
The red color in the image of M51 is due to light from Hydrogen atoms. The Hubble Space Telescope has an instrument allowing us to see the light from Hydrogen atoms in distant galaxies; it has another instrument allowing us to see blue light from distant galaxies. I wrote a paper using CANDELS data to compare the brightnesses of the light at the two wavelengths. We conclude that we need more data!

The ratio of brightnesses between red spots and blue spots for distant galaxies is different from the ratio for local galaxies. Dust grains in distant galaxies might have different sizes compared to their sizes in M51, which would make them more or less good at obscuring red light compared to how they obscure blue and yellow light. We cannot distinguish between this hypothesis and the one saying that the number of grains differs.

NASA has a plan to launch several telescopes into space and connect them, which would solve the problem of resolution that prevents us from having detailed pictures at infrared wavelengths of distant galaxies. You can find out more about the Far-IR Surveyor here:

Tuesday, November 24, 2015

Coming Out of the Dark Ages

Until about 400,000 years after the Big Bang,  the Universe was mostly full of electrons and protons, zipping in random directions. It was only when the Universe cooled down enough, because of expansion, that electrons and protons had a chance to combine to form neutral hydrogen (the lightest element in the Universe) for the first time. This epoch is known as the epoch of recombination. The Universe then enters and remains in what we call the Dark Ages until the formation of the first luminous sources -- first stars, first galaxies, quasars, and so on. During this period, the Universe was full of neutral hydrogen, and thus completely opaque to any ultra-violet (UV) radiation because neutral hydrogen is very efficient at absorbing UV radiation. Intense UV ionizing photons from the first stars and first galaxies then start to ionize their surrounding, forming ionized bubbles. These bubbles grow with time, and eventually the entire Universe was filled with ionized bubbles. The epoch during which this change of phase or transition occurred i.e., the ionization of most of the neutral hydrogen to ionized hydrogen -- is called the epoch of reionization (see Figure below). This was the last major transition in the history of the Universe, and had a significant impact on the large scale structure of the Universe. Therefore, this is one of the frontier research areas in modern observational cosmology.


Time line history of the Universe from Big Bang (left) to the present day Universe (right). Before the process of reionization, the Universe was completely filled with neutral hydrogen. It is only after the formation of first sources including first stars, first galaxies, that the neutral hydrogen in the Universe started ionizing, and by about one billion years after the Big Bang, most of the neutral hydrogen in the Universe was vaporized marking the end of the epoch of reionization (Image credit: NASA, ESA, A. Fields (STScI).


Probing the Epoch of Reionization
One of the most powerful and practical tools to probe the epoch of reionization is the Lyman-alpha emission test. Lyman-alpha photons are a n=2 to n=1 transition in neutral hydrogen which emits a photon with a wavelength of lambda=1215.67 Angstroms. In the presence of neutral hydrogen, Lyman-alpha photons are scattered again and again and eventually many of the Lyman-alpha photons are  scattered away form our line of sight . As a result, we expect to see fewer and fewer galaxies with Lyman-alpha emission as we probe higher and higher redshifts (closer to the Big Bang).

To study the epoch of reionization, we did exactly this using a large sample of very distant (high-redshift) galaxy candidates selected from the Hubble Space Telescope (HST) CANDELS survey -- the largest galaxy survey ever undertaken using  HST.  To know the exact distance of a galaxy, it is critical to obtain spectroscopic observations of these galaxies. We did this using a near-infrared spectrograph, MOSFIRE, on the Keck Telescope located at 13,000 ft on top of Mauna Kea, a dormant-volcano mountain in Hawaii.


To our surprise, we discovered that most of the galaxies we observed did not show Lyman-alpha emission. The figure below shows our results combined with previous studies. This figure shows the Lyman-alpha equivalent width, the ratio of strength of Lyman-alpha emission from a galaxy to its underlying blue stellar light continuum (non Lyman-alpha light), as a function of redshift (or age of the Universe on the top axis), as we probe closer and closer to the Big Bang. As can be seen, there are fewer galaxies,  and at the same time the strength of Lyman-alpha emission also decreases as we go to higher redshifts. While this can be a result of a few different things, upon careful inspection, we think that this is likely because of the Universe becoming more neutral as we go beyond redshift ~7, and we are witnessing the epoch of reionization in-progress.

This Figure shows the evolution of strength of Lyman-alpha emission in galaxies, as we get closer and closer to the Big Bang. As can be seen, the strength of Lyman-alpha emission appears to be decreasing or in other words we are missing vetry strong Lyman-alpha emitting galaxies as we go towards higher redshifts. This is likely a consequence of increasing neutral hydrogen, as expected from theoretical studies (Image credit: Tilvi et al 2014).
Currently, Lyman-alpha emission provides the best tool to discover and confirm very distant galaxies. While there are a few other emission lines that could be used to confirm distance to a galaxy, their strengths compared to the Lyman-alpha emission is much weaker.  Despite this, we have made quite a significant progress in understanding the first billion years of the Universe.

The figure below shows the summary of progress astronomers have made over the past few years, understanding the transition of Universe from a completely neutral to an ionized phase. Below redshift of about 6, that is about 1 billion years after the Big Bang, the Universe is almost completely full of ionized hydrogen—only one part in 10,000 is neutral. At redshifts greater than 6, the Universe becomes more and more neutral. The James Webb Space Telescope (JWST) will be very instrumental in discovering galaxies within the first 600 Myrs, and will help us gain even more insight into the details of the crucial epoch.


This figure shows the evolution of neutral hydrogen fraction as a function of redshift (or age of the Universe shown on top axis). Only one part in 10,000 is neutral below redshift of about 6 which implies that the Universe is mostly ionized and the process of reionization has occurred at redshifts greater than six, where the Universe is becoming increasingly neutral (Image credit: V. Tilvi).

Thursday, November 19, 2015

Preparing Multi Object Spectroscopy Observations

Although CANDELS is a photometric survey, many team members have proposed for and been granted observing time for CANDELS sources to obtain spectroscopy. Such additional data not only provides us with a more accurate measurement of the distances of galaxies (aka redshift), but also with additional information to decode their properties, such as how many stars they are forming and how much dust is contained in the galaxies.

Figure 1: Example pointing for a MOS observation with the GMOS
instrument at the Gemini Telescope. The image in the background shows
the targeted sky area. The cyan outline shows the field of view of the
instrument with the gaps between the 3 CCD detectors. The dashed outlined
box shows the sky area in which the guide star needs to be placed. The red
"arm" shows the arm that holds the camera that monitors the guide star.

Classically, spectroscopy was carried out object by object, by placing one long slit where your one object is located. With this you restrict the area which lets light through to the detector to a narrow slit and blocking out everything else around it. The light that enters the prism or grism through this slit is then dispersed according to its wavelength, creating a spectrum of the object. Bright spots highlight the presence of elements that emit at this frequency/wavelength, and dark spots tell us where certain elements absorbed light and stopped it from reaching us. You can imagine though that carrying out such observations object by object is very time consuming.


In the last decades though, astronomical studies for galaxy evolution started to greatly profit from new instrumentation which allows us to observe many objects at the same time. This is not only true for taking images of the sky, but also for spectroscopic observations.

One method to take spectroscopy of many objects at the same time is grism spectroscopy, which we showed you in our post about grism spectroscopy with the Hubble Space Telescope. In that case nothing in your field of view is masked out and everything is dispersed. If your field of view is very crowded, meaning you have many many objects in your piece of sky, many spectra will overlap and will be hard to disentangle.

Figure 2: I-band image of the piece of sky to be observed with Multi Object
Spectroscopy within the mask-making software. The red outline shows the
field of view of the instrument, the blue stripes mark the gaps between the
detectors. All potential target objects are marked with different smaller
symbols according to their priority (blue triangles, green boxes, white circles
and cyan diamonds for alignment stars).
Another method is multi-object spectroscopy (MOS) via slit-masks. With this method you can take spectra for many objects at the same time by placing slits on many objects and blocking out the rest of the sky. This requires the creation of so-called MOS-masks in which the slit areas and the blocked out areas are clearly defined. This means that for every different observation you need a custom mask. Most current instruments require these masks to be prepared well in advance of the observation and to be cut out of plastic. This process isn't feasible for a space telescope, but works very well on the ground. However, times are changing. For example, for the MOSFIRE (Multi-Object Spectrometer for InfraRed Exploration) instrument at the Keck Telescope, the masks are created on the fly and "bars" that create slits are then moved into the right position within the instrument. Also for the upcoming James Webb Space Telescope a MOS unit will be available. It is designed in such a way that little shutters open and close to produce slits and masked out areas. For many other instruments however, a mask is essentially one large piece of plastic that has lots of tiny slits cut out of it. The slits are placed exactly where you want to observe an object. To create such a mask is in principle relatively simple and I illustrate the process here with a series of images.

I recently created some MOS masks for the Gemini Multi Object Spectrograph (GMOS) instrument at the Gemini Telescope to observe CANDELS galaxies and will use one of the masks I created as an example here to illustrate the process. Firstly, an image of the desired piece of sky in which the positions of the objects you want to observe are measured (Figure 2) and a list of objects, i.e. a catalogue, are required. From that list we  picked our desired targets. Often these are selected based on specific properties and limited by their brightness to ensure the maximum success with the granted observation time. Then we also need a list of stars to guide the telescope and to align the mask properly. Guide stars are used to correct for the rotation of the Earth throughout the observation so that the telescope is pointing at the same portion of the sky the entire time. You can see an example pointing in the first figure.


Figure 3: Zoom in to show the placement of slits on some targets. Objects with blue triangles have highest priority, next are objects with green boxes, and then those with white circles. The yellow vertical stripes overlaid on an object show where the slit will be placed and cut out of the mask. The horizontal white lines mark the extension of the dispersed light, i.e. the spectrum of the object. Basically, all the light that hits the disperser when it comes through the vertically extended slit, is dispersed in the horizontal direction.

Alignment stars are included on the mask to make sure all the slits are on the selected objects and not on some other piece of empty sky when the telescope operators define the pointing of the telescope. Then we take this image and list of targets and run them through the provided software for the given instrument.  Usually, the original list of targets leaves room for other objects to be placed on the mask as well, so we basically work with a prioritized list of objects. The highest priority objects are "forced" onto the mask into the space left after placements of the alignment stars to observe as many as possible of the desired targets. Then any available gaps are filled with objects of lower priority. In Figures 3 and 4 you can see all the slits that were placed on this particular mask and a zoom in that shows you a slit.


Figure 4: The finished mask. The red outline is the field of view of the instrument, the blue vertical lines mark the gaps in the detector. Each rectangle box shows where the spectrum of that object will extend. Yellow vertical lines mark the position of the slit on the selected object. The cyan rectangle boxes mark the position of the alignment stars.

After this, the observer can manually remove objects that received a slit if he/she wants the software to pick out a different object for example, one that might be more optimally placed. Then there are usually a few iterations in which the slit placement is refined a bit more and the maximum amount of objects are placed on the mask. And that's it, the mask is finished. All that is left to do is create all the masks for all the pointings in the same manner and then sending them off to the telescope and instrument support team for checking and approval. Once a mask is approved, all the necessary information is send to the mask cutting team who cut the mask, meaning all the tiny slits are cut out. After masks are cut, they will be installed in the instrument and then it's anxious waiting for us for the completion of your observations if they are carried out by the support astronomers at the observatory (Figure 5) or hoping for good weather if we go to the telescope ourselves to carry out the observations. 

The CANDELS fields are currently targeted by astronomers all over the world with many observational programs on instruments such as DEIMOS (on the Keck Telescope), MOSFIRE (on the Keck Telescope), GMOS (on the Gemini Telescopes, described in this post) and VIMOS (at the VLT, for example with the VIMOS UltraDeep Survey). 


Figure 5: Example observation from one of the GMOS masks. Each horizontal package of lines is the dispersed light from one slit. The bright vertical lines (a few are highlighted by the violet arrows) are emission lines caused by the night sky, meaning elements in our atmosphere emit light at certain wavelengths which are also detected and then overlap with the spectrum of the target object. The spectral traces of the target objects are highlighted by red arrows and are faint horizontal lines. In the red box, we can clearly see 2 bright dots, these are emission lines in the target object which we can use to determine its redshift and other properties. The green arrows point towards high energy cosmic rays that hit the detector and cause a detection. In order to retrieve the spectra for the target objects, astronomers have to remove the cosmic rays and subtract the spectrum of the night sky, so that ideally only the spectra of the real targets are left in the end.