At AIPL, we’ve been posting Manhattan plots of the marker effects for each breed-trait combination with each official release of our genomic predictions. For example, consider the plot of lifetime net merit for Holsteins from the April, 2010 run:
These plots are produced automatically using Python and matplotlib. The scripts run fairly quickly and they’ve proven to be fairly dependable. I recently had a request from a colleague to produce similar plots for data produced as part of an ongoing research project here in the lab. I thought that this would be a great time to try making similar plots using the very cool ggplot2 library for R.
I got 90% of the way to the result I wanted very quickly, which was great. Then I ran into problems. Fortunately, I found a straightforward solution that I will discuss below. As you can see from the plot above, the data on the x-axis consist of markers ordered by their location in the genome. Markers assigned to the same chromosome are shaded the same color, and I’m using a simple seven-color palette that repeats. I was able to produce a graphic that had the correct shading but an x-axis labeled with marker locations, not chromosome numbers. All I needed to do was replace the x-axis labels, or scale in the vocabulary used by ggplot2, with the correct labels. Fortunately, I had seen a final product similar to what I wanted over at Getting Genomics Done in the post GWAS Manhattan plots and QQ plots using ggplot2 in R.
The solution was very simple: I had to provide a list of breaks (tick mark locations) and labels to scale_x_continuous() which I then overlaid on the graph. Here are the results:
I used the aggregate() function to find the locations of the first and last marker on each chromosome. The breaks were calculated as: first SNP location + floor( ( last SNP location – first SNP location ) / 2 ), and the chromosome numbers were used as the corresponding labels. (I know that cattle have only 30 chromosomes; we use a numbering scheme internally which distinguishes between SNP in the pseudoautosomal and non-pseudoautosomal regions of the X chromosome, which is why there’s a 30 between autosome 29 and the X. The label “PAR” is too big to fit in the space available.)
Here’s the code that produced the second plot:
p_milk <- ggplot(data = yld, aes(snp_id, milk, color = chrome)) + geom_point() + scale_colour_manual(value = colours) + opts(legend.position = "none") + scale_x_continuous(name = "Chromosome", breaks = unique(yld$tick), labels = unique(yld$label)) + ylab("Marker Effect (pounds)") + ylim(0, 35) + opts(title = "Distribution of adjusted marker effects for HO milk")
Of particular interest is the line scale_x_continuous(name = “Chromosome”, breaks = unique(yld$tick), labels = unique(yld$label)), which overlays the new scale with the chromosome numbers on it. The vector yld$tick includes the breaks calculated using the formula mentioned above, and yld$label is a vector of strings used to label the breaks.
I also ran into a minor issue while trying to use scale_colour_brewer() to shade the points on each chromosome — the number of chromosomes (31) exceeded the maximum number of levels that can scale_colour_brewer() accommodate. I still wanted to use those colors, so I manually pulled the first 7 colors from the Brewer “Set1″ palette, put them in a list, and handed them to scale_colour_manual(). Worked like a charm.
library("RColorBrewer") colours <- rep(c(brewer.pal(n = 7, name = "Set1")),5)
In addition to looking at the examples and code on Getting Genomics Done I also made heavy use of Hadley Wickham’s excellent book “ggplot2: Elegant Graphics for Data Analysis”. (Note: I copied the link from the author’s website, so I suppose there could be an affiliate code in there.) The conceptual framework of building graphics in layers is both easy to understand and powerful in application. I look forward to making more use of ggplot2 in the future.
