This notebook details the retrieval and analysis of CRISPR dependency data from DepMap.
These are packages you will need for this notebook. For exact versions used, please refer to the session info at the bottom of this notebook.
library('tidyverse')
library('ggplot2')
library('RColorBrewer')
library('vroom')
I usually set a couple of directories of where to store things, or where I have located things we may need. Below, generalDatasets is just where we have stored more global data, such as a general use database. The baseWorkspace is where we will store my coding files that will usually end up getting pushed to GitHub. The last is baseRepository. This is set up the same as baseWorkspace but is not pushed to GitHub, so will store larger files (such as the data we download from GEO).
generalDatasets = 'C:/Users/chris/OneDrive/Documents/bccrc/projectsRepository/generalDatasets'
baseWorkspace = 'C:/Users/chris/OneDrive/Documents/bccrc/protocolsDryLab/relatedToPublishedData'
baseRepository = 'C:/Users/chris/OneDrive/Documents/bccrc/protocolsRepository'
The data we are interested in is from DepMap. Specifically, we want the gene dependency data from the CRISPR screens. To get this, we can go to the downloads page. On the left panel, select ‘All Downloads’. Click ‘Genetic Dependency’. Click the release you would like to get the data for, in this case we are going to go with the most recent as of the writing of this notebook, DepMap Public 20Q2. The file we want to download is ‘Achilles_gene_effect.csv’. I saved this in a named folder in my generalDatasets directory. We can now read it into R.
##########################################################################################
crispr = vroom(paste(generalDatasets, '/depmap20Q2/Achilles_gene_effect.csv', sep = ''))
Rows: 769
Columns: 18,120
Delimiter: ","
chr [ 1]: DepMap_ID
dbl [18119]: A1BG (1), A1CF (29974), A2M (2), A2ML1 (144568), A3GALT2 (127550), A4GALT (53947), A4GNT (51146), AAAS (8086), AACS (65985), AADAC (13), AADACL2 (344752), AA...
Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
Rename the first column as we will need this later.
##########################################################################################
colnames(crispr)[1] = 'DepMapId'
head(crispr[,1:5])
From here, we can already do some basic processing of the data. For example, let’s imagine we have a gene we are interested in looking at the dependency values for. We can first select just this column using subsetting based on the gene name. Basically we look for the index of the column that has the gene name that we are looking for, and keep just this one.
##########################################################################################
crisprGeneSubset = crispr[,c('DepMapId',
colnames(crispr)[which(grepl('YBX1', colnames(crispr)))])]
colnames(crisprGeneSubset)[2] = 'geneOfInterest'
crisprGeneSubset$gene = 'YBX1'
head(crisprGeneSubset)
Probably the most sensible plot for these data is a boxplot, so we can make a quick one.
##########################################################################################
ggplot(crisprGeneSubset, aes(gene, geneOfInterest)) +
geom_boxplot(width = 0.25, size = 1) +
labs(x = '', y = 'CERES Dependency Score', title = 'YBX1 gene dependency') +
scale_y_continuous(limits = c(-2,0), breaks = seq(-2,0,0.5)) +
geom_hline(yintercept = -1, linetype = 'dashed') +
theme_classic()
This doesn’t really tell us much other than that our gene is pretty close to the median score for ‘essential’ genes (= -1). If we add some annotation it may be more informative. To get the annotation info, we can go back to the downloads page, click ‘All Downloads’, and ‘Cellular Models’, and ‘DepMap Public 20Q2’. The file we are interested in is called ‘sample_info.csv’. Download this file to the same directory as the CRISPR data from before. Read this data in to R.
##########################################################################################
cellAnno = read_csv(paste(generalDatasets, '/depmap20Q2/sample_info.csv', sep = ''))
Parsed with column specification:
cols(
.default = col_character(),
COSMICID = col_double(),
Achilles_n_replicates = col_double(),
cell_line_NNMD = col_double(),
age = col_double(),
depmap_public_comments = col_logical()
)
See spec(...) for full column specifications.
11 parsing failures.
row col expected actual file
1105 depmap_public_comments 1/0/T/F/TRUE/FALSE Adherent version of CCLFPEDS0001T 'C:/Users/chris/OneDrive/Documents/bccrc/projectsRepository/generalDatasets/depmap20Q2/sample_info.csv'
1166 depmap_public_comments 1/0/T/F/TRUE/FALSE SV40+TERT immortalized kidney cell line 'C:/Users/chris/OneDrive/Documents/bccrc/projectsRepository/generalDatasets/depmap20Q2/sample_info.csv'
1454 depmap_public_comments 1/0/T/F/TRUE/FALSE Drug resistance: Dabrafenib and Trametinib 'C:/Users/chris/OneDrive/Documents/bccrc/projectsRepository/generalDatasets/depmap20Q2/sample_info.csv'
1455 depmap_public_comments 1/0/T/F/TRUE/FALSE Drug resistance: SCH772984 'C:/Users/chris/OneDrive/Documents/bccrc/projectsRepository/generalDatasets/depmap20Q2/sample_info.csv'
1456 depmap_public_comments 1/0/T/F/TRUE/FALSE Drug resistance: Dabrafenib and Trametinib 'C:/Users/chris/OneDrive/Documents/bccrc/projectsRepository/generalDatasets/depmap20Q2/sample_info.csv'
.... ...................... .................. .......................................... .......................................................................................................
See problems(...) for more details.
We can see this file contains a good amount of annotation information.
##########################################################################################
colnames(cellAnno)[1] = 'DepMapId'
head(cellAnno)
Since our CRISPR data is already in a suitable format, we can add this annotation data directly to that data frame.
##########################################################################################
crisprAnnotated = crispr %>%
left_join(cellAnno)
Joining, by = "DepMapId"
Now we can look at our gene of interest again, but this time add some lineage details. First, reshape it as we did before.
##########################################################################################
crisprGeneSubset = crisprAnnotated[,c('DepMapId', 'lineage', 'lineage_subtype',
colnames(crisprAnnotated)[which(grepl('YBX1', colnames(crisprAnnotated)))])]
colnames(crisprGeneSubset)[4] = 'geneOfInterest'
head(crisprGeneSubset)
We are almost ready to make a plot. We should first look at the lineages and discard any that have too few entries.
##########################################################################################
table(crisprGeneSubset$lineage)
bile_duct blood bone breast central_nervous_system
29 41 28 34 60
cervix colorectal esophagus eye fibroblast
12 37 25 4 1
gastric kidney liver lung lymphocyte
26 23 22 106 21
ovary pancreas peripheral_nervous_system plasma_cell prostate
42 34 19 20 1
skin soft_tissue thyroid upper_aerodigestive urinary_tract
54 36 6 33 29
uterus
26
So it looks like we probably want to get rid of a few of these lineages, like fibroblast that only has an n of 1. We will keep any lineage with an n greater than 5.
##########################################################################################
crisprGeneSubset2 = subset(crisprGeneSubset, crisprGeneSubset$lineage %in% names(which(table(crisprGeneSubset$lineage) > 5)))
Now we can go ahead and make a plot. The difference here is that we will use the lineage for grouping.
This plot is kind of hard to see anything, so we can make some graphical adjustments.
##########################################################################################
ggplot(crisprGeneSubset2, aes(reorder(lineage, geneOfInterest, FUN = median), geneOfInterest, color = lineage)) +
geom_boxplot(width = 0.25, size = 1) +
labs(x = '', y = 'CERES Dependency Score', title = 'YBX1 gene dependency') +
scale_y_continuous(limits = c(-2,0), breaks = seq(-2,0,0.5)) +
geom_hline(yintercept = -1, linetype = 'dashed') +
theme_classic() +
theme(legend.position = 'none', axis.text.x = element_text(size = 9, angle = 45, hjust = 1))
We can also specify the colors to change, if we are interested in the specific lineage, for example, bone.
##########################################################################################
boxColors = setNames(c(rep(brewer.pal(8,'Greys')[4], length(unique(crisprGeneSubset2$lineage)) - 1), brewer.pal(8,'Spectral')[1]),
c(unique(crisprGeneSubset2$lineage)[!unique(crisprGeneSubset2$lineage) %in% 'bone'], 'bone'))
##
ggplot(crisprGeneSubset2, aes(reorder(lineage, geneOfInterest, FUN = median), geneOfInterest, color = lineage)) +
geom_boxplot(width = 0.25, size = 1) +
scale_color_manual(values = boxColors) +
labs(x = '', y = 'CERES Dependency Score', title = 'YBX1 gene dependency') +
scale_y_continuous(limits = c(-2,0), breaks = seq(-2,0,0.5)) +
geom_hline(yintercept = -1, linetype = 'dashed') +
theme_classic() +
theme(legend.position = 'none', axis.text.x = element_text(size = 9, angle = 45, hjust = 1))
This is the end of this notebook. As a final command, we will run sessionInfo to ensure others who look at our work know exactly how our environment was set up.
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vroom_1.2.1 RColorBrewer_1.1-2 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.4 readr_1.3.1
[8] tidyr_1.0.3 tibble_3.0.1 ggplot2_3.3.1 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 xfun_0.14 haven_2.2.0 lattice_0.20-38 colorspace_1.4-1 vctrs_0.3.0 generics_0.0.2 htmltools_0.4.0
[9] base64enc_0.1-3 yaml_2.2.1 utf8_1.1.4 blob_1.2.1 rlang_0.4.6 pillar_1.4.4 glue_1.4.1 withr_2.2.0
[17] DBI_1.1.0 bit64_0.9-7 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
[25] cellranger_1.1.0 rvest_0.3.5 evaluate_0.14 labeling_0.3 knitr_1.28 parallel_3.6.3 fansi_0.4.1 broom_0.5.6
[33] Rcpp_1.0.4.6 scales_1.1.1 backports_1.1.7 jsonlite_1.6.1 farver_2.0.3 fs_1.4.1 bit_1.1-15.2 digest_0.6.25
[41] hms_0.5.3 stringi_1.4.6 grid_3.6.3 cli_2.0.2 tools_3.6.3 magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3
[49] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.8 assertthat_0.2.1 rmarkdown_2.1 httr_1.4.1 rstudioapi_0.11
[57] R6_2.4.1 nlme_3.1-144 compiler_3.6.3