cis-eQTL analysis with DAP: a step-by-step guide
This page provides a step-by-by guide to analyze cis-eQTL data (from a single tissue) using the software package DAP, which enables highly effient Bayesian multi-SNP analysis. Among many important features, DAP allows to incorporate genomic annotations into the eQTL mapping.
Here we show a step-by-step guide to analyze a subset of GEUVADIS data that contains genotype-expression data for 92 Toscani(TSI) samples. This data set contains expression data of 11,837 protein-coding and LincRNA genes measured in LCLs. The candidate SNPs for cis-eQTL mapping are those within the 100kb radius of the trasnscription start site (TSS) of each gene.
The described analysis is performed in a single multi-core Linux box. The procedure utilizes the feature of multi-thread processing. It can also be adjusted to run in a cluster environment.
Step 1: software installation
The following binary executables are required by the analysis
- dap-g: the newest implementation of the DAP algorithm for multi-SNP fine-mapping
- torus: for prior specification
The following utility is recommended
- openmp-wrapper: for automatic multi-thread processing
Download and compile the source code from the above URLs and make the binary executables accessible to the analysis.
Step 2: data preparation
After standard QC and pre-processing steps, the genotype-phenotype information of each gene should be organized into a single text file. The file format is explained in here. The formatted genotype-phenotype data files for 11,837 genes in GEUVADIS TSI samples, geuv.tsi.eqtl.sbams.tgz
, can be downloaded from here.
The illustrated analysis also utilizes SNP and gene position information. The SNP position file, geuv.snp.map.gz
, can be downloaded from here. The Gene position file, geuv.gene.map.gz
, can be downloaded from here.
Step 3: set up working directory
- create a working directory
workspace
, this directory is assumed to be the current working directory (cwd) from this point on. - create an empty directory
workspace/sbams_data/
and move the downloaded data file into this directory - unpack the data file:
cd sbsams_data; tar zxf geuv.tsi.eqtl.sbams.tgz
. In the end, there should be 11,837 data files named as "gene_name.dat" unpacked in thesbams_data
directory.
Step 4: prior estimation
Before the multi-SNP mapping, we first set the prior for each candidate cis-SNP. In cis-eQTL mapping, it is well-known that candidate SNPs located closer to transcription start site (TSS) is more likely to be associated with the expression level of the targe gene. We will utilize this information to quantify different priors for different SNPs according to their distance to TSS (DTSS). In particular, we treat the genomic feature DTSS as a categorical annotation and use the executabletorus
for prior estimation. The statistical procedure is described in Wen 2016.
Step 4.1: obtain single-SNP association test statistics
If the eQTL data are already analyzed by either MatrixEQTL
or fastQTL
, the output from either software can be fed into torus
for prior estimation.
Alternatively, single-SNP Bayes factor can be computed by dap-g
using the following command
dap-g -d gene_name.dat -scan > gene_name.bf
where gene_name.dat
a single sbams format genotype-expression file.
For batch processing,
- download
batch_scan.pl
from the repo into theworkspace
directory - create directory
workspace/scan_out
- run
perl batch_scan.pl > batch_scan.cmd
- batch processing by
openmp_wrapper -d batch_scan.cmd -t 8
where "-t 8" specifices that 8 parallel threads are requested. - upon completion, obtain the combine the data by
cat scan_out/*.bf | gzip - > geuv.tsi.bf.gz
The output guev.tsi.bf.gz
should appear in workspace
upon completion.
torus
Step 4.2: run To obtain the priors, issue the following command from workspace
directory
torus -d geuv.tsi.bf.gz -smap geuv.snp.map.gz -gmap geuv.gene.map.gz --load_bf -dump_prior priors
In the end, 11,837 prior files should be output into the newly created directory workspace/priors
. It is important to emphasize here that prior should be estimated for all genes (instead of just a few selected ones) for the necessary statistical rigor.
Step 5: multi-SNP fine-mapping
With the prior and genotype-phenotype file ready, the multi-SNP cis-eQTL mapping can be achieved by the following command
dap-g -d gene_name.dat -p gene_name.prior -t 4 -ld_control 0.25 --no_size_limit > gene_name.fm.out
The command line options are explained in below
-
-d gene_name.dat
: specify the sbams format genotype-phenotype data -
-p gene_name.prior
: specify the prior file for the corresponding gene generated bytorus
-
-t 4
: runningdap-g
with 4 parallel threads -
-ld_control 0.25
: lower r^2 threshold fordap-g
to consider multiple SNPs (in LD) responsible to a single association signal. (if not specified, the thereshold is 0) -
--no_size_limit
: do not restrict the number of SNPs within a single association signal cluster (the default value is set to be 25)
For batch processing,
- download
batch_dap.pl
from the repo into theworkspace
directory - create directory
workspace/dap_out
- run
perl batch_dap.pl > batch_dap.cmd
- batch processing by
openmp_wrapper -d batch_dap.cmd -t 4
where "-t 4" specifices that 8 parallel threads are requested.
The result files are output to the directory dap_out
.
Step 6: Post-analysis utility ties
Utilities for summarizing and visualizing fine-mapping analysis results, along with their demonstrations, are provided in this repo. The utilities include
- Signal cluster visualization:
plot_dap