Introduction and References
Install and run

Quick examples
Options FAQ

PRVCS: Predicting regulatory variant with composite statistic


Introduction


Interpreting human regulatory variants affecting gene regulation is critical to understand phenotypic diversity and promote personalized medicine. PRVCS is designed to predict and prioritize the regulatory variants by integrating existing prediction algorithms (like CADD, GWAVA, FunSeq, etc.) to estimate the composite likelihood of regulatory potential given tools statistics.

References

Mulin Jun Li, Zhicheng Pan, Zipeng Liu, Jiexing Wu, Panwen Wang, Yun Zhu, Feng Xu, Zhengyuan Xia, Pak Chung Sham, Jean-Pierre A. Kocher, Miaoxin Li, Jun S. Liu, Junwen Wang. Predicting regulatory variants with composite statistic. Bioinformatics 2016 June. 10.1093/bioinformatics/btw288

Install and run

System requirement

Java Runtime Environment (JRE) version 6.0 or above is required for PRVCS. It can be downloaded from the Java web site. Installing the JRE is very easy in Windows OS and Mac OS X.

In Linux, you have more work to do. Details of the installation can be found http://www.java.com/en/download/help/linux_install.xml.

In Ubuntu, if you have an error message like: "Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException ... " , please install the Sun Java Running Environment (JRE) first.

To install the Sun JRE on Ubuntu(10.04), please use the following commands:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts
Detailed explanation of above commands can be found at http://www.ubuntugeek.com/how-install-sun-java-runtime-environment-jre-in-ubuntu-10-04-lucid-lynx.html.


For Mac OS, the JRE 1.6 has been available at http://developer.apple.com/java/download/ since April 2008. Mac OS users may need update the Java application to run PRVCS. A potential problem is that this update does not replace the existing installation of J2SE 5.0 or change the default version of Java. Similar to the Linux OS, the Java_Home environmental variable has to be configured to initiate PRVCS.

Download: latest Java version (Fast, now integrate into KGGSeq) | latest Perl version (Slow but no need to download big annotation, <10k variants)

20170313 Update Note  We have released a lightweight database called "database of non-coding SNV functional prediction score (dbNCFP)", which integrates 12 scores form 8 different algorithms for all possible SNVs in human. The database is compiled using modified ANN format (Allele | CADD CScore | CADD PHRED | fathmm-MKL non coding score | fathmm-MKL coding score | DANN score | FunSeq2 score | GWAS3D score | FunSeq score | GWAVA unmatched score | GWAVA tss score | GWAVA matched score | SuRFR score). To download the database, please visit our ftp site: ftp://147.8.193.36/PRVCS/dbNCFP

20160621 Update Note  We have integrated our JAVA version PRCVS into KGGSeq package, now it supports whole genome scroing in an efficient manner. Please refer to the manual for package configuration and running parameters (--regulatory-causing-predict). The Perl version of PRVCS is still supported for easy execution of small list of SNVs and reseacher who do not want to download big program-dependent data (Please refer to Perl version manual).

20160520 Update Note  CADD score had been updated to v1.3 in PERL version of PRVCS. If you want to validate our benchmark result in Bioinforamtics paper, please use JAVA version of PRVCS (PRVCS_v1.1,zip for CADD v1.1).

First release  Please download PRVCS v1.1 package from PRVCS_v1.1.zip (2.2 GB) (JAVA program supports MS Windows / Linux / Mac OS X outdated) or PRVCS_PL package from PRVCS_PL_v0.11.zip (27 MB) (Perl script only supports Linux / Mac OS X now). Note£º Since our compiled score file is very big, we currently only supported 1000 Genomes Project biallelic variants in the JAVA program. For scores of other variants and possible alleles, we also provided a Tabix Perl wrapper script to facilitate the random access remotely. Compiled prediction scores of eight algorithms for around 8.6 billion possible single nucleotide substitutions in the human reference genome (GRCh37).
    The compiled dataset: dbNCFP_whole_genome_SNVs.bgz (218 GB)
    Tabix index file: dbNCFP_whole_genome_SNVs.bgz.tbi
    The file header file: HEADER.txt

Package pre-requirement

Since PRVCS Java program now only supports 1000 Genomes project bi-allelic variants, if you will use PRVCS_PL perl script to score whole genome all possible substitution SNVs, you have to install Tabix and pass the executable file to script. To install Tabix, please follow this site.

Installing PRVCS

Simply decompress the archive and run the following command to execute PRVCS!

  java -Xms256m -Xmx1300m -jar ./PRVCS.jar <arguments>

The arguments -Xms256m and -Xmx1300m set the initial and maximum Java heap sizes for PRVCS as 256 megabytes and 1.3 gigabytes respectively. Specifying a larger maximum heap size can speed up the analysis. A higher setting like -Xmx2g or even-Xmx5g is required when there is a large number of variants, say 5 million. The number, however, should be less than the size of physical memory of a machine.

Note  <arguments >can be saved in a flat text file.


Note  To use PRVCS under Proxy Network, you need specifically use the following java command to configure the proxy settings:
java -Dhttp.proxyHost=xxx.xxx.xxx -Dhttp.proxyPort=xxxx -Dhttp.proxyUser=xxx -Dhttp.proxyPassword=xxx -jar "./PRVCS.jar"

Quick example

Variants file is needed (example.vcf). This example use VCF variant format without genotype, and other formats are also supported!

Note  All files were included in the examples folder of PRVCS.

1. Run the PRVCS by passing options:

java -jar PRVCS.jar --vcf-file examples/example.vcf --db-score dbncfp --regulatory-causing-predict all

2. Run the PRVCS using parameter file:

java -jar PRVCS.jar examples/params.example.txt

We now walk through the parameter file ¡°params.composite.example.txt¡± before going into the results. Lines starting with hash sign # are comments. Detailed interpretation for each argument in the parameter file is included in 'Options' part.

#one argument per line
#I.Environmental setting
#--no-log \

#II. Specify the input files
--vcf-file examples/example.vcf \

#III. Output setting
--out example_result \

#IV. Filtering and Prioritization
--db-score dbncfp \
--regulatory-causing-predict 1,3,4,5,6,8,10,11 \

Part (I): Specify general environmental setting

Arguments in this part are used to set general PRVCS running environment, including resource path and program update

Part (II): Specify the input files

Arguments in this part are used to specify the input files which support various data format, and they are compulsory for running PRVCS.

Part (III): Output setting

Arguments in this part are used to set the output file name and type, which can be produced simultaneously by PRVCS.

Part (IV): Prioritization

Arguments in this part are used to apply a set of models (selected regulatory variant scores, etc.) to enable better prioritization of the regulatory variant.

Notes Most of the above arguments are optional, so user can mask some lines by ¡°#¡± or delete the lines. Under this circumstance, user can have a systematic view of the impact for each level or even steps. And it will be easier to produce this parameter file by PRVCS command generator we provide.

3. Run the PRVCS_PL for whole genome all possible SNVs scoring:

perl PRVCS_PL.pl -i examples/example.vcf -f vcf -t /user/bin/tabix -r http://147.8.193.36/ftp/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s 1,3,4,5,6,8,10,11 -o prvcs1.flt.txt


Options

Environmental settings

Specify the local path to store the resource datasets java -jar PRVCS.jar --resource path/to/resource/files

By default, it is --resource "./"

Disable/enable the log information in the screen. java -jar PRVCS.jar --no-log

By default, PRVCS will display the log information in the command line.

Input formats

Variants file


NOTE  Currently, PRVCS only supports hg19 coordinates. PRVCS compiled an integrative database for eight latest algorithms in the noncoding variants functional prediction field, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. To make a universal format for all collected algorithms, PRVCS now only support all 1000 Genomes Project phase 1 biallelic variants.

VCF format
java -jar PRVCS.jar --vcf-file path/to/file1

ANNOVAR format java -jar PRVCS.jar --annovar-file path/to/file NOTE PRVCS can recognize an extended ANNOVAR format in which a head row and and multiple columns for comments are allowed.

Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0

Outputs

PRVCS can flexibly output different formats of the prioritization and annotation results for either final validation or further analysis by third-party tools.

Output file path and file name prefix: --outjava -jar PRVCS.jar --vcf-file path/to/file1 --out path/to/prefixname Specify path and prefix name of outputs. It is "./PRVCS" by default.

Odinary outputs

Text format: This is by defaultjava -jar PRVCS.jar --vcf-file path/to/file1 By default, PRVCS output results in TEXT format in a file named prvcs1.flt.txt, in which the fields (or columns) are delimited by the tabs.

Produce VCF input: --o-vcf java -jar PRVCS.jar --vcf-file path/to/file1 --o-vcf Generate an extra copy of output for the prioritized variants in VCF format, which can be further analyzed by other tools.

Produce ANNOVAR input: --o-annovar java -jar PRVCS.jar --vcf-file path/to/file1 --o-annovar Generate an extra copy of output for the prioritized variants in ANNOVAR input format, which can be further annotated by ANNOVAR.

Prediction and Prioritization

One attractive feature of PRVCS is that it combines the functional impact scores from multiple algorithms (i.e., CADD and GWAVA) to predict whether variants are regulatory or not.

Predicting noncoding regulatory variants based on composite strategy


NotePRVCS compiles an integrative database for several latest algorithms in the noncoding variants functional prediction field, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. Here named "dbncfp" database.

Predict regulatory variants: --db-score dbncfp --regulatory-causing-predict
java -jar PRVCS.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict 1,3,4,5,6,8,10,11 Use at most 8 existing algorithms for 11 available functional impact scores (listed below) to RE-predict whether a SNV will potentially be regulatory. By default, PRVCS uses all algorithms (8 scores) for a combinatorial prediction.

On the other hand, one can FIX the prediction using any specified subset (>2) or full set (all) of the impact scores by option like --regulatory-causing-predict 1,6,8,10
The coding for the functional impact scores used in'--regulatory-causing-predict' options is listed below:

Coding Method Description
1 CADD_CScore "Raw" CADD scores come straight from the CADD model, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects.
2 CADD_PHRED Since the CScores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a "normalized" and now externally comparable unit of analysis. CADD scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then "PHRED-scaled" those values by expressing the rank in order of magnitude terms rather than the precise rank itself.
3 DANN DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture nonlinear relationships among features and are better suited than SVMs for problems with a large number of samples and features.
4 FunSeq FunSeq filters mutations overlapping 1000 Genomes variants and then prioritizes those in regions under strong selection (sensitive and ultrasensitive), breaking TF motifs, and those associated with hubs. It can score the deleterious potential of variants in single or multiple genomes. The scores for each noncoding variant vary from 0 to 6, with 6 corresponding to maximum deleterious effect. When multiple tumor genomes are given as input, FunSeq also identifies recurrent mutations in the same element.
5 FunSeq2 FunSeq2 is originally to annotate and prioritize somatic alterations integrating various resources from genomic and cancer studies. The framework consists of two components: (1) data context from uniformly processing large-scale datasets; and (2) a high-throughput variant prioritization pipeline. FunSeq2 can also be used to prioritize noncoding genetic variants.
6 GWAS3D GWAS3D systematically assesses the genetic variants that could affect regulatory elements, by integrating annotations from cell type-specific chromatin states, epigenetic modifications, sequence motifs and cross-species conservation. It combines the original GWAS signal, risk haplotype, binding affinity significance and conservation information to prioritize the leading variants, and infer the putative causal variant in the LD of leading variant.
7 GWAVA_Region GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was composed of all 1KG variants in the 1 kb surrounding each of the HGMD variants.
8 GWAVA_TSS GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was matched for distance to the nearest TSS genome-wide.
9 GWAVA_Unmatched GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was constructed from a random selection of SNVs from across the genome in order to sample overall background.
10 SuRFR SuRFR integrates functional annotation and prior biological knowledge to prioritise candidate functional variants by regression model. It introduces novel training and validation datasets that i) capture the regional heterogeneity of genomic annotation better than previously applied approaches, and ii) facilitate understanding of which annotations are most important for discriminating different classes of functionally relevant variants from background variants.
11 FATHMM-MKL FATHMM-MKL uses MKL classifier to predict the functional consequences of both coding and non-coding sequence variants from various genomic annotations and weights the significance of each component annotation source.
Note Predictions at variants with missing scores at specified methods will use population mean of corresponding method!

The option will append the corresponding score of selected prediction methods to the output file, as well as the bayes factor (BF) and composite probability (Composite_P):

PRVCS_PL whole genome all possible SNVs scoring

Command line:perl PRVCS_PL.pl -i query.vcf -f vcf -t /user/bin/tabix -r http://147.8.193.36/ftp/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s 1,3,4,5,6,8,10,11 -o prvcs1.flt.txt By default, PRVCS use Tabix to visit remote compiled dataset for random access. You have to install Tabix and assign the progaram path to script by "-t". You can also download the whole genome score file to local, and use "-r" to assign the file path.

Options: PRVCS_PL.pl -- Perl version of PRVCS for scoring allele-specific composite statistics given VCF or ANNOVAR variant input file -h output help information to screen -i the input variants file -f the format of input file, supporting VCF and ANNOVAR format; default: vcf -t the path of executable tabix program; default: /user/bin/tabix -r the path of dbNCFP reference database; default: dbNCFP_whole_genome_SNVs.bgz -s the selected tool scores; default: 1,3,4,5,6,8,10,11 1 CADD_cscore 2 CADD_PHRED 3 DANN_score 4 FunSeq_score 5 FunSeq2_score 6 GWAS3D_score 7 GWAVA_region_score 8 GWAVA_TSS_score 9 GWAVA_unmatched_score 10 SuRFR_score 11 Fathmm_MKL_score -p the casual distribution folder; default: resources/all_causal_distribution -n the neutral distribution folder; default: resources/all_neutral_distribution -o the path of output file; default: prvcs1.flt.txt


FAQ?

1) Why PRVCS does not read my VCF file?

If you use standard VCF output from GATK pipeline, it usually contains variants on mitochondrial DNA. However, mitochondrial DNA is not annotated by gene feature database. Therefore PRVCS currently only accept VCF file excluding variants on ChrM.

2) Whether PRVCS supports rare variants or somatic variants?

PRVCS Java program supports the genetic variants from 1000 Genomes Project phase 1 since we currently prefer to make all collected eight methods work well (no missing values). For scores of other variants and possible alleles, we also provided a Tabix Perl wrapper script to facilitate the random access remotely.

3) Can I use ANNOVAR and PRVCS together on my dataset?

PRVCS is quite flexible for interacting with other sequence-oriented analytical programs/software for downstream annotation (including ANNOVAR, etc). It can accept various input formats, and output different kinds of sequence data. In case of ANNOVAR, PRVCS can read ANNOVAR-formatted sequence variants, and write the final remaining variants in ANNOVAR format.

4) Can I run PRVCS on my laptop? Is it time consuming to run a complete PRVCS process?

Normally, PRVCS run well and fast with >=1 GB RAM memory (For small input dataset). Hence current laptop are certainty affordable for running PRVCS. The program takes ~0.5hr/CPU and 10GB RAM to score all 1000 Genomes Project variants.

5) How do I report an error or bugs to PRVCS?

You are welcomed to write an email to mulin0424.li@gmail.com.