Fast Two Stage P-value Computation Software

##
*Introduction*

** FastPval ** is a two stage p-value computation software
which compute the empirical p-value by two stage ranking strategy,
it can produce very low P-value based on huge dataset. This fast and
powerful tool takes advantage of a delicate cutoff which separate
the exactly significant area. Compared to the traditional ranking
method, Tspvc has a good time efficiency, lower memory consuming and
tiny storage spaces with high accuracy.

Traditionally, exact P-value computation based on permutation is needed for sorting the dataset before ranking. For huge dataset(more than 1 billion values), the time complexity of sorting time will be long(nlogn). Furthermore, surprisingly, the space complexity will be very large(1 billion dataset will take ~4G memory).

We adopt this two stage strategy in order to reduce the computation time and overcome memory restriction using the exact P-value computation method especially in the huge dataset. At first stage, we make a random sampling array

**(which means the first model)**with given sampling number (up to the volume of dataset) and sort it. Then we select the satisfied array

**(which means the second model)**from the dataset with the cutoff number which fetched form the first sampling array. So, directly, we can compute the approximate P-value with those two model.

The core procedure is implemented by C. We also offer a Java-based GUI by encapsulating the C package.

##
*Features*

- Is flexible enough to produce model files and quick to compute P-value;
- Is efficient enough to store two approximate models for reduplicated P-value computation without loading all raw dataset;
- Allow visible GUI interface for the functions of model generating, two stage P-value computation and exact P-value computation using traditional method;
- Is compact enough for model size without loading the whole dataset into memory;

##
*Data Format*

**The dataset and test sample should follow the
format: dataset example test sample **

1.323204 2.323434 34.767567 0.565465 -2.432545 5.675477

*No space, new line, tab, etc. before the first data, each data should follow a '\n' at the end of line.*

##
*Usage*

**There are three ways to launch the software:**

*Please install the latest sun JRE
in your local machine firstly.*

1. Running from Java Web Start by clicking Launch the application
**32bit** /
**64bit** .

2. You can download
**32bit** /
**64bit** the
package and run using following commands:

java -cp [the path of pvalue.jar] pvalue.MainGUI

3. Run the procedure by command line:

1). Quick start:

java -cp <pvalue jar file path> pvalue.Pvalue -m 0 -d <dataset> -s <test sample> -o <output path>

Required parameters: -m <Integer> The computation method: 0: the two stage method 1: the exact method(brute ranking) Optional parameters:(default in [ ]): -i <Integer> The instruction of current method [0] This parameter is effective only when '-m' is 0 0: the general two stage method 1: generate the model 2: compute from given model

Required parameters: -d <String> The dataset file path -s <String> The test sample file path -o <String> The output directory Optional parameters:(default in [ ]): -n <Integer> The sampling number for first model [100000]it is recommended (100000--<100M; 1000000--<10B;)-c <Integer> The P-value for the cutoff of first distribution [0.001] -a <Integer> The assumed distribution of dataset [0] 0: normal distribution 1: extreme distribution

Required parameters: -d <String> The dataset file path -o <String> The output directory Optional parameters:(default in [ ]): -n <Integer> The sampling number for first model [100000]it is recommended (100000--<100M; 1000000--<10B;)-c <Integer> The P-value for the cutoff of first distribution [0.001]

Required parameters: -b1 <String> The first model file path -b2 <String> The second model file path -s <String> The test sample file path -o <String> The output directory Optional parameters:(default in [ ]): -a <Integer> The assumed distribution of dataset [0] 0: normal distribution 1: extreme distribution

Required parameters: -d <String> The dataset file path -s <String> The test sample file path -o <String> The output directory

4. Source code

**download**.

##
*Evaluation*

1).

**Complexity**

Procedure has been tested and evaluated in our server (Intel Xeon CPU E5410 2.33GHz; 16G Memory; -n:100000; -c:0.001)

Dataset Size | FastPval(Time) | Exact Computation(Time) | FastPval(RAM) | Exact Computation(RAM) | FastPval Model Size |
---|---|---|---|---|---|

1M/12.1MB | 1.05s+0.08s+1.53s | 1.10s+0.74s+2.33s | 0.39MB | 4MB | 1.3MB+13KB |

10M/121.4MB | 9.29s+0.09s+16.07s | 11.21s+7.61s+29.88s | 0.39MB | 38MB | 1.3MB+131KB |

100M/1.2GB | 91.46s+0.14s+249.44s | 116.73s+77.13s+332.13s | 0.78MB | 373MB | 1.3MB+1.3MB |

500M/5.9GB | 455.23s+0.40s+1297.44s | 677.58s+380.47s+1885.12s | 2MB | 1.9GB | 1.3MB+6MB |

1B/11.9GB | 919.45s+0.72s+2530.65s | 1409.61s+761.52s+4019.77s | 4MB | 3.7GB | 1.3MB+11.9MB |

5B/59.9GB | 5475.32s+3.34s+12885.87s | N/A | 19MB | N/A(>18.5GB) | 1.3MB+59.8MB |

Table: The comparison between FastPval and Exact P-value
Computation(brutal sorting).
*Dateset
Size:* The number of dataset(M: million; B: billion) and the file
size. *FastPval(Time):* The model generating time and
model loading time and P-value computation time using current
dataset as test sample. *Exact Computation(Time):* The
time of sorting dataset and background loading time and exact
P-value computation time by binary search using current dataset as
test sample. *FastPval(RAM):* The total memory
consumption of FastPval. *Exact Computation(RAM) :* The
total memory consumption using exact P-value computation. *FastPval
Model Size :* The two model files size of FastPval.

**2N**and the time complexity of the Exact P-value Computation is

**NlogN**. The space complexity of two stage P-value Computation is changed by sampling number and cutoff number, but it will be very small.

2).

**Accuracy**

Compared to the Exact computation, we test the accuracy of the two stage P-value Computation using Q-Q plot.

##
*Development*

We will continue to develop next version with advanced functions and good usability.

##
*Acknowledgments*

The research was supported by internal funds from CRCG, CRDG and genomic SRT of HKU, GRF 778609M and AoE M-04/04 from RGC of Hong Kong.

##
*Citation*

**FastPval: a fast and memory efficient program to calculate very low P-values from empirical distribution.**

*Mulin Jun Li, Pak Chung Sham, and Junwen Wang*

**Bioinformatics.**2010 Nov 15;26(22):2897-9. Epub 2010 Sep 21.