Peregrine组装
# 介绍
Peregrine 是一款针对准确度高的长读长(length >10kb,accuracy>99%)的基因组组装工具。仅20cpu核时可以从30x的reads中得到纠错后的一致性序列。大多数的长读长软件利用的是Overlap-layout-consensus(OLC)方法,需要all-to-all的读长比较。Peregrine组装使用的是SHIMMER (Sparse HIerarchical MiniMimER)方法来避免这种耗时比较。
可以在单个节点上使用云可访问的硬件在大约两个小时内完成人类基因组组装。同样,具有足够物理内存的专用台式计算机(例如2019 Mac Pro)也可以执行此任务,从而避免了软件基础结构要求和集群计算专业技能的需要。
仓库地址: https://github.com/cschin/Peregrine
文章地址: https://www.biorxiv.org/content/10.1101/705616v1.full
# 软件安装及测试
# 安装方法
通过conda安装
#!/bin/bash
. /opt/conda/bin/activate
conda create -n peregrine -y python=3.7
conda activate peregrine
conda install -c conda-forge -y pypy3.6
pushd py
rm -rf .eggs/ dist/ build/ peregrine.egg-info/ peregrine_pypy.egg-info get-pip.py
python3 setup.py install
python3 setup.py clean --all
popd
git clone -b peregrine https://github.com/cschin/pypeFLOW.git
pushd pypeFLOW
python3 setup.py install
popd
pushd py
wget -q https://bootstrap.pypa.io/get-pip.py
wget -q https://bootstrap.pypa.io/get-pip.py
pypy3 get-pip.py
pypy3 setup_pypy.py install
popd
pushd src
make all
make install
popd
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
直接拉取 docker镜像(推荐)
docker pull cschin/peregrine:latest
# 小测试数据下载
作者自己用的ecoli测试数据放在了亚马逊云上,国内访问不方便;也可以自己下个基因组,使用作者仓库的simulate_reads.py模拟些数据。这个地方使用hifiasm的测试数据做个测试。
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz
# 测试数据运行
数据下载完成后,把数据的绝对路径写到一个read.lst文件里,支持fasta/fastq(.gz).
echo $PWD/chr11-2M.fa.gz >reads.lst
使用镜像方式运行
docker run -it -v $PWD:$PWD --user $(id -u):$(id -g) cschin/peregrine:latest asm \
$PWD/reads.lst 24 24 24 24 24 24 24 24 24 \
--with-consensus --shimmer-r 3 --best_n_ovlp 8 \
--output $PWD/test
2
3
4
会弹出个选项告诉仅限非商用用途,输入yes
This is a pre-release, please do not re-distribute without permission.
I agree that I am not using this software for any commericial purposes (yes/no): no
Sorry, please contact us to get a license before using Peregrine. Thanks
大概5秒就得到结果,结果位于“4-cns/cns-merge/ctg_cns.fa”
# 软件参数
Usage:
pg_run.py asm <reads.lst> <index_nchunk> <index_nproc>
<ovlp_nchunk> <ovlp_nproc>
<mapping_nchunk> <mapping_nproc>
<cns_nchunk> <cns_nproc>
<sort_nproc>
[--with-consensus]
[--with-L0-index]
[--output <output>]
[--shimmer-k <shimmer_k>]
[--shimmer-w <shimmer_w>]
[--shimmer-r <shimmer_r>]
[--shimmer-l <shimmer_l>]
[--best_n_ovlp <n_ovlp>]
[--mc_lower <mc_lower>]
[--mc_upper <mc_upper>]
[--aln_bw <aln_bw>]
[--ovlp_upper <ovlp_upper>]
pg_run.py (-h | --help)
pg_run.py --verison
Options:
-h --help Show this help
--version Show version
--with-consensus Generate consensus after getting the draft contigs
--with-L0-index Keep level-0 index
--output <output> Set output directory (will be created if not exist) [default: ./wd]
--shimmer-k <shimmer_k> Level 0 k-mer size [default: 16]
--shimmer-w <shimmer_w> Level 0 window size [default: 80]
--shimmer-r <shimmer_r> Reduction factore for high level SHIMMER [default: 6]
--shimmer-l <shimmer_l> number of level of shimmer used, the value should be 1 or 2 [default: 2]
--best_n_ovlp <n_ovlp> Find best n_ovlp overlap [default: 4]
--mc_lower <mc_lower> Does not cosider SHIMMER with count less than mc_low [default: 2]
--mc_upper <mc_upper> Does not cosider SHIMMER with count greater than mc_upper [default: 240]
--aln_bw <aln_bw> Max off-diagonal gap allow during overlap confirmation [default: 100]
--ovlp_upper <ovlp_upper> Ignore cluster with overlap count greater ovlp_upper [default: 120]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
nchunk 为分块数,nproc 为cpu数;对于小内存机器来说,可以使用一个较大的ovlp_nchunk和较小的ovlp_nproc,例如一个32G的机器可以设置ovlp_nchunk=24 和 ovlp_nproc=1.