BUSCO v5
# 介绍
BUSCO软件可以对基因组,注释结果进行质量评估,作者目前已经更新到了v5版本。BUSCO软件根据OrthoDB数据库,构建了几个大的进化分支的单拷贝基因集,该基因集在对应分类集合上理论上保守且仅有一个拷贝。将基因组/蛋白序列与该基因集进行比较,根据比对上的比例、完整性,来评价拼接结果的准确性和完整性。
仓库地址: https://busco.ezlab.org/busco_userguide.html
cite: Seppey M., Manni M., Zdobnov E.M. (2019) BUSCO: Assessing Genome Assembly and Annotation Completeness. In: Kollmar M. (eds) Gene Prediction. Methods in Molecular Biology, vol 1962. Humana, New York, NY. 2019 doi.org/10.1007/978-1-4939-9173-0_14. PMID:31020564
# 软件安装及测试
# 安装方法
直接拉取 docker镜像(推荐)
docker pull ezlabgva/busco:v5.0.0_cv1
# 小测试数据下载
下载酿酒酵母基因组
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
gzip -d GCF_000146045.2_R64_genomic.fna.gz
2
通过镜像打开容器
docker run -ti --rm --user $(id -u) -v $PWD:/busco_wd ezlabgva/busco:v5.0.0_cv1 bash
由于v4以上版本引入了自动odb库选择和下载,直接使用基因组文件作为输入
busco --in GCF_000146045.2_R64_genomic.fna --mode genome --out tmp -f
弹出的info显示自动选择的库存在问题
INFO: Downloading file 'https://busco-data.ezlab.org/v5/data/lineages/archaea_odb10.2021-02-23.tar.gz'
下载的是古菌的库archaea_odb10,似乎不大对劲。指定自动发现真核生物相关库的参数--auto-lineage-euk
busco --in GCF_000146045.2_R64_genomic.fna --mode genome --out tmp -f --auto-lineage-euk
下载了真核生物的库,但通过 busco --list-datasets 命令可以看到其实有更接近的库 --> saccharomycetes_odb10 。看起来这个自动判断不大靠谱的样子。
手动下载saccharomycetes_odb10数据库,解压后放到指定目录,在命令指定该目录位置,注意镜像中位置由于挂载可能存在变化
busco -i GCF_000146045.2_R64_genomic.fna -l saccharomycetes_odb10 --download_path busco_downloads -o tmp -m genome -f --offline -c 10
odb库下载: https://busco-data.ezlab.org/v5/data/lineages/
# 软件参数
-i FASTA FILE, --in FASTA FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_dataset LINEAGE
Specify the name of the BUSCO lineage to be used.
--auto-lineage Run auto-lineage to find optimum lineage path
--auto-lineage-prok Run auto-lineage just on non-eukaryote trees to find optimum lineage path
--auto-lineage-euk Run auto-placement just on eukaryote tree to find optimum lineage path
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-r, --restart Continue a run that had already partially completed.
-q, --quiet Disable the info logs, displays only errors
--out_path OUTPUT_PATH
Optional location for results folder, excluding results folder name. Default is current working directory.
--download_path DOWNLOAD_PATH
Specify local filepath for storing BUSCO dataset downloads
--datasets_version DATASETS_VERSION
Specify the version of BUSCO datasets, e.g. odb10
--download_base_url DOWNLOAD_BASE_URL
Set the url to the remote BUSCO dataset location
--update-data Download and replace with last versions all lineages datasets and files necessary to their automated selection
--offline To indicate that BUSCO cannot attempt to download files
--metaeuk_parameters METAEUK_PARAMETERS
Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single pair of quotation marks, separated by commas. E.g. "--param1=1,--param2=2"
--metaeuk_rerun_parameters METAEUK_RERUN_PARAMETERS
Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single pair of quotation marks, separated by commas. E.g. "--param1=1,--param2=2"
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
--limit REGION_LIMIT How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--augustus Use augustus gene predictor for eukaryote runs
--augustus_parameters AUGUSTUS_PARAMETERS
Pass additional arguments to Augustus. All arguments should be contained within a single pair of quotation marks, separated by commas. E.g. "--param1=1,--param2=2"
--augustus_species AUGUSTUS_SPECIES
Specify a species for Augustus training.
--long Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
--config CONFIG_FILE Provide a config file
-v, --version Show this version and exit
-h, --help Show this help message and exit
--list-datasets Print the list of available BUSCO datasets
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45