hmmsearch -h

# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>

Basic options:
  -h : show brief help on version and usage

Options directing output:
  -o <f>           : direct output to file <f>, not stdout
  -A <f>           : save multiple alignment of all hits to file <f>
  --tblout <f>     : save parseable table of per-sequence hits to file <f>
  --domtblout <f>  : save parseable table of per-domain hits to file <f>
  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
  --acc            : prefer accessions over names in output
  --noali          : don't output alignments, so output is smaller
  --notextw        : unlimit ASCII text output line width
  --textw <n>      : set max width of ASCII text output lines  [120]  (n>=120)

Options controlling reporting thresholds:
  -E <x>     : report sequences <= this E-value threshold in output  [10.0]  (x>0)
  -T <x>     : report sequences >= this score threshold in output
  --domE <x> : report domains <= this E-value threshold in output  [10.0]  (x>0)
  --domT <x> : report domains >= this score cutoff in output

Options controlling inclusion (significance) thresholds:
  --incE <x>    : consider sequences <= this E-value threshold as significant
  --incT <x>    : consider sequences >= this score threshold as significant
  --incdomE <x> : consider domains <= this E-value threshold as significant
  --incdomT <x> : consider domains >= this score threshold as significant

Options controlling model-specific thresholding:
  --cut_ga : use profile's GA gathering cutoffs to set all thresholding
  --cut_nc : use profile's NC noise cutoffs to set all thresholding
  --cut_tc : use profile's TC trusted cutoffs to set all thresholding

Options controlling acceleration heuristics:
  --max    : Turn all heuristic filters off (less speed, more power)
  --F1 <x> : Stage 1 (MSV) threshold: promote hits w/ P <= F1  [0.02]
  --F2 <x> : Stage 2 (Vit) threshold: promote hits w/ P <= F2  [1e-3]
  --F3 <x> : Stage 3 (Fwd) threshold: promote hits w/ P <= F3  [1e-5]
  --nobias : turn off composition bias filter

Other expert options:
  --nonull2     : turn off biased composition score corrections
  -Z <x>        : set # of comparisons done, for E-value calculation
  --domZ <x>    : set # of significant seqs, for domain E-value calculation
  --seed <n>    : set RNG seed to <n> (if 0: one-time arbitrary seed)  [42]
  --tformat <s> : assert target <seqfile> is in format <s>: no autodetection
  --cpu <n>     : number of parallel CPU workers to use for multithreads  [2]

进行全基因组扫描：

hmmsearch -E 1e-5 --domtblout Bo_P450_hmm.out PF00067.hmm ../00_raw_data/Bombyx_mori/Bombyx_mori.anno.pep.fa

参数	说明
`hmmsearch`	用HMM模型在蛋白质数据库中搜索同源序列
`-E 1e-5`	序列级别E-value阈值，只报告E值 ≤ 1e-5 的hit
`--domtblout Bo_P450_hmm.out`	将结构域比对结果以表格格式输出到该文件
`PF00067.hmm`	输入的HMM模型文件（这里是P450家族的Pfam模型）
`../00_raw_data/Bombyx_mori/Bombyx_mori.anno.pep.fa`	目标蛋白质序列数据库（家蚕注释蛋白序列）

E-value 阈值说明：

1e-5 是比较严格的阈值，表示期望假阳性数 ≤ 0.00001
越小越严格：1e-3 < 1e-5 < 1e-10
初步筛选可用 1e-3，严格筛选用 1e-10

输出文件 --domtblout 包含的主要列：

目标序列名、HMM模型名
E-value（序列级 & 结构域级）
比对得分、比对坐标范围

最后的输出的是一个列表，比如 01_HMMER_search/Bo_P450_hmm.out：

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Bmor004196.1         -            499 p450                 PF00067.28   462  2.7e-113  379.0   0.1   1   1  1.5e-115    3e-113  378.8   0.1     2   459    31   486    30   489 0.94 annotation: Cytochrome P450 4d2 OS=Drosophila melanogaster OX=7227 GN=Cyp4d2 PE=2 SV=2

其中有一个 target name，就是候选的蛋白，我们需要捞出来，捞出来的蛋白序列丢到一个文件里面：

开始建树

把前面几个捞出来的蛋白序列输入到一个文件里面：

cat Bo_P450_hits.pep.fa Dr_P450_hits.pep.fa Ma_P450_hits.pep.fa Pl_P450_hits.pep.fa > All_species_P450s.pep.fa

多序列比对

mafft --auto 01_HMMER_search/P450_hits_all_species/All_species_P450s.pep.fa > 02_Alignment_Tree/All_P450_aligned.fasta

结果是这样的：

在对比完成后，序列两端或者中间通常会有很多缺失，可以使用 trimAl 这个小工具开进行去掉，可以使用 conda 进行安装：

conda install -c bioconda trimal

执行：

trimal -in All_P450_aligned.fasta -out All_P450_clean.fasta -automated1

构建最大似然树

使用 conda 安装 iqtree：

conda install -c bioconda iqtree

iqtree -s All_P450_clean.fasta -m TEST -bb 1000 -alrt 1000

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111