主要有这四个物种:
数据来自于 IncestBase,下载时间2026年3月8日
主要有这三类数据:
首先需要从数据库中下载 P450 的隐马尔可夫模型(HMM),这里的模型是从 InterPro 中去下载,https://www.ebi.ac.uk/interpro/entry/pfam/PF00067/
brew install hmmer
验证安装:
hmmsearch -h
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>
Basic options:
-h : show brief help on version and usage
Options directing output:
-o <f> : direct output to file <f>, not stdout
-A <f> : save multiple alignment of all hits to file <f>
--tblout <f> : save parseable table of per-sequence hits to file <f>
--domtblout <f> : save parseable table of per-domain hits to file <f>
--pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
--acc : prefer accessions over names in output
--noali : don't output alignments, so output is smaller
--notextw : unlimit ASCII text output line width
--textw <n> : set max width of ASCII text output lines [120] (n>=120)
Options controlling reporting thresholds:
-E <x> : report sequences <= this E-value threshold in output [10.0] (x>0)
-T <x> : report sequences >= this score threshold in output
--domE <x> : report domains <= this E-value threshold in output [10.0] (x>0)
--domT <x> : report domains >= this score cutoff in output
Options controlling inclusion (significance) thresholds:
--incE <x> : consider sequences <= this E-value threshold as significant
--incT <x> : consider sequences >= this score threshold as significant
--incdomE <x> : consider domains <= this E-value threshold as significant
--incdomT <x> : consider domains >= this score threshold as significant
Options controlling model-specific thresholding:
--cut_ga : use profile's GA gathering cutoffs to set all thresholding
--cut_nc : use profile's NC noise cutoffs to set all thresholding
--cut_tc : use profile's TC trusted cutoffs to set all thresholding
Options controlling acceleration heuristics:
--max : Turn all heuristic filters off (less speed, more power)
--F1 <x> : Stage 1 (MSV) threshold: promote hits w/ P <= F1 [0.02]
--F2 <x> : Stage 2 (Vit) threshold: promote hits w/ P <= F2 [1e-3]
--F3 <x> : Stage 3 (Fwd) threshold: promote hits w/ P <= F3 [1e-5]
--nobias : turn off composition bias filter
Other expert options:
--nonull2 : turn off biased composition score corrections
-Z <x> : set # of comparisons done, for E-value calculation
--domZ <x> : set # of significant seqs, for domain E-value calculation
--seed <n> : set RNG seed to <n> (if 0: one-time arbitrary seed) [42]
--tformat <s> : assert target <seqfile> is in format <s>: no autodetection
--cpu <n> : number of parallel CPU workers to use for multithreads [2]
进行全基因组扫描:
hmmsearch -E 1e-5 --domtblout Bo_P450_hmm.out PF00067.hmm ../00_raw_data/Bombyx_mori/Bombyx_mori.anno.pep.fa
| 参数 | 说明 |
|---|---|
hmmsearch | 用HMM模型在蛋白质数据库中搜索同源序列 |
-E 1e-5 | 序列级别E-value阈值,只报告E值 ≤ 1e-5 的hit |
--domtblout Bo_P450_hmm.out | 将结构域比对结果以表格格式输出到该文件 |
PF00067.hmm | 输入的HMM模型文件(这里是P450家族的Pfam模型) |
../00_raw_data/Bombyx_mori/Bombyx_mori.anno.pep.fa | 目标蛋白质序列数据库(家蚕注释蛋白序列) |
E-value 阈值说明:
1e-5 是比较严格的阈值,表示期望假阳性数 ≤ 0.000011e-3 < 1e-5 < 1e-101e-3,严格筛选用 1e-10输出文件 --domtblout 包含的主要列:
最后的输出的是一个列表,比如 01_HMMER_search/Bo_P450_hmm.out:
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord
# target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Bmor004196.1 - 499 p450 PF00067.28 462 2.7e-113 379.0 0.1 1 1 1.5e-115 3e-113 378.8 0.1 2 459 31 486 30 489 0.94 annotation: Cytochrome P450 4d2 OS=Drosophila melanogaster OX=7227 GN=Cyp4d2 PE=2 SV=2
其中有一个 target name,就是候选的蛋白,我们需要捞出来,捞出来的蛋白序列丢到一个文件里面:

把前面几个捞出来的蛋白序列输入到一个文件里面:
cat Bo_P450_hits.pep.fa Dr_P450_hits.pep.fa Ma_P450_hits.pep.fa Pl_P450_hits.pep.fa > All_species_P450s.pep.fa
mafft --auto 01_HMMER_search/P450_hits_all_species/All_species_P450s.pep.fa > 02_Alignment_Tree/All_P450_aligned.fasta
结果是这样的:

在对比完成后,序列两端或者中间通常会有很多缺失,可以使用 trimAl 这个小工具开进行去掉,可以使用 conda 进行安装:
conda install -c bioconda trimal
执行:
trimal -in All_P450_aligned.fasta -out All_P450_clean.fasta -automated1

使用 conda 安装 iqtree:
conda install -c bioconda iqtree
iqtree -s All_P450_clean.fasta -m TEST -bb 1000 -alrt 1000