logo
0
0
WeChat Login

昆虫P450s解毒酶小实验

下载数据集

主要有这四个物种:

  • Plutella xylostella:小菜蛾
  • Bombyx mori:家蚕
  • Manduca sexta:烟草天蛾
  • Drosophila melanogaster:黑腹果蝇

数据来自于 IncestBase,下载时间2026年3月8日

主要有这三类数据:

  • 蛋白质序列文件(Protein FASTA)
  • 编码核苷酸序列文件(CDS FASTA)
  • 基因组注释文件(GFF/GFF3)

运行 hmmsearch 进行全基因组扫描

首先需要从数据库中下载 P450 的隐马尔可夫模型(HMM),这里的模型是从 InterPro 中去下载,https://www.ebi.ac.uk/interpro/entry/pfam/PF00067/

在 macOS 中安装 hmmer

brew install hmmer

验证安装:

hmmsearch -h

# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>

Basic options:
  -h : show brief help on version and usage

Options directing output:
  -o <f>           : direct output to file <f>, not stdout
  -A <f>           : save multiple alignment of all hits to file <f>
  --tblout <f>     : save parseable table of per-sequence hits to file <f>
  --domtblout <f>  : save parseable table of per-domain hits to file <f>
  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
  --acc            : prefer accessions over names in output
  --noali          : don't output alignments, so output is smaller
  --notextw        : unlimit ASCII text output line width
  --textw <n>      : set max width of ASCII text output lines  [120]  (n>=120)

Options controlling reporting thresholds:
  -E <x>     : report sequences <= this E-value threshold in output  [10.0]  (x>0)
  -T <x>     : report sequences >= this score threshold in output
  --domE <x> : report domains <= this E-value threshold in output  [10.0]  (x>0)
  --domT <x> : report domains >= this score cutoff in output

Options controlling inclusion (significance) thresholds:
  --incE <x>    : consider sequences <= this E-value threshold as significant
  --incT <x>    : consider sequences >= this score threshold as significant
  --incdomE <x> : consider domains <= this E-value threshold as significant
  --incdomT <x> : consider domains >= this score threshold as significant

Options controlling model-specific thresholding:
  --cut_ga : use profile's GA gathering cutoffs to set all thresholding
  --cut_nc : use profile's NC noise cutoffs to set all thresholding
  --cut_tc : use profile's TC trusted cutoffs to set all thresholding

Options controlling acceleration heuristics:
  --max    : Turn all heuristic filters off (less speed, more power)
  --F1 <x> : Stage 1 (MSV) threshold: promote hits w/ P <= F1  [0.02]
  --F2 <x> : Stage 2 (Vit) threshold: promote hits w/ P <= F2  [1e-3]
  --F3 <x> : Stage 3 (Fwd) threshold: promote hits w/ P <= F3  [1e-5]
  --nobias : turn off composition bias filter

Other expert options:
  --nonull2     : turn off biased composition score corrections
  -Z <x>        : set # of comparisons done, for E-value calculation
  --domZ <x>    : set # of significant seqs, for domain E-value calculation
  --seed <n>    : set RNG seed to <n> (if 0: one-time arbitrary seed)  [42]
  --tformat <s> : assert target <seqfile> is in format <s>: no autodetection
  --cpu <n>     : number of parallel CPU workers to use for multithreads  [2]

进行全基因组扫描:

hmmsearch -E 1e-5 --domtblout Bo_P450_hmm.out PF00067.hmm ../00_raw_data/Bombyx_mori/Bombyx_mori.anno.pep.fa
参数说明
hmmsearch用HMM模型在蛋白质数据库中搜索同源序列
-E 1e-5序列级别E-value阈值,只报告E值 ≤ 1e-5 的hit
--domtblout Bo_P450_hmm.out将结构域比对结果以表格格式输出到该文件
PF00067.hmm输入的HMM模型文件(这里是P450家族的Pfam模型)
../00_raw_data/Bombyx_mori/Bombyx_mori.anno.pep.fa目标蛋白质序列数据库(家蚕注释蛋白序列)

E-value 阈值说明:

  • 1e-5 是比较严格的阈值,表示期望假阳性数 ≤ 0.00001
  • 越小越严格:1e-3 < 1e-5 < 1e-10
  • 初步筛选可用 1e-3,严格筛选用 1e-10

输出文件 --domtblout 包含的主要列:

  • 目标序列名、HMM模型名
  • E-value(序列级 & 结构域级)
  • 比对得分、比对坐标范围

最后的输出的是一个列表,比如 01_HMMER_search/Bo_P450_hmm.out

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Bmor004196.1         -            499 p450                 PF00067.28   462  2.7e-113  379.0   0.1   1   1  1.5e-115    3e-113  378.8   0.1     2   459    31   486    30   489 0.94 annotation: Cytochrome P450 4d2 OS=Drosophila melanogaster OX=7227 GN=Cyp4d2 PE=2 SV=2

其中有一个 target name,就是候选的蛋白,我们需要捞出来,捞出来的蛋白序列丢到一个文件里面:

开始建树

把前面几个捞出来的蛋白序列输入到一个文件里面:

cat Bo_P450_hits.pep.fa Dr_P450_hits.pep.fa Ma_P450_hits.pep.fa Pl_P450_hits.pep.fa > All_species_P450s.pep.fa

多序列比对

mafft --auto 01_HMMER_search/P450_hits_all_species/All_species_P450s.pep.fa > 02_Alignment_Tree/All_P450_aligned.fasta

结果是这样的:

在对比完成后,序列两端或者中间通常会有很多缺失,可以使用 trimAl 这个小工具开进行去掉,可以使用 conda 进行安装:

conda install -c bioconda trimal

执行:

trimal -in All_P450_aligned.fasta -out All_P450_clean.fasta -automated1

构建最大似然树

使用 conda 安装 iqtree

conda install -c bioconda iqtree
iqtree -s All_P450_clean.fasta -m TEST -bb 1000 -alrt 1000

About

No description, topics, or website provided.
Language
Python100%