hinanwitenshi0711/anima-lora-train

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

Forkfromhanpi2233/anima-lora-train, aheadmain103 commits, behindmain2 commits

Hinanaw0Tenshi<ljt20060711@gmail.com>

feat: add staged sample interval

e395d289

124 commits

.ide
AnimaLoraToolkit
output
训练状态保存
.cnb.yml
.dockerignore
.gitattributes
.gitignore
README.md
gpu_hog
linux查看命令
run.sh
打包命令
解包命令

Anima LoRA/LoKr 训练仓库

这是当前 CNB 环境使用的 Anima LoRA/LoKr 训练项目。日常入口是根目录的 run.sh，它会在后台启动训练，并同时启动 LoKr 自动保底推送脚本，避免 CNB 工作区回收后丢失已经保存到磁盘的结果。

快速开始

在项目根目录运行：

git pull
bash run.sh

启动后会生成两个日志：

tail -f anima_training.log
tail -f lokr_auto_push.log

停止训练和自动推送：

bash stop_anima_training.sh

默认启动内容

run.sh 会做三件事：

选择 Python 环境：优先使用镜像内的 /opt/venv/bin/python，找不到时再 fallback 到本地 ./venv/bin/python 或 PATH 里的 Python。
后台启动训练：默认配置文件是 AnimaLoraToolkit/config/train_my.yaml。
后台启动 LoKr watcher：监控 AnimaLoraToolkit/output，发现 step >= 800 的新 .safetensors 后，等文件写入稳定，再单独 git add -f、commit、push。

常用配置

训练参数主要改这里：

AnimaLoraToolkit/config/train_my.yaml

重点字段：

data_dir: "./AnimaLoraToolkit/Dataset/你的数据集"
output_dir: "./AnimaLoraToolkit/output/lokr_c8"
output_name: "anima-lokrC8"

lora_type: "lokr"
lora_rank: 32
lokr_factor: 8

batch_size: 4
grad_accum: 1
mixed_precision: "bf16"
grad_checkpoint: true

attention_backend: "sdpa"
sdpa_kernel: "auto"

save_every_steps: 40
save_state_every: 200

sample_steps: 40
sample_steps_before: 100
sample_steps_switch_step: 1000

H20 当前建议保持：

attention_backend: "sdpa"
sdpa_kernel: "auto"

也就是让 PyTorch SDPA 自动选择 flash / memory-efficient / cuDNN / math kernel，不额外依赖 xformers 或外部 flash-attn。

采样如果拖慢训练，可以用阶段式采样：

sample_steps: 40                 # 1000 步后每 40 步采样
sample_steps_before: 100         # 1000 步前每 100 步采样
sample_steps_switch_step: 1000

LoKr 自动推送

默认开启。它只处理文件名里带 step数字 的 .safetensors，并且默认从第 800 步开始：

anima-lokrC8_step800.safetensors
anima-lokrC8_step840.safetensors
...

每个文件会形成一个单独提交，方便 CNB 工作区被回收后还能从仓库找回结果。

临时关闭自动推送：

LOKR_AUTO_PUSH=0 bash run.sh

修改起推步数：

LOKR_MIN_STEP=1200 bash run.sh

修改监控目录：

LOKR_WATCH_DIR=./AnimaLoraToolkit/output/lokr_c8 bash run.sh

手动单独启动 watcher：

python -u AnimaLoraToolkit/utils/watch_and_push_lokr.py \
  --watch-dir AnimaLoraToolkit/output \
  --min-step 800 \
  --interval 20

文件保命规则

CNB 工作区长时间无操作可能回收。已经写到磁盘的文件不一定会靠工作区状态永久保留，尤其是大文件、被 .gitignore 忽略的文件、或未提交的输出文件。

所以重要 LoKr 结果建议至少满足一个条件：

被 watcher 自动 commit + push。
手动 git add -f、commit、push。
下载到本地或上传到其他持久化存储。

注意：训练还没跑到保存点时，显存里的中间状态不会被保留。需要保留训练状态就设置 save_state_every。

断点续训

训练状态文件通常形如：

training_state_step2000.pt

恢复时在配置里填：

resume_state: "./AnimaLoraToolkit/output/lokr_c8/training_state_step2000.pt"

只从已有 LoRA 权重继续训练则填：

resume_lora: "./AnimaLoraToolkit/output/lokr_c8/anima-lokrC8_step2000.safetensors"

两者区别：

resume_state 会恢复优化器、随机数、step、loss 历史，更接近真正续训。
resume_lora 只加载 LoRA 权重，优化器状态会重新开始。

目录说明

run.sh                                      # 一键后台训练 + LoKr 自动推送
stop_anima_training.sh                     # run.sh 生成的一键停止脚本
anima_training.log                         # 训练日志
lokr_auto_push.log                         # 自动推送日志
AnimaLoraToolkit/anima_train.py            # 训练主程序
AnimaLoraToolkit/config/train_my.yaml       # 当前主要训练配置
AnimaLoraToolkit/output/                   # LoRA/LoKr 输出目录
AnimaLoraToolkit/utils/watch_and_push_lokr.py

常见命令

查看最近自动保存的 LoKr：

find AnimaLoraToolkit/output -name "*.safetensors" | sort | tail

查看训练是否还在跑：

pgrep -af "anima_train.py"
pgrep -af "watch_and_push_lokr.py"

查看最近提交：

git log --oneline -10

About

No description, topics, or website provided.

25.55 GiB

8 forks 2 stars 1 branches 0 TagREADME

Release
0

Tag

Packages

dockerfile-caches

Contributors
5

Language

Python99.6%

HTML0.2%

Shell0.1%

Dockerfile0%

Others0.1%

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111