速度60tokens/s,目前不支持多并发
运行命令
/workspace/llama-server-mtp -m /workspace/model/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 5000 -ngl 99 -t 8 --spec-type mtp --spec-draft-n-max 3 -np 1 -c 131072 -ctk q8_0 -ctv q8_0 --reasoning off
openclaw config
"agents": {
"defaults": {
"workspace": "/home/mls/.openclaw/workspace",
"model": {
"primary": "cnb/Qwen3.6-27B-Q4"
},
"models": {
"modelscope/ZhipuAI/GLM-5.1": {"alias": "GLM-5.1"},
"cnb/Qwen3.6-27B-UD-Q4_K_XL.gguf": {"alias": "Qwen3.6-27B-Q4"}
}
}
}
"models": {
"mode": "merge",
"providers": {
"cnb": {
"baseUrl": "https://vd1odtlvc7-8082.cnb.run/v1",
"api": "openai-completions",
"apiKey": "ss-",
"models": [
{
"id": "Qwen3.6-27B-UD-Q4_K_XL.gguf",
"name": "Qwen3.6-27B-Q4",
"contextWindow": 262144,
"maxTokens": 262144,
"input": ["text"],
"cost": {"input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0},
"reasoning": false
}
]
},
"modelscope": {
"baseUrl": "https://api-inference.modelscope.cn/v1",
"api": "openai-completions",
"apiKey": "ms-",
"models": [
{
"id": "ZhipuAI/GLM-5.1",
"name": "GLM-5.1",
"contextWindow": 202752,
"maxTokens": 202752,
"input": ["text"],
"cost": {"input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0},
"reasoning": false
}
]
}
}
}
apikey 随便填,关闭了,视觉识别,个人感觉,目前针对openclaw没啥用,还不完善,报错太多.
Building llama.cpp with MTP Support — Step by Step
1. Clone and enter the repo
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
2. Fetch the latest remote changes
git fetch origin
This pulls all new refs from the upstream repository so you're working off the current tip of master.
3. Fetch PR #22673 as a local branch
git fetch origin pull/22673/head:pr-22673
PR #22673 ("llama + spec: MTP Support") adds the speculative-decoding / MTP infrastructure that lets llama-server consume multi-token prediction heads. We pull it directly without waiting for it to be merged upstream.
4. Checkout master and reset to latest remote
git checkout master
git reset --hard origin/master
Ensures a clean starting point at the current upstream master, discarding any local drift.
5. Merge the PR on top (non-fast-forward)
git merge --no-ff pr-22673 -m "Merge [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673): llama + spec: MTP Support"
The --no-ff flag preserves a merge commit so you can cleanly cherry-pick or revert later if the PR lands officially and changes.
6. Build llama-server
cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build build --config Release -j 16
This produces build/bin/llama-server.
7. Run the server with MTP enabled
/workspace/llama-server-mtp \
-m /workspace/model/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf \
--spec-type mtp \
--spec-draft-n-max 3
--spec-type mtp tells llama.cpp to use the MTP heads baked into the GGUF.
--spec-draft-n-max 3 sets the max number of draft tokens per step (matching the model's 3 MTP layers).