artnsyll1/Qwen3.6-27B-MTP-UD-GGUF

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

Forkfromartnsyll/Qwen3.6-27B-MTP-UD-GGUF

青梧昭明

编辑文件 README.md

f2833a19

6 commits

model
.cnb.yml
.gitattributes
README.md
llama-server-mtp

速度60tokens/s,目前不支持多并发

运行命令

/workspace/llama-server-mtp -m /workspace/model/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 5000 -ngl 99 -t 8 --spec-type mtp --spec-draft-n-max 3 -np 1 -c 131072 -ctk q8_0  -ctv q8_0 --reasoning off

openclaw config

  "agents": {
    "defaults": {
      "workspace": "/home/mls/.openclaw/workspace",
      "model": {
        "primary": "cnb/Qwen3.6-27B-Q4"
      },
      "models": {
        "modelscope/ZhipuAI/GLM-5.1": {"alias": "GLM-5.1"},
        "cnb/Qwen3.6-27B-UD-Q4_K_XL.gguf": {"alias": "Qwen3.6-27B-Q4"}
      }
    }
  }

  "models": {
    "mode": "merge",
    "providers": {
      "cnb": {
        "baseUrl": "https://vd1odtlvc7-8082.cnb.run/v1",
        "api": "openai-completions",
        "apiKey": "ss-",
        "models": [
          {
            "id": "Qwen3.6-27B-UD-Q4_K_XL.gguf",
            "name": "Qwen3.6-27B-Q4",
            "contextWindow": 262144,
            "maxTokens": 262144,
            "input": ["text"],
            "cost": {"input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0},
            "reasoning": false
          }
        ]
      },
      "modelscope": {
        "baseUrl": "https://api-inference.modelscope.cn/v1",
        "api": "openai-completions",
        "apiKey": "ms-",
        "models": [
          {
            "id": "ZhipuAI/GLM-5.1",
            "name": "GLM-5.1",
            "contextWindow": 202752,
            "maxTokens": 202752,
            "input": ["text"],
            "cost": {"input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0},
            "reasoning": false
          }
        ]
      }
    }
  }

apikey 随便填,关闭了,视觉识别,个人感觉,目前针对openclaw没啥用,还不完善,报错太多.

Building llama.cpp with MTP Support — Step by Step
1. Clone and enter the repo
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

2. Fetch the latest remote changes
git fetch origin

This pulls all new refs from the upstream repository so you're working off the current tip of master.

3. Fetch PR #22673 as a local branch
git fetch origin pull/22673/head:pr-22673

PR #22673 ("llama + spec: MTP Support") adds the speculative-decoding / MTP infrastructure that lets llama-server consume multi-token prediction heads. We pull it directly without waiting for it to be merged upstream.

4. Checkout master and reset to latest remote
git checkout master
git reset --hard origin/master

Ensures a clean starting point at the current upstream master, discarding any local drift.

5. Merge the PR on top (non-fast-forward)
git merge --no-ff pr-22673 -m "Merge [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673): llama + spec: MTP Support"

The --no-ff flag preserves a merge commit so you can cleanly cherry-pick or revert later if the PR lands officially and changes.

6. Build llama-server
cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON

cmake --build build --config Release  -j 16

This produces build/bin/llama-server.

7. Run the server with MTP enabled
/workspace/llama-server-mtp \
  -m /workspace/model/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf \
  --spec-type mtp \
  --spec-draft-n-max 3

--spec-type mtp tells llama.cpp to use the MTP heads baked into the GGUF.
--spec-draft-n-max 3 sets the max number of draft tokens per step (matching the model's 3 MTP layers).