问题描述

用MindIE拉qwq-32b模型报错:AttributeError: 'ForkAwareLocal' object has no attribute 'connection',详细错误信息如下:

[2025-03-25 21:50:29,878] [1125] [281464557924704] [llm] [INFO] [dist.py-81] : initialize_distributed has been Set
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
Process ForkServerProcess-8:
Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/managers.py", line 814, in _callmethod
    conn = self._tls.connection
           ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 71, in wrapper
    raise exp
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 63, in wrapper
    func(*args, **kwargs)
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 264, in task_distribute
    resource_proxy[SUB_PROCESS_STATE].append(True)
  File "<string>", line 2, in append
  File "/usr/lib64/python3.11/multiprocessing/managers.py", line 818, in _callmethod
    self._connect()
  File "/usr/lib64/python3.11/multiprocessing/managers.py", line 805, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/connection.py", line 518, in Client
    c = SocketClient(address)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/connection.py", line 646, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
Process ForkServerProcess-6:
Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/managers.py", line 814, in _callmethod
    conn = self._tls.connection
           ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 71, in wrapper
    raise exp
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 63, in wrapper
    func(*args, **kwargs)
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 264, in task_distribute
    resource_proxy[SUB_PROCESS_STATE].append(True)
  File "<string>", line 2, in append
  File "/usr/lib64/python3.11/multiprocessing/managers.py", line 818, in _callmethod
    self._connect()
  File "/usr/lib64/python3.11/multiprocessing/managers.py", line 805, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/connection.py", line 518, in Client
    c = SocketClient(address)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/connection.py", line 646, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

从上面日志很难看出问题,修改mindie-service的配置文件,将日志设置为debug等级,重启mindie-serivce,查看日志

vim /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

错误信息如下:

[2025-03-26 20:23:13.026+0800] [18045] [281473177874784] [mindie-server] [ERROR] [model.py:40] : [Model]        >>> Exception:Error while deserializing header: HeaderTooLarge
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/model_wrapper/model.py", line 38, in initialize
    return self.python_model.initialize(config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/model_wrapper/standard_model.py", line 93, in initialize
    self.generator = Generator(
                     ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 85, in __init__
    self.generator_backend = get_generator_backend(model_config)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/__init__.py", line 26, in get_generator_backend
    return generator_cls(model_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 97, in __init__
    super().__init__(model_config)
  File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_backend.py", line 113, in __init__
    self.model_wrapper = get_model_wrapper(model_config, backend_type)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/__init__.py", line 15, in get_model_wrapper
    return wrapper_cls(**model_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 52, in __init__
    self.model_runner.load_weights()
  File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 149, in load_weights
    weights = Weights(
              ^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/utils/weights.py", line 49, in __init__
    routing = self.load_routing(process_group)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/utils/weights.py", line 77, in load_routing
    with safe_open(filename, framework="pytorch") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
[2025-03-26 20:23:13.026+0800] [18045] [281473177874784] [mindie-server] [ERROR] [model.py:43] : [MIE04E13030A] [Model] >>> return initialize error result: {'status': 'error', 'npuBlockNum': '0', 'cpuBlockNum': '0'}

怀疑是权重的问题,经检查,权重文件下载不完整

du -sh *  
#权重目录内执行

源文件大小:

重新下载权重文件,问题解决

Logo

昇腾计算产业是基于昇腾系列(HUAWEI Ascend)处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务,包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐