Skip to content

Cannot run the double-die version of the demo model Deepseek-7B #49

@chripell

Description

@chripell

I followed https://github.com/DC-DeepComputing/Framework/blob/main/FML13V03/DC-ROMA%20RISC-V%20AI%20PC%2C%20RISC-V%20Mainboard%20II%20NPU%20Memory%20Adjustment%20Instructions.md for the 32G RAM version (which I have) to expand the reserved memory. Here the dmesg output:

[    0.000000] Linux version 6.6.92-eic7x-2025.07 (root@b67a2314ece6) (riscv64-unknown-linux-gnu-gcc () 13.2.0, GNU ld (GNU Binutils) 2.42) #2025.09.26.03.45+ SMP Fri Sep 26 03:53:01 UTC 2025
[    0.000000] Machine model: DeepComputing FML13V03
[    0.000000] SBI specification v1.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10003
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[    0.000000] printk: bootconsole [sbi0] enabled
[    0.000000] efi: EFI v2.10 by Das U-Boot
[    0.000000] efi: RTPROP=0xe8cc8040 SMBIOS=0xe8cf5000 INITRD=0xe33bc040 MEMRESERVE=0xe33bb040 
[    0.000000] OF: reserved mem: OVERLAP DETECTED!
               mmz_nid_0_part_0@1,c0000000 (0x00000001c0000000--0x0000000480000000) overlaps with g2d_8GB_boundary_reserved_4k (0x00000001fe000000--0x0000000200000000)
[    0.000000] OF: reserved mem: OVERLAP DETECTED!
               mmz_nid_1_part_0@21,40000000 (0x0000002140000000--0x0000002400000000) overlaps with d1_g2d_8GB_boundary_reserved_4k (0x00000021fe000000--0x0000002200000000)
[    0.000000] Reserved memory: created CMA memory pool at 0x0000002120000000, size 512 MiB
[    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000002120000000..0x000000213fffffff (524288 KiB) map reusable linux,cma
[    0.000000] OF: reserved mem: 0x0000000059000000..0x00000000593fffff (4096 KiB) nomap non-reusable sprammemory@59000000
[    0.000000] OF: reserved mem: 0x0000000079000000..0x00000000793fffff (4096 KiB) nomap non-reusable sprammemory@79000000
[    0.000000] OF: reserved mem: 0x0000000080000000..0x000000008007ffff (512 KiB) nomap non-reusable mmode_resv0@80000000
[    0.000000] OF: reserved mem: 0x00000000dffe0000..0x00000000dfffffff (128 KiB) nomap non-reusable lpcpures@dffe0000
[    0.000000] OF: reserved mem: 0x00000000e0000000..0x00000000e1ffffff (32768 KiB) nomap non-reusable region@e0000000
[    0.000000] OF: reserved mem: 0x00000000fff00000..0x00000000ffffffff (1024 KiB) nomap non-reusable ramoops@fff00000
[    0.000000] Reserved memory: created mmz_nid_0_part_0@1,c0000000 eswin reserve memory at 0x00000001c0000000, size 11264 MiB
[    0.000000] OF: reserved mem: initialized node mmz_nid_0_part_0@1,c0000000, compatible id eswin-reserve-memory
[    0.000000] OF: reserved mem: 0x00000001c0000000..0x000000047fffffff (11534336 KiB) nomap non-reusable mmz_nid_0_part_0@1,c0000000
[    0.000000] OF: reserved mem: 0x00000001fe000000..0x00000001ffffffff (32768 KiB) nomap non-reusable g2d_8GB_boundary_reserved_4k
[    0.000000] OF: reserved mem: 0x00000002fffff000..0x00000002ffffffff (4 KiB) nomap non-reusable g2d_12GB_boundary_reserved_4k
[    0.000000] OF: reserved mem: 0x0000002040000000..0x00000020403fffff (4096 KiB) nomap non-reusable nid_1_zero_device_simu@2040000000
[    0.000000] OF: reserved mem: 0x00000020e0000000..0x00000020e1ffffff (32768 KiB) nomap non-reusable region@20,e0000000
[    0.000000] OF: reserved mem: 0x00000020fffff000..0x00000020ffffffff (4 KiB) nomap non-reusable d1_g2d_4GB_boundary_reserved_4k
[    0.000000] Reserved memory: created mmz_nid_1_part_0@21,40000000 eswin reserve memory at 0x0000002140000000, size 11264 MiB
[    0.000000] OF: reserved mem: initialized node mmz_nid_1_part_0@21,40000000, compatible id eswin-reserve-memory
[    0.000000] OF: reserved mem: 0x0000002140000000..0x00000023ffffffff (11534336 KiB) nomap non-reusable mmz_nid_1_part_0@21,40000000
[    0.000000] OF: reserved mem: 0x00000021fe000000..0x00000021ffffffff (32768 KiB) nomap non-reusable d1_g2d_8GB_boundary_reserved_4k
[    0.000000] OF: NUMA: parsing numa-distance-map-v1
[    0.000000] NUMA: NODE_DATA [mem 0x1bfffe1c0-0x1bfffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x211fcb71c0-0x211fcb8fff]

then I followed https://github.com/DC-DeepComputing/Framework/blob/main/FML13V03/DC-ROMA%20RISC-V%20AI%20PC%20Install%20AI%20Models(Deepseek-7B)%20Guide.md

The single-die model works. Unfortunately, the 2 die does NOT. It just hangs after I enter the question:

root@roma:/home/roma# /opt/eswin/sample-code/npu_sample/qwen_sample/bin/es_qwen2 /opt/eswin/sample-code/npu_sample/qwen_sample/src/deepseek_7b_1k_int8_peer/config.json
Loading models: [==================================================] 100.00% ( 70.834682 seconds )
----------------------------------------------------------------------------------
0: Role setting: 你是一个智能助理.
----------------------------------------------------------------------------------
1: 介绍一下大语言模型
2: The quantum computers
3: Humans and robots coexist
4: Customized prompts
----------------------------------------------------------------------------------
[YOU]: 4
[YOU]: Who are you?

On the serial console I get messages that point toward a bug in some driver blocking the system:

[ 1344.354910] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1344.360906] rcu:     0-...0: (5 ticks this GP) idle=f28c/1/0x4000000000000002 softirq=26168/26170 fqs=7031
[ 1344.370318] rcu:              hardirqs   softirqs   csw/system
[ 1344.375895] rcu:      number: 45651993          0            0
[ 1344.381473] rcu:     cputime:        0          0            0   ==> 30028(ms)
[ 1344.388441] rcu:     (detected by 6, t=15010 jiffies, g=39517, q=632 ncpus=8)
[ 1451.314391] INFO: task kworker/u21:2:332 blocked for more than 120 seconds.
[ 1451.321379]       Not tainted 6.6.92-eic7x-2025.07 #2025.09.26.03.45+
[ 1451.327832] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1451.335841] INFO: task login:834 blocked for more than 120 seconds.
[ 1451.342117]       Not tainted 6.6.92-eic7x-2025.07 #2025.09.26.03.45+
[ 1451.348571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

If I press C-c I get:

[YOU]: Who are you?
^C^Cterminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
^C^C^C^C

and actually the system is deadlocked for good. I am just reading the documentation linked from another issue so basically I don't know what I am doing, but I would like to see the demo working before trying anything more complex.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions