Skip to main content
开放网络的先行者与推动者—星融元
加入我们技术支持(Support)  TEL:(+86)4000989811

Helium智能网卡相关功能验证:OVS、vFW与SSL加解密的卸载

1 方案概述

本文主要讲解Helium智能网卡(下文统一简称为“智能网卡”)相关解决方案,验证其对VNF功能的卸载能力。整个验证过程涵盖以下三个功能点:

  • 智能网卡对OVS的卸载;
  • 智能网卡对基于虚拟机的VNF(vFW)功能卸载;
  • 智能网卡对基于容器的VNF(SSL加解密)功能卸载。

2 硬件与软件环境

验证过程中涉及到的硬件和软件环境如表2-1和表2-2所示。

验证方案物理拓扑
图2-1:验证方案物理拓扑
名称型号硬件指标数量
智能网卡EC2004Y【参见产品彩页】1
服务器X86需兼容全高尺寸网卡2
光模块25GSFP282
光纤多模10G/25G适用1

表2-1:硬件环境

软件版本备注
宿主机操作系统CentOS 7.8.2003
安装包helium-V1.0.zip从support.asterfusion.com下载

表2-2:软件环境

3 验证思路及过程

3.1 将OVS卸载到智能网卡

3.1.1 验证思路

网卡出厂时会默认安装好系统,我们需要进行管理网口配置、驱动安装调试等基本配置操作。

OVS卸载验证拓扑
图3-1:OVS卸载验证拓扑

完成以上基本配置后,在智能网卡侧运行OVS,创建测试网桥,然后在宿主机侧启动两台虚拟机连接到测试网桥上。经过验证测试,两台虚拟机之间可以通过位于智能网卡侧的OVS网桥进行通信,证明OVS卸载到智能网卡后可以正常提供服务。

3.1.2 验证过程

3.1.2.1 对宿主机和智能网卡进行基本配置
#修改cpu内核启动参数
#编辑/etc/default/grub,修改其中GRUB_CMDLINE_LINUX_DEFAULT,新增内容如下:
intel_iommu=on iommu=pt pci=assign-busses pcie_acs_override=downstream
[root@asterfusion ~]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt pci=assign-busses pcie_acs_override=downstream crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
#最后执行如下命令后重启宿主机
[root@asterfusion ~]#grub2-mkconfig -o /boot/grub2/grub.cfg

#智能网卡配置pci上游总线地址
root@OCTEONTX:~# ifconfig mvmgmt0 down 2>/dev/null 
root@OCTEONTX:~# echo 0 > /sys/bus/pci/devices/0000\:05\:00.0/sriov_numvfs 
root@OCTEONTX:~# rmmod mgmt_net 2>/dev/null 
root@OCTEONTX:~# rmmod pcie_ep 
root@OCTEONTX:~# rmmod dpi_dma 
root@OCTEONTX:~# echo 1 > /sys/bus/pci/devices/0000\:05\:00.0/sriov_numvfs   
root@OCTEONTX:~# modprobe dpi_dma 
# lspci -tv | grep -C2 b200获取网卡host-sid,本机为03则下面的host_sid为ox30300
root@OCTEONTX:~# modprobe pcie_ep    host_sid=0x30300 pem_num=0 epf_num=0
#配置mvmgmt0端口
root@OCTEONTX:~#modprobe mgmt_net 
root@OCTEONTX:~#ifconfig mvmgmt0 12.12.12.12  
root@OCTEONTX:~#ifconfig mvmgmt0 up

#宿主机加载网卡驱动
[root@asterfusion ~]#tar -xvf Helium-Driver-V1.0R1.tar.gz
[root@asterfusion ~]#cd Helium-ep-driver
[root@asterfusion ~]#make
[root@asterfusion ~]#insmod ./drivers/legacy/modules/driver/src/host/linux/kernel/drv/octeon_drv.ko num_vfs=4
[root@asterfusion~]#insmod ./drivers/mgmt_net/mgmt_net.ko
[root@asterfusion~]#ifconfig mvmgmt0 12.12.12.1
[root@asterfusion~]#ifconfig mvmgmt0 up

#上述配置完后确认智能网卡和宿主机虚拟网口信息
#智能网卡确认加载完成
root@OCTEONTX:~#lspci -nn -d 177d:a0f7
#宿主机确认驱动加载完成
[root@asterfusion~]#lspci | grep b203
#宿主机如果加载有问题或需要关闭更改虚拟端口,需要先卸载驱动再重新按上面步骤加载
[root@asterfusion~]#ifconfig mvmgmt0 down 
[root@asterfusion~]#rmmod mgmt_net 
[root@asterfusion~]#rmmod octeon_drv
#注意一定要先在网卡加载驱动后再在宿主机加载驱动
3.1.2.2 安装OVS及相关服务

宿主机和智能网卡都需要安装DPDK和OVS。

# 拷贝helium-V1.0.zip压缩包上传到宿主机目录,并解压。
[root@asterfusion~]: unzip helium-V1.0.zip
#宿主机安装dpdk
[root@asterfusion~]# tar -zxvf helium-v1.0/Helium-DPDK19.11-V1.0R1.tar.gz 
[root@asterfusion~]# cd helium-v1.0/dpdk-19.11 
[root@asterfusion~]# export RTE_SDK=$PWD 
[root@asterfusion~]# export RTE_TARGET=build 
[root@asterfusion~]# make config T=x86_64-native-linuxapp-gcc 
[root@asterfusion~]# make
#VF口pci地址查看
[root@asterfusion~]#lspci | grep b203
#宿主机加载vfio驱动
[root@asterfusion~]#modprobe vfio-pci
[root@asterfusion~]#echo 1 > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode
[root@asterfusion~]# helium-v1.0/dpdk-19.11/usertools/dpdk-devbind.py -b vfio-pci  0000:03:02.0  0000:03:02.1 0000:03:02.2 0000:03:02.3
#宿主机安装virtio-forwarder相关包
[root@asterfusion~]yum install protobuf.x86_64 -y
[root@asterfusion~]yum install protobuf-c.x86_64 -y
[root@asterfusion~]yum install czmq.x86_64 -y 
[root@asterfusion~] helium-v1.0/Helium-VirtioForwarder-V1.0R1-intel-ivb.bin
#宿主机开启大页内存
[root@asterfusion~]# sysctl vm.nr_hugepages=20480
#开启virtio-forwarder服务
[root@asterfusion~]systemctl start virtio-forwarder.service
#新增vhost和VF端口
[root@asterfusion~]/usr/local/lib/virtio-forwarder/virtioforwarder_port_control add_sock --vhost-path="/tmp/vhost1.sock" --pci-addr="03:02.0"    --tso=on    --mtu=9000
[root@asterfusion~]/usr/local/lib/virtio-forwarder/virtioforwarder_port_control add_sock --vhost-path="/tmp/vhost2.sock" --pci-addr="03:02.1"    --tso=on    --mtu=9000
#验证添加的端口配置
[root@asterfusion~]/usr/local/lib/virtio-forwarder/virtioforwarder_stats -d 0

# 拷贝helium-V1.0.zip压缩包上传到网卡data目录,并解压。
root@OCTEONTX:/data/helium-v1.0# unzip helium-V1.0.zip
#智能网卡安装dpdk
root@OCTEONTX:/data# tar -zxvf Helium-DPDK19.11-V1.0R1.tar.gz 
root@OCTEONTX:/data# cd dpdk-19.11 
root@OCTEONTX:/data# export RTE_SDK=$PWD 
root@OCTEONTX:/data# export RTE_TARGET=build 
root@OCTEONTX:/data# make config T=arm64-octeontx2-linux-gcc 
root@OCTEONTX:/data# make -j8
#智能网卡开启大页内存
root@OCTEONTX:/data# sysctl vm.nr_hugepages=32
#绑定端口
root@OCTEONTX:/data# /data/helium-v1.0/dpdk-19.11/usertools/dpdk-devbind.py -b vfio-pci 0002:02:00.0  0002:0f:00.2 0002:0f:00.3
#智能网卡安装ovs
root@OCTEONTX:/data# chmod +x Helium-OvS-V1.0R1.bin
root@OCTEONTX:/data# ./Helium-OvS-V1.0R1.bin
3.1.2.3 验证OVS
# 智能网卡启动OVS 
root@OCTEONTX:/data# cd ovs_install 
root@OCTEONTX:/data/ovs_install# chmod +x ovs_start.sh
root@OCTEONTX:/data/ovs_install# ./ovs_start.sh
# 验证OVS和DPDK的版本
root@OCTEONTX:/data/ovs_install# ovs-vsctl get Open_vSwitch . dpdk_initialized
true
root@OCTEONTX:/data/ovs_install# ovs-vsctl get Open_vSwitch . dpdk_version
"DPDK 19.11.0"
root@OCTEONTX:/data/ovs_install# ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.11.1
DPDK 19.11.0
3.1.2.4 在智能网卡侧配置管理网与业务网的网桥
# 创建并配置管理网的网桥,并将智能网卡的管理网IP放到此网桥上
root@OCTEONTX:~# ovs-vsctl add-br br-m -- set bridge br-m datapath_type=netdev
root@OCTEONTX:~# ip add del dev eth4 192.168.5.45/24
root@OCTEONTX:~# ovs-vsctl add-port br-m eth4
root@OCTEONTX:~# ip link set dev br-m up
root@OCTEONTX:~# ip add add dev br-m 192.168.5.45/24
root@OCTEONTX:~# ip route add default via 192.168.5.1 dev br-m
# 创建并配置业务网的网桥,将智能网卡的物理网口eth0连接到此网桥上
#查看智能网卡物理口PCI地址
root@OCTEONTX:/data/helium-v1.0# lspci|grep a063
0002:02:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 09)
0002:03:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 09)
0002:04:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 09)
0002:05:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 09)
root@OCTEONTX:~# ovs-vsctl add-br br-net -- set bridge br-net datapath_type=netdev
root@OCTEONTX:~# ovs-vsctl add-port br-net eth0 -- set Interface eth0 type=dpdk options:dpdk-devargs=0002:02:00.0  mtu_request=9000 
root@OCTEONTX:~# ip link set dev br-net up
3.1.2.5 在宿主机侧创建两台虚拟机,连接到智能网卡侧的业务网桥
# 修改虚拟机的xml配置文件,添加一个vhost-user的虚拟网卡。
# centos-00:
<domain type='kvm' id='16'>
  <name>centos-00</name>
  <uuid>549a2cc5-0b8b-4b7a-acd5-6171d0e85000</uuid>
  <memory unit='KiB'>2194432</memory>
  <currentMemory unit='KiB'>2194304</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.6.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <vmport state='off'/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Haswell-noTSX-IBRS</model>
    <vendor>Intel</vendor>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='f16c'/>
    <feature policy='require' name='rdrand'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='arat'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='md-clear'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='xsaveopt'/>
    <feature policy='require' name='pdpe1gb'/>
    <feature policy='require' name='abm'/>
    <feature policy='require' name='ibpb'/>
    <numa>
      <cell id='0' cpus='0-3' memory='2194432' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/CentOS-7-x86_64-GenericCloud-01.qcow2'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <alias name='usb'/>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <alias name='usb'/>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <alias name='usb'/>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='pci' index='1' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='1'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='2'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </controller>
    <controller type='pci' index='3' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='3'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </controller>
    <interface type='vhostuser'>
      <source type='unix' path='/tmp/vhost1.sock' mode='server'/>
      <model type='virtio'/>
      <mtu size='9000'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/4'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/4'>
      <source path='/dev/pts/4'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='spice' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
      <image compression='off'/>
    </graphics>
    <sound model='ich6'>
      <alias name='sound0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </sound>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <redirdev bus='usb' type='spicevmc'>
      <alias name='redir0'/>
      <address type='usb' bus='0' port='1'/>
    </redirdev>
    <redirdev bus='usb' type='spicevmc'>
      <alias name='redir1'/>
      <address type='usb' bus='0' port='2'/>
    </redirdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>


# centos-01:
<domain type='kvm' id='15'>
  <name>centos-01</name>
  <uuid>549a2cc5-0b8b-4b7a-acd5-6171d0e85001</uuid>
  <memory unit='KiB'>2194432</memory>
  <currentMemory unit='KiB'>2194304</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.6.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <vmport state='off'/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Haswell-noTSX-IBRS</model>
    <vendor>Intel</vendor>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='f16c'/>
    <feature policy='require' name='rdrand'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='arat'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='md-clear'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='xsaveopt'/>
    <feature policy='require' name='pdpe1gb'/>
    <feature policy='require' name='abm'/>
    <feature policy='require' name='ibpb'/>
    <numa>
      <cell id='0' cpus='0-3' memory='2194432' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/CentOS-7-x86_64-GenericCloud-02.qcow2'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <alias name='usb'/>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <alias name='usb'/>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <alias name='usb'/>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </controller>
    <interface type='vhostuser'>
       <source type='unix' path='/tmp/vhost2.sock' mode='server'/>
      <model type='virtio'/>
      <mtu size='9000'/> 
</interface>
    <serial type='pty'>
      <source path='/dev/pts/5'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/5'>
      <source path='/dev/pts/5'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='spice' port='5901' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
      <image compression='off'/>
    </graphics>
    <sound model='ich6'>
      <alias name='sound0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </sound>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <redirdev bus='usb' type='spicevmc'>
      <alias name='redir0'/>
      <address type='usb' bus='0' port='1'/>
    </redirdev>
    <redirdev bus='usb' type='spicevmc'>
      <alias name='redir1'/>
      <address type='usb' bus='0' port='2'/>
    </redirdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

#相关镜像CentOS-7-x86_64-GenericCloud-XXXX.qcow2需要自己从网上下载。
# 创建两台CentOS7虚拟机并启动。
[root@asterfusion ~]# virsh define centos-00.xml
[root@asterfusion ~]# virsh define centos-01.xml
[root@asterfusion ~]# virsh start centos-00
[root@asterfusion ~]# virsh start centos-01
[root@asterfusion ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 13    centos-00                      running
 14    centos-01                      running

# 将虚拟机连接到宿主机侧的管理网桥。
[root@asterfusion ~]# ip link add centos-00-m type veth peer name centos-00-m-s
[root@asterfusion ~]# ip link add centos-01-m type veth peer name centos-01-m-s
[root@asterfusion ~]# ovs-vsctl add-br br-m
[root@asterfusion ~]# ip link set dev br-m up
[root@asterfusion ~]# ip address add dev br-m 192.168.5.145/24
[root@asterfusion ~]# ovs-vsctl add-port br-m eno2
[root@asterfusion ~]# ip link set dev eno2 up
[root@asterfusion ~]# ovs-vsctl add-port br-m centos-00-m-s
[root@asterfusion ~]# ovs-vsctl add-port br-m centos-01-m-s
[root@asterfusion ~]# virsh attach-interface centos-00 --type direct --source centos-00-m --config
[root@asterfusion ~]# virsh attach-interface centos-00 --type direct --source centos-00-m --live
[root@asterfusion ~]# virsh attach-interface centos-01 --type direct --source centos-01-m --config
[root@asterfusion ~]# virsh attach-interface centos-01 --type direct --source centos-01-m --live
[root@asterfusion ~]# ip link set dev centos-00-m up
[root@asterfusion ~]# ip link set dev centos-01-m up
[root@asterfusion ~]# ip link set dev centos-00-m-s up
[root@asterfusion ~]# ip link set dev centos-01-m-s up

# 分别给两台虚拟机配置业务IP。
# centos-00:
[root@centos-00 ~]# ip link set dev eth0 up
[root@centos-00 ~]# ip add add dev eth0 172.0.0.100/24
# centos-01:
[root@centos-01 ~]# ip link set dev eth0 up
[root@centos-01 ~]# ip add add dev eth0 172.0.0.200/24

# 分别给两台虚拟机配置管理IP。
# centos-00:
[root@centos-00 ~]# ip link set dev eth1 up
[root@centos-00 ~]# ip add add dev eth1 192.168.5.155/24
[root@centos-00 ~]# ip route add default via 192.168.5.1 dev eth1
# centos-01:
[root@centos-01 ~]# ip link set dev eth1 up
[root@centos-01 ~]# ip add add dev eth1 192.168.5.165/24
[root@centos-01 ~]# ip route add default via 192.168.5.1 dev eth1
#查看智能网卡侧的VF口PCI地址,列出的VF口是从第二条开始和宿主机VF口一一对应。
root@OCTEONTX:/data/helium-v1.0# lspci -nn -d 177d:a0f7
0002:0f:00.1 System peripheral [0880]: Cavium, Inc. Device [177d:a0f7]
0002:0f:00.2 System peripheral [0880]: Cavium, Inc. Device [177d:a0f7]
0002:0f:00.3 System peripheral [0880]: Cavium, Inc. Device [177d:a0f7]
0002:0f:00.4 System peripheral [0880]: Cavium, Inc. Device [177d:a0f7]
0002:0f:00.5 System peripheral [0880]: Cavium, Inc. Device [177d:a0f7]
# 在智能网卡侧将虚拟机使用的两个VF绑定到业务网桥br-net。
root@OCTEONTX:~# ovs-vsctl add-port br-net sdp1 -- set Interface sdp1 type=dpdk options:dpdk-devargs=0002:0f:00.2 mtu_request=9000  
root@OCTEONTX:~# ovs-vsctl add-port br-net sdp2 -- set Interface sdp2 type=dpdk options:dpdk-devargs=0002:0f:00.3 mtu_request=9000

3.1.3 验证卸载结果

# 经过验证两台虚拟机能够经过智能网卡侧的网桥br-net正常通信。
# centos-00:
[root@centos-00 ~]# ping 172.0.0.200 -c 4
PING 172.0.0.200 (172.0.0.200) 56(84) bytes of data.
64 bytes from 172.0.0.200: icmp_seq=1 ttl=64 time=0.220 ms
64 bytes from 172.0.0.200: icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from 172.0.0.200: icmp_seq=3 ttl=64 time=0.140 ms
64 bytes from 172.0.0.200: icmp_seq=4 ttl=64 time=0.132 ms

--- 172.0.0.200 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.132/0.164/0.220/0.034 ms
[root@centos-00 ~]#
# centos-01:
[root@centos-01 ~]# ping 172.0.0.100 -c 4
PING 172.0.0.100 (172.0.0.100) 56(84) bytes of data.
64 bytes from 172.0.0.100: icmp_seq=1 ttl=64 time=0.159 ms
64 bytes from 172.0.0.100: icmp_seq=2 ttl=64 time=0.163 ms
64 bytes from 172.0.0.100: icmp_seq=3 ttl=64 time=0.179 ms
64 bytes from 172.0.0.100: icmp_seq=4 ttl=64 time=0.180 ms

--- 172.0.0.100 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.159/0.170/0.180/0.013 ms


3.2 基于虚拟机的vFW卸载

3.2.1 验证思路

为了验证智能网卡对基于虚拟机的VNF卸载能力,本方案将在宿主机侧启动两台虚拟机作为租户的业务实例,在智能网卡侧运行CentOS虚拟机,配置相应的iptables规则作为VPC网关与防火墙。

基于虚拟机的vFW卸载拓扑
图3-2:基于虚拟机的vFW卸载拓扑

经过验证测试,成功地将防火墙功能卸载至智能网卡。由上图可知,租户的VPC网段为172.0.0.0/24,实例的网关设置为VPC网关172.0.0.1。当智能网卡上运行的防火墙收到业务实例访问其他网段的流量时,会对流量按预设规则进行过滤转发,而且在转发到业务网络前会进行一次SNAT,使得租户VPC内的实例可以访问到业务网络。

3.2.2 验证过程

本小节的所有操作,都是基于 3.1 小节的配置环境进行,因此基础的操作步骤不再赘述。

3.2.2.1 在智能网卡侧配置网桥vm-net,并将宿主机侧的两台虚机连接到此网桥
# 在智能网卡侧创建网桥vm-net。
root@OCTEONTX:~# ovs-vsctl add-br vm-net -- set bridge vm-net datapath_type=netdev
	
# 将VF从br-net网桥上删除。
root@OCTEONTX:~# ovs-vsctl del-port br-net sdp1
root@OCTEONTX:~# ovs-vsctl del-port br-net sdp2

# 将VF连接到vm-net网桥。
root@OCTEONTX:~# ovs-vsctl add-port vm-net sdp1 -- set Interface sdp1 type=dpdk options:dpdk-devargs=0002:0f:00.2 mtu_request=9000
root@OCTEONTX:~# ovs-vsctl add-port vm-net sdp2 -- set Interface sdp2 type=dpdk options:dpdk-devargs=0002:0f:00.3 mtu_request=9000

3.2.2.2 在智能网卡侧创建虚拟机,并分别连接到网桥vm-net和br-net

# 在智能网卡侧安装虚拟化软件包。
root@OCTEONTX:~# apt install -y qemu qama-utils qemu-efi-arm qemu-efi-aarch64 qemu-system-arm qemu-system-common qemu-system-data qemu-system-gui
	
# 准备虚拟机的镜像和xml文件,结果如下:
root@OCTEONTX:/data# mkdir libvirt && cd libvirt
root@OCTEONTX:/data/libvirt# tree
.
|-- images
|   |-- CentOS-7-aarch64-GenericCloud-2009.qcow2
|   `-- QEMU_EFI.fd
`-- xml
    |-- firewall-00.xml
    `-- default-net.xml

2 directories, 4 files
root@OCTEONTX:/data/libvirt# cat xml/centos-00.xml
<domain type='qemu'>
  <name>firewall-00</name>
  <uuid>dc042799-4e06-466f-8fce-71ac2105f786</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://centos.org/centos/7.0"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <os>
    <type arch='aarch64' machine='virt-4.2'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/AAVMF/AAVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/centos-00_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <gic version='2'/>
  </features>
  <cpu mode='custom' match='exact' check='none'>
    <model fallback='allow'>cortex-a57</model>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-aarch64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/data/libvirt/images/CentOS-7-aarch64-GenericCloud-2009.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xe'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <serial type='pty'>
      <target type='system-serial' port='0'>
        <model name='pl011'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </rng>
  </devices>
</domain>
#相关镜像CentOS-7-aarch64-GenericCloud-XXXX.qcow2需要自己从网上下载。
# 创建虚拟机并启动。
root@OCTEONTX:/data/libvirt# virsh define firwall-00.xml
root@OCTEONTX:/data/libvirt# virsh start firewall-00
root@OCTEONTX:/data/libvirt# virsh list --all
Id   Name        State
---------------------------
30   firewall-00   running

# 将虚拟机分别连接网桥vm-net、br-net和br-m。
root@OCTEONTX:/data/libvirt# ip link add fw-if-in type veth peer name fw-if-in-sw
root@OCTEONTX:/data/libvirt# ip link add fw-if-ou type veth peer name fw-if-ou-sw
root@OCTEONTX:/data/libvirt# ip link add fw-m type veth peer name fw-m-sw
root@OCTEONTX:/data/libvirt# ip link set dev fw-m up
root@OCTEONTX:/data/libvirt# ip link set dev fw-m-sw up
root@OCTEONTX:/data/libvirt# ip link set dev fw-if-in up
root@OCTEONTX:/data/libvirt# ip link set dev fw-if-in-sw up
root@OCTEONTX:/data/libvirt# ip link set dev fw-if-ou up
root@OCTEONTX:/data/libvirt# ip link set dev fw-if-ou-sw up
root@OCTEONTX:/data/libvirt# ovs-vsctl add-port vm-net fw-if-in-sw
root@OCTEONTX:/data/libvirt# ovs-vsctl add-port br-net fw-if-ou-sw
root@OCTEONTX:/data/libvirt# ovs-vsctl add-port br-m fw-m-sw
root@OCTEONTX:/data/libvirt# virsh attach-interface firewall-00 --type direct --source fw-if-in --config
root@OCTEONTX:/data/libvirt# virsh attach-interface firewall-00 --type direct --source fw-if-in --live
root@OCTEONTX:/data/libvirt# virsh attach-interface firewall-00 --type direct --source fw-if-ou  --config
root@OCTEONTX:/data/libvirt# virsh attach-interface firewall-00 --type direct --source fw-if-ou  --live
root@OCTEONTX:/data/libvirt# virsh attach-interface firewall-00 --type direct --source fw-m  --config
root@OCTEONTX:/data/libvirt# virsh attach-interface firewall-00 --type direct --source fw-m  --live
#为br-net配置网关IP
root@OCTEONTX:/data/libvirt#ip address  add dev br-net 10.0.0.1/24

3.2.2.3 在智能网卡侧的虚拟机上配置防火墙规则

# 配置虚拟机各个网卡的IP地址。
root@OCTEONTX:~# virsh console firewall-00
Connected to domain firewall-00
Escape character is ^]

[root@firewall ~]# ip link set dev eth0 up
[root@firewall ~]# ip link set dev eth1 up
[root@firewall ~]# ip link set dev eth2 up
[root@firewall ~]# ip add add dev eth0 172.0.0.1/24
[root@firewall ~]# ip add add dev eth1 10.0.0.45/24
[root@firewall ~]# ip add add dev eth2 192.168.5.155/24
[root@firewall ~]# ip route add default via 10.0.0.1 dev eth1

# 开启虚拟机的报文转发功能。
[root@firewall ~]# echo '1' > /proc/sys/net/ipv4/ip_forward

# 设置防火墙的测试规则:丢弃实例172.0.0.100的所有报文(也即从宿主机上的第一个虚机发出的报文)。
[root@firewall ~]# iptables -I FORWARD  -s 172.0.0.100 -j DROP
[root@firewall ~]# iptables -nvL
Chain INPUT (policy ACCEPT 332K packets, 135M bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain FORWARD (policy ACCEPT 7158 packets, 545K bytes)
 pkts bytes target     prot opt in     out     source               destination         
 7305  544K DROP       all  --  *      *       172.0.0.100          0.0.0.0/0           

Chain OUTPUT (policy ACCEPT 20823 packets, 1740K bytes)
 pkts bytes target     prot opt in     out     source               destination         

# 设置防火墙的SNAT规则。
[root@firewall ~]# iptables -t nat -A POSTROUTING -o eth1 -s 172.0.0.0/24 -j SNAT --to-source 10.0.0.45
[root@firewall ~]#iptables -nvL -t nat
Chain PREROUTING (policy ACCEPT 11048 packets, 828K bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain INPUT (policy ACCEPT 16 packets, 784 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 9639 packets, 725K bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 9639 packets, 725K bytes)
 pkts bytes target     prot opt in     out     source               destination         
 6188  470K SNAT       all  --  *      eth1    172.0.0.0/24         0.0.0.0/0            to:10.0.0.45

3.2.3 验证卸载结果

# 在宿主机上的虚机centos-00上Ping位于业务网络上的“外网网关”10.0.0.1,无法通信。
[root@centos-00 ~]# ip route del default via 192.168.5.1 dev eth1
[root@centos-00 ~]# ip route add default via 172.0.0.1 dev eth0
[root@centos-00 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.0.0.1       0.0.0.0         UG    0      0        0 eth0
172.0.0.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
[root@centos-00 ~]# ping 10.0.0.1 -c 4
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.

--- 10.0.0.1 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 2999ms
[root@centos-00 ~]#

# 在宿主机上的虚机centos-01上Ping位于业务网络上的“外网网关”10.0.0.1,通信正常。
[root@centos-00 ~]# ip route del default via 192.168.5.1 dev eth1
[root@centos-00 ~]# ip route add default via 172.0.0.1 dev eth0
[root@centos-01 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.0.0.1       0.0.0.0         UG    0      0        0 eth0
172.0.0.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.5.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1
[root@centos-01 ~]# ping 10.0.0.1 -c 4
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=63 time=1.07 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=63 time=1.04 ms
64 bytes from 10.0.0.1: icmp_seq=3 ttl=63 time=1.04 ms
64 bytes from 10.0.0.1: icmp_seq=4 ttl=63 time=1.04 ms

--- 10.0.0.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 1.042/1.052/1.075/0.041 ms

3.3 基于容器的SSL加解密卸载

3.3.1 验证思路

为了验证智能网卡对基于容器的VNF卸载能力,本方案将在宿主机侧启动两台虚拟机作为WEB后端,在智能网卡侧运行nginx容器作为SSL加解密的前端和负载均衡器,用https://10.0.0.50/这个域名对业务网的其他用户提供HTTPS服务。

基于容器的SSL加解密卸载拓扑
图3-3基于容器的SSL加解密卸载拓扑

经过验证测试,成功地将SSL加解密功能卸载至智能网卡。当nginx容器从智能网卡的25G业务口收到来自客户端10.0.0.1访问https://10.0.0.50/的HTTPS流量时,会将其转换为HTTP流量发送至位于宿主机的WEB节点中,后端WEB节点的选择采用轮询算法,因此在客户端10.0.0.1上多次访问,会交替收到WEB-00和WEB-01响应的页面。

3.3.2 验证过程

本小节的所有操作,都是基于 3.2 小节的配置环境进行,因此基础的操作步骤不再赘述。

3.3.2.1 在宿主机侧启动两台虚拟机,并分别连接到管理网和业务网
# 修改xml文件,将3.1小节创建的虚拟机重命名后用作WEB后端。
[root@asterfusion ~]# virsh shutdown centos-00
[root@asterfusion ~]# virsh shutdown centos-01
[root@asterfusion ~]# virsh domrename centos-00 WEB-00.xml
[root@asterfusion ~]# virsh domrename centos-01 WEB-01.xml
[root@asterfusion ~]# virsh start WEB-00
[root@asterfusion ~]# virsh start WEB-01
[root@asterfusion ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 13    WEB-00                      running
 14    WEB-01                      running
# 重新给两台虚拟机配置管理IP。
# WEB-00:
[root@WEB-00 ~]# ip link set dev eth1 up
[root@WEB-00 ~]# ip add add dev eth1 192.168.5.155/24
[root@WEB-00 ~]# ip link set dev eth0 up
[root@WEB-00 ~]# ip add add dev eth0 172.0.0.100/24
[root@WEB-00 ~]# ip route add default via 172.0.0.1  dev eth0

# WEB-01:
[root@WEB-01 ~]# ip link set dev eth1 up
[root@WEB-01 ~]# ip add add dev eth1 192.168.5.165/24
[root@WEB-01 ~]# ip link set dev eth0 up
[root@WEB-01 ~]# ip add add dev eth0 172.0.0.200/24
[root@WEB-01 ~]# ip route add default via 172.0.0.1 dev eth1
3.3.2.2 将宿主机侧的两台虚拟机配置为WEB后端
# 分别在两台虚拟上安装httpd服务并创建index页面。
# WEB-00:
[root@WEB-00 ~]# setenforce  0
[root@WEB-00 ~]# yum update && yum install -y httpd
[root@WEB-00 ~]# cd /var/www/html/
[root@WEB-00 html]# echo "I'm end server: 172.0.0.100" > index.html
[root@WEB-00 html]# systemctl restart httpd
[root@WEB-00 html]# systemctl enable httpd
# WEB-01:
[root@WEB-01 ~]# getenforce 
Disabled
[root@WEB-01 ~]# yum update && yum install -y httpd
[root@WEB-01 ~]# cd /var/www/html/
[root@WEB-01 html]# echo "I'm end server: 172.0.0.200" > index.html
[root@WEB-01 html]# systemctl restart httpd
[root@WEB-01 html]# systemctl enable httpd
3.3.2.3 在智能网卡侧创建两个网桥用于前后端网络IP的挂载
# 删除3.2节用不到的端口及网桥。
root@OCTEONTX:~# ovs-vsctl del-port vm-net fw-if-in-sw
root@OCTEONTX:~# ovs-vsctl del-port br-net fw-if-ou-sw
root@OCTEONTX:~# ovs-vsctl del-port br-m fw-m-sw
root@OCTEONTX:~#  ip link delete fw-if-in type veth peer name fw-if-in-sw
root@OCTEONTX:~#  ip link delete fw-if-ou type veth peer name fw-if-ou-sw
root@OCTEONTX:~#  ip link delete  fw-m type veth peer name fw-m-sw

root@OCTEONTX:~# ipconfig  vm-net 172.0.0.50/24
root@OCTEONTX:~# ipconfig br-net 10.0.0.50/24
3.3.2.4 在智能网卡侧进行基于容器的SSL加解密卸载
# 准备nginx的目录以及配置文件。
root@OCTEONTX:~# cd /data/
root@OCTEONTX:/data# mkdir nginx && cd nginx
root@OCTEONTX:/data/nginx# mkdir config data logs ssl
root@OCTEONTX:/data/nginx# ll
total 20K
drwxr-xr-x 3 root root 4.0K Sep 18 01:54 config
drwxr-xr-x 2 root root 4.0K Sep 17 08:06 data
drwxr-xr-x 2 root root 4.0K Sep 18 02:15 logs
drwxr-xr-x 2 root root 4.0K Sep 18 02:02 ssl

# 创建自签名证书。
root@OCTEONTX:/data/nginx# cd ssl/
root@OCTEONTX:/data/nginx/ssl# openssl genrsa -des3 -out server.key 2048
root@OCTEONTX:/data/nginx/ssl# openssl req -new -key server.key -out server.csr
root@OCTEONTX:/data/nginx/ssl# openssl rsa -in server.key -out server_nopwd.key
root@OCTEONTX:/data/nginx/ssl# openssl x509 -req -days 365 -in server.csr -signkey 

# 准备完成后的nginx目录以及相关配置。
root@OCTEONTX:/data/nginx# tree
.
|-- config
|   |-- conf.d
|   |   `-- default.conf
|   `-- nginx.conf
|-- data
|   `-- index.html
|-- logs
|   |-- access.log
|   `-- error.log
|-- ssl
|   |-- server.crt
|   |-- server.csr
|   |-- server.key
|   `-- server_nopwd.key
`-- start-n.sh

5 directories, 10 files
root@OCTEONTX:/data/nginx# cat data/index.html 
I'm SSL Proxer
root@OCTEONTX:/data/nginx# cat config/conf.d/default.conf 
upstream end_server {                                                         
    server 172.0.0.100:80 weight=1 max_fails=3 fail_timeout=15s;                                                
    server 172.0.0.200:80 weight=1 max_fails=3 fail_timeout=15s;                                                
}
server {
    listen 443 ssl;
    server_name	localhost;

    ssl_certificate /ssl/server.crt;
    ssl_certificate_key /ssl/server_nopwd.key;

    ssl_session_cache shared:SSL:1m;
    ssl_session_timeout 5m;

     ssl_protocols SSLv2 SSLv3 TLSv1.2;

     ssl_ciphers HIGH:!aNULL:!MD5;
     ssl_prefer_server_ciphers  on;

     location / {
        root /usr/share/nginx/html;
        index index.html index.htm;
        proxy_pass http://end_server/;
        proxy_set_header Host $host:$server_port;
     }

    error_page 500 502 503 504 /50x.html;
    location = /50x.html {
        root /usr/share/nginx/html;
    }
    
    proxy_ignore_client_abort on;    
}

# 在智能网卡的操作系统上运行nginx容器。
root@OCTEONTX:/data/nginx# docker run -d --network host --name nginx-00 -v /data/nginx/data:/usr/share/nginx/html:rw -v /data/nginx/config/nginx.conf:/etc/nginx/nginx.conf/:rw -v /data/nginx/config/conf.d/default.conf:/etc/nginx/conf.d/default.conf:rw -v /data/nginx/logs:/var/log/nginx/:rw -v /data/nginx/ssl:/ssl/:rw nginx

3.3.3 验证卸载结果

# 从业务网中的一台服务器访问https://10.0.0.50/,卸载成功则会返回后端WEB提供的页面。
[root@compute-01 ~]# ip a | grep -i "enp2s0f0"
8: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 10.0.0.1/24 scope global enp2s0f0
[root@compute-01 ~]# curl --insecure https://10.0.0.50/
I'm end server: 172.0.0.100
[root@compute-01 ~]# curl --insecure https://10.0.0.50/
I'm end server: 172.0.0.200

4 在服务器与智能网卡中运行VNF的异同

VNF通常以虚拟机或者容器的形式进行部署。关于安装部署,对于服务器与Helium智能网卡基本没区别,只要有对应版本的可以安装运行,配置方式、命令行等基本一致。

关于软件资源,由于目前x86架构的服务器占比极高,各种操作系统、容器服务、Hypervisor软件、镜像、应用等均会提供x86版本。对于Helium智能网卡这个arm架构平台,在操作系统、容器服务、Hypervisor等方面,大多数流行的产品也已经提供了arm版本。但是,对于各种应用软件、容器镜像等只有少部分支持arm版本,如果客户原先跑在x86平台的VNF软件没有arm版本,则需要先由研发完成移植、测试等工作。

移植工作涉及到代码层面,因此一定是需要研发进行的。因为从x86向arm进行代码移植有两个方面的问题,一是这两种CPU在处理某些类型的溢出时,行为不同,二是这两种CPU采用不同的指令集,即复杂指令集与精简指令集,因此他们的机器指令不能完全一一对应,如果项目嵌入汇编进行加速,则代码的移植会更麻烦。

5 总结

通过前文所述的验证测试操作,证明Helium智能网卡可以完成对OVS的卸载、对基于虚拟机的VNF(vFW)功能的卸载、对基于容器的VNF(SSL加解密)功能的卸载。未来再配合Helium智能网卡SoC的协处理器,不仅能对VNF进行卸载,还能进一步提升VNF应用的处理性能。

操作指南:基于WireGuard的FullMesh VPN组网方案

1 目标

本文档将简要介绍开源VPN协议WireGuard,与基于WireGuard实现Full-Mesh组网的开源项目Netmaker的基本概念,以及安装部署的具体方法。

2 概要介绍

2.1 关于WireGuard

WireGuard是由Jason Donenfeld等人用C语言编写的一个开源VPN协议,被视为下一代VPN协议,旨在解决许多困扰IPSec/IKEv2、OpenVPN或L2TP等其他VPN协议的问题。它与Tinc和MeshBird等现代VPN产品有一些相似之处,即加密技术先进、配置简单。

从2020年1月开始,它已经并入了Linux内核的5.6版本,这意味着大多数Linux发行版的用户将拥有一个开箱即用的WireGuard。

WireGuard与其他VPN协议的性能测试对比
图2-1:WireGuard与其他VPN协议的性能测试对比

上图是WireGuard官方给出的与目前市面上常用的VPN协议的性能测试对比,测试过程使用万兆物理网卡,服务器的Linux内核版本为4.6.1。可以看到WireGuard的带宽基本接近网卡物理带宽,网络时延在几个对比的协议中是最低的。

关于WireGuard的特性,总结如下:

  • 基于UDP协议;
  • 核心部分:加密密钥路由
    • 公钥和IP地址列表(AllowedIPs)关联起来;
    • 每一个WG接口都有一个私钥和一个Peer列表;
    • 每一个Peer都有一个公钥和IP地址列表;
  • 发送包时,AllowedIPs字段起到路由表的功能:Peer的每个AllowedIPs,都会生成一条静态路由到设备;
  • 接收包时,AllowedIPs字段起到权限管理的功能:Packet的Source IP位于服务端的 AllowedIPs 列表时则接收,否则丢弃。

在了解完WireGuard基本概念后,再来看下在实际应用中,如何利用WireGuard来组建比较复杂的网络拓扑,这里介绍3个典型的组网拓扑。

  • 点对点(Point-to-point

这是最简单的拓扑,所有的节点要么在同一个局域网,要么直接通过公网访问,因此WireGuard可以直接连接到对端,不需要中继节点进行跳转。

  • 中心辐射型(Hub-and-Spoke

在WireGuard的概念中没有Server和Client之分,所有的节点都是Peer。通常进行组网的做法是找一个拥有公网IP的服务器作为中继节点,也就是VPN网关,然后各个节点之间的通信都由VPN网关进行转发。为了方便理解,我们可以把这种架构中的VPN网关当作Server,其他的节点当作Client,但实际上是不区分Server和Client的,架构示例如下图所示。

WireGuard Hub-and-Spoke
图2-2:WireGuard Hub-and-Spoke

这种架构的缺点相当明显,当Peer越来越多时,VPN网关就会变成垂直扩展的瓶颈。并且,通过VPN网关转发流量需要由公网IP和较大的带宽,成本较高。最后,从性能层面考虑,通过VPN网关转发流量会带来很高的延迟与单点故障的风险。

  • 全连接网络(FullMesh)
WireGuard FullMesh
图2-3:WireGuard FullMesh

在全互联的架构下,任意一个Peer和其他所有Peer都是直连的,无需经由一个VPN网关来中转流量,基本解决了中心辐射型架构的所有不足。在WireGuard的场景下实现全互联组网,需要在每一个Peer的配置文件中声明除本机以外的所有Peer。这个逻辑并不难理解,难点在于配置的繁琐程度,尤其是在组网变得比较大时,某一个Peer的变更,会带来非常大的配置工作量。

因此,如今已经有很多开源工具被开发出来,以实现WireGuard FullMesh组网的配置和管理,后文中会详细介绍如何通过Netmaker这样一款开源工具实现基于WireGuard的FullMesh组网。

2.2 关于Netmaker

Netmaker是一个用来配置WireGuard全互联模式的可视化工具,它的功能非常强大,支持NAT穿透、UDP打洞、多租户,客户端也几乎适配了所有平台,包括Linux、Mac和Windows,并且还可以通过WireGuard原生客户端连接智能手机(Android和iPhone)。

其最新版本的基准测试结果显示,基于Netmaker的WireGuard网络速度比其他全互联模式的VPN(例如Tailscale和ZeroTier)网络速度快50%以上。

Netmaker 架构图
图2-4:Netmaker 架构图

Netmaker使用的是C/S架构,即客户端/服务器架构。NetmakerServer包含两个核心组件:用来管理网络的可视化界面,以及与客户端通信的gRPCServer。也可以选择部署DNS服务器(CoreDNS)来管理私有DNS。

客户端(netclient)是一个二进制文件,可以在绝大多数Linux发行版、macOS和Windows中运行,它的功能就是自动管理WireGuard,动态更新Peer的配置。客户端会周期性地向NetmakerServer签到,以保持本机Peer的配置为最新状态,然后与所有的Peer建立点对点连接,即全互联组网。

3 系统环境

本次验证环境中,使用4台虚拟机进行FullMesh组网测试。其中,一台Ubuntu用作Netmaker的服务器端,另外三台CentOS用作客户端。

3.1 服务器端

操作系统:Ubuntu 20.04.4 LTS;

内核版本:5.15.0-46-generic;

WireGuard版本:1.0.20220627;

Docker CE版本:20.10.12;

Netmaker版本:0.12.0。

3.2 客户端

操作系统:CentOS Linux release 7.9.2009(Core);

内核版本:3.10.0-1160.66.1.el7.x86_64;

WireGuard版本:1.0.20200513;

NetClient版本:0.12.0。

4 安装部署

4.1 安装配置WireGuard

在组网中的所有Client中,都需要完成WireGuard的安装,此处仅展示node-01的安装步骤,其他Client的安装配置同理。

4.1.1 安装内核模块

[root@node-01 ~]# yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm
[root@node-01 ~]# yum install kmod-wireguard wireguard-tools
[root@node-01 ~]# reboot
...
[root@node-01 ~]# modprobe wireguard

4.1.2 开启IP转发

[root@node-01 ~]# cat <<'EOF'>> /etc/sysctl.conf
net.ipv4.ip_forward = 1
net.ipv4.conf.all.proxy_arp = 1
EOF
[root@node-01 ~]# sysctl -p /etc/sysctl.conf

4.2 安装配置Netmaker服务器端

本小结中的配置步骤,仅需要在Netmaker服务器端完成,Client端无需配置。

4.2.1 配置DockerCE容器运行环境

root@open-source:~# yum-config-manager --add-repo  https://download.docker.com/linux/centos/docker-ce.repo
root@open-source:~# yum remove -y docker docker-common docker-selinux docker-engine
root@open-source:~# yum update
root@open-source:~# yum install -y yum-utils device-mapper-persistent-data lvm2
root@open-source:~# yum install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
root@open-source:~# systemctl enable docker
root@open-source:~# systemctl retart docker

4.2.2 准备DockerCompose配置文件

root@open-source:~# cat docker-compose.yml
version: "3.4"

services:
  netmaker:
    container_name: netmaker
    image: gravitl/netmaker:v0.12.0
    volumes:
      - dnsconfig:/root/config/dnsconfig
      - sqldata:/root/data
    cap_add: 
      - NET_ADMIN
      - NET_RAW
      - SYS_MODULE
    sysctls:
      - net.ipv4.ip_forward=1
      - net.ipv4.conf.all.src_valid_mark=1
    restart: always
    environment:
      SERVER_HOST: "192.168.4.44"
      SERVER_HTTP_HOST: "192.168.4.44"
      SERVER_GRPC_HOST: "192.168.4.44"
      COREDNS_ADDR: "192.168.4.44"
      SERVER_API_CONN_STRING: "192.168.4.44:8081"
      SERVER_GRPC_CONN_STRING: "192.168.4.44:50051"
      GRPC_SSL: "off"
      DNS_MODE: "on"
      API_PORT: "8081"
      GRPC_PORT: "50051"
      CLIENT_MODE: "on"
      MASTER_KEY: "36lGTBLyp8itKCYeh7mzTYWej9RgF0"
      CORS_ALLOWED_ORIGIN: "*"
      DISPLAY_KEYS: "on"
      DATABASE: "sqlite"
      NODE_ID: "netmaker-server-1"
      MQ_HOST: "mq"
      HOST_NETWORK: "off"
      MANAGE_IPTABLES: "on"
      PORT_FORWARD_SERVICES: "mq"
      VERBOSITY: "1"
    ports:
      - "51821-51830:51821-51830/udp"
      - "8081:8081"
      - "50051:50051"
  netmaker-ui:
    container_name: netmaker-ui
    depends_on:
      - netmaker
    image: gravitl/netmaker-ui:v0.12.0
    links:
      - "netmaker:api"
    ports:
      - "8082:80"
    environment:
      BACKEND_URL: "http://192.168.4.44:8081"
    restart: always
  coredns:
    depends_on:
      - netmaker 
    image: coredns/coredns
    command: -conf /root/dnsconfig/Corefile
    container_name: coredns
    restart: always
    volumes:
      - dnsconfig:/root/dnsconfig
  caddy:
    image: caddy:latest
    container_name: caddy
    restart: unless-stopped
    network_mode: host
    volumes:
      - /root/Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
      - caddy_conf:/config
  mq:
    image: eclipse-mosquitto:2.0.14
    container_name: mq
    restart: unless-stopped
    ports:
      - "1883:1883"
    volumes:
      - /root/mosquitto.conf:/mosquitto/config/mosquitto.conf
      - mosquitto_data:/mosquitto/data
      - mosquitto_logs:/mosquitto/log
volumes:
  caddy_data: {}
  caddy_conf: {}
  sqldata: {}
  dnsconfig: {}
  mosquitto_data: {}
  mosquitto_logs: {}

root@open-source:~# 

4.2.3 启动Netmaker的容器组

root@open-source:~# docker-compose up -d
Creating network "root_default" with the default driver
Creating mq ... 
Creating netmaker ... 
Creating caddy    ... 
Creating netmaker-ui ... 
Creating coredns     ... 
root@open-source:~# 

4.2.4 创建Full-Mesh网络

Netmaker主页面
图4-1:Netmaker主页面
创建一个用于测试的网络
图4-2:创建一个用于测试的网络
创建完成后的Networks栏目状态
图4-3:创建完成后的Networks栏目状态
创建用于加入测试网络的Access Key
图4-4:创建用于加入测试网络的Access Key
Access Key Details
图4-5:Access Key Details

4.3 安装配置Netmaker客户端

4.3.1 下载Netclient客户端,使用Access Token加入测试网络

Node-01:
[root@node-01 ~]# wget https://github.com/gravitl/netmaker/releases/download/v0.12.0/netclient
[root@node-01 ~]# chmod +x netclient
[root@node-01 ~]# ./netclient join -t eyJncnBjY29ubiI6IjE5Mi4xNjguNC40NDo1MDA1MSIsImdycGNzc2wiOiJvZmYiLCJjb21tc25ldHdvcmsiOiJ6TFk5c2dLQyIsIm5ldHdvcmsiOiJhc3RlcmZ1c2lvbiIsImtleSI6IkhmZkd5ZjJMTkVCcXBGRkgiLCJsb2NhbHJhbmdlIjoiIn0=
2022/08/12 14:34:10 [netclient] joining asterfusion at 192.168.4.44:50051
2022/08/12 14:34:10 [netclient] node created on remote server...updating configs
2022/08/12 14:34:10 [netclient] retrieving peers
2022/08/12 14:34:10 [netclient] starting wireguard
2022/08/12 14:34:10 [netclient] waiting for interface...
2022/08/12 14:34:10 [netclient] interface ready - netclient.. ENGAGE
2022/08/12 14:34:12 [netclient] restarting netclient.service
2022/08/12 14:34:13 [netclient] joined asterfusion
[root@node-01 ~]# 
Node-02:
[root@node-02 ~]# ./netclient join -t eyJncnBjY29ubiI6IjE5Mi4xNjguNC40NDo1MDA1MSIsImdycGNzc2wiOiJvZmYiLCJjb21tc25ldHdvcmsiOiJ6TFk5c2dLQyIsIm5ldHdvcmsiOiJhc3RlcmZ1c2lvbiIsImtleSI6IkhmZkd5ZjJMTkVCcXBGRkgiLCJsb2NhbHJhbmdlIjoiIn0=
2022/08/12 14:41:49 [netclient] joining asterfusion at 192.168.4.44:50051
2022/08/12 14:41:49 [netclient] node created on remote server...updating configs
2022/08/12 14:41:49 [netclient] retrieving peers
2022/08/12 14:41:49 [netclient] starting wireguard
2022/08/12 14:41:49 [netclient] waiting for interface...
2022/08/12 14:41:49 [netclient] interface ready - netclient.. ENGAGE
2022/08/12 14:41:50 [netclient] restarting netclient.service
2022/08/12 14:41:52 [netclient] joined asterfusion
[root@node-02 ~]# 
Node-03:
[root@node-03 ~]# ./netclient join -t eyJncnBjY29ubiI6IjE5Mi4xNjguNC40NDo1MDA1MSIsImdycGNzc2wiOiJvZmYiLCJjb21tc25ldHdvcmsiOiJ6TFk5c2dLQyIsIm5ldHdvcmsiOiJhc3RlcmZ1c2lvbiIsImtleSI6IkhmZkd5ZjJMTkVCcXBGRkgiLCJsb2NhbHJhbmdlIjoiIn0=
2022/08/12 14:42:06 [netclient] joining asterfusion at 192.168.4.44:50051
2022/08/12 14:42:06 [netclient] node created on remote server...updating configs
2022/08/12 14:42:06 [netclient] retrieving peers
2022/08/12 14:42:06 [netclient] starting wireguard
2022/08/12 14:42:06 [netclient] waiting for interface...
2022/08/12 14:42:06 [netclient] interface ready - netclient.. ENGAGE
2022/08/12 14:42:08 [netclient] restarting netclient.service
2022/08/12 14:42:09 [netclient] joined asterfusion
[root@node-03 ~]#

4.3.2 在控制器界面中检查网络状态

在WEB界面中检查网络状态
图4-6:在WEB界面中检查网络状态

4.4 组网测试

4.4.1 在客户端节点检查WG信息与路由信息

Node-01:
[root@node-01 ~]# wg
interface: nm-asterfusion
  public key: qmPw+9r2+S94EjMAkNwMm9YV8ZDoSay8Fyi1HgyqFlg=
  private key: (hidden)
  listening port: 51821

peer: KnwIOGgWWCvXDqgrWfpc0xvlSf7GN/LjvUeJlpJhMy0=
  endpoint: 192.168.4.103:51821
  allowed ips: 10.20.20.3/32
  latest handshake: 2 seconds ago
  transfer: 304 B received, 272 B sent
  persistent keepalive: every 20 seconds

peer: rsAp8gC+vW63ET7YPpFT2oWMgrelFM1nO+9pAS2KLmQ=
  endpoint: 192.168.4.102:51821
  allowed ips: 10.20.20.2/32
  latest handshake: 5 seconds ago
  transfer: 304 B received, 272 B sent
  persistent keepalive: every 20 seconds

interface: nm-zLY9sgKC
  public key: B9QORmZw9PmhsHwDmZzxyzXZKxMTmx2qCAmlRWIzGG0=
  private key: (hidden)
  listening port: 55829
[root@node-01 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.4.1     0.0.0.0         UG    100    0        0 ens192
10.20.20.0      0.0.0.0         255.255.255.0   U     0      0        0 nm-asterfusion
10.20.20.2      0.0.0.0         255.255.255.255 UH    0      0        0 nm-asterfusion
10.20.20.3      0.0.0.0         255.255.255.255 UH    0      0        0 nm-asterfusion
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.4.0     0.0.0.0         255.255.255.0   U     100    0        0 ens192
[root@node-01 ~]#
Node-02:
[root@node-02 ~]# wg
interface: nm-asterfusion
  public key: rsAp8gC+vW63ET7YPpFT2oWMgrelFM1nO+9pAS2KLmQ=
  private key: (hidden)
  listening port: 51821

peer: KnwIOGgWWCvXDqgrWfpc0xvlSf7GN/LjvUeJlpJhMy0=
  endpoint: 192.168.4.103:51821
  allowed ips: 10.20.20.3/32
  latest handshake: 10 seconds ago
  transfer: 304 B received, 272 B sent
  persistent keepalive: every 20 seconds

peer: qmPw+9r2+S94EjMAkNwMm9YV8ZDoSay8Fyi1HgyqFlg=
  endpoint: 192.168.4.101:51821
  allowed ips: 10.20.20.1/32
  latest handshake: 13 seconds ago
  transfer: 92 B received, 180 B sent
  persistent keepalive: every 20 seconds

interface: nm-zLY9sgKC
  public key: asu7DXf5slyqN7xjzo1BQ+OinxbG2ECgf38SSY6u9xM=
  private key: (hidden)
  listening port: 37758
[root@node-02 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.4.1     0.0.0.0         UG    100    0        0 ens192
10.20.20.0      0.0.0.0         255.255.255.0   U     0      0        0 nm-asterfusion
10.20.20.1      0.0.0.0         255.255.255.255 UH    0      0        0 nm-asterfusion
10.20.20.3      0.0.0.0         255.255.255.255 UH    0      0        0 nm-asterfusion
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.4.0     0.0.0.0         255.255.255.0   U     100    0        0 ens192
[root@node-02 ~]#
Node-03:
[root@node-03 ~]# wg
interface: nm-asterfusion
  public key: KnwIOGgWWCvXDqgrWfpc0xvlSf7GN/LjvUeJlpJhMy0=
  private key: (hidden)
  listening port: 51821

peer: qmPw+9r2+S94EjMAkNwMm9YV8ZDoSay8Fyi1HgyqFlg=
  endpoint: 192.168.4.101:51821
  allowed ips: 10.20.20.1/32
  latest handshake: 15 seconds ago
  transfer: 92 B received, 180 B sent
  persistent keepalive: every 20 seconds

peer: rsAp8gC+vW63ET7YPpFT2oWMgrelFM1nO+9pAS2KLmQ=
  endpoint: 192.168.4.102:51821
  allowed ips: 10.20.20.2/32
  latest handshake: 15 seconds ago
  transfer: 92 B received, 180 B sent
  persistent keepalive: every 20 seconds

interface: nm-zLY9sgKC
  public key: tKLd9l1H8NmZmvv5C8amrt5FJNGc/rmfv8pxY1eWdis=
  private key: (hidden)
  listening port: 44378
[root@node-03 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.4.1     0.0.0.0         UG    100    0        0 ens192
10.20.20.0      0.0.0.0         255.255.255.0   U     0      0        0 nm-asterfusion
10.20.20.1      0.0.0.0         255.255.255.255 UH    0      0        0 nm-asterfusion
10.20.20.2      0.0.0.0         255.255.255.255 UH    0      0        0 nm-asterfusion
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.4.0     0.0.0.0         255.255.255.0   U     100    0        0 ens192
[root@node-03 ~]#

4.4.2 客户端之间使用VPN网段IP进行互Ping测试

[root@node-01 ~]# ping 10.20.20.1
PING 10.20.20.1 (10.20.20.1) 56(84) bytes of data.
64 bytes from 10.20.20.1: icmp_seq=1 ttl=64 time=0.046 ms
64 bytes from 10.20.20.1: icmp_seq=2 ttl=64 time=0.041 ms
^C
--- 10.20.20.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.041/0.043/0.046/0.007 ms
[root@node-01 ~]# ping 10.20.20.2
PING 10.20.20.2 (10.20.20.2) 56(84) bytes of data.
64 bytes from 10.20.20.2: icmp_seq=1 ttl=64 time=0.637 ms
64 bytes from 10.20.20.2: icmp_seq=2 ttl=64 time=0.912 ms
^C
--- 10.20.20.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.637/0.774/0.912/0.140 ms
[root@node-01 ~]# ping 10.20.20.3
PING 10.20.20.3 (10.20.20.3) 56(84) bytes of data.
64 bytes from 10.20.20.3: icmp_seq=1 ttl=64 time=0.738 ms
64 bytes from 10.20.20.3: icmp_seq=2 ttl=64 time=0.760 ms
^C
--- 10.20.20.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.738/0.749/0.760/0.011 ms
[root@node-01 ~]#

5 参考文档

更多内容请关注:A-Lab

性能评估:HPC场景性能测试常用工具

1 HPC测试方案介绍

本文主要介绍在HPC高性能计算场景中的几种测试方案,具体方案如下:

  • E2E转发测试

测试HPC方案在E2E(End to End)场景下的转发时延和带宽,方案测试点可以采用Mellanox IB发包工具、Qperf(支持RDMA)和Perftest工具集,测试遍历2~8388608字节。

  • MPI基准测试

MPI基准测试常用于评估高性能计算网络性能。方案测试点采用OSU Micro-Benchmarks来评估HPC方案在MPI场景下的转发时延和带宽,测试遍历2~8388608字节。

  • Linpack性能测试

Linpack性能测试常用于测试高性能计算机系统浮点性能,方案测试点采用HPL和HPCG来评估HPC方案在Linpack场景下的服务器性能。

  • HPC应用测试

本次测试方案在不同场景下运行HPC应用,方案测试点采用WRF和LAMMPS评估不同HPC方案的HPC应用并行运行效率。

2 不同场景测试工具介绍

HPC不同场景测试过程中所涉及到的测试软件以及版本如表1所示:

应用场景工具名称版本
E2E转发测试Mellanox IB工具包工具包版本与驱动版本相同
Qperf0.4.9
PerftestV4.5-0.20
MPI基准测试OSU Micro-BenchmarksV5.6.3
Linpack性能测试HPLV2.3
HPCGV3.1
HPC应用测试WRFV4.0
LAMMPSLAMMPS (3 Mar 2020)
表1:软件环境

3 E2E场景测试工具部署及介绍

3.1 Mellanox IB工具包

在Server服务器上安装Mellanox网卡的MLNX_OFED驱动程序,驱动安装完成后自带IB测试工具包(ib_read_lat、ib_send_lat、ib_write_lat等网络性能测试工具)。详细安装驱动过程可参考联合实验室发布的《解决方案-Mellanox网卡驱动安装》。IB工具包包含的主要测试集如表2:

RDMA操作带宽测试程序时延测试程序
RDMA Sendib_send_bwib_send_lat
RDMA Readib_read_bwib_read_lat
RDMA Writeib_write_bwib_write_lat
表2:IB工具包常用测试集

3.1.1 网卡MLNX_OFED驱动程序安装

[root@server ~]# wget \
https://content.mellanox.com/ofed/MLNX_OFED-5.0-1.0.0.0/MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.8-x86_64.tgz
[root@server ~]# tar -zxvf MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.8-x86_64.tgz
[root@server ~]# cd MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.8-x86_64
[root@server ~]# ./mlnx_add_kernel_support.sh -m \
/root/MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.8-x86_64 -v
[root@server ~]# tar xzvf \
MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.8-x86_64-ext.tgz
[root@server ~]# cd MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.8-x86_64-ext
[root@server ~]# ./mlnxofedinstall

3.1.2 检查网卡及网卡驱动状态

[root@server ~]# /etc/init.d/openibd start
[root@server ~]# ibstatus
命令行
[root@server ~]# systemctl start mst
[root@server ~]# mst status
命令行

3.1.3 IB工具包测试

[root@server1 ~]# ib_send_bw -R -d mlx5_2 -F --report_gbits -a
[root@server2 ~]# ib_send_bw -a -R -x 5 -d mlx5_2 -F -f 2 10.230.1.11

3.2 Qperf

qperf和iperf/netperf一样可以测试两个节点之间的带宽和延时,相比netperf和iperf,支持RDMA是qperf工具的独有特性。测试的性能相较于Mellanox IB工具略差,可以用于国产RDMA网卡性能测试。Qperf包含的主要测试集如表3:

RDMA操作带宽测试程序时延测试程序
RMDA Sendrc_bwrc_lat
RDMA Readrc_rdma_read_bwrc_rdma_read_lat
RDMA Writerc_rdma_write_bwrc_rdma_write_lat
表3:Qperf常用测试集

3.2.1 Qperf安装

[root@server ~]# yum -y install qperf*

3.2.2 Qperf RDMA测试

服务端:
[root@server ~]# qperf
客户端:
send/receive:
qperf -cm1 -oo msg_size:1:64K:*2 10.230.1.11 rc_lat 
qperf -cm1 -oo msg_size:1:64K:*2 10.230.1.11 rc_bw
wirite/read:
qperf -cm1 -oo msg_size:1:64K:*2 10.230.1.11 rc_rdma_write_lat
qperf -cm1 -oo msg_size:1:64K:*2 10.230.1.11 rc_rdma_write_bw

3.3 Perftest

Perftest是一组基于uverbs编写的测试程序,是RDMA性能相关的benchmark。可用于软硬件调优以及功能测试。Perftest测试软件包含的测试集如表4:

RDMA操作带宽测试程序时延测试程序
Sendib_send_bwib_send_lat
RDMA Readib_read_bwib_read_lat
RDMA Writeib_write_bwib_write_lat
RDMA Atomicib_atmoic_bwib_atomic_lat
Native Ethernet(纯以太网测试)raw_ethernet_bwraw_ethernet_lat
表4:Perftest常用测试集

3.3.1 Perftest

[root@Server ~]# git clone https://github.com/linux-rdma/perftest.git
[root@Server ~]# cd perftest
[root@Server perftest]# ./autogen.sh
[root@Server perftest]# ./configure
[root@Server perftest]# make
[root@Server perftest]# make install

3.3.2 Perftest RDMA测试

[root@Server ~]# ib_read_lat -R -d rdmap2s0f0 -F --report_gbits -a
[root@Server ~]# ib_read_lat -a -R -x 5 -d rdmap3s0f0 -F -f 2 10.230.1.11

4 MPI场景测试工具部署及介绍

在Server服务器上安装OSU MPI Benchmarks MPI通信效率测评工具,测试方式分为点对点通信和组网通信两种方式,通过执行各种不同模式的MPI,来测试带宽和时延。

4.1 OSU MPI Benchamarks工具安装

[root@server ~]# yum -y install openmpi3 openmpi3-devel -y
[root@server ~]# wget \
http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.3.tar.gz
[root@server ~]# tar zxvf osu-micro-benchmarks-5.6.3.tar.gz
[root@server ~]# cd osu-micro-benchmarks-5.6.3
[root@server ~]# ./configure
[root@server ~]# make -j
[root@server ~]# make install
[root@server ~]# mkdir /osu
[root@server ~]# cp -rf \
/usr/mpi/gcc/openmpi-4.0.3rc4/tests/osu-micro-benchmarks-5.3.2/* /osu

4.2 OSU MPI Benchamark使用

带宽测试:
[root@Server ~]# mpirun -np 2 --allow-run-as-root \
--host 10.230.1.11,10.230.1.12 /osu_bw
时延测试:
[root@Server ~]# mpirun -np 2 --allow-run-as-root \
--host 10.230.1.11,10.230.1.12 /osu_latency

5 Linpack测试工具部署及介绍

Linpack现在国际上已经成为最流行的用于测试高性能计算机系统浮点性能的工具,Linpack关注线性方程的计算性能,更考验超算的处理器理论性能。Linpack测试包括三类:Linpack100、Linpack1000和HPL。HPCG使用更复杂的微分方程计算方式,更看重实际性能,所以任何HPC超算测出来的HPCG性能要比Linpack性能低很多。

5.1 HPL安装及使用

5.1.1 基础环境准备

在安装HPL之前需要配置好GCC/Fortran77编译器、BLAS/CBLAS/ATLAS库和Mpich并行环境。

GCC/Fortan77:
[root@Server ~]# yum -y install gcc gcc-gfortran

BLAS:
[root@Server ~]# mkdir ~/prepare && cd ~/prepare
[root@Server prepare]# wget http://www.netlib.org/blas/blas-3.8.0.tgz
[root@Server prepare]# tar -xzf blas-3.8.0.tgz
[root@Server prepare]# cd BLAS-3.8.0
[root@Server BLAS-3.8.0]# make
[root@Server BLAS-3.8.0]# ar rv libblas.a *.o
[root@Server BLAS-3.8.0]# cd ~/prepare
[root@Server prepare]# wget http://www.netlib.org/blas/blast-forum/cblas.tgz
[root@Server prepare]# tar -xzf cblas.tgz
[root@Server prepare]# cd CBLAS
[root@Server CBLAS]# cp ~/prepare/BLAS-3.8.0/blas_LINUX.a ./
[root@Server CBLAS]# vim Makefile.in
BLLIB = ~/prepare/BLAS-3.8.0/blas_LINUX.a
[root@Server CBLAS]# make
[root@Server CBLAS]# ./testing/xzcblat1

MPICH2:
[root@Server ~]# cd ~/prepare
[root@Server prepare]# wget \
http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1.tar.gz
[root@Server prepare]# tar xzf mpich-3.2.1.tar.gz
[root@Server prepare]# cd mpich-3.2.1
[root@Server mpich-3.2.1]# ./configure –disable-cxx
[root@Server mpich-3.2.1]# make
[root@Server mpich-3.2.1]# make install
[root@Server mpich-3.2.1]# mkdir machinefile
[root@Server mpich-3.2.1]# mpiexec -f machinefile -n 3 hostname && mpiexec -n 5 -f machinefile ./examples/cpi

5.1.2 HPL安装及并行测试

[root@Server ~]# cd ~/prepare
[root@Server prepare]# cp CBLAS/lib/* /usr/local/lib
[root@Server prepare]# cp BLAS-3.8.0/blas_LINUX.a /usr/local/lib
[root@Server prepare]# wget http://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
[root@Server prepare]# tar -xzf hpl-2.3.tar.gz
[root@Server prepare]# cd hpl-2.3
[root@Server hpl-2.3]# cp setup/Make.Linux_PII_CBLAS ./
[root@Server hpl-2.3]# cp include/* /usr/local/include
[root@Server hpl-2.3]# vim Make.top
arch = Linux_PII_CBLAS
[root@Server hpl-2.3]# vim Makefile
arch = Linux_PII_CBLAS
[root@Server hpl-2.3]# vim Make.Linux_PII_CBLAS
LN_S         = ln -sf
ARCH         = Linux_PII_CBLAS
TOPdir       = /root/prepare/hpl-2.3
MPdir        = /usr/local
MPlib        = $(MPdir)/lib/libmpich.so
LAdir        = /usr/local/lib
LAlib        = $(LAdir)/cblas_LINUX.a $(LAdir)/blas_LINUX.a
CC           = /usr/local/bin/mpicc
LINKER       = /usr/local/bin/mpif77
[root@Server hpl-2.3]# make arch=Linux_PII_CBLAS
[root@Server hpl-2.3]# cd /bin/Linux_PII_CBLAS
[root@Server Linux_PII_CBLAS]# mpirun -np 4 ./xhpl
命令行

5.1.3 HPL配置文件解读

HPL程序运行结果取决于配置文件参数。

[root@Server ~]# cd /root/prepare/hpl-2.3/bin/Linux_PII_CBLAS
[root@server1 Linux_PII_CBLAS]# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
4            # of problems sizes (N)
29 30 34 35  Ns
4            # of NBs
1 2 3 4      NBs
0            PMAP process mapping (0=Row-,1=Column-major)
3            # of process grids (P x Q)
2 1 4        Ps
2 4 1        Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
  • 第5~6行内容:N表示求解的矩阵数量与规模。矩阵规模N越大,有效计算所占的比例也越大,系统浮点处理性能也就越高。但矩阵规模越大会导致内存消耗量越多,如果系统实际内存空间不足,使用缓存、性能会大幅度降低。矩阵占用系统总内存的80%左
    右为最佳,即N×N×8=系统总内存×80%。
  • 第7-8行内容:NB值的选择主要是通过实际测试得出最优值。NB的值一般小于384并且NB*8一定是缓存行的倍数。
  • 第10~12行内容:P表示水平方向处理器个数,Q表示垂直方向处理器个数。P×Q表示二维处理器网格。P×Q=系统CPU数=进程数。一般情况下一个进程对应一个CPU,可以得到最佳性能。

5.2 HPCG

5.2.1 基础环境准备

在安装HPCG之前需要配置好CXX编译器和Mpich并行环境。

CXX编译器:
[root@server1 ~]# c++ -v
Using built-in specs.
COLLECT_GCC=c++
Target: x86_64-redhat-linux
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

MPICH2:
[root@Server ~]# cd ~/prepare
[root@Server prepare]# wget  \
http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1.tar.gz
[root@Server prepare]# tar xzf mpich-3.2.1.tar.gz
[root@Server prepare]# cd mpich-3.2.1
[root@Server mpich-3.2.1]# ./configure --disable-cxx
[root@Server mpich-3.2.1]# make
[root@Server mpich-3.2.1]# make install
[root@Server mpich-3.2.1]# mkdir machinefile
[root@Server mpich-3.2.1]# mpiexec -f machinefile -n 3 hostname && mpiexec -n 5 -f machinefile ./examples/cpi

5.2.3 HPCG配置文件解读

HPCG程序运行结果取决于配置文件参数,测试完成会生成HPCG-Benchmark报告文件,运行结果主要看Performance Summary (times in sec)。

[root@Server ~]# cd /root/prepare/hpcg/setup/build/bin
[root@server1 bin]# cat hpcg.dat
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104 #测试规模
1800  #测试时间,运行必须要1800s才能得到正式结果
[root@Server ~]# cat HPCG-Benchmark_3.1_2023-03-23_15-30-40.txt
命令行

6 HPC应用部署及并行测试

6.1 WRF

6.1.1 基础环境准备

基础环境需要在Server服务器上完成编译器的安装以及基础环境变量的配置。

[root@Server ~]# cd /data/home/wrf01/202302test/
[root@Server 202302test]# mkdir Build_WRF
[root@Server 202302test]# mkdir TESTS
[root@Server ~]# yum -y install gcc cpp gcc-gfortran gcc-g++ m4 make csh
[root@Server ~]# vi ~/.bashrc
export DIR=/data/home/wrf01/202302test/Build_WRF/LIBRARIES
export CC=gcc
export CXX=g++
export FC=gfortran
export CFLAGS='-m64'
export F77=gfortran
export FFLAGS='-m64'
export PATH=$DIR/mpich/bin:$PATH
export PATH=$DIR/netcdf/bin:$PATH
export NETCDF=$DIR/netcdf
export JASPERLIB=$DIR/grib2/lib
export JASPERINC=$DIR/grib2/include
export LDFLAGS=-L$DIR/grib2/lib
export CPPFLAGS=-I$DIR/grib2/include
export LD_LIBRARY_PATH=$DIR/grib2/lib:$LD_LIBRARY_PATH
[root@Server ~]# source ~/.bashrc

6.1.2 安装三方依赖库

在Server服务器上安装第三方库以及完成zlib、libpng、mpich、jasper和netcdf软件的编译。并对依赖库就行测试。

[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF
[root@Server Build_WRF]# mkdir LIBRARIES

下载第三方库:
[root@Server Build_WRF]# wget \ https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compile_tutorial/tar_files/zlib-1.2.7.tar.gz
[root@Server Build_WRF]# wget \ https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compile_tutorial/tar_files/mpich-3.0.4.tar.gz
[root@Server Build_WRF]# wget \ https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compile_tutorial/tar_files/netcdf-4.1.3.tar.gz
[root@Server Build_WRF]# wget \ https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compile_tutorial/tar_files/jasper-1.900.1.tar.gz
[root@Server Build_WRF]# wget \ https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compile_tutorial/tar_files/libpng-1.2.50.tar.gz

编译安装zlib:
[root@Server Build_WRF]# tar xzvf zlib-1.2.7.tar.gz 
[root@Server Build_WRF]# cd zlib-1.2.7    
[root@Server zlib-1.2.7]# ./configure --prefix=$DIR/grib2
[root@Server zlib-1.2.7]# make
[root@Server zlib-1.2.7]# make install

编译安装libpng:
[root@Server Build_WRF]# tar xzvf libpng-1.2.50.tar.gz
[root@Server Build_WRF]# cd  libpng-1.2.50
[root@Server libpng-1.2.50]# ./configure --prefix=$DIR/grib2
[root@Server libpng-1.2.50]# make
[root@Server libpng-1.2.50]# make install

编译安装mpich:
[root@Server Build_WRF]# tar xzvf mpich-3.0.4.tar.gz 
[root@Server Build_WRF]# cd  mpich-3.0.4
[root@Server mpich-3.0.4]# ./configure --prefix=$DIR/mpich
[root@Server mpich-3.0.4]# make
[root@Server mpich-3.0.4]# make install

编译安装jasper:
[root@Server Build_WRF]# tar xzvf jasper-1.900.1.tar.gz 
[root@Server Build_WRF]# cd  jasper-1.900.1
[root@Server jasper-1.900.1]# ./configure --prefix=$DIR/grib2
[root@Server jasper-1.900.1]# make
[root@Server jasper-1.900.1]# make install

编译安装netcdf:
[root@Server Build_WRF]# tar xzvf netcdf-4.1.3.tar.gz
[root@Server Build_WRF]# cd  netcdf-4.1.3
[root@Server netcdf-4.1.3]# ./configure --prefix=$DIR/netcdf \
--disable-dap --disable-netcdf-4 --disable-shared
[root@Server netcdf-4.1.3]# make
[root@Server netcdf-4.1.3]# make install

6.1.3 依赖库测试

在Server服务器上完成对所安装依赖库的可用性测试。

[root@Server Build_WRF]# cd TESTS
[root@Server TESTS]# wget \ https://www2.mmm.ucar.edu/wrf/OnLineTutorial/compile_tutorial/tar_files/Fortran_C_NETCDF_MPI_tests.tar
[root@Server TESTS]# tar -xf Fortran_C_NETCDF_MPI_tests.tar

测试Fortran+C+NetCDF:
[root@Server TESTS]# cp ${NETCDF}/include/netcdf.inc .
[root@Server TESTS]# gfortran -c 01_fortran+c+netcdf_f.f
[root@Server TESTS]# gcc -c 01_fortran+c+netcdf_c.c
[root@Server TESTS]# gfortran 01_fortran+c+netcdf_f.o \  01_fortran+c+netcdf_c.o \-L${NETCDF}/lib -lnetcdff -lnetcdf
[root@Server TESTS]# ./a.out

测试Fortran+C+NetCDF+MPI:
[root@Server TESTS]# cp ${NETCDF}/include/netcdf.inc .
[root@Server TESTS]# mpif90 -c 02_fortran+c+netcdf+mpi_f.f
[root@Server TESTS]# mpicc -c 02_fortran+c+netcdf+mpi_c.c
[root@Server TESTS]# mpif90 02_fortran+c+netcdf+mpi_f.o 02_fortran+c+netcdf+mpi_c.o -L${NETCDF}/lib -lnetcdff -lnetcdf
[root@Server TESTS]# mpirun ./a.out

6.1.4 安装WRF

[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF
[root@Server Build_WRF]# wget \ https://www2.mmm.ucar.edu/wrf/src/WRFV4.0.TAR.gz
[root@Server Build_WRF]# tar xzvf WRFV4.0.TAR.gz
[root@Server Build_WRF]# cd WRF
[root@Server WRF]# ./configure
命令行
[root@Server WRF]# ./compile
[root@Server WRF]# ls -ls main/*.exe

6.1.5 安装WPS

[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF
[root@Server Build_WRF]# wget \
https://www2.mmm.ucar.edu/wrf/src/WPSV4.0.TAR.gz
[root@Server Build_WRF]# tar xzvf WRFV4.0.TAR.gz
[root@Server Build_WRF]# cd WPS
[root@Server WPS]# ./clean

修改intmath.f文件
[root@Server WPS]# cat ./ungrib/src/ngl/g2/intmath.f
命令行
编译安装WPS:
[root@Server WPS]# ./configure
Enter selection [1-40] : 1
[root@Server WPS]# ./compile
[root@Server WPS]# ls -las *.exe
命令行
[root@Server WPS]# vi namelist.wps
&share
 wrf_core = 'ARW',
 max_dom = 1,
 start_date = '2000-01-24_12:00:00',
 end_date   = '2000-01-26_00:00:00',
 interval_seconds = 21600
 io_form_geogrid = 2,
/

&geogrid
 parent_id         =   1,   1,
 parent_grid_ratio =   1,   3,
 i_parent_start    =   1,  31,
 j_parent_start    =   1,  17,
 e_we              =  104, 142,
 e_sn              =  61,  97,
geog_data_res = '10m','2m',
 dx = 30000,
 dy = 30000,
 map_proj = 'lambert',
 ref_lat   =  34.83,
 ref_lon   = -81.03,
 truelat1  =  30.0,
 truelat2  =  60.0,
 stand_lon = -98.0,
 geog_data_path = 
[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF
[root@Server Build_WRF]# wget \
https://www2.mmm.ucar.edu/wrf/src/WPSV4.0.TAR.gz
[root@Server Build_WRF]# tar xzvf WRFV4.0.TAR.gz
[root@Server Build_WRF]# cd WPS
[root@Server WPS]# ./clean

修改intmath.f文件
[root@Server WPS]# cat ./ungrib/src/ngl/g2/intmath.f
 

编译安装WPS:
[root@Server WPS]# ./configure
Enter selection [1-40] : 1
[root@Server WPS]# ./compile
[root@Server WPS]# ls -las *.exe
 
[root@Server WPS]# vi namelist.wps
&share
 wrf_core = 'ARW',
 max_dom = 1,
 start_date = '2000-01-24_12:00:00',
 end_date   = '2000-01-26_00:00:00',
 interval_seconds = 21600
 io_form_geogrid = 2,
/

&geogrid
 parent_id         =   1,   1,
 parent_grid_ratio =   1,   3,
 i_parent_start    =   1,  31,
 j_parent_start    =   1,  17,
 e_we              =  104, 142,
 e_sn              =  61,  97,
geog_data_res = '10m','2m',
 dx = 30000,
 dy = 30000,
 map_proj = 'lambert',
 ref_lat   =  34.83,
 ref_lon   = -81.03,
 truelat1  =  30.0,
 truelat2  =  60.0,
 stand_lon = -98.0,
 geog_data_path = '/data/home/wrf01/202302test/Build_WRF/WPS_GEOG/WPS_GEOG/'
/

&ungrib
 out_format = 'WPS',
 prefix = 'FILE',
/

&metgrid
 fg_name = 'FILE'
 io_form_metgrid = 2, 
/

下载静态地理数据:
[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF
[root@Server Build_WRF]# mkdir WPS_GEOG
下载链接:https://www2.mmm.ucar.edu/wrf/users/download/get_sources_wps_geog.html
命令行

6.1.6 生成WRF可执行文件

[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF
生成地理数据:
[root@Server ~]# cd /data/home/wrf01/202302test/Build_WRF/WPS
[root@Server WPS]# ./geogrid.exe
[root@Server WPS]# ls -lah geo_em.d01.nc

下载并链接气象数据:
气象数据下载网址:https://rda.ucar.edu/。
[root@Server Build_WRF]# mkdir DATA
[root@Server Build_WRF]# ls -lah ./DATA/JAN00/fnl*
命令行
[root@Server Build_WRF]# cd WPS
[root@Server WPS]# ./link_grib.csh ../DATA/JAN00/fnl
[root@Server WPS]# ln -sf ungrib/Variable_Tables/Vtable.GFS Vtable
[root@Server WPS]# ./ungrib.exe
[root@Server WPS]# ls -lah FILE*

融合气象和地理数据:
[root@Server WPS]# ./metgrid.exe

链接WPS到WRF:
[root@Server WPS]#  cd ../WRF/test/em_real/
[root@Server em_real]# ln -sf ~/Build_WRF/WPS/met_em* .
[root@Server em_real]# mpirun -np 1 ./real.exe
[root@Server em_real]# ls -alh wrfbdy_d01 wrfinput_d01
命令行

6.1.7 执行WRF并行测试

[root@Server em_real]# time /usr/mpi/gcc/openmpi-4.1.5a1/bin/mpirun -np 24 -oversubscribe --allow-run-as-root \
--host 10.230.1.11,10.230.1.12  ./wrf.exe
命令行

6.2 LAMMPS

LAMMPS即Large-scale Atomic/Molecular Massively Parallel Simulator,大规模原子分子并行模拟器,主要用于分子动力学相关的一些计算和模拟工作。

6.2.1 编译安装GCC-7.3

[root@server ~]# yum -y install gcc gcc-c++ gcc-gfortran texinfo
[root@server ~]# wget http://mirrors.ustc.edu.cn/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.gz
[root@server ~]# tar zxvf gcc-7.3.0.tar.gz
[root@server ~]# cd gcc-7.3.0
[root@server ~]# sh ./contrib/download_prerequisites
[root@server ~]# mkdir build && cd build
[root@server ~]# ../configure \
--prefix=/usr/local/gcc-7.3 \
--disable-bootstrap \
--enable-languages=c,c++,fortran \
--disable-multilib
[root@server ~]# make -j
[root@server ~]# make install
[root@server ~]# vi ~/.bashrc
export GCC_HOME=/usr/local/gcc-7.3
export PATH=$GCC_HOME/bin:$PATH
export MANPATH=$GCC_HOME/share/man
export CPATH=$GCC_HOME/include:$CPATH
export LD_LIBRARY_PATH=$GCC_HOME/lib:$GCC_HOME/lib64:LD_LIBRARY_PATH
export LIBRARY_PATH=$GCC_HOME/lib:$GCC_HOME/lib64:LIBRARY_PATH
[root@server ~]# source ~/.bashrc
[root@server ~]# gcc --verison
命令行

6.2.2 编译安装OpenMPI

[root@server ~]# yum install -y gcc gcc-c++ gcc-gfortran
[root@server ~]# wget  \
https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.bz2
[root@server ~]# tar jxvf openmpi-4.04.tar.bz2
[root@server ~]# cd openmpi-4.0.4
[root@server ~]# mkdir build && cd build
[root@server ~]# ../configure \
--prefix=/usr/local/openmpi-4.0.4 CC=gcc CXX=g++ \
FC=gfortran F77=gfortran
[root@server ~]# make -j
[root@server ~]# make install
[root@server ~]# vi ~/.bashrc
export PATH=$PATH:/usr/local/share/openvswitch/scripts
export GCC_HOME=/usr/local/gcc-7.3
export PATH=$GCC_HOME/bin:$PATH
export MANPATH=$GCC_HOME/share/man
export CPATH=$GCC_HOME/include:$CPATH
export LD_LIBRARY_PATH=$GCC_HOME/lib:$GCC_HOME/lib64:LD_LIBRARY_PATH
export LIBRARY_PATH=$GCC_HOME/lib:$GCC_HOME/lib64:LIBRARY_PATH
export PATH=/usr/local/openmpi-4.0.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi-4.0.4/lib:$LD_LIBRARY_PATH
export MANPATH=/usr/local/openmpi-4.0.4/share/man:$MANPATH
[root@server ~]# source ~/.bashrc
[root@server ~]# mpirun --version
命令行

6.2.3 编译安装FFTW

[root@server ~]# wget ftp://ftp.fftw.org/pub/fftw/fftw-3.3.8.tar.gz
[root@server ~]# tar zxvf fftw-3.3.8.tar.gz
[root@server ~]# cd fftw-3.3.8
[root@server ~]# mkdir build && cd build 
[root@server ~]# ../configure \
--prefix=/usr/local/fftw \
--enable-mpi \
--enable-openmp \
--enable-shared \
--enable-static
[root@server ~]# make -j
[root@server ~]# make install
[root@server ~]# vi ~/.bashrc
export PATH=/usr/local/fftw/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/fftw/lib:$LD_LIBRARY_PATH
[root@server ~]# source ~/.bashrc
[root@server ~]# fftw-wisdom --version
命令行

6.2.4 编译安装LAMMPS

[root@server ~]# yum -y install libjpeg-devel libpng-devel
[root@server ~]# wget https://lammps.sandia.gov/tars/lammps-3Mar20.tar.gz
[root@server ~]# tar zxvf lammps-3Mar20.tar.gz
[root@server ~]# cd lammps-3Mar20/src
[root@server ~]# vi MAKE/Makefile.mpi
命令行
[root@server ~]# make yes-MANYBODY
[root@server ~]# make -j mpi
[root@server ~]# mkdir -p /usr/local/lammps/bin
[root@server ~]# cp lmp_mpi /usr/local/lammps/bin/
[root@server ~]# vi ~/.bashrc
export PATH=/usr/local/lammps/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/lammps/lib:$LD_LIBRARY_PATH
[root@server ~]# source ~/.bashrc

6.2.5 执行LAMMPS并行测试

[root@server1 ~]# cd ~/lammps/lammps-stable_3Mar2020/examples/shear
[root@server1 ~]# vi in.shear
atom_style       atomic
region            box block 0 16.0 0 10.0 0 2.828427
create_box       100 box
thermo            25
thermo_modify   temp new3d
timestep         0.001
thermo_modify   temp new2d
reset_timestep  0
run               340000
[root@server1 ~]# mpirun --allow-run-as-root -np 4 –oversubscribe \
--host 10.230.1.11,10.230.1.12 lmp_mpi \
< /root/lammps/lammps-3Mar20/examples/shear/in.shear

X-T系列交换机DPU扣卡四层负载均衡DPVS的卸载验证与性能测试

1 方案概述

本文档主要讲解星融元X-T系列交换机DPU扣卡的业务卸载能力验证,以及卸载后的负载性能测试,本次功能验证与性能测试是以DPVS的卸载为例进行的,DPVS是一款开源的、基于DPDK的高性能四层负载均衡器。

在按照本文档进行验证场景复现之前,建议先阅读文档《X-T Programmable Bare Metal用户指导手册》,了解星融元X-T系列交换机和DPU扣卡的相关概念。

2 硬件与软件环境

验证过程中涉及到的硬件和软件环境如表2-1和表2-2所示。

名称型号硬件指标数量
交换机X312P-48Y-T配置一块DPU扣卡1
服务器通用X86服务器配置10G光口4
光模块10GSFP+8
光纤多模10G适用4

表2-1:硬件环境

软件版本备注
服务器操作系统CentOS Linux release 7.8.2003 (Core)开源版本
交换机操作系统AsterNOS v3.1联系技术支持获取
相应版本软件包
DPU扣卡操作系统Debian 10.3 (Kernel 4.14.76-17.0.1)联系技术支持获取
相应版本软件包
DPDK19.11.0联系技术支持获取
相应版本软件包
DPVS1.8-8开源版本

表2-2:软件环境

3 验证思路及过程

3.1 验证思路

为了验证星融元X-T系列交换机的DPU扣卡对DPVS的卸载能力,本次验证使用4台服务器(其中1台作为Client,发起HTTP请求,其他3台作为Real Server,提供Web服务,响应HTTP请求)直连交换机,在DPU扣卡上,编译安装DPDK和DPVS,并进行双臂Full-NAT模式的四层负载均衡配置测试。本次验证的设备连接拓扑如图3-1所示。

DPVS卸载验证的设备连接拓扑
图3-1:DPVS卸载验证的设备连接拓扑

在DPVS的双臂模式下,需要在交换机上配置2个VLAN,分别用于Client端与DPU扣卡上dpdk1端口之间的报文转发,后端3台Real Server与DPU扣卡上dpdk0端口的报文转发,相应的VLAN划分、端口分配、端口角色以及对应的IP配置如图3-2所示。

DPVS卸载验证的网络拓扑
图3-2:DPVS卸载验证的网络拓扑

3.2 验证过程

3.2.1 在交换机上进行VLAN配置

# 配置VLAN
admin@sonic:~$ sudo config vlan add 10
admin@sonic:~$ sudo config vlan add 200
admin@sonic:~$ sudo config vlan member add 10 Ethernet1 -u
admin@sonic:~$ sudo config vlan member add 10 Ethernet2 -u
admin@sonic:~$ sudo config vlan member add 10 Ethernet3 -u
admin@sonic:~$ sudo config vlan member add 10 Ethernet116 -u
admin@sonic:~$ sudo config vlan member add 200 Ethernet20 -u
admin@sonic:~$ sudo config vlan member add 200 Ethernet112 -u

3.2.2 在DPU扣卡上编译安装DPDK和DPVS

# 配置编译环境
root@OCTEONTX:~# apt-get install libpopt0 libpopt-dev libnl-3-200 libnl-3-dev libnl-genl-3-dev libpcap-dev
root@OCTEONTX:~# tar xvf linux-custom.tgz
root@OCTEONTX:~# ln -s `pwd`/linux-custom /lib/modules/`uname -r`/build

# 编译DPDK
root@OCTEONTX:~# cd /var/dpvs/
root@OCTEONTX:/var/dpvs# tar xvf dpdk-19.11.0_raw.tar.bz2
root@OCTEONTX:/var/dpvs# cd dpdk-19.11.0
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# export TARGET="arm64-octeontx2-linux-gcc"
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# export RTE_SDK=`pwd`
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# export RTE_TARGET="build"
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# export PATH="${PATH}:$RTE_SDK/usertools"
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# make config T=arm64-octeontx2-linux-gcc
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# sed -i 's/CONFIG_RTE_LIBRTE_PMD_PCAP=n/CONFIG_RTE_LIBRTE_PMD_PCAP=y/g' $RTE_SDK/build/.config
root@OCTEONTX:/var/dpvs/dpdk-19.11.0# make -j

# 编译DPVS
root@OCTEONTX:~# cd /var/dpvs/
root@OCTEONTX:/var/dpvs# tar xvf dpvs.tar
root@OCTEONTX:/var/dpvs# cd dpvs/
root@OCTEONTX:/var/dpvs/dpvs# patch -p1 < dpvs_5346e4c645c_with_dpdk.patch
root@OCTEONTX:/var/dpvs/dpvs# make -j
root@OCTEONTX:/var/dpvs/dpvs# make install

# 加载内核模块、设置大页内存、为指定端口绑定DPDK驱动
root@OCTEONTX:~# cd /var/dpvs
root@OCTEONTX:/var/dpvs# insmod /var/dpvs/dpdk-19.11.0/build/build/kernel/linux/kni/rte_kni.ko carrier=on
root@OCTEONTX:/var/dpvs# echo 128 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
root@OCTEONTX:/var/dpvs# mount -t hugetlbfs nodev /mnt/huge -o pagesize=2M
root@OCTEONTX:/var/dpvs# dpdk-devbind.py -b vfio-pci 0002:02:00.0
root@OCTEONTX:/var/dpvs# dpdk-devbind.py -b vfio-pci 0002:07:00.0
root@OCTEONTX:/var/dpvs# dpdk-devbind.py -s

Network devices using DPDK-compatible driver
============================================
0002:02:00.0 'Device a063' drv=vfio-pci unused=
0002:07:00.0 'Device a063' drv=vfio-pci unused=

Network devices using kernel driver
===================================
0000:01:10.0 'Device a059' if= drv=octeontx2-cgx unused=vfio-pci 
0000:01:10.1 'Device a059' if= drv=octeontx2-cgx unused=vfio-pci 
0000:01:10.2 'Device a059' if= drv=octeontx2-cgx unused=vfio-pci 
......
root@OCTEONTX:/var/dpvs#

3.2.3 在DPU扣卡上配置负载均衡服务

root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpvs -- -w 0002:02:00.0 -w 0002:07:00.0
root@OCTEONTX:/var/dpvs# 
root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpip link set dpdk0 link up
root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpip link set dpdk1 link up
root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpip addr add 10.0.0.10/32 dev dpdk0 sapool
root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpip addr add 200.0.0.200/32 dev dpdk1
root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpip route add 10.0.0.0/24 dev dpdk0
root@OCTEONTX:/var/dpvs# ./dpvs/bin/dpip route add 200.0.0.0/24 dev dpdk1
root@OCTEONTX:/var/dpvs# 
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm -A -t 200.0.0.200:80 -s rr
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm -a -t 200.0.0.200:80 -r 10.0.0.11 -b
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm -a -t 200.0.0.200:80 -r 10.0.0.12 -b
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm -a -t 200.0.0.200:80 -r 10.0.0.13 -b
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm --add-laddr -z 10.0.0.10 -t 200.0.0.200:80 -F dpdk0
root@OCTEONTX:/var/dpvs# 
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm -G 
VIP:VPORT            TOTAL    SNAT_IP              CONFLICTS  CONNS     
200.0.0.200:80       1        
                              10.0.0.10            0          0    
root@OCTEONTX:/var/dpvs# ./dpvs/bin/ipvsadm -ln
IP Virtual Server version 0.0.0 (size=0)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  200.0.0.200:80 rr
  -> 10.0.0.11:80                 FullNat 1      0          0         
  -> 10.0.0.12:80                 FullNat 1      0          0         
  -> 10.0.0.13:80                 FullNat 1      0          0      
root@OCTEONTX:/var/dpvs# 

3.2.4 配置3台Real Server的网络和Web服务

# Real Server 01
[root@node-01 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b8:59:9f:42:36:69 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.11/24 brd 10.0.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ba59:9fff:fe42:3669/64 scope link 
       valid_lft forever preferred_lft forever
[root@node-01 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.10       0.0.0.0         UG    0      0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
[root@node-01 ~]# cat index.html 
Real Server 01
[root@node-01 ~]# python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...
10.0.0.10 - - [23/Dec/2022 02:57:18] "GET / HTTP/1.1" 200 –

# Real Server 02
[root@node-02 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 68:91:d0:64:02:f1 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.12/24 brd 10.0.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::6a91:d0ff:fe64:2f1/64 scope link 
       valid_lft forever preferred_lft forever
[root@node-02 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.10       0.0.0.0         UG    0      0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
[root@node-02 ~]# python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...
10.0.0.10 - - [23/Dec/2022 08:16:40] "GET / HTTP/1.1" 200 –

# Real Server 03
[root@node-03 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ac state UP group default qlen 1000
    link/ether b8:59:9f:c7:73:cb brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ba59:9fff:fec7:73cb/64 scope link 
       valid_lft forever preferred_lft forever
[root@node-03 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.10       0.0.0.0         UG    0      0        0 eth1
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth1
[root@node-03 ~]# python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...
10.0.0.10 - - [23/Dec/2022 08:16:39] "GET / HTTP/1.1" 200 -

3.2.5 配置Client的网络

[root@node-00 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b8:59:9f:42:36:68 brd ff:ff:ff:ff:ff:ff
    inet 200.0.0.48/24 brd 200.0.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ba59:9fff:fe42:3668/64 scope link 
       valid_lft forever preferred_lft forever
[root@node-00 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         200.0.0.1       0.0.0.0         UG    0      0        0 eth0
200.0.0.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
[root@node-00 ~]#

3.3 卸载结果验证

# 在Client端,使用curl访问http://<VIP>来验证DPVS的负载均衡效果
[root@node-00 ~]# curl http://200.0.0.200
Real Server 01
[root@node-00 ~]# curl http://200.0.0.200
Real Server 02
[root@node-00 ~]# curl http://200.0.0.200
Real Server 03

经过验证测试,我们成功地将高性能四层负载均衡软件DPVS,卸载到星融X-T系列交换机的DPU扣卡上。从Client端的访问结果可以看到,在Client端访问http://200.0.0.200时,报文会先通过交换机转发到DPU扣卡上,再交由DPVS按照预设规则轮询转发到后端3台Web服务器。

4 性能测试

4.1 测试环境

4.1.1 测试组网

测试环境组网图
图4‑1:测试环境组网图

测试组网如图4-1所示,将仪表的100G端口与交换面板的100G端口相连,在AsterNOS上进行规则配置,将100G端口C1的流量送往C16,C16是与DPU扣卡互连的一个100G传输通道,流量经DPVS负载处理后原路返回。

4.1.2 仪表打流配置

  • 源MAC不变,目的MAC固定为DPVS使用的网卡MAC;
  • 源IP递增,目的IP固定为DPVS配置的VIP;
  • TCP报文头的源端口不变,目的端口设为80,将SYN flag置1,其余均置0。

4.2 测试结果

4.2.1 Full-NAT模式下不同核单流最大转发性能

使用仪表以64Byte包长、100Gbps的发包速率打流,测得不同核心数配置下,DPVS最大转发性能如下表所示。

核心数1481623
DPVS转发带宽/Gbps0.772.665.1610.0114.16

表4-1:Full-NAT模式下不同核单流最大转发性能

4.2.2 Full-NAT模式下16核稳定转发性能

使用仪表以不同长度的包长持续打流5分钟,测得DPVS在使用16个核心时,不同包长的情况下不丢包的稳定转发性能。

包长/Byte78128256512
DPVS转发带宽/Gbps9.613.621.631.2

表4-2:Full-NAT模式下16核稳定转发性能

4.2.3 Full-NAT模式下23核稳定转发性能

使用仪表以不同长度的包长持续打流5分钟,测得DPVS在使用23个核心时,不同包长的情况下不丢包的稳定转发性能。

包长/Byte78128256512
12.217.727.942.8

表4-3:Full-NAT模式下23核稳定转发性能

4.2.4 Full-NAT模式下多核建连性能

使用仪表以64Byte包长、100Gbps的发包速率打流,测得不同核心数配置下,DPVS每秒最大建连性能。

核心数1481623
每秒最高建连数22w55w94w163w203w

表4-4:Full-NAT模式下多核建连性能

5 总结

通过前文所描述的功能验证与性能测试的结果,能够证明星融元X-T系列交换机的DPU扣卡,可以像通用的X86或Arm架构服务器一样运行DPVS来承载负载均衡业务。并且,一块DPU扣卡可以让DPVS在Full-NAT模式下跑到42.8Gbps的转发带宽,一台X312P-48Y-T在配置两块DPU扣卡时,转发带宽可以达到85Gbps左右。在生产环境中部署时,可根据业务情况,选择更高性能版本的DPU扣卡,以应对更大的流量业务场景。

除了四层负载均衡网关,DPU扣卡还具有更加多样化的商用应用场景,智能网关(如限速网关、专线网关、对等连接网关、协议网关、东西向网关、公有服务网关…)、DC承载Spine/Leaf组网、汇聚分流等。将业务卸载到DPU扣卡后,还可以配合DPU扣卡上硬件加速单元,进一步提升业务性能。

6 附录1:测试期间使用的DPVS配置文件

root@OCTEONTX:/var/dpvs# cat /etc/dpvs.conf 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! This is dpvs default configuration file.
!
! The attribute "<init>" denotes the configuration item at initialization stage. Item of
! this type is configured oneshoot and not reloadable. If invalid value configured in the
! file, dpvs would use its default value.
!
! Note that dpvs configuration file supports the following comment type:
!   * line comment: using '#" or '!'
!   * inline range comment: using '<' and '>', put comment in between
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

! global config
global_defs {
    log_level   INFO
    log_file    /var/log/dpvs.log
    ! log_async_mode    on
}

! netif config
netif_defs {
    <init> pktpool_size     65535
    <init> pktpool_cache    32

    <init> device dpdk0 {
        rx {
            queue_number        1
            descriptor_number   1024
            rss                 all
        }
        tx {
            queue_number        1
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        promisc_mode
        kni_name                dpdk0.kni
    }

    <init> device dpdk1 {
        rx {
            queue_number        1
            descriptor_number   1024
            rss                 all
        }
        tx {
            queue_number        1
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        promisc_mode
        kni_name                dpdk1.kni
    }

    ! <init> bonding bond0 {
    !    mode        0
    !    slave       dpdk0
    !    slave       dpdk1
    !    primary     dpdk0
    !    kni_name    bond0.kni
    !}
}

! worker config (lcores)
worker_defs {
    <init> worker cpu0 {
        type    master
        cpu_id  0
    }

    <init> worker cpu1 {
        type    slave
        cpu_id  1
        port    dpdk0 {
            rx_queue_ids     0
            tx_queue_ids     0
            ! isol_rx_cpu_ids  9
            ! isol_rxq_ring_sz 1048576
        }
        port    dpdk1 {
            rx_queue_ids     0
            tx_queue_ids     0
            ! isol_rx_cpu_ids  9
            ! isol_rxq_ring_sz 1048576
        }
    }
}

! timer config
timer_defs {
    # cpu job loops to schedule dpdk timer management
    schedule_interval    500
}

! dpvs neighbor config
neigh_defs {
    <init> unres_queue_length  128
    timeout                    60
}

! dpvs ipv4 config
ipv4_defs {
    forwarding                 on
    <init> default_ttl         64
    fragment {
        <init> bucket_number   4096
        <init> bucket_entries  16
        <init> max_entries     4096
        <init> ttl             1
    }
}

! dpvs ipv6 config
ipv6_defs {
    disable                     off
    forwarding                  off
    route6 {
        <init> method           hlist
        recycle_time            10
    }
}

! control plane config
ctrl_defs {
    lcore_msg {
        <init> ring_size                4096
        sync_msg_timeout_us             20000
        priority_level                  low
    }
    ipc_msg {
        <init> unix_domain /var/run/dpvs_ctrl
    }
}

! ipvs config
ipvs_defs {
    conn {
        <init> conn_pool_size       2097152
        <init> conn_pool_cache      256
        conn_init_timeout           3
        ! expire_quiescent_template
        ! fast_xmit_close
        ! <init> redirect           off
    }

    udp {
        ! defence_udp_drop
        uoa_mode        opp
        uoa_max_trail   3
        timeout {
            normal      300
            last        3
        }
    }

    tcp {
        ! defence_tcp_drop
        timeout {
            none        2
            established 90
            syn_sent    3
            syn_recv    30
            fin_wait    7
            time_wait   7
            close       3
            close_wait  7
            last_ack    7
            listen      120
            synack      30
            last        2
        }
        synproxy {
            synack_options {
                mss             1452
                ttl             63
                sack
                ! wscale
                ! timestamp
            }
            ! defer_rs_syn
            rs_syn_max_retry    3
            ack_storm_thresh    10
            max_ack_saved       3
            conn_reuse_state {
                close
                time_wait
                ! fin_wait
                ! close_wait
                ! last_ack
           }
        }
    }
}

! sa_pool config
sa_pool {
    pool_hash_size   16
}
root@OCTEONTX:/var/dpvs# 

存储场景性能指标与常用测试工具

1 存储场景介绍

如果以产品/架构划分,可分为NAS、SAN、软件定义存储,软件定义存储又可以根据业务场景和架构细分为:分布式存储架构、超融合架构、数据库一体机架构,架构图如下所示。

分布式存储架构

图1:分布式存储架构

超融合架构

图2:超融合架构

数据库一体机架构

图3:数据库一体机架构

如果以存储最终用户的视角来划分,可以分为块(业务场景为虚拟机用的虚拟硬盘、数据库等)、文件(业务场景为AI、HPC、大数据等)、对象(海量数据存储场景)。对于存储场景的性能指标和常用测试工具的了解,我们就需要以最终用户视角的划分来理解,以上三个架构图仅作为背景信息补充。

2 常用的测试工具

应用场景工具名称
基础性能测试/块dd、fio、iostat
文件系统filebench、iozone、mdtest
对象存储cosbench
数据库swingbench、hammerdb
云环境vdbench
表1:常用的测试工具分类

3 存储性能指标解读

存储性能测试项整体上分为IO时延和IOPS两个纬度,每个维度中又会按照读/写、数据块的大小分别进行测试。一个IO就是单个读/写请求,IO时延指的是从发起请求到收到存储系统的响应消息所花费的时间,IOPS是指每秒存储系统能处理的IO请求数。

IO的大小对存储的性能表现也有直接的影响。当一次IO操作的数据量比较小的时候,称为小IO,比如1K、4K、8K这样的级别;当一次IO操作的数据量比较大的时候,称为大IO,比如32K、64K甚至更大。总体来说,较大的IO会带来更高的吞吐,较小的IO会产生更高的IOPS。大多数真实的业务场景中,IO的大小是混合的。

另外,IO还有顺序和随机之分,受存储主控的读写缓存策略、预读机制、存储介质的读写原理多方面因素影响,通常情况下随机IO的性能远低于顺序IO、写入性能远低于读取性能。顺序IO指大量的IO请求连续相邻的数据块,典型的业务有日志、数据备份恢复、流媒体等,顺序IO的性能通常就是最高性能;随机IO是指IO请求的是随机分布在存储介质各个区域的数据块,比如高并发读写大量小文件,就会导致IOPS和吞吐的性能下降,典型的业务有OLTP、交换分区、操作系统等,随机IO的性能通常是最低性能。

接下来我们看一个真实的存储性能测试结果,这是国内数据库一体机厂商分别使用Mellanox SB7700与星融元CX532P-N进行组网,使用测试工具fio对数据库一体机的存储系统进行测试后的结果,如下图所示。

 Mellanox SB7700 100G IB交换机Asterfusion CX532P-N
低时延以太网交换机
latr(时延测试-4k随机读)141.79us132.84us
latw(时延测试-4k随机写)79.67us71.6us
latw-8k(时延测试-8k随机读)150.64us145.83us
latw-8k(时延测试-8k随机写)80.89us73.89us
4kr-1台压力服务器(IOPS)1239k1275k
4kw-1台压力服务器(IOPS)493k453k
8kr-1台压力服务器(IOPS)1007k939k
8kw-1台压力服务器(IOPS)330k310k
1024kr-1台压力服务器(IOPS)11.7k11.0k
1024kw-1台压力服务器(IOPS)37093669
4kr-2台压力服务器(IOPS)2548k2633k
4kw-2台压力服务器(IOPS)850k916k
8kr-2台压力服务器(IOPS)1992k1877k
8kw-2台压力服务器(IOPS)535k591k
1024kr-2台压力服务器(IOPS)1747421.2k
1024kw-2台压力服务器(IOPS)36734820

表2:存储性能测试报告

在测试时延时使用的是1v1的方式,测试存储系统IOPS时分别用1v1、2v1的方式进行压测。在衡量存储系统的性能时,时延越低越好,时延代表着存储系统的响应速度;IOPS则越高越好,IOPS x IO Size算出来的峰值,就是存储系统的最大吞吐能力。

4 测试流程与使用到的软件

通常,在存储业务场景中,涉及到网络的测试流程分为以下三个步骤:
首先,会进行存储网络的性能测试,这里会关注网络单链路的吞吐和时延,常用的工具是iperf、ib_read/write_bw、ib_read/write_lat;

第二步,会进行存储系统的基础性能测试,这里关注的是存储系统的时延和吞吐,常用的工具是fio;
第三步,会进行业务级别的兼容性、稳定性以及性能测试,兼容性方面主要测试交换机的API是否能满足业务系统的要求,稳定性方面的测试则是网络设备级和链路级别的高可靠,性能测试则会用业务场景专用的测试工具进行压测,比如:数据库一体机常用的工具是swingbench和hammerdb,对象存储场景中常用的工具是cosbench。

5 Fio使用介绍与测试结果说明

5.1 工具介绍

存储性能测试工具fio的全称为Flexible IO Tester,由Jens Axboe开发,Jens Axboe另一个比较有名的身份是Linux内核的块IO子系统的维护者。fio在存储测试中是瑞士军刀一般的存在,首先是诸多可灵活调整的测试参数,使其能够组合出非常多地测试样例,其次就是到现在fio仍处于活跃更新的状态,能根据存储的发展不断进行适配。

5.2 参数说明

本次测试演示,目标是测试服务器在假设的小IO业务场景中(100% 随机,70% 读,30% 写,IO size 4K)的性能表现。

[root@server ~]# fio \
-filename=/root/randrw_70read_4k.fio \
-direct=1 \
-iodepth 1 \
-thread \
-rw=randrw \
-rwmixread=70 \
-ioengine=psync \
-bs=4k \
-size=5G \
-numjobs=8 \
-runtime=300 \
-group_reporting \
-name=randrw_70read_4k_local

-filename=/root/randrw_70read_4k.fio

支持文件、裸盘、RBD image。这次要测的是文件系统,filename=<具体的文件名>;如果是RBD image,filename=<具体的image name>;如果是裸盘,filename=<具体的设备名>;该参数可以同时制定多个设备或文件,格式为:-filename=/dev/vdc:/dev/vdd(以冒号分割)。

-direct=1

direct即使用直接写入,绕过操作系统的page cache。

-iodepth=1

iodepth是设置IO队列深度,即单线程中一次给系统多少IO请求。如果使用同步方式,单线程中iodepth总是1;如果是异步方式,就可以提高iodepth,一次提交一批IO,使得底层IO调度算法可以进行合并操作。异步方式,一般设置为32或64。注意响应时间在可接受的范围内,来增加队列深度,因为队列深度增加了,IO在队列的等待时间也会增加,导致IO响应时间变大,这需要权衡。 单路IO测试设置为1, 多路IO测试设置为32。

-thread

fio默认是通过fork创建多个job,即多进程方式,如果指定thread,就是用POSIX的thread方式创建多个job,即使用pthread_create()方式创建线程。

-rw=randrw

设置读写模式,包括:write(顺序写)、read(顺序读)、rw(顺序读写)、randwrite(随机写)、randread(随机读)、randrw(随机读写)。

-rwmixread=70

设置读写IO的混合比例,在这个测试中,读占总IO的70%,写IO占比30%。

-ioengine=psync

设置fio下发IO的方式,包括sync(同步IO)、psync(同步IO,内部使用pwrite、pread方式,和write、read区别是:读写到某个位置时不会改变文件位置指针)、libaio(Linux异步IO,Linux只支持非buffer IO的异步排队,也就是direct需要设置为1)、posixaio(POSIX异步IO,是glibc在用户空间实现的,自己维护多个线程进行异步操作,比较耗费资源,扩展性差)、rados(直接使用libaio接口测试RADOS层IO)、rbd(直接使用librbd接口测试RBD Image IO)。本次测试使用的IO引擎为psync。

-bs=4k

bs即block size(块大小),是指每个IO的数据大小。使用场景是数据库的时候,通常采用4k、8k等小数据块,主要关注IOPS指标;使用场景为视频存储、归档等大文件的时候,通常采用1m、4m等大数据块,主要关注带宽吞吐指标。默认情况下,单位小写代表换算基数为1024,大写代表换算基数为1000,即1m=1024k,1M=1000k。随机读写测试设置为4K,顺序读写吞吐测试设置为1M。

-size=5g

测试总数据量,该参数和runtime会同时限制fio的运行,任何一个目标先达到,fio都会终止运行。
在做性能测试时,尽量设置大点,比如设置2g、5g、10g或者更大,如果基于文件系统测试,则需要-size需要<4g。

-numjobs=8

本次作业同时进行测试的线程或进程数,线程还是进程由前面提到的thread参数控制。

-runtime=300

测试总时长,单位是s。和size一起控制fio的运行时长,在做一般性性能测试的时候,该时间也尽量设置长点,比如5分钟、10分钟。

-group_reporting

多个jobs测试的时候,测试结果默认是单独分开的,加上这个参数,会将所有jobs的测试结果汇总起来。

-name=randrw_70read_4k_local

本次测试作业的名称。

5.3 测试结果

命令行
命令行展示
命令行展示

图4:fio性能测试结果

5.4 结果解读

Line 16~22
软件版本、执行参数、任务名、运行过程输出等信息。

randrw_70read_4k_local: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 8 threads
randrw_70read_4k_local: Laying out IO file (1 file / 5120MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=404KiB/s,w=164KiB/s][r=101,w=41 IOPS][eta 00m:00s]
randrw_70read_4k_local: (groupid=0, jobs=8): err= 0: pid=49066: Wed Mar  8 14:33:21 2023

Line 23~33

此部分是读性能的测试结果,其中整体IO时延lat = 提交时延slat + 完成时延clat。slat(submission latency) 是提交IO花费的时间,指从fio创建IO到内核开始处理IO的时间,即在队列中排队的时间。fio会分别统计出最小延迟、最大延迟、平均延迟、标准方差延迟,因为同步IO没有队列,所以,选择同步模式的存储引擎时不显示slat。clat(completion latency)是完成IO花费的时间,从内核开始处理IO,到IO处理完成的时间,不包括提交IO时间。

另外,这部分报告还会统计整体时延分布状态,以99.99th=[ 1020]为例,它的含义是 99.99%的IO的时延都低于1020ms。最后两行分别时读取时带宽和IOPS的测试结果。

# What is the difference between kB s and KiB s?
# 1 kB = 1000 bytes. 1 KiB = 1024 bytes.
# 时间的换算关系:
# 1秒(s) =1000毫秒(ms, millisecond)
# 1毫秒(ms)=1000微秒 (us, microsecond)
# 1微秒(us)=1000纳秒 (ns, nanosecond)
# 1纳秒(ns)=1000皮秒 (ps, picosecond)
# 读性能
   read: IOPS=96, BW=387KiB/s (396kB/s)(113MiB/300047msec)
    clat (usec): min=159, max=1206.1k, avg=81519.63, stdev=87349.89
     lat (usec): min=159, max=1206.1k, avg=81519.97, stdev=87349.89
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    8], 10.00th=[   14], 20.00th=[   21],
     | 30.00th=[   30], 40.00th=[   41], 50.00th=[   54], 60.00th=[   70],
     | 70.00th=[   93], 80.00th=[  127], 90.00th=[  184], 95.00th=[  245],
     | 99.00th=[  405], 99.50th=[  493], 99.90th=[  835], 99.95th=[  885],
     | 99.99th=[ 1020]
   bw (  KiB/s): min=    7, max=  143, per=12.66%, avg=48.99, stdev=20.31, samples=4730
   iops        : min=    1, max=   35, avg=12.17, stdev= 5.09, samples=4730

Line 34~44

此部分是写性能的测试结果,报告中各个指标项的含义与上文中的读性能测试结果一致,不再赘述。

# 写性能  
  write: IOPS=42, BW=169KiB/s (173kB/s)(49.5MiB/300047msec)
    clat (usec): min=155, max=956586, avg=2619.71, stdev=32750.22
     lat (usec): min=156, max=956586, avg=2620.25, stdev=32750.24
    clat percentiles (usec):
     |  1.00th=[   208],  5.00th=[   233], 10.00th=[   247], 20.00th=[   306],
     | 30.00th=[   330], 40.00th=[   453], 50.00th=[   529], 60.00th=[   857],
     | 70.00th=[   971], 80.00th=[  1156], 90.00th=[  1614], 95.00th=[  4047],
     | 99.00th=[ 14877], 99.50th=[ 18744], 99.90th=[750781], 99.95th=[817890],
     | 99.99th=[918553]
   bw (  KiB/s): min=    7, max=  120, per=14.85%, avg=24.95, stdev=15.98, samples=4044
   iops        : min=    1, max=   30, avg= 6.16, stdev= 4.00, samples=4044

Line 45~47

此部分是整体时延的分布统计,250=3.28%表示时延在0us ~ 250us的IO占比为3.28%,因此本次测试的时延分布情况为:0us ~  250us 3.28%、250us ~  500us 11.27%、…、750ms ~ 1000ms 0.24%、> 1000ms 0.13%。

lat (usec)   : 250=3.28%, 500=11.27%, 750=2.09%, 1000=5.49%
lat (msec)   : 2=6.16%, 4=1.48%, 10=3.85%, 20=9.96%, 50=19.76%
lat (msec)   : 100=17.51%, 250=15.85%, 500=2.91%, 750=0.24%, 1000=0.13%

Line 48

此部分是CPU的使用率,分别是:用户态CPU使用率、内核态CPU使用率、上下文切换次数、主要的页面错误数、次要页面错误数。

  cpu          : usr=0.01%, sys=0.05%, ctx=41717, majf=0, minf=380

Line 49~53

此部分是IO深度分布情况,反映了存储系统处理IO请求的速度,本次测试使用的depth为1,因此结果中显示1=100%。

  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=29034,12664,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Line 55~57

此部分是读写带宽测试结果的汇总,分别有带宽(bw)、总IO数据量(io)、运行时间(run)。

Run status group 0 (all jobs):
   READ: bw=387KiB/s (396kB/s), 387KiB/s-387KiB/s (396kB/s-396kB/s), io=113MiB (119MB), run=300047-300047msec
  WRITE: bw=169KiB/s (173kB/s), 169KiB/s-169KiB/s (173kB/s-173kB/s), io=49.5MiB (51.9MB), run=300047-300047msec

Line 59~62

此部分是测试过程中,服务器上块设备的使用情况,包括:设备名称、总IO数(ios,以‘/’分割,前面为read ios、后面为write ios)、IO scheduler合并的IO数(merge,读合并数/写合并数)、设备处理的ticks数(ticks,读使用的ticks/写使用的ticks)、在设备队列中花费的总时间(in_queue)、设备使用率。

Disk stats (read/write):
    dm-0: ios=29013/12724, merge=0/0, ticks=2364321/50184, in_queue=2415058, util=100.00%, aggrios=14517/6380, aggrmerge=0/2, aggrticks=1183183/24592, aggrin_queue=1207762, aggrutil=100.00%
  sdc: ios=29034/12684, merge=0/1, ticks=2366367/48619, in_queue=2414960, util=100.00%
  sda: ios=0/76, merge=0/4, ticks=0/565, in_queue=565, util=0.06%
整个测试报告中,Line22~33和Line34~44分别是读写两项测试结果的汇总,基本上可通过这两部分的数据判断出存储系统的性能表现,整体时延越低、IOPS越高,即意味着存储性能越好。

附录:常用测试工具的使用文档

【1】dd.md

【2】fio.md

【3】iostat.md

【4】HammerDB.md

如有其它问题,请填写右侧需求表单联系我们。

点击了解Asterfusion CX-N数据中心交换机

全流程演示: 如何从0到1构建分布式GPU计算环境

随着AI、大模型的快速发展,传统的集中式计算已无法应对激增的数据处理需求,而分布式计算是指将一个计算任务分解成多个子任务,由多个计算节点并行地进行计算,并将结果汇总得到最终结果的计算方式,能够更高效、更稳定、更灵活地处理大规模数据和复杂计算任务,在各行各业中得到了广泛的应用。

那如何从零到一搭建分布式计算的环境呢?本文将从硬件选型,到服务器侧的基础配置、GPU驱动安装和集合通讯库配置,以及无损以太网的启用,直至大模型导入和训练测试,带您跑通搭建分布式训练环境的全流程。

1 硬件准备

1.1 GPU服务器选型

GPU拥有大量的计算核心,可以同时处理多个数据任务,是构成智算中心的关键硬件。

从智算中心方案的整体设计层面来看:GPU服务器集群和存储服务器集群分别通过计算网络(Scale-out网络)和存储网络连接。另外两张管理网中,业务管理网用于GPU服务器互联,进行AIOS管理面通信,带外管理则连接整个智算中心的所有设备,用于运维接入管理。

图1:智算中心方案的概要设计拓扑

1:智算中心方案的概要设计拓扑

明确了智算中心的整体设计后,我们将对比通用计算服务器与GPU服务器的内部硬件连接拓扑图,来具体了解GPU服务器的选型逻辑:

左:通用计算服务器内部的硬件连接拓扑
右:GPU服务器内部的硬件连接拓扑

图2(上):通用计算服务器内部的硬件连接拓扑

图3(下):GPU服务器内部的硬件连接拓扑

图2是一台通用计算服务器内部的硬件连接拓扑,这台服务器的核心是两块AMD的EPYC CPU,根据IO Chiplet扩展出了若干接口,辅助CPU充分释放通用计算能力。

图3是一台GPU服务器内部的硬件连接拓扑,这台服务器配备了8块A100 GPU,8张用于计算通信的RDMA网卡,以及2张用于存储通信的RDMA网卡,所有的IO组件设计,都是为了让这8块GPU充分释放算力。

通过上面两张硬件连接拓扑图可以看到,通用服务器和GPU服务器从基本的硬件构造上就有着非常大的差异,一个是围绕通用CPU来构建,另一个是围绕着GPU来构建的。因此,在硬件选型阶段,就需要注意差别,通常来讲通用服务器是没有办法复用改造成一台高性能的GPU服务器,PCIe接口数量、服务器空间、散热设计、电源等方面都不能满足要求。

当通过计算任务确定算力需求,进而确定了所需要的GPU型号和数量之后,我们也就可以再继续规划整个GPU集群的组网了。

由于资源限制,本次实验验证中,使用三台通用服务器稍加改造进行后续的并行训练和推理测试。

计算节点的硬件配置如下:

CPU:Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz * 2

GPU:NVIDIA GeForce RTX 4060 Ti 16G * 1

内存:128G

存储:10T HDD * 2

网卡:MGMT、CX5

其他部分:

散热:GPU为全高尺寸,但服务器只有2U,所以只能拆掉上盖板;

电源:通用服务器通常没有预留足够的供电接口,因此需要使用外置电源对GPU进行额外供电;

电源选择的是Great Wall 额定650W X6,功率上可以同时满足3块GPU(RTX4060Ti需要外接150W的供电)的供电要求,并且支持3个8pin接口,用来分别连接三块GPU。

电源选型示意图

4:电源选型示意图

GPU和RDMA网卡上机安装后的实拍图1

5GPURDMA网卡上机安装后的实拍图

1.2 高性能计算网选型

智算中心的管理网相较于传统的通用计算数据中心来说,没有太大差异。比较特殊的就是Scale-out计算网络和存储网络,这两张网络承载的业务流量决定了交换机设备的选型需求:支持RDMA、低时延、高吞吐。

如下图所示,在组网连接方面也有所不同,这里会通过将GPU分组(图中#L0~7一组,#L8~15一组),组成只有一跳的高带宽互联域(HB域),并通过针对智算场景优化的Rail交换机连接,实现了高效的数据传输和计算协同。

组网链接示意

图6:组网连接示意

这次实验验证中,计算网的交换机选用星融元Asterfusion®️ CX-N系列超低时延交换机,具体型号为CX308P-48Y-N。

型号业务接口交换容量
CX864E-N64 x 800GE OSFP,2 x 10GE SFP+102.4Tbps
CX732Q-N32 x 400GE QSFP-DD, 2 x 10GE SFP+25.6Tbps
CX664D-N64 x 200GE QSFP56, 2 x 10GE SFP+25.6Tbps
CX564P-N64 x 100GE QSFP28, 2 x 10GE SFP+12.8Tbps
CX532P-N32 x 100GE QSFP28, 2 x 10GE SFP+6.4Tbps
CX308P-48Y-N48 x 25GE SFP28, 8 x 100GE QSFP284.0Tbps
表1:具体型号规格示意

提升大模型训练效率

CX-N数据中心交换机的单机转发时延(400ns)低至业界平均水平的1/4~1/5,将网络时延在AI/ML应用端到端时延中的占比降至最低,同时多维度的高可靠设计确保网络在任何时候都不中断,帮助大模型的训练大幅度降低训练时间、提升整体效率。

全系列标配RoCEv2能力

区别于传统厂家多等级License权限管理方式,CX-N数据中心交换机所有应用场景License权限一致,全系列标配RoCEv2能力,提供PFC、ECN、Easy RoCE等一系列面向生产环境的增强网络特性,用户无须为此类高级特性额外付出网络建设成本,帮助用户获得更高的ROI。

开放、中立的AI/ML网络

星融元AI/ML网络解决方案的开放性确保用户能够重用已有的系统(K8s、Prometheus等)对网络进行管理,无需重复投入;星融元以“中立的网络供应商参与AI生态”的理念为用户提供专业的网络方案,帮助用户规避“全栈方案锁定”的风险。

最终,实验环节的组网拓扑和基础配置如下所示。

实验拓扑和基础配置示意

7:实验拓扑和基础配置示意

2 软件准备

以上,我们已经完成了硬件选型,接下来我们将进行软件层面的配置:部署 RoCEv2 交换机、配置GPU 服务器、安装 GPU 驱动和集合通讯库。

2.1 RoCEv2交换机

CX308P-48Y-N

8CX308P-48Y-N设备图

本次并行训练的环境中设备数量较少,组网相对简单:

1. 将CX5网卡的25GE业务接口连接到CX308P;

2. 在交换机上一键启用全局RoCE的无损配置;

3. 将三个25G业务口划分到一个VLAN下组成一个二层网络;

如前文提到,CX-N数据中心交换机全系列标配RoCEv2能力,配合星融元AsterNOS网络操作系统,只需要两行命令行便可配置所有必要的QoS规则和参数,具体命令行如下:

noone@MacBook-Air ~ % ssh admin@10.230.1.17
Linux AsterNOS 5.10.0-8-2-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64
    _          _                _   _   ___   ____  
   / \    ___ | |_   ___  _ __ | \ | | / _ \ / ___| 
  / _ \  / __|| __| / _ \| '__||  \| || | | |\___ \ 
 / ___ \ \__ \| |_ |  __/| |   | |\  || |_| | ___) |
/_/   \_\|___/ \__| \___||_|   |_| \_| \___/ |____/ 

------- Asterfusion Network Operating System -------

Help:    http://www.asterfusion.com/

Last login: Sun Sep 29 17:10:46 2024 from 172.16.20.241

AsterNOS# configure terminal 
AsterNOS(config)# qos roce lossless   
AsterNOS(config)# qos service-policy roce_lossless 
AsterNOS(config)# end
AsterNOS# show qos roce
                    operational    description
------------------  -------------  ---------------------------------------------------
status              bind           qos roce binding status
mode                lossless       Roce Mode
cable-length        40m            Cable Length(in meters) for Roce Lossless Config
congestion-control
- congestion-mode   ECN            congestion-control
- enabled-tc        3,4            Congestion config enabled Traffic Class
- max-threshold     750000         Congestion config max-threshold
- min-threshold     15360          Congestion config max-threshold
pfc
- pfc-priority      3,4            switch-prio on which PFC is enabled
- rx-enabled        enable         PFC Rx Enabled status
- tx-enabled        enable         PFC Tx Enabled status
trust
- trust-mode        dscp           Trust Setting on the port for packet classification

 RoCE DSCP->SP mapping configurations
==========================================
dscp                       switch-prio
-----------------------  -------------
0,1,2,3,4,5,6,7                      0
10,11,12,13,14,15,8,9                1
16,17,18,19,20,21,22,23              2
24,25,26,27,28,29,30,31              3
32,33,34,35,36,37,38,39              4
40,41,42,43,44,45,46,47              5
48,49,50,51,52,53,54,55              6
56,57,58,59,60,61,62,63              7

 RoCE SP->TC mapping and ETS configurations
================================================
  switch-prio  mode    weight
-------------  ------  --------
            6  SP
            7  SP

 RoCE pool config
======================
name                     switch-prio
-----------------------  -------------
egress_lossy_profile     0 1 2 5 6
ingress_lossy_profile    0 1 2 5 6
egress_lossless_profile  3 4
roce_lossless_profile    3 4

2.2 GPU服务器基础配置

以下所有操作,在三台服务器上都需要执行,本文档中的配置步骤以server3为例。

2.2.1 关闭防火墙和SELinux

[root@server3 ~]# systemctl stop firewalld
[root@server3 ~]# systemctl disable firewalld
[root@server3 ~]# setenforce 0
[root@server3 ~]# sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux

2.2.2 配置服务器间免密登陆

[root@server3 ~]# ssh-keygen
[root@server3 ~]# ssh-copy-id root@server1
[root@server3 ~]# ssh-copy-id root@server2

2.2.3 配置服务器软件源

[root@server3 ~]# ll /etc/yum.repos.d/
总用量 80
-rw-r--r-- 1 root root 2278 9月  19 08:00 CentOS-Base.repo
-rw-r--r-- 1 root root  232 9月  19 08:00 cuda-rhel7.repo
-rw-r--r-- 1 root root  210 9月  19 08:00 cudnn-local-rhel7-8.9.7.29.repo
drwxr-xr-x 2 root root 4096 9月  19 07:58 disable.d
-rw-r--r-- 1 root root  664 9月  19 08:00 epel.repo
-rw-r--r-- 1 root root  381 9月  19 08:00 hashicorp.repo
-rw-r--r-- 1 root root  218 9月  19 08:00 kubernetes.repo
-rw-r--r-- 1 root root  152 9月  19 08:00 MariaDB.repo
-rw-r--r-- 1 root root  855 9月  19 08:00 remi-modular.repo
-rw-r--r-- 1 root root  456 9月  19 08:00 remi-php54.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php70.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php71.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php72.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php73.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php74.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php80.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php81.repo
-rw-r--r-- 1 root root 1314 9月  19 08:00 remi-php82.repo
-rw-r--r-- 1 root root 2605 9月  19 08:00 remi.repo
-rw-r--r-- 1 root root  750 9月  19 08:00 remi-safe.repo
[root@server3 ~]# more /etc/yum.repos.d/*.repo
::::::::::::::
/etc/yum.repos.d/CentOS-Base.repo
::::::::::::::
# CentOS-Base.repo
#
# The mirror system uses the connecting IP address of the client and the
# update status of each mirror to pick mirrors that are updated to and
# geographically close to the client.  You should use this for CentOS updates
# unless you are manually picking other mirrors.
#
# If the mirrorlist= does not work for you, as a fall back you can try the 
# remarked out baseurl= line instead.
#
#
 
[base]
name=CentOS-7 - Base - mirrors.aliyun.com
failovermethod=priority
baseurl=http://mirrors.aliyun.com/centos/7/os/x86_64/
        http://mirrors.aliyuncs.com/centos/7/os/x86_64/
        http://mirrors.cloud.aliyuncs.com/centos/7/os/x86_64/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
 
#released updates 
[updates]
name=CentOS-7 - Updates - mirrors.aliyun.com
failovermethod=priority
baseurl=http://mirrors.aliyun.com/centos/7/updates/x86_64/
        http://mirrors.aliyuncs.com/centos/7/updates/x86_64/
        http://mirrors.cloud.aliyuncs.com/centos/7/updates/x86_64/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
 
#additional packages that may be useful
[extras]
name=CentOS-7 - Extras - mirrors.aliyun.com
failovermethod=priority
baseurl=http://mirrors.aliyun.com/centos/7/extras/x86_64/
        http://mirrors.aliyuncs.com/centos/7/extras/x86_64/
        http://mirrors.cloud.aliyuncs.com/centos/7/extras/x86_64/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
 
#additional packages that extend functionality of existing packages
[centosplus]
name=CentOS-7 - Plus - mirrors.aliyun.com
failovermethod=priority
baseurl=http://mirrors.aliyun.com/centos/7/centosplus/x86_64/
        http://mirrors.aliyuncs.com/centos/7/centosplus/x86_64/
        http://mirrors.cloud.aliyuncs.com/centos/7/centosplus/x86_64/
gpgcheck=1
enabled=0
gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
 
#contrib - packages by Centos Users
[contrib]
name=CentOS-7 - Contrib - mirrors.aliyun.com
failovermethod=priority
baseurl=http://mirrors.aliyun.com/centos/7/contrib/x86_64/
        http://mirrors.aliyuncs.com/centos/7/contrib/x86_64/
        http://mirrors.cloud.aliyuncs.com/centos/7/contrib/x86_64/
gpgcheck=1
enabled=0
gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
::::::::::::::
/etc/yum.repos.d/cuda-rhel7.repo
::::::::::::::
[cuda-rhel7-x86_64]
name=cuda-rhel7-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/D42D0685.pub
::::::::::::::
/etc/yum.repos.d/cudnn-local-rhel7-8.9.7.29.repo
::::::::::::::
[cudnn-local-rhel7-8.9.7.29]
name=cudnn-local-rhel7-8.9.7.29
baseurl=file:///var/cudnn-local-repo-rhel7-8.9.7.29
enabled=1
gpgcheck=1
gpgkey=file:///var/cudnn-local-repo-rhel7-8.9.7.29/90F10142.pub
obsoletes=0
::::::::::::::
/etc/yum.repos.d/epel.repo
::::::::::::::
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=http://mirrors.aliyun.com/epel/7/$basearch
failovermethod=priority
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
 
[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
 
[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
::::::::::::::
/etc/yum.repos.d/hashicorp.repo
::::::::::::::
[hashicorp]
name=Hashicorp Stable - $basearch
baseurl=https://rpm.releases.hashicorp.com/RHEL/$releasever/$basearch/stable
enabled=0
gpgcheck=1
gpgkey=https://rpm.releases.hashicorp.com/gpg

[hashicorp-test]
name=Hashicorp Test - $basearch
baseurl=https://rpm.releases.hashicorp.com/RHEL/$releasever/$basearch/test
enabled=0
gpgcheck=1
gpgkey=https://rpm.releases.hashicorp.com/gpg
::::::::::::::
/etc/yum.repos.d/kubernetes.repo
::::::::::::::
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes-new/core/stable/v1.28/rpm/
enabled=1
gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes-new/core/stable/v1.28/rpm/repodata/repomd.xml.key
::::::::::::::
/etc/yum.repos.d/MariaDB.repo
::::::::::::::
[mariadb]
name = MariaDB
baseurl = https://mirror.mariadb.org/yum/11.2/centos74-amd64
gpgkey = https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck = 0
::::::::::::::
/etc/yum.repos.d/remi-modular.repo
::::::::::::::
# Repository: https://rpms.remirepo.net/
# Blog:       https://blog.remirepo.net/
# Forum:      https://forum.remirepo.net/

[remi-modular]
name=Remi's Modular repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/modular/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/modular/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/modular/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-modular-test]
name=Remi's Modular testing repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/modular-test/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/modular-test/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/modular-test/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

::::::::::::::
/etc/yum.repos.d/remi-php54.repo
::::::::::::::
# This repository only provides PHP 5.4 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php54]
name=Remi's PHP 5.4 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php54/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php54/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php54/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

::::::::::::::
/etc/yum.repos.d/remi-php70.repo
::::::::::::::
# This repository only provides PHP 7.0 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php70]
name=Remi's PHP 7.0 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php70/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php70/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php70/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php70-debuginfo]
name=Remi's PHP 7.0 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php70/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php70-test]
name=Remi's PHP 7.0 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test70/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test70/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test70/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php70-test-debuginfo]
name=Remi's PHP 7.0 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test70/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php71.repo
::::::::::::::
# This repository only provides PHP 7.1 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php71]
name=Remi's PHP 7.1 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php71/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php71/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php71/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php71-debuginfo]
name=Remi's PHP 7.1 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php71/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php71-test]
name=Remi's PHP 7.1 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test71/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test71/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test71/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php71-test-debuginfo]
name=Remi's PHP 7.1 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test71/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php72.repo
::::::::::::::
# This repository only provides PHP 7.2 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php72]
name=Remi's PHP 7.2 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php72/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php72/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php72/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php72-debuginfo]
name=Remi's PHP 7.2 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php72/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php72-test]
name=Remi's PHP 7.2 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test72/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test72/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test72/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php72-test-debuginfo]
name=Remi's PHP 7.2 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test72/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php73.repo
::::::::::::::
# This repository only provides PHP 7.3 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php73]
name=Remi's PHP 7.3 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php73/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php73/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php73/mirror
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php73-debuginfo]
name=Remi's PHP 7.3 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php73/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php73-test]
name=Remi's PHP 7.3 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test73/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test73/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test73/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php73-test-debuginfo]
name=Remi's PHP 7.3 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test73/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php74.repo
::::::::::::::
# This repository only provides PHP 7.4 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php74]
name=Remi's PHP 7.4 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php74/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php74/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php74/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php74-debuginfo]
name=Remi's PHP 7.4 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php74/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php74-test]
name=Remi's PHP 7.4 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test74/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test74/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test74/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php74-test-debuginfo]
name=Remi's PHP 7.4 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test74/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php80.repo
::::::::::::::
# This repository only provides PHP 8.0 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php80]
name=Remi's PHP 8.0 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php80/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php80/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php80/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php80-debuginfo]
name=Remi's PHP 8.0 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php80/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php80-test]
name=Remi's PHP 8.0 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test80/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test80/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test80/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php80-test-debuginfo]
name=Remi's PHP 8.0 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test80/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php81.repo
::::::::::::::
# This repository only provides PHP 8.1 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php81]
name=Remi's PHP 8.1 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php81/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php81/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php81/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php81-debuginfo]
name=Remi's PHP 8.1 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php81/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php81-test]
name=Remi's PHP 8.1 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test81/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test81/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test81/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php81-test-debuginfo]
name=Remi's PHP 8.1 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test81/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi-php82.repo
::::::::::::::
# This repository only provides PHP 8.2 and its extensions
# NOTICE: common dependencies are in "remi-safe"

[remi-php82]
name=Remi's PHP 8.2 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php82/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php82/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php82/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php82-debuginfo]
name=Remi's PHP 8.2 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php82/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php82-test]
name=Remi's PHP 8.2 test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test82/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test82/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test82/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php82-test-debuginfo]
name=Remi's PHP 8.2 test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test82/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
::::::::::::::
/etc/yum.repos.d/remi.repo
::::::::::::::
# Repository: http://rpms.remirepo.net/
# Blog:       http://blog.remirepo.net/
# Forum:      http://forum.remirepo.net/

[remi]
name=Remi's RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/remi/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/remi/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/remi/mirror
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php55]
name=Remi's PHP 5.5 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php55/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php55/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php55/mirror
# NOTICE: common dependencies are in "remi-safe"
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php56]
name=Remi's PHP 5.6 RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/php56/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/php56/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/php56/mirror
# NOTICE: common dependencies are in "remi-safe"
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-test]
name=Remi's test RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/test/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/test/mirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/test/mirror
# WARNING: If you enable this repository, you must also enable "remi"
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-debuginfo]
name=Remi's RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-remi/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php55-debuginfo]
name=Remi's PHP 5.5 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php55/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-php56-debuginfo]
name=Remi's PHP 5.6 RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-php56/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-test-debuginfo]
name=Remi's test RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-test/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

::::::::::::::
/etc/yum.repos.d/remi-safe.repo
::::::::::::::
# This repository is safe to use with RHEL/CentOS base repository
# it only provides additional packages for the PHP stack
# all dependencies are in base repository or in EPEL

[remi-safe]
name=Safe Remi's RPM repository for Enterprise Linux 7 - $basearch
#baseurl=http://rpms.remirepo.net/enterprise/7/safe/$basearch/
#mirrorlist=https://rpms.remirepo.net/enterprise/7/safe/httpsmirror
mirrorlist=http://cdn.remirepo.net/enterprise/7/safe/mirror
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi

[remi-safe-debuginfo]
name=Remi's RPM repository for Enterprise Linux 7 - $basearch - debuginfo
baseurl=http://rpms.remirepo.net/enterprise/7/debug-remi/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
[root@server3 ~]# 

2.2.4 安装Python3

准备工作目录
[root@server3 lichao]# mkdir AIGC
[root@server3 lichao]# cd AIGC/

安装Python3

安装编译环境和依赖包
[root@server3 AIGC]# yum install wget gcc openssl-devel bzip2-devel libffi-devel
[root@server3 AIGC]# yum install openssl11 openssl11-devel openssl-devel
解压源码包
[root@server3 AIGC]# tar xvf Python-3.11.9.tar.xz 
[root@server3 AIGC]# cd Python-3.11.9
[root@server3 Python-3.11.9]# 
设置环境变量
[root@server3 Python-3.11.9]# export CFLAGS=$(pkg-config --cflags openssl11)
[root@server3 Python-3.11.9]# export LDFLAGS=$(pkg-config --libs openssl11)
进行编译安装
[root@server3 Python-3.11.9]# mkdir -p /home/lichao/opt/python3.11.9
[root@server3 Python-3.11.9]# ./configure --prefix=/home/lichao/opt/python3.11.9
[root@server3 Python-3.11.9]# make && make install
创建软链接,用于全局访问
[root@server3 Python-3.11.9]# cd /home/lichao/opt/python3.11.9/
[root@server3 python3.11.9]# ln -s /home/lichao/opt/python3.11.9/bin/python3 /usr/bin/python3
[root@server3 python3.11.9]# ln -s /home/lichao/opt/python3.11.9/bin/pip3 /usr/bin/pip3
[root@server3 python3.11.9]# ll /usr/bin/python3 
lrwxrwxrwx 1 root root 41 5月  16 08:32 /usr/bin/python3 -> /home/lichao/opt/python3.11.9/bin/python3
[root@server3 python3.11.9]# ll /usr/bin/pip3
lrwxrwxrwx 1 root root 38 5月  16 08:32 /usr/bin/pip3 -> /home/lichao/opt/python3.11.9/bin/pip3
验证测试
[root@server3 python3.11.9]# python3
Python 3.11.9 (main, May 16 2024, 08:23:00) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
[root@server3 python3.11.9]# 

2.2.5 安装MLNX网卡驱动

下文以CentOS7为例,详细介绍了Mellanox网卡MLNX_OFED的驱动安装和固件升级方法。

本次下载的驱动版本为:MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64.tgz。

2.2.5安装MLNX网卡驱动1
2.2.5安装MLNX网卡驱动2
把下载好的Mellanox驱动解压缩
[root@server3 ~]# tar –zxvf MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64.tgz
[root@server3 ~]# cd MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64
查看当前系统的内核版本
[root@server3 MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64]# uname -r
3.10.0-957.el7.x86_64
查看当前驱动所支持的内核版本
[root@server3 MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64]# cat .supported_kernels 
3.10.0-957.el7.x86_64 
注:由以上可知下载的默认驱动支持当前的内核版本
如果当前内核与支持内核不匹配,手动编译适合内核的驱动,在编译之前首先安装gcc编译环境和kernel开发包
[root@server3 MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64]#yum  install gcc gcc-c++
libstdc++-devel kernel-default-devel 
添加针对当前内核版本的驱动
[root@server3 MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64]#./mlnx_add_kernel_support.sh -m /root/MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64  -v
注:完成后生成的驱动文件在/tmp目录下
[root@server3 MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64]# ls -l /tmp/MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64-ext.tgz
-rw-r--r-- 1 root root 282193833 Dec 23 09:49 /tmp/MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64-ext.tgz
安装驱动
[root@server3 tmp]# tar xzvf MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64-ext.tgz
[root@server3 tmp]# cd MLNX_OFED_LINUX-4.7-3.2.9.0-rhel7.6-x86_64-ext
[root@server3 tmp]# ./mlnxofedinstall
最后启动openibd服务
[root@server3 ~]#/etc/init.d/openibd start
[root@server3 ~]#chkconfig openibd on

2.3 安装GPU驱动和集合通讯库

2.3.1 安装配置

2.3.1.1 安装GPU驱动和CUDA、CUDNN

安装开始前,请根据自己的GPU型号、操作系统版本去英伟达官网下载相对应的软件包。

[root@server3 AIGC]# ll
总用量 1733448
-rw-r--r--  1 root root 1430373861 5月  16 08:55 cudnn-local-repo-rhel7-8.9.7.29-1.0-1.x86_64.rpm
drwxr-xr-x  7 root root        141 5月  17 13:45 nccl-tests
-rwxr-xr-x  1 root root  306736632 5月  16 08:43 NVIDIA-Linux-x86_64-550.67.run
drwxrwxr-x 10 1000 1000       4096 5月  17 13:21 openmpi-4.1.6
-rw-r--r--  1 root root   17751702 9月  30 2023 openmpi-4.1.6.tar.gz
drwxr-xr-x 17 root root       4096 5月  16 08:23 Python-3.11.9
-rw-r--r--  1 root root   20175816 4月   2 13:11 Python-3.11.9.tar.xz
[root@server3 AIGC]# ./NVIDIA-Linux-x86_64-550.67.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.67...................
安装GPU驱动
安装GPU2
[root@server3 AIGC]# yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
已加载插件:fastestmirror, nvidia
adding repo from: https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
grabbing file https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo to /etc/yum.repos.d/cuda-rhel7.repo
repo saved to /etc/yum.repos.d/cuda-rhel7.repo
[root@server3 AIGC]# yum install libnccl-2.21.5-1+cuda12.4 libnccl-devel-2.21.5-1+cuda12.4 libnccl-static-2.21.5-1+cuda12.4
[root@server3 AIGC]# yum install cudnn-local-repo-rhel7-8.9.7.29-1.0-1.x86_64.rpm

安装完成后,可以通过nvidia-smi查看驱动和CUDA版本。如果版本不匹配,则执行此命令行会报错。

[root@server3 AIGC]# nvidia-smi 
Mon Jun  3 11:59:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:02:00.0 Off |                  N/A |
|  0%   34C    P0             27W /  165W |       1MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[root@server3 AIGC]# 

2.3.1.2编译安装OpenMPI

[root@server3 AIGC]# tar xvf openmpi-4.1.6.tar.gz 
[root@server3 openmpi-4.1.6]# 
[root@server3 openmpi-4.1.6]# mkdir -p /home/lichao/lib/openmpi
[root@server3 openmpi-4.1.6]# ./configure --prefix=/home/lichao/lib/openmpi -with-cuda=/usr/local/cuda-12.4 -with-nccl=/usr/lib64

Open MPI configuration:
-----------------------
Version: 4.1.6
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi
MPI Build Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
CUDA support: yes
HWLOC support: internal
Libevent support: internal
Open UCC: no
PMIx support: Internal
 
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: yes
OpenFabrics OFI Libfabric: no
OpenFabrics Verbs: yes
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
 
Resource Managers
-----------------------
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no
 
OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no
Lustre: no
PVFS2/OrangeFS: no
 
[root@server3 openmpi-4.1.6]# 

2.3.1.3 编译安装NCCL-Test

[root@server3 lichao]# cd AIGC/
[root@server3 AIGC]# git clone https://github.com/NVIDIA/nccl-tests.git
[root@server3 AIGC]# cd nccl-tests/
[root@server3 nccl-tests]# make clean
[root@server3 nccl-tests]# make MPI=1 MPI_HOME=/home/lichao/opt/openmpi/ CUDA_HOME=/usr/local/cuda-12.4/ NCCL_HOME=/usr/lib64/

2.3.2 集合通信性能测试方法(all_reduce

[root@server1 lichao]# cat run_nccl-test.sh 
/home/lichao/opt/openmpi/bin/mpirun --allow-run-as-root \
-np 3 \
-host "server1,server2,server3" \
-mca btl ^openib \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=ring \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_SOCKET_IFNAME=ens11f1 \
-x NCCL_IB_HCA=mlx5_1:1 \
/home/lichao/AIGC/nccl-tests/build/all_reduce_perf -b 128 -e 8G -f 2 -g 1
[root@server1 lichao]# ./run_nccl-test.sh 
# nThread 1 nGpus 1 minBytes 128 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  18697 on    server1 device  0 [0x02] NVIDIA GeForce RTX 4060 Ti
#  Rank  1 Group  0 Pid  20893 on    server2 device  0 [0x02] NVIDIA GeForce RTX 4060 Ti
#  Rank  2 Group  0 Pid   2458 on    server3 device  0 [0x02] NVIDIA GeForce RTX 4060 Ti
#
# Reducing maxBytes to 5261099008 due to memory limitation
server1:18697:18697 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server1:18697:18697 [0] NCCL INFO Bootstrap : Using ens11f1:172.16.0.11<0>
server1:18697:18697 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
server1:18697:18697 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
server1:18697:18697 [0] NCCL INFO NET/Plugin: Using internal network plugin.
server2:20893:20893 [0] NCCL INFO cudaDriverVersion 12040
server2:20893:20893 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server2:20893:20893 [0] NCCL INFO Bootstrap : Using ens11f1:172.16.0.12<0>
server2:20893:20893 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
server2:20893:20893 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
server2:20893:20893 [0] NCCL INFO NET/Plugin: Using internal network plugin.
server1:18697:18697 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
server3:2458:2458 [0] NCCL INFO cudaDriverVersion 12040
server3:2458:2458 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server3:2458:2458 [0] NCCL INFO Bootstrap : Using ens11f1:172.16.0.13<0>
server3:2458:2458 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
server3:2458:2458 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
server3:2458:2458 [0] NCCL INFO NET/Plugin: Using internal network plugin.
server2:20893:20907 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
server2:20893:20907 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server2:20893:20907 [0] NCCL INFO NCCL_IB_HCA set to mlx5_1:1
server2:20893:20907 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]; OOB ens11f1:172.16.0.12<0>
server2:20893:20907 [0] NCCL INFO Using non-device net plugin version 0
server2:20893:20907 [0] NCCL INFO Using network IB
server3:2458:2473 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
server3:2458:2473 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server3:2458:2473 [0] NCCL INFO NCCL_IB_HCA set to mlx5_1:1
server1:18697:18712 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
server1:18697:18712 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server3:2458:2473 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]; OOB ens11f1:172.16.0.13<0>
server1:18697:18712 [0] NCCL INFO NCCL_IB_HCA set to mlx5_1:1
server3:2458:2473 [0] NCCL INFO Using non-device net plugin version 0
server3:2458:2473 [0] NCCL INFO Using network IB
server1:18697:18712 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]; OOB ens11f1:172.16.0.11<0>
server1:18697:18712 [0] NCCL INFO Using non-device net plugin version 0
server1:18697:18712 [0] NCCL INFO Using network IB
server1:18697:18712 [0] NCCL INFO ncclCommInitRank comm 0x23622c0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 2000 commId 0x35491327c8228dd0 - Init START
server3:2458:2473 [0] NCCL INFO ncclCommInitRank comm 0x346ffc0 rank 2 nranks 3 cudaDev 0 nvmlDev 0 busId 2000 commId 0x35491327c8228dd0 - Init START
server2:20893:20907 [0] NCCL INFO ncclCommInitRank comm 0x2a1af20 rank 1 nranks 3 cudaDev 0 nvmlDev 0 busId 2000 commId 0x35491327c8228dd0 - Init START
server3:2458:2473 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
server2:20893:20907 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
server1:18697:18712 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
server1:18697:18712 [0] NCCL INFO comm 0x23622c0 rank 0 nRanks 3 nNodes 3 localRanks 1 localRank 0 MNNVL 0
server1:18697:18712 [0] NCCL INFO Channel 00/02 :    0   1   2
server1:18697:18712 [0] NCCL INFO Channel 01/02 :    0   1   2
server1:18697:18712 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] 2/-1/-1->0->1
server1:18697:18712 [0] NCCL INFO P2P Chunksize set to 131072
server3:2458:2473 [0] NCCL INFO comm 0x346ffc0 rank 2 nRanks 3 nNodes 3 localRanks 1 localRank 0 MNNVL 0
server2:20893:20907 [0] NCCL INFO comm 0x2a1af20 rank 1 nRanks 3 nNodes 3 localRanks 1 localRank 0 MNNVL 0
server3:2458:2473 [0] NCCL INFO Trees [0] 1/-1/-1->2->0 [1] -1/-1/-1->2->0
server3:2458:2473 [0] NCCL INFO P2P Chunksize set to 131072
server2:20893:20907 [0] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 0/-1/-1->1->-1
server2:20893:20907 [0] NCCL INFO P2P Chunksize set to 131072
server3:2458:2473 [0] NCCL INFO Channel 00/0 : 1[0] -> 2[0] [receive] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Channel 01/0 : 1[0] -> 2[0] [receive] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/IB/0
server2:20893:20907 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IB/0
server2:20893:20907 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IB/0
server2:20893:20907 [0] NCCL INFO Channel 00/0 : 1[0] -> 2[0] [send] via NET/IB/0
server2:20893:20907 [0] NCCL INFO Channel 01/0 : 1[0] -> 2[0] [send] via NET/IB/0
server1:18697:18712 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IB/0
server1:18697:18712 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/IB/0
server1:18697:18712 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0
server1:18697:18712 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/0
server3:2458:2475 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
server1:18697:18714 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
server2:20893:20909 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
server1:18697:18712 [0] NCCL INFO Connected all rings
server1:18697:18712 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Connected all rings
server2:20893:20907 [0] NCCL INFO Connected all rings
server1:18697:18712 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/IB/0
server2:20893:20907 [0] NCCL INFO Channel 00/0 : 2[0] -> 1[0] [receive] via NET/IB/0
server1:18697:18712 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/IB/0
server2:20893:20907 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Channel 00/0 : 2[0] -> 1[0] [send] via NET/IB/0
server3:2458:2473 [0] NCCL INFO Connected all trees
server1:18697:18712 [0] NCCL INFO Connected all trees
server1:18697:18712 [0] NCCL INFO NCCL_ALGO set by environment to ring
server3:2458:2473 [0] NCCL INFO NCCL_ALGO set by environment to ring
server3:2458:2473 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
server3:2458:2473 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server2:20893:20907 [0] NCCL INFO Connected all trees
server2:20893:20907 [0] NCCL INFO NCCL_ALGO set by environment to ring
server2:20893:20907 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
server2:20893:20907 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server1:18697:18712 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
server1:18697:18712 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server2:20893:20907 [0] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
server2:20893:20907 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
server2:20893:20907 [0] NCCL INFO ncclCommInitRank comm 0x2a1af20 rank 1 nranks 3 cudaDev 0 nvmlDev 0 busId 2000 commId 0x35491327c8228dd0 - Init COMPLETE
server3:2458:2473 [0] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
server3:2458:2473 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
server3:2458:2473 [0] NCCL INFO ncclCommInitRank comm 0x346ffc0 rank 2 nranks 3 cudaDev 0 nvmlDev 0 busId 2000 commId 0x35491327c8228dd0 - Init COMPLETE
server1:18697:18712 [0] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
server1:18697:18712 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
server1:18697:18712 [0] NCCL INFO ncclCommInitRank comm 0x23622c0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 2000 commId 0x35491327c8228dd0 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         128            32     float     sum      -1    28.39    0.00    0.01      0    27.35    0.00    0.01      0
         256            64     float     sum      -1    29.44    0.01    0.01      0    28.54    0.01    0.01      0
         512           128     float     sum      -1    29.99    0.02    0.02      0    29.66    0.02    0.02      0
        1024           256     float     sum      -1    32.89    0.03    0.04      0    30.64    0.03    0.04      0
        2048           512     float     sum      -1    34.81    0.06    0.08      0    31.87    0.06    0.09      0
        4096          1024     float     sum      -1    37.32    0.11    0.15      0    36.09    0.11    0.15      0
        8192          2048     float     sum      -1    45.11    0.18    0.24      0    43.12    0.19    0.25      0
       16384          4096     float     sum      -1    57.92    0.28    0.38      0    56.98    0.29    0.38      0
       32768          8192     float     sum      -1    72.68    0.45    0.60      0    70.79    0.46    0.62      0
       65536         16384     float     sum      -1    95.77    0.68    0.91      0    93.73    0.70    0.93      0
      131072         32768     float     sum      -1    162.7    0.81    1.07      0    161.5    0.81    1.08      0
      262144         65536     float     sum      -1    177.3    1.48    1.97      0    177.4    1.48    1.97      0
      524288        131072     float     sum      -1    301.4    1.74    2.32      0    302.0    1.74    2.31      0
     1048576        262144     float     sum      -1    557.9    1.88    2.51      0    559.2    1.88    2.50      0
     2097152        524288     float     sum      -1   1089.8    1.92    2.57      0   1092.2    1.92    2.56      0
     4194304       1048576     float     sum      -1   2165.7    1.94    2.58      0   2166.6    1.94    2.58      0
     8388608       2097152     float     sum      -1   4315.7    1.94    2.59      0   4316.1    1.94    2.59      0
    16777216       4194304     float     sum      -1   8528.8    1.97    2.62      0   8529.3    1.97    2.62      0
    33554432       8388608     float     sum      -1    16622    2.02    2.69      0    16610    2.02    2.69      0
    67108864      16777216     float     sum      -1    32602    2.06    2.74      0    32542    2.06    2.75      0
   134217728      33554432     float     sum      -1    63946    2.10    2.80      0    63831    2.10    2.80      0
   268435456      67108864     float     sum      -1   126529    2.12    2.83      0   126412    2.12    2.83      0
   536870912     134217728     float     sum      -1   251599    2.13    2.85      0   251327    2.14    2.85      0
  1073741824     268435456     float     sum      -1   500664    2.14    2.86      0   501911    2.14    2.85      0
  2147483648     536870912     float     sum      -1  1001415    2.14    2.86      0  1000178    2.15    2.86      0
  4294967296    1073741824     float     sum      -1  1999361    2.15    2.86      0  1997380    2.15    2.87      0
server1:18697:18697 [0] NCCL INFO comm 0x23622c0 rank 0 nranks 3 cudaDev 0 busId 2000 - Destroy COMPLETE
server2:20893:20893 [0] NCCL INFO comm 0x2a1af20 rank 1 nranks 3 cudaDev 0 busId 2000 - Destroy COMPLETE
server3:2458:2458 [0] NCCL INFO comm 0x346ffc0 rank 2 nranks 3 cudaDev 0 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.66163 
#

结果详解:

– size (B):操作处理的数据的大小,以字节为单位;

– count (elements):操作处理的元素的数量;

– type:元素的数据类型;

– redop:使用的归约操作;

– root:对于某些操作(如 reduce 和 broadcast),这列指定了根节点的编号,值是 -1 表示这个操作没有根节点(all-reduce 操作涉及到所有的节点);

– time (us):操作的执行时间,以微秒为单位;

– algbw (GB/s):算法带宽,以每秒吉字节(GB/s)为单位;

– busbw (GB/s):总线带宽,以每秒吉字节(GB/s)为单位;

– wrong:错误的数量,如果这个值不是 0,那可能表示有一些错误发生。

在这个例子中,你可以看到,当处理的数据量增大时,算法带宽和总线带宽都有所提高,这可能表示 NCCL 能够有效地利用大量的数据。

查看结果时,需要关注如下几点

1. 数据量增加时,带宽是否会下降(下降明显不符合预期);

2. 更关注带宽的峰值,每次算到的带宽峰值,可以只关注 in 或者 out;

3. 平均值,在数据量递增的情况下,可能无法体现最终的结果;

4. 请确保数据量足够大,可以压到带宽上限(通过调整 b、e 或者 n 选项)。

2.3.3 常用参数及解释

– GPU 数量

  – -t,–nthreads <num threads> 每个进程的线程数量配置, 默认 1;

  –  -g,–ngpus <GPUs per thread> 每个线程的 GPU 数量,默认 1;

– 数据大小配置

  – -b,–minbytes <min size in bytes> 开始的最小数据量,默认 32M;

  – -e,–maxbytes <max size in bytes> 结束的最大数据量,默认 32M;

–  数据步长设置

  –  -i,–stepbytes <increment size> 每次增加的数据量,默认: 1M;

  –  -f,–stepfactor <increment factor> 每次增加的倍数,默认禁用;

– NCCL 操作相关配置

  – -o,–op <sum/prod/min/max/avg/all>指定那种操作为reduce,仅适用于Allreduce、Reduce或ReduceScatter等缩减操作。默认值为:求和(Sum);

  – -d,–datatype <nccltype/all>指定使用哪种数据类型,默认 : Float;

– 性能相关配置

  – -n,–iters <iteration count> 每次操作(一次发送)循环多少次,默认 : 20;

  – -w,–warmup_iters <warmup iteration count> 预热迭代次数(不计时),默认:5;

  – -m,–agg_iters <aggregation count> 每次迭代中要聚合在一起的操作数,默认:1;

  – -a,–average <0/1/2/3> 在所有 ranks 计算均值作为最终结果 (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>,默认:1;

– 测试相关配置

  – -p,–parallel_init <0/1> 使用线程并行初始化 NCCL,默认: 0;

  – -c,–check <0/1> 检查结果的正确性。在大量GPU上可能会非常慢,默认:1;

  – -z,–blocking <0/1> 使NCCL集合阻塞,即在每个集合之后让CPU等待和同步,默认:0;

  – -G,–cudagraph <num graph launches>  将迭代作为CUDA图形捕获,然后重复指定的次数,默认:0;

3 实验测试

完成硬件、软件的选型和配置后,下一步将进行实践测试。

3.1.1 获取LLaMA-Factory源码包

因为网络问题很难直接通过git clone命令行拉取,建议通过打包下载后自己上传的方式进行:

noone@MacBook-Air Downloads % scp LLaMA-Factory-0.8.3.zip root@10.230.1.13:/tmp

[root@server3 AIGC]# pwd
/home/lichao/AIGC
[root@server3 AIGC]# cp /tmp/LLaMA-Factory-0.8.3.zip ./
[root@server3 AIGC]# unzip LLaMA-Factory-0.8.3.zip
[root@server3 AIGC]# cd LLaMA-Factory-0.8.3
[root@server3 LLaMA-Factory-0.8.3]# ll
总用量 128
drwxr-xr-x  2 root root    83 9月  13 05:04 assets
drwxr-xr-x  2 root root   122 9月   6 08:26 cache
-rw-r--r--  1 root root  1378 7月  18 19:36 CITATION.cff
drwxr-xr-x  6 root root  4096 9月  13 05:03 data
drwxr-xr-x  4 root root    43 7月  18 19:36 docker
drwxr-xr-x  5 root root    44 7月  18 19:36 evaluation
drwxr-xr-x 10 root root   182 7月  18 19:36 examples
-rw-r--r--  1 root root 11324 7月  18 19:36 LICENSE
-rw-r--r--  1 root root   242 7月  18 19:36 Makefile
-rw-r--r--  1 root root    33 7月  18 19:36 MANIFEST.in
-rw-r--r--  1 root root   645 7月  18 19:36 pyproject.toml
-rw-r--r--  1 root root 44424 7月  18 19:36 README.md
-rw-r--r--  1 root root 44093 7月  18 19:36 README_zh.md
-rw-r--r--  1 root root   245 7月  18 19:36 requirements.txt
drwxr-xr-x  3 root root    16 9月   6 18:48 saves
drwxr-xr-x  2 root root   219 7月  18 19:36 scripts
-rw-r--r--  1 root root  3361 7月  18 19:36 setup.py
drwxr-xr-x  4 root root   101 9月   6 08:22 src
drwxr-xr-x  5 root root    43 7月  18 19:36 tests
[root@server3 LLaMA-Factory-0.8.3]# 

3.1.2 安装LLaMA-Factory,并进行验证

[root@server3 LLaMA-Factory-0.8.3]# pip install -e ".[torch,metrics]"
[root@server3 LLaMA-Factory-0.8.3]# llamafactory-cli version
[2024-09-23 08:51:28,722] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
----------------------------------------------------------
| Welcome to LLaMA Factory, version 0.8.3                |
|                                                        |
| Project page: https://github.com/hiyouga/LLaMA-Factory |
----------------------------------------------------------
[root@server3 LLaMA-Factory-0.8.3]# 

3.1.3 下载训练时所需的预训练模型和数据集

根据当前GPU服务器所配置的GPU硬件规格,选择适合的训练方法、模型和数据集。

GPU型号:NVIDIA GeForce RTX 4060 Ti 16GB

预训练模型:Qwen/Qwen1.5-0.5B-Chat

数据集:identity、alpaca_zh_demo

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://hf-mirror.com/Qwen/Qwen1.5-0.5B-Chat
# If you want to clone without large files - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/Qwen/Qwen1.5-0.5B-Chat

因为网络问题通过命令行很难直接下载,这里使用huggingface的国内镜像站拉取预训练模型数据,并使用“GIT_LFS_SKIP_SMUDGE=1”变量跳过大文件,随后手工下载后再上传。

如果觉得麻烦,也可以安装使用huggingface的命令行工具,下载预训练模型和数据集。同样地,安装完成后,需要配置一些环境变量(使用镜像站hf-mirror.com)来解决网络问题。

下载预训练模型:
[root@server3 AIGC]# mkdir models
[root@server3 AIGC]# cd models/
[root@server3 models]# GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/Qwen/Qwen1.5-0.5B-Chat
[root@server3 models]# tree -h Qwen1.5-0.5B-Chat/
Qwen1.5-0.5B-Chat/
├── [ 656]  config.json
├── [ 661]  config.json.raw
├── [ 206]  generation_config.json
├── [7.1K]  LICENSE
├── [1.6M]  merges.txt
├── [1.2G]  model.safetensors
├── [4.2K]  README.md
├── [1.3K]  tokenizer_config.json
├── [6.7M]  tokenizer.json
└── [2.6M]  vocab.json

0 directories, 10 files
[root@server3 models]# 

下载数据集:默认情况下,LLaMA-Factory项目文件下的data目录,自带了一些本地数据集可直接使用。
[root@server3 LLaMA-Factory-0.8.3]# tree -h data/
data/
├── [841K]  alpaca_en_demo.json
├── [621K]  alpaca_zh_demo.json
├── [  32]  belle_multiturn
│   └── [2.7K]  belle_multiturn.py
├── [733K]  c4_demo.json
├── [ 13K]  dataset_info.json
├── [1.5M]  dpo_en_demo.json
├── [833K]  dpo_zh_demo.json
├── [722K]  glaive_toolcall_en_demo.json
├── [665K]  glaive_toolcall_zh_demo.json
├── [  27]  hh_rlhf_en
│   └── [3.3K]  hh_rlhf_en.py
├── [ 20K]  identity.json
├── [892K]  kto_en_demo.json
├── [  45]  mllm_demo_data
│   ├── [ 12K]  1.jpg
│   ├── [ 22K]  2.jpg
│   └── [ 16K]  3.jpg
├── [3.1K]  mllm_demo.json
├── [9.8K]  README.md
├── [9.2K]  README_zh.md
├── [  27]  ultra_chat
│   └── [2.3K]  ultra_chat.py
└── [1004K]  wiki_demo.txt

4 directories, 20 files
[root@server3 LLaMA-Factory-0.8.3]# 

3.1.4 使用准备好的模型与数据集,在单机上进行训练测试

LLaMA-Factory支持通过WebUI微调大语言模型。在完成安装后,我们可以使用WebUI进行快速调测验证,没问题后可使用命令行工具进行多机分布式训练。

[root@server3 LLaMA-Factory-0.8.3]# llamafactory-cli webui
[2024-09-23 17:54:45,786] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.

3.1.5 使用命令行运行多机分布式训练任务

1. 准备目录
[root@server3 LLaMA-Factory-0.8.3]# mkdir asterun
[root@server3 LLaMA-Factory-0.8.3]# mkdir -p asterun/saves/qwen/full/sft
2. 根据集群环境和训练任务,准备分布式训练的配置文件
[root@server3 LLaMA-Factory-0.8.3]# cat asterun/qwen_full_sft_ds2.yaml 
### model
model_name_or_path: /home/lichao/AIGC/models/Qwen1.5-0.5B-Chat

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json

### dataset
dataset: identity,alpaca_zh_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: asterun/saves/qwen/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

report_to: tensorboard
logging_dir: asterun/saves/qwen/full/sft/runs


### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
[root@server3 LLaMA-Factory-0.8.3]# 
3. 用同样的方式,在Server1和Server2上准备运行环境
步骤略。
4. 依次在集群中的3个GPU节点上启动分布式训练任务
主节点rank0:
[root@server3 LLaMA-Factory-0.8.3]# FORCE_TORCHRUN=1 NNODES=3 RANK=0 MASTER_ADDR=172.16.0.13 MASTER_PORT=29500 llamafactory-cli train asterun/qwen_full_sft_ds2.yaml
从节点rank1:
[root@server2 LLaMA-Factory-0.8.3]# FORCE_TORCHRUN=1 NNODES=3 RANK=1 MASTER_ADDR=172.16.0.13 MASTER_PORT=29500 llamafactory-cli train asterun/qwen_full_sft_ds2.yaml
从节点rank2:
[root@server1 LLaMA-Factory-0.8.3]# FORCE_TORCHRUN=1 NNODES=3 RANK=2 MASTER_ADDR=172.16.0.13 MASTER_PORT=29500 llamafactory-cli train asterun/qwen_full_sft_ds2.yaml

附件-分布式训练全流程的终端打印日志:

[root@server3 LLaMA-Factory-0.8.3]# FORCE_TORCHRUN=1 NNODES=3 RANK=0 MASTER_ADDR=172.16.0.13 MASTER_PORT=29500 llamafactory-cli train asterun/qwen_full_sft_ds2.yaml 
[2024-09-23 10:01:33,036] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
09/23/2024 10:01:37 - INFO - llamafactory.cli - Initializing distributed tasks at: 172.16.0.13:29500
[2024-09-23 10:01:52,891] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-23 10:01:56,575] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-23 10:01:56,575] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
09/23/2024 10:01:56 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2267] 2024-09-23 10:01:56,613 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2267] 2024-09-23 10:01:56,613 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2267] 2024-09-23 10:01:56,613 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2267] 2024-09-23 10:01:56,614 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2267] 2024-09-23 10:01:56,614 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2267] 2024-09-23 10:01:56,614 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2513] 2024-09-23 10:01:56,941 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/23/2024 10:01:56 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
09/23/2024 10:01:56 - WARNING - llamafactory.data.template - New tokens have been added, make sure `resize_vocab` is True.
09/23/2024 10:01:56 - INFO - llamafactory.data.loader - Loading dataset identity.json...
Converting format of dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 347.58 examples/s]
09/23/2024 10:01:58 - INFO - llamafactory.data.loader - Loading dataset alpaca_zh_demo.json...
Converting format of dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4042.14 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:02<00:00, 476.63 examples/s]
training example:
input_ids:
[27, 91, 2468, 8757, 842, 91, 29, 872, 27, 91, 408, 8757, 842, 91, 1339, 6023, 151646, 27, 91, 2468, 8757, 842, 91, 29, 77091, 27, 91, 408, 8757, 842, 91, 1339, 9707, 0, 358, 1079, 5867, 606, 38154, 458, 15235, 17847, 7881, 553, 5867, 3094, 3417, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151646]
inputs:
<|start_header_id|>user<|end_header_id|>

hi<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 9707, 0, 358, 1079, 5867, 606, 38154, 458, 15235, 17847, 7881, 553, 5867, 3094, 3417, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151646]
labels:
Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?<|eot_id|>
[INFO|configuration_utils.py:731] 2024-09-23 10:02:03,983 >> loading configuration file /home/lichao/AIGC/models/Qwen1.5-0.5B-Chat/config.json
[INFO|configuration_utils.py:800] 2024-09-23 10:02:03,986 >> Model config Qwen2Config {
  "_name_or_path": "/home/lichao/AIGC/models/Qwen1.5-0.5B-Chat",
  "architectures": [
    "Qwen2Config"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 2816,
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "model_type": "qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.45.0.dev0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

[INFO|modeling_utils.py:3654] 2024-09-23 10:02:04,036 >> loading weights file /home/lichao/AIGC/models/Qwen1.5-0.5B-Chat/model.safetensors
[INFO|modeling_utils.py:1585] 2024-09-23 10:02:04,058 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-09-23 10:02:04,062 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645
}

[INFO|modeling_utils.py:4489] 2024-09-23 10:02:05,417 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.

[INFO|modeling_utils.py:4497] 2024-09-23 10:02:05,417 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/lichao/AIGC/models/Qwen1.5-0.5B-Chat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-09-23 10:02:05,421 >> loading configuration file /home/lichao/AIGC/models/Qwen1.5-0.5B-Chat/generation_config.json
[INFO|configuration_utils.py:1038] 2024-09-23 10:02:05,421 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "top_p": 0.8
}

09/23/2024 10:02:05 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/23/2024 10:02:05 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/23/2024 10:02:05 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/23/2024 10:02:05 - INFO - llamafactory.model.adapter - Fine-tuning method: Full
09/23/2024 10:02:05 - INFO - llamafactory.model.loader - trainable params: 463,987,712 || all params: 463,987,712 || trainable%: 100.0000
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:655] 2024-09-23 10:02:05,593 >> Using auto half precision backend
[2024-09-23 10:02:06,167] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-09-23 10:02:06,167] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[2024-09-23 10:02:06,406] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-09-23 10:02:06,408] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-09-23 10:02:06,408] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-23 10:02:06,424] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-09-23 10:02:06,424] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-09-23 10:02:06,424] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-09-23 10:02:06,424] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500000000
[2024-09-23 10:02:06,424] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500000000
[2024-09-23 10:02:06,424] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-09-23 10:02:06,424] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: True
[2024-09-23 10:02:08,342] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-23 10:02:08,343] [INFO] [utils.py:782:see_memory_usage] MA 1.63 GB         Max_MA 1.63 GB         CA 1.75 GB         Max_CA 2 GB 
[2024-09-23 10:02:08,343] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.67 GB, percent = 5.3%
[2024-09-23 10:02:08,568] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-23 10:02:08,569] [INFO] [utils.py:782:see_memory_usage] MA 1.63 GB         Max_MA 2.2 GB         CA 2.33 GB         Max_CA 2 GB 
[2024-09-23 10:02:08,570] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.67 GB, percent = 5.3%
[2024-09-23 10:02:08,570] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-09-23 10:02:08,792] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-23 10:02:08,793] [INFO] [utils.py:782:see_memory_usage] MA 1.63 GB         Max_MA 1.63 GB         CA 2.33 GB         Max_CA 2 GB 
[2024-09-23 10:02:08,793] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.67 GB, percent = 5.3%
[2024-09-23 10:02:08,794] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-23 10:02:08,794] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2024-09-23 10:02:08,794] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-23 10:02:08,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2024-09-23 10:02:08,796] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2024-09-23 10:02:08,796] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-09-23 10:02:08,796] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2024-09-23 10:02:08,796] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2024-09-23 10:02:08,796] [INFO] [config.py:1003:print]   amp_params ................... False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0d52b5d3d0>
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2024-09-23 10:02:08,797] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   dump_state ................... False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   global_rank .................. 0
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2024-09-23 10:02:08,798] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   pld_params ................... False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   train_batch_size ............. 6
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   world_size ................... 3
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=True zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2024-09-23 10:02:08,799] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 2
[2024-09-23 10:02:08,800] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 6, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true, 
        "round_robin_gradients": true
    }, 
    "steps_per_print": inf
}
[INFO|trainer.py:2141] 2024-09-23 10:02:08,800 >> ***** Running training *****
[INFO|trainer.py:2142] 2024-09-23 10:02:08,800 >>   Num examples = 981
[INFO|trainer.py:2143] 2024-09-23 10:02:08,800 >>   Num Epochs = 3
[INFO|trainer.py:2144] 2024-09-23 10:02:08,800 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2147] 2024-09-23 10:02:08,800 >>   Total train batch size (w. parallel, distributed & accumulation) = 6
[INFO|trainer.py:2148] 2024-09-23 10:02:08,800 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2149] 2024-09-23 10:02:08,800 >>   Total optimization steps = 489
[INFO|trainer.py:2150] 2024-09-23 10:02:08,801 >>   Number of trainable parameters = 463,987,712
  0%|                                                                                                                                             | 0/489 [00:00<?, ?it/s]/home/lichao/opt/python3.11.9/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 2.3658, 'grad_norm': 25.19988250732422, 'learning_rate': 2.0408163265306123e-05, 'epoch': 0.06}                                                                  
{'loss': 2.6136, 'grad_norm': 9.38448429107666, 'learning_rate': 4.0816326530612245e-05, 'epoch': 0.12}                                                                   
{'loss': 2.2796, 'grad_norm': 13.728240013122559, 'learning_rate': 6.122448979591838e-05, 'epoch': 0.18}                                                                  
{'loss': 2.1511, 'grad_norm': 18.125511169433594, 'learning_rate': 8.163265306122449e-05, 'epoch': 0.24}                                                                  
{'loss': 2.3712, 'grad_norm': 22.641611099243164, 'learning_rate': 9.999872552137497e-05, 'epoch': 0.31}                                                                  
{'loss': 2.3982, 'grad_norm': 19.40285301208496, 'learning_rate': 9.98458666866564e-05, 'epoch': 0.37}                                                                    
{'loss': 2.5063, 'grad_norm': 11.834580421447754, 'learning_rate': 9.943900474099748e-05, 'epoch': 0.43}                                                                  
{'loss': 2.4219, 'grad_norm': 11.096634864807129, 'learning_rate': 9.878021295961217e-05, 'epoch': 0.49}                                                                  
{'loss': 2.5318, 'grad_norm': 11.01838493347168, 'learning_rate': 9.787284839440982e-05, 'epoch': 0.55}                                                                   
{'loss': 2.6357, 'grad_norm': 15.102975845336914, 'learning_rate': 9.672153476722816e-05, 'epoch': 0.61}                                                                  
{'loss': 2.5858, 'grad_norm': 11.936942100524902, 'learning_rate': 9.533213890840657e-05, 'epoch': 0.67}                                                                  
{'loss': 2.3013, 'grad_norm': 10.956372261047363, 'learning_rate': 9.371174086076363e-05, 'epoch': 0.73}                                                                  
{'loss': 2.443, 'grad_norm': 11.979649543762207, 'learning_rate': 9.186859780132164e-05, 'epoch': 0.8}                                                                    
{'loss': 2.4357, 'grad_norm': 7.360419273376465, 'learning_rate': 8.981210196462533e-05, 'epoch': 0.86}                                                                   
{'loss': 2.5534, 'grad_norm': 14.005857467651367, 'learning_rate': 8.755273278206749e-05, 'epoch': 0.92}                                                                  
{'loss': 2.5753, 'grad_norm': 9.832633018493652, 'learning_rate': 8.510200348110868e-05, 'epoch': 0.98}                                                                   
{'loss': 1.7594, 'grad_norm': 10.028552055358887, 'learning_rate': 8.247240241650918e-05, 'epoch': 1.04}                                                                  
{'loss': 1.4025, 'grad_norm': 12.267614364624023, 'learning_rate': 7.967732943253571e-05, 'epoch': 1.1}                                                                   
{'loss': 1.1433, 'grad_norm': 7.551489353179932, 'learning_rate': 7.673102758042653e-05, 'epoch': 1.16}                                                                   
{'loss': 1.2479, 'grad_norm': 8.397479057312012, 'learning_rate': 7.364851053906718e-05, 'epoch': 1.22}                                                                   
{'loss': 1.1978, 'grad_norm': 9.697928428649902, 'learning_rate': 7.044548610872434e-05, 'epoch': 1.28}                                                                   
{'loss': 1.1877, 'grad_norm': 14.016590118408203, 'learning_rate': 6.713827616769614e-05, 'epoch': 1.35}                                                                  
{'loss': 1.2349, 'grad_norm': 11.697397232055664, 'learning_rate': 6.374373349976169e-05, 'epoch': 1.41}                                                                  
{'loss': 1.214, 'grad_norm': 8.01415729522705, 'learning_rate': 6.027915591625804e-05, 'epoch': 1.47}                                                                     
{'loss': 1.1724, 'grad_norm': 8.013666152954102, 'learning_rate': 5.6762198110398444e-05, 'epoch': 1.53}                                                                  
{'loss': 1.2709, 'grad_norm': 10.372663497924805, 'learning_rate': 5.3210781693002754e-05, 'epoch': 1.59}                                                                 
{'loss': 1.1069, 'grad_norm': 14.193530082702637, 'learning_rate': 4.964300386807653e-05, 'epoch': 1.65}                                                                  
{'loss': 1.3013, 'grad_norm': 14.019328117370605, 'learning_rate': 4.607704521360776e-05, 'epoch': 1.71}                                                                  
{'loss': 1.2138, 'grad_norm': 11.885704040527344, 'learning_rate': 4.253107703750875e-05, 'epoch': 1.77}                                                                  
{'loss': 1.1027, 'grad_norm': 8.35533332824707, 'learning_rate': 3.9023168780796294e-05, 'epoch': 1.83}                                                                   
{'loss': 1.1346, 'grad_norm': 12.683867454528809, 'learning_rate': 3.557119593986208e-05, 'epoch': 1.9}                                                                   
{'loss': 1.0305, 'grad_norm': 7.334381580352783, 'learning_rate': 3.219274897704053e-05, 'epoch': 1.96}                                                                   
{'loss': 0.9327, 'grad_norm': 4.699033737182617, 'learning_rate': 2.8905043683644872e-05, 'epoch': 2.02}                                                                  
{'loss': 0.5392, 'grad_norm': 5.634421348571777, 'learning_rate': 2.5724833452240792e-05, 'epoch': 2.08}                                                                  
{'loss': 0.5446, 'grad_norm': 5.442759990692139, 'learning_rate': 2.2668323905198108e-05, 'epoch': 2.14}                                                                  
{'loss': 0.4084, 'grad_norm': 5.1523966789245605, 'learning_rate': 1.9751090314553878e-05, 'epoch': 2.2}                                                                  
{'loss': 0.4885, 'grad_norm': 6.668193340301514, 'learning_rate': 1.698799823399628e-05, 'epoch': 2.26}                                                                   
{'loss': 0.4697, 'grad_norm': 5.780378818511963, 'learning_rate': 1.4393127747410417e-05, 'epoch': 2.32}                                                                  
{'loss': 0.4652, 'grad_norm': 4.824888706207275, 'learning_rate': 1.1979701719998453e-05, 'epoch': 2.39}                                                                  
{'loss': 0.4356, 'grad_norm': 12.217597961425781, 'learning_rate': 9.760018417589334e-06, 'epoch': 2.45}                                                                  
{'loss': 0.4252, 'grad_norm': 5.763933181762695, 'learning_rate': 7.745388837495188e-06, 'epoch': 2.51}                                                                   
{'loss': 0.4486, 'grad_norm': 8.276981353759766, 'learning_rate': 5.946079070261773e-06, 'epoch': 2.57}                                                                   
{'loss': 0.4308, 'grad_norm': 12.236105918884277, 'learning_rate': 4.371257986024202e-06, 'epoch': 2.63}                                                                  
{'loss': 0.4139, 'grad_norm': 5.1657185554504395, 'learning_rate': 3.0289505120464743e-06, 'epoch': 2.69}                                                                 
{'loss': 0.3718, 'grad_norm': 6.259467124938965, 'learning_rate': 1.925996739531577e-06, 'epoch': 2.75}                                                                   
{'loss': 0.3833, 'grad_norm': 8.667612075805664, 'learning_rate': 1.0680170680846259e-06, 'epoch': 2.81}                                                                  
{'loss': 0.4498, 'grad_norm': 7.922170639038086, 'learning_rate': 4.593835654447709e-07, 'epoch': 2.87}                                                                   
{'loss': 0.4422, 'grad_norm': 5.631829261779785, 'learning_rate': 1.0319768843018996e-07, 'epoch': 2.94}                                                                  
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 489/489 [26:28<00:00,  3.26s/it][INFO|trainer.py:3510] 2024-09-23 10:28:37,461 >> Saving model checkpoint to asterun/saves/qwen/full/sft/checkpoint-489
[INFO|configuration_utils.py:472] 2024-09-23 10:28:37,464 >> Configuration saved in asterun/saves/qwen/full/sft/checkpoint-489/config.json
[INFO|configuration_utils.py:807] 2024-09-23 10:28:37,464 >> Configuration saved in asterun/saves/qwen/full/sft/checkpoint-489/generation_config.json
[INFO|modeling_utils.py:2778] 2024-09-23 10:28:43,244 >> Model weights saved in asterun/saves/qwen/full/sft/checkpoint-489/model.safetensors
[INFO|tokenization_utils_base.py:2684] 2024-09-23 10:28:43,251 >> tokenizer config file saved in asterun/saves/qwen/full/sft/checkpoint-489/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2024-09-23 10:28:43,252 >> Special tokens file saved in asterun/saves/qwen/full/sft/checkpoint-489/special_tokens_map.json
[2024-09-23 10:28:43,459] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step489 is about to be saved!
[2024-09-23 10:28:43,470] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: asterun/saves/qwen/full/sft/checkpoint-489/global_step489/mp_rank_00_model_states.pt
[2024-09-23 10:28:43,470] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving asterun/saves/qwen/full/sft/checkpoint-489/global_step489/mp_rank_00_model_states.pt...
[2024-09-23 10:28:48,175] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved asterun/saves/qwen/full/sft/checkpoint-489/global_step489/mp_rank_00_model_states.pt.
[2024-09-23 10:28:48,178] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving asterun/saves/qwen/full/sft/checkpoint-489/global_step489/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-09-23 10:28:57,930] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved asterun/saves/qwen/full/sft/checkpoint-489/global_step489/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-09-23 10:28:57,931] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved asterun/saves/qwen/full/sft/checkpoint-489/global_step489/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-09-23 10:28:57,931] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step489 is ready now!
[INFO|trainer.py:2401] 2024-09-23 10:28:57,940 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 1609.1394, 'train_samples_per_second': 1.829, 'train_steps_per_second': 0.304, 'train_loss': 1.3682080348820287, 'epoch': 2.99}                         
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 489/489 [26:49<00:00,  3.29s/it]
[INFO|trainer.py:3510] 2024-09-23 10:28:58,466 >> Saving model checkpoint to asterun/saves/qwen/full/sft
[INFO|configuration_utils.py:472] 2024-09-23 10:28:58,470 >> Configuration saved in asterun/saves/qwen/full/sft/config.json
[INFO|configuration_utils.py:807] 2024-09-23 10:28:58,470 >> Configuration saved in asterun/saves/qwen/full/sft/generation_config.json
[INFO|modeling_utils.py:2778] 2024-09-23 10:29:04,536 >> Model weights saved in asterun/saves/qwen/full/sft/model.safetensors
[INFO|tokenization_utils_base.py:2684] 2024-09-23 10:29:04,552 >> tokenizer config file saved in asterun/saves/qwen/full/sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2024-09-23 10:29:04,552 >> Special tokens file saved in asterun/saves/qwen/full/sft/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9908
  total_flos               =   772542GF
  train_loss               =     1.3682
  train_runtime            = 0:26:49.13
  train_samples_per_second =      1.829
  train_steps_per_second   =      0.304
Figure saved at: asterun/saves/qwen/full/sft/training_loss.png
09/23/2024 10:29:05 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
09/23/2024 10:29:05 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot.
[INFO|trainer.py:3826] 2024-09-23 10:29:05,042 >> 
***** Running Evaluation *****
[INFO|trainer.py:3828] 2024-09-23 10:29:05,042 >>   Num examples = 110
[INFO|trainer.py:3831] 2024-09-23 10:29:05,042 >>   Batch size = 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:01<00:00, 19.78it/s]
***** eval metrics *****
  epoch                   =     2.9908
  eval_loss               =     2.7517
  eval_runtime            = 0:00:01.92
  eval_samples_per_second =     57.029
  eval_steps_per_second   =     19.182
[INFO|modelcard.py:449] 2024-09-23 10:29:06,975 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
[root@server3 LLaMA-Factory-0.8.3]# 

3.1.6 推理测试

安装GGUF库

下载llama.cpp源码包到服务器,解压到工作目录
[root@server3 AIGC]# unzip llama.cpp-master.zip
[root@server3 AIGC]# cd llama.cpp-master
[root@server3 llama.cpp-master]# ll
总用量 576
-rw-r--r--  1 root root  33717 9月  26 11:38 AUTHORS
drwxr-xr-x  2 root root     37 9月  26 11:38 ci
drwxr-xr-x  2 root root    164 9月  26 11:38 cmake
-rw-r--r--  1 root root   6591 9月  26 11:38 CMakeLists.txt
-rw-r--r--  1 root root   3164 9月  26 11:38 CMakePresets.json
drwxr-xr-x  3 root root   4096 9月  26 11:38 common
-rw-r--r--  1 root root   2256 9月  26 11:38 CONTRIBUTING.md
-rwxr-xr-x  1 root root 199470 9月  26 11:38 convert_hf_to_gguf.py
-rwxr-xr-x  1 root root  15993 9月  26 11:38 convert_hf_to_gguf_update.py
-rwxr-xr-x  1 root root  19106 9月  26 11:38 convert_llama_ggml_to_gguf.py
-rwxr-xr-x  1 root root  14901 9月  26 11:38 convert_lora_to_gguf.py
drwxr-xr-x  4 root root    109 9月  26 11:38 docs
drwxr-xr-x 43 root root   4096 9月  26 11:38 examples
-rw-r--r--  1 root root   1556 9月  26 11:38 flake.lock
-rw-r--r--  1 root root   7469 9月  26 11:38 flake.nix
drwxr-xr-x  5 root root     85 9月  26 11:38 ggml
drwxr-xr-x  6 root root    116 9月  26 11:38 gguf-py
drwxr-xr-x  2 root root    154 9月  26 11:38 grammars
drwxr-xr-x  2 root root     21 9月  26 11:38 include
-rw-r--r--  1 root root   1078 9月  26 11:38 LICENSE
-rw-r--r--  1 root root  50865 9月  26 11:38 Makefile
drwxr-xr-x  2 root root    163 9月  26 11:38 media
drwxr-xr-x  2 root root   4096 9月  26 11:38 models
-rw-r--r--  1 root root    163 9月  26 11:38 mypy.ini
-rw-r--r--  1 root root   2044 9月  26 11:38 Package.swift
drwxr-xr-x  3 root root     40 9月  26 11:38 pocs
-rw-r--r--  1 root root 124786 9月  26 11:38 poetry.lock
drwxr-xr-x  2 root root   4096 9月  26 11:38 prompts
-rw-r--r--  1 root root   1280 9月  26 11:38 pyproject.toml
-rw-r--r--  1 root root    528 9月  26 11:38 pyrightconfig.json
-rw-r--r--  1 root root  28481 9月  26 11:38 README.md
drwxr-xr-x  2 root root   4096 9月  26 11:38 requirements
-rw-r--r--  1 root root    505 9月  26 11:38 requirements.txt
drwxr-xr-x  2 root root   4096 9月  26 11:38 scripts
-rw-r--r--  1 root root   5090 9月  26 11:38 SECURITY.md
drwxr-xr-x  2 root root     97 9月  26 11:38 spm-headers
drwxr-xr-x  2 root root    289 9月  26 11:38 src
drwxr-xr-x  2 root root   4096 9月  26 11:38 tests
[root@server3 llama.cpp-master]# 

进入gguf-py子目录,安装GGUF库
[root@server3 llama.cpp-master]# cd gguf-py
[root@server3 gguf-py]# ll
总用量 12
drwxr-xr-x 2 root root   40 9月  26 11:38 examples
drwxr-xr-x 2 root root  230 9月  26 11:38 gguf
-rw-r--r-- 1 root root 1072 9月  26 11:38 LICENSE
-rw-r--r-- 1 root root 1049 9月  26 11:38 pyproject.toml
-rw-r--r-- 1 root root 2719 9月  26 11:38 README.md
drwxr-xr-x 2 root root  151 9月  26 11:38 scripts
drwxr-xr-x 2 root root   71 9月  26 11:38 tests
[root@server3 gguf-py]# pip install --editable .
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Obtaining file:///home/lichao/AIGC/llama.cpp-master/gguf-py
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: numpy>=1.17 in /home/lichao/opt/python3.11.9/lib/python3.11/site-packages (from gguf==0.10.0) (1.26.4)
Requirement already satisfied: pyyaml>=5.1 in /home/lichao/opt/python3.11.9/lib/python3.11/site-packages (from gguf==0.10.0) (6.0.2)
Requirement already satisfied: sentencepiece<=0.2.0,>=0.1.98 in /home/lichao/opt/python3.11.9/lib/python3.11/site-packages (from gguf==0.10.0) (0.2.0)
Requirement already satisfied: tqdm>=4.27 in /home/lichao/opt/python3.11.9/lib/python3.11/site-packages (from gguf==0.10.0) (4.66.5)
Building wheels for collected packages: gguf
  Building editable for gguf (pyproject.toml) ... done
  Created wheel for gguf: filename=gguf-0.10.0-py3-none-any.whl size=3403 sha256=4a0851426e263076c64c9854be9dfe95493844062484d001fddb08c1be5fa2ca
  Stored in directory: /tmp/pip-ephem-wheel-cache-iiq8ofh3/wheels/80/80/9b/c6c23d750f4bd20fc0c2c75e51253d89c61a2369247fb694db
Successfully built gguf
Installing collected packages: gguf
Successfully installed gguf-0.10.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[root@server3 gguf-py]# 

模型格式转换

将之前微调训练生成的safetensors格式的模型,转换为gguf格式
[root@server3 gguf-py]# cd .. 
[root@server3 llama.cpp-master]# python3 convert_hf_to_gguf.py /home/lichao/AIGC/LLaMA-Factory-0.8.3/asterun/saves/qwen/full/sft
INFO:hf-to-gguf:Loading model: sft
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,             torch.bfloat16 --> F16, shape = {1024, 151936}
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> F16, shape = {1024, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.0.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.0.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.1.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.1.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.1.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.1.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.1.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.1.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.1.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.1.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.1.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.10.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.10.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.10.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.10.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.10.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.10.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.10.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.10.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.10.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.10.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.10.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.11.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.11.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.11.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.11.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.11.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.11.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.11.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.11.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.11.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.11.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.12.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.12.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.12.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.12.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.12.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.12.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.12.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.12.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.12.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.12.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.12.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.13.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.13.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.13.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.13.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.13.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.13.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.13.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.13.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.13.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.13.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.13.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.14.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.14.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.14.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.14.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.14.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.14.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.14.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.14.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.14.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.14.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.15.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.15.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.15.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.15.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.15.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.15.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.15.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.15.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.15.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.15.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.16.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.16.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.16.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.16.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.16.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.16.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.16.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.16.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.16.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.16.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.17.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.17.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.17.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.17.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.17.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.17.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.17.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.17.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.17.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.17.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.18.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.18.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.18.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.18.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.18.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.18.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.18.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.18.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.18.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.18.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.18.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.19.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.19.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.19.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.19.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.19.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.19.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.19.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.19.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.19.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.19.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.2.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.2.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.2.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.2.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.2.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.2.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.2.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.2.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.2.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.20.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.20.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.20.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.20.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.20.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.20.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.20.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.20.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.20.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.20.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.20.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.21.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.21.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.21.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.21.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.21.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.21.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.21.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.21.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.21.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.21.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.21.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.22.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.22.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.22.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.22.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.22.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.22.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.22.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.22.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.22.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.22.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight,   torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.23.ffn_down.weight,    torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.23.ffn_gate.weight,    torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.23.ffn_up.weight,      torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.23.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.23.attn_k.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.23.attn_k.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight, torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.23.attn_q.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.23.attn_q.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.23.attn_v.bias,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.23.attn_v.weight,      torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.3.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.3.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.3.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.3.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.3.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.3.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.3.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.3.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.3.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.3.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.3.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.3.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.4.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.4.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.4.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.4.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.4.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.4.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.4.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.4.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.4.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.4.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.4.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.4.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.5.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.5.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.5.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.5.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.5.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.5.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.5.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.5.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.5.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.5.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.5.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.5.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.6.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.6.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.6.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.6.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.6.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.6.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.6.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.6.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.6.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.6.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.6.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.6.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.7.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.7.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.7.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.7.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.7.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.7.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.7.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.7.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.7.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.7.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.7.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.7.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.8.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.8.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.8.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.8.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.8.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.8.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.8.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.8.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.8.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.8.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.8.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.8.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.9.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.9.ffn_down.weight,     torch.bfloat16 --> F16, shape = {2816, 1024}
INFO:hf-to-gguf:blk.9.ffn_gate.weight,     torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.9.ffn_up.weight,       torch.bfloat16 --> F16, shape = {1024, 2816}
INFO:hf-to-gguf:blk.9.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.9.attn_k.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.9.attn_k.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.9.attn_output.weight,  torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.9.attn_q.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.9.attn_q.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:blk.9.attn_v.bias,         torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.9.attn_v.weight,       torch.bfloat16 --> F16, shape = {1024, 1024}
INFO:hf-to-gguf:output_norm.weight,        torch.bfloat16 --> F32, shape = {1024}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 32768
INFO:hf-to-gguf:gguf: embedding length = 1024
INFO:hf-to-gguf:gguf: feed forward length = 2816
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: key-value head count = 16
INFO:hf-to-gguf:gguf: rope theta = 1000000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151646
INFO:gguf.vocab:Setting special token type pad to 151643
INFO:gguf.vocab:Setting special token type bos to 151643
INFO:gguf.vocab:Setting chat_template to {% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ '<|start_header_id|>system<|end_header_id|>

' + system_message + '<|eot_id|>' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|start_header_id|>user<|end_header_id|>

' + content + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

' }}{% elif message['role'] == 'assistant' %}{{ content + '<|eot_id|>' }}{% endif %}{% endfor %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/lichao/AIGC/LLaMA-Factory-0.8.3/asterun/saves/qwen/full/sft/Sft-620M-F16.gguf: n_tensors = 291, total_size = 1.2G
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [00:03<00:00, 338Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/lichao/AIGC/LLaMA-Factory-0.8.3/asterun/saves/qwen/full/sft/Sft-620M-F16.gguf
[root@server3 llama.cpp-master]# cd /home/lichao/AIGC/LLaMA-Factory-0.8.3/asterun/saves/qwen/full/sft
转换成功后,修改gguf格式的模型名称,方便后需使用辨认
[root@server3 sft]# ll
总用量 2883588
-rw-r--r-- 1 root root        104 9月  23 10:29 added_tokens.json
-rw-r--r-- 1 root root        358 9月  23 10:29 all_results.json
drwxr-xr-x 3 root root       4096 9月  19 09:59 checkpoint-1000
drwxr-xr-x 3 root root       4096 9月  19 10:05 checkpoint-1470
drwxr-xr-x 3 root root       4096 9月  13 11:02 checkpoint-489
drwxr-xr-x 3 root root       4096 9月  19 09:51 checkpoint-500
-rw-r--r-- 1 root root        731 9月  23 10:28 config.json
-rw-r--r-- 1 root root        175 9月  23 10:29 eval_results.json
-rw-r--r-- 1 root root        210 9月  23 10:28 generation_config.json
-rw-r--r-- 1 root root    1671853 9月  23 10:29 merges.txt
-rw-r--r-- 1 root root 1239173352 9月  23 10:28 model.safetensors
-rw-r--r-- 1 root root       1398 9月  23 10:29 README.md
drwxr-xr-x 2 root root        222 9月  23 10:29 runs
-rw-r--r-- 1 root root 1245334112 9月  26 11:58 Sft-620M-F16.gguf
-rw-r--r-- 1 root root        367 9月  23 10:29 special_tokens_map.json
-rw-r--r-- 1 root root       1720 9月  23 10:29 tokenizer_config.json
-rw-r--r-- 1 root root    7028230 9月  23 10:29 tokenizer.json
-rw-r--r-- 1 root root      11984 9月  23 10:28 trainer_log.jsonl
-rw-r--r-- 1 root root       9284 9月  23 10:29 trainer_state.json
-rw-r--r-- 1 root root       6584 9月  23 10:29 training_args.bin
-rw-r--r-- 1 root root      38333 9月  19 10:06 training_eval_loss.png
-rw-r--r-- 1 root root      37022 9月  23 10:29 training_loss.png
-rw-r--r-- 1 root root        218 9月  23 10:29 train_results.json
-rw-r--r-- 1 root root    2776833 9月  23 10:29 vocab.json
[root@server3 sft]# mv Sft-620M-F16.gguf qwen-sft-620M-F16.gguf 

安装Ollama

下载ollama源码包到服务器,解压到工作目录
[root@server3 AIGC]# tar -C /usr -xzf ollama-linux-amd64.tgz
通过命令行方式启动ollama服务
[root@server3 AIGC]# ollama serve
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILZVS+rUG5x5wd6issBvGuj3YYzMnPUUOmVbEz4iZFCt

2024/09/26 12:04:20 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-09-26T12:04:20.753+02:00 level=INFO source=images.go:753 msg="total blobs: 0"
time=2024-09-26T12:04:20.754+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-26T12:04:20.754+02:00 level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.12)"
time=2024-09-26T12:04:20.755+02:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama316805737/runners
time=2024-09-26T12:04:39.145+02:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
time=2024-09-26T12:04:39.145+02:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-09-26T12:04:39.283+02:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-2d337ad0-020d-0464-2d00-715b0d00c7ba library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4060 Ti" total="15.7 GiB" available="15.6 GiB"

注册模型

打开一个新的terminal

[root@server3 AIGC]# cd LLaMA-Factory-0.8.3/asterun/
[root@server3 asterun]# ll
总用量 4
-rw-r--r-- 1 root root 817 9月  19 09:33 qwen_full_sft_ds2.yaml
drwxr-xr-x 3 root root  18 9月  13 10:28 saves
创建模型的Modelfile文件
[root@server3 asterun]# touch qwen_full_sft_ds2.ollama.Modelfile
[root@server3 asterun]# vim qwen_full_sft_ds2.ollama.Modelfile 
[root@server3 asterun]# cat qwen_full_sft_ds2.ollama.Modelfile 
FROM /home/lichao/AIGC/LLaMA-Factory-0.8.3/asterun/saves/qwen/full/sft/qwen-sft-620M-F16.gguf
[root@server3 asterun]# cd ../..
使用Modelfile注册模型
[root@server3 AIGC]# ollama create qwen-full-sft -f ./LLaMA-Factory-0.8.3/asterun/qwen_full_sft_ds2.ollama.Modelfile 
transferring model data 100% 
using existing layer sha256:19d794be57081c1a5aa7e03c4045a0fdc5b8a40f080f0c550ab38033cf0d5d58 
creating new layer sha256:c33681b055686143e7d6e0bb0f1054c9910c05c3f4ab16932fbc567a8961929a 
writing manifest 
success 
[root@server3 AIGC]# 

推理测试

使用注册好的模型运行推理服务
[root@server3 AIGC]# ollama run qwen-full-sft
>>> who are you?
<|im_end|>
我是 {{name}},一个由 {{author}} 开发的人工智能助手,我可以帮助用户查询信息、安排日程、提供建议等。

>>> can you speak english?
 I am an AI assistant developed by {{author}}.

>>> 好吧,用中文交流吧。
没问题。

>>> 你喜欢中国哪个城市?
每个城市都有其独特的魅力,各具特色,比如:
成都:美食之都,生活悠闲。
北京:历史悠久,文化丰富。
杭州:风景优美,以西湖闻名。
上海:现代化大都市,经济繁荣。

>>> 感谢,再见
好的,我是个人工智能助手,很高兴见到您。

>>> exit
[root@server3 AIGC]# 

至此,已完成分布式计算环境的搭建与测试。

4 部署与使用相关Q&A

  • 问题1

使用如下参数单机运行nccl-test测试任务,会提示“No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.”,测试任务能够正常进行下去,暂不清楚会有什么影响。

[root@server3 ~]# /home/lichao/opt/openmpi/bin/mpirun --allow-run-as-root -np 1 /home/lichao/AIGC/nccl-tests/build/all_reduce_perf -b 512 -e 8G -f 2 -g 1
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           server3
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 512 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   8080 on    server3 device  0 [0x02] NVIDIA GeForce RTX 4060 Ti
#
# Reducing maxBytes to 5261099008 due to memory limitation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         512           128     float     sum      -1     3.77    0.14    0.00      0     0.34    1.50    0.00      0
        1024           256     float     sum      -1     3.96    0.26    0.00      0     0.34    3.04    0.00      0
        2048           512     float     sum      -1     3.63    0.56    0.00      0     0.34    6.03    0.00      0
        4096          1024     float     sum      -1     3.63    1.13    0.00      0     0.34   12.06    0.00      0
        8192          2048     float     sum      -1     3.65    2.25    0.00      0     0.34   24.17    0.00      0
       16384          4096     float     sum      -1     3.63    4.51    0.00      0     0.34   48.23    0.00      0
       32768          8192     float     sum      -1     3.61    9.08    0.00      0     0.34   97.21    0.00      0
       65536         16384     float     sum      -1     3.60   18.18    0.00      0     0.34  193.52    0.00      0
      131072         32768     float     sum      -1     3.67   35.72    0.00      0     0.34  389.86    0.00      0
      262144         65536     float     sum      -1     3.66   71.54    0.00      0     0.35  757.97    0.00      0
      524288        131072     float     sum      -1     4.38  119.60    0.00      0     0.34  1542.25    0.00      0
     1048576        262144     float     sum      -1     6.66  157.41    0.00      0     0.33  3164.08    0.00      0
     2097152        524288     float     sum      -1    15.73  133.29    0.00      0     0.34  6233.18    0.00      0
     4194304       1048576     float     sum      -1    31.38  133.66    0.00      0     0.34  12457.10    0.00      0
     8388608       2097152     float     sum      -1    65.34  128.37    0.00      0     0.34  24467.28    0.00      0
    16777216       4194304     float     sum      -1    132.4  126.70    0.00      0     0.34  49156.80    0.00      0
    33554432       8388608     float     sum      -1    275.5  121.81    0.00      0     0.34  99258.78    0.00      0
    67108864      16777216     float     sum      -1    549.5  122.13    0.00      0     0.34  199728.76    0.00      0
   134217728      33554432     float     sum      -1   1101.8  121.81    0.00      0     0.34  398863.98    0.00      0
   268435456      67108864     float     sum      -1   2203.6  121.81    0.00      0     0.34  785128.56    0.00      0
   536870912     134217728     float     sum      -1   4414.9  121.60    0.00      0     0.34  1567735.18    0.00      0
  1073741824     268435456     float     sum      -1   8819.1  121.75    0.00      0     0.34  3121342.51    0.00      0
  2147483648     536870912     float     sum      -1    17639  121.75    0.00      0     0.35  6218281.88    0.00      0
  4294967296    1073741824     float     sum      -1    35280  121.74    0.00      0     0.30  14144466.64    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

[server3:08076] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[server3:08076] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[root@server3 ~]# 

原因分析/解决方法

在mpirun命令行中,增加参数“-mca btl ‘^openib’”指定BTL的value为’^openib’,可解决。

[root@server3 ~]# /home/lichao/opt/openmpi/bin/mpirun --allow-run-as-root -np 1 -mca btl '^openib' /home/lichao/AIGC/nccl-tests/build/all_reduce_perf -b 512 -e 8G -f 2 -g 1
# nThread 1 nGpus 1 minBytes 512 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   8106 on    server3 device  0 [0x02] NVIDIA GeForce RTX 4060 Ti
#
# Reducing maxBytes to 5261099008 due to memory limitation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         512           128     float     sum      -1     3.43    0.15    0.00      0     0.31    1.64    0.00      0
        1024           256     float     sum      -1     6.29    0.16    0.00      0     0.30    3.39    0.00      0
        2048           512     float     sum      -1     4.07    0.50    0.00      0     0.32    6.36    0.00      0
        4096          1024     float     sum      -1     4.00    1.02    0.00      0     0.33   12.59    0.00      0
        8192          2048     float     sum      -1     3.97    2.06    0.00      0     0.32   25.24    0.00      0
       16384          4096     float     sum      -1     3.97    4.13    0.00      0     0.30   54.30    0.00      0
       32768          8192     float     sum      -1     4.00    8.20    0.00      0     0.30  108.49    0.00      0
       65536         16384     float     sum      -1     3.94   16.64    0.00      0     0.30  215.22    0.00      0
      131072         32768     float     sum      -1     4.64   28.23    0.00      0     0.31  424.32    0.00      0
      262144         65536     float     sum      -1     4.12   63.65    0.00      0     0.31  848.09    0.00      0
      524288        131072     float     sum      -1     4.36  120.27    0.00      0     0.30  1719.26    0.00      0
     1048576        262144     float     sum      -1     6.44  162.86    0.00      0     0.30  3451.53    0.00      0
     2097152        524288     float     sum      -1    15.74  133.21    0.00      0     0.30  6880.42    0.00      0
     4194304       1048576     float     sum      -1    31.58  132.83    0.00      0     0.31  13688.98    0.00      0
     8388608       2097152     float     sum      -1    64.95  129.15    0.00      0     0.30  27799.86    0.00      0
    16777216       4194304     float     sum      -1    132.0  127.09    0.00      0     0.30  55849.59    0.00      0
    33554432       8388608     float     sum      -1    274.4  122.29    0.00      0     0.31  109834.47    0.00      0
    67108864      16777216     float     sum      -1    550.3  121.94    0.00      0     0.31  218845.15    0.00      0
   134217728      33554432     float     sum      -1   1101.1  121.89    0.00      0     0.31  439409.82    0.00      0
   268435456      67108864     float     sum      -1   2204.8  121.75    0.00      0     0.31  867459.87    0.00      0
   536870912     134217728     float     sum      -1   4411.4  121.70    0.00      0     0.31  1728774.47    0.00      0
  1073741824     268435456     float     sum      -1   8822.3  121.71    0.00      0     0.31  3515278.52    0.00      0
  2147483648     536870912     float     sum      -1    17639  121.75    0.00      0     0.31  6842388.56    0.00      0
  4294967296    1073741824     float     sum      -1    35284  121.73    0.00      0     0.31  13942435.63    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

[root@server3 ~]# 

参考文档:

https://www.open-mpi.org/video/internals/Sandia_BrianBarrett-1up.pdf

https://github.com/open-mpi/ompi/issues/11063

https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.php

  • 问题2

三节点运行多机nccl-test,会提示路由相关的错误,卡在初始阶段无法继续进行。

[root@server1 lichao]# ./run_nccl-test.sh 
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           server1
  Local device:         mlx5_1
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
[1716789553.453110] [server1:7255 :0]            sock.c:325  UCX  ERROR   connect(fd=54, dest_addr=200.200.0.2:49112) failed: No route to host

原因分析/解决方法

排查三个节点上的网络配置,发现是server3多启用了一个mlnx接口并配置了200.200.0.0网段的地址,用于nccl-test的IP地址段是172.16.0.0,所以导致任务初始化阶段在server1和2上找不到200的路由进而通信测试失败。

添加参数指定网口“-x NCCL_SOCKET_IFNAME=ens11f1 -x NCCL_IB_HCA=mlx5_1:1”,不能解决,仍旧提示无法找到200网段的路由。最终关闭ens11f0接口,重新测试,恢复正常。

[root@server3 ~]# ibdev2netdev 
mlx5_0 port 1 ==> ens11f0 (Up)
mlx5_1 port 1 ==> ens11f1 (Up)
[root@server3 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ac:1f:6b:dd:1b:f2 brd ff:ff:ff:ff:ff:ff
    inet 10.230.1.13/24 brd 10.230.1.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:fedd:1bf2/64 scope link 
       valid_lft forever preferred_lft forever
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ac:1f:6b:dd:1b:f3 brd ff:ff:ff:ff:ff:ff
6: ens11f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b8:59:9f:3b:57:b6 brd ff:ff:ff:ff:ff:ff
    inet 200.200.0.2/30 brd 200.200.0.3 scope global ens11f0
       valid_lft forever preferred_lft forever
    inet6 fe80::ba59:9fff:fe3b:57b6/64 scope link 
       valid_lft forever preferred_lft forever
7: ens11f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b8:59:9f:3b:57:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.13/24 brd 172.16.0.255 scope global ens11f1
       valid_lft forever preferred_lft forever
    inet6 fe80::ba59:9fff:fe3b:57b7/64 scope link 
       valid_lft forever preferred_lft forever
[root@server3 ~]# 
  • 问题3

提示“NET/Plugin: No plugin found (libnccl-net.so)”。

server1:41185:41185 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens11f1
server1:41185:41185 [0] NCCL INFO Bootstrap : Using ens11f1:172.16.0.11<0>
server1:41185:41185 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
server1:41185:41185 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
server1:41185:41185 [0] NCCL INFO NET/Plugin: Using internal network plugin.
server1:41185:41185 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4

原因分析/解决方法

这个是正常行为,因为 NCCL 中新增了外部网络插件支持。它允许第三方厂商创建自己的外部网络传输插件供 NCCL 使用,例如:https://github.com/aws/aws-ofi-nccl。这个提示是不影响正常运行的。

在该消息之后,会看到另一条 INFO 消息“NET/Plugin: Using internal network plugin”,这表示 NCCL 已退回到使用其内部网络传输的状态。

参考文档:

https://github.com/NVIDIA/nccl/issues/162。

  • 问题4

GPU驱动和相关加速库安装好后,nvidia工具和nccl-test集合通信测试一切正常,但是重启服务器后,运行nvidia-smi提示驱动/库的版本不匹配。

[root@server3 ~]# nvidia-smi 
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.67
[root@server3 ~]# 

原因分析/解决方法

按照工具给出的错误提示,应该就是某个组件,在后续安装其他应用时,被覆盖了版本。

逐一排查,发现GPU驱动确实存在一个通过yum安装的版本“nvidia-driver-latest-dkms-NVML 550.54.15-1.el7”,和提示的版本不匹配“NVML library version: 550.67”。删除后重新通过二进制包安装驱动,恢复正常。

[root@server3 ~]# yum remove nvidia* libnvidia*
已加载插件:fastestmirror, nvidia
参数 libnvidia* 没有匹配
正在解决依赖关系
--> 正在检查事务
---> 软件包 nvidia-driver-latest-dkms.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-driver-latest-dkms-NVML.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-driver-latest-dkms-NvFBCOpenGL.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-driver-latest-dkms-cuda.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-driver-latest-dkms-cuda-libs.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-driver-latest-dkms-devel.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-driver-latest-dkms-libs.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-kmod-common.x86_64.3.550.54.15-1.el7 将被 删除
--> 正在处理依赖关系 nvidia-kmod-common = 3:550.54.15,它被软件包 3:kmod-nvidia-open-dkms-550.54.15-1.el7.x86_64 需要
--> 正在处理依赖关系 nvidia-kmod-common = 3:550.54.15,它被软件包 3:kmod-nvidia-open-dkms-550.54.15-1.el7.x86_64 需要
---> 软件包 nvidia-modprobe-latest-dkms.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-persistenced-latest-dkms.x86_64.3.550.54.15-1.el7 将被 删除
---> 软件包 nvidia-xconfig-latest-dkms.x86_64.3.550.54.15-1.el7 将被 删除
--> 正在检查事务
---> 软件包 kmod-nvidia-open-dkms.x86_64.3.550.54.15-1.el7 将被 删除
--> 解决依赖关系完成

依赖关系解决

==========================================================================================================================================================
 Package                                               架构                   版本                               源                                  大小
==========================================================================================================================================================
正在删除:
 nvidia-driver-latest-dkms                             x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 175 M
 nvidia-driver-latest-dkms-NVML                        x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 2.0 M
 nvidia-driver-latest-dkms-NvFBCOpenGL                 x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 135 k
 nvidia-driver-latest-dkms-cuda                        x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 1.3 M
 nvidia-driver-latest-dkms-cuda-libs                   x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 222 M
 nvidia-driver-latest-dkms-devel                       x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 0.0  
 nvidia-driver-latest-dkms-libs                        x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 305 M
 nvidia-kmod-common                                    x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 1.3 k
 nvidia-modprobe-latest-dkms                           x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                  70 k
 nvidia-persistenced-latest-dkms                       x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                  65 k
 nvidia-xconfig-latest-dkms                            x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                 222 k
为依赖而移除:
 kmod-nvidia-open-dkms                                 x86_64                 3:550.54.15-1.el7                  @cuda-rhel7-x86_64                  21 M

事务概要
==========================================================================================================================================================
移除  11 软件包 (+1 依赖软件包)

安装大小:727 M
是否继续?[y/N]:y

[root@server3 ~]# cd /home/lichao/AIGC/
[root@server3 AIGC]# sh NVIDIA-Linux-x86_64-550.67.run 
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.67........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
[root@server3 AIGC]# nvidia-smi
Thu May 16 09:28:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:02:00.0 Off |                  N/A |
|  0%   36C    P8              5W /  165W |       2MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[root@server3 AIGC]# 

PXE与MC-LAG的Fallback机制

  1. PXE简介
  2. PXE与SONiC LAG Fallback
  3. PXE与AsterNOS MC-LAG Fallback
  4. AsterNOS的MC-LAG Fallback功能验证

1.PXE简介

PXE(Preboot Execution Environment)是一种网络启动协议,它允许计算机通过网络从远程服务器上获取操作系统镜像并进行安装。PXE装机的基本原理是在服务器启动时,通过网络请求IP地址和PXE相关配置信息。然后,服务器通过TFTP(Trivial File Transfer Protocol)从PXE服务器下载启动文件,启动文件负责进一步的操作系统安装过程。最终,操作系统镜像文件通过网络传输到服务器,并完成操作系统的安装。

PXE简介

2.PXE与SONiC LAG Fallback

在服务器通过PXE启动的过程中,是无操作系统的状态,无法和交换机之间建立LAG连接、无法发送LACP报文,此时交换机的LAG成员端口都是Inactive状态,也就不会转发DHCP Discover广播报文,PXE流程也就无法继续进行下去。

SONiC LAG Fallback就是解决这个问题的,通过对LAG开启Fallback配置,使其在没有收到LACP报文的情况下,LAG组中的一个成员口会被设为Active状态,使得PXE启动过程能顺利完成。收到LACP后会自动退出Fallback状态。

3.PXE与AsterNOS MC-LAG Fallback

MC-LAG(Multi Chassis Link Aggregation Group,跨设备链路聚合组)是一种实现跨设备链路聚合的机制,通过将一台设备与另外两台设备进行跨设备链路聚合,保留了普通链路聚合的所有优点,同时提供了设备级别的冗余。

PXE与AsterNOS MC-LAG Fallback

MC-LAG将两台物理设备虚拟成单台逻辑设备,这台虚拟出来的“单个设备”与其相连的上行或下行设备实现“一对一”链路聚合。因此,在MC-LAG场景中也会存在LAG场景下PXE装机遇到的问题,AsterNOS目前在LAG和MC-LAG场景都已经支持了Fallback功能。

4.AsterNOS的MC-LAG Fallback功能验证

AsterNOS的MC-LAG Fallback功能验证

在Centos76-1上完成mode4 bond配置、DHCP Server配置:

[root@server1 dhcp]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:0a:0e:54:00:01
Active Aggregator Info:
        Aggregator ID: 3
        Number of ports: 2
        Actor Key: 9
        Partner Key: 0
        Partner Mac Address: 52:54:00:12:34:56

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:0a:0e:54:00:01
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 0c:0a:0e:54:00:01
    port key: 9
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 65535
    system mac address: 52:54:00:12:34:56
    oper key: 0
    port priority: 255
    port number: 2
    port state: 63

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:0a:0e:54:00:02
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 0c:0a:0e:54:00:01
    port key: 9
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 65535
    system mac address: 52:54:00:12:34:56
    oper key: 0
    port priority: 255
    port number: 2
    port state: 63
[root@server1 dhcp]# 
[root@server1 dhcp]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:0a:0e:54:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.3.121/24 brd 10.240.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::e0a:eff:fe54:0/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 0c:0a:0e:54:00:01 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 0c:0a:0e:54:00:01 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:0a:0e:54:00:03 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:0a:0e:54:00:01 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.1/24 brd 172.16.10.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::e0a:eff:fe54:1/64 scope link 
       valid_lft forever preferred_lft forever
[root@server1 dhcp]# cat dhcpd.conf 
#
# DHCP Server Configuration file.
#   see /usr/share/doc/dhcp*/dhcpd.conf.example
#   see dhcpd.conf(5) man page
#

subnet 172.16.10.0 netmask 255.255.255.0 {
  range 172.16.10.100 172.16.10.200;
  #option routers 172.16.10.254;
  #option domain-name-servers 223.5.5.5;
}
[root@server1 dhcp]# 

在Leaf1和2上完成MC-LAG配置,并确认状态正常

leaf1# show mclag state 
The MCLAG's keepalive is: OK
MCLAG info sync is: completed
Domain id: 1
MCLAG session Channel: Primary channel
VRF Name: default
consistency Check Action: idle
Local Ip: 12.12.12.1
Peer Ip: 12.12.12.2
Dad Local Ip: 
Dad Peer Ip: 
Peer Link Interface: lag 99
Keepalive time: 1
Dad Detection Delay: 15
Dad Recovery Delay Mlag Intf: 60
Dad Recovery Delay Non Mlag Intf: 0
Dad VRF Name: default
Dad Status: disable
session Timeout : 15
Peer Link Mac: 52:54:00:12:34:56 
Admin Role: None
Role: Active
MCLAG Interface: lag 2,lag 1
Loglevel: NOTICE   
leaf1# show link-aggregation summary
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Dw)  0/2      (D)   N/A
 0099  lag 99           LACP(A)(Up)  0/9      (S)   N/A
                                     0/8      (S)
leaf1# 

leaf2# show mclag state 
The MCLAG's keepalive is: OK
MCLAG info sync is: completed
Domain id: 1
MCLAG session Channel: Primary channel
VRF Name: default
consistency Check Action: idle
Local Ip: 12.12.12.2
Peer Ip: 12.12.12.1
Dad Local Ip: 
Dad Peer Ip: 
Peer Link Interface: lag 99
Keepalive time: 1
Dad Detection Delay: 15
Dad Recovery Delay Mlag Intf: 60
Dad Recovery Delay Non Mlag Intf: 0
Dad VRF Name: default
Dad Status: disable
session Timeout : 15
Peer Link Mac: 52:54:00:12:34:57 
Admin Role: None
Role: Standby
MCLAG Interface: lag 2,lag 1
Loglevel: NOTICE
leaf2# show link-aggregation summary
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Dw)  0/2      (D)   N/A
 0099  lag 99           LACP(A)(Up)  0/9      (S)   N/A
                                     0/8      (S)
leaf2# 

在Centos76-2的两个业务口上,通过DHCP无法获取IP地址

[root@server2 ~]# ifup eth1

正在确定 eth1 的 IP 信息... 完成。
[root@server2 ~]# ifup eth2

正在确定 eth2 的 IP 信息... 完成。
[root@server2 network-scripts]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.3.122/24 brd 10.240.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ea8:80ff:fe2f:0/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ea8:80ff:fe2f:1/64 scope link 
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:02 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ea8:80ff:fe2f:2/64 scope link 
       valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:03 brd ff:ff:ff:ff:ff:ff
[root@server2 network-scripts]# 

在Leaf1和2上的LAG2接口组上,启用Fallback功能,AsterNOS会暂时保持一侧端口被激活,在收到LACP协商报文后恢复动态聚合模式

leaf1# show link-aggregation summary 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Dw)  0/2      (D)   N/A
 0099  lag 99           LACP(A)(Up)  0/8      (S)   N/A
                                     0/9      (S)
启用Fallback:
leaf1# configure terminal 
leaf1(config)# interface link-aggregation 2
leaf1(config-lagif-2)# show this
!
interface link-aggregation 2
 lacp fallback
 lacp fast-rate
 commit
 switchport access vlan 512
leaf1(config-lagif-2)# end
leaf1# show link-aggregation summary 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Up)  0/2      (S)   N/A
 0099  lag 99           LACP(A)(Up)  0/9      (S)   N/A
                                     0/8      (S)
leaf1# 


leaf2# show link-aggregation summary 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Dw)  0/2      (D)   N/A
 0099  lag 99           LACP(A)(Up)  0/8      (S)   N/A
                                     0/9      (S)
启用Fallback:
leaf2# configure terminal 
leaf2(config)# interface link-aggregation 2
leaf2(config-lagif-2)# show this
!
interface link-aggregation 2
 lacp fallback
 lacp fast-rate
 commit
 switchport access vlan 512
leaf2(config-lagif-2)# end
leaf2# show link-aggregation summary 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Dw)  0/2      (D)   N/A
 0099  lag 99           LACP(A)(Up)  0/8      (S)   N/A
                                     0/9      (S)
leaf2# 

在Centos76-2的两个业务口上,其中一个口能通过DHCP获取到IP地址

[root@server2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.3.122/24 brd 10.240.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ea8:80ff:fe2f:0/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:02 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:03 brd ff:ff:ff:ff:ff:ff
[root@server2 ~]# ifup eth1

正在确定 eth1 的 IP 信息... 完成。
[root@server2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.3.122/24 brd 10.240.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ea8:80ff:fe2f:0/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.100/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 43197sec preferred_lft 43197sec
    inet6 fe80::ea8:80ff:fe2f:1/64 scope link 
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:02 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:03 brd ff:ff:ff:ff:ff:ff
[root@server2 ~]# ifup eth2

正在确定 eth2 的 IP 信息... 完成。
[root@server2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.3.122/24 brd 10.240.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ea8:80ff:fe2f:0/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.100/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 42370sec preferred_lft 42370sec
    inet6 fe80::ea8:80ff:fe2f:1/64 scope link 
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:02 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ea8:80ff:fe2f:2/64 scope link 
       valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:03 brd ff:ff:ff:ff:ff:ff
[root@server2 ~]# 

在DHCP Server上能看到租约信息

[root@server1 dhcp]# cat /var/lib/dhcpd/dhcpd.leases
# The format of this file is documented in the dhcpd.leases(5) manual page.
# This lease file was written by isc-dhcp-4.2.5

server-duid "\000\001\000\001.,\333g\014\012\016T\000\001";

lease 172.16.10.100 {
  starts 5 2024/07/19 08:08:19;
  ends 5 2024/07/19 20:08:19;
  cltt 5 2024/07/19 08:08:19;
  binding state active;
  next binding state free;
  rewind binding state free;
  hardware ethernet 0c:a8:80:2f:00:01;
  client-hostname "server2";
}
[root@server1 dhcp]# systemctl status dhcpd
● dhcpd.service - DHCPv4 Server Daemon
   Loaded: loaded (/usr/lib/systemd/system/dhcpd.service; disabled; vendor preset: disabled)
   Active: active (running) since 五 2024-07-19 08:11:09 UTC; 1h 13min ago
     Docs: man:dhcpd(8)
           man:dhcpd.conf(5)
 Main PID: 4036 (dhcpd)
   Status: "Dispatching packets..."
   CGroup: /system.slice/dhcpd.service
           └─4036 /usr/sbin/dhcpd -f -cf /etc/dhcp/dhcpd.conf -user dhcpd -group dhcpd --no-pid

7月 19 08:11:09 server1 dhcpd[4036]: 
7月 19 08:11:09 server1 dhcpd[4036]: No subnet declaration for eth0 (10.240.3.121).
7月 19 08:11:09 server1 dhcpd[4036]: ** Ignoring requests on eth0.  If this is not what
7月 19 08:11:09 server1 dhcpd[4036]:    you want, please write a subnet declaration
7月 19 08:11:09 server1 dhcpd[4036]:    in your dhcpd.conf file for the network segment
7月 19 08:11:09 server1 dhcpd[4036]:    to which interface eth0 is attached. **
7月 19 08:11:09 server1 dhcpd[4036]: 
7月 19 08:11:09 server1 dhcpd[4036]: Sending on   Socket/fallback/fallback-net
7月 19 08:11:58 server1 dhcpd[4036]: DHCPREQUEST for 172.16.10.100 from 0c:a8:80:2f:00:01 (server2) via bond0
7月 19 08:11:58 server1 dhcpd[4036]: DHCPACK on 172.16.10.100 to 0c:a8:80:2f:00:01 (server2) via bond0
[root@server1 dhcp]# 

在Centos76-2上对两个业务口做bond,观察到Leaf1和2上LAG2的成员口都进入Active状态,Fallback功能生效,LAG2恢复动态聚合模式

[root@server2 network-scripts]# ifup bond0
[root@server2 network-scripts]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:a8:80:2f:00:01
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 2
        Actor Key: 9
        Partner Key: 0
        Partner Mac Address: 52:54:00:12:34:56

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:a8:80:2f:00:02
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 0c:a8:80:2f:00:01
    port key: 9
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 65535
    system mac address: 52:54:00:12:34:56
    oper key: 0
    port priority: 255
    port number: 3
    port state: 63

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:a8:80:2f:00:01
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: monitoring
Partner Churn State: monitoring
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 0c:a8:80:2f:00:01
    port key: 9
    port priority: 255
    port number: 3
    port state: 61
details partner lacp pdu:
    system priority: 65535
    system mac address: 52:54:00:12:34:56
    oper key: 0
    port priority: 255
    port number: 3
    port state: 63
[root@server2 network-scripts]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.3.122/24 brd 10.240.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ea8:80ff:fe2f:0/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:a8:80:2f:00:03 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:a8:80:2f:00:01 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.101/24 brd 172.16.10.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::ea8:80ff:fe2f:1/64 scope link 
       valid_lft forever preferred_lft forever
[root@server2 network-scripts]# ping 172.16.10.1 -c 4
PING 172.16.10.1 (172.16.10.1) 56(84) bytes of data.
64 bytes from 172.16.10.1: icmp_seq=1 ttl=64 time=5.38 ms
64 bytes from 172.16.10.1: icmp_seq=2 ttl=64 time=3.29 ms
64 bytes from 172.16.10.1: icmp_seq=3 ttl=64 time=3.97 ms
64 bytes from 172.16.10.1: icmp_seq=4 ttl=64 time=3.11 ms

--- 172.16.10.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 3.115/3.943/5.389/0.895 ms
[root@server2 network-scripts]# 

leaf1# show link-aggregation summary 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Up)  0/2      (S)   N/A
 0099  lag 99           LACP(A)(Up)  0/9      (S)   N/A
                                     0/8      (S)
leaf1# 
leaf2# show link-aggregation summary 
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports          Description
-----  ---------------  -----------  -------------  -------------
 0001  lag 1            LACP(A)(Up)  0/1      (S)   N/A
 0002  lag 2            LACP(A)(Up)  0/2      (S)   N/A
 0099  lag 99           LACP(A)(Up)  0/8      (S)   N/A
                                     0/9      (S)
leaf2# 

X-N系列交换机MC-LAG功能验证

1 方案概述

本文主要讲解CX-N 系列交换机基于MC-LAG实现的三层组网下的相关解决方案,验证网络通信、故障转移和恢复等能力。整个验证过程中交换机所有命令通过KLISH命令行配置完成。

2 物理网络拓扑

本次相关方案验证的整体物理拓扑如图1所示:

图1:CX-N基于MC-LAG整体物理网络拓扑

3 硬件与软件环境

3.1 设备管理口

验证过程中所涉及到的设备、主机名及管理网口IP地址等信息,如下表所示:

设备主机名管理口IP
CX532-NSpine110.230.1.7
CX532-NSpine210.230.1.8
CX308-NLeaf110.230.1.18
CX308-NLeaf210.230.1.19
CX308-NLeaf310.230.1.20
CX308-NLeaf410.230.1.21
ServerServer110.230.1.11
ServerServer210.230.1.13

3.2 硬件环境

验证环境中涉及到的硬件环境,如下表所示:

名称型号硬件指标数量备注
SpineCX532P-N【参见产品彩页】2
LeafCX308P-48Y-N【参见产品彩页】4
光模块10G

100G
SFP+

QSFP28
8

24
为了尽量减少物料种类,线缆和模块速率进行了统一,交换机互联使用100G模块和线缆,服务器需用10G模块和线缆
网线//8
光纤多模

多模
10G /25G适用

100G适用
4

12
服务器/内存推荐8G以上2

3.3 软件环境

验证环境中涉及到的软件环境,如下表所示:

名称版本
iperf33.1.7
CX532-NSONiC.201911.R0312P03
CX308-NSONiC.201911.R0312P03
服务器系统CentOS Linux 7.8.2003
服务器内核3.10.0-1127.18.2.el7

4 基础环境部署

在两台Server服务器上,安装部署本次验证方案的所需要的基础软件。

补充说明:以”[root@server ~]#”为开头的命令表示两台服务器都要执行。

4.1 LLDP

在两台Server服务器上安装LLDP服务,如果是X710网卡要求网卡驱动版本大于2.3.6,然后配置网卡开启LLDP。

[root@server ~]# yum -y install epel-release
[root@server ~]# yum -y install lldpd
[root@server ~]# systemctl start lldpd
[root@server ~]# systemctl enable lldpd
[root@server ~]# lspci |grep -i ether

[root@server ~]# ethtool -i ens1f2
[root@server ~]# ethtool -i ens1f3

[root@sever ~]# ethtool --set-priv-flags ens1f2 disable-fw-lldp on
[root@sever ~]# ethtool --set-priv-flags ens1f3 disable-fw-lldp on

4.2 安装iPerf3 

在2台Server服务器上安装iPerf3软件用来打流。

在2台服务器上上执行:
[root@server ~]# yum -y install iperf3
[root@server ~]# iperf3 -v
iperf 3.1.7
Linux compute-2 3.10.0-1160.62.1.el7.x86_64 #1 SMP Tue Apr 5 16:57:59 UTC 2022 x86_64
Optional features available: CPU affinity setting, IPv6 flow label, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing

4.3 检查链路连接

所有交换机设备要提前检查和Server服务器之间的链路连接情况,确保交换机设备和Server服务器之间的链路连接没有问题,以下命令在所有交换机设备上执行。

admin@sonic:~$ sudo config cli-mode cli
admin@sonic:~$ sudo sonic-cli
sonic#
Spine1# show lldp neighbor summary

Spine2# show lldp neighbor summary

Leaf1#show lldp neighbor summary

Leaf2#show lldp neighbor summary

Leaf3#show lldp neighbor summary

Leaf4#show lldp neighbor summary

5 组网环境配置  

5.1 逻辑拓扑

图2:组网逻辑拓扑与接口配置

5.2 Spine1

  • 设备恢复出厂设置

配置CICSO-LIKE命令行,恢复Spine1设备到出厂设置。

Spine1@sonic:~$ sudo config cli-mode cli
Spine1@sonic:~$ sudo sonic-cli
sonic# delete startup-config
sonic# reload
  • 配置Spine1接口IP

在Spine1交换机上配置与4台Leaf交换机的互联接口IP。

Spine1# configure terminal
Spine1(config)# interface ethernet 0/4
Spine1(config-if-0/4)# ip address 10.0.10.2/24
Spine1(config)# interface ethernet 0/8
Spine1(config-if-0/8)# ip address 10.0.11.2/24
Spine1(config)# interface ethernet 0/12
Spine1(config-if-0/12)# ip address 10.0.12.2/24
Spine1(config)# interface ethernet 0/16
Spine1(config-if-0/16)# ip address 10.0.13.2/24
  • 配置Spine1的BGP

在Spine1交换机上配置4台Leaf交换机的BGP邻居。

Spine1# configure terminal
Spine1(config)# interface ethernet 0/4
Spine1(config-if-0/4)# ip address 10.0.10.2/24
Spine1(config-if-0/4)# interface ethernet 0/8
Spine1(config-if-0/8)# ip address 10.0.11.2/24
Spine1(config-if-0/8)# interface ethernet 0/12
Spine1(config-if-0/12)# ip address 10.0.12.2/24
Spine1(config-if-0/12)# interface ethernet 0/16
Spine1(config-if-0/16)# ip address 10.0.13.2/24
Spine1(config)# router bgp 65003     
Spine1(config-router)# bgp router-id 10.10.0.3
Spine1(config)# interface loopback 0
Spine1(config-loif-0)# ip address 10.10.0.3/32
Change Loopback0 ip from 10.1.0.1/32 to 10.10.0.3/32
Loopback ip will be used as bgp router-id in frr
Spine1(config)# router bgp 65003
Spine1(config-router)# no bgp ebgp-requires-policy 
Spine1(config-router)# neighbor 10.0.10.1 remote-as 65007
Spine1(config-router)# neighbor 10.0.11.1 remote-as 65007
Spine1(config-router)# neighbor 10.0.12.1 remote-as 65008
Spine1(config-router)# neighbor 10.0.13.1 remote-as 65008
Spine1(config-router)# address-family ipv4 unicast
Spine1(config-router)# address-family l2vpn evpn
Spine1(config-router-af)# neighbor 10.0.10.1 activate
Spine1(config-router-af)# neighbor 10.0.11.1 activate
Spine1(config-router-af)# neighbor 10.0.12.1 activate
Spine1(config-router-af)# neighbor 10.0.13.1 activate
Spine1(config-router-af)# advertise-all-vni

5.3 Spine2

  • 设备恢复出厂设置

配置CICSO-LIKE命令行,恢复Spine2设备到出厂设置。

Spine2@sonic:~$ sudo config cli-mode cli
Spine2@sonic:~$ sudo sonic-cli
sonic# delete startup-config
sonic# reload
  • 配置Spine2接口IP

在Spine2交换机上配置与4台Leaf交换机的互联接口IP。

Spine2# configure terminal
Spine2(config)# interface ethernet 0/4
Spine2(config-if-0/4)# ip address 10.1.10.2/24
Spine2(config)# interface ethernet 0/8
Spine2(config-if-0/8)# ip address 10.1.11.2/24
Spine2(config)# interface ethernet 0/12
Spine2(config-if-0/12)# ip address 10.1.12.2/24
Spine2(config)# interface ethernet 0/16
Spine2(config-if-0/16)# ip address 10.1.13.2/24
  • 配置Spine2的BGP

在Spine2交换机上配置4台Leaf交换机的BGP邻居。

Spine2# configure terminal
Spine2(config)# interface ethernet 0/4
Spine2(config-if-0/4)# ip address 10.0.10.2/24
Spine2(config-if-0/4)# interface ethernet 0/8
Spine2(config-if-0/8)# ip address 10.0.11.2/24
Spine2(config-if-0/8)# interface ethernet 0/12
Spine2(config-if-0/12)# ip address 10.0.12.2/24
Spine2(config-if-0/12)# interface ethernet 0/16
Spine2(config-if-0/16)# ip address 10.0.13.2/24
Spine2(config)# router bgp 65004
Spine2(config-router)# bgp router-id 10.10.0.4
Spine2(config)# interface loopback 0
Spine2(config-loif-0)# ip address 10.10.0.4/32
Change Loopback0 ip from 10.1.0.1/32 to 10.10.0.3/32
Loopback ip will be used as bgp router-id in frr
Spine2(config)# router bgp 65004
Spine2(config-router)# no bgp ebgp-requires-policy 
Spine2(config-router)# neighbor 10.1.10.1 remote-as 65007
Spine2(config-router)# neighbor 10.1.11.1 remote-as 65007
Spine2(config-router)# neighbor 10.1.12.1 remote-as 65008
Spine2(config-router)# neighbor 10.1.13.1 remote-as 65008
Spine2(config-router)# address-family l2vpn evpn
Spine2(config-router-af)# neighbor 10.1.10.1 activate
Spine2(config-router-af)# neighbor 10.1.11.1 activate
Spine2(config-router-af)# neighbor 10.1.12.1 activate
Spine2(config-router-af)# neighbor 10.1.13.1 activate
Spine2(config-router-af)# advertise-all-vni

5.4 Leaf1

  • 设备恢复出厂设置

恢复Leaf1设备到出厂设置。

Leaf1# delete startup-config
Leaf1# reload
  • 配置Leaf1端口速率

配置Leaf1交换机的Ethernet2口速率为10G。

Leaf1# configure terminal 
Leaf1(config)# interface ethernet 0/2  
Leaf1(config-if-0/2)# speed 10000
Leaf1(config-if-0/2)# show this
!
interface ethernet 0/2
speed 10000
  • 配置Leaf1接口IP

在Leaf1交换机上配置与Leaf、Spine交换机的互联接口IP以及PortChannel、VLAN信息。

Leaf1# configure terminal 
Leaf1(config)# interface ethernet 0/48
Leaf1(config-if-0/48)# ip address 10.0.10.1/24
Leaf1(config)# interface ethernet 0/52
Leaf1(config-if-0/52)# ip address 10.1.10.1/24
Leaf1(config) interface link-aggregation 1
Leaf1(config) interface link-aggregation 3
Leaf1(config)# interface ethernet 0/2       
Leaf1(config-if-0/2)# link-aggregation-group 1
Leaf1(config-if-0/2)# interface ethernet 0/56
Leaf1(config-if-0/56)# link-aggregation-group 3
Leaf1(config-if-0/56)# interface ethernet 0/60
Leaf1(config-if-0/60)# link-aggregation-group 3
Leaf1(config)# vlan 10
Leaf1(config)# interface vlan 10
Leaf1(config-vlanif-10)# ip address 100.0.10.1/24
Leaf1(config)# interface link-aggregation 1
Leaf1(config-lagif-1)# switchport access vlan 10
Leaf1(config)# interface link-aggregation 3
Leaf1(config-lagif-1)# switchport trunk vlan 10
  • 配置Leaf1的MC-LAG

在Leaf1交换机上配置与Leaf2交换机互联接口的MC-LAG。

Leaf1# configure terminal 
Leaf1(config)# vlan 30
Leaf1(config)# interface link-aggregation 3
Leaf1(config-lagif-1)# switchport trunk vlan 30
Leaf1(config)# interface vlan 30
Leaf1(config-vlanif-30)# ip address 11.0.0.6/24
Leaf1(config)# mclag domain 1
Leaf1(mclag-domain)# peer-link link-aggregation 3 
Leaf1(mclag-domain)# local-address 11.0.0.6   
Leaf1(mclag-domain)# peer-address 11.0.0.7
Leaf1(mclag-domain)# member lag 1
Leaf1(mclag-domain)# commit
Leaf1(config)# interface vlan 10
Leaf1(config-vlanif-10)# mac-address 18:17:25:37:64:40
  • 配置Leaf1的BGP

在Leaf1交换机上配置2台Spine交换机的BGP邻居。

Leaf1# configure terminal 
Leaf1(config)# router bgp 65007
Leaf1(config-router)# bgp router-id 10.10.0.7
Leaf1(config)# interface loopback 0
Leaf1(config-loif-0)# ip address 10.10.0.7/32
Change Loopback0 ip from 10.1.0.1/32 to 10.10.0.7/32
Loopback ip will be used as bgp router-id in frr
Leaf1(config)# router bgp 65007
Leaf1(config-router)# no bgp ebgp-requires-policy
Leaf1(config-router)# neighbor 10.0.10.2 remote-as 65003
Leaf1(config-router)# neighbor 10.1.10.2 remote-as 65004
Leaf1(config-router)# address-family ipv4 unicast
Leaf1(config-router)# network 10.10.0.7/32
Leaf1(config-router)# address-family l2vpn evpn
Leaf1(config-router-af)# neighbor 10.0.10.2 activate
Leaf1(config-router-af)# neighbor 10.1.10.2 activate
Leaf1(config-router-af)# advertise-all-vni
  • 配置Leaf1的EVPN

在Leaf1交换机上配置EVPN、创建VNET,建立二三层VXLAN映射。

Leaf1# configure terminal
Leaf1(config)# interface vxlan 0
Leaf1(config-vxlanif-0)# source 10.10.0.7
Leaf1(config)# evpn-overlay enable
Leaf1(config)# vrf 123
Leaf1(config-vrf)# mac 60:eb:5a:00:86:20
Leaf1(config-vrf)# interface vlan 10
Leaf1(config-vlanif-10)# vrf 123
Leaf1(config)# vlan 10
Leaf1(config-vlan-10)# vni 10
Leaf1(config)# vrf 123
Leaf1(config-vrf)# vni 1000

5.5 Leaf2

  • 设备恢复出厂设置

恢复Leaf2设备到出厂设置。

sonic# delete startup-config
sonic# reload
  • 配置Leaf2端口速率

配置Leaf2交换机的Ethernet2口速率为10G。

sonic# configure terminal 
sonic(config)# interface ethernet 0/2  
sonic(config-if-0/2)# speed 10000
sonic(config-if-0/2)# show this
!
interface ethernet 0/2
speed 10000
  • 配置Leaf2接口IP

在Leaf2交换机上配置与Leaf、Spine交换机的互联接口IP以及PortChannel、VLAN信息。

Leaf2# configure terminal 
Leaf2(config)# interface ethernet 0/48
Leaf2(config-if-0/48)# ip address 10.0.11.1/24
Leaf2(config)# interface ethernet 0/52
Leaf2(config-if-0/52)# ip address 10.1.11.1/24
Leaf2(config) interface link-aggregation 1
Leaf2(config) interface link-aggregation 3
Leaf2(config)# interface ethernet 0/2       
Leaf2(config-if-0/2)# link-aggregation-group 1
Leaf2(config-if-0/2)# interface ethernet 0/56
Leaf2(config-if-0/56)# link-aggregation-group 3
Leaf2(config-if-0/56)# interface ethernet 0/60
Leaf2(config-if-0/60)# link-aggregation-group 3
Leaf2(config)# vlan 10
Leaf2(config)# interface vlan 10
Leaf2(config-vlanif-10)# ip address 100.0.10.1/24
Leaf2(config)# interface link-aggregation 1
Leaf2(config-lagif-1)# switchport access vlan 10
Leaf2(config)# interface link-aggregation 3
Leaf2(config-lagif-1)# switchport trunk vlan 10
  • 配置Leaf2的MC-LAG

在Leaf2交换机上配置与Leaf1交换机互联接口的MC-LAG。

Leaf2# configure terminal 
Leaf2(config)# vlan 30
Leaf2(config)# interface link-aggregation 3
Leaf2(config-lagif-1)# switchport trunk vlan 30
Leaf2(config)# interface vlan 30
Leaf2(config-vlanif-30)# ip address 11.0.0.7/24
Leaf2(config)# mclag domain 1
Leaf2(mclag-domain)# peer-link link-aggregation 3 
Leaf2(mclag-domain)# local-address 11.0.0.7   
Leaf2(mclag-domain)# peer-address 11.0.0.6
Leaf2(mclag-domain)# member lag 1
Leaf2(mclag-domain)# commit
Leaf2(config)# interface vlan 10
Leaf2(config-vlanif-10)# mac-address 18:17:25:37:64:40
  • 配置Leaf2的BGP

在Leaf2交换机上配置2台Spine交换机的BGP邻居。

Leaf1# configure terminal 
Leaf2(config)# router bgp 65007
Leaf2(config-router)# bgp router-id 10.10.0.7
Leaf2(config)# interface loopback 0
Leaf2(config-loif-0)# ip address 10.10.0.7/32
Change Loopback0 ip from 10.1.0.1/32 to 10.10.0.7/32
Loopback ip will be used as bgp router-id in frr
Leaf2(config)# router bgp 65007
Leaf2(config-router)# no bgp ebgp-requires-policy
Leaf2(config-router)# neighbor 10.0.11.2 remote-as 65003
Leaf2(config-router)# neighbor 10.1.11.2 remote-as 65004
Leaf2(config-router)# address-family ipv4 unicast
Leaf2(config-router)# network 10.10.0.7/32
Leaf2(config-router)# address-family l2vpn evpn
Leaf2(config-router-af)# neighbor 10.0.11.2 activate
Leaf2(config-router-af)# neighbor 10.1.11.2 activate
Leaf2(config-router-af)# advertise-all-vni
  • 配置Leaf2的EVPN

在Leaf2交换机上配置EVPN、创建VNET,建立二三层VXLAN映射。

Leaf2# configure terminal
Leaf2(config)# interface vxlan 0
Leaf2(config-vxlanif-0)# source 10.10.0.7
Leaf2(config)# evpn-overlay enable
Leaf2(config)# vrf 123
Leaf2(config-vrf)# mac 60:eb:5a:00:86:20
Leaf2(config-vrf)# interface vlan 10
Leaf2(config-vlanif-10)# vrf 123
Leaf2(config)# vlan 10
Leaf2(config-vlan-10)# vni 10
Leaf2(config)# vrf 123
Leaf2(config-vrf)# vni 1000

5.6 Leaf3

  • 设备恢复出厂设置

恢复Leaf3设备到出厂设置。

sonic# delete startup-config
sonic# reload
  • 配置Leaf3端口速率

配置Leaf3交换机的Ethernet2口速率为10G。

Leaf3# configure terminal 
Leaf3(config)# interface ethernet 0/2  
Leaf3(config-if-0/2)# speed 10000
Leaf3(config-if-0/2)# show this
!
interface ethernet 0/2
speed 10000
  • 配置Leaf3接口IP

在Leaf3交换机上配置与Leaf、Spine交换机的互联接口IP以及PortChannel、VLAN信息。

Leaf3# configure terminal 
Leaf3(config)# interface ethernet 0/48
Leaf3(config-if-0/48)# ip address 10.0.12.1/24
Leaf3(config)# interface ethernet 0/52
Leaf3(config-if-0/52)# ip address 10.1.12.1/24
Leaf3(config) interface link-aggregation 1
Leaf3(config) interface link-aggregation 3
Leaf3(config)# interface ethernet 0/2       
Leaf3(config-if-0/2)# link-aggregation-group 1
Leaf3(config-if-0/2)# interface ethernet 0/64
Leaf3(config-if-0/64)# link-aggregation-group 3
Leaf3(config-if-0/64)# interface ethernet 0/68
Leaf3(config-if-0/68)# link-aggregation-group 3
Leaf3(config)# vlan 20
Leaf3(config)# interface vlan 20
Leaf3(config-vlanif-20)# ip address 100.0.20.1/24
Leaf3(config)# interface link-aggregation 1
Leaf3(config-lagif-1)# switchport access vlan 20
Leaf3(config)# interface link-aggregation 3
Leaf3(config-lagif-3)# switchport trunk vlan 20
  • 配置Leaf3的MC-LAG

在Leaf3交换机上配置与Leaf4交换机互联接口的MC-LAG。

Leaf3(config)# vlan 30
Leaf3(config)# interface link-aggregation 3
Leaf3(config-lagif-3)# switchport trunk vlan 30
Leaf3(config)# interface vlan 30
Leaf3(config-vlanif-30)# ip address 11.0.0.8/24
Leaf3(config)# mclag domain 1
Leaf3(mclag-domain)# peer-link link-aggregation 3 
Leaf3(mclag-domain)# local-address 11.0.0.8 
Leaf3(mclag-domain)# peer-address 11.0.0.9
Leaf3(mclag-domain)# member lag 1
Leaf3(mclag-domain)# commit
Leaf3(config)# interface vlan 20
Leaf3(config-vlanif-20)# mac-address 18:17:25:37:64:32
  • 配置Leaf3的BGP

在Leaf3交换机上配置2台Spine交换机的BGP邻居。

Leaf3(config)# router bgp 65008
Leaf3(config-router)# bgp router-id 10.10.0.8
Leaf3(config)# interface loopback 0
Leaf3(config-loif-0)# ip address 10.10.0.8/32
Change Loopback0 ip from 10.1.0.1/32 to 10.10.0.8/32
Loopback ip will be used as bgp router-id in frr
Leaf3(config)# router bgp 65008
Leaf3(config-router)# no bgp ebgp-requires-policy
Leaf3(config-router)# neighbor 10.0.12.2 remote-as 65003
Leaf3(config-router)# neighbor 10.1.12.2 remote-as 65004
Leaf3(config-router)# address-family ipv4 unicast
Leaf3(config-router)# network 10.10.0.8/32
Leaf3(config-router)# address-family l2vpn evpn
Leaf3(config-router-af)# neighbor 10.0.12.2 activate
Leaf3(config-router-af)# neighbor 10.1.12.2 activate
Leaf3(config-router-af)# advertise-all-vni
  • 配置Leaf3的EVPN

在Leaf3交换机上配置EVPN、创建VNET,建立二三层VXLAN映射。

Leaf3# configure terminal
Leaf3(config)# interface vxlan 0
Leaf3(config-vxlanif-0)# source 10.10.0.8
Leaf3(config)# evpn-overlay enable
Leaf3(config)# vrf 456
Leaf3(config-vrf)# mac 60:eb:5a:00:86:22
Leaf3(config-vrf)# interface vlan 20
Leaf3(config-vlanif-10)# vrf 456
Leaf3(config)# vlan 20
Leaf3(config-vlan-10)# vni 20
Leaf3(config)# vrf 456
Leaf3(config-vrf)# vni 1000

5.7 Leaf4

  • 设备恢复出厂设置

恢复Leaf4设备到出厂设置。

sonic# delete startup-config
sonic# reload
  • 配置Leaf4端口速率

配置Leaf4交换机的Ethernet2口速率为10G。

Leaf4# configure terminal 
Leaf4(config)# interface ethernet 0/2  
Leaf4(config-if-0/2)# speed 10000
Leaf4(config-if-0/2)# show this
!
interface ethernet 0/2
speed 10000
  • 配置Leaf4接口IP

在Leaf4交换机上配置与Leaf、Spine交换机的互联接口IP以及PortChannel、VLAN信息。

Leaf4# configure terminal 
Leaf4(config)# interface ethernet 0/48
Leaf4(config-if-0/48)# ip address 10.0.13.1/24
Leaf4(config)# interface ethernet 0/52
Leaf4(config-if-0/52)# ip address 10.1.13.1/24
Leaf4(config) interface link-aggregation 1
Leaf4(config) interface link-aggregation 3
Leaf4(config)# interface ethernet 0/2       
Leaf4(config-if-0/2)# link-aggregation-group 1
Leaf4(config-if-0/2)# interface ethernet 0/64
Leaf4(config-if-0/64)# link-aggregation-group 3
Leaf4(config-if-0/64)# interface ethernet 0/68
Leaf4(config-if-0/68)# link-aggregation-group 3
Leaf4(config)# vlan 20
Leaf4(config)# interface vlan 20
Leaf4(config-vlanif-20)# ip address 100.0.20.1/24
Leaf4(config)# interface link-aggregation 1
Leaf4(config-lagif-1)# switchport access vlan 20
Leaf4(config)# interface link-aggregation 3
Leaf4(config-lagif-3)# switchport trunk vlan 20
  • 配置Leaf4的MC-LAG

在Leaf4交换机上配置与Leaf3交换机互联接口的MC-LAG。

Leaf4(config)# vlan 30
Leaf4(config)# interface link-aggregation 3
Leaf4(config-lagif-3)# switchport trunk vlan 30
Leaf4(config)# interface vlan 30
Leaf4(config-vlanif-30)# ip address 11.0.0.9/24
Leaf4(config)# mclag domain 1
Leaf4(mclag-domain)# peer-link link-aggregation 3 
Leaf4(mclag-domain)# local-address 11.0.0.9
Leaf4(mclag-domain)# peer-address 11.0.0.8
Leaf4(mclag-domain)# member lag 1
Leaf4(mclag-domain)# commit
Leaf4(config)# interface vlan 20
Leaf4(config-vlanif-20)# mac-address 18:17:25:37:64:32
  • 配置Leaf4的BGP

在Leaf4交换机上配置2台Spine交换机的BGP邻居。

Leaf4(config)# router bgp 65008
Leaf4(config-router)# bgp router-id 10.10.0.8
Leaf4(config)# interface loopback 0
Leaf4(config-loif-0)# ip address 10.10.0.8/32
Change Loopback0 ip from 10.1.0.1/32 to 10.10.0.8/32
Loopback ip will be used as bgp router-id in frr
Leaf4(config)# router bgp 65008
Leaf4(config-router)# no bgp ebgp-requires-policy
Leaf4(config-router)# neighbor 10.0.13.2 remote-as 65003
Leaf4(config-router)# neighbor 10.1.13.2 remote-as 65004
Leaf4(config-router)# address-family ipv4 unicast
Leaf4(config-router)# network 10.10.0.8/32
Leaf4(config-router)# address-family l2vpn evpn
Leaf4(config-router-af)# neighbor 10.0.13.2 activate
Leaf4(config-router-af)# neighbor 10.1.13.2 activate
Leaf4(config-router-af)# advertise-all-vni
  • 配置Leaf4的EVPN

在Leaf4交换机上配置EVPN、创建VNET,建立二三层VXLAN映射。

Leaf4# configure terminal
Leaf4(config)# interface vxlan 0
Leaf4(config-vxlanif-0)# source 10.10.0.8
Leaf4(config)# evpn-overlay enable
Leaf4(config)# vrf 456
Leaf4(config-vrf)# mac 60:eb:5a:00:86:22
Leaf4(config-vrf)# interface vlan 20
Leaf4(config-vlanif-10)# vrf 456
Leaf4(config)# vlan 20
Leaf4(config-vlan-10)# vni 20
Leaf4(config)# vrf 456
Leaf4(config-vrf)# vni 1000

CX102S-DPU开放智能网关-DPU操作系统安装指导-Debian

DPU操作系统安装指导-Debian

1 操作前声明

技术人员在进行后续操作前,建议仔细阅读产品的用户手册,对CX102S-D设备的结构设计充分了解。
本文档将以Debian Linux系统的安装为例,介绍如何安装一个新系统到设备的计算单元(DPU)。

2 安装流程

2.1 准备安装所需文件和物料

  • 系统文件,包括:内核镜像、设备树、文件系统;
  • U盘,容量不小于4GB。

常见的Linux发行版系统(Debian、OpenWRT等)的内核镜像与设备树文件,请联系星融元的售前/售后获取。用户也可以根据需求自行编译适配内核镜像和设备树,以支持特定的系统和版本。

U盘烧录:

可以使用balenaEtcher烧录工具,或通过Linux的命令行,将准备好的Debian Linux烧录进U盘。

工具参考下载地址:

  1. https://etcher.balena.io/#download-etcher
  2. https://github.com/balena-io/etcher/releases/tag/v1.18.11

2.2 从U盘中引导临时系统

把制作好的U盘插入设备USB接口,连接串口到电脑,设备上电启动,根据系统提示按任意键中断autoboot进入uboot界面。在串口连接下输入switchUart0、1或2可分别切换到交换单元、计算单元 1或计算单元 2。交换单元中断autoboot流程后会进入ac5y uboot,计算单元中断autoboot流程后会进入9130 uboot。

默认下会进入ac5y uboot,后续操作需要通过switchUart*命令切换到指定计算单元的9130 uboot界面中进行。

9130 uboot界面

设置环境变量,让计算单元从U盘中引导系统:

设置环境变量

2.3 安装系统到DPU硬盘

成功从U盘引导系统后,证明准备的系统能正常适配计算单元芯片,所以在这一步将U盘的系统文件拷贝到计算单元的本地存储MMC。

2.4 设置uboot环境变量

重启系统,进入uboot界面,设置环境变量使其从MMC引导系统。

2.5 从DPU硬盘引导系统

从DPU硬盘启动后可以正常进入操作系统,进入系统后进行测试确认系统工作状态正常,到此完流程系统安装的所有流程,拔掉U盘。

3 附录

3.1 环境变量解释

3.1.1 setenv bootusb

3.1.2 setenv bootarg

3.1.3 setenv bootcmd

setenv bootcmd:设置 bootcmd 环境变量;
· 'run bootusb':bootcmd 环境变量的值。告诉系统在启动时运行bootusb这个命令。使用了run命令来执行之前设置好的bootusb环境变量中的命令。
命令作用:是设置系统启动时要执行的命令序列为bootusb。

3.1.4 saveenv

相关产品:云化园区交换机

相关阅读:

CX102S-DPU开放智能网关用户指导手册

基于Kubeadm安装Kubernetes部署方案

1 Kubernetes简介

Kubernetes是一个轻便和可扩展的开源云平台,用于管理容器化应用和服务。通过Kubernetes能够进行应用的自动化部署以及扩容和缩容等操作。在Kubernetes中,可以将组成应用的容器结合成一个逻辑单元,更易于管理和发现。

2 Kubernetes功能

  • 自动装箱

基于容器对应用运行环境的资源配置要求自动部署应用容器。

  • 自我修复

当容器失败时,会对容器进行重启;当所部署的Node节点出现问题时,会对容器进行重新部署和重新调度;当容器未通过监控检查时,会关闭此容器,直到容器正常运行时,才会对外提供服务。

  • 水平扩展

通过简单的命令,对应用容器进行规模扩大或剪裁。

  • 服务发现

用户不需要使用额外的服务发现机制就能够基于Kubernetes自身能力实现服务的发现和负载均衡。

  • 滚动更新

可以根据应用的变化,对应用容器的应用进行一次性或批量更新。

  • 版本回退

可以根据应用部署情况,对应用容器运行的应用,进行历史版本即时回退。

  • 密钥和配置管理

在不需要重新构建镜像情况下,可以部署和更新密钥以及应用配置。

  • 存储编排

自动实现存储系统挂载及应用,尤其对有状态应用实现数据持久化十分重要。存储系统可以来自本地目录、网络存储(NFS、Gluster、Ceph、Cinder等)、公共云存储等。

3 Kubernetes集群角色

  • Master Node

集群控制节点,对集群进行调度管理,接收集群外用户操作请求,由API Server、Scheduler、Cluster State Store(ETCD数据库)和Controller Server组成。

  • Worker Node

集群工作节点,运行用户业务应用容器,由Kubelet、Kube Proxy和Container Runtime组成。

4 Kubernetes架构

 Kubernetes架构
图4-1 Kubernetes架构

架构说明:

  • Etcd

保存整个集群的状态。

  • API Server

提供了资源操作的唯一入口,并提供认证、授权、访问控制、API注册和发现等机制。

  • Controller Manager

负责维护集群的状态,如故障检测、自动扩展、滚动更新等。

  • Scheduler

负责资源的调度,按照预定的调度策略将Pod调度到相应的机器上。

  • Kubelet

负责维护容器的生命周期、Volume(CVI) 和网络(CNI)的管理。

  • Container Runtime

负责镜像管理以及Pod和容器的真正运行(CRI)。

  • Kube-proxy

负责为Service提供Cluster内部的服务发现和负载均衡(四层)。

5 Kubernetes安装环境

本次部署三个节点,一个master节点,两个worker节点,如表5-1。

节点系统网卡:eth0
master1Centos7.610.0.0.100
worker1Centos7.610.0.0.101
worker2Centos7.610.0.0.102

表5-1:安装环境

配置说明:

master1

  • 内存:16G
  • CPU:双核双线程,虚拟化开启
  • 硬盘:300G

worker1/2

  • 内存:16G
  • CPU:双核双线程,虚拟化开启
  • 硬盘:300G

6 基础环境部署

6.1 修改主机名(所有节点)

  • master节点
hostnamectl set-hostname master1
  • worker1节点
hostnamectl set-hostname worker1
  • worker2节点
hostnamectl set-hostname worker2

6.2 配置域名解析(所有节点)

vi etc/hosts
10.0.0.100 master1
10.0.0.101 worker1
10.0.0.102 worker2

6.3 关闭防火墙与SELINUX(所有节点)

systemctl stop firewalld
systemctl disable firewalld
vi /etc/selinux/config
SELINUX=disabled
reboot

6.4 关闭swap分区(所有节点)

使用Kubeadm部署时必须关闭swap分区,此处采用将swap分区注释掉方式。

vi /etc/fstab
#/dev/mapper/centos-swap swap         swap    defaults         0 0
reboot

6.5 配置时间同步(所有节点)

master节点与worker节点的时间需要同步,否则可能会出现意外问题。

  • master1节点
yum install -y chrony
vi /etc/chrony.conf 
allow 10.0.0.0/24
systemctl enable chronyd.service
systemctl start chronyd.service
  • worker1/worker2节点
yum install -y chrony
vi /etc/chrony.conf 
server 10.0.0.100 iburst
  • 设置开机自启并启动
systemctl enable chronyd.service
systemctl start chronyd.service

6.6 配置优化(所有节点)

  • 添加网桥过滤及地址转发,实现内核的过滤
vi /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
  • 加载br_netfilter模块
modprobe br_netfilter
  • 加载网桥过滤配置文件
sysctl -p /etc/sysctl.d/k8s.conf
  • 所有节点开启ipvs
sysctl -p /etc/sysctl.d/k8s.conf
  • 安装软件ipset和ipvsadm
yum install -y ipset ipvsadm
  • 添加需要加载的模块
cat > /etc/sysconfig/modules/ipvs.modules <<EOF
#!/bin/bash
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack_ipv4
EOF
  • 添加权限并应用
chmod 777 /etc/sysconfig/modules/ipvs.modules
sh /etc/sysconfig/modules/ipvs.modules

7 安装docker(所有节点)

7.1 安装docker依赖包

yum install -y yum-utils device-mapper-persistent-data lvm2

7.2 设置阿里镜像源

yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

7.3 安装指定版本docker

yum -y install docker-ce-20.10.12-3.el7

7.4 设置开机自启并启动

systemctl enable docker
systemctl start docker

7.5 修改配置文件

vi /etc/docker/daemon.json
{
     "exec-opts": ["native.cgroupdriver=systemd"]
}

8 升级系统内核(所有节点)

由于CentOS 7.x 系统自带的3.10.x内核存在一些Bug,导致运行的Docker和Kubernetes不稳定,因此需要将系统内核升级至最新版本,升级步骤如下(如内核已是新版则跳过此步骤)。

  • 安装工具wget和unzip
yum install -y curl wget unzip
  • 导入ELRepo仓库的公共密钥
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
  • 安装ELRepo仓库的yum源
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  • 安装最新版本内核
yum --enablerepo=elrepo-kernel install kernel-ml
  • 设置新的内核为grub2的默认版本
grub2-set-default 0
  • 生成grub配置文件并重启
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

9 Kubernetes组件安装(所有节点)

Kubernetes组件包含Kubeadm、Kubelet、Kubectl,功能如下。

  • Kubeadm

初始化集群、管理集群等。

  • Kubelet

接收api-server指令,对Pod生命周期进行管理。

  • Kubectl

集群命令行管理工具。

9.1 配置Kubernetes的yum源

cat > /etc/yum.repos.d/kubernetes.repo << EOF
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=0
repo_gpgcheck=0
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF

9.2 安装组件

安装组件Kubeadm,Kubelet和Kubectl并指定版本。

yum makecache fast
yum install -y kubelet-1.21.3 kubeadm-1.21.3 kubectl-1.21.3
systemctl enable kubelet

修改配置文件。

vi /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--cgroup-driver=systemd"

10 初始化集群(master节点)

10.1 master节点初始化

在master节点上的任意路径下输入执行以下命令。

kubeadm init \
  --apiserver-advertise-address=10.0.0.100 \
  --image-repository registry.aliyuncs.com/google_containers \
  --kubernetes-version v1.21.3 \
  --service-cidr=10.96.0.0/12 \
  --pod-network-cidr=10.244.0.0/16 \
  --ignore-preflight-errors=all

参数说明:

  • apiserver-advertise-address

集群通告地址。

  • image-repository

由于默认拉取镜像地址k8s.gcr.io国内无法访问,这里指定阿里云镜像仓库地址。

  • kubernetes-version

Kubernetes版本,与上面安装的一致。

  • service-cidr

集群内部虚拟网络,Pod统一访问入口。

  • pod-network-cidr

Pod网络,与下面部署的CNI网络组件yaml中保持一致。

正常初始化后,会提示下图10-1中的内容,并将箭头所指处复制到本地。

初始化提示
图10-1 初始化提示

根据提示配置如下内容。

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

10.2 镜像准备

查看集群使用的容器镜像。

[root@master1 ~]# kubeadm config images list
I0608 09:54:34.987170    2894 version.go:254] remote version is much newer: v1.24.1; falling back to: stable-1.21
registry.aliyuncs.com/google_containers/kube-apiserver:v1.21.13
registry.aliyuncs.com/google_containers/kube-controller-manager:v1.21.13
registry.aliyuncs.com/google_containers/kube-scheduler:v1.21.13
registry.aliyuncs.com/google_containers/kube-proxy:v1.21.13
registry.aliyuncs.com/google_containers/pause:3.4.1
registry.aliyuncs.com/google_containers/etcd:3.4.13-0
registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
[root@master1 ~]#

生成脚本。

kubeadm config images list >> image.list

编辑脚本。

vi image.list
#!/bin/bash
img_list='registry.aliyuncs.com/google_containers/kube-apiserver:v1.21.3
registry.aliyuncs.com/google_containers/kube-controller-manager:v1.21.3
registry.aliyuncs.com/google_containers/kube-scheduler:v1.21.3
registry.aliyuncs.com/google_containers/kube-proxy:v1.21.3
registry.aliyuncs.com/google_containers/pause:3.4.1
registry.aliyuncs.com/google_containers/etcd:3.4.13-0
registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0'
for img in ${img_list}
do
          docker pull $img
done

执行脚本。

sh image.list

10.3 worker节点加入集群

将图10.1所指地方配置到worker1和worker2节点上。

kubeadm join 10.0.0.100:6443 --token wq53fj.x28gsb67wd3josc4 \
  --discovery-token-ca-cert-hash sha256:ecabaf79ece2225a8d52b0febe03001ad512ada9dd8b26926161a85a341ac6f9

master节点查看集群。

kubectl get nodes
节点加入检查
图10-2 节点加入检查

11 部署容器网络

本文档使用Calico部署容器网络,Calico是一个纯三层的数据中心网络方案,是目前Kubernetes主流的网络方案,步骤如下。

  • 下载yaml
wget https://docs.projectcalico.org/manifests/calico.yaml
  • 应用calico.yaml
kubectl apply -f calico.yaml
  • 查看部署进度,全部为running后则正常
[root@master1 ~]# kubectl get pods -n kube-system
NAME                                             READY   STATUS    RESTARTS   AGE
calico-kube-controllers-685b65ddf9-hgjx7   1/1     Running   0          148m
calico-node-8ngrz                               1/1     Running   0          148m
calico-node-p2lc9                               1/1     Running   0          148m
calico-node-r2tkg                               1/1     Running   0          148m
coredns-59d64cd4d4-fqfq9                       1/1     Running   0          163m
coredns-59d64cd4d4-zcph8                       1/1     Running   0          163m
etcd-master1                                     1/1     Running   0          163m
kube-apiserver-master1                         1/1     Running   0          163m
kube-controller-manager-master1              1/1     Running   0          163m
kube-proxy-lszzs                                1/1     Running   0          162m
kube-proxy-pbjhs                                1/1     Running   0          163m
kube-proxy-wjl7x                                1/1     Running   0          162m
kube-scheduler-master1                         1/1     Running   0          163m 
[root@master1 ~]#

12 测试Kubernetes集群

在Kubernetes集群中创建一个Pod,验证是否正常运行。

kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=NodePort
kubectl get pod,svc

结果如下图12-1。

检查Pod创建
图12-1 检查Pod创建

验证Kubernetes可以保持一定个数容器运行的功能,此处把replicas修改为3如下。

kubectl scale --replicas=3 deployment/nginx
kubectl get pod -o wide
 修改副本数量后Pod查看
图12-2 修改副本数量后Pod查看

尝试删除一个正在运行中的Pod。

kubectl delete pod nginx-6799fc88d8-dcxql
图12-3 删除一个Pod

再次查看Pod数量。

kubectl get pod -o wide
删除一个Pod后查看
图12-4 删除一个Pod后查看

可以看到之前的ip为10.244.235.136的Pod已经被删除,并产生了新的Pod,ip为10.244.235.137,说明Kubernetes功能正常。

检查各ip的连通性。

ping各Pod的ip:10.244.235.141、10.244.235.140、10.244.189.72,如下图12-5、12-6、12-7。

 ping测试1
图12-5 ping测试1
ping测试2
图12-6 ping测试2

 ping测试3
图12-7 ping测试3

curl service的ip:10.107.23.235,如图12-8。

curl svc测试
图12-8 curl svc测试

curl node:ip测试,如图12-9、12-10、12-11。

curl node:port测试1
图12-9 curl node:port测试1

curl node:port测试2
图12-10 curl node:port测试2

 curl node:port测试3
图12-11 curl node:port测试3

Pod与Pod连通性测试,如图12-12。

 Pod与Pod连通性测试
图12-12 Pod与Pod连通性测试

检查DNS解析可用性,如图12-13。

 dns可用性检查
图12-13 dns可用性检查

访问地址:http://<任意node的ip>:port,此处访问:10.0.0.101:31339,结果如图12-14。

 nginx测试访问
图12-14 nginx测试访问

测试结果:无异常。

13 部署Dashboard

Dashboard是官方提供的一个UI,可用于管理Kubernetes资源。

master节点输入如下命令。

wget https://raw.githubusercontent.com/kubernetes/dashboard/v2.2.0/aio/deploy/recommended.yaml

默认Dashboard只能集群内部访问,可以修改recommended.yaml文件中Service类型为nodeport,方便集群外的机器访问。

kind: Service
apiVersion: v1
metadata:
labels:
  k8s-app: kubernetes-dashboard
name: kubernetes-dashboard
namespace: kubernetes-dashboard
spec:
ports:
  - port: 443
    targetPort: 8443
    nodePort: 30443
selector:
  k8s-app: kubernetes-dashboard
type: NodePort

再次输入如下命令。

kubectl apply -f recommended.yaml
kubectl get pods -n kubernetes-dashboard

待所有Pod处于running的状态后,创建service account并绑定默认cluster-admin管理员集群角色。

kubectl create serviceaccount dashboard-admin -n kube-system
kubectl create clusterrolebinding dashboard-admin --clusterrole=cluster-admin --serviceaccount=kube-system:dashboard-admin
kubectl describe secrets -n kube-system $(kubectl -n kube-system get secret | awk '/dashboard-admin/{print $1}')

访问地址:https://<任意node的ip>:30443,将上条命令产生的token复制后填入,进行登录,如图13-1,13-2。

登录界面
图13-1 登录界面

登录成功
图13-2 登录成功

至此一个可用的kubernetes集群安装完毕。

更多内容请参考:A-Lab

对星融元产品感兴趣?

立即联系!

返回顶部

© 星融元数据技术(苏州)有限公司 苏ICP备17070048号-2