|
我在超算服务器上,先按照 http://sobereva.com/451 的流程按照了 openmpi,然后按照 http://sobereva.com/586 安装了量子化学软件 CP2K,然后运行一个测试样例,一个金刚石小晶胞的 PBE 泛函单点能计算。我用 4 核计算(mpirun -np 4 cp2k.popt),在输入文件里设置了读取之前算好的波函数(SCF_GUESS RESTART),结果stderr出现报错:- [l06c41n4:2206753:0:2206753] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4c00081b)
- [l06c41n4:2206750:0:2206750] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4c00081b)
- [l06c41n4:2206751:0:2206751] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4c00081b)
- [l06c41n4:2206752:0:2206752] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4c00081b)
复制代码
我尝试发现:如果不设置读取之前算好的波函数(注释掉SCF_GUESS RESTART),或者单核运行(mpirun -np 1 cp2k.popt),就不会出现这个报错,计算能正常结束。但就是在既并行又读取波函数时会出现上述报错,out 文件显示无法进入 SCF 流程。这些计算是在学校的服务器上进行的(我没有 root 权限),我将提交任务的 slurm 脚本中,将跟系统申请的核数由4核改成了16核(可用的内存数量大大增加了),但 mpirun -np 4 cp2k.popt 没有改成 16,既并行又读取波函数的计算仍然有相同的报错。我同学在同一个超算系统上(我俩都没有 root 权限),同学就能正常完成多核读取波函数的计算。
这可能是什么原因,我应该如何排查?
我的 .bashrc 内容:
- # .bashrc
- # Source global definitions
- if [ -f /etc/bashrc ]; then
- . /etc/bashrc
- fi
- # User specific environment
- if ! [[ "$PATH" =~ "$HOME/.local/bin:$HOME/bin:" ]]
- then
- PATH="$HOME/.local/bin:$HOME/bin:$PATH"
- fi
- export PATH
- # Uncomment the following line if you don't like systemctl's auto-paging feature:
- # export SYSTEMD_PAGER=
- # User specific aliases and functions
- module use ~/.modulefiles
- module load vasp/
- module load anaconda3/
- export PATH="/lustre/home/2001110394/FreqScript:$PATH"
- export PATH="/lustre/home/2001110394/vaspkit.1.5.1/bin:$PATH"
- #export PATH="/lustre/home/2001110394/cmake-4.0.0-rc1/build/bin:$PATH"
- # mpi
- export PATH=/lustre/home/2001110394/openmpi416/bin:$PATH
- export LD_LIBRARY_PATH=/lustre/home/2001110394/openmpi416/lib:$LD_LIBRARY_PATH
- # cp2k
- source /lustre/home/2001110394/cp2k-2025.1/tools/toolchain/install/setup
- export PATH=$PATH:/lustre/home/2001110394/cp2k-2025.1/exe/local
复制代码 我的slurm脚本内容:
- #!/bin/bash
- #SBATCH -A hpc2006189113
- #SBATCH --get-user-env
- #SBATCH --partition=C064M0256G
- #SBATCH --qos=low
- #SBATCH -J test
- #SBATCH -N 1
- #SBATCH -n 4
- #SBATCH -o jobid%j-%N.out
- #SBATCH -e jobid%j-%N.error
- #SBATCH --time=0:02:00
- ulimit -s unlimited
- ulimit -l unlimited
- source ~/.bashrc
- module load gcc/12.2.0
- export OMPI_MCA_btl_openib_if_exclude=mlx5_0:1
- mpirun -np 4 cp2k.popt test.inp |tee test.out
复制代码
test.inp 内容:
|
|