Problem Description
I am observing that distributed cluster jobs on Linux are not starting up. I am receiving error messages from MPI.
Solution
The underlying reason for COMSOL not working on a Linux cluster might be that the network interface and fabrics are not detected correctly. On Linux, COMSOL 6.1 is shipped with Intel MPI 2021.6 and COMSOL 6.0 with Intel MPI 2021.2. You can investigate if there is an incompatibility with Intel MPI using the following steps:
When you find that Intel MPI is not working on your cluster, you should first make sure that your submission script is configured correctly. In addition, you should run the MPI test by calling
comsol hydra mpitest -nn 2 -f hostfile
or, e.g. with Slurm,
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
...
comsol hydra mpitest -nn 2 -nnhost 1
to see that actually MPI is the issue. You can add the switch '-mpidebug 10' for getting additional debug output.
For resolving the problem you can try the suggestions A. and B. If A. works for you, you should try B. as this option would offer better performance.
A. Fall Back to TCP
Export the environment variable FI_PROVIDER and set it to 'sockets'. With Slurm, this can be done by means of
#SBATCH --export=FI_PROVIDER=sockets
Otherwise, you can use
export FI_PROVIDER=sockets
or
setenv FI_PROVIDER sockets
and make sure that this environment variable is handed over to your cluster job.
If you are running cluster jobs from the COMSOL Desktop, add --export=FI_PROVIDER=sockets to the Additional scheduler arguments field. I you are using SLURM, also add the FLROOT environment variable, using a comma character as separator. The value of FLROOT should be the COMSOL installation directory path.
--export=FI_PROVIDER=sockets,FLROOT=<COMSOL installation directory>
The downside with this approach is that the communication falls back to TCP, which might be slow if you have a faster fabrics.
B. Install a Later Intel MPI
Download the latest Intel MPI from here and install it. You can install to your home directory if you don't have admin rights on the cluster.
Launch COMSOL with the additional switch
-mpiroot <Intel MPI installation directory>/intel/oneapi/mpi/latest
On Slurm, you can call for example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
...
comsol hydra mpitest -nn 2 -nnhost 1 -mpiroot <Intel MPI installation directory>/intel/oneapi/mpi/latest
Remarks:
- You can also point to other MPICH2-based MPI installations (but not to OpenMPI for example)
- In COMSOL 5.6 you can point to the new Intel MPI via -mpiroot as well.
COMSOL makes every reasonable effort to verify the information you view on this page. Resources and documents are provided for your information only, and COMSOL makes no explicit or implied claims to their validity. COMSOL does not assume any legal liability for the accuracy of the data disclosed. Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark details.