Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
Recovery files overrunning disk space
Posted Mar 23, 2024, 8:53 p.m. EDT Cluster & Cloud Computing, Results & Visualization Version 6.0 2 Replies
Please login with a confirmed email address before reporting spam
I've helping a colleague run some COMSOL jobs (multiphysics simulations involving a parametric sweep) on our university's computing cluster. These require several terabytes of RAM and/or temporary storage space, which we have access to on the compute nodes as the job is running; however we do not have quite the same degree of storage available while the jobs are not actively running. This is a problem when COMSOL tries to write 1TB+ of recovery files to a filesystem that is less than 1TB away from its group quota.
The observed ensuing behavior is that COMSOL exits due to exceeded disk quota; then the next run seems to see that the recovery files are not intact (hardly a surprise if the last attempt failed while trying to write them) and starts over.
For further context (as much as I can give without currently having access to the recovery files) one attempt produced hundreds of savepoint subdirectories in addition to about 50k .mph.bin
files; I believe the latter is what took up most of the space (perhaps relevantly, this is on a striped filesystem with a large block size).
My questions, then, are:
- What options can we invoke to reduce the amount of space COMSOL uses for recovery, short of simply not saving recovery files at all?
- On the flipside, should we expect that the final output file would actually be roughly that big?
- Can we get COMSOL to use fewer, large files instead of saving so many small files?
- Are there any files in this recovery folder that can be safely deleted after a certain point? How would these be identified?
- If we were to run a process that (say) ran behind COMSOL zipping up recovery files once they were written, would this interfere with its operations?
- Failing that, is it possible to impose a limit on the amount of disk space COMSOL will use so that it doesn't interfere with other users from our group who rely on the same scratch space?
- As for the existing recovery files, are there shell commands to convert them to a combined (hopefully smaller)
.mph
file or export their contents?- What about only parsing some of them (especially if others are corrupted)?
Thanks for any help.