Differences

This shows you the differences between two versions of the page.

--- cluster:cluster [2023-03-20 11:33] – [SLURM Job Submission] Markus Rosenstihl
+++ cluster:cluster [2024-10-08 11:54] (current) – [AG Vogel] Markus Rosenstihl
@@ Line 1: / Line 1: @@
 ====== Tips/Tricks ======
-===== Profiling ====
+===== Profiling C Programs ====
 You can profile programs with [[https://valgrind.org/docs/manual/cl-manual.html#cl-manual.options|''valgrind'']] and analyze the output file with ''kcachegrind''.
@@ Line 20: / Line 20: @@
 === Generelle Punkte ===
 == Gross ist besser ==
-Lieber wenige grosse Dateien als viele kleine Dateien, grose Dateien erzeugen weniger IOPS, das bleastet das Netzwerk und die SSDs nicht so stark.
+Lieber wenige grosse Dateien als viele kleine Dateien, grose Dateien erzeugen weniger IOPS, das belastet das Netzwerk und die SSDs nicht so stark.
 Nicht vergessen, es könnten noch viele andere ebenfalls auf das gleiche Dateisystem zugreifen.
@@ Line 100: / Line 100: @@
 Paralleles schreiben in eine einzige Datei ist fast nicht möglich (Locking, etc..). Deswegen bleibt nur der Weg über einzelne Dateien.
-Was man aber machen kann ist nach dem Durchlauf die Dateien in HDF umzuwandel und zu bündeln.
+Was man aber machen kann ist nach dem Durchlauf die Dateien in HDF umzuwandeln und zu bündeln.
 Beispielcode um viele ASCII Dateien in eine HDF5 Datei zu packen (''./create_h5.py params.0?''):
@@ Line 123: / Line 122: @@
 # do not forget to delete the files
 </code>
+Bei vielen (>1000) kleinen Dateien bietet es sich an nur das lokale (scratch) Dateisystem zu benutzen, nicht ein Netzlaufwerk. Das macht aber leider die Sammlung wieder komplexer. Ein Vorschlag der Admins wäre die Dateien des Jobs mit ''tar'' zu packen (''tar cfz files.tar.gz files*.dat''), dann diese Dateien von allen Knoten auf einem Konten entpacken und zusammenführen.
 ====== SLURM Job Submission  ======
 References:
@@ Line 128: / Line 130: @@
   * [[https://doku.lrz.de/display/PUBLIC/Job+farming+with+SLURM|Leibniz Rechezentrum]]
   * [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/manytasks-manynodes/|UL HPC]]
-  * [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/basics/|UL HPC]] <-- VERY CONCISE RESOURCE
+  * [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/basics/|UL HPC]] <- **VERY CONCISE RESOURCE**
@@ Line 289: / Line 291: @@
 Another possible way is the ''--multi-prog'' parameter for srun. As an example you can use this [[https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/#_using_multi_prog|document]].
+One can let jobs wait for each other also with the ''-d, --dependency=singleton'' parameter.
+This tells the job to begin execution after any previously launched jobs sharing the same job name and user has [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/basics/|terminated]]. Job name is set with ''-J'' parameter.
+<code bash>
+# Abstract search space parameters
+min=1
+max=2000
+chunksize=200
+for i in $(seq $min $chunksize $max); do
+    ${CMD_PREFIX} sbatch \
+                  -J ${JOBNAME}_$(($i/$chunksize%${MAXNODES})) --dependency singleton \
+                  ${LAUNCHER} --joblog log/state.${i}.parallel.log  "{$i..$((i+$chunksize))}";
+done
+</code>
+===== Tools =====
+These are tools that exist, if requested we will try and make them available on the cluster:
+  * [[https://researchcomputing.princeton.edu/support/knowledge-base/spark|Spark]]
+  * [[https://docs.dask.org/en/stable/deploying.html|Dask]]
+  * [[https://modin.readthedocs.io/en/stable/|Modin]]
+  * [[https://researchcomputing.princeton.edu/support/knowledge-base/apptainer|Apptainer]]
 ====== Group Specific ======
 ===== AG Drossel =====
 ===== AG Liebchen =====
 ===== AG Vogel =====
+The head node protein does not allow password logins, you need to use ssh keys.
+  - create a key: ''ssh-keygen -t ed25519''
+  - We admins stronlgy recommend to use a very strong passphrase. Together with ssh-agent you have to type it only once per login to your desktop!
+  - add the public part to the authorized_keys file and set correct premissions: ''cat .ssh/id_ed25519.pub | tee -a .ssh/authorized_keys && chmod 0600 .ssh/authorized_keys''
+  - now login to protein.cluster, it may ask for the passphrase.
+===== SSH Agent =====
+The ''ssh-agent'' should be startet automatically on login, Cinnamon for example will show a screen upon login to the desktop. If not you need to set the **GNOME Keyring SSH Agent** to start automatically:
+{{:cluster:startup_apps_cinnamon.png?600 |}}