====== Tips/Tricks ======

===== Profiling C Programs ====
You can profile programs with [[https://valgrind.org/docs/manual/cl-manual.html#cl-manual.options|''valgrind'']] and analyze the output file with ''kcachegrind''.

<code bash>
valgrind --tool=callgrind ./out
</code>
===== Daten erzeugen =====

Die Performance, besonders über das Netzwerk hängt besonders ab von Dateizugriffen.


Was sind denn "viele" Dateien, "grosse" und "kleine" Dateien?

  * Viele Dateien sind mehr als 10-20k in einem Verzeichnis, und Millionen in Unterverzeichnissen (Bauchgefühl)
  * Grosse Dateien sind mehr als 1GB
  * Kleine Dateien sind <100kB

=== Generelle Punkte ===
== Gross ist besser ==
Lieber wenige grosse Dateien als viele kleine Dateien, grose Dateien erzeugen weniger IOPS, das belastet das Netzwerk und die SSDs nicht so stark.
Nicht vergessen, es könnten noch viele andere ebenfalls auf das gleiche Dateisystem zugreifen.

== Minimiere Dateioperationen ==
Teure Operationen wie ''flush'', ''open'', ''close'' nicht in **inneren** Schleifen verwenden:

<code c>

// NEGATIVES BEISPIEL
main()
{
    for(int i = 0; i <= 300; i += 20)
    {
        f=open(.., 'a');
        f.write(..);
        f.flush(..); // besonders teuer: Daten werden hier auf Platte geschrieben, vorher sind die noch im Cache. Nach dem flush ist
        f.close(..);
    }
}
</code>

<code c>

// POSITIVES BEISPIEL

{
    f=open(.., 'a');
    for(int i = 0; i <= 300; i += 20)
    {

        f.write(..);
        if (i%100 ==0) f.flush(..); // flush nur nach jedem 100ten Wert

    }
    f.close(..);
}
</code>

== Nicht ASCII benutzen ==
ASCII Text ist sehr unpraktisch und sehr verschwenderisch bezgl. des Platzverbrauchs, es lässt sich idR auf 20% komprimieren. 
Je nachdem was man hat kann man z.B. Numpy Arrays [[https://numpy.org/doc/stable/reference/generated/numpy.save.html|binär speichern]], z.B. mit Pickle, und dann direkt einlesen ohne zu parsen.

<code python>
from tempfile import TemporaryFile
outfile = TemporaryFile()
x = np.arange(10)
np.save(outfile, x) # oder savez für Komprimierung.
y=np.load(outfile)
</code>

Auch C Arrays kann man binär Speichern und [[https://stackoverflow.com/questions/18597685/how-to-write-an-array-to-file-in-c|einlesen]]:
<code c>
FILE *f = fopen("client.data", "wb");
fwrite(clientdata, sizeof(char), sizeof(clientdata), f);
fclose(f);
</code>

[[https://www.hdfgroup.org/solutions/hdf5/|HDF5]] komprimiert z.B. wenn man das einstellt, man kann auch eine fixe Genauigkeit angeben etc. 
Zudem kann man den Datasets Attribute/Metadaten zuweisen wie z.B. Parameter die zur Erzeugung verwendet wurden. HDF5 ist relativ komplex aber machbar, die Vorteile sind auf jedenfall nicht einfach von der Hand zu weisen. Die Dokumentation ist sehr ausgiebig, man kommt recht schnell zu einem Ergebnis. In der AG Vogel wechseln wir zwischen C und Python für den Zugriff. Mit ''hdf-compass'' können HDF Dateien angezeigt werden, Tabellen können geplotted werden und man kann Daten exportieren.
Für Python gibt es das [[|https://docs.h5py.org/en/stable/quick.html|h5py]] Modul.

==== Szenario 1: sequentiell ====

Programm erzeugt 1024 Zeilen x 6 Spalten Arrays, und das 1000 mal bei einem Aufruf.

Dafür kann man z.B hdf5 nutzen und jedes Array in einer Baumstruktur abspeichern:

{{:agdrossel:2023-03-15_13-18.png?nolink&400|}}

(Es gibt Tools zum anzeigen der Dateien)

[[https://docs.hdfgroup.org/hdf5/develop/_u_g.html|HDF5]] Bibliotheken machen es einfach z.B. über alle Arrays zu laufen und auszuwerten.
Es gibt natürlich noch andere Dateiformate, aber HDF5 ist mir geläufig da wir das für unsere NMR Messungen nutzen.
Es sind mit [[https://www.pytables.org/usersguide/tutorials.html|pytables]] und [[https://docs.h5py.org/en/stable/quick.html|h5py]] auch sehr stabile Python Bibliotheken vorhanden. h5py sollte in Anaconda drin sein.

==== Szenario 2: parallel ====

Programm erzeugt **ein** 1024 Zeilen x 6 Spalten Array, das Programm wird 1000 mal aufgerufen, auf unterscheidlichsten Rechnern.

Paralleles schreiben in eine einzige Datei ist fast nicht möglich (Locking, etc..). Deswegen bleibt nur der Weg über einzelne Dateien.
Was man aber machen kann ist nach dem Durchlauf die Dateien in HDF umzuwandeln und zu bündeln.

Beispielcode um viele ASCII Dateien in eine HDF5 Datei zu packen (''./create_h5.py params.0?''):

<code python create_h5.py>
#!/usr/bin/env python3
import h5py
from numpy import loadtxt
from sys import argv

with h5py.File("collect.h5","w") as h5:
    for afile in argv[1:]:
        print(afile)
        data = loadtxt(afile)
        # here is the data type an intt64, could be also Float64 (f), int32 (i8), etc. 
        dataset = h5.create_dataset(f"{afile}", data=data, compression='gzip', shuffle=True, dtype="i") 
        # example for an attribute for this dataset (the original filename)
        dataset.attrs["filename"] = afile

h5.close()
# do not forget to delete the files
</code>

Bei vielen (>1000) kleinen Dateien bietet es sich an nur das lokale (scratch) Dateisystem zu benutzen, nicht ein Netzlaufwerk. Das macht aber leider die Sammlung wieder komplexer. Ein Vorschlag der Admins wäre die Dateien des Jobs mit ''tar'' zu packen (''tar cfz files.tar.gz files*.dat''), dann diese Dateien von allen Knoten auf einem Konten entpacken und zusammenführen.

====== SLURM Job Submission  ======
References:
  * [[https://portal.supercomputing.wales/index.php/index/slurm/interactive-use-job-arrays/batch-submission-of-serial-jobs-for-parallel-execution/|Supercomputing Wales]]
  * [[https://doku.lrz.de/display/PUBLIC/Job+farming+with+SLURM|Leibniz Rechezentrum]]
  * [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/manytasks-manynodes/|UL HPC]]
  * [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/basics/|UL HPC]] <- **VERY CONCISE RESOURCE**


<WRAP center round important 60%>
Please limit the number of submitted jobs, do not use job arrays for massive parameter sweeps. 
This will overwhelm the SLURM scheduler.

Don't submit thousands of jobs!
</WRAP>

===== Job Submission with GNU parallel =====


Some tips fot submitting embarrassingly parallel jobs more efficiently.


<WRAP center round alert 60%>
This needs some more testing! I am not convinced that srun has to be used after the parallel command.
The examples I found on the internet all used it, but that job step will still be recorded in the slurmdbd.
</WRAP>

You need several part for this:
  * out.sh : the actual program doing the calculations, takes here 2 arguments like ''out.sh 2 222323''
  * slurm.batch : the actual job script to do the calculations
  * params.list : list of parameters
  * collect.batch : a sbatch script to collect the created data
  * submit.sh : a job submission script
  * ''--mem-per-cpu'' setting for ''srun'' is **crucial**!

<code bash slurm.batch>
#!/bin/bash
#SBATCH --mail-type=NONE
#SBATCH --nodes=1
#SBATCH --ntasks=16
 
# parallel options
# -P,j,--jobs N     Number of jobslots on each machine. Run up to N jobs in parallel.  0 means as many as possible.
# -a input-file Use input-file as input source. 
#   If multiple -a are given, each input-file will be treated as an input source, and all combinations of input sources will be generated. 
#    E.g. The file foo contains 1 2, the file bar contains a b c.  -a foo -a bar will result in the combinations (1,a) (1,b) (1,c) (2,a) (2,b) (2,c). This is useful for replacing nested for-loops.
# --colsep Column separator. The input will be treated as a table with regexp separating the columns. The n'th column can be access using {n} or {n.}. E.g. {3} is the 3rd column.
 
parallel --delay=0.1 --jobs=$SLURM_NTASKS --arg-file=params.list --colsep=" " \
	srun --mem-per-cpu=2000 --exclusive --ntasks=1 --nodes=1 bash out.sh {1} {2}
</code>

<code - params.list>
6 10064217
8 55066712
10 53991502
9 25649130
9 31032513
9 74063598
6 32381514
7 31916727
7 27385465
6 43468477
6 93701606
7 75175606
9 51106571
8 84389210
...
</code>


<code bash collect.batch>
#!/bin/bash
#SBATCH --mail-type=NONE
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4000
 
## postprocessing the data a bit and cleanup
tar cvfz mydata.tar.gz output_*.dat
rm -v output_*.dat
</code>

<code bash submit.sh>
#!/bin/bash

if [ $? -ne 0 ]; then
    echo "sbatch failed"
    exit 1
fi

# first job
SBATCH_OUTPUT=$(sbatch $1)
PARALLEL_JOB=$(echo $SBATCH_OUTPUT | grep -oE "[0-9]+")
echo $PARALLEL_JOB

# second job
sbatch --dependency=afterok:$PARALLEL_JOB collect.batch
</code>


The submit script extracts the jobif of the slurm.batch job and adds that jobid as a dependency for the collect.batch script.
Only if the first job was succesfull and all runs are finished will the secodn job collect the all the data.

The actual slurm.batch script will run 16 processes simultaniously, it will loop through the params.list files with possibly a lot of parameters. Only when all parameters are consumed will the job finish and the collect.bastch job will start.
{{drawio>agdrossel:diagram1.png}}

If you have a huge parameter list, you can create the list, split the list in parts (''man split'') and submit each part as a separate job.
If it is a text file, you can use the shell command:

<code bash>
split --verbose -n l/4 --numeric-suffixes=1  params.list params.
</code>
to split the file in 4 parts named ''params.01'' to ''params.04''.

You can then submit multiple jobs with the splitted files. The call graph looks something akin to the following.

{{drawio>agdrossel:diagram2.png}}

The corresponding submit script looks like this:

<code bash submit_parts.sh>
#!/bin/bash
if [ $# -le 2 ]; then
    echo "sbatch failed: no enough input files"
    exit 1
fi

PARALLEL_JOB_LIST=()

# first argument is the job file
# second and more are the parameter files
for PARAM_FILE in "${@:2}"; do
	SBATCH_OUTPUT=$(sbatch $1 $PARAM_FILE)
	PARALLEL_JOB_LIST+=($(echo $SBATCH_OUTPUT | grep -oE "[0-9]+"))
	echo "${SBATCH_OUTPUT}"
done

# collect the data after all jobs are done
# some bash magic to concatenate the jobids seperated with ","

IFS="," 
sbatch --dependency=afterok:"${PARALLEL_JOB_LIST[*]}" collect.batch

</code>

We need to modify the ''slurm.batch script'' to take a part of the splitted parameter list as an argument:
<code bash slurm_part.batch>
#!/bin/bash
#SBATCH --mail-type=NONE
#SBATCH --ntasks=16
 
PARAMS=$1
parallel --delay=0.1 --jobs=$SLURM_NTASKS --arg-file=$PARAMS --colsep=" " \
	srun --mem-per-cpu=2000 --exclusive --ntasks=1 --nodes=1 bash out.sh {1} {2}
</code>

You can run it like this:

<code bash>
bash submit_parts.sh slurm_part.batch params.0? # the ? matches any single character.
</code>

Another possible way is the ''--multi-prog'' parameter for srun. As an example you can use this [[https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/#_using_multi_prog|document]].

One can let jobs wait for each other also with the ''-d, --dependency=singleton'' parameter. 
This tells the job to begin execution after any previously launched jobs sharing the same job name and user has [[https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/basics/|terminated]]. Job name is set with ''-J'' parameter.

<code bash>
# Abstract search space parameters
min=1
max=2000
chunksize=200
for i in $(seq $min $chunksize $max); do
    ${CMD_PREFIX} sbatch \
                  -J ${JOBNAME}_$(($i/$chunksize%${MAXNODES})) --dependency singleton \
                  ${LAUNCHER} --joblog log/state.${i}.parallel.log  "{$i..$((i+$chunksize))}";
done
</code>

===== Tools =====
These are tools that exist, if requested we will try and make them available on the cluster:

  * [[https://researchcomputing.princeton.edu/support/knowledge-base/spark|Spark]]
  * [[https://docs.dask.org/en/stable/deploying.html|Dask]]
  * [[https://modin.readthedocs.io/en/stable/|Modin]]
  * [[https://researchcomputing.princeton.edu/support/knowledge-base/apptainer|Apptainer]]
====== Group Specific ======
===== AG Drossel =====
===== AG Liebchen =====
===== AG Vogel =====

The head node protein does not allow password logins, you need to use ssh keys.

  - create a key: ''ssh-keygen -t ed25519''
  - We admins stronlgy recommend to use a very strong passphrase. Together with ssh-agent you have to type it only once per login to your desktop!
  - add the public part to the authorized_keys file and set correct premissions: ''cat .ssh/id_ed25519.pub | tee -a .ssh/authorized_keys && chmod 0600 .ssh/authorized_keys''
  - now login to protein.cluster, it may ask for the passphrase.
  
===== SSH Agent =====

The ''ssh-agent'' should be startet automatically on login, Cinnamon for example will show a screen upon login to the desktop. If not you need to set the **GNOME Keyring SSH Agent** to start automatically:

{{:cluster:startup_apps_cinnamon.png?600 |}}