{ "cells": [ { "cell_type": "markdown", "id": "f3a7c04021f471a7", "metadata": {}, "source": [ "# MPIEvaluator: Run on multi-node HPC systems\n", "\n", "The `MPIEvaluator` is a new addition to the EMAworkbench that allows experiment execution on multi-node systems, including high-performance computers (HPCs). This capability is particularly useful if you want to conduct large-scale experiments that require distributed processing. Under the hood, the evaluator leverages the `MPIPoolExecutor` from [`mpi4py.futures`](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html).\n", "\n", "#### Limiations\n", "- Currently, the MPIEvaluator is only tested on Linux and macOS, while it might work on other operating systems.\n", "- Currently, the MPIEvaluator only works with Python-based models, and won't work with file-based model types (like NetLogo or Vensim).\n", "- The MPIEvaluator is most useful for large-scale experiments, where the time spent on distributing the experiments over the cluster is negligible compared to the time spent on running the experiments. For smaller experiments, the overhead of distributing the experiments over the cluster might be significant, and it might be more efficient to run the experiments locally.\n", "\n", "The MPIEvaluator is experimental and its interface and functionality might change in future releases. We appreciate feedback on the MPIEvaluator, share any thoughts about it at https://github.com/quaquel/EMAworkbench/discussions/311.\n", "\n", "#### Contents\n", "This tutorial will first show how to set up the environment, and then how to use the MPIEvaluator to run a model on a cluster. Finally, we'll use the [DelftBlue supercomputer](https://doc.dhpc.tudelft.nl/delftblue/) as an example, to show how to run on a system which uses a SLURM scheduler." ] }, { "cell_type": "markdown", "id": "93f999bb09734a30", "metadata": {}, "source": [ "## 1. Setting up the environment\n", "To use the MPIEvaluator, MPI and mpi4py must be installed. \n", "\n", "Installing MPI on Linux typically involves the installation of a popular MPI implementation such as OpenMPI or MPICH. Below are the instructions for installing OpenMPI:\n", "\n", "### 1a. Installing OpenMPI\n", "\n", "_If you use conda, it might install MPI automatically along when installing mpi4py (see 1b)._\n", "\n", "You can install OpenMPI using you package manager. First, update your package repositories, and then install OpenMPI:\n", " \n", " For **Debian/Ubuntu**:\n", " ```bash\n", " sudo apt update\n", " sudo apt install openmpi-bin libopenmpi-dev\n", " ```\n", "\n", " For **Fedora**:\n", " ```bash\n", " sudo dnf check-update\n", " sudo dnf install openmpi openmpi-devel\n", " ```\n", "\n", " For **CentOS/RHEL**:\n", " ```bash\n", " sudo yum update\n", " sudo yum install openmpi openmpi-devel\n", " ```\n", "\n", "Many times, the necessary environment variables are automatically set up. You can check if this is the case by running the following command:\n", "\n", " ```bash\n", " mpiexec --version\n", " ```\n", "\n", "If not, add OpenMPI's `bin` directory to your `PATH`:\n", "\n", " ```bash\n", " export PATH=$PATH:/usr/lib/openmpi/bin\n", " ```\n", "\n", "### 1b. Installing mpi4py\n", "The python package mpi4py needs to installed as well. This is most easily done [from PyPI](https://pypi.org/project/mpi4py/), by running the following command:\n", "```bash\n", "pip install -U mpi4py\n", "```" ] }, { "cell_type": "markdown", "id": "203ca7e78198c2e5", "metadata": {}, "source": [ "## 2. Creating a model\n", "First, let's set up a minimal model to test with. This can be any Python-based model, we're using the [`example_python.py`](https://emaworkbench.readthedocs.io/en/latest/examples/example_python.html) model from the EMA Workbench documentation as example.\n", "\n", "We recommend crafting and testing your model in a separate Python file, and then importing it into your main script. This way, you can test your model without having to run it through the MPIEvaluator, and you can easily switch between running it locally and on a cluster.\n", "\n", "### 2a. Define the model\n", "First, we define a Python model function." ] }, { "cell_type": "code", "execution_count": null, "id": "16974ce8be11ba65", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "def some_model(x1=None, x2=None, x3=None):\n", " return {\"y\": x1 * x2 + x3}" ] }, { "cell_type": "markdown", "id": "6a13cf8baa10597c", "metadata": {}, "source": [ "Now, create the EMAworkbench model object, and specify the uncertainties and outcomes:" ] }, { "cell_type": "code", "execution_count": null, "id": "initial_id", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "from ema_workbench import Model, RealParameter, ScalarOutcome, ema_logging, perform_experiments\n", "\n", "if __name__ == \"__main__\":\n", " # We recommend setting pass_root_logger_level=True when running on a cluster, to ensure consistent log levels.\n", " ema_logging.log_to_stderr(level=ema_logging.INFO, pass_root_logger_level=True)\n", "\n", " ema_model = Model(\"simpleModel\", function=some_model) # instantiate the model\n", "\n", " # specify uncertainties\n", " ema_model.uncertainties = [\n", " RealParameter(\"x1\", 0.1, 10),\n", " RealParameter(\"x2\", -0.01, 0.01),\n", " RealParameter(\"x3\", -0.01, 0.01),\n", " ]\n", " # specify outcomes\n", " ema_model.outcomes = [ScalarOutcome(\"y\")]" ] }, { "cell_type": "markdown", "id": "33a9d08b66e7f92c", "metadata": {}, "source": [ "### 2b. Test the model\n", "Now, we can run the model locally to test it:" ] }, { "cell_type": "code", "execution_count": null, "id": "a3bb51cc5ae8ef7c", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from ema_workbench import SequentialEvaluator\n", "\n", "with SequentialEvaluator(ema_model) as evaluator:\n", " results = perform_experiments(ema_model, 100, evaluator=evaluator)" ] }, { "cell_type": "markdown", "id": "b410deb34e956178", "metadata": {}, "source": [ "In this stage, you can test your model and make sure it works as expected. You can also check if everything is included in the results and do initial validation on the model, before scaling up to a cluster." ] }, { "cell_type": "markdown", "id": "268ae4235c66dd83", "metadata": {}, "source": [ "## 3. Run the model on a MPI cluster\n", "Now that we have a working model, we can run it on a cluster. To do this, we need to import the `MPIEvaluator` class from the `ema_workbench` package, and instantiate it with our model. Then, we can use the `perform_experiments` function as usual, and the MPIEvaluator will take care of distributing the experiments over the cluster. Finally, we can save the results to a tarball, as usual." ] }, { "cell_type": "code", "execution_count": null, "id": "fae9c15d70d18454", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# ema_example_model.py\n", "from ema_workbench import (\n", " Model,\n", " RealParameter,\n", " ScalarOutcome,\n", " ema_logging,\n", " perform_experiments,\n", " MPIEvaluator,\n", " save_results,\n", ")\n", "\n", "\n", "def some_model(x1=None, x2=None, x3=None):\n", " return {\"y\": x1 * x2 + x3}\n", "\n", "\n", "if __name__ == \"__main__\":\n", " ema_logging.log_to_stderr(level=ema_logging.INFO, pass_root_logger_level=True)\n", "\n", " ema_model = Model(\"simpleModel\", function=some_model)\n", "\n", " ema_model.uncertainties = [\n", " RealParameter(\"x1\", 0.1, 10),\n", " RealParameter(\"x2\", -0.01, 0.01),\n", " RealParameter(\"x3\", -0.01, 0.01),\n", " ]\n", " ema_model.outcomes = [ScalarOutcome(\"y\")]\n", "\n", " # Note that we switch to the MPIEvaluator here\n", " with MPIEvaluator(ema_model) as evaluator:\n", " results = evaluator.perform_experiments(scenarios=10000)\n", "\n", " # Save the results\n", " save_results(results, \"ema_example_model.tar.gz\")" ] }, { "cell_type": "markdown", "id": "373e62cf3af804b3", "metadata": {}, "source": [ "To run this script on a cluster, we need to use the `mpiexec` command:\n", "\n", "```bash\n", "mpiexec -n 1 python ema_example_model.py\n", "```\n", "\n", "This command will execute the `ema_example_model.py` Python script using MPI. MPI-specific configurations are inferred from default settings or any configurations provided elsewhere, such as in an MPI configuration file or additional flags to `mpiexec` (not shown in the provided command). The number of workers that will be spawned by the MPIEvaluator depends on the MPI universe size. The way this size can be controlled depends on which implementation of MPI you have. See the discussion in [the docs of mpi4py for details](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html#examples).\n", "\n", "The output of the script will be saved to the `ema_mpi_test.tar.gz` file, which can be loaded and analyzed later as usual." ] }, { "cell_type": "markdown", "id": "a0116cace0bd0a87", "metadata": {}, "source": [ "## Example: Running on the DelftBlue supercomputer (with SLURM)\n", "\n", "As an example, we'll show how to run the model on the [DelftBlue supercomputer](https://doc.dhpc.tudelft.nl/delftblue/), which uses the SLURM scheduler. The DelftBlue supercomputer is a cluster of 218 nodes, each with 2 Intel Xeon Gold E5-6248R CPUs (48 cores total), 192 GB of RAM, and 480 GB of local SSD storage. The nodes are connected with a 100 Gbit/s Infiniband network.\n", "\n", "_These steps roughly follow the [DelftBlue Crash-course for absolute beginners](https://doc.dhpc.tudelft.nl/delftblue/crash-course/). If you get stuck, you can refer to that guide for more information._\n", "\n", "### 1. Creating a SLURM script\n", "\n", "First, you need to create a SLURM script. This is a bash script that will be executed on the cluster, and it will contain all the necessary commands to run your model. You can create a new file, for example `slurm_script.sh`, and add the following lines:\n", "\n", "```bash\n", "#!/bin/bash\n", "\n", "#SBATCH --job-name=\"Python_test\"\n", "#SBATCH --time=00:02:00\n", "#SBATCH --ntasks=10\n", "#SBATCH --cpus-per-task=1\n", "#SBATCH --partition=compute\n", "#SBATCH --mem-per-cpu=4GB\n", "#SBATCH --account=research-tpm-mas\n", "\n", "module load 2023r1\n", "module load openmpi\n", "module load python\n", "module load py-numpy\n", "module load py-scipy\n", "module load py-mpi4py\n", "module load py-pip\n", "\n", "pip install --user --upgrade ema_workbench\n", "\n", "mpiexec -n 1 python3 ema_example_model.py\n", "```\n", "\n", "Modify the script to suit your needs:\n", "- Set the `--job-name` to something descriptive.\n", "- Update the maximum `--time` to the expected runtime of your model. The job will be terminated if it exceeds this time limit.\n", "- Set the `--ntasks` to the number of cores you want to use. Each node in DelftBlue currently has 48 cores, so for example `--ntasks=96` might use two nodes, but can also be distributed over more nodes. Note that using MPI involves quite some overhead. If you do not plan to use more than 48 cores, you might want to consider using the MultiprocessingEvaluator and request exclusive node access (see below). \n", "- Update the memory `--mem-per-cpu` to the amount of memory you need per core. Each node has 192 GB of memory, so you can use up to 4 GB per core.\n", "- Add `--exclusive` if you want to claim a full node for your job. In that case, specify `--nodes` next to `--ntasks`. Requesting exclusive access to one or more nodes will delay you scheduling time, because you need to wait for the full nodes to become available.\n", "- Set the `--account` to your project account. You can find this on the [Accounting and Shares](https://doc.dhpc.tudelft.nl/delftblue/Accounting-and-shares/) page of the DelftBlue docs.\n", "\n", "\n", "See [Submit Jobs](https://doc.dhpc.tudelft.nl/delftblue/Slurm-scheduler/) at the DelftBlue docs for more information on the SLURM script configuration.\n", "\n", "Then, you need to load the necessary modules. You can find the available modules on the [DHPC modules](https://doc.dhpc.tudelft.nl/delftblue/DHPC-modules/) page of the DelftBlue docs. In this example, we're loading the 2023r1 toolchain, which includes Python 3.9, and then we're loading the necessary Python packages.\n", "\n", "You might want to install additional Python packages. You can do this with `pip install -U --user `. Note that you need to use the `--user` flag, because you don't have root access on the cluster.\n", "To install the EMA Workbench, you can use `pip install -U --user ema_workbench`. If you want to install a development branch, you can use `pip install -e -U --user git+https://github.com/quaquel/EMAworkbench@#egg=ema-workbench`, where `` is the name of the branch you want to install.\n", "\n", "Finally, the script uses `mpiexec` to run Python script in a way that allows the MPIEvaluator to distribute the experiments over the cluster. The `-n 1` argument meanst that we only start a single process. This main process in turn will spawn additional worker processes. The number of worker processes that will spawn defaults to the value of `--ntasks` - 1. \n", "\n", "Note that the bash scripts (sh), including the `slurm_script.sh` we just created, need LF line endings. If you are using Windows, line endings are CRLF by default, and you need to convert them to LF. You can do this with most text editors, like Notepad++ or Atom for example." ] }, { "cell_type": "markdown", "id": "49b3ae210d69c2cb", "metadata": {}, "source": [ "### 1. Setting up the environment\n", "\n", "First, you need to log in on DelftBlue. As an employee, you can login from the command line with:\n", "\n", "```bash\n", "ssh @login.delftblue.tudelft.nl\n", "```\n", "\n", "where `` is your TU Delft netid. You can also use an SSH client such as [PuTTY](https://www.putty.org/).\n", "\n", "As a student, you need to jump though an extra hoop:\n", "\n", "```bash\n", "ssh -J @student-linux.tudelft.nl @login.delftblue.tudelft.nl\n", "```\n", "\n", "Note: Below are the commands for students. If you are an employee, you need to remove the `-J @student-linux.tudelft.nl` from all commands below.\n", "\n", "Once you're logged in, you want to jump to your scratch directory (note it's not but is not backed up).\n", "\n", "```bash\n", "cd ../../scratch/\n", "```\n", "\n", "Create a new directory for this tutorial, for example `mkdir ema_mpi_test` and then `cd ema_mpi_test`\n", "\n", "Then, you want to send your Python file and SLURM script to DelftBlue. Make sure your files are all in one folder. Open a **new** command line terminal, `cd` in the folder where your files are, and then send the files with `scp`:\n", "\n", "```bash\n", "scp -J @student-linux.tudelft.nl ema_example_model.py slurm_script.sh @login.delftblue.tudelft.nl:/scratch//ema_mpi_test\n", "```\n", "\n", "Now go back to your original cmd window, on which you logged in on DelftBlue. Before scheduling the SLURM script, we first have to make it executable:\n", "\n", "```bash\n", "chmod +x slurm_script.sh\n", "```\n", "\n", "Then we can schedule it:\n", "\n", "```bash\n", "sbatch slurm_script.sh\n", "```\n", "\n", "Now it's scheduled!\n", "\n", "You can check the status of your job with `squeue`:\n", "\n", "```bash\n", "squeue -u \n", "```\n", "\n", "You might want to inspect the log file, which is created by the SLURM script. You can do this with `cat`:\n", "\n", "```bash\n", "cat slurm-.out\n", "```\n", "\n", "where `` is the job ID of your job, which you can find with `squeue`.\n", "\n", "When the job is finished, we can download the tarball with our results. Open the command line again (can be the same one as before), and you can copy the results back to your local machine with `scp`:\n", "\n", "```bash\n", "scp -J @student-linux.tudelft.nl @login.delftblue.tudelft.nl:/scratch//ema_mpi_test/ema_mpi_test.pickle .\n", "```\n", "\n", "Finally, we can clean up the files on DelftBlue, to avoid cluttering the scratch directory:\n", "\n", "```bash\n", "cd ..\n", "rm -rf \"ema_mpi_test\"\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "7fa60315-6145-4cac-950b-e9006428a955", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 5 }