Metadata-Version: 2.4
Name: edps
Version: 1.7.0
Summary: EDPS - ESO Data Processing System
Home-page: https://gitlab.eso.org/dfs/edps
Author: Stanislaw Podgorski, Stefano Zampieri
Author-email: szampier@eso.org
Project-URL: Bug Tracker, https://jira.eso.org/
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: astropy
Requires-Dist: fastapi
Requires-Dist: networkx
Requires-Dist: pyyaml
Requires-Dist: requests
Requires-Dist: uvicorn
Requires-Dist: tinydb
Requires-Dist: frozendict
Requires-Dist: jinja2
Requires-Dist: pydantic>=2.0
Requires-Dist: psutil
Dynamic: license-file

# EDPS - ESO Data Processing System

[[_TOC_]]

## PIP installation (for system integrators)

### Build EDPS

1. create new virtual environment: `python -m venv build-env`
1. activate the environment: `. build-env/bin/activate`
1. install the build tools: `pip install build`
1. build edps: `python -m build` (creates build artifacts in the `dist` directory)

### Install EDPS

1. create new virtual environment: `python -m venv install-env`
1. activate the environment: `. install-env/bin/activate`
1. install edps and its dependencies: `pip install edps-0.0.1.tar.gz`

## Source code installation (for developers)

### Install EDPS Software

```shell
cd installation-dir
git clone https://gitlab.eso.org/dfs/edps.git
```

where `installation-dir` is a user-defined path specifying the location of the EDPS installation directory.

To update the EDPS software, run the following command:

```shell
cd installation-dir/edps
git pull
```

### Install Python Environment

```shell
python -m venv path-to-my-env
. path-to-my-env/bin/activate
cd edps
pip install -r pip-requirements-noversion.txt
```

where `path-to-my-env` is a user-defined path specifying the location of the python environment.

To run the tests see section [Running Unit/Integration tests](#Running Unit/Integration tests)

NOTE: EDPS requires Python 3.8 and above

### Install Instrument Pipelines

To reduce data with EDPS you also need to install the relevant instrument pipeline(s), following the instructions in:

https://www.eso.org/sci/software/pipelines/

## Configuring EDPS

To configure EDPS, edit the file `installation-dir/edps/src/application.properties`
Below is the default configuration:

```
[server]
port=5000
host=localhost

[application]
data_dir=.
executors_pool=2
logging=DEBUG
workflow_dir=/path/to/edps/workflows

[executor]
processes=2
esorex_path=esorex
base_dir=.
dummy=true
ordering=dfs
```

The important configuration parameters are:

- `port` TCP port number used by the EDPS server. Clients should use this port number to communicate with the server.
- `esorex_path` the esorex command to launch, insert the full path if the command is not in your system path
- `base_dir` base directory for the pipeline working directory, where the pipeline logs and products are created. EDPS
  creates a unique working directory under the base_dir for each pipeline execution.
- `dummy` when `true`, the workflow is executed in simulated mode. When `false`, the real pipeline is executed.
- `ordering` defines the order of data reduction, right now we allow for 4 options `dfs`, `bfs`, `type` and `dynamic`,
  see details below.

NOTE: the EDPS server must be restarted after changing the configuration.

### Execution ordering

All orderings follow `topological order`, so parent tasks are always placed before their children. For subtrees without
a common root the orderings can differ.

#### ordering=dfs

Orders nodes using DFS-like traversal. It will go as deep as possible (to leaf) before moving to parallel branch. This
means if we have:

```
          A
        /   \
       B     C
      / \   / \
     D   E F   G
```

The order will favour `A->B->D->E` before proceeding into C branch.

The idea is to start at leaf node and work our way up recursively to find all parents and schedule them before the leaf.
Once this is done, we move to another leaf and repeat the process. We want to process leaves in particular order:
process all leaves from the same sub-graph before moving to another one. For that we need to split incoming graph into
disjoint sub-graphs to order the leaves.

#### ordering=bfs

Orders nodes using BFS-like traversal. It will spread between all children branches evenly. This means if we have:

```
          A
        /   \
       B     C
      / \   / \
     D   E F   G
```

The order will favour `A->BC->DEFG`

The idea is to start from root nodes (with no parents), then proceed to nodes which depend only on those roots, then to
nodes which depend on previous ones etc. Nodes on given level are ordered based on the size of the sub-tree starting at
given node, placing "heavier" nodes to be scheduled first. Weight of the node comes from `num_inputs/avg_num_inputs` for
given task type + weights of all it's children.

#### ordering=type

Orders nodes using BFS-like traversal but grouping nodes of the same "type". Eg if there are FLATs and DARKs on the same
level, whole group (like all DARKs) will be processed together before moving to other group. This means if we have:

```
              A
            /   \
          /       \
         B         C
      /    \     /    \
     DARK FLAT DARK FLAT
```

The order will favour `A->BC->DARKs->FLATs` or `A->BC->FLATs->DARKs`
The idea is to start from root nodes (with no parents), then proceed to nodes which depend only on those roots, then to
nodes which depend on previous ones etc, just like for BFS, but within given level to order the nodes by type (as
opposed to ordering by weight like in BFS)

#### ordering=dynamic

Orders nodes in purely dynamic fashion, at runtime, in similar way to BFS, but using runtime information rather than
static graph structure. A background thread is periodically checking which nodes are ready for execution (so have all
parents completed) and schedules them. If more than one is ready, then it applies similar ordering mechanism to BFS, so
calculates node weight, based on `num_inputs/avg_num_inputs` for given task type + weights of all it's children.

Right now check for new tasks ready for submission is done every 5s.

## Running EDPS

To execute the **EDPS server**, activate the python environment and type the following commands in a terminal:

```shell
cd installation-dir/edps/src
python edps/scripts/server.py
```

To terminate the EDPS server, type `Ctrl+C` in the terminal where EDPS is running.

## Reducing data with EDPS

1. **Make sure the EDPS server is running on your system**
2. **Invoke the EDPS client tool to send processing requests to EDPS**

### EDPS Client Tool Usage
```shell
edps-client -h

usage: edps-client [-h] [-H HOST] [-P PORT] [-i [INPUTS [INPUTS ...]]] [-t [TARGETS [TARGETS ...]]]
                   [-m [META_TARGETS [META_TARGETS ...]]] -w WORKFLOW [-e] [-f] [-g] [-a] [-r] [-x] [-d]
                   [-wp PARAMETER VALUE] [-rp TASK PARAMETER VALUE] [-wps WORKFLOW_PARAMETER_SET]
                   [-rps RECIPE_PARAMETER_SET]

EDPS CLI

optional arguments:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  e.g. "localhost"
  -P PORT, --port PORT  e.g. "5000"
  -i [INPUTS [INPUTS ...]], --inputs [INPUTS [INPUTS ...]]
                        input files or directories
  -t [TARGETS [TARGETS ...]], --targets [TARGETS [TARGETS ...]]
                        targets
  -m [META_TARGETS [META_TARGETS ...]], --meta-targets [META_TARGETS [META_TARGETS ...]]
                        meta-targets
  -w WORKFLOW, --workflow WORKFLOW
                        e.g. "fors.fors_imaging_wkf"
  -e, --execute         execute full processing
  -f, --flat            produce flat organization output
  -g, --graph           print workflow graph in DOT format
  -a, --assocmap        print association map in MD format
  -r, --reset           reset given workflow
  -x, --expand-meta-targets
                        expand meta-targets
  -d, --default-parameters
                        print default recipe params
  -wp PARAMETER VALUE, --workflow-param PARAMETER VALUE
                        workflow parameter
  -rp TASK PARAMETER VALUE, --recipe-param TASK PARAMETER VALUE
                        recipe parameter
  -wps WORKFLOW_PARAMETER_SET, --workflow-parameter-set WORKFLOW_PARAMETER_SET
                        workflow parameter set
  -rps RECIPE_PARAMETER_SET, --recipe-parameter-set RECIPE_PARAMETER_SET
                        recipe parameter set
```

### EDPS Client Tool Example

```shell
edps-client -w kmos.kmos_wkf -i /data/kmos/2019-01-07 -t response -wps qc1_parameters -wp molecfit standard -rps qc1_parameters -rp response kmos.kmos_gen_telluric.method 4
```

## Other commands

### Get workflow graph in DOT format

EDPS can generate a workflow graph in DOT format, which can be rendered using the `dot` command. DOT is a popular graph
description language, see https://en.wikipedia.org/wiki/DOT_%28graph_description_language%29

```shell
edps-client -w WORKFLOW_NAME -g > <filename>.dot
cat <filename>.dot | dot -Tpng > <filename>.png  (for .png format)
cat <filename>.dot | dot -Tps >  <filename>.ps   (for .ps format)
```

where WORKFLOW_NAME is a valid workflow name and <filename> is the name of the desired output. A list of valid
WORKFLOW_NAMEs can be obtained here:

```shell
cd installation-dir/edps/src/edps/workflow/
```

valid names are those with _wkf (do not include the extension .py)

NOTE: to generate a graph from a DOT file one can also use an online service such
as https://dreampuf.github.io/GraphvizOnline.

## Python programmatic client

Instead of the EDPS client tool, one can use the EDPS python client to send processing requests to the EDPS server from a python
script.

### Organise

```python
from edps.client.EDPSClient import EDPSClient
from edps.client.ProcessingRequest import ProcessingRequest

client = EDPSClient("localhost", 5000)
request = ProcessingRequest(inputs=["/path/to/data/pool"],
                            targets=["science"],
                            workflow="fors.fors_imaging_wkf")
result = client.submit_to_organise(request)
jobs = result.get()
print(jobs)
```

### Execute and browse

```python
from edps.client.EDPSClient import EDPSClient
from edps.client.ProcessingRequest import ProcessingRequest

client = EDPSClient("localhost", 5000)
request = ProcessingRequest(inputs=["/path/to/data/pool"],
                            targets=["science"],
                            workflow="fors.fors_imaging_wkf")

response = client.submit_processing_request(request)
jobs = response.get().jobs
for job in jobs:
    print(job)

search_result = client.get_jobs_list("bias")
print(search_result.get())
```

## Running Unit/Integration tests

* Follow [Install EDPS Software](#Install EDPS Software)
* Assuming python environment is activated and all the core dependencies are installed, install the additional dev dependencies 
  * `pip install -r pip-requirements-noversion-dev.txt`
* Setup PYTHONPATH
  * `export PYTHONPATH="$PWD/src:$PWD/test:$PWD/test/tests:$PYTHONPATH"`
* Go to tests folder
  * ` cd test/tests`
* Run the tests e.g.:
  * Running workflow json tests
    * `pytest  -v json_tests.py`
  * Running workflow json tests via unittest and full output 
    * `python -m unittest -q json_tests.py`
  * Running all the tests
    * `pytest -v *.py`
  * Running tests with matching filter e.g. 'fors'
    * `pytest -v *.py -k fors`
    Note: This will not match with json tests description as those are created at runtime
  * Running all the tests with coverage and html report
    * `pytest -v --cov=../../src --cov-branch --cov-report html --html=test_results.html --self-contained-html *.py`
      * For coverage report `open htmlcov/index.html`
      * For Test result `open test_results.html`

note: Individual (passing) subtests (dynamically created tests from the json file) are not yet reported see issue [39](https://github.com/pytest-dev/pytest-subtests/issues/39)

## JSON tests guide

`json_tests.py` contains generic logic which executes all test cases defined under `tests/json_configuration` directory.
Adding a new test requires putting new scenario description in one of existing `.json` files or creating a new file.
New file will be picked-up automatically on the next test run.

### Test suite JSON file

Each `.json` files contains a document with one field `scenarios`, which contains a list of scenario definitions.
Those will be run in-order, sequentially, exactly as they are defined in the file.

#### Scenario definition

Each scenario has 3 major sections:

- Test-case metadata
- Input file definitions
- Result expectations

##### Test-case metadata

First section contains:
- `description` of the test, which will be used as `test name` in the execution result
- `workflow` to be used for data organization
- `workflow_parameters` optional dictionary to pass with the request to EDPS
- `workflow_parameter_set` optional name of a `named parameter set` for EDPS to use
- `targets` list of target tasks to consider when generating jobs (EDPS will generate jobs for the targets and also for anything those targets depend on)
- `meta_targets` list of labelled meta-targets which will be expanded by EDPS into a list of tasks to be used as targets

Example:
```json
{
  "description": "fors bias flat",
  "workflow": "fors.fors_imaging_wkf",
  "workflow_parameters": {
    "a": "b"
  },
  "workflow_parameter_set": "qc0_parameters",
  "targets": [
    "bias",
    "flat"
  ],
  "meta_targets": [],
  "skip": false
}
```

##### Input file definitions

Each test scenario will run EDPS using generated FITS files.
List `input_files` holds definitions of file templates.
Template consist of:

- `name_prefix` which will be prepended to generated files (each file will have a random UUID suffix)
- `count` number of files to generate, defaults to 1
- `keywords` dictionary with keywords to place in primary header of the FITS file. Keywords defined like that will be put in the file as-is.

Example:

```json
{
  "input_files": [
    {
      "name_prefix": "bias",
      "count": 4,
      "keywords": {
        "instrume": "FORS1",
        "dpr.catg": "CALIB",
        "dpr.type": "BIAS"
      }
    },
    {
      "name_prefix": "flat",
      "keywords": {
        "instrume": "FORS1",
        "dpr.catg": "CALIB",
        "dpr.type": "FLAT,SKY",
        "dpr.tech": "IMAGE"
      }
    }
  ]
}
```

##### Result expectations

Each test scenario will be validated against the defined expected results list.
List `results` contains definitions of the jobs that are expected to be created by EDPS.

Each job is defined by:

- `recipe` name of the recipe which is supposed to be used
- `inputs_prefixes` list of allowed filename prefixes for the input files

Example:
```json
{
  "results": [
    {
      "recipe": "fors_bias",
      "inputs_prefixes": [
        "bias"
      ]
    },
    {
      "recipe": "fors_img_sky_flat",
      "inputs_prefixes": [
        "flat"
      ]
    }
  ]
}
```

##### Full example

Example of a single scenario:

```json
    {
      "description": "fors bias flat",
      "workflow": "fors.fors_imaging_wkf",
      "workflow_parameters": {
        "a": "b"
      },
      "workflow_parameter_set": "qc0_parameters",
      "targets": [
        "bias",
        "flat"
      ],
      "meta_targets": [],
      "input_files": [
        {
          "name_prefix": "bias",
          "count": 4,
          "keywords": {
            "instrume": "FORS1",
            "dpr.catg": "CALIB",
            "dpr.type": "BIAS"
          }
        },
        {
          "name_prefix": "flat",
          "keywords": {
            "instrume": "FORS1",
            "dpr.catg": "CALIB",
            "dpr.type": "FLAT,SKY",
            "dpr.tech": "IMAGE"
          }
        }
      ],
      "results": [
        {
          "recipe": "fors_bias",
          "inputs_prefixes": [
            "bias"
          ]
        },
        {
          "recipe": "fors_img_sky_flat",
          "inputs_prefixes": [
            "flat"
          ]
        }
      ]
    }
```

### Default behaviour

#### MJD-OBS

For each scenario a random `base mjd-obs` is generated.
Unless MJD-OBS keyword is explicitly defined for given template, the `base` value will be used.
If there is more than 1 file in the template, each consecutive file has the MJD-OBS slightly further in the future compared to previous one -> `base_mjd_obs + i * 0.02`.
If keyword is explicitly defined for template it will be used as-is, without the increment.
Each consecutive template starts with mjd-obs further back in time, based on the order in which inputs are defined in the file.

#### TPL.START

If not explicitly defined `tpl.start` is set to a randomly generated value, same for each file of the template.
Input files template definition supports only a single set of keywords, so in case different files should be marked as part of the same template it might be necessary to explicitly set `tpl.start`.

Example:

```json
{
  "input_files": [
    {
      "name_prefix": "orderdef_a",
      "count": 1,
      "keywords": {
        "instrume": "ESPRESSO",
        "dpr.catg": "CALIB",
        "dpr.type": "ORDERDEF,LAMP,OFF",
        "tpl.start": "1"
      }
    },
    {
      "name_prefix": "orderdef_b",
      "count": 1,
      "keywords": {
        "instrume": "ESPRESSO",
        "dpr.catg": "CALIB",
        "dpr.type": "ORDERDEF,OFF,LAMP",
        "tpl.start": "1"
      }
    }
  ]
}
```

With such definition both generated files will have the same `tpl.start`.

#### Default keywords

Certain keywords are inserted automatically, even if not explicitly defined:

- `arcfile` set to the same as file name: `{prefix}_{i + 1}_{uuid.uuid4()}.fits`
- `tpl.nexp` set to number of template files
- `tpl.expno` set to numbers `1..n` for each template file

### Known limitations

- It's not possible to re-run one selected test, because they are generated dynamically. If you want to work on a single test only set `skip` flag for other tests.
- Synthetic data generation has no knowledge about type of the data or any real-world relative order in which such data are taken. Unless you explicitly specify MJD-OBS keyword you should not make any assumptions about the MJD-OBS value and therefore about chronological ordering of the generated files.
- Each template definition can have only one set of keywords to use.
- Tests are doing only the `data organization` part and are designed to verify workflow against expectations about what jobs should be created for given set of inputs.
No recipes are run, so it's still possible that the workflow is not really correct (eg. min/max-ret is set incorrectly or some task is not declaring association necessary for the recipe).
- Result verification does not check if all defined prefixes are included in the list of input files for the recipe (eg. if there is at least one file with each prefix), it checks only that there are no input files other than those with right prefix (eg. if the only expected prefix is `bias` and there is a file `flat_...` as input the test will fail)
- Result verification is `strict` and it requires that the number of resulting jobs matches the number of defined expectations and that there is at least one job matching each of the expectations.
- Tests are able to check only happy-paths, they always assert that request succeeded, so are not suitable for checking error conditions.
