wfGenes

wfGenes (workflow generator) is a tool to generate various type of workflow management systems (WMSs) by parsing single workflow configuration file called WConfig. Naturally, within wfGenes framework, workflows are defined in human readable formats, JSON or YAML, with efficient and concise structure to generate different type of WMSs by performing dependency analysis and automatic code generation for specific WMS. This approach enables users to examine different type of WMS based on the application requirement and available computing environment. Here we demonstrate the wfGenes capability by constructing four different WMSs using WConfig (configuration file) and wfGenes. Following lines briefly introduce these tools and summarize their main feature in the table below. For more information about these systems and their potentials refer to hyperlinks.

  1. FireWorks is an Open source WMS with well separated data storage and computation phases using MongoDB to offer powerfull workflow management system for distributed workers across single or multiple cluster(s) while providing strong query mechanism thanks to strong database back-end.

  2. SimStack is a commercial tool featured with Graphical User Interface (GUI) with available set of customizable blocks (IPs) deployed in nano material simulation domain.

  3. Dask is an open source library for parallel python computing. With various parallel array data types built on top of Numpy and Pandas arrays, Dask is suitable for memory intensive computation with the capability to scale on many nodes. Dask builtin task scheduler coordinates execution sequence between tasks and exploit parallelism in a lazy manner. This enable users to generate their task graph prior to simulation with minimum amount of code modification.

  4. Parsl is a parallel python libraries to scale python scripts across many cores. With various kind of executors, Parsl enable users to accelerate their applications and achieve extreme scalability using Parsl specific syntax and decorators.

All discussed tools are able to scale applications from personal laptops to super computers.

Overview of supported WMS by wfGenes

WMS

Input Language

GUI

Post-Processing

Fault tolerance

License

FireWorks

YAML/JSON/Python

No

Monitoring/Database Query

Relaunching fizzled subworkflow (fireworks)

BSD

SimStack

XML

Yes

Monitoring

Relaunching fizzled subworkflow (WaNos)

Commercial

Dask

Python

No

Monitoring

none

BSD

Parsl

Python

No

Monitoring

Lazy failure and check-pointing

Apache

How it works

To get started with wfGenes, WConfig should be prepared based on workflow graph. In fact, WConfig is an abstract description of inputs, outputs and function names to be parsed by wfGenes. Apart from configuration file, additional arguments provides controlability over automation process and output generation.

workflowconfig

Path to workflowconfig file that contains data of workflow in YAML/JSON format. i.e. Input/Outputfile, modules and arguments names, the default is workflow.yaml.

inputpath

Set input path of Workflow, the directory to fetch input data.. Default is current working directory.

wms

Choose specific workflow manageme system. Possible values are FireWorks, SimStack, Dask and Parsl.

In the following snippets, two simple workflows are described in YML format and presents the WConfig structure. During Modeling phase, wfGenes validates user’s input against WConfig schema to ensure successful generation phase.

workflow_name: First workflow
nodes:
- name: node_1
  id: 1
  tasks:s
  - func: [source_1 , module_1]
    input: [input1]
    outputs: [output1_id1]
    kwargs: {}
  - func: [source_1, module_2]
    inputs: [input1, output1_id1]
    outputs: [output2_id1]
    kwargs: {}
- name: node_2
  id: 2
  tasks:
  - func: [source_1, module_3]
    inputs: [input2, input1]
    outputs: [output1_id2]
    kwargs: {}
- name: node_3
  id: 3
  tasks:
  - func: [source_2, module_1]
    inputs: [output2_id1, output1_id2]
    outputs: [output1_id3]
    kwargs: {}
workflow_name: Second workflow
nodes:
- name: node_1
  id: 1
  tasks:
  - func: [source_1, module_1]
    input: [input1]
    outputs: [output1_id1]
    kwargs: {}
  - func: [source_1, module_2]
    inputs: [input2, output1_id1]
    outputs: [output2_id1 , output3_id1 ]
    kwargs: {}
- name: node_2
  id: 2
  tasks:
  - func: [source_1, module_3]
    inputs: [input1, output2_id1]
    outputs: [output1_id2]
    kwargs: {}
- name: node_3
  id: 3
  tasks:
  - func: [source_2, module_1]
    inputs: [input1, output3_id1]
    outputs: [output1_id3]
    kwargs: {}

However, wfGenes constructs and adopts workflow based on user preferred WMS, several common consideration are taking to account to ensure the quality of result is not affected by automation process.

  • In a unified fashion, configuration file contains source and module names to pass necessary information to tool for automatic wrapper generation.

The WGenerator generates executable python wrapper from custom configuration while taking care of three main criteria that boosts performance while preserving functionality:

  1. One time load of extra inputs.

  2. One time import of duplicate modules.

  3. Resolving dependency in data-flow and optimized code generation.

  • wfGenes construct task graph by matching names, e.g , in the first workflow, in node one output1_id1 is passed to next function (local dependency) or , in the second workflow, output2_id1 is passed to node_3 (global dependency).

wgenerator first sample
wgenerator second sample

Directed Acyclic Graph (DAG) generated by wfGenes for two different discussed sample

  • Regardless of target WMS, generated outputs by wfGenes are validated against schema to assure early stage validation and ease further improvement of the tool.

wfGenes

wfGenesLab

wfGenesLab is a widget based user interface for wfGenes that runs on top of JupyterLab. It provides a light-weight and intuitive interface to generate, visualize and execute workflow graphs. wfGenesLab offers a dashboard of Jupyter widgets –various type of buttons and clickable links– to couple modeling phase to execution in a customizable manner using wfGenes under the hood.

wfGenes

The WConfig file and inputpath should be set before generating target WMS.

wfGenes

The wfGenes visualize the task graph from WConfig and produce valid input for specified system. Moreover, the WConfig can be modified interactively using provided link in console.

wfGenesEngine

The wfGenes engine mainly designed to execute generated workflows by wfGenes via a graphical environment built on top of JupyterLab. In the case of FireWorks, thanks to available python APIs, the engine is also equipped with monitoring instruments to capture the state of each workflow and enable users to face with failure and fizzled workflows.

wfGenes

wfGenesLab executor runs generated python models for Dask and Parsl on two different type of resources 1. Local to be used on personal working stations or 2. Slurm to run the workflow on supercomputers.

Installation and Setup

In order to complete the hands-on you should first clone wfgenes git repository:

git clone https://gitlab.com/wfgenes/wfgenes.git

Afterwards, the required packages can be installed using Anaconda.

conda env create -f environment.yml

Note

If conda is not installed for your account you can installe miniconda locally and use conda package manger, alternatively, if it is provided by administrator you can load conda and resume with wfGenes installation.

After installation, you should activate wfGenes environment by issuing:

conda activate wfgenes

Next, run JupyterLab on the remote server (default port 8888):

jupyter-lab --no-browser

In order to complete the hands-on smoothly on your personal working station and use computing power of supercomputers it is recommended to use local port forwarding. Simply, open a new terminal on your local machine and issue the command below:

ssh -N -f -L localhost:8888:localhost:8888 {user}@{server_ip}

This command forwards the remote port 8888 onto our local machine’s port 8888. Note you will need to change {user} and {server_ip} to your own.

Finally, you can open jupyter-lab from previous terminal by clicking on the generated link on server side. This will open juypter-lab session on your working station while connected to cluster session.

Hands-on

Click Here

Contacts

GRK 2450

Twitter