Gnu parallel basic usage

 

Introduction to GNU Parallel

parallel is a GNU utility that allows you to run processes simultaneously, sending them to different CPU cores, threads or even other hosts to be runned.

There are millions of use cases for this, and it is incredible useful when you need to process tasks that are CPU intensive or network intensive , or you are just lack of resources and need to find a way to load balance avoiding those bottle-necks

Thinking twice before starting one if these batch processing tasks, can save you and your organization incredible amounts of time.

In this article I will be explaining my usecase for this utility, and how to understand everything around the tools I am going to use. Basically I needed to run a batch processing task. Was in need of Indexing entire websites, which are sometimes made of Millions of files. Breaking down the tasks into getting a url-list.txt (spidering) , keeping track of what it has been downloaded so far and what not, been able to then start the download process from different hosts to multiply the Download Speed from the server is crucial to avoid bringing the Download time from Months to Weeks to days even hours.

Knowing how to parallel process this task is vital in this use case.

As you can see the main point of this is to avoid bottle necks originated by the nature of the task, and the reources available to process that task, may they be:

- Processing power
- Storage availabilty
- Network resources

In my case I am hitting with Network Resources and Storage Availability , I will be using GNU parallel to allocate and run the Download task across different hosts that can multiply my Network Resources capabilities , running also with Storage Availability issues as we will be seeing how I tackle this.

Thus I will be more interested in the feature of parallel of running the task in different hosts, rather than running the task in different threads. I will give few examples of both.

Introduction: When can you load balance?

Understanding always the nature of computing load balancing techniques , they can be achieved as long as the task can be broken down into different system calls to the Operating System, thus the processor.

Understanding that CPU’s can run just one process at a time, but contemporany CPU’s are made of more than a single Core, thread etc… Operative systems work on the basis of allocating system calls to different threads according to their internals. Each ssystem call cannot be divided and sent to different threads for its execution to speed up its execution. Statements of programming languages often are made to run on a Single thread.

It is important then, nowadays that Programmers undersand the nature of System Calls knowing that their Code is made of a bunch of them, and if not specified otherwise , the program will run sequencially on a Single Thread, meaning that each System Call will have to wait for the previous one to be completed in order to start its execution.

But some parts of a program or script, may be made of tasks that can run asynchronously in different threads. That will spead up the overall execution time of the program. Identifying them and programming to load balance these tasks into additional threads in the CPU, or even (if they are networking tasks) you can spread its execution across different Hosts. This is the great usefullness of GNU parallel

Installing parallel, and sources of information

You may want to install the package available on your distro repositories. My opinion is that I have had mixed, results with different versions. Being an old-stable one buggy with some issues, being the latest one buggy too so I recommend this version.

wget https://ftp.gnu.org/gnu/parallel/parallel-20240222.tar.bz2
sudo tar xjf parallel-20240222.tar.bz2
sudo ./configure && make
sudo make install

If you want to get familiarised with parallel I recommend you to follow their tutorials found on their manuals, just gonna do a quick view in here with my understanding of it which may be not 100% accurate. man parallel `man

man parallel
man parallel_tutorial
man parallel_examples
man env_parallel 

Breakdown on how to build the command-line

By the nature of the utility which deals with running other commands within a command, running this directly in the shell or inside a shell-script the syntax can get quite overwhelming , loosing the ability to understand what is doing each part, how to avoid escaping issues with the Shell or with the commands used.

Taking up some simple commands

# Running a command with one argument, multiple times (varying the arguments)
## Arguments been read from the CLI directly
parallel echo ::: 'pep' 'costa' 'tom'
# Output
pep
costa
tom

## Arguments been read from stdin
# newlines are interpretted as arguments to be executed parallely on the same command
echo -e "pep\ncosta\ntom" | parallel echo
# Same Output


## Arguments been read from a file
# input.txt
pep
costa
tom
#
parallel -a "input.txt" echo
# Same Output

## Running in a script
# myscript.sh
#!/bin/bash
printing_stuff() {
    echo $1
}
export -f printing_stuff
parallel printing_stuff ::: 'pep' 'costa' 'tom'

#
./myscript.sh
# Same Output

You can achieve the same of this examples with different syntaxes, refining the combinations to achieve different things, refer to the documentation for that, we will be keeping it simple in here to achieve what we wanted.

Clarifying the last example. Running it inside a script with a function, you need to export that function in bash with export -f statement , then you can recall that function parallel my_func

Allocating jobs in different threads

In the examples before all the echo commands were sent to different threads to be runned parallely as many as possible (that is the default behavior). As soon as each return it will be logged on the stdout (asyncrhonous execution) So if one of the jobs will take longer, even if executed first it will finnish the execution later

Note {} is a replacement string for all the arguments been passed to that job

echo -e "5\n2\n1" | parallel 'sleep {}; echo {}'
# Output
1
2
5

We can pick now different parameters to be used in different parts of our command {n} will be replace by parameter no n

Though we need to specify how to separate the parameters in our input, we will tell that for each whitespace it will be a different parameter. --colsep " "

echo -e "5 pep\n2 costa\n1 tom" | parallel --colsep " " 'sleep {1}; echo {2}'
# Output
tom
costa
pep

We have demonstrated that tom returns earlier than pep despite it’s execution have been started earlier.

That is because parallel runs those jobs in as many threads as possible by default. You can change this by doing -j <num-jobs> to run maximum those number of jobs parallely.

echo -e "5 pep\n2 costa\n1 tom" | parallel -j 1 --colsep " " 'sleep {1}; echo {2}'
# Output
pep
costa
tom

If 1 job is running parallely it will be a sequenced execution (synchronous)

No arguments

You can truncate the number of arguments been sent to the command with -n <numer> In the special case that you have run commands that need no arguments as input, parallel wont let you , pass arguments and use -n0

echo -e "Whatever\nNothing" | parallel -n0 "ip addr"
## Will run twice asyncrhonously the command ip addr

Remote execution, running jobs in other hosts

We are going to run a bash function in different hosts: First we need to pass each Host IP we want to use for the execution after -S We keep adding more to add more the more hosts we want to add up for the processing. Note that to execute the function we need to pass it’s name as an environment variable with --env <function_name> Note also that if you want to your localhost to be added to the pool of hosts that execute the processes, you need to add it manually -S 127.0.0.1 (is not added by default)

# input.txt
pep
costa
tom

# my_script.sh
check_hostname(){
    echo $HOSTNAME
}
export -f check_hostname

parallel --env check_hostname -S <WAN_SERVER> -S <LOCAL_HOST> -n0 -a "input.txt" check_hostname

# Output
<LOCAL_HOST>
<LOCAL_HOST>
<WAN_SERVER> 

Would it have been parallel -S <WAN_SERVER> -S <LOCAL_HOST> -n0 -a "input.txt" check_hostname it wouldnt have worked

Sending environment variables to the remote host

Sometimes you want to work with different data across the hosts If you need the otherhosts, to adopt certain environment variables of the executing host you need to use env_parallel and then add that environment variable with --env <MY_VAR>

# my_script.sh
source `which env_parallel.bash`
check_hostname(){
    echo $HOSTNAME
}
export -f check_hostname

env_parallel --env check_hostname --env HOSTNAME -S <WAN_SERVER> -S <LOCAL_HOST> -n0 -a "input.txt" check_hostname

# Output
<LOCAL_HOST>
<LOCAL_HOST>
<LOCAL_HOST> 

Reading environment variables from the remote host

In complex script interactions, you may want to define host-specific values, to perform certain things differently in each one.

This is not strictly of GNU parallel but I have found that trying to remotely echo a variable that has been defined in $HOME/.bashrc is not working. And it is not caused specificaly by no GNU parallel configuration but it is an actual ssh security feature The variables we want to be accessing through ssh need to be contained in a file called $HOME/.ssh/environment

MY_VAR=some value

Then change /etc/ssh/sshd_config directive

PermitUserEnvironment yes

Restart your sshd server and try echoing from another host

ssh hostIp 'echo ${MY_VAR}'
# Output
some_value

Putting this together

See the example below

vim parallel_echo_variables.sh

#!/bin/bash

## Previously I have modified ./.bashrc and $HOME/.ssh/environment
## In each host (they hold different values)
# DOWN_PATH=/home/fakuve/downloads

source `which env_parallel.bash`

## SETTING GLOBAL SCRIPT VARIABLES
# this will be shared regardless to which hosts is executing
MASTER_HOSTNAME=$(hostname)
MASTER_DOWNPATH=${DOWN_PATH}


main() {

# each newline echoed will fire the command 1 more time (just for testing)
echo -e ' \n ' | env_parallel \
        `## PARALLEL ARGUMENTS` \
        -n0 \
        --jobs 1 \
        `## GLOBAL SCRIPT VARIABLES` \
        --env MASTER_HOSTNAME \
        --env MASTER_DOWNPATH \
        `## FUNCTIONS TO BE EXECUTED` \
        --env echo_variables \
        `## WORKER_HOSTS` \
        -S 192.168.43.241 \
        -S 127.0.0.1 \
        `## FUNCTION TO CALL` \
        echo_variables \

}

echo_variables() {

    echo "WORKER_HOST      : ${HOSTNAME}"
    echo "WORKER_DOWNPATH  : ${DOWN_PATH}"
    echo "MASTER_HOST      : ${MASTER_HOSTNAME}"
    echo "MASTER_DOWNPATH  : ${MASTER_DOWNPATH}"
    echo "-------------------------------------"

}
export -f echo_variables

main


### Output
#WORKER_HOST      : verneynas
#WORKER_DOWNPATH  : /home/fakuve/files/downloads
#MASTER_HOST      : elitebook-x360
#MASTER_DOWNPATH  : /home/fakuve/downloads
#-------------------------------------
#WORKER_HOST      : elitebook-x360
#WORKER_DOWNPATH  : /home/fakuve/downloads
#MASTER_HOST      : elitebook-x360
#MASTER_DOWNPATH  : /home/fakuve/downloads
#-------------------------------------

You can see how we can proceed with clarity from now on. At the begining of the script we assing the GLOBAL SCRIPT VARIABLES, which regardless of which host is the caller they are not going to be modified. In this case I like to call MASTER_<VARNAME> to caller variables, and to host specific variables will call them WORKER_<VARNAME>

As you can see the MASTER variables need to be also passed via env_parallel , dont forget that. The rest they get assigned for each call of parallel in each host. For clarity you can reassign them within the function

echo_variables() {
WORKER_HOST=${HOSTNAME}
WORKER_DOWNPATH=${DOWN_PATH}
}

Note that DOWN_PATH has been assigned for each host in each .bashrc and $HOME/.ssh/environment

Conclusion

After all of this we will have unravelled this gnu parallel tool, and will be able to work with it for our use case in the next chapter. Building up a parallel webcrawler