Straxen scripts

Straxen comes with several scripts that allow common uses of straxen. Some of these scripts are designed to run on the DAQ whereas others are for common use cases. Each of the scripts will be briefly discussed below:

straxer

straxer is the most useful straxen script for regular users. Allows data to be generated in a script format. Especially useful for reprocessing data in batch jobs.

For example a user can reprocess the data of run 012100 using the following command up to event_info_double.

straxer 012100 --target event_info_double

For more information on the options, please refer to the help:

straxer --help

straxen_print_versions

straxen_print_versions is a small bin utility that wraps around straxen.print_versions. It allows one to quickly print which installation paths are used, such as for example:

straxen_print_versions strax straxen cutax wfsim pema

ajax [DAQ-only]

The DAQ-cleaning script. Data is stored on the DAQ such that other tools like admix may ship the data to distributed storage. A portion of the high level data is stored on the DAQ for diagnostic purposes for longer periods of time. ajax removes this data if needed. The ajax script looks for data on the eventbuilders that can be deleted because at least one of the following reasons:

A run has been “abandoned”, this means that there is no further use for this data, e.g. a board failed during a run, there is no point in keeping a run where part of the data on the DAQ.

The live-data (intermediate DAQ format, even more raw than raw-records) has been successfully processed. Therefore remove this intermediate datakind from daq.

A run has been abandoned but there is live-data still on the DAQ-bugger.

Data is “unregistered” (not in the runsdatabase), this only occurs if DAQ-experts perform tests on the DAQ.

Since bootstrax runs on multiple hosts, some of the data may appear to be stored more than once since a given bootstrax instance could crash during it’s processing. The data of unsucessful processings should be removed by ajax.

Finally ajax also checks if all the entries that are in the database are also on the host still This sanity check catches any potential issues in the data handling by admix.

restrax [DAQ-only]

Bootstrax creates many files when processing the data live. To prevent aggregating too much data in memory, it stores each datatype as soon as 200 MB (see strax.default_chunk_size_mb) of a datatype has been aggregated. Furthermore, it does not rechunk the raw-records (i.e. just saves it as they come from redax). This leads to many small files, which is an issue for the data-management, as each datatype may create a single file every few seconds, and the data-management has to bookkeep all the seperate files.

Restrax rechunks and recompresses the files after bootstrax is done with processing. This means that the DAQ data flow is:

From left to right:

The digitizers readout the PMTs, and redax reads from the digitizers. redax also converts the data (which we call live_data) to a format that bootstrax can read. This data is stored to CEPH (see https://arxiv.org/abs/2212.11032).
bootstrax reads the live_data from CEPH and processes it to the strax-datatypes (raw_records, peaks, events) and so on. Additionally, it fills the online_monitor collection with a selection of the data (see the online monitor). All data is also written to a pre_processed directory. The data in the pre_processed directory is not yet considered by the data management tools.
The data is rechunked and recompressed by restrax and stored in the production folder /data/xenonnt_processed. We will elaborate on the rechunking and recompressing below.
Finally, the data is uploaded in the datamanagement tools by admix which reads the data from the production folder and uploads it into rucio (our data management tool).

The first three steps are on the DAQ, the last step is on datamanager which is a server from the computing group that is also on the LNGS network.

Rechunking & recompression

Restrax does rechunking, which is the process of combining multiple blocks of data into one. Additionally it recompresses the data, which is the storing of the data with heavier compression algorithms. These take more CPU, but reduce the overall disksize of a given datatype, which is especially useful for long-term storage.

Why not have bootstrax do the rechunking/compression?

In principle, bootstrax could also do the recompression and rechunking. However, there are several issues with bootstrax doing this while also live-processing the data. First of all, the memory usage would blow up massively if bootstrax would rechunk all data types up to a chunk size of ~1000 MB, as it would buffer data for each data type up until that chunk size, concatenate the data and than store it. If ~50 data types are stored, this would give a memory consumption of up to 50x1000 MB = 50 GB. If you also account for the concatenating (which doubles the memory consumption) you quicly allocate 100GB just for saving data - not taking into account the requirements for the actual processing.

Additionally, for high rates, we do not always have the time for heavy compression algorithms, as these take a lot of CPU. Doing those at a later time can assure we stay processing live while still doing heavy compression later.

Restrax philosophy

Restrax is designed as a lazy algorithm, doing one thing at a time and only update the runs-database after the job is done. It does allow for parallelization, but this should be used with caution as it also increases the memory footprint. The maximum memory usage can be approximated by the 2x``target_size_mb`` from Restrax.get_compressor_and_size times the number of (raw-records) threads so 4 * target_size_mb, which usually maxes out at 20GB for a raw-records target size of 5 GB.

Restrax configuration

Most of the restrax configurations are set as class variables. These can be overwritten by a document in the daq-database. For example, the sniplet below sets the max_workers to 5.

from straxen import daq_core
db = daq_core.DataBases()
db.daq_db['restrax_config'].update_one(
    {'name': 'restrax_config'},
    {'$set': {'user': 'angevaare',
              'last_modified': daq_core.now(),
              'max_workers': 5,  # <-- Increase the number of workers
              }
    })

There are several methods to make

Several settings make restrax go faster:

increase max_workers, this increases the number of workers / data type. More workers uses more memory.
increase max_threads, this increases the number of concurrent data types that are handled. More workers increases memory footprint.
decrease is_heavy_rate_mbs to a lower value. If the data rate is higher than this number, restrax will use faster (but less squeezy) compression algorithms for raw records.
disable deep_compare, this is a slow and over-engineered check that should only be used during testing.
change target_compressor to faster compression algorithms.
expend the skip_compression list of targets that are skipped during compression. Since most time in raw-records (re)compression, this option only saves time if

Similarly, decreasing the options above often leads to a lower memory footprint. So does decreasing the target_size to lower values, as restrax has to keep a 2x target_size for each datatype it is handling at a given time. Additionally setting process=False stops the multithreaded processing (process=True). Setting process='process' changes the processing to multicore instead of multithreaded.

Bypass mode

If needed, restrax can be bypassed by passing the --bypass_mode argument. This will skipp all compression and rechunking steps, and will complete a run within ~0.5s. It’s advised to only do this in conjunction with the --process RUN_ID argument, and use it for single runs, but this is not required. Bypass mode can be activated as by the configuration example above.

bootstrax [DAQ-only]

As the main DAQ processing script. This is discussed separately. It is only used for XENONnT.

microstrax

Mini strax interface that allows strax-data to be retrieved using HTTP requests on a given port. This is at the time of writing used on the DAQ as a pulse viewer.

refresh_raw_records

Updates raw-records from old strax versions. This data is of a different format and needs to be refreshed before it can be opened with more recent versions of strax.

Last updated 2023-02-14. Joran Angevaare