Straxen scripts
Straxen comes with several scripts that allow common uses of straxen. Some of these scripts are designed to run on the DAQ whereas others are for common use cases. Each of the scripts will be briefly discussed below:
straxer
straxer
is the most useful straxen script for regular users. Allows data to be
generated in a script format. Especially useful for reprocessing data
in batch jobs.
For example a user can reprocess the data of run 012100
using the
following command up to event_info_double
.
straxer 012100 --target event_info_double
For more information on the options, please refer to the help:
straxer --help
straxen-print_versions
straxen-print_versions
is a small bin utility that wraps around straxen.print_versions
.
It allows one to quickly print which installation paths are used, such as for example:
straxen-print_versions strax straxen cutax wfsim pema
ajax [DAQ-only]
The DAQ-cleaning script. Data is stored on the DAQ such that other tools
like admix may ship the data to
distributed storage. A portion of the high level data is stored on the DAQ
for diagnostic purposes for longer periods of time. ajax
removes this
data if needed.
The ajax
script looks for data on the eventbuilders
that can be deleted because at least one of the following reasons:
A run has been “abandoned”, this means that there is no further use for this data, e.g. a board failed during a run, there is no point in keeping a run where part of the data on the DAQ.
The live-data (intermediate DAQ format, even more raw than raw-records) has been successfully processed. Therefore remove this intermediate datakind from daq.
A run has been abandoned but there is live-data still on the DAQ-bugger.
Data is “unregistered” (not in the runsdatabase), this only occurs if DAQ-experts perform tests on the DAQ.
Since bootstrax runs on multiple hosts, some of the data may appear to be stored more than once since a given bootstrax instance could crash during it’s processing. The data of unsucessful processings should be removed by
ajax
.Finally
ajax
also checks if all the entries that are in the database are also on the host still This sanity check catches any potential issues in the data handling by admix.
restrax [DAQ-only]
Bootstrax creates many files when processing the data live. To prevent aggregating too much
data in memory, it stores each datatype as soon as 200 MB (see strax.default_chunk_size_mb
)
of a datatype has been aggregated.
Furthermore, it does not rechunk the raw-records
(i.e. just saves it as they come from redax).
This leads to many small files, which is an issue for the data-management, as each datatype may
create a single file every few seconds, and the data-management has to bookkeep all the seperate files.
Restrax
rechunks and recompresses the files after bootstrax is done with processing.
This means that the DAQ data flow is:
- From left to right:
The digitizers readout the PMTs, and redax reads from the digitizers.
redax
also converts the data (which we calllive_data
) to a format that bootstrax can read. This data is stored toCEPH
(see https://arxiv.org/abs/2212.11032).bootstrax
reads thelive_data
fromCEPH
and processes it to the strax-datatypes (raw_records
,peaks
,events
) and so on. Additionally, it fills theonline_monitor
collection with a selection of the data (see the online monitor). All data is also written to apre_processed
directory. The data in thepre_processed
directory is not yet considered by the data management tools.The data is rechunked and recompressed by
restrax
and stored in the production folder/data/xenonnt_processed
. We will elaborate on the rechunking and recompressing below.Finally, the data is uploaded in the datamanagement tools by admix which reads the data from the production folder and uploads it into
rucio
(our data management tool).
The first three steps are on the DAQ
, the last step is on datamanager
which is
a server from the computing group that is also on the LNGS network.
Rechunking & recompression
Restrax does rechunking, which is the process of combining multiple blocks of data into one. Additionally it recompresses the data, which is the storing of the data with heavier compression algorithms. These take more CPU, but reduce the overall disksize of a given datatype, which is especially useful for long-term storage.
Why not have bootstrax do the rechunking/compression?
In principle, bootstrax
could also do the recompression and rechunking. However, there are
several issues with bootstrax
doing this while also live-processing the data.
First of all, the memory usage would blow up massively if bootstrax
would rechunk all
data types up to a chunk size of ~1000 MB, as it would buffer data for each data type up until that
chunk size, concatenate the data and than store it. If ~50 data types are stored, this
would give a memory consumption of up to 50x1000 MB = 50 GB. If you also account for the
concatenating (which doubles the memory consumption) you quicly allocate 100GB just for
saving data - not taking into account the requirements for the actual processing.
Additionally, for high rates, we do not always have the time for heavy compression algorithms, as these take a lot of CPU. Doing those at a later time can assure we stay processing live while still doing heavy compression later.
Restrax philosophy
Restrax is designed as a lazy algorithm, doing one thing at a time and only update the runs-database
after the job is done.
It does allow for parallelization, but this should be used with caution as it also increases the memory
footprint.
The maximum memory usage can be approximated by the 2x``target_size_mb`` from Restrax.get_compressor_and_size
times the number of (raw-records) threads so 4 * target_size_mb
, which usually
maxes out at 20GB for a raw-records target size of 5 GB.
Restrax configuration
Most of the restrax configurations are set as class variables. These can be overwritten by a document in the
daq-database. For example, the sniplet below sets the max_workers
to 5.
from straxen import daq_core
db = daq_core.DataBases()
db.daq_db['restrax_config'].update_one(
{'name': 'restrax_config'},
{'$set': {'user': 'angevaare',
'last_modified': daq_core.now(),
'max_workers': 5, # <-- Increase the number of workers
}
})
There are several methods to make
- Several settings make
restrax
go faster: increase
max_workers
, this increases the number of workers / data type. More workers uses more memory.increase
max_threads
, this increases the number of concurrent data types that are handled. More workers increases memory footprint.decrease
is_heavy_rate_mbs
to a lower value. If the data rate is higher than this number, restrax will use faster (but less squeezy) compression algorithms for raw records.disable
deep_compare
, this is a slow and over-engineered check that should only be used during testing.change
target_compressor
to faster compression algorithms.expend the
skip_compression
list of targets that are skipped during compression. Since most time in raw-records (re)compression, this option only saves time if
Similarly, decreasing the options above often leads to a lower memory footprint. So does decreasing the target_size
to lower values, as restrax has to keep a 2x target_size
for each datatype it is handling at a given time.
Additionally setting process=False
stops the multithreaded processing (process=True
). Setting
process='process'
changes the processing to multicore instead of multithreaded.
Bypass mode
If needed, restrax can be bypassed by passing the --bypass_mode
argument. This will skipp all compression
and rechunking steps, and will complete a run within ~0.5s. It’s advised to only do this in conjunction with
the --process RUN_ID
argument, and use it for single runs, but this is not required. Bypass mode can be
activated as by the configuration example above.
bootstrax [DAQ-only]
As the main DAQ processing script. This is discussed separately. It is only used for XENONnT.
fake_daq
Script that allows mimiming DAQ-processing by opening raw-records data.
microstrax
Mini strax interface that allows strax-data to be retrieved using HTTP requests on a given port. This is at the time of writing used on the DAQ as a pulse viewer.
refresh_raw_records
Updates raw-records from old strax versions. This data is of a different format and needs to be refreshed before it can be opened with more recent versions of strax.
Last updated 2023-02-14. Joran Angevaare