Straxen scripts =================== Straxen comes with `several scripts `_ that allow common uses of straxen. Some of these scripts are designed to run on the DAQ whereas others are for common use cases. Each of the scripts will be briefly discussed below: straxer ------- ``straxer`` is the most useful straxen script for regular users. Allows data to be generated in a script format. Especially useful for reprocessing data in batch jobs. For example a user can reprocess the data of run ``012100`` using the following command up to ``event_info_double``. .. code-block:: bash straxer 012100 --target event_info_double For more information on the options, please refer to the help: .. code-block:: bash straxer --help straxen_print_versions ---------------------- ``straxen_print_versions`` is a small bin utility that wraps around ``straxen.print_versions``. It allows one to quickly print which installation paths are used, such as for example: .. code-block:: bash straxen_print_versions strax straxen cutax wfsim pema ajax [DAQ-only] ---------------- The DAQ-cleaning script. Data is stored on the DAQ such that other tools like `admix `_ may ship the data to distributed storage. A portion of the high level data is stored on the DAQ for diagnostic purposes for longer periods of time. ``ajax`` removes this data if needed. The ``ajax`` script looks for data on the eventbuilders that can be deleted because at least one of the following reasons: - A run has been "abandoned", this means that there is no further use for this data, e.g. a board failed during a run, there is no point in keeping a run where part of the data on the DAQ. - The live-data (intermediate DAQ format, even more raw than raw-records) has been successfully processed. Therefore remove this intermediate datakind from daq. - A run has been abandoned but there is live-data still on the DAQ-bugger. - Data is "unregistered" (not in the runsdatabase), this only occurs if DAQ-experts perform tests on the DAQ. - Since bootstrax runs on multiple hosts, some of the data may appear to be stored more than once since a given bootstrax instance could crash during it's processing. The data of unsucessful processings should be removed by ``ajax``. - Finally ``ajax`` also checks if all the entries that are in the database are also on the host still This sanity check catches any potential issues in the data handling by admix. restrax [DAQ-only] -------------------- Bootstrax creates many files when processing the data live. To prevent aggregating too much data in memory, it stores each datatype as soon as 200 MB (see ``strax.default_chunk_size_mb``) of a datatype has been aggregated. Furthermore, it does not rechunk the ``raw-records`` (i.e. just saves it as they come from redax). This leads to **many small files**, which is an issue for the data-management, as each datatype may create a single file every few seconds, and the data-management has to bookkeep all the seperate files. ``Restrax`` rechunks and recompresses the files after bootstrax is done with processing. This means that the DAQ data flow is: .. image:: figures/restrax.svg From left to right: - The digitizers readout the PMTs, and `redax `_ reads from the digitizers. ``redax`` also converts the data (which we call ``live_data``) to a format that bootstrax can read. This data is stored to ``CEPH`` (see ``_). - ``bootstrax`` reads the ``live_data`` from ``CEPH`` and processes it to the strax-datatypes (``raw_records``, ``peaks``, ``events``) and so on. Additionally, it fills the ``online_monitor`` collection with a selection of the data (see `the online monitor `_). All data is also written to a ``pre_processed`` directory. The data in the ``pre_processed`` directory is not yet considered by the data management tools. - The data is rechunked and recompressed by ``restrax`` and stored in the production folder ``/data/xenonnt_processed``. We will elaborate on the rechunking and recompressing below. - Finally, the data is uploaded in the datamanagement tools by `admix `_ which reads the data from the production folder and uploads it into ``rucio`` (our data management tool). The first three steps are on the ``DAQ``, the last step is on ``datamanager`` which is a server from the computing group that is also on the LNGS network. **Rechunking & recompression** Restrax does rechunking, which is the process of combining multiple blocks of data into one. Additionally it recompresses the data, which is the storing of the data with heavier compression algorithms. These take more CPU, but reduce the overall disksize of a given datatype, which is especially useful for long-term storage. **Why not have bootstrax do the rechunking/compression?** In principle, ``bootstrax`` could also do the recompression and rechunking. However, there are several issues with ``bootstrax`` doing this while also live-processing the data. First of all, the memory usage would blow up massively if ``bootstrax`` would rechunk all data types up to a chunk size of ~1000 MB, as it would buffer data for each data type up until that chunk size, concatenate the data and than store it. If ~50 data types are stored, this would give a memory consumption of up to 50x1000 MB = 50 GB. If you also account for the concatenating (which doubles the memory consumption) you quicly allocate 100GB just for saving data - not taking into account the requirements for the actual processing. Additionally, for high rates, we do not always have the time for heavy compression algorithms, as these take a lot of CPU. Doing those at a later time can assure we stay processing live while still doing heavy compression later. **Restrax philosophy** Restrax is designed as a lazy algorithm, doing one thing at a time and only update the runs-database after the job is done. It does allow for parallelization, but this should be used with caution as it also increases the memory footprint. The maximum memory usage can be approximated by the 2x``target_size_mb`` from ``Restrax.get_compressor_and_size`` times the number of (raw-records) threads so ``4 * target_size_mb``, which usually maxes out at 20GB for a raw-records target size of 5 GB. **Restrax configuration** Most of the restrax configurations are set as class variables. These can be overwritten by a document in the daq-database. For example, the sniplet below sets the ``max_workers`` to 5. .. code-block:: python from straxen import daq_core db = daq_core.DataBases() db.daq_db['restrax_config'].update_one( {'name': 'restrax_config'}, {'$set': {'user': 'angevaare', 'last_modified': daq_core.now(), 'max_workers': 5, # <-- Increase the number of workers } }) There are several methods to make Several settings make ``restrax`` go faster: - increase ``max_workers``, this increases the number of workers / data type. More workers uses more memory. - increase ``max_threads``, this increases the number of concurrent data types that are handled. More workers increases memory footprint. - decrease ``is_heavy_rate_mbs`` to a lower value. If the data rate is higher than this number, restrax will use faster (but less squeezy) compression algorithms for raw records. - disable ``deep_compare``, this is a slow and over-engineered check that should only be used during testing. - change ``target_compressor`` to faster compression algorithms. - expend the ``skip_compression`` list of targets that are skipped during compression. Since most time in raw-records (re)compression, this option only saves time if Similarly, decreasing the options above often leads to a lower memory footprint. So does decreasing the ``target_size`` to lower values, as restrax has to keep a ``2x target_size`` for each datatype it is handling at a given time. Additionally setting ``process=False`` stops the multithreaded processing (``process=True``). Setting ``process='process'`` changes the processing to multicore instead of multithreaded. **Bypass mode** If needed, restrax can be bypassed by passing the ``--bypass_mode`` argument. This will skipp all compression and rechunking steps, and will complete a run within ~0.5s. It's advised to only do this in conjunction with the ``--process RUN_ID`` argument, and use it for single runs, but this is not required. Bypass mode can be activated as by the configuration example above. bootstrax [DAQ-only] -------------------- As the main DAQ processing script. This is discussed separately. It is only used for XENONnT. microstrax ------------------ Mini strax interface that allows strax-data to be retrieved using HTTP requests on a given port. This is at the time of writing used on the DAQ as a pulse viewer. refresh_raw_records ------------------- Updates raw-records from old strax versions. This data is of a different format and needs to be refreshed before it can be opened with more recent versions of strax. *Last updated 2023-02-14. Joran Angevaare*