Superruns

Basic concept of a superrun:

A superrun is made up of many regular runs and helps us therefore to organize data in logic units and to load it faster. In the following notebook we will give some brief examples how superruns work and can be used to make analysts lives easier.

Let’s get started how we can define superruns. The example I demonstrate here is based on some dummy Record and Peak plugins. But it works in the same way for regular data.

[1]:

import strax
import straxen

Define context and create some dummy data:

In the subsequent cells I create a dummy context and write some dummy-data. You can either read through it if you are interested or skip until Define a superrun. For the working examples on superruns you only need to know:

Superruns can be created with any of our regular online and offline contexts.
In the two cells below I define 3 runs and records for the run_ids 0, 1, 2.
The constituents of a superrun are called subruns which we call runs.

[2]:

from strax.testutils import Records, Peaks, PeakClassification

superrun_name = "_superrun_test"
st = strax.Context(
    storage=[
        strax.DataDirectory(
            "./strax_data", provide_run_metadata=True, readonly=False, deep_scan=True
        )
    ],
    register=[Records, Peaks, PeakClassification],
    config={"bonus_area": 42},
)
st.set_context_config({"use_per_run_defaults": False})

[3]:

import datetime
import pytz

import numpy as np

import json
from bson import json_util


def _write_run_doc(context, run_id, time, endtime):
    """Function which writes a dummy run document."""
    run_doc = {"name": run_id, "start": time, "end": endtime}
    with open(context.storage[0]._run_meta_path(str(run_id)), "w") as fp:
        json.dump(run_doc, fp, sort_keys=True, indent=4, default=json_util.default)


offset_between_subruns = 10

now = datetime.datetime.now()
now.replace(tzinfo=pytz.utc)
subrun_ids = [str(r) for r in range(3)]

for run_id in subrun_ids:
    rr = st.get_array(run_id, "peaks")
    time = np.min(rr["time"])
    endtime = np.max(strax.endtime(rr))

    _write_run_doc(
        st,
        run_id,
        now + datetime.timedelta(0, int(time)),
        now + datetime.timedelta(0, int(endtime)),
    )

    st.set_config({"secret_time_offset": endtime + offset_between_subruns})  # untracked option
    assert st.is_stored(run_id, "peaks")

If we print now the lineage and hash for the three runs you will see it is equivalent to our regular data.

[4]:

print(st.key_for("2", "peaks"))
st.key_for("2", "peaks").lineage

2-peaks-xia2iit6vb

[4]:

{'peaks': ('Peaks',
  '0.0.0',
  {'bonus_area': 42, 'base_area': 0, 'give_wrong_dtype': False}),
 'records': ('Records', '0.0.0', {'crash': False, 'dummy_tracked_option': 42})}

Metadata of our subruns:

To understand a bit better how our dummy data looks like we can have a look into the metadata for a single run. Each subrun is made of 10 chunks each containing 10 waveforms in 10 different channels.

[5]:

st.get_metadata("2", "peaks")

[5]:

{'chunk_target_size_mb': 200,
 'chunks': [{'chunk_i': 0,
   'end': 50,
   'filename': 'peaks-xia2iit6vb-000000',
   'filesize': 1323,
   'first_endtime': 41,
   'first_time': 40,
   'last_endtime': 50,
   'last_time': 49,
   'n': 100,
   'nbytes': 223500,
   'run_id': '2',
   'start': 40,
   'subruns': None}],
 'compressor': 'blosc',
 'data_kind': 'peaks',
 'data_type': 'peaks',
 'dtype': "[(('Start time since unix epoch [ns]', 'time'), '<i8'), (('Length of the interval in samples', 'length'), '<i4'), (('Width of one sample [ns]', 'dt'), '<i4'), (('Channel/PMT number', 'channel'), '<i2'), (('Classification of the peak(let)', 'type'), '|i1'), (('Integral across channels [PE]', 'area'), '<f4'), (('Integral per channel [PE]', 'area_per_channel'), '<f4', (100,)), (('Number of hits contributing at least one sample to the peak ', 'n_hits'), '<i4'), (('Waveform data in PE/sample (not PE/ns!)', 'data'), '<f4', (200,)), (('Waveform data in PE/sample (not PE/ns!), top array', 'data_top'), '<f4', (200,)), (('Peak widths in range of central area fraction [ns]', 'width'), '<f4', (11,)), (('Peak widths: time between nth and 5th area decile [ns]', 'area_decile_from_midpoint'), '<f4', (11,)), (('Does the channel reach ADC saturation?', 'saturated_channel'), '|i1', (100,)), (('Total number of saturated channels', 'n_saturated_channels'), '<i2'), (('Channel within tight range of mean', 'tight_coincidence'), '<i2'), (('Largest gap between hits inside peak [ns]', 'max_gap'), '<i4'), (('Maximum interior goodness of split', 'max_goodness_of_split'), '<f4'), (('Largest time difference between apexes of hits inside peak [ns]', 'max_diff'), '<i4'), (('Smallest time difference between apexes of hits inside peak [ns]', 'min_diff'), '<i4')]",
 'end': 50,
 'lineage': {'peaks': ['Peaks',
   '0.0.0',
   {'base_area': 0, 'bonus_area': 42, 'give_wrong_dtype': False}],
  'records': ['Records',
   '0.0.0',
   {'crash': False, 'dummy_tracked_option': 42}]},
 'lineage_hash': 'xia2iit6vb',
 'run_id': '2',
 'start': 40,
 'strax_version': '1.6.5',
 'writing_ended': 1724569392.7568176,
 'writing_started': 1724569392.7369668}

Define a superrun:

Defining a superrun is quite simple one has to call:

[6]:

st.define_run(superrun_name, subrun_ids)
print("superrun_name: ", superrun_name, "\nsubrun_ids: ", subrun_ids)

superrun_name:  _superrun_test
subrun_ids:  ['0', '1', '2']

where the first argument is a string specifying the name of the superrun e.g. _Kr83m_20200816. Please note that superrun names must start with an underscore.

The second argument is a list of run_ids of subruns the superrun should be made of. Please note that the definition of a superrun does not need any specification of a data_kind like peaks or event_info because it is a “run”.

By default, it is only allowed to store new runs under the usere’s specified strax_data directory. In this example it is simply ./strax_data and the run_meta data can be looked at via:

[7]:

st.run_metadata("_superrun_test")

[7]:

{'comments': [{'comment': ''}],
 'end': datetime.datetime(2024, 8, 25, 2, 5, 48, 884000),
 'livetime': 30.0,
 'mode': [''],
 'name': '_superrun_test',
 'source': [''],
 'start': datetime.datetime(2024, 8, 25, 2, 4, 58, 884000),
 'sub_run_spec': {'0': 'all', '1': 'all', '2': 'all'},
 'tags': [{'name': ''}]}

The superrun-metadata contains a list of all subruns making up the superrun, the start and end time (in milliseconds) of the corresponding collections of runs and its naive livetime in nanoseconds without any corrections for deadtime.

Please note that in the presented example the time difference between start and end time is 50 s while the live time is only about 30 s. This comes from the fact that I defined the time between two runs to be 10 s. It should be always kept in mind for superruns that livetime is not the same as the end - start of the superrun.

The superun will appear in the run selection as any other run:

[8]:

st.select_runs()

[8]:

	name	number
0	0	0.0
1	1	1.0
2	2	2.0
3	_superrun_test	NaN
4	024399	24399.0

Loading data with superruns:

Loading superruns can be done in two different ways. Lets try first the already implemented approach and compare the data with loading the individual runs separately:

[9]:

sub_runs = st.get_array(subrun_ids, "peaks")  # Loading all subruns individually like we are used to
superrun = st.get_array(superrun_name, "peaks")  # Loading the superrun
assert np.all(sub_runs["time"] == superrun["time"])  # Comparing if the data is the same

To increase the loading speed it can be allowed to skip the lineage check of the individual subruns:

[10]:

sub_runs = st.get_array(subrun_ids, "peaks")
superrun = st.get_array(superrun_name, "peaks", _check_lineage_per_run_id=False)
assert np.all(sub_runs["time"] == superrun["time"])

Unknown config option _check_lineage_per_run_id; will do nothing.
Invalid context option _check_lineage_per_run_id; will do nothing.

So how does this magic work? Under the hood a superrun first checks if the data of the different subruns has been created before. If not it will make the data for you. After that the data of the individual runs is loaded.

The loading speed can be further increased if we rechunk and write the data of our superrun as “new” data to disk. This can be done easily for light weight data_types like peaks and above. Further, this allows us to combine multiple data_types if the same data_kind, like for example event_info and cuts.

Writing a “new” superrun:

To write a new superrun one has to set the corresponding context setting to true:

[11]:

st.set_context_config({"write_superruns": True})

[12]:

st.is_stored(superrun_name, "peaks")

[12]:

True

[13]:

st.make(superrun_name, "peaks")
st.is_stored(superrun_name, "peaks")

[13]:

True

Lets see if the data is the same:

[14]:

sub_runs = st.get_array(subrun_ids, "peaks")
superrun = st.get_array(superrun_name, "peaks", _check_lineage_per_run_id=False)
assert np.all(sub_runs["time"] == superrun["time"])

Unknown config option _check_lineage_per_run_id; will do nothing.
Invalid context option _check_lineage_per_run_id; will do nothing.

And the data will now shown as available in select runs:

[15]:

st.select_runs(available=("peaks",))

[15]:

	name	number	peaks_available
0	0	0.0	True
1	1	1.0	True
2	2	2.0	True
3	_superrun_test	NaN	True

If a some data does not exist for a super run we can simply created it via the superrun_id. This will not only create the data of the rechunked superrun but also the data of the subrungs if not already stored:

[16]:

st.is_stored(subrun_ids[0], "peak_classification")

[16]:

False

[17]:

st.make(superrun_name, "peak_classification")
st.is_stored(subrun_ids[0], "peak_classification")

[17]:

False

[18]:

peaks = st.get_array(superrun_name, "peak_classification")

Some developer information:

In case of a stored and rechunked superruns every chunk has also now some additional information about the individual subruns it is made of:

[19]:

for chunk in st.get_iter(superrun_name, "peaks"):
    chunk
chunk.subruns, chunk.run_id

[19]:

({'0': {'end': 10, 'start': 0},
  '1': {'end': 30, 'start': 20},
  '2': {'end': 50, 'start': 40}},
 '_superrun_test')

The same goes for the meta data:

[20]:

st.get_metadata(superrun_name, "peaks")["chunks"]

[20]:

[{'chunk_i': 0,
  'end': 50,
  'filename': 'peaks-xia2iit6vb-000000',
  'filesize': 3338,
  'first_endtime': 1,
  'first_time': 0,
  'last_endtime': 50,
  'last_time': 49,
  'n': 300,
  'nbytes': 670500,
  'run_id': '_superrun_test',
  'start': 0,
  'subruns': {'0': {'end': 10, 'start': 0},
   '1': {'end': 30, 'start': 20},
   '2': {'end': 50, 'start': 40}}}]