Tutorial

Jelle, updated May 2020

updated Feb 2022 by Joran

This notebook shows how to do basic analysis with straxen, much like hax.minitrees.

For reference, here are some jargon terms which we will introduce below:

  • Context: Holds configuration on how to process

  • Dataframe or array: table of related information produced by a plugin.

  • Plugin: an algorithm that produces a dataframe

  • Data type: specification of which columns are in a dataframe.

  • Data kind: e.g. ‘events’ or ‘peaks’. Dataframes of the same kind have the same number of rows and can be merged.

[1]:
import numpy as np
import pandas as pd

# This just ensures some comments in dataframes below display nicely
pd.options.display.max_colwidth = 100
import straxen
[2]:
straxen.print_versions()
cutax is not installed
[2]:
module version path git
0 python 3.10.0 /home/angevaare/miniconda3/envs/dev_strax/bin/python None
1 strax 1.1.7 /home/angevaare/software/dev_strax/strax/strax branch:master | db14f80
2 straxen 1.2.8 /home/angevaare/software/dev_strax/straxen/straxen branch:master | 024602e

Setting up

First we load a strax context, much like hax.init(). A strax context contains all information on how to process: where to read what files from, what plugins provide what data, etc.

You can make a context yourselves using strax.Context, but straxen provides standardized contexts as well. Most future analyses will use such standardized contexts defined by analysis coordinators or straxen maintainers.

Unlike hax.init, you can have multiple active contexts, e.g. to load analysis and MC data, or compare data processed with different settings (we will see examples of this below).

[3]:
st = straxen.contexts.xenonnt_online()

Finding your data

Suposse we want to make a cS1/cS2 plot. We have to figure out which type of dataframes to load. A specific type of dataframe is also called a data type. (in hax these were called minitrees)

We can find this out automatically if we know (part of) the name of a field to load:

[4]:
st.search_field("cs1")

cs1 is part of corrected_areas (provided by CorrectedAreas)
cs1 is part of event_info (provided by EventInfo)

cs1 is used in CorrectedAreas.infer_dtype
cs1 is used in CorrectedAreas.compute
cs1 is used in EnergyEstimates.compute
cs1 is used in EnergyEstimates.cs1_to_e

It seems we’re after one of the data types called event_info or corrected_areas. In the current context, these are provided by plugins called EventInfo and CorrectedAreas, respectively (but this doesn’t concern us yet).

Additionally, we see the occurrences of cs1 of a field in EnergyEstimates and CorrectedAreas. This means that the field is used there directly

Let’s see what else is in these data types:

[5]:
st.data_info("event_info")
[5]:
Field name Data type Comment
0 time int64 Start time since unix epoch [ns]
1 endtime int64 Exclusive end time since unix epoch [ns]
2 cs1 float32 Corrected area of main S1 [PE]
3 cs2_wo_elifecorr float32 Corrected area of main S2 before elife correction (s2 xy correction + SEG/EE correction applied)...
4 cs2_wo_timecorr float32 Corrected area of main S2 before SEG/EE and elife corrections(s2 xy correction applied) [PE]
... ... ... ...
106 z_naive float32 Interaction z-position using mean drift velocity only (cm)
107 r_naive float32 Interaction r-position using observed S2 positions directly (cm)
108 r_field_distortion_correction float32 Correction added to r_naive for field distortion (cm)
109 z_field_distortion_correction float32 Correction added to z_naive for field distortion (cm)
110 theta float32 Interaction angular position (radians)

111 rows × 3 columns

As you can see, event_info has a lot more information; let’s load that one. You can see from the documentation (TODO link) that event_info’s job is to merge the info from corrected_areas and other things.

Loading data

Next, you’ll want to select a run. The select_runs function will return a dataframe with all available runs; there is a separate tutorial on more advanced use of this. In this demo context, we only have high-level data for the run 180215_1029 available (and low-level data for another):

[ ]:
st.select_runs()

So lets’ take 021932.

To actually load data, you use get_df to get a pandas DataFrame, or get_array to get a numpy (structured) array. Let’s go with pandas for now:

[6]:
run_id = "021932"
# The seconds_range=[0,60] is an optional argument to prevent loading too much data at once
df = st.get_df(run_id, "event_info", seconds_range=[0, 60])

The first time you run this, it will take a moment: it has to actually process the data somewhat. We didn’t ship highest-level demo data with straxen: that would mean we’d have to constantly update the test data when the algorithms change.

You can also specify a list of runid’s instead of one run, and get the concatenated result back. Likewise, you can specify multiple data types (of the same kind) to load, and they will be merged for you.

Just like hax.minitrees.load, we got a dataframe back:

[7]:
st.show_config("event_info")
[7]:
option default current applies_to help
0 gain_model <OMITTED> (to_pe_model, ONLINE, True) (afterpulses,) PMT gain model. Specify as (model_type, model_config)
1 n_tpc_pmts <OMITTED> 494 (afterpulses,) Number of PMTs in TPC
2 LED_window_left 50 <OMITTED> (afterpulses,) Left boundary of sample range for LED pulse integration
3 LED_window_right 100 <OMITTED> (afterpulses,) Right boundary of sample range for LED pulse integration
4 baseline_samples 40 <OMITTED> (afterpulses,) Number of samples to use at start of WF to determine the baseline
... ... ... ... ... ...
269 recon_alg_included (_mlp, _gcn, _cnn) <OMITTED> (s2_recon_pos_diff,) The list of all reconstruction algorithm considered.
270 max_veto_window 7200000000000 <OMITTED> (veto_intervals,) Maximum separation between veto stop and start pulses [ns]. Set to be >> than the max duration o...
271 event_window_fields (time, endtime) <OMITTED> (veto_proximity,) Fields to determine where to look for overlaps for using this plugin in the events. The default ...
272 veto_proximity_window 300000000000 <OMITTED> (veto_proximity,) Maximum separation between veto stop and start pulses [ns]
273 time_no_aqmon_veto_found 3600000000000 <OMITTED> (veto_proximity,) If no next/previous veto is found, we will fill the fields time_to_previous_XX with this time. S...

274 rows × 5 columns

[8]:
df.head()
[8]:
time endtime cs1 cs2_wo_elifecorr cs2_wo_timecorr cs2_area_fraction_top cs2_bottom cs2 alt_cs1 alt_cs2_wo_elifecorr ... alt_s2_y_mlp x y z r z_naive r_naive r_field_distortion_correction z_field_distortion_correction theta
0 1623144887003148092 1623144887005863030 2.659672 348.436493 370.887268 0.777278 82.604637 370.887268 NaN 34.098747 ... -23.990379 -7.353356 -15.314320 -43.339104 16.988239 -43.339104 16.190155 0.798083 0.007348 -2.018448
1 1623144887006147772 1623144887008854250 13155.237305 NaN NaN NaN NaN NaN 4.279365 NaN ... NaN 45.517769 55.207436 -0.189628 71.552284 -0.189628 69.542831 2.009451 0.000000 0.881302
2 1623144887030319882 1623144887033327360 61723.960938 28599.917969 29029.152344 0.723225 8034.542969 29029.152344 19.304628 1338.243530 ... 53.898460 -35.264309 55.141556 -10.200156 65.453514 -10.200156 61.628628 3.824886 0.744288 2.139771
3 1623144887054958322 1623144887057677700 76545.257812 1266.177612 1266.558594 0.159955 1063.966309 1266.558594 2.062342 17.207802 ... 2.884162 31.293451 9.913309 -0.026945 32.826115 -0.026945 37.803139 -4.977024 0.000000 0.306784
4 1623144887127075472 1623144887129777490 NaN 5178.317383 NaN 0.749424 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 61.323208 NaN NaN 2.864871

5 rows × 111 columns

Let’s make a quick plot of the events we just loaded:

[9]:
df.plot.scatter("cs1", "cs2")

import matplotlib.pyplot as plt

plt.xscale("log")
plt.xlim(1, None)
plt.yscale("log")
../_images/tutorials_strax_demo_23_0.png

Since making a cS1, cS2 plot for a dataset is such a common task that straxen has a built-in method for it. There are other similar mini-analyses, such a waveform plotting, which we will see in action below.

[10]:
st.event_scatter(run_id, s=20, seconds_range=[0, 60])
../_images/tutorials_strax_demo_25_0.png

Can you guess what kind of data this is?

Waveform analysis

The peaks data type contains the sum waveform information:

[11]:
st.data_info("peaks")
[11]:
Field name Data type Comment
0 time int64 Start time since unix epoch [ns]
1 length int32 Length of the interval in samples
2 dt int32 Width of one sample [ns]
3 channel int16 Channel/PMT number
4 type int8 Classification of the peak(let)
5 area float32 Integral across channels [PE]
6 area_per_channel ('<f4', (494,)) Integral per channel [PE]
7 n_hits int32 Number of hits contributing at least one sample to the peak
8 data ('<f4', (200,)) Waveform data in PE/sample (not PE/ns!)
9 width ('<f4', (11,)) Peak widths in range of central area fraction [ns]
10 area_decile_from_midpoint ('<f4', (11,)) Peak widths: time between nth and 5th area decile [ns]
11 saturated_channel ('i1', (494,)) Does the channel reach ADC saturation?
12 n_saturated_channels int16 Total number of saturated channels
13 tight_coincidence int16 Channel within tight range of mean
14 max_gap int32 Largest gap between hits inside peak [ns]
15 max_goodness_of_split float32 Maximum interior goodness of split

Notice the compound data types of the data, width and saturated_channel fields. Pandas does not support such types (well, it sort of does, but the resulting dataframes are quite inefficient), so we have to load this as a numpy array.

[12]:
peaks = st.get_array(run_id, "peaks", seconds_range=[0, 60])
type(peaks), peaks.dtype.names
[12]:
(numpy.ndarray,
 ('time',
  'length',
  'dt',
  'channel',
  'type',
  'area',
  'area_per_channel',
  'n_hits',
  'data',
  'width',
  'area_decile_from_midpoint',
  'saturated_channel',
  'n_saturated_channels',
  'tight_coincidence',
  'max_gap',
  'max_goodness_of_split'))

Now we can plot peak waveforms:

[13]:
def plot_peak(p, t0=None, **kwargs):
    n = p["length"]
    if t0 is None:
        t0 = p["time"]
    plt.plot(
        (p["time"] - t0) + np.arange(n) * p["dt"],
        p["data"][:n] / p["dt"],
        drawstyle="steps-mid",
        **kwargs
    )
    plt.xlabel("Time (ns)")
    plt.ylabel("Sum waveform (PE / ns)")


plot_peak(peaks[148])
plt.show()
../_images/tutorials_strax_demo_33_0.png
[14]:
def plot_peaks(main_i, n_before=0, n_after=0, label_threshold=0, legendloc="best"):
    for i in main_i + np.arange(-n_before, n_after + 1):
        p = peaks[mask][i]
        label = None
        if p["area"] > label_threshold:
            label = "%d PE, %d ns dt" % (
                p["area"],
                p["dt"],
            )
            color = None
        else:
            color = "gray"
        plot_peak(p, t0=peaks[main_i]["time"], label=label, color=color)
    plt.ylim(0, None)
    plt.legend(loc=legendloc)
    plt.yscale("symlog")


# Find the largest peak close to 1e4 PE
mask = peaks["area"] < 1e4
i_of_largest_peak = np.argmax(peaks[mask]["area"])
plot_peaks(i_of_largest_peak, n_after=5, n_before=2, label_threshold=10, legendloc=(1.1, 0.0))
../_images/tutorials_strax_demo_34_0.png

The abrupt termination of the S2 above is due to strax’s data reduction.

If you have access to the raw data (at least the records level) you can use straxen’s built-in waveform display. For example, try:

st.waveform_display(run_id, seconds_range=(0, 0.15))

(we didn’t evaluate this in the tutorial, as it creates a substantial amount of javascript, which would have made the notebook quite huge).

Configuration changes

As you can see in the above plot, we have many events high up in the TPC at low S1. Perhaps you want to get rid of them by increasing the ‘S1 coincidence requirement’, i.e. the number of PMTs that must see something before a peak is labeled as S1. Then, of course, you want to load the event-level data again to see if it worked.

First, we need to see which configuration option we have to change. Strax plugins declare what configuration they take and what other plugins they depend on, so this is not very difficult. We just ask which options with s1 in their name influence event_basics:

[29]:
st.show_config("event_basics", "s1*")[["option", "applies_to", "help", "current", "default"]]
[29]:
option applies_to help current default
0 s1_xyz_map (corrected_areas,) <OMITTED> itp_map://resource://cmt://format://s1_xyz_map_{algo}?version=ONLINE&run_id=plugin.run_id&fmt=js...
1 s1_risetime_area_parameters (peaklet_classification,) norm, const, tau in the empirical boundary in the risetime-area plot <OMITTED> (50, 80, 12)
2 s1_risetime_aft_parameters (peaklet_classification,) Slope and offset in exponential of emperical boundary in the rise time-AFT plot. Specified as (s... <OMITTED> (-1, 2.6)
3 s1_flatten_threshold_aft (peaklet_classification,) Threshold for AFT, above which we use a flatted boundary for rise timeSpecified values: (AFT bou... <OMITTED> (0.6, 100)
4 s1_max_rise_time_post100 (peaklet_classification,) Maximum S1 rise time for > 100 PE [ns] <OMITTED> 200
5 s1_min_coincidence (peaklet_classification,) Minimum tight coincidence necessary to make an S1 <OMITTED> 2
6 s1_optical_map (event_pattern_fit,) S1 (x, y, z) optical/pattern map. <OMITTED> XENONnT_s1_xyz_patterns_LCE_corrected_qes_MCva43fa9b_wires.pkl
7 s1_aft_map (event_pattern_fit,) Date drive S1 area fraction top map. <OMITTED> s1_aft_dd_xyz_XENONnT_Kr83m_41500eV_31Oct2021.json
8 s1_min_area_pattern_fit (event_pattern_fit,) Skip EventPatternFit reconstruction if S1 area (PE) is less than this <OMITTED> 2
9 s1_risetime_area_parameters (peaklet_classification_he,) norm, const, tau in the empirical boundary in the risetime-area plot <OMITTED> (50, 80, 12)
10 s1_risetime_aft_parameters (peaklet_classification_he,) Slope and offset in exponential of emperical boundary in the rise time-AFT plot. Specified as (s... <OMITTED> (-1, 2.6)
11 s1_flatten_threshold_aft (peaklet_classification_he,) Threshold for AFT, above which we use a flatted boundary for rise timeSpecified values: (AFT bou... <OMITTED> (0.6, 100)
12 s1_max_rise_time_post100 (peaklet_classification_he,) Maximum S1 rise time for > 100 PE [ns] <OMITTED> 200
13 s1_min_coincidence (peaklet_classification_he,) Minimum tight coincidence necessary to make an S1 <OMITTED> 2

Looks like we’re after the s1_min_coincidence option. Note this is not part of the event_basics data type, but of a data type called peak_classification. As you can see from the table, this option is not set in the current context, so the default value (3) is used.

To try out a different option, just pass it to get_df:

[30]:
# Let's use a short run for the following exaples
run_id = "038769"
[31]:
st2 = st.new_context()
st2.set_config(dict(s1_min_coincidence=50))
df_2 = st2.get_df(
    run_id,
    "event_info",
    config=dict(s1_min_coincidence=50),
)
st2.event_scatter(run_id, events=df_2)
../_images/tutorials_strax_demo_43_1.png

Notice all the small S1 events are indeed gone now.

Behind the scenes, this figured out which dataframes had to be remade: as it happens this time just event_basics and peak_basics. You will now have a new event_basics_<somehash> folder in ./custom_data which contains the results, as well as a new peak_basics_<somehash> folder.

More on configuration changes

Changing configuration can be done in two other ways. We can change it permanently in the current context:

st.set_config(dict(s1_min_coincidence=50))

Or we could make a new context, with this option set:

st_2 = st.new_context(config=dict(s1_min_coincidence=50))

(feeding it to get_df just does the latter behind the scenes).

If you just want to run a mini-analysis (like event_scatter), you can also pass a new config option directly to it, as in the example below.

Strax protects you from typos in the configuration. Suppose we typed s1_min_n_channelz instead:

[32]:
st.event_scatter(run_id, config=dict(s1_min_n_channelz=10), seconds_range=[0, 60])
Option s1_min_n_channelz not taken by any registered plugin
Option s1_min_n_channelz not taken by any registered plugin
---------------------------------------------------------------------------
DataNotAvailable                          Traceback (most recent call last)
Input In [32], in <module>
----> 1 st.event_scatter(run_id, config=dict(s1_min_n_channelz=10), seconds_range=[0,60])

File ~/software/dev_strax/straxen/straxen/mini_analysis.py:106, in mini_analysis.<locals>.decorator.<locals>.wrapped_f(context, run_id, **kwargs)
    100         kwargs[dkind] = strax.apply_selection(
    101             kwargs[dkind],
    102             selection_str=kwargs['selection_str'],
    103             time_range=kwargs['time_range'],
    104             time_selection=kwargs['time_selection'])
    105     else:
--> 106         kwargs[dkind] = context.get_array(
    107             run_id,
    108             dtypes,
    109             selection_str=kwargs['selection_str'],
    110             time_range=kwargs['time_range'],
    111             time_selection=kwargs['time_selection'],
    112             # Arguments for new context, if needed
    113             config=kwargs.get('config'),
    114             register=kwargs.get('register'),
    115             storage=kwargs.get('storage', tuple()),
    116             progress_bar=False,
    117         )
    119 # If user did not give time kwargs, but the function expects
    120 # a time_range, try to add one based on the time range of the data
    121 base_dkind = list(deps_by_kind.keys())[0]

File ~/software/dev_strax/strax/strax/context.py:1406, in Context.get_array(self, run_id, targets, save, max_workers, **kwargs)
   1399 else:
   1400     source = self.get_iter(
   1401         run_ids[0],
   1402         targets,
   1403         save=save,
   1404         max_workers=max_workers,
   1405         **kwargs)
-> 1406     results = [x.data for x in source]
   1408 results = np.concatenate(results)
   1409 return results

File ~/software/dev_strax/strax/strax/context.py:1406, in <listcomp>(.0)
   1399 else:
   1400     source = self.get_iter(
   1401         run_ids[0],
   1402         targets,
   1403         save=save,
   1404         max_workers=max_workers,
   1405         **kwargs)
-> 1406     results = [x.data for x in source]
   1408 results = np.concatenate(results)
   1409 return results

File ~/software/dev_strax/strax/strax/context.py:1234, in Context.get_iter(self, run_id, targets, save, max_workers, time_range, seconds_range, time_within, time_selection, selection_str, keep_columns, drop_columns, allow_multiple, progress_bar, _chunk_number, **kwargs)
   1225     elif (self.context_config['timeout'] > 7200 or (
   1226             self.context_config['allow_lazy'] and
   1227             not self.context_config['allow_multiprocess'])):
   1228         # For allow_multiple we don't want allow this when in lazy mode
   1229         # with long timeouts (lazy-mode is disabled if multiprocessing
   1230         # so if that is activated, we can also continue)
   1231         raise RuntimeError(f'Cannot allow_multiple in lazy mode or '
   1232                            f'with long timeouts.')
-> 1234 components = self.get_components(run_id,
   1235                                  targets=targets,
   1236                                  save=save,
   1237                                  time_range=time_range,
   1238                                  chunk_number=_chunk_number)
   1240 # Cleanup the temp plugins
   1241 for k in list(self._plugin_class_registry.keys()):

File ~/software/dev_strax/strax/strax/context.py:979, in Context.get_components(self, run_id, targets, save, time_range, chunk_number)
    975             savers = self._add_saver(savers, d_to_save, target_plugin,
    976                                      _is_superrun, loading_this_data)
    978 for target_i in targets:
--> 979     check_cache(target_i)
    980 plugins = to_compute
    982 intersec = list(plugins.keys() & loaders.keys())

File ~/software/dev_strax/strax/strax/context.py:894, in Context.get_components.<locals>.check_cache(target_i)
    891     if target_plugin.save_when[target_i] == strax.SaveWhen.TARGET:
    892         error_message += (f"\nFirst run st.make({run_id}, "
    893                           f"{target_i}) to make {target_i}.")
--> 894     raise strax.DataNotAvailable(error_message)
    895 if '*' in self.context_config['forbid_creation_of']:
    896     raise strax.DataNotAvailable(
    897         f"{target_i} for {run_id} not found in any storage, and "
    898         "your context specifies no new data can be created.")

DataNotAvailable: Time range selection assumes data is already available, but event_info for 038769 is not.

The result of get_df is just the same as if the option wasn’t set (just like in pax/hax), but you also get a warning about an unknown configuration option.

By the way, you can use

import warnings
warnings.filterwarnings("error")

to ensure any warning raises an exception instead.

Customization: new plugins

To add or change processing algorithms, or to define new variables to use in cuts, you have to write new strax plugins. These are somewhat similar to hax’s treemakers.

Suppose you have a brilliant new idea for peak classification. Strax does this in the peaklet_classification plugin, which produces:

[33]:
st.data_info("peaklet_classification")
[33]:
Field name Data type Comment
0 time int64 Start time since unix epoch [ns]
1 length int32 Length of the interval in samples
2 dt int32 Width of one sample [ns]
3 channel int16 Channel/PMT number
4 type int8 Classification of the peak(let)
  • The first three fields contain time information of the peak. This is duplicated in many datatypes – unfortunately, this is necessary for strax to be able to track the data and combine it with other data. Instead of (time, length, dt), plugins could also provide (time, endtime). See here for more information.

  • The ‘channel’ field is an historical artifact.

  • The ‘type’ field contains the classification: 0 for unknown, 1 for S1, 2 for S2. (note #8)

You can find the original plugin in peaklet_processing.py Here’s how you would make a different classification plugin:

[34]:
import strax
import numpy as np


class AdvancedExpertClassification(strax.Plugin):
    """Everything is an S1!"""

    # Name of the data type this plugin provides
    provides = "peaklet_classification"

    # Data types this plugin requires. Note we don't specify
    # what plugins should produce them: maybe the default PeakBasics
    # has been replaced by another AdvancedExpertBlabla plugin?
    depends_on = ("peaklets",)

    # Numpy datatype of the output
    dtype = straxen.PeakletClassification.dtype

    # Version of the plugin. Increment this if you change the algorithm.
    __version__ = "0.0.2"

    def compute(self, peaklets):
        # Your code here.
        # This function will be called several times with
        # 'peaks' a numpy array of the datatype 'peaks'.
        # Each time you'll see a small part of the run.

        # You have to return a numpy array of the dtype you declared above
        result = np.zeros(len(peaklets), self.dtype)

        # Copy the basic time fields over from peaklets
        for (_, field), _ in strax.time_dt_fields:
            result[field] = peaklets[field]

        # Store the classification results
        # You might want to do real work here
        result["type"] = 1

        return result

        # Instead of an array, you are also allowed to return a dictionary
        # we can transform into an array.
        # That is, (dict keys -> field names, values -> field values)

To use it in place of PeakClassification, we only have to register it. Again, we can do so permanently using

st.register(AdvancedExpertClassification)

or temporarily, by feeding the registration as an extra argument to get_df:

[35]:
df = st.get_df(
    run_id,
    "event_info",
    register=AdvancedExpertClassification,
)
Found chunk without events?!
Found chunk without events?!
Found chunk without events?!
[36]:
df["s2_area"].max()
[36]:
nan

As you can see, all events are now S1-only events, as expected. Maybe this is not the best alternative classification :-)

This plugin was a rather basic plugin. You’ll also want to learn about LoopPlugins and OverlapWindowPlugins, but that’s beyond the scope of this tutorial.