Tutorial

Jelle, updated May 2020

This notebook shows how to do basic analysis with straxen, much like hax.minitrees.

For reference, here are some jargon terms which we will introduce below:

  • Context: Holds configuration on how to process

  • Dataframe or array: table of related information produced by a plugin.

  • Plugin: an algorithm that produces a dataframe

  • Data type: specification of which columns are in a dataframe.

  • Data kind: e.g. ‘events’ or ‘peaks’. Dataframes of the same kind have the same number of rows and can be merged.

[1]:
12+1
[1]:
13
[2]:
import numpy as np
# This just ensures some comments in dataframes below display nicely
import pandas as pd
pd.options.display.max_colwidth = 100

Setting up

First we load a strax context, much like hax.init(). A strax context contains all information on how to process: where to read what files from, what plugins provide what data, etc.

You can make a context yourselves using strax.Context, but straxen provides standardized contexts as well. Most future analyses will use such standardized contexts defined by analysis coordinators or straxen maintainers.

Unlike hax.init, you can have multiple active contexts, e.g. to load analysis and MC data, or compare data processed with different settings (we will see examples of this below).

[3]:
import straxen
st = straxen.contexts.xenon1t_dali()

Finding your data

Suposse we want to make a cS1/cS2 plot. We have to figure out which type of dataframes to load. A specific type of dataframe is also called a data type. (in hax these were called minitrees)

We can find this out automatically if we know (part of) the name of a field to load:

[4]:
st.search_field('cs1')
cs1 is part of corrected_areas (provided by CorrectedAreas)
cs1 is part of event_info (provided by EventInfo)

It seems we’re after one of the data types called event_info or corrected_areas. In the current context, these are provided by plugins called EventInfo and CorrectedAreas, respectively (but this doesn’t concern us yet).

Let’s see what else is in these data types:

[5]:
st.data_info('event_info')
[5]:
Field name Data type Comment
0 cs1 float32 Corrected S1 area [PE]
1 cs2 float32 Corrected S2 area [PE]
2 alt_cs1 float32 Corrected area of the alternate S1 [PE]
3 alt_cs2 float32 Corrected area of the alternate S2 [PE]
4 time int64 Start time since unix epoch [ns]
... ... ... ...
59 z_naive float32 Interaction z-position using mean drift velocity only (cm)
60 r_naive float32 Interaction r-position using observed S2 positions directly (cm)
61 r_field_distortion_correction float32 Correction added to r_naive for field distortion (cm)
62 theta float32 Interaction angular position (radians)
63 event_number int64 Event number in this dataset

64 rows × 3 columns

As you can see, event_info has a lot more information; let’s load that one. You can see from the documentation (TODO link) that event_info’s job is to merge the info from corrected_areas and other things.

Loading data

Next, you’ll want to select a run. The select_runs function will return a dataframe with all available runs; there is a separate tutorial on more advanced use of this. In this demo context, we only have high-level data for the run 180215_1029 available (and low-level data for another):

[6]:
st.select_runs()
Checking data availability: 100%|██████████| 5/5 [00:38<00:00,  7.60s/it]
[6]:
name number start reader.ini.name trigger.events_built end tags mode livetime tags.name events_available event_info_available peaklets_available records_available raw_records_available
0 170204_1410 6786 2017-02-04 14:10:08+00:00 background_stable 19574 2017-02-04 15:10:13+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:05 NaN True True True True True
1 170204_1510 6787 2017-02-04 15:10:28+00:00 background_stable 19634 2017-02-04 16:10:32+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 NaN True True True True True
2 170204_1610 6788 2017-02-04 16:10:39+00:00 background_stable 19400 2017-02-04 17:10:43+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 NaN True True True True True
3 170204_1710 6789 2017-02-04 17:10:51+00:00 background_stable 19415 2017-02-04 18:10:54+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:03 NaN True True True True True
4 170204_1810 6790 2017-02-04 18:11:01+00:00 background_stable 19671 2017-02-04 19:11:05+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 NaN True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
211 181027_2044 23315 2018-10-27 20:44:21+00:00 ar37_stable 36496 2018-10-27 21:44:25+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary ar37_stable 01:00:04 NaN True True True True True
212 181027_2144 23316 2018-10-27 21:44:33+00:00 ar37_stable 36330 2018-10-27 22:44:35+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary ar37_stable 01:00:02 NaN True True True True True
213 181027_2244 23317 2018-10-27 22:44:43+00:00 ar37_stable 36486 2018-10-27 23:44:47+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary ar37_stable 01:00:04 NaN True True True True True
214 181027_2344 23318 2018-10-27 23:44:59+00:00 ar37_stable 36777 2018-10-28 00:45:03+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary ar37_stable 01:00:04 NaN True True True True True
215 181028_0045 23319 2018-10-28 00:45:11+00:00 ar37_stable 36708 2018-10-28 01:45:15+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary ar37_stable 01:00:04 NaN True True True True True

216 rows × 15 columns

So lets’ take that 180215_1029.

To actually load data, you use get_df to get a pandas DataFrame, or get_array to get a numpy (structured) array. Let’s go with pandas for now:

[7]:
run_id = '180215_1029'
df = st.get_df(run_id, 'event_info')

The first time you run this, it will take a moment: it has to actually process the data somewhat. We didn’t ship highest-level demo data with straxen: that would mean we’d have to constantly update the test data when the algorithms change.

You can also specify a list of runid’s instead of one run, and get the concatenated result back. Likewise, you can specify multiple data types (of the same kind) to load, and they will be merged for you.

Just like hax.minitrees.load, we got a dataframe back:

[8]:
st.show_config('event_info')
[8]:
option default current applies_to help
0 trigger_min_area 100 <OMITTED> (events,) Peaks must have more area (PE) than this to cause events
1 trigger_max_competing 7 <OMITTED> (events,) Peaks must have FEWER nearby larger or slightly smaller peaks to cause events
2 left_event_extension 1000000 <OMITTED> (events,) Extend events this many ns to the left from each triggering peak
3 right_event_extension 1000000 <OMITTED> (events,) Extend events this many ns to the right from each triggering peak
4 n_top_pmts 127 <OMITTED> (peak_basics,) Number of top PMTs
... ... ... ... ... ...
56 s2_relative_lce_map https://raw.githubusercontent.com/XENON1T/pax/master/pax/data/XENON1T_s2_xy_ly_SR1_v2.2.json <OMITTED> (corrected_areas,) S2 relative LCE(x, y) map
57 elife_file https://raw.githubusercontent.com/XENONnT/strax_auxiliary_files/master/elife.npy <OMITTED> (corrected_areas,) link to the electron lifetime
58 g1 0.1426 <OMITTED> (energy_estimates,) S1 gain in PE / photons produced
59 g2 31.2162 <OMITTED> (energy_estimates,) S2 gain in PE / electrons produced
60 lxe_w 0.0137 <OMITTED> (energy_estimates,) LXe work function in quanta/keV

61 rows × 5 columns

[9]:
df.head()
[9]:
cs1 cs2 alt_cs1 alt_cs2 time endtime e_light e_charge e_ces n_peaks ... alt_s2_y x y z r z_naive r_naive r_field_distortion_correction theta event_number
0 3154.413574 652438.812500 0.000000 176.090179 1518690592149882940 1518690592152169100 303.053741 286.338745 589.392456 9 ... -31.338512 44.141258 -6.838715 -37.135551 44.667870 -37.247639 41.780415 2.887457 -0.153706 0
1 5758.058105 568073.625000 141.350296 337445.937500 1518690592201439770 1518690592203722050 553.193542 249.313004 802.506531 19 ... -21.093775 -41.987190 -18.791229 -34.336990 46.000374 -34.498425 42.666840 3.333534 -2.720781 1
2 1435.035889 82255.000000 0.000000 6068.285645 1518690592466567440 1518690592468799730 137.868103 36.099621 173.967728 66 ... 5.692088 -48.027901 1.946625 -26.670229 48.067337 -27.682955 40.648117 7.419219 3.101084 2
3 1345.859131 122233.304688 0.000000 19143.283203 1518690592747771050 1518690592749798640 129.300629 53.645077 182.945709 10 ... -25.501575 -27.031414 -25.482199 -2.938360 37.148888 -2.960282 36.789291 0.359594 -2.385687 3
4 2126.204346 112584.046875 0.000000 104159.750000 1518690593003844370 1518690593005853770 204.270691 49.410263 253.680954 6 ... 43.027340 8.248243 42.250587 -0.578694 43.048180 -0.612683 42.846947 0.201231 1.377999 4

5 rows × 64 columns

Let’s make a quick plot of the events we just loaded:

[10]:
df.plot.scatter('cs1', 'cs2')

import matplotlib.pyplot as plt
plt.xscale('log')
plt.xlim(1, None)
plt.yscale('log')
../_images/tutorials_strax_demo_23_0.png

Since making a cS1, cS2 plot for a dataset is such a common task that straxen has a built-in method for it. There are other similar mini-analyses, such a waveform plotting, which we will see in action below.

[11]:
st.event_scatter(run_id, s=20)
../_images/tutorials_strax_demo_25_0.png

Can you guess what kind of data this is?

Waveform analysis

The peaks data type contains the sum waveform information:

[12]:
st.data_info('peaks')
[12]:
Field name Data type Comment
0 time int64 Start time since unix epoch [ns]
1 length int32 Length of the interval in samples
2 dt int16 Width of one sample [ns]
3 channel int16 Channel/PMT number
4 type int8 Classification of the peak(let)
5 area float32 Integral across channels [PE]
6 area_per_channel ('<f4', (248,)) Integral per channel [PE]
7 n_hits int32 Number of hits from which peak was constructed (currently zero if peak is split afterwards)
8 data ('<f4', (200,)) Waveform data in PE/sample (not PE/ns!)
9 width ('<f4', (11,)) Peak widths in range of central area fraction [ns]
10 area_decile_from_midpoint ('<f4', (11,)) Peak widths: time between nth and 5th area decile [ns]
11 saturated_channel ('<i2', (248,)) Does the channel reach ADC saturation?
12 n_saturated_channels int16 Total number of saturated channels
13 tight_coincidence int16 Hits within tight range of mean
14 max_gap int32 Largest gap between hits inside peak [ns]
15 max_goodness_of_split float32 Maximum interior goodness of split

Notice the compound data types of the data, width and saturated_channel fields. Pandas does not support such types (well, it sort of does, but the resulting dataframes are quite inefficient), so we have to load this as a numpy array.

[13]:
peaks = st.get_array(run_id, 'peaks')
type(peaks), peaks.dtype.names
[13]:
(numpy.ndarray,
 ('time',
  'length',
  'dt',
  'channel',
  'type',
  'area',
  'area_per_channel',
  'n_hits',
  'data',
  'width',
  'area_decile_from_midpoint',
  'saturated_channel',
  'n_saturated_channels',
  'tight_coincidence',
  'max_gap',
  'max_goodness_of_split'))

Now we can plot peak waveforms:

[14]:
def plot_peak(p, t0=None, **kwargs):
    n = p['length']
    if t0 is None:
        t0 = p['time']
    plt.plot((p['time'] - t0) + np.arange(n) * p['dt'],
             p['data'][:n] / p['dt'],
             drawstyle='steps-mid',
             **kwargs)
    plt.xlabel("Time (ns)")
    plt.ylabel("Sum waveform (PE / ns)")

plot_peak(peaks[9100])
../_images/tutorials_strax_demo_33_0.png
[15]:
def plot_peaks(main_i, n_before=0, n_after=0, label_threshold=0, legendloc='best'):
    for i in main_i + np.arange(-n_before, n_after + 1):
        p = peaks[i]
        label = None
        if p['area'] > label_threshold:
            label = '%d PE, %d ns dt' % (p['area'], p['dt'], )
            color = None
        else:
            color = 'gray'
        plot_peak(p,
                  t0=peaks[main_i]['time'],
                  label=label,
                  color=color)
    plt.ylim(0, None)
    plt.legend(loc=legendloc)
    plt.yscale('symlog')

# Find the largest peak
i_of_largest_peak = np.argmax(peaks['area'])
plot_peaks(i_of_largest_peak,
           n_after=1,
           n_before=5,
           label_threshold=10,
           legendloc=(0.1, 0.5))
../_images/tutorials_strax_demo_34_0.png

The abrupt termination of the S2 above is due to strax’s data reduction.

If you have access to the raw data (at least the records level) you can use straxen’s built-in waveform display. For example, try:

st.waveform_display(run_id, seconds_range=(0, 0.15))

(we didn’t evaluate this in the tutorial, as it creates a substantial amount of javascript, which would have made the notebook quite huge).

Configuration changes

As you can see in the above plot, we have many events high up in the TPC at low S1. Perhaps you want to get rid of them by increasing the ‘S1 coincidence requirement’, i.e. the number of PMTs that must see something before a peak is labeled as S1. Then, of course, you want to load the event-level data again to see if it worked.

First, we need to see which configuration option we have to change. Strax plugins declare what configuration they take and what other plugins they depend on, so this is not very difficult. We just ask which options with s1 in their name influence event_basics:

[16]:
st.show_config('event_basics', 's1*')
[16]:
option default current applies_to help
0 s1_max_rise_time 60 <OMITTED> (peaklet_classification,) Maximum S1 rise time for < 100 PE [ns]
1 s1_max_rise_time_post100 150 <OMITTED> (peaklet_classification,) Maximum S1 rise time for > 100 PE [ns]
2 s1_min_coincidence 3 <OMITTED> (peaklet_classification,) Minimum tight coincidence necessary to make an S1

Looks like we’re after the s1_min_n_channels option. Note this is not part of the event_basics data type, but of a data type called peak_classification. As you can see from the table, this option is not set in the current context, so the default value (3) is used.

To try out a different option, just pass it to get_df:

[17]:
df_2 = st.get_df(run_id, 'event_info',
                 config=dict(s1_min_n_channels=50))
st.event_scatter(run_id, events=df_2)
/home/aalbers/software/strax/strax/context.py:373: UserWarning: Option s1_min_n_channels not taken by any registered plugin
  warnings.warn(f"Option {k} not taken by any registered plugin")
../_images/tutorials_strax_demo_42_1.png

Notice all the small S1 events are indeed gone now.

Behind the scenes, this figured out which dataframes had to be remade: as it happens this time just event_basics and peak_basics. You will now have a new event_basics_<somehash> folder in ./custom_data which contains the results, as well as a new peak_basics_<somehash> folder.

More on configuration changes

Changing configuration can be done in two other ways. We can change it permanently in the current context:

st.set_config(dict(s1_min_channels=50))

Or we could make a new context, with this option set:

st_2 = st.new_context(config=dict(s1_min_channels=50))

(feeding it to get_df just does the latter behind the scenes).

If you just want to run a mini-analysis (like event_scatter), you can also pass a new config option directly to it, as in the example below.

Strax protects you from typos in the configuration. Suppose we typed s1_min_n_channelz instead:

[18]:
st.event_scatter(run_id, config=dict(s1_min_n_channelz=10))
/home/aalbers/software/strax/strax/context.py:373: UserWarning: Option s1_min_n_channelz not taken by any registered plugin
  warnings.warn(f"Option {k} not taken by any registered plugin")
../_images/tutorials_strax_demo_47_1.png

The result of get_df is just the same as if the option wasn’t set (just like in pax/hax), but you also get a warning about an unknown configuration option.

By the way, you can use

import warnings
warnings.filterwarnings("error")

to ensure any warning raises an exception instead.

Customization: new plugins

To add or change processing algorithms, or to define new variables to use in cuts, you have to write new strax plugins. These are somewhat similar to hax’s treemakers.

Suppose you have a brilliant new idea for peak classification. Strax does this in the peaklet_classification plugin, which produces:

[19]:
st.data_info('peaklet_classification')
[19]:
Field name Data type Comment
0 time int64 Start time since unix epoch [ns]
1 length int32 Length of the interval in samples
2 dt int16 Width of one sample [ns]
3 channel int16 Channel/PMT number
4 type int8 Classification of the peak(let)
  • The first three fields contain time information of the peak. This is duplicated in many datatypes – unfortunately, this is necessary for strax to be able to track the data and combine it with other data. Instead of (time, length, dt), plugins could also provide (time, endtime). See here for more information.

  • The ‘channel’ field is an historical artifact.

  • The ‘type’ field contains the classification: 0 for unknown, 1 for S1, 2 for S2. (note #8)

You can find the original plugin in peaklet_processing.py Here’s how you would make a different classification plugin:

[20]:
import strax
import numpy as np

class AdvancedExpertClassification(strax.Plugin):
    """Everything is an S1!"""

    # Name of the data type this plugin provides
    provides = 'peaklet_classification'

    # Data types this plugin requires. Note we don't specify
    # what plugins should produce them: maybe the default PeakBasics
    # has been replaced by another AdvancedExpertBlabla plugin?
    depends_on = ('peaklets',)

    # Numpy datatype of the output
    dtype = straxen.PeakletClassification.dtype

    # Version of the plugin. Increment this if you change the algorithm.
    __version__ = '0.0.2'

    def compute(self, peaklets):
        # Your code here.
        # This function will be called several times with
        # 'peaks' a numpy array of the datatype 'peaks'.
        # Each time you'll see a small part of the run.

        # You have to return a numpy array of the dtype you declared above
        result = np.zeros(len(peaklets), self.dtype)

        # Copy the basic time fields over from peaklets
        for (_, field), _ in strax.time_dt_fields:
            result[field] = peaklets[field]

        # Store the classification results
        # You might want to do real work here
        result['type'] = 1

        return result

        # Instead of an array, you are also allowed to return a dictionary
        # we can transform into an array.
        # That is, (dict keys -> field names, values -> field values)

To use it in place of PeakClassification, we only have to register it. Again, we can do so permanently using

st.register(AdvancedExpertClassification)

or temporarily, by feeding the registration as an extra argument to get_df:

[21]:
df = st.get_df(run_id, 'event_info',
               register=AdvancedExpertClassification)
[22]:
df['s2_area'].max()
[22]:
0.0

As you can see, all events are now S1-only events, as expected. Maybe this is not the best alternative classification :-)

This plugin was a rather basic plugin. You’ll also want to learn about LoopPlugins and OverlapWindowPlugins, but that’s beyond the scope of this tutorial.