Tutorial
Jelle, updated May 2020
This notebook shows how to do basic analysis with straxen, much like hax.minitrees
.
For reference, here are some jargon terms which we will introduce below:
Context: Holds configuration on how to process
Dataframe or array: table of related information produced by a plugin.
Plugin: an algorithm that produces a dataframe
Data type: specification of which columns are in a dataframe.
Data kind: e.g. ‘events’ or ‘peaks’. Dataframes of the same kind have the same number of rows and can be merged.
[1]:
12+1
[1]:
13
[2]:
import numpy as np
# This just ensures some comments in dataframes below display nicely
import pandas as pd
pd.options.display.max_colwidth = 100
Setting up
First we load a strax context, much like hax.init()
. A strax context contains all information on how to process: where to read what files from, what plugins provide what data, etc.
You can make a context yourselves using strax.Context
, but straxen provides standardized contexts as well. Most future analyses will use such standardized contexts defined by analysis coordinators or straxen maintainers.
Unlike hax.init
, you can have multiple active contexts, e.g. to load analysis and MC data, or compare data processed with different settings (we will see examples of this below).
[3]:
import straxen
st = straxen.contexts.xenon1t_dali()
Finding your data
Suposse we want to make a cS1/cS2 plot. We have to figure out which type of dataframes to load. A specific type of dataframe is also called a data type. (in hax these were called minitrees)
We can find this out automatically if we know (part of) the name of a field to load:
[4]:
st.search_field('cs1')
cs1 is part of corrected_areas (provided by CorrectedAreas)
cs1 is part of event_info (provided by EventInfo)
It seems we’re after one of the data types called event_info
or corrected_areas
. In the current context, these are provided by plugins called EventInfo and CorrectedAreas, respectively (but this doesn’t concern us yet).
Let’s see what else is in these data types:
[5]:
st.data_info('event_info')
[5]:
Field name | Data type | Comment | |
---|---|---|---|
0 | cs1 | float32 | Corrected S1 area [PE] |
1 | cs2 | float32 | Corrected S2 area [PE] |
2 | alt_cs1 | float32 | Corrected area of the alternate S1 [PE] |
3 | alt_cs2 | float32 | Corrected area of the alternate S2 [PE] |
4 | time | int64 | Start time since unix epoch [ns] |
... | ... | ... | ... |
59 | z_naive | float32 | Interaction z-position using mean drift velocity only (cm) |
60 | r_naive | float32 | Interaction r-position using observed S2 positions directly (cm) |
61 | r_field_distortion_correction | float32 | Correction added to r_naive for field distortion (cm) |
62 | theta | float32 | Interaction angular position (radians) |
63 | event_number | int64 | Event number in this dataset |
64 rows × 3 columns
As you can see, event_info
has a lot more information; let’s load that one. You can see from the documentation (TODO link) that event_info
’s job is to merge the info from corrected_areas
and other things.
Loading data
Next, you’ll want to select a run. The select_runs
function will return a dataframe with all available runs; there is a separate tutorial on more advanced use of this. In this demo context, we only have high-level data for the run 180215_1029
available (and low-level data for another):
[6]:
st.select_runs()
Checking data availability: 100%|██████████| 5/5 [00:38<00:00, 7.60s/it]
[6]:
name | number | start | reader.ini.name | trigger.events_built | end | tags | mode | livetime | tags.name | events_available | event_info_available | peaklets_available | records_available | raw_records_available | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 170204_1410 | 6786 | 2017-02-04 14:10:08+00:00 | background_stable | 19574 | 2017-02-04 15:10:13+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:05 | NaN | True | True | True | True | True |
1 | 170204_1510 | 6787 | 2017-02-04 15:10:28+00:00 | background_stable | 19634 | 2017-02-04 16:10:32+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | NaN | True | True | True | True | True |
2 | 170204_1610 | 6788 | 2017-02-04 16:10:39+00:00 | background_stable | 19400 | 2017-02-04 17:10:43+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | NaN | True | True | True | True | True |
3 | 170204_1710 | 6789 | 2017-02-04 17:10:51+00:00 | background_stable | 19415 | 2017-02-04 18:10:54+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:03 | NaN | True | True | True | True | True |
4 | 170204_1810 | 6790 | 2017-02-04 18:11:01+00:00 | background_stable | 19671 | 2017-02-04 19:11:05+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | NaN | True | True | True | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
211 | 181027_2044 | 23315 | 2018-10-27 20:44:21+00:00 | ar37_stable | 36496 | 2018-10-27 21:44:25+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary | ar37_stable | 01:00:04 | NaN | True | True | True | True | True |
212 | 181027_2144 | 23316 | 2018-10-27 21:44:33+00:00 | ar37_stable | 36330 | 2018-10-27 22:44:35+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary | ar37_stable | 01:00:02 | NaN | True | True | True | True | True |
213 | 181027_2244 | 23317 | 2018-10-27 22:44:43+00:00 | ar37_stable | 36486 | 2018-10-27 23:44:47+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary | ar37_stable | 01:00:04 | NaN | True | True | True | True | True |
214 | 181027_2344 | 23318 | 2018-10-27 23:44:59+00:00 | ar37_stable | 36777 | 2018-10-28 00:45:03+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary | ar37_stable | 01:00:04 | NaN | True | True | True | True | True |
215 | 181028_0045 | 23319 | 2018-10-28 00:45:11+00:00 | ar37_stable | 36708 | 2018-10-28 01:45:15+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_preliminary | ar37_stable | 01:00:04 | NaN | True | True | True | True | True |
216 rows × 15 columns
So lets’ take that 180215_1029.
To actually load data, you use get_df
to get a pandas DataFrame, or get_array
to get a numpy (structured) array. Let’s go with pandas for now:
[7]:
run_id = '180215_1029'
df = st.get_df(run_id, 'event_info')
The first time you run this, it will take a moment: it has to actually process the data somewhat. We didn’t ship highest-level demo data with straxen: that would mean we’d have to constantly update the test data when the algorithms change.
You can also specify a list of runid’s instead of one run, and get the concatenated result back. Likewise, you can specify multiple data types (of the same kind) to load, and they will be merged for you.
Just like hax.minitrees.load, we got a dataframe back:
[8]:
st.show_config('event_info')
[8]:
option | default | current | applies_to | help | |
---|---|---|---|---|---|
0 | trigger_min_area | 100 | <OMITTED> | (events,) | Peaks must have more area (PE) than this to cause events |
1 | trigger_max_competing | 7 | <OMITTED> | (events,) | Peaks must have FEWER nearby larger or slightly smaller peaks to cause events |
2 | left_event_extension | 1000000 | <OMITTED> | (events,) | Extend events this many ns to the left from each triggering peak |
3 | right_event_extension | 1000000 | <OMITTED> | (events,) | Extend events this many ns to the right from each triggering peak |
4 | n_top_pmts | 127 | <OMITTED> | (peak_basics,) | Number of top PMTs |
... | ... | ... | ... | ... | ... |
56 | s2_relative_lce_map | https://raw.githubusercontent.com/XENON1T/pax/master/pax/data/XENON1T_s2_xy_ly_SR1_v2.2.json | <OMITTED> | (corrected_areas,) | S2 relative LCE(x, y) map |
57 | elife_file | https://raw.githubusercontent.com/XENONnT/strax_auxiliary_files/master/elife.npy | <OMITTED> | (corrected_areas,) | link to the electron lifetime |
58 | g1 | 0.1426 | <OMITTED> | (energy_estimates,) | S1 gain in PE / photons produced |
59 | g2 | 31.2162 | <OMITTED> | (energy_estimates,) | S2 gain in PE / electrons produced |
60 | lxe_w | 0.0137 | <OMITTED> | (energy_estimates,) | LXe work function in quanta/keV |
61 rows × 5 columns
[9]:
df.head()
[9]:
cs1 | cs2 | alt_cs1 | alt_cs2 | time | endtime | e_light | e_charge | e_ces | n_peaks | ... | alt_s2_y | x | y | z | r | z_naive | r_naive | r_field_distortion_correction | theta | event_number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3154.413574 | 652438.812500 | 0.000000 | 176.090179 | 1518690592149882940 | 1518690592152169100 | 303.053741 | 286.338745 | 589.392456 | 9 | ... | -31.338512 | 44.141258 | -6.838715 | -37.135551 | 44.667870 | -37.247639 | 41.780415 | 2.887457 | -0.153706 | 0 |
1 | 5758.058105 | 568073.625000 | 141.350296 | 337445.937500 | 1518690592201439770 | 1518690592203722050 | 553.193542 | 249.313004 | 802.506531 | 19 | ... | -21.093775 | -41.987190 | -18.791229 | -34.336990 | 46.000374 | -34.498425 | 42.666840 | 3.333534 | -2.720781 | 1 |
2 | 1435.035889 | 82255.000000 | 0.000000 | 6068.285645 | 1518690592466567440 | 1518690592468799730 | 137.868103 | 36.099621 | 173.967728 | 66 | ... | 5.692088 | -48.027901 | 1.946625 | -26.670229 | 48.067337 | -27.682955 | 40.648117 | 7.419219 | 3.101084 | 2 |
3 | 1345.859131 | 122233.304688 | 0.000000 | 19143.283203 | 1518690592747771050 | 1518690592749798640 | 129.300629 | 53.645077 | 182.945709 | 10 | ... | -25.501575 | -27.031414 | -25.482199 | -2.938360 | 37.148888 | -2.960282 | 36.789291 | 0.359594 | -2.385687 | 3 |
4 | 2126.204346 | 112584.046875 | 0.000000 | 104159.750000 | 1518690593003844370 | 1518690593005853770 | 204.270691 | 49.410263 | 253.680954 | 6 | ... | 43.027340 | 8.248243 | 42.250587 | -0.578694 | 43.048180 | -0.612683 | 42.846947 | 0.201231 | 1.377999 | 4 |
5 rows × 64 columns
Let’s make a quick plot of the events we just loaded:
[10]:
df.plot.scatter('cs1', 'cs2')
import matplotlib.pyplot as plt
plt.xscale('log')
plt.xlim(1, None)
plt.yscale('log')

Since making a cS1, cS2 plot for a dataset is such a common task that straxen has a built-in method for it. There are other similar mini-analyses, such a waveform plotting, which we will see in action below.
[11]:
st.event_scatter(run_id, s=20)

Can you guess what kind of data this is?
Waveform analysis
The peaks data type contains the sum waveform information:
[12]:
st.data_info('peaks')
[12]:
Field name | Data type | Comment | |
---|---|---|---|
0 | time | int64 | Start time since unix epoch [ns] |
1 | length | int32 | Length of the interval in samples |
2 | dt | int16 | Width of one sample [ns] |
3 | channel | int16 | Channel/PMT number |
4 | type | int8 | Classification of the peak(let) |
5 | area | float32 | Integral across channels [PE] |
6 | area_per_channel | ('<f4', (248,)) | Integral per channel [PE] |
7 | n_hits | int32 | Number of hits from which peak was constructed (currently zero if peak is split afterwards) |
8 | data | ('<f4', (200,)) | Waveform data in PE/sample (not PE/ns!) |
9 | width | ('<f4', (11,)) | Peak widths in range of central area fraction [ns] |
10 | area_decile_from_midpoint | ('<f4', (11,)) | Peak widths: time between nth and 5th area decile [ns] |
11 | saturated_channel | ('<i2', (248,)) | Does the channel reach ADC saturation? |
12 | n_saturated_channels | int16 | Total number of saturated channels |
13 | tight_coincidence | int16 | Hits within tight range of mean |
14 | max_gap | int32 | Largest gap between hits inside peak [ns] |
15 | max_goodness_of_split | float32 | Maximum interior goodness of split |
Notice the compound data types of the data
, width
and saturated_channel
fields. Pandas does not support such types (well, it sort of does, but the resulting dataframes are quite inefficient), so we have to load this as a numpy array.
[13]:
peaks = st.get_array(run_id, 'peaks')
type(peaks), peaks.dtype.names
[13]:
(numpy.ndarray,
('time',
'length',
'dt',
'channel',
'type',
'area',
'area_per_channel',
'n_hits',
'data',
'width',
'area_decile_from_midpoint',
'saturated_channel',
'n_saturated_channels',
'tight_coincidence',
'max_gap',
'max_goodness_of_split'))
Now we can plot peak waveforms:
[14]:
def plot_peak(p, t0=None, **kwargs):
n = p['length']
if t0 is None:
t0 = p['time']
plt.plot((p['time'] - t0) + np.arange(n) * p['dt'],
p['data'][:n] / p['dt'],
drawstyle='steps-mid',
**kwargs)
plt.xlabel("Time (ns)")
plt.ylabel("Sum waveform (PE / ns)")
plot_peak(peaks[9100])

[15]:
def plot_peaks(main_i, n_before=0, n_after=0, label_threshold=0, legendloc='best'):
for i in main_i + np.arange(-n_before, n_after + 1):
p = peaks[i]
label = None
if p['area'] > label_threshold:
label = '%d PE, %d ns dt' % (p['area'], p['dt'], )
color = None
else:
color = 'gray'
plot_peak(p,
t0=peaks[main_i]['time'],
label=label,
color=color)
plt.ylim(0, None)
plt.legend(loc=legendloc)
plt.yscale('symlog')
# Find the largest peak
i_of_largest_peak = np.argmax(peaks['area'])
plot_peaks(i_of_largest_peak,
n_after=1,
n_before=5,
label_threshold=10,
legendloc=(0.1, 0.5))

The abrupt termination of the S2 above is due to strax’s data reduction.
If you have access to the raw data (at least the records
level) you can use straxen’s built-in waveform display. For example, try:
st.waveform_display(run_id, seconds_range=(0, 0.15))
(we didn’t evaluate this in the tutorial, as it creates a substantial amount of javascript, which would have made the notebook quite huge).
Configuration changes
As you can see in the above plot, we have many events high up in the TPC at low S1. Perhaps you want to get rid of them by increasing the ‘S1 coincidence requirement’, i.e. the number of PMTs that must see something before a peak is labeled as S1. Then, of course, you want to load the event-level data again to see if it worked.
First, we need to see which configuration option we have to change. Strax plugins declare what configuration they take and what other plugins they depend on, so this is not very difficult. We just ask which options with s1
in their name influence event_basics
:
[16]:
st.show_config('event_basics', 's1*')
[16]:
option | default | current | applies_to | help | |
---|---|---|---|---|---|
0 | s1_max_rise_time | 60 | <OMITTED> | (peaklet_classification,) | Maximum S1 rise time for < 100 PE [ns] |
1 | s1_max_rise_time_post100 | 150 | <OMITTED> | (peaklet_classification,) | Maximum S1 rise time for > 100 PE [ns] |
2 | s1_min_coincidence | 3 | <OMITTED> | (peaklet_classification,) | Minimum tight coincidence necessary to make an S1 |
Looks like we’re after the s1_min_n_channels option
. Note this is not part of the event_basics
data type, but of a data type called peak_classification
. As you can see from the table, this option is not set in the current context, so the default value (3) is used.
To try out a different option, just pass it to get_df:
[17]:
df_2 = st.get_df(run_id, 'event_info',
config=dict(s1_min_n_channels=50))
st.event_scatter(run_id, events=df_2)
/home/aalbers/software/strax/strax/context.py:373: UserWarning: Option s1_min_n_channels not taken by any registered plugin
warnings.warn(f"Option {k} not taken by any registered plugin")

Notice all the small S1 events are indeed gone now.
Behind the scenes, this figured out which dataframes had to be remade: as it happens this time just event_basics
and peak_basics
. You will now have a new event_basics_<somehash>
folder in ./custom_data
which contains the results, as well as a new peak_basics_<somehash> folder
.
More on configuration changes
Changing configuration can be done in two other ways. We can change it permanently in the current context:
st.set_config(dict(s1_min_channels=50))
Or we could make a new context, with this option set:
st_2 = st.new_context(config=dict(s1_min_channels=50))
(feeding it to get_df just does the latter behind the scenes).
If you just want to run a mini-analysis (like event_scatter
), you can also pass a new config
option directly to it, as in the example below.
Strax protects you from typos in the configuration. Suppose we typed s1_min_n_channelz
instead:
[18]:
st.event_scatter(run_id, config=dict(s1_min_n_channelz=10))
/home/aalbers/software/strax/strax/context.py:373: UserWarning: Option s1_min_n_channelz not taken by any registered plugin
warnings.warn(f"Option {k} not taken by any registered plugin")

The result of get_df is just the same as if the option wasn’t set (just like in pax/hax), but you also get a warning about an unknown configuration option.
By the way, you can use
import warnings
warnings.filterwarnings("error")
to ensure any warning raises an exception instead.
Customization: new plugins
To add or change processing algorithms, or to define new variables to use in cuts, you have to write new strax plugins. These are somewhat similar to hax’s treemakers.
Suppose you have a brilliant new idea for peak classification. Strax does this in the peaklet_classification plugin, which produces:
[19]:
st.data_info('peaklet_classification')
[19]:
Field name | Data type | Comment | |
---|---|---|---|
0 | time | int64 | Start time since unix epoch [ns] |
1 | length | int32 | Length of the interval in samples |
2 | dt | int16 | Width of one sample [ns] |
3 | channel | int16 | Channel/PMT number |
4 | type | int8 | Classification of the peak(let) |
The first three fields contain time information of the peak. This is duplicated in many datatypes – unfortunately, this is necessary for strax to be able to track the data and combine it with other data. Instead of (time, length, dt), plugins could also provide (time, endtime). See here for more information.
The ‘channel’ field is an historical artifact.
The ‘type’ field contains the classification: 0 for unknown, 1 for S1, 2 for S2. (note #8)
You can find the original plugin in peaklet_processing.py Here’s how you would make a different classification plugin:
[20]:
import strax
import numpy as np
class AdvancedExpertClassification(strax.Plugin):
"""Everything is an S1!"""
# Name of the data type this plugin provides
provides = 'peaklet_classification'
# Data types this plugin requires. Note we don't specify
# what plugins should produce them: maybe the default PeakBasics
# has been replaced by another AdvancedExpertBlabla plugin?
depends_on = ('peaklets',)
# Numpy datatype of the output
dtype = straxen.PeakletClassification.dtype
# Version of the plugin. Increment this if you change the algorithm.
__version__ = '0.0.2'
def compute(self, peaklets):
# Your code here.
# This function will be called several times with
# 'peaks' a numpy array of the datatype 'peaks'.
# Each time you'll see a small part of the run.
# You have to return a numpy array of the dtype you declared above
result = np.zeros(len(peaklets), self.dtype)
# Copy the basic time fields over from peaklets
for (_, field), _ in strax.time_dt_fields:
result[field] = peaklets[field]
# Store the classification results
# You might want to do real work here
result['type'] = 1
return result
# Instead of an array, you are also allowed to return a dictionary
# we can transform into an array.
# That is, (dict keys -> field names, values -> field values)
To use it in place of PeakClassification, we only have to register it. Again, we can do so permanently using
st.register(AdvancedExpertClassification)
or temporarily, by feeding the registration as an extra argument to get_df
:
[21]:
df = st.get_df(run_id, 'event_info',
register=AdvancedExpertClassification)
[22]:
df['s2_area'].max()
[22]:
0.0
As you can see, all events are now S1-only events, as expected. Maybe this is not the best alternative classification :-)
This plugin was a rather basic plugin. You’ll also want to learn about LoopPlugin
s and OverlapWindowPlugin
s, but that’s beyond the scope of this tutorial.