Opening data to do analyses
06-01-2020
To this end we will play with one of our most recent 1T datasets (run), load it, make some selections and plots. We will only be looking at two parameters but the data is rich and can be explored in many more ways.
Goal & setup
We want to use straxen to open some data, to this end we show how to make a basic selection and plot some waveforms.
[1]:
print('Start import')
import socket
import strax
import straxen
import numpy as np
import datetime
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import datetime
print(
f'''Working on {socket.getfqdn()} with the following versions
and installation paths:
strax
\tv{strax.__version__}\t{str(strax.__path__[0])}
straxen
\tv{straxen.__version__}\t{str(straxen.__path__[0])}
''')
Start import
Working on dali030.rcc.local with the following versions
and installation paths:
strax
v0.13.2 /home/angevaare/software/strax/strax
straxen
v0.14.1 /home/angevaare/software/straxen/straxen
Setup straxen
We just have to use the xenonnt_online
context for nT or the one below for 1T
[2]:
st = straxen.contexts.xenon1t_dali()
Data-availability
Using the context to check what is available.
This loads the RunsDB, looks at the runs that are considered in the context and checks if they are available.
NB: I’m going to slightly alter the config here to check for multiple datatypes. This takes considereably longer so you can sit back for a while.
[3]:
%%time
# This config change will take time
st.set_context_config({'check_available': ('raw_records', 'records', 'peaklets', 'peak_basics')})
st.select_runs()
Checking data availability: 100%|██████████| 4/4 [00:12<00:00, 3.02s/it]
CPU times: user 6.8 s, sys: 711 ms, total: 7.51 s
Wall time: 13.8 s
[3]:
name | number | start | reader.ini.name | trigger.events_built | end | tags | mode | livetime | peak_basics_available | peaklets_available | records_available | raw_records_available | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 170204_1410 | 6786 | 2017-02-04 14:10:08+00:00 | background_stable | 19574 | 2017-02-04 15:10:13+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:05 | True | True | True | True |
1 | 170204_1510 | 6787 | 2017-02-04 15:10:28+00:00 | background_stable | 19634 | 2017-02-04 16:10:32+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | True | True | True | True |
2 | 170204_1610 | 6788 | 2017-02-04 16:10:39+00:00 | background_stable | 19400 | 2017-02-04 17:10:43+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | True | True | True | True |
3 | 170204_1710 | 6789 | 2017-02-04 17:10:51+00:00 | background_stable | 19415 | 2017-02-04 18:10:54+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:03 | True | True | True | True |
4 | 170204_1810 | 6790 | 2017-02-04 18:11:01+00:00 | background_stable | 19671 | 2017-02-04 19:11:05+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | True | True | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
211 | 181027_2044 | 23315 | 2018-10-27 20:44:21+00:00 | ar37_stable | 36496 | 2018-10-27 21:44:25+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
212 | 181027_2144 | 23316 | 2018-10-27 21:44:33+00:00 | ar37_stable | 36330 | 2018-10-27 22:44:35+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:02 | True | True | True | True |
213 | 181027_2244 | 23317 | 2018-10-27 22:44:43+00:00 | ar37_stable | 36486 | 2018-10-27 23:44:47+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
214 | 181027_2344 | 23318 | 2018-10-27 23:44:59+00:00 | ar37_stable | 36777 | 2018-10-28 00:45:03+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
215 | 181028_0045 | 23319 | 2018-10-28 00:45:11+00:00 | ar37_stable | 36708 | 2018-10-28 01:45:15+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
216 rows × 13 columns
The second time we run this, it will be much faster.
[4]:
st.select_runs(available = 'peaklets',
exclude_tags = ('bad', 'messy'))
[4]:
name | number | start | reader.ini.name | trigger.events_built | end | tags | mode | livetime | peak_basics_available | peaklets_available | records_available | raw_records_available | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 170204_1410 | 6786 | 2017-02-04 14:10:08+00:00 | background_stable | 19574 | 2017-02-04 15:10:13+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:05 | True | True | True | True |
1 | 170204_1510 | 6787 | 2017-02-04 15:10:28+00:00 | background_stable | 19634 | 2017-02-04 16:10:32+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | True | True | True | True |
2 | 170204_1610 | 6788 | 2017-02-04 16:10:39+00:00 | background_stable | 19400 | 2017-02-04 17:10:43+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | True | True | True | True |
3 | 170204_1710 | 6789 | 2017-02-04 17:10:51+00:00 | background_stable | 19415 | 2017-02-04 18:10:54+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:03 | True | True | True | True |
4 | 170204_1810 | 6790 | 2017-02-04 18:11:01+00:00 | background_stable | 19671 | 2017-02-04 19:11:05+00:00 | blinded,_sciencerun1_candidate,_sciencerun1 | background_stable | 01:00:04 | True | True | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
211 | 181027_2044 | 23315 | 2018-10-27 20:44:21+00:00 | ar37_stable | 36496 | 2018-10-27 21:44:25+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
212 | 181027_2144 | 23316 | 2018-10-27 21:44:33+00:00 | ar37_stable | 36330 | 2018-10-27 22:44:35+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:02 | True | True | True | True |
213 | 181027_2244 | 23317 | 2018-10-27 22:44:43+00:00 | ar37_stable | 36486 | 2018-10-27 23:44:47+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
214 | 181027_2344 | 23318 | 2018-10-27 23:44:59+00:00 | ar37_stable | 36777 | 2018-10-28 00:45:03+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
215 | 181028_0045 | 23319 | 2018-10-28 00:45:11+00:00 | ar37_stable | 36708 | 2018-10-28 01:45:15+00:00 | Kr83m,_sciencerun2_candidate,_sciencerun2_prel... | ar37_stable | 01:00:04 | True | True | True | True |
205 rows × 13 columns
For nT one can also select a run from the website:
https://xenon1t-daq.lngs.infn.it/runsui
[5]:
# The name of the run
run_id = '181027_2044'
Alternative method.
If you know what to look for you can use this command to double check:
[6]:
st.is_stored(run_id, "peak_basics")
[6]:
True
Lets load the peaks of this run and have a look. We are going to look at peak_basics (loading e.g. peaks) is much more data, making everything slow. You can check the size before loading using the command below:
[7]:
for p in ('raw_records', 'records', 'peaklets', 'peak_basics', 'pulse_counts'):
if st.is_stored(run_id, p):
print(f'Target: {p} \t is {st.size_mb(run_id, p):.0f} MB')
Target: raw_records is 50612 MB
Target: records is 17145 MB
Target: peaklets is 2159 MB
Target: peak_basics is 56 MB
Target: pulse_counts is 7 MB
Loading the data
NB: Don’t crash your notebook!
The prints above are why you only want to load small portions of data. Especially for ‘low-level’ data. If you try to load all of records or even peaklets, your notebook crashes.
We are going to load only 100 seconds of data as that will be enough to see some populations. Later we can load more if we want.
My favorite reduction method id just loading one or a few seconds, like below:
[8]:
peaks = st.get_array(run_id,
targets = ('peak_basics'),
seconds_range = (0, 100), # Only first 100 seconds
progress_bar=False)
[9]:
# Let's see how much we have loaded:
print(f'We loaded {peaks.nbytes/1e6:.1f} MB of peaks-data')
We loaded 1.5 MB of peaks-data
[10]:
# You can also use get_df that will give you a pandas.DataFrame. For example:
st.get_df(run_id,
targets = ('peak_basics'),
seconds_range = (0, 0.01), # Only first 10 ms because I'm going to use the array from above
progress_bar=False).head(10)
[10]:
time | endtime | center_time | area | n_channels | max_pmt | max_pmt_area | range_50p_area | range_90p_area | area_fraction_top | length | dt | rise_time | tight_coincidence | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1540673061000064360 | 1540673061000072510 | 1540673061000067775 | 630.876892 | 129 | 7 | 63.06451 | 2154.426758 | 6508.952637 | 0.650553 | 163 | 50 | 2509.838867 | 15 | 2 |
1 | 1540673061000080260 | 1540673061000080940 | 1540673061000080508 | 12.416115 | 3 | 66 | 9.20665 | 236.856506 | 606.827026 | 0.741508 | 68 | 10 | 138.616135 | 2 | 0 |
Make plots of this one run
These are my helper functions: - I want to make an area vs width plot to check which population to select - I want to be able to check some waveforms
[11]:
# My function to make a 2D histogram
def plots_area_vs_width(data, log=None, **kwargs):
"""basic wrapper to plot area vs width"""
# Things to put on the axes
x, y = data['area'], data['range_50p_area']
# Make a log-plot manually
if log:
x, y = np.log10(x), np.log10(y)
plt.hist2d(x, y, norm=LogNorm(), **kwargs)
plt.ylabel(f'Width [ns]')
plt.xlabel(f'Area [PE]')
# Change the ticks slightly for nicer formats
if log:
xt = np.arange(int(plt.xticks()[0].min()), int(plt.xticks()[0].max()))
plt.xticks(xt, 10 ** xt)
yt = np.arange(int(plt.yticks()[0].min()), int(plt.yticks()[0].max()))
plt.yticks(yt, 10 ** yt)
plt.colorbar(label='counts/bin')
[12]:
# My function to plot some waveforms
def plot_some_peaks(peaks, max_plots = 5, randomize = False):
"""Plot the first peaks in the peaks collection (max number of 5 by default)"""
if randomize:
# This randomly takes max_plots in the range (0, len(peaks)). Plot these indices
indices = np.random.randint(0, len(peaks), max_plots)
else:
# Just take the first max_plots peaks in the data
indices = range(max_plots)
for i in indices:
p = peaks[i]
start, stop = p['time'], p['endtime']
st.plot_peaks(run_id,
time_range = (start, stop))
plt.title(f'S{p["type"]} of ~{int(p["area"])} PE (width {int(p["range_50p_area"])} ns)\n'
f'detected by {p["n_channels"]} PMTs at:\n'
f'{datetime.datetime.fromtimestamp(start/1e9).isoformat()}') # should in s not ns hence /1e9
plt.show()
Plotting waveforms
This may help you if you want to look at specific waveforms. Below I’m just going to show that it works. As you can see, there are a lot of junk waveforms! We are going to do some cuts to select ‘nice’ ones later.
[13]:
# Let's plot the first 2
plot_some_peaks(peaks, max_plots = 2)


[14]:
# Or just look at two random peaks somewhere in our dataset
plot_some_peaks(peaks, max_plots=2, randomize = True)


Display populations
Area vs width plot
You see these plots a lot on Slac. We’ll make a 2D histogram of two propperties of the peaks. Herefrom we try to isolatea population to look at the waveforms.
[15]:
# All peaks
plots_area_vs_width(peaks,
log = True,
bins = 100,
range=[[0,4], [1,4]])
plt.title('Area vs width in log-log space of all populations\n10 s of data')
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in log10
if __name__ == '__main__':
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log10
if __name__ == '__main__':
[15]:
Text(0.5, 1.0, 'Area vs width in log-log space of all populations\n10 s of data')

I want to zoom in on the population on the lower right. Therefore I reduce the range:
[16]:
plots_area_vs_width(peaks,
log = True,
bins = 100,
range=[[2,4], [1.5,2.5]]) # zoomed wrt to previous plot!
plt.title('Area vs width in log-log space zoomed on large S1s\n10 s of data')
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in log10
if __name__ == '__main__':
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log10
if __name__ == '__main__':
[16]:
Text(0.5, 1.0, 'Area vs width in log-log space zoomed on large S1s\n10 s of data')

Let’s select this blob of peaks.
For this I switch to lin-lin space rather than log-log as I’m too lazy to convert the selection I’m going to make from lin to log and vise versa.
[17]:
# Let's look in linspace
plots_area_vs_width(peaks,
log = False,
bins = 100,
range=[[10**2, 10**4], [10**1.5, 10**2.5]])
plt.title('Area vs width in lin-lin space zoomed on large S1s\n10 s of data')
[17]:
Text(0.5, 1.0, 'Area vs width in lin-lin space zoomed on large S1s\n10 s of data')

Data selection
Let’s make a very rudimentary selection in this data
Below I’m going to draw a box arround the blob as above and load some peaks therefrom. I suspect this are nice fat S1s.
[18]:
# Parameters used for plot below
low_area = 400
high_area = 8000
low_width = 40
high_width = 60
[19]:
# select data in box
plots_area_vs_width(peaks,
log = False,
bins = 100,
range=[[10**2, 10**4], [10**1.5, 10**2.5]])
ax = plt.gca()
x = np.arange(low_area,high_area,2)
y = np.full(len(x), low_width)
y2 = np.full(len(x), high_width)
ax.fill_between(x,y,y2, alpha = 0.5, color= 'red', label = 'selection')
plt.legend()
plt.title('Area vs width in lin-lin space zoomed on large S1s\nwith selection\n10 s of data')
[19]:
Text(0.5, 1.0, 'Area vs width in lin-lin space zoomed on large S1s\nwith selection\n10 s of data')

Load more data from the selection.
For this I will use a ‘selection string’ this will reduce memory usage, see straxferno 0 and 1: https://xe1t-wiki.lngs.infn.it/doku.php?id=xenon:xenon1t:aalbers:straxferno
I’m also going to load all of the of data!
[20]:
# Clear peaks from our namespace, this reduces our RAM usage (what usually crashes a notebook).
del peaks
[21]:
%%time
# use selection string to load a lot of data
selected_peaks = st.get_array(run_id,
targets = ( 'peak_basics'),
selection_str = f'({high_area}>area) & (area>{low_area}) ' \
f'& ({high_width}>range_50p_area) & (range_50p_area>{low_width})')
CPU times: user 92 ms, sys: 51.7 ms, total: 144 ms
Wall time: 162 ms
Alternative method yielding the same (if all data were loaded)
The cell above does return the same results as this selection (only now we are taking the entire run instead of 10s):
mask = ((high_area>peaks['area'] ) & (peaks['area']>{low_area}) &
high_width > peaks['range_50p_area'] & peaks['range_50p_area'] > low_area )
selected_peaks
[22]:
# Let's double check that we have the data we want in the area vs width space
plots_area_vs_width(selected_peaks,
log = True,
bins = 100,
range=[[2,4], [1,3]])
plt.title('Area vs width in log-log space with selected S1s\nall data')
[22]:
Text(0.5, 1.0, 'Area vs width in log-log space with selected S1s\nall data')

[23]:
# select data in box
plots_area_vs_width(selected_peaks,
log = False,
bins = 100,
range=[[10**2,10**4], [10**1,10**2.5]])
plt.title('Area vs width in lin-lin space with selected S1s\nall data')
[23]:
Text(0.5, 1.0, 'Area vs width in lin-lin space with selected S1s\nall data')

Plotting waveforms
Let’s check that these are indeed big S1s:
[24]:
plot_some_peaks(selected_peaks)





Nice, we made a simple cut and checked that some waveforms are of the kind we expected. Of course we can do more elaborate cuts and check the other dimensions of the parameter space. Very interesting times to be an analyst!
[ ]: