Opening data to do analyses

06-01-2020

j.angevaare@nikhef.nl

To this end we will play with one of our most recent 1T datasets (run), load it, make some selections and plots. We will only be looking at two parameters but the data is rich and can be explored in many more ways.

Goal & setup

We want to use straxen to open some data, to this end we show how to make a basic selection and plot some waveforms.

[1]:
print('Start import')
import socket
import strax
import straxen
import numpy as np
import datetime
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import datetime

print(
f'''Working on {socket.getfqdn()} with the following versions
and installation paths:
strax
\tv{strax.__version__}\t{str(strax.__path__[0])}
straxen
\tv{straxen.__version__}\t{str(straxen.__path__[0])}
''')
Start import
Working on dali030.rcc.local with the following versions
and installation paths:
strax
        v0.13.2 /home/angevaare/software/strax/strax
straxen
        v0.14.1 /home/angevaare/software/straxen/straxen

Setup straxen

We just have to use the xenonnt_online context for nT or the one below for 1T

[2]:
st = straxen.contexts.xenon1t_dali()

Data-availability

Using the context to check what is available.

This loads the RunsDB, looks at the runs that are considered in the context and checks if they are available.

NB: I’m going to slightly alter the config here to check for multiple datatypes. This takes considereably longer so you can sit back for a while.

[3]:
%%time
# This config change will take time
st.set_context_config({'check_available': ('raw_records', 'records', 'peaklets', 'peak_basics')})
st.select_runs()
Checking data availability: 100%|██████████| 4/4 [00:12<00:00,  3.02s/it]
CPU times: user 6.8 s, sys: 711 ms, total: 7.51 s
Wall time: 13.8 s

[3]:
name number start reader.ini.name trigger.events_built end tags mode livetime peak_basics_available peaklets_available records_available raw_records_available
0 170204_1410 6786 2017-02-04 14:10:08+00:00 background_stable 19574 2017-02-04 15:10:13+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:05 True True True True
1 170204_1510 6787 2017-02-04 15:10:28+00:00 background_stable 19634 2017-02-04 16:10:32+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 True True True True
2 170204_1610 6788 2017-02-04 16:10:39+00:00 background_stable 19400 2017-02-04 17:10:43+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 True True True True
3 170204_1710 6789 2017-02-04 17:10:51+00:00 background_stable 19415 2017-02-04 18:10:54+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:03 True True True True
4 170204_1810 6790 2017-02-04 18:11:01+00:00 background_stable 19671 2017-02-04 19:11:05+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ...
211 181027_2044 23315 2018-10-27 20:44:21+00:00 ar37_stable 36496 2018-10-27 21:44:25+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True
212 181027_2144 23316 2018-10-27 21:44:33+00:00 ar37_stable 36330 2018-10-27 22:44:35+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:02 True True True True
213 181027_2244 23317 2018-10-27 22:44:43+00:00 ar37_stable 36486 2018-10-27 23:44:47+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True
214 181027_2344 23318 2018-10-27 23:44:59+00:00 ar37_stable 36777 2018-10-28 00:45:03+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True
215 181028_0045 23319 2018-10-28 00:45:11+00:00 ar37_stable 36708 2018-10-28 01:45:15+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True

216 rows × 13 columns

The second time we run this, it will be much faster.

[4]:
st.select_runs(available = 'peaklets',
               exclude_tags = ('bad', 'messy'))
[4]:
name number start reader.ini.name trigger.events_built end tags mode livetime peak_basics_available peaklets_available records_available raw_records_available
0 170204_1410 6786 2017-02-04 14:10:08+00:00 background_stable 19574 2017-02-04 15:10:13+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:05 True True True True
1 170204_1510 6787 2017-02-04 15:10:28+00:00 background_stable 19634 2017-02-04 16:10:32+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 True True True True
2 170204_1610 6788 2017-02-04 16:10:39+00:00 background_stable 19400 2017-02-04 17:10:43+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 True True True True
3 170204_1710 6789 2017-02-04 17:10:51+00:00 background_stable 19415 2017-02-04 18:10:54+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:03 True True True True
4 170204_1810 6790 2017-02-04 18:11:01+00:00 background_stable 19671 2017-02-04 19:11:05+00:00 blinded,_sciencerun1_candidate,_sciencerun1 background_stable 01:00:04 True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ...
211 181027_2044 23315 2018-10-27 20:44:21+00:00 ar37_stable 36496 2018-10-27 21:44:25+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True
212 181027_2144 23316 2018-10-27 21:44:33+00:00 ar37_stable 36330 2018-10-27 22:44:35+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:02 True True True True
213 181027_2244 23317 2018-10-27 22:44:43+00:00 ar37_stable 36486 2018-10-27 23:44:47+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True
214 181027_2344 23318 2018-10-27 23:44:59+00:00 ar37_stable 36777 2018-10-28 00:45:03+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True
215 181028_0045 23319 2018-10-28 00:45:11+00:00 ar37_stable 36708 2018-10-28 01:45:15+00:00 Kr83m,_sciencerun2_candidate,_sciencerun2_prel... ar37_stable 01:00:04 True True True True

205 rows × 13 columns

For nT one can also select a run from the website:

https://xenon1t-daq.lngs.infn.it/runsui

[5]:
# The name of the run
run_id = '181027_2044'

Alternative method.

If you know what to look for you can use this command to double check:

[6]:
st.is_stored(run_id, "peak_basics")
[6]:
True

Lets load the peaks of this run and have a look. We are going to look at peak_basics (loading e.g. peaks) is much more data, making everything slow. You can check the size before loading using the command below:

[7]:
for p in ('raw_records', 'records', 'peaklets', 'peak_basics', 'pulse_counts'):
    if st.is_stored(run_id, p):
        print(f'Target: {p} \t is {st.size_mb(run_id, p):.0f} MB')
Target: raw_records      is 50612 MB
Target: records          is 17145 MB
Target: peaklets         is 2159 MB
Target: peak_basics      is 56 MB
Target: pulse_counts     is 7 MB

Loading the data

NB: Don’t crash your notebook!

The prints above are why you only want to load small portions of data. Especially for ‘low-level’ data. If you try to load all of records or even peaklets, your notebook crashes.

We are going to load only 100 seconds of data as that will be enough to see some populations. Later we can load more if we want.

My favorite reduction method id just loading one or a few seconds, like below:

[8]:
peaks = st.get_array(run_id,
                     targets = ('peak_basics'),
                     seconds_range = (0, 100), # Only first 100 seconds
                     progress_bar=False)
[9]:
# Let's see how much we have loaded:
print(f'We loaded {peaks.nbytes/1e6:.1f} MB of peaks-data')
We loaded 1.5 MB of peaks-data
[10]:
# You can also use get_df that will give you a pandas.DataFrame. For example:
st.get_df(run_id,
                     targets = ('peak_basics'),
                     seconds_range = (0, 0.01), # Only first 10 ms because I'm going to use the array from above
                     progress_bar=False).head(10)
[10]:
time endtime center_time area n_channels max_pmt max_pmt_area range_50p_area range_90p_area area_fraction_top length dt rise_time tight_coincidence type
0 1540673061000064360 1540673061000072510 1540673061000067775 630.876892 129 7 63.06451 2154.426758 6508.952637 0.650553 163 50 2509.838867 15 2
1 1540673061000080260 1540673061000080940 1540673061000080508 12.416115 3 66 9.20665 236.856506 606.827026 0.741508 68 10 138.616135 2 0

Make plots of this one run

These are my helper functions: - I want to make an area vs width plot to check which population to select - I want to be able to check some waveforms

[11]:
# My function to make a 2D histogram
def plots_area_vs_width(data, log=None, **kwargs):
    """basic wrapper to plot area vs width"""
    # Things to put on the axes
    x, y = data['area'], data['range_50p_area']

    # Make a log-plot manually
    if log:
        x, y = np.log10(x), np.log10(y)
    plt.hist2d(x, y, norm=LogNorm(), **kwargs)
    plt.ylabel(f'Width [ns]')
    plt.xlabel(f'Area [PE]')

    # Change the ticks slightly for nicer formats
    if log:
        xt = np.arange(int(plt.xticks()[0].min()), int(plt.xticks()[0].max()))
        plt.xticks(xt, 10 ** xt)
        yt = np.arange(int(plt.yticks()[0].min()), int(plt.yticks()[0].max()))
        plt.yticks(yt, 10 ** yt)
    plt.colorbar(label='counts/bin')
[12]:
# My function to plot some waveforms
def plot_some_peaks(peaks, max_plots = 5, randomize = False):
    """Plot the first peaks in the peaks collection (max number of 5 by default)"""
    if randomize:
        # This randomly takes max_plots in the range (0, len(peaks)). Plot these indices
        indices = np.random.randint(0, len(peaks), max_plots)
    else:
        # Just take the first max_plots peaks in the data
        indices = range(max_plots)
    for i in indices:
        p = peaks[i]
        start, stop = p['time'], p['endtime']
        st.plot_peaks(run_id,
                      time_range = (start, stop))
        plt.title(f'S{p["type"]} of ~{int(p["area"])} PE (width {int(p["range_50p_area"])} ns)\n'
                  f'detected by {p["n_channels"]} PMTs at:\n'
                  f'{datetime.datetime.fromtimestamp(start/1e9).isoformat()}')  # should in s not ns hence /1e9
        plt.show()

Plotting waveforms

This may help you if you want to look at specific waveforms. Below I’m just going to show that it works. As you can see, there are a lot of junk waveforms! We are going to do some cuts to select ‘nice’ ones later.

[13]:
# Let's plot the first 2
plot_some_peaks(peaks, max_plots = 2)
../_images/tutorials_Open_data_22_0.png
../_images/tutorials_Open_data_22_1.png
[14]:
# Or just look at two random peaks somewhere in our dataset
plot_some_peaks(peaks, max_plots=2, randomize = True)
../_images/tutorials_Open_data_23_0.png
../_images/tutorials_Open_data_23_1.png

Display populations

Area vs width plot

You see these plots a lot on Slac. We’ll make a 2D histogram of two propperties of the peaks. Herefrom we try to isolatea population to look at the waveforms.

[15]:
# All peaks
plots_area_vs_width(peaks,
                    log = True,
                    bins = 100,
                    range=[[0,4], [1,4]])
plt.title('Area vs width in log-log space of all populations\n10 s of data')
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in log10
  if __name__ == '__main__':
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log10
  if __name__ == '__main__':
[15]:
Text(0.5, 1.0, 'Area vs width in log-log space of all populations\n10 s of data')
../_images/tutorials_Open_data_25_2.png

I want to zoom in on the population on the lower right. Therefore I reduce the range:

[16]:
plots_area_vs_width(peaks,
                    log = True,
                    bins = 100,
                    range=[[2,4], [1.5,2.5]]) # zoomed wrt to previous plot!
plt.title('Area vs width in log-log space zoomed on large S1s\n10 s of data')
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in log10
  if __name__ == '__main__':
/home/angevaare/software/Miniconda3/envs/strax/lib/python3.6/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log10
  if __name__ == '__main__':
[16]:
Text(0.5, 1.0, 'Area vs width in log-log space zoomed on large S1s\n10 s of data')
../_images/tutorials_Open_data_27_2.png

Let’s select this blob of peaks.

For this I switch to lin-lin space rather than log-log as I’m too lazy to convert the selection I’m going to make from lin to log and vise versa.

[17]:
# Let's look in linspace
plots_area_vs_width(peaks,
                    log = False,
                    bins = 100,
                    range=[[10**2, 10**4], [10**1.5, 10**2.5]])
plt.title('Area vs width in lin-lin space zoomed on large S1s\n10 s of data')
[17]:
Text(0.5, 1.0, 'Area vs width in lin-lin space zoomed on large S1s\n10 s of data')
../_images/tutorials_Open_data_29_1.png

Data selection

Let’s make a very rudimentary selection in this data

Below I’m going to draw a box arround the blob as above and load some peaks therefrom. I suspect this are nice fat S1s.

[18]:
# Parameters used for plot below
low_area = 400
high_area = 8000
low_width = 40
high_width = 60
[19]:
# select data in box
plots_area_vs_width(peaks,
                    log = False,
                    bins = 100,
                    range=[[10**2, 10**4], [10**1.5, 10**2.5]])

ax = plt.gca()
x = np.arange(low_area,high_area,2)
y = np.full(len(x), low_width)
y2 = np.full(len(x), high_width)
ax.fill_between(x,y,y2, alpha = 0.5, color= 'red', label = 'selection')
plt.legend()
plt.title('Area vs width in lin-lin space zoomed on large S1s\nwith selection\n10 s of data')
[19]:
Text(0.5, 1.0, 'Area vs width in lin-lin space zoomed on large S1s\nwith selection\n10 s of data')
../_images/tutorials_Open_data_32_1.png

Load more data from the selection.

For this I will use a ‘selection string’ this will reduce memory usage, see straxferno 0 and 1: https://xe1t-wiki.lngs.infn.it/doku.php?id=xenon:xenon1t:aalbers:straxferno

I’m also going to load all of the of data!

[20]:
# Clear peaks from our namespace, this reduces our RAM usage (what usually crashes a notebook).
del peaks
[21]:
%%time
# use selection string to load a lot of data
selected_peaks = st.get_array(run_id,
    targets = ( 'peak_basics'),
    selection_str = f'({high_area}>area) & (area>{low_area}) ' \
                    f'& ({high_width}>range_50p_area) & (range_50p_area>{low_width})')

CPU times: user 92 ms, sys: 51.7 ms, total: 144 ms
Wall time: 162 ms

Alternative method yielding the same (if all data were loaded)

The cell above does return the same results as this selection (only now we are taking the entire run instead of 10s):

mask = ((high_area>peaks['area'] ) & (peaks['area']>{low_area}) &
        high_width > peaks['range_50p_area'] & peaks['range_50p_area'] > low_area )

selected_peaks
[22]:
# Let's double check that we have the data we want in the area vs width space
plots_area_vs_width(selected_peaks,
                    log = True,
                    bins = 100,
                    range=[[2,4], [1,3]])
plt.title('Area vs width in log-log space with selected S1s\nall data')
[22]:
Text(0.5, 1.0, 'Area vs width in log-log space with selected S1s\nall data')
../_images/tutorials_Open_data_37_1.png
[23]:
# select data in box
plots_area_vs_width(selected_peaks,
                    log = False,
                    bins = 100,
                    range=[[10**2,10**4], [10**1,10**2.5]])
plt.title('Area vs width in lin-lin space with selected S1s\nall data')
[23]:
Text(0.5, 1.0, 'Area vs width in lin-lin space with selected S1s\nall data')
../_images/tutorials_Open_data_38_1.png

Plotting waveforms

Let’s check that these are indeed big S1s:

[24]:
plot_some_peaks(selected_peaks)
../_images/tutorials_Open_data_40_0.png
../_images/tutorials_Open_data_40_1.png
../_images/tutorials_Open_data_40_2.png
../_images/tutorials_Open_data_40_3.png
../_images/tutorials_Open_data_40_4.png

Nice, we made a simple cut and checked that some waveforms are of the kind we expected. Of course we can do more elaborate cuts and check the other dimensions of the parameter space. Very interesting times to be an analyst!

[ ]: