Download CSV-Like Files#
You can download the pre-generated CSV-like files that were created using the repository
without having to clone it. These files are available in the directory
/eos/lhcb/user/a/anthonyc/tracking/data/csv/
on EOS.
The YAML configuration files that were used to generate these files are available in jobs/eos/.
List of Available Data (from v2)#
The most up-to-date data available for use are in
/eos/lhcb/user/a/anthonyc/tracking/data/csv/
.
The files contain hits from the velo, UT, and SciFi detectors, as well as MC particles.
The files are in .parquet.lz4
format, which is faster to read and much lighter.
Important
Please note that the files from the RAL_MC-DST
storage element had to be
removed due to issues with Moore reading them.
Even when the files were downloaded and run with Moore, the same bug was encountered.
Version 2.0#
The data for version 2.0 is summarized in the table below. The files are available in
/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/{data}
.
data |
What is it? |
# events |
# events / subfolder |
---|---|---|---|
|
All SMOG2 data |
42,125 |
~ 1000 or 2000 |
|
Min-bias data in sim10b, first 500 files |
464,078 |
~ 1000 or 2000 |
|
Min-bias data in sim10b, 500 files, from 510th |
462,395 |
|
|
Min-bias data in sim10a, first 500 files |
468,985 |
~ 100 |
The bookkeeping paths or local paths of the original files are detailed in the table below.
data |
Original path |
---|---|
|
|
|
|
|
Same as the previous one |
|
|
Version 2.1#
The data for version 2.1 is summarized in the table below. The files are available in
/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.1/{data}
.
data |
What is it? |
# events |
# events / subfolder |
---|---|---|---|
|
Min-bias data in sim10b, 500 files, no spill-over |
964,498 |
~ 5000 |
The bookkeeping paths or local paths of the original files are detailed in the table below.
data |
Original path |
---|---|
|
|
Version 2.2.2#
The data for version 2.2.2 is summarized in the table below. The files are available in
/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.2.2/{data}
.
data |
What is it? |
# events |
# events / subfolder |
---|---|---|---|
|
Min-bias data in sim10b, 500 files |
~400,000 |
~ 1000 or 2000 |
|
500 files of events with \(B^{+} \to K^{\star} e^{+}e^{-}\) decays |
~400,000 |
~ 1000 or 2000 |
The bookkeeping paths or local paths of the original files are detailed in the table below.
data |
Original path |
---|---|
|
|
Version 2.3#
The data for version 2.3 is summarized in the table below. The files are available in
/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.3/{data}
.
data |
What is it? |
# events |
# events / subfolder |
---|---|---|---|
|
Min-bias data in sim10b, 500 files, no spill-over |
~400,000 |
~ 5000 |
|
\(B^{+} \to K_{s} \pi\), 100 files, with spill-over |
~100,000 |
~ 2000 |
|
\(D^{\star} \to \left(D^{0} \to K_S \pi \pi \right) \pi\), 100 files, with spill-over |
~100,000 |
~ 2000 |
Version 2.4#
The data for version 2.4 is summarized in the table below. The files are available in
/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.4/{data}
.
data |
What is it? |
# events |
# events / subfolder |
---|---|---|---|
|
Min-bias data in sim10b, 1500 files, no spill-over |
~1,500,000 |
~ 1000 |
Download the Data#
Using a XROOTD Client (recommended)#
Prerequisites
Kerberos must be installed. Refer to this guide for instructions.
The XROOTD client must be installed. In Ubuntu, as partially explained in this guide, execute:
# Configure the required APT repositorY echo "deb [arch=$(dpkg --print-architecture)] http://storage-ci.web.cern.ch/storage-ci/debian/xrootd/ $(lsb_release -cs) release" | sudo tee -a /etc/apt/sources.list.d/cerneos-client.list > /dev/null curl -sL http://storage-ci.web.cern.ch/storage-ci/storageci.key | sudo apt-key add - sudo apt update # Install `xrootd-client` sudo apt install xrootd-client
Then, after signing in with kerberos (kinit <your-username>@CERN.CH
), you can
download the data using the following commands
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/smog2-digi/ . --parallel 4
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10aU1-xdigi/ . --parallel 4
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.2.2/bu2kstee-sim10aU1-xdigi/ . --parallel 4
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.3/minbias-sim10b-xdigi-nospillover/ . --parallel 3
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.4/minbias-sim10b-xdigi/ . --parallel 4
--parallel n
allows to parallelise the copy, up to n <= 4
threads.
From LXPLUS#
To download the data, you can copy it from one of the LXPLUS machines, where the EOS space is fuse-mounted, using the following command:
scp -r jdoe@lxplus.cern.ch:/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10b-xdigi/ .
Make sure to replace jdoe with your actual username.
Organisation of the CSV-like files#
To process the large amount of data represented by the parquet files and the (X)DIGI files they come from, a job must be split into smaller subjobs that can be executed in parallel (in this case, using HTCondor) and produce smaller, more manageable output files.
The processing of a given set of n
(e.g., 500) (X)DIGI files
(e.g., minbias-sim10b-xdigi
) is split into subjobs, with each subjob processing
a subset of the files (e.g., two files per subjob).
The resulting CSV-like output is stored in different subfolders
(e.g., within /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10b-digi/
).
Each subfolder is named after the index of the first file processed
(e.g., 0
, 2
, etc.), and contains the following .parquet.lz4
files:
hits_velo.parquet.lz4
: velo hitshits_ut.parquet.lz4
: UT hitshits_scifi.parquet.lz4
: SciFi hitsmc_particles.parquet.lz4
: MC Particles
In addition to the .parquet.lz4
files, each subfolder contains a log.yaml
file
that provides information about the status of the job, including:
input
: a list of the input pathsbanned
: a list of the paths that were removedreturncode
: the job’s return code (0 for success, non-zero for failure)
If the log.yaml
file does not exist, it means that the job was not fully processed.
Furthermore, each production also includes a job_config.yaml
file that stores
the job’s configuration.
This file contains all the unrolled input paths used for the job.
Note
At present, the repository does not have the capability to split CSV-like files per event. If such a capability were to be implemented, considerations would need to be made regarding how CSV-like files can be read in an efficient way in python.
Log#
The log for the production of these files can be found in my AFS space at the following path:
/afs/cern.ch/work/a/anthonyc/public/tracking/persist-lhcb-tracks/log
To download the logs in Ubuntu, you can use the rsync
software, for example:
rsync -aP lxplus.cern.ch:/afs/cern.ch/work/a/anthonyc/public/tracking/persist-lhcb-tracks/log .
Alternatively, you can use the scp
command.
Note
It is not feasible to store the log files in the EOS due to the large number of small files involved. As a result, the log files are currently stored in my AFS space. However, I am open to suggestions for alternative storage solutions that would provide a more consistent access to the log files while avoiding the limitations of EOS.
The log for a given production includes several files:
The
condor.submit
file, which is the submit file used to submit the job to HTCondorThe main
log.log
file, which summarizes the job submission and execution details for all subjobs.For each subjob, the
stdout
andstderr
of the subjob are stored in thestdout.{n}.log
andstderr.{n}.log
files, respectively. The value of{n}
corresponds to the index of for this subjob, and is the same as the subfolder of the corresponding CSV-like files in the EOS space.
Open a parquet.lz4
File#
To open a file in parquet and store it into a Pandas dataframe,
you need to install the pyarrow
package (e.g., pip install pyarrow
).
After installing pyarrow
, you can use the following code with Pandas:
import pandas as pd
#: path to the `.parquet` or `.parquet.lz4` file
path = ...
dataframe = pd.read_parquet(path, engine="pyarrow")
Alternatively, you can use pyarrow
to accomplish the same task:
import pyarrow.parquet as pq
path = ...
dataframe = pq.read_table(path).to_pandas()