Download CSV-Like Files#

You can download the pre-generated CSV-like files that were created using the repository without having to clone it. These files are available in the directory /eos/lhcb/user/a/anthonyc/tracking/data/csv/ on EOS.

The YAML configuration files that were used to generate these files are available in jobs/eos/.

List of Available Data (from v2)#

The most up-to-date data available for use are in /eos/lhcb/user/a/anthonyc/tracking/data/csv/. The files contain hits from the velo, UT, and SciFi detectors, as well as MC particles. The files are in .parquet.lz4 format, which is faster to read and much lighter.

Important

Please note that the files from the RAL_MC-DST storage element had to be removed due to issues with Moore reading them. Even when the files were downloaded and run with Moore, the same bug was encountered.

Version 2.0#

The data for version 2.0 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/{data}.

data	What is it?	# events	# events / subfolder
`smog2-digi`	All SMOG2 data	42,125	~ 1000 or 2000
`minbias-sim10b-xdigi`	Min-bias data in sim10b, first 500 files	464,078	~ 1000 or 2000
`minbias-sim10b-xdigi-part2`	Min-bias data in sim10b, 500 files, from 510th	462,395
`minbias-sim10aU1-xdigi`	Min-bias data in sim10a, first 500 files	468,985	~ 100

The bookkeeping paths or local paths of the original files are detailed in the table below.

data	Original path
`smog2-digi`	`/eos/lhcb/wg/IonPhysics/Simulations/SMOG2Arv56/digi/*.digi`
`minbias-sim10b-xdigi`	`/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10b/30000000/XDIGI`
`minbias-sim10b-xdigi-part2`	Same as the previous one
`minbias-sim10aU1-xdigi`	`/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10aU1/30000000/XDIGI`

Version 2.1#

The data for version 2.1 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.1/{data}.

data	What is it?	# events	# events / subfolder
`minbias-sim10b-xdigi-nospillover`	Min-bias data in sim10b, 500 files, no spill-over	964,498	~ 5000

The bookkeeping paths or local paths of the original files are detailed in the table below.

data	Original path
`minbias-sim10b-xdigi-nospillover`	`/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-Pythia8/Sim10b/30000000/XDIGI`

Version 2.2.2#

The data for version 2.2.2 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.2.2/{data}.

data	What is it?	# events	# events / subfolder
`minbias-sim10b-xdigi`	Min-bias data in sim10b, 500 files	~400,000	~ 1000 or 2000
`bu2kstee-sim10aU1-xdigi`	500 files of events with \(B^{+} \to K^{\star} e^{+}e^{-}\) decays	~400,000	~ 1000 or 2000

The bookkeeping paths or local paths of the original files are detailed in the table below.

data	Original path
`bu2kstee-sim10aU1-xdigi`	`/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10aU1/11124001/XDIGI`

Version 2.3#

The data for version 2.3 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.3/{data}.

data	What is it?	# events	# events / subfolder
`minbias-sim10b-xdigi-nospillover`	Min-bias data in sim10b, 500 files, no spill-over	~400,000	~ 5000
`bu2kspi-sim10aU1-xdigi`	\(B^{+} \to K_{s} \pi\), 100 files, with spill-over	~100,000	~ 2000
`dst2d0pi_kspipi-sim10aU1-xdigi`	\(D^{\star} \to \left(D^{0} \to K_S \pi \pi \right) \pi\), 100 files, with spill-over	~100,000	~ 2000

Version 2.4#

The data for version 2.4 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.4/{data}.

data	What is it?	# events	# events / subfolder
`minbias-sim10b-xdigi`	Min-bias data in sim10b, 1500 files, no spill-over	~1,500,000	~ 1000

Download the Data#

Using a XROOTD Client (recommended)#

Prerequisites

Kerberos must be installed. Refer to this guide for instructions.

The XROOTD client must be installed. In Ubuntu, as partially explained in this guide, execute:

# Configure the required APT repositorY
echo "deb [arch=$(dpkg --print-architecture)] http://storage-ci.web.cern.ch/storage-ci/debian/xrootd/ $(lsb_release -cs) release" | sudo tee -a /etc/apt/sources.list.d/cerneos-client.list > /dev/null
curl -sL http://storage-ci.web.cern.ch/storage-ci/storageci.key | sudo apt-key add -
sudo apt update
# Install `xrootd-client`
sudo apt install xrootd-client

Then, after signing in with kerberos (kinit <your-username>@CERN.CH), you can download the data using the following commands

xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/smog2-digi/ .  --parallel 4
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10aU1-xdigi/ .  --parallel 4
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.2.2/bu2kstee-sim10aU1-xdigi/ .  --parallel 4
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.3/minbias-sim10b-xdigi-nospillover/ .  --parallel 3
xrdcp -r root://eoslhcb.cern.ch//eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.4/minbias-sim10b-xdigi/ .  --parallel 4

--parallel n allows to parallelise the copy, up to n <= 4 threads.

From LXPLUS#

To download the data, you can copy it from one of the LXPLUS machines, where the EOS space is fuse-mounted, using the following command:

scp -r jdoe@lxplus.cern.ch:/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10b-xdigi/ .

Make sure to replace jdoe with your actual username.

Organisation of the CSV-like files#

To process the large amount of data represented by the parquet files and the (X)DIGI files they come from, a job must be split into smaller subjobs that can be executed in parallel (in this case, using HTCondor) and produce smaller, more manageable output files.

The processing of a given set of n (e.g., 500) (X)DIGI files (e.g., minbias-sim10b-xdigi) is split into subjobs, with each subjob processing a subset of the files (e.g., two files per subjob). The resulting CSV-like output is stored in different subfolders (e.g., within /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10b-digi/).

Each subfolder is named after the index of the first file processed (e.g., 0, 2, etc.), and contains the following .parquet.lz4 files:

hits_velo.parquet.lz4: velo hits
hits_ut.parquet.lz4: UT hits
hits_scifi.parquet.lz4: SciFi hits
mc_particles.parquet.lz4: MC Particles

In addition to the .parquet.lz4 files, each subfolder contains a log.yaml file that provides information about the status of the job, including:

input: a list of the input paths
banned: a list of the paths that were removed
returncode: the job’s return code (0 for success, non-zero for failure)

If the log.yaml file does not exist, it means that the job was not fully processed.

Furthermore, each production also includes a job_config.yaml file that stores the job’s configuration. This file contains all the unrolled input paths used for the job.

Note

At present, the repository does not have the capability to split CSV-like files per event. If such a capability were to be implemented, considerations would need to be made regarding how CSV-like files can be read in an efficient way in python.

Log#

The log for the production of these files can be found in my AFS space at the following path:

/afs/cern.ch/work/a/anthonyc/public/tracking/persist-lhcb-tracks/log

To download the logs in Ubuntu, you can use the rsync software, for example:

rsync -aP lxplus.cern.ch:/afs/cern.ch/work/a/anthonyc/public/tracking/persist-lhcb-tracks/log .

Alternatively, you can use the scp command.

Note

It is not feasible to store the log files in the EOS due to the large number of small files involved. As a result, the log files are currently stored in my AFS space. However, I am open to suggestions for alternative storage solutions that would provide a more consistent access to the log files while avoiding the limitations of EOS.

The log for a given production includes several files:

The condor.submit file, which is the submit file used to submit the job to HTCondor
The main log.log file, which summarizes the job submission and execution details for all subjobs.
For each subjob, the stdout and stderr of the subjob are stored in the stdout.{n}.log and stderr.{n}.log files, respectively. The value of {n} corresponds to the index of for this subjob, and is the same as the subfolder of the corresponding CSV-like files in the EOS space.

Open a `parquet.lz4` File#

To open a file in parquet and store it into a Pandas dataframe, you need to install the pyarrow package (e.g., pip install pyarrow).

After installing pyarrow, you can use the following code with Pandas:

import pandas as pd
#: path to the `.parquet` or `.parquet.lz4` file
path = ...
dataframe = pd.read_parquet(path, engine="pyarrow")

Alternatively, you can use pyarrow to accomplish the same task:

import pyarrow.parquet as pq
path = ...
dataframe = pq.read_table(path).to_pandas()

Download CSV-Like Files

Contents

Download CSV-Like Files#

List of Available Data (from v2)#

Version 2.0#

Version 2.1#

Version 2.2.2#

Version 2.3#

Version 2.4#

Download the Data#

Using a XROOTD Client (recommended)#

From LXPLUS#

Organisation of the CSV-like files#

Log#

Open a parquet.lz4 File#

Open a `parquet.lz4` File#