Download CSV-Like Files#

You can download the pre-generated CSV-like files that were created using the repository without having to clone it. These files are available in the directory /eos/lhcb/user/a/anthonyc/tracking/data/csv/ on EOS.

The YAML configuration files that were used to generate these files are available in jobs/eos/.

List of Available Data (from v2)#

The most up-to-date data available for use are in /eos/lhcb/user/a/anthonyc/tracking/data/csv/. The files contain hits from the velo, UT, and SciFi detectors, as well as MC particles. The files are in .parquet.lz4 format, which is faster to read and much lighter.

Important

Please note that the files from the RAL_MC-DST storage element had to be removed due to issues with Moore reading them. Even when the files were downloaded and run with Moore, the same bug was encountered.

Version 2.0#

The data for version 2.0 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/{data}.

data

What is it?

# events

# events / subfolder

smog2-digi

All SMOG2 data

42,125

~ 1000 or 2000

minbias-sim10b-xdigi

Min-bias data in sim10b, first 500 files

464,078

~ 1000 or 2000

minbias-sim10b-xdigi-part2

Min-bias data in sim10b, 500 files, from 510th

462,395

minbias-sim10aU1-xdigi

Min-bias data in sim10a, first 500 files

468,985

~ 100

The bookkeeping paths or local paths of the original files are detailed in the table below.

data

Original path

smog2-digi

/eos/lhcb/wg/IonPhysics/Simulations/SMOG2Arv56/digi/*.digi

minbias-sim10b-xdigi

/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10b/30000000/XDIGI

minbias-sim10b-xdigi-part2

Same as the previous one

minbias-sim10aU1-xdigi

/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10aU1/30000000/XDIGI

Version 2.1#

The data for version 2.1 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.1/{data}.

data

What is it?

# events

# events / subfolder

minbias-sim10b-xdigi-nospillover

Min-bias data in sim10b, 500 files, no spill-over

964,498

~ 5000

The bookkeeping paths or local paths of the original files are detailed in the table below.

data

Original path

minbias-sim10b-xdigi-nospillover

/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-Pythia8/Sim10b/30000000/XDIGI

Version 2.2.2#

The data for version 2.2.2 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.2.2/{data}.

data

What is it?

# events

# events / subfolder

minbias-sim10b-xdigi

Min-bias data in sim10b, 500 files

~400,000

~ 1000 or 2000

bu2kstee-sim10aU1-xdigi

500 files of events with \(B^{+} \to K^{\star} e^{+}e^{-}\) decays

~400,000

~ 1000 or 2000

The bookkeeping paths or local paths of the original files are detailed in the table below.

data

Original path

bu2kstee-sim10aU1-xdigi

/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10aU1/11124001/XDIGI

Version 2.3#

The data for version 2.3 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.3/{data}.

data

What is it?

# events

# events / subfolder

minbias-sim10b-xdigi-nospillover

Min-bias data in sim10b, 500 files, no spill-over

~400,000

~ 5000

bu2kspi-sim10aU1-xdigi

\(B^{+} \to K_{s} \pi\), 100 files, with spill-over

~100,000

~ 2000

dst2d0pi_kspipi-sim10aU1-xdigi

\(D^{\star} \to \left(D^{0} \to K_S \pi \pi \right) \pi\), 100 files, with spill-over

~100,000

~ 2000

Version 2.4#

The data for version 2.4 is summarized in the table below. The files are available in /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2.4/{data}.

data

What is it?

# events

# events / subfolder

minbias-sim10b-xdigi

Min-bias data in sim10b, 1500 files, no spill-over

~1,500,000

~ 1000

Download the Data#

From LXPLUS#

To download the data, you can copy it from one of the LXPLUS machines, where the EOS space is fuse-mounted, using the following command:

scp -r jdoe@lxplus.cern.ch:/eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10b-xdigi/ .

Make sure to replace jdoe with your actual username.

Organisation of the CSV-like files#

To process the large amount of data represented by the parquet files and the (X)DIGI files they come from, a job must be split into smaller subjobs that can be executed in parallel (in this case, using HTCondor) and produce smaller, more manageable output files.

The processing of a given set of n (e.g., 500) (X)DIGI files (e.g., minbias-sim10b-xdigi) is split into subjobs, with each subjob processing a subset of the files (e.g., two files per subjob). The resulting CSV-like output is stored in different subfolders (e.g., within /eos/lhcb/user/a/anthonyc/tracking/data/csv/v2/minbias-sim10b-digi/).

Each subfolder is named after the index of the first file processed (e.g., 0, 2, etc.), and contains the following .parquet.lz4 files:

  • hits_velo.parquet.lz4: velo hits

  • hits_ut.parquet.lz4: UT hits

  • hits_scifi.parquet.lz4: SciFi hits

  • mc_particles.parquet.lz4: MC Particles

In addition to the .parquet.lz4 files, each subfolder contains a log.yaml file that provides information about the status of the job, including:

  • input: a list of the input paths

  • banned: a list of the paths that were removed

  • returncode: the job’s return code (0 for success, non-zero for failure)

If the log.yaml file does not exist, it means that the job was not fully processed.

Furthermore, each production also includes a job_config.yaml file that stores the job’s configuration. This file contains all the unrolled input paths used for the job.

Note

At present, the repository does not have the capability to split CSV-like files per event. If such a capability were to be implemented, considerations would need to be made regarding how CSV-like files can be read in an efficient way in python.

Log#

The log for the production of these files can be found in my AFS space at the following path:

/afs/cern.ch/work/a/anthonyc/public/tracking/persist-lhcb-tracks/log

To download the logs in Ubuntu, you can use the rsync software, for example:

rsync -aP lxplus.cern.ch:/afs/cern.ch/work/a/anthonyc/public/tracking/persist-lhcb-tracks/log .

Alternatively, you can use the scp command.

Note

It is not feasible to store the log files in the EOS due to the large number of small files involved. As a result, the log files are currently stored in my AFS space. However, I am open to suggestions for alternative storage solutions that would provide a more consistent access to the log files while avoiding the limitations of EOS.

The log for a given production includes several files:

  • The condor.submit file, which is the submit file used to submit the job to HTCondor

  • The main log.log file, which summarizes the job submission and execution details for all subjobs.

  • For each subjob, the stdout and stderr of the subjob are stored in the stdout.{n}.log and stderr.{n}.log files, respectively. The value of {n} corresponds to the index of for this subjob, and is the same as the subfolder of the corresponding CSV-like files in the EOS space.

Open a parquet.lz4 File#

To open a file in parquet and store it into a Pandas dataframe, you need to install the pyarrow package (e.g., pip install pyarrow).

After installing pyarrow, you can use the following code with Pandas:

import pandas as pd
#: path to the `.parquet` or `.parquet.lz4` file
path = ...
dataframe = pd.read_parquet(path, engine="pyarrow")

Alternatively, you can use pyarrow to accomplish the same task:

import pyarrow.parquet as pq
path = ...
dataframe = pq.read_table(path).to_pandas()