The XDIGI2CSV Program#

XDIGI2CSV is a program that converts raw data (in the form of (X)DIGI files) from particle detector simulations into CSV files that can be used for further analysis. It provides a flexible and configurable way to convert different types of data from multiple detectors into separate CSV files. Using this repository, the program can be configured through a YAML file, and can be run either locally or on a cluster using the HTCondor system.

This page explains how to run this program.

  • The Prerequisites section recalls what is required to run the program

  • The Run Using Python Files section shows how to run the XDIGI2CSV program like any other Moore algorithms, that is, using python files

  • The Example: Using This Repository section explains how to run the XDIGI2CSV program using this repository, by configuring the algorithm using a YAML file

  • 3 other sections explain how to further configure this algorithm

  • The Run in HTCondor section explains how to submit many XDIGI2CSV jobs to HTCondor.

If you would like more information about the algorithms used by the XDIGI2CSV program, please refer to the More Information About the XDIGI2CSV Program page.

Prerequisites#

As explained in the Setup page, you need to have the LHCb stack set up in the branches anthonyc-persistence_csv, in a detdesc build (e.g., x86_64_v2-centos7-gcc11+detdesc-opt).

Do not forget to do

source setup/setup.sh

and to set up your LHCb proxy with lhcb-proxy-init if you use files stored in the grid.

Run Using Python Files#

The XDIGI2CSV program can be run like any other Moore algorithms, and this section provides insights into how to run them. Behind the scenes, the python helper scripts such as run/run.py simply execute these python files.

To run this algorithm, you need to:

  • set up your input.py python file to specify the input, which can either be a local file or a file stored on the grid.

  • Configure the algorithms using Moore, using a python file such as run/xdigi2csv/moore_program_standalone.py

An example of an input python file is provided in jobs/examples/xdigi2csv/input_pfn.py:

from Moore import options

# That won't change:
options.simulation = True  # we are only working with simulated data
options.input_type = "ROOT"  # use `"ROOT"` of XDIGI files
options.data_type = "Upgrade"  # We are only working with simulations for the upgrade

# These depend on the MC Upgrade simulation you're using
options.dddb_tag = "dddb-20221004"
options.conddb_tag = "sim-20220929-vc-md100"
options.input_files = [
    "root://gridproxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi"
]
options.evt_max = 100  # at most 100 events

options.input_files is a list of physical links to the input files, whether they are in the grid or on your local machine. You can obtain the dddb_tag and conddb_tag by following the instructions on the Accessing Data on the Grid page.

Alternatively, you can provide logical file names (LFNs) and an XML catalog, as shown in the jobs/examples/xdigi2csv/input_lfn.py python file:

from Moore import options
from Gaudi.Configuration import (
    Gaudi__MultiFileCatalog as FileCatalog,
    ApplicationMgr,
)

options.simulation = True
options.data_type = "Upgrade"
options.conddb_tag = "sim-20220929-vc-md100"
options.dddb_tag = "dddb-20221004"
options.input_type = "ROOT"
options.input_files = [
    "LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi"
]
options.evt_max = 100

# (relative path w.r.t. the working directory, here, the root repository)
xml_file_name = "jobs/examples/xdigi2csv/pool_xml_catalog.xml"
catalog = FileCatalog(Catalogs=[f"xmlcatalog_file:{xml_file_name}"])
ApplicationMgr().ExtSvc.append(catalog)

You can generate an XML catalog by following the instructions on the Accessing Data on the Grid page.

For the example, we’ll use the jobs/examples/xdigi2csv/input_lfn.py input file.

To configure the XDIGI2CSV program, go to run/xdigi2csv/moore_program_standalone.py and change the following to your needs:

# List of detectors for which to dump the hits or clusters
selected_detectors = ["velo", "ut", "scifi"]
# Whether to dump the MC hits (`mchits_{detector}.csv` files)
dump_mc_hits = False
# Whether to dump `event_info.csv`
dump_event_info = False
# If set to `False`, we only dump the variables that are necessary for tracking
extended = False
# If set to `True`, all the MC particles in `mc_particles.csv` are dumped, even those
# which don't have any hits
all_mc_particles = False
# Whether to use Retina clusters
retina_clusters = True
# Output directory where the CSV files are stored
outdir = "./jobs/examples/xdgi2csv/output"
# If set to `True`, if a CSV file that is going to be written already exists, it will
# be erase to write the new one
erase = True

To run Moore, which is part of the LHCb software stack, you can execute the following command on LXPLUS or a similar system:

/afs/cern.ch/work/a/anthonyc/public/tracking/stack/Moore/build.x86_64_v2-centos7-gcc11+detdesc-opt/run gaudirun.py jobs/examples/xdigi2csv/input_lfn.py run/xdigi2csv/moore_program_standalone.py

Replace /afs/cern.ch/work/a/anthonyc/public/tracking/stack/Moore/build.x86_64_v2-centos7-gcc11+detdesc-opt/run with the path to your own binary of of Moore, if needed. Make sure to execute this line at the root of the XDIGI2CSV repository, which is necessary because the modules tools and definitions are used in the run/xdigi2csv/moore_program_standalone.py python file to configure the algorithms.

Note

If the XDIGI2CSV algorithms were merged into the master branches of the LHCb stack projects, you could run these algorithms on LXPLUS using lb-run Moore/latest gaudirun.py, where latest can also be replaced by a specific version of Moore.

However, using only Python files to convert a large amount of (X)DIGI files to CSV files comes with several constraints:

  • Relative paths are not well-defined by default, and can cause issues when specifying input and output files.

  • Job splitting and submission can be challenging.

  • The fact that the algorithms are not merged into the master branches greatly complicates the use of these algorithms in HTC systems.

  • Separate scripts will also have to be written if one wants to convert the .csv files into another format.

  • Given that the PFNs and XML catalog may evolve over time, they will have to be generated manually, which is time-consuming for hundreds of files.

For all of these reasons, it is much easier to use the XDIGI2CSV repository to run the XDIGI2CSV program.

Example: Using This Repository#

This repository provides a convenient way of running the XDIGI2CSV program. You can set up a YAML file that configures the program, such as the one in jobs/examples/xdigi2csv/xdigi2csv_lfn.yaml

moore_input:
  paths: "LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi"
  dddb_tag: "dddb-20221004"
  conddb_tag: "sim-20220929-vc-md100"
  evt_max: 100
xdigi2csv:
  detectors:
  - velo
  - ut
  - scifi
  format: parquet
  compression: lz4
output:
  outdir: "xdigi2csv_lfn"
computing:
  program: xdigi2csv

By providing the LFN path in the YAML file, the XML catalog will be generated automatically. However, thedddb_tag and conddb_tag need to be provided manually by following the instructions on the Accessing Data on the Grid page.

In addition, there are two other options, format and compression, which can be used to convert the output CSV files into a different compressed format. The conversions are applied after running the algorithms in Moore.

Note

Behind the scene, 2 python files are used:

  • ./run/moore/moore_input.py to configure an input to a Moore algorithm

  • ./run/xdigi2csv/moore_program.py to configure the XDIGI2CSV program

These two Python files read the configuration that is passed to them using two different temporary YAML files. Their paths are stored in temporary environment variables XDIGI2CSV_INPUT_CONFIG" and XDIGI2CSV_PROGRAM_CONFIG”.

You can then run the XDIGI2CSV program in the command line

./run/run.py -c jobs/examples/xdigi2csv/xdigi2csv_lfn.yaml

or equivalently

./run/moore/run.py xdigi2csv -c jobs/examples/xdigi2csv/xdigi2csv_lfn.yaml

The resulting .parquet.lz4 files are saved in ./jobs/examples/xdigi2csv/xdigi2csv_lfn/, and a log.yaml file is generated to keep track of the original input file and the Moore return code.

Setting Up Moore Input#

The moore_input section of the YAML file is used to provide input to Moore. There are three ways to define the input.

Using a Python Input File#

You can provide a Python input file that defines the input in the Moore.options object. For example:

moore_input:
  python_input: input.py

Using LFNs, PFNs, and/or Local Paths#

You can also provide LFN(s), PFN(s), and/or local paths to the input files. For example:

moore_input:
  paths:
  - LFN:/lhcb/some/LFN/path.xdigi
  - "{XDIGI2CSV_REPO}/a/path/starting/from/the/root/of/the/repo.xdigi"
  - ./a/relative/path/expressed/relative/to/the/yaml/file.xdigi
  • If LFNs are provided, the XML catalog will be generated automatically.

  • Elements enclosed in curly brackets, such as {XDIGI2CSV_REPO}, are replaced by the corresponding environment variables. The XDIGI2CSV_REPO environment variable is initialized when executing source setup/setup.sh and contains the path to the root of this repository.

Using a Bookkeeping Path#

You can provide a bookkeeping path to the input files, which is then translated into a list of LFN paths. For example:

moore_input:
  bookkeeping_path: /some/bookkeeping/path
  start_index: 0  # index of the first LFN path
  nb_files: 2  # number of files to process at max

If LFNs or a bookkeeping path are used, storage elements can be banned using banned_storage_elements. This will remove the replicas from the specified storage element and remove the LFNs that were only stored in banned storage elements.

You can find a list of possible min-bias data input to Moore in the jobs/moore_input/ directory. You can use them with the include option of any configuration file, for example:

include: "{XDIGI2CSV_REPO}/jobs/moore_input/minbias-sim10b-xdigi.yaml"

This allows you to easily provide pre-defined input to the Moore algorithm.

Setting Up Output Directory#

The output section is used to define where the output files of the program will be saved. There are two ways to specify the output directory.

The first method is to use the outdir option, which specifies the path where the output files should be saved. If a relative path is used, it is expressed relative to the YAML file containing the configuration. For example:

output:
  outdir: output/

The second method is to save the output files in the path specified in the global/datadir variable in the setup/config_default.yaml file:

global: # define global variables
  # Path where the files are stored in EOS
  datadir: "/eos/lhcb/user/a/anthonyc/tracking/data"

This can be useful if you want to save the files in a centralized location accessible by other users. To use this method, you need to specify the following options in the output section of the YAML file:

output:
  auto_output_mode: eos  # set this to "eos"
  dataname: my_data  # name of the data

This will save the data under:

{datadir}/{datatype}/{version}/{dataname}

where datadir is global/datadir, datatype depends on the output format of the program used (for example, csv for the XDIGI2CSV program), version is the current version of the repository and dataname is the name of the data.

This method is used in the jobs in jobs/eos/ to save the output parquet files in the EOS space, which can be accessed and downloaded by everyone with a CERN computing account.

Configure the XDIGI2CSV Program#

The XDIGI2CSV program can be configured using the xdigi2csv section of the configuration file. Please have a read through the default configuration file setup/config_default.yaml for a description of all the possible options.

Run in HTCondor#

Important

The Ganga software provides the capability to run jobs on the grid, using the Dirac backend, which would be much faster than HTCondor.

However, due to the modifications made to the LHCb software stack not being merged into the main branch, a local build of the stack must be used. This makes it more complicated to use Ganga.

As a result, we currently use HTCondor to execute the XDIGI2CSV program. Since the Kerberos ticket is passed to HTCondor jobs, they are able to access a local build of the software stack located in the user’s AFS space. It is worth noting that using Ganga with the HTCondor backend does not pass the Kerberos ticket.

The XDIGI2CSV repository provides a way to submit XDIGI2CSV jobs while keeping output organised and logs easily accessible.

If you need to split your production into sub-jobs and run them in HTCondor, you can use the computing section of the YAML file to specify the configuration options. To use HTCondor, set the backend option to condor in the computing section, as shown in the example below:

computing:
  # default is `local` but you need to switch it to `condor`
  backend: condor
  # Number of files to process per subjob
  nb_files_per_job: 2
  # Maximum run time (in seconds) of each sub-job.
  max_runtime: 1800

In this example:

  • nb_files_per_job specifies the number of files to process per sub-job

  • max_runtime specifies the maximum run time (in seconds) of each sub-job. If a job runs longer than the specified time, it will be terminated by HTCondor.

For instance, if you have 500 input files and want to process them using sub-jobs of 2 files each, 250 jobs will be submitted to HTCondor. To parallelize the transformation of LFNs into PFNs, The XML catalog is generated is done within the sub-jobs.

As explained in the Organisation of the CSV-like files section, the output directory will contain sub-folders start_index, start_index + nb_files_per_job, start_index + 2 * nb_files_per_job, and so on, where the output files of the XDIGI2CSV program are stored.

The log of the HTCondor jobs is saved in the path specified in the computing/wildcard_data_logdir option of default_config.yaml. The log is further described in the Log section.

All jobs in jobs/eos/ are run using HTCondor following this method.