The XDIGI2CSV Program#
XDIGI2CSV is a program that converts raw data (in the form of (X)DIGI files) from particle detector simulations into CSV files that can be used for further analysis. It provides a flexible and configurable way to convert different types of data from multiple detectors into separate CSV files. Using this repository, the program can be configured through a YAML file, and can be run either locally or on a cluster using the HTCondor system.
This page explains how to run this program.
The Prerequisites section recalls what is required to run the program
The Run Using Python Files section shows how to run the XDIGI2CSV program like any other Moore algorithms, that is, using python files
The Example: Using This Repository section explains how to run the XDIGI2CSV program using this repository, by configuring the algorithm using a YAML file
3 other sections explain how to further configure this algorithm
The Setting Up Moore Input section explains how the possible ways the input to a Moore algorithm can be configured
The Setting Up Output Directory section explains how to configure the output directory
The Configure the XDIGI2CSV Program section explains how to further configure the XDIGI2CSV program
The Run in HTCondor section explains how to submit many XDIGI2CSV jobs to HTCondor.
If you would like more information about the algorithms used by the XDIGI2CSV program, please refer to the More Information About the XDIGI2CSV Program page.
Prerequisites#
As explained in the Setup page, you need to have the LHCb stack
set up in the branches anthonyc-persistence_csv
, in a detdesc
build
(e.g., x86_64_v2-centos7-gcc11+detdesc-opt
).
Do not forget to do
source setup/setup.sh
and to set up your LHCb proxy with lhcb-proxy-init
if you use files stored in the grid.
Run Using Python Files#
The XDIGI2CSV program can be run like any other Moore algorithms,
and this section provides insights into how to run them.
Behind the scenes, the python helper scripts such as run/run.py
simply execute
these python files.
To run this algorithm, you need to:
set up your
input.py
python file to specify the input, which can either be a local file or a file stored on the grid.Configure the algorithms using Moore, using a python file such as
run/xdigi2csv/moore_program_standalone.py
An example of an input python file is provided in jobs/examples/xdigi2csv/input_pfn.py
:
from Moore import options
# That won't change:
options.simulation = True # we are only working with simulated data
options.input_type = "ROOT" # use `"ROOT"` of XDIGI files
options.data_type = "Upgrade" # We are only working with simulations for the upgrade
# These depend on the MC Upgrade simulation you're using
options.dddb_tag = "dddb-20221004"
options.conddb_tag = "sim-20220929-vc-md100"
options.input_files = [
"root://gridproxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi"
]
options.evt_max = 100 # at most 100 events
options.input_files
is a list of physical links to the input files,
whether they are in the grid or on your local machine.
You can obtain the dddb_tag
and conddb_tag
by following the instructions on the
Accessing Data on the Grid
page.
Alternatively, you can provide logical file names (LFNs) and an XML catalog, as shown
in the jobs/examples/xdigi2csv/input_lfn.py
python file:
from Moore import options
from Gaudi.Configuration import (
Gaudi__MultiFileCatalog as FileCatalog,
ApplicationMgr,
)
options.simulation = True
options.data_type = "Upgrade"
options.conddb_tag = "sim-20220929-vc-md100"
options.dddb_tag = "dddb-20221004"
options.input_type = "ROOT"
options.input_files = [
"LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi"
]
options.evt_max = 100
# (relative path w.r.t. the working directory, here, the root repository)
xml_file_name = "jobs/examples/xdigi2csv/pool_xml_catalog.xml"
catalog = FileCatalog(Catalogs=[f"xmlcatalog_file:{xml_file_name}"])
ApplicationMgr().ExtSvc.append(catalog)
You can generate an XML catalog by following the instructions on the Accessing Data on the Grid page.
For the example, we’ll use the jobs/examples/xdigi2csv/input_lfn.py
input file.
To configure the XDIGI2CSV program, go to run/xdigi2csv/moore_program_standalone.py
and change the following to your needs:
# List of detectors for which to dump the hits or clusters
selected_detectors = ["velo", "ut", "scifi"]
# Whether to dump the MC hits (`mchits_{detector}.csv` files)
dump_mc_hits = False
# Whether to dump `event_info.csv`
dump_event_info = False
# If set to `False`, we only dump the variables that are necessary for tracking
extended = False
# If set to `True`, all the MC particles in `mc_particles.csv` are dumped, even those
# which don't have any hits
all_mc_particles = False
# Whether to use Retina clusters
retina_clusters = True
# Output directory where the CSV files are stored
outdir = "./jobs/examples/xdgi2csv/output"
# If set to `True`, if a CSV file that is going to be written already exists, it will
# be erase to write the new one
erase = True
To run Moore, which is part of the LHCb software stack, you can execute the following command on LXPLUS or a similar system:
/afs/cern.ch/work/a/anthonyc/public/tracking/stack/Moore/build.x86_64_v2-centos7-gcc11+detdesc-opt/run gaudirun.py jobs/examples/xdigi2csv/input_lfn.py run/xdigi2csv/moore_program_standalone.py
Replace /afs/cern.ch/work/a/anthonyc/public/tracking/stack/Moore/build.x86_64_v2-centos7-gcc11+detdesc-opt/run
with the path to your own binary of of Moore, if needed.
Make sure to execute this line at the root of the XDIGI2CSV repository, which is
necessary because the modules tools
and definitions
are used in the
run/xdigi2csv/moore_program_standalone.py
python file to configure the algorithms.
Note
If the XDIGI2CSV algorithms were merged into the master branches of the LHCb stack
projects, you could run these algorithms on LXPLUS using
lb-run Moore/latest gaudirun.py
, where latest
can also be replaced by a
specific version of Moore.
However, using only Python files to convert a large amount of (X)DIGI files to CSV files comes with several constraints:
Relative paths are not well-defined by default, and can cause issues when specifying input and output files.
Job splitting and submission can be challenging.
The fact that the algorithms are not merged into the master branches greatly complicates the use of these algorithms in HTC systems.
Separate scripts will also have to be written if one wants to convert the
.csv
files into another format.Given that the PFNs and XML catalog may evolve over time, they will have to be generated manually, which is time-consuming for hundreds of files.
For all of these reasons, it is much easier to use the XDIGI2CSV repository to run the XDIGI2CSV program.
Example: Using This Repository#
This repository provides a convenient way of running the XDIGI2CSV program.
You can set up a YAML file that configures the program,
such as the one in jobs/examples/xdigi2csv/xdigi2csv_lfn.yaml
moore_input:
paths: "LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi"
dddb_tag: "dddb-20221004"
conddb_tag: "sim-20220929-vc-md100"
evt_max: 100
xdigi2csv:
detectors:
- velo
- ut
- scifi
format: parquet
compression: lz4
output:
outdir: "xdigi2csv_lfn"
computing:
program: xdigi2csv
By providing the LFN path in the YAML file, the XML catalog will be generated
automatically. However, thedddb_tag
and conddb_tag
need to be provided manually
by following the instructions on the
Accessing Data on the Grid
page.
In addition, there are two other options, format
and compression
,
which can be used to convert the output CSV files into a different compressed format.
The conversions are applied after running the algorithms in Moore.
Note
Behind the scene, 2 python files are used:
./run/moore/moore_input.py
to configure an input to a Moore algorithm./run/xdigi2csv/moore_program.py
to configure the XDIGI2CSV program
These two Python files read the configuration that is passed to them using
two different temporary YAML files. Their paths are stored in temporary environment
variables XDIGI2CSV_INPUT_CONFIG" and
XDIGI2CSV_PROGRAM_CONFIG”.
You can then run the XDIGI2CSV program in the command line
./run/run.py -c jobs/examples/xdigi2csv/xdigi2csv_lfn.yaml
or equivalently
./run/moore/run.py xdigi2csv -c jobs/examples/xdigi2csv/xdigi2csv_lfn.yaml
The resulting .parquet.lz4
files are saved in
./jobs/examples/xdigi2csv/xdigi2csv_lfn/
, and a log.yaml
file is generated
to keep track of the original input file and the Moore return code.
Setting Up Moore Input#
The moore_input
section of the YAML file is used to provide input to Moore.
There are three ways to define the input.
Using a Python Input File#
You can provide a Python input file that defines the input in the Moore.options object. For example:
moore_input:
python_input: input.py
Using LFNs, PFNs, and/or Local Paths#
You can also provide LFN(s), PFN(s), and/or local paths to the input files. For example:
moore_input:
paths:
- LFN:/lhcb/some/LFN/path.xdigi
- "{XDIGI2CSV_REPO}/a/path/starting/from/the/root/of/the/repo.xdigi"
- ./a/relative/path/expressed/relative/to/the/yaml/file.xdigi
If LFNs are provided, the XML catalog will be generated automatically.
Elements enclosed in curly brackets, such as
{XDIGI2CSV_REPO}
, are replaced by the corresponding environment variables. TheXDIGI2CSV_REPO
environment variable is initialized when executingsource setup/setup.sh
and contains the path to the root of this repository.
Using a Bookkeeping Path#
You can provide a bookkeeping path to the input files, which is then translated into a list of LFN paths. For example:
moore_input:
bookkeeping_path: /some/bookkeeping/path
start_index: 0 # index of the first LFN path
nb_files: 2 # number of files to process at max
If LFNs or a bookkeeping path are used, storage elements can be banned using
banned_storage_elements
.
This will remove the replicas from the specified storage element and
remove the LFNs that were only stored in banned storage elements.
You can find a list of possible min-bias data input to Moore in the
jobs/moore_input/
directory. You can use them with the include option of any configuration file,
for example:
include: "{XDIGI2CSV_REPO}/jobs/moore_input/minbias-sim10b-xdigi.yaml"
This allows you to easily provide pre-defined input to the Moore algorithm.
Setting Up Output Directory#
The output
section is used to define where the output files of the program
will be saved. There are two ways to specify the output directory.
The first method is to use the outdir
option, which specifies the path
where the output files should be saved.
If a relative path is used, it is expressed relative to the YAML file containing
the configuration. For example:
output:
outdir: output/
The second method is to save the output files in the path specified
in the global/datadir
variable in the setup/config_default.yaml file:
global: # define global variables
# Path where the files are stored in EOS
datadir: "/eos/lhcb/user/a/anthonyc/tracking/data"
This can be useful if you want to save the files in a centralized location accessible
by other users. To use this method, you need to specify the following options
in the output
section of the YAML file:
output:
auto_output_mode: eos # set this to "eos"
dataname: my_data # name of the data
This will save the data under:
{datadir}/{datatype}/{version}/{dataname}
where datadir
is global/datadir
, datatype
depends on the output format of
the program used (for example, csv
for the XDIGI2CSV program), version
is
the current version of the repository and dataname
is the name of the data.
This method is used in the jobs in
jobs/eos/
to save the output parquet files
in the EOS space, which can be accessed and downloaded by everyone with a
CERN computing account.
Configure the XDIGI2CSV Program#
The XDIGI2CSV program can be configured using the xdigi2csv
section of the
configuration file. Please have a read through the default configuration file
setup/config_default.yaml for a description of all the possible options.
Run in HTCondor#
Important
The Ganga software provides the capability to run jobs on the grid, using the Dirac backend, which would be much faster than HTCondor.
However, due to the modifications made to the LHCb software stack not being merged into the main branch, a local build of the stack must be used. This makes it more complicated to use Ganga.
As a result, we currently use HTCondor to execute the XDIGI2CSV program. Since the Kerberos ticket is passed to HTCondor jobs, they are able to access a local build of the software stack located in the user’s AFS space. It is worth noting that using Ganga with the HTCondor backend does not pass the Kerberos ticket.
The XDIGI2CSV repository provides a way to submit XDIGI2CSV jobs while keeping output organised and logs easily accessible.
If you need to split your production into sub-jobs and run them in HTCondor,
you can use the computing
section of the YAML file to specify the configuration options.
To use HTCondor, set the backend
option to condor in the computing section,
as shown in the example below:
computing:
# default is `local` but you need to switch it to `condor`
backend: condor
# Number of files to process per subjob
nb_files_per_job: 2
# Maximum run time (in seconds) of each sub-job.
max_runtime: 1800
In this example:
nb_files_per_job
specifies the number of files to process per sub-jobmax_runtime
specifies the maximum run time (in seconds) of each sub-job. If a job runs longer than the specified time, it will be terminated by HTCondor.
For instance, if you have 500 input files and want to process them using sub-jobs of 2 files each, 250 jobs will be submitted to HTCondor. To parallelize the transformation of LFNs into PFNs, The XML catalog is generated is done within the sub-jobs.
As explained in the
Organisation of the CSV-like files section, the output directory will contain sub-folders
start_index
, start_index + nb_files_per_job
, start_index + 2 * nb_files_per_job
, and so on,
where the output files of the XDIGI2CSV program are stored.
The log of the HTCondor jobs is saved in the path specified in the
computing/wildcard_data_logdir
option of default_config.yaml
. The log
is further described in the Log section.
All jobs in jobs/eos/
are run using HTCondor following this method.