Accessing Data on the Grid#
In this page, we provide an overview of how to access data on the grid
using the Dirac Portal and Dirac commands.
The last section Using Ganga explains how to use ganga
,
a python-based job submission and management system, to handle data on the grid.
Warning
Please note that this page is intended as a brief overview of accessing data on the grid and was not reviewed by experts in bookkeeping, data management, or DIRAC. For accurate and detailed information, refer to the official documentation.
Note
Before executing any Dirac command, remember to initialise your LHCb proxy with
lhcb-proxy-init
.
Bookkeeping path, LFNs, replicas and PFNs#
As explained in this twiki page, To manage data on the grid,
Each production is assigned a bookkeeping path, which keeps track of all the output files
The files are identified by their Logical File Names (LFNs), which are always of the form
/lhcb/...
.Each file can be stored in different sites, called Storage Elements (SE), such at one of the CERN sites (e.g.,
CERN_MC-DST-EOS
), the IN2P3 sites (e.g.,IN2P3_MC-DST
), and others.These instances of a file are called replicas, and it is important to note that replicas may change with time.
Each replica is identified by its Physical File Name (PFN).
The Dirac Portal is a web-based interface
that allows you, among other things, to browse the data available on the grid,
find their bookkeeping path and LFNs (see the
next section for more details), and
obtain and their dddb_tag
and conddb_tag
(see the
Obtain the Condition Database Tags of a MC Production
section).
Additionally, Dirac commands can be used to quickly retrieve LFNs (see Get the LFNs section) and PFNs (see Get replicas section).
For further information, please refer to:
The LHCb starterkit
This LHCb twiki page.
Explore Available Data on the Grid#
The Dirac Portal (https://lhcb-portal-dirac.cern.ch/DIRAC/),
provides access to the data available on the grid, which is indexed under
Application > Bookkeeping Browser
.
For instance, the simulated data with the following characteristics can be found under the bookkeeping path:
/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10b/30000000/XDIGI
corresponding to simulated data with
8722 XDIGI files
simulation version: sim10b
Min-bias data (event type
30000000
): no selection or particular decays appliedSpill-over: Bunch-crossing events are produced every 25 ns, and event overlapping is taken into account in the simulation
Magnet polarity: Down
Average number of \(p\)-\(p\) collisions: \(\nu = 7.6\)
Beam energy in the Center-of-Mass: 7 TeV
Tip
To quickly find the files, you can type
sim+std://MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10b/30000000/XDIGI
in the address bar at the bottom of the right panel of the LHCb Bookkeeping browser.
Get the LFNs#
There are different ways to retrieve the LFNs of files in the grid.
One way is to use the Bookkeeping browser in the Dirac Portal,
as explained in the previous section.
To save the LFNs, click on Save on the bottom right panel of the Bookkeeping browser
and save them in a file with the .txt
, .py
or .csv
extension.
Alternatively, you can use the
dirac-bookkeeping-get-files
command on LXPLUS.
For example, to retrieve the LFNs of all files in a given production,
run the following command:
lb-dirac dirac-bookkeeping-get-files --BKQuery /MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10b/30000000/XDIGI
This command will output a list of LFN paths, such as:
LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
If you have an LFN path and want to retrieve the bookkeeping path,
you can use the
dirac-bookkeeping-file-path
command. For example:
lb-dirac dirac-bookkeeping-file-path /lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
Get replicas#
To retrieve the Physical File Names (PFNs) of a Logical File Name (LFN),
you can run use the
dirac-dms-lfn-replicas
command.
For example:
lb-dirac dirac-dms-lfn-replicas LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
The output of this command will provide you with the PFNs of the file in the different Storage Elements (SE) where it’s stored. For instance:
Successful :
/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi :
IN2P3_MC-DST : root://gridproxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
IN2P3_MC-DST
is the name of the Storage Element where the file is stored.root://gridproxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
is the PFN of the replica within this SE.
Get the XML catalog#
You can generate an XML catalog that contains information about the association
between LFNs and replicas. To generate the XML catalog, use the
dirac-bookkeeping-genXMLCatalog
command:
lb-dirac dirac-bookkeeping-genXMLCatalog -l LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
Important
Replicas can change over time, so the XML catalog needs to be regenerated periodically.
Download a file#
To download a file, you can use the
dirac-dms-get-file
command
lb-dirac dirac-dms-get-file LFN:/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
You can also use the PFN of the file instead of the LFN:
lb-dirac dirac-dms-get-file root://gridproxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi
you can also use the xrdcp
command
xrdcp root://gridproxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00171960/0000/00171960_00000353_1.xdigi .
Using ganga
#
You can perform the same operations using ganga
, a python-based job submission
and management system,
To start an interactive ganga
Python shell, simply run ganga
in your terminal
on LXPLUS
Then, you can use the following code to perform the same operations:
# Query the bookkeeping database
bkpath = "/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10b/30000000/XDIGI"
bkq = BKQuery(bkpath)
data = bkq.getDataset()
# Get the LFNs of the dataset
data.getLFNs()
# Get the replicas of the first two files in the dataset
data[0:2].getReplicas()
# Generate the XML catalog for the first two files in the dataset
data[0:2].getCatalog()
You can also use ganga
inside a Python script.
To do so, you’ll first need to export the following environment variables:
export GANGA_CONFIG_PATH=${GANGA_CONFIG_PATH:-GangaLHCb/LHCb.ini}
export GANGA_SITE_CONFIG_AREA=${GANGA_SITE_CONFIG_AREA:-/cvmfs/lhcb.cern.ch/lib/GangaConfig/config}
export PYTHONPATH=$PYTHONPATH:/cvmfs/ganga.cern.ch/Ganga/install/LATEST/lib/python3.8/site-packages/
These environment variables are already set up in setup/setup.sh
.
Once you’ve exported the environment variables,
you can import ganga
in your Python script using
import ganga.ganga
All the ganga objects, such as BKQuery
, are accessible inside the ganga
namespace,
e.g., ganga.BKQuery
.
Important
Please note that ganga
does not handle many concurrent sessions well,
so it cannot be used inside many subjobs in parallel, for example, in HTCondor.