This part of the tutorial deals with D3PD analysis using ROOT only. This part of the tutorial should work on any Linux or Mac OS system that has ROOT installed, but the instructions below are for running on lxplus. You are encouraged to try it on whatever system you like, but if you don't use lxplus, you'll need to setup ROOT yourself and copy the tutorial files from CERN (e.g. via AFS or scp).
This tutorial will introduce you to all the tools the Physics Analysis Tools group (PAT) provides for D3PD analysis in ROOT. While doing this tutorial you should try out all these tools, though for your actual analysis you are free to replace our tools with tools you have written yourself or which you received from other sources. If you choose not to use our tools or have a suggestion on what we could do better, please let us know how we can improve our tools and we will do our best to accommodate that.
There are two exceptions to this mix and match rule: The performance groups distribute various tools for performing good object selection, corrections, etc. In general the validity of your physics result will depend on following these prescriptions precisely (including any last minute changes), so unless you enjoy doing unnecessary checks as part of your blessing procedure you should just use the official tools. The other exception is the package manager RootCore, which the other packages (including the performance tools) depend on. While not impossible to skip, it would be a lot of unnecessary effort.
This tutorial is divided up in sections. Once you have completed the first section which gives you the essential setup (A Basic EventLoop Algorithm), the remaining sections may be completed in any order. However, unless otherwise noted, within each section you should work from start to finish. At the beginning of each section is a short introduction as to what you should learn in this section for orientation.
Note: This tutorial is meant as an example of the basic usage of tools on lxplus using standalone root. Occasionally you will see sections like this with remarks on how you would change things to run in other places or how to change things for serious work. If you see a section like this you can typically ignore it safely.
The purpose of this section is to provide you with the basic setup you will need to complete the remaining section. In the course you will get a basic introduction to the package manager RootCore, the n-tuple reader D3PDReader, the sample manager SampleHandler and the job manager EventLoop. All of these will receive a followup in-depth discussion in further sections of the tutorial. At the end of the section you will have a job that runs over the data and produces a single histogram, which is pretty much the minimal amount of work you can do.
As the first step, log in to the machine you want to use. Lxplus is a safe default, but you should be able to do this on any Linux or MacOS machine with root installed:
ssh lxplus.cern.ch |
Now let's setup ROOT (you'll have to do this every time you log in):
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh localSetupROOT |
Note that here we are using ROOT from CVMFS at CERN, which will work on lxplus as well as on many tier 3 sites. More information can be found here. If your shell is csh based (e.g. tcsh) rather than bash based (e.g. zsh), replace "sh" with "csh" above. As an alternative you can also set up an ATLAS release and use its root release. However you should NOT try to setup both ROOT and an ATLAS release in the same session, as it can cause problems. Also, if you log in again, you have to setup ROOT exactly the same way, or you will likely get in trouble.
It is highly recommended not to put the code for setting up root into your .login
file, since you will often work on different projects, each of which
may require a different way to setup your environment. However you can
put all these setup commands into a file setup.sh
(or
whatever you'd like to call it) and source that whenever you log in.
You would naturally also add the RootCore setup from the next section to
this as well.
Warning: When using root from CVMFS it will occasionally be updated to the latest version. RootCore will detect such a change and give you a warning, in which case you should follow the instructions in the next section for changing root versions. Please note that while updating to the latest root version has its advantages, it can at the same time break your existing code. If you want to have a specific version of root (and avoid this trouble), you can list all available versions of root by calling
and then set up a particular version using
showVersions --show=root
localSetupROOT --rootVersion=<value>
Other Sites: If you don't work on lxplus, make sure that you have a local root version installed. You can then set it up by using it's thisroot.sh script:Or you can manually set ROOTSYS, LD_LIBRARY_PATH, and PATH to point to your local ROOT installation, e.g.:
source /opt/root/root_5.27.06/bin/thisroot.sh
export ROOTSYS=/opt/root/root_5.27.06 export LD_LIBRARY_PATH=${ROOTSYS}/lib:$LD_LIBRARY_PATH export PATH=$ROOTSYS/bin:$PATH
First make a directory in which you want to put your RootCore installation:
mkdir ROOTTutorial cd ROOTTutorial |
Now let's setup RootCore. I remind you that RootCore is PAT's system for building packages outside of the Athena environment. First we need to check out RootCore:
svn co svn+ssh://svn.cern.ch/reps/atlasoff/PhysicsAnalysis/D3PDTools/RootCore/tags/RootCore-00-01-52 RootCore |
Note that if you don't have ssh keys or a kerberos ticket setup, you'll have to type in your CERN password every time you check something out via ssh. This can be quite tedious, so the recommendation is for you to do setup one or the other whenever possible.
Now you have to let RootCore configure itself. To do this, execute the following:
cd RootCore/ ./configure cd .. |
This will scan your local system, detect the location of your ROOT installation and configure RootCore accordingly. You will have to repeat that step in a fresh shell whenever you change your ROOT or RootCore version. The next step is to set up your local environment for RootCore (you need to repeat this every time you log in):
source RootCore/scripts/setup.sh |
Now you are in the RootCore build environment.
Warning: If you ever change your root or RootCore version you have to repeat these steps in a fresh shell. If you change the root version (on purpose or by accident) you should remove the old object files as well to force recompilation with the new root version or in other words you need to call
$ROOTCOREDIR/scripts/clean.sh
For future reference, the RootCore documentation walks you through how to add packages, build, etc and is a good resource.
The next step is to check out all the packages used in this tutorial.
We could also check them out as we use them, but for simplicity we'll
check them out at once. We could do that ourselves directly, but
instead we are going to ask RootCore to do it for us. For that create a
file packages.txt
with the list of packages:
atlasoff/DataQuality/GoodRunsLists/tags atlasoff/PhysicsAnalysis/AnalysisCommon/PATCore/tags atlasoff/PhysicsAnalysis/D3PDTools/EventLoop/tags atlasoff/PhysicsAnalysis/D3PDTools/EventLoopGrid/tags atlasoff/PhysicsAnalysis/D3PDTools/EventLoopAlgs/tags atlasoff/PhysicsAnalysis/D3PDTools/MultiDraw/tags atlasoff/PhysicsAnalysis/D3PDTools/RootCore/tags atlasoff/PhysicsAnalysis/D3PDTools/RootCoreUtils/tags atlasoff/PhysicsAnalysis/D3PDTools/SampleHandler/tags atlasoff/PhysicsAnalysis/JetMissingEtID/JetSelectorTools/tags atlasoff/Reconstruction/egamma/egammaEvent/tagsPlease note that in this case we ask for a tagged version, but do not specify a tag. This instructs RootCore to use the latest tagged version, which is usually the latest stable version (as opposed to the trunk version, which may be in a non-working situation). In practice this is often a good choice, though there are good reasons to pick a specific tagged version instead.
The actual command for checking out the packages is:
$ROOTCOREDIR/scripts/checkout.sh packages.txt |
Next you have to ask RootCore to locate all the packages and check their
dependencies. You'll have to repeat this step whenever you check out
or create a new package, or change the dependencies of an existing
package. It is very important that you run this inside ROOTTutorial
directory and not one of the subdirectories, since it picks up all the packages inside the current directory:
$ROOTCOREDIR/scripts/find_packages.sh |
Sometimes when debugging problems you may want to find out which versions of the packages you are actually using. Particularly if you send out email asking for help you should provide that information. RootCore provides a script that lists all the currently configured packages for you:
$ROOTCOREDIR/scripts/version.sh |
Now you have to compile all of the packages. You essentially have to do that whenever you change your code or add packages. The first time you do this it may take a while, but subsequent compilations should be faster. You can also compile packages individually, but that doesn't do quite as much as compiling all packages, so when you add packages or programs you should just recompile everything:
$ROOTCOREDIR/scripts/compile.sh |
For this tutorial we will be using the D3PDReader to read in the data, which is the way recommended by the PAT group, but there are numerous other ways. Maybe the main advantage of the D3PDReader is that it is highly optimized, meaning (everything else the same) your code will likely run faster with less effort.
The D3PDReader is actually not a centrally distributed package, instead the package gets generated from the data files you are using. This is necessary, so that it can contain all the variables contained your D3PD. Please note that this only works with D3PDs produced with the D3PDMaker, not with custom n-tuples you may have produced yourself. The documentation for the package that generates these classes can be found under D3PDMakerReader.
The D3PDReader package is actually generated using Athena, so you will have to open up a new shell and setup athena. Please don't use the same session, as setting up Athena may confuse RootCore. On lxplus you can setup Athena in your test area (you may already have the first three lines in your login script from the first day in the tutorial in that case you do not need to do these lines):
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh' source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh asetup 17.2.7.4.1,here,64 |
Now, generate the D3PDReader files in this session (the .root
files are the files we will be using later on). First, go to the main directory and create a new code
directory where the code will go to.
cd ROOTTutorial mkdir code |
d3pdReadersFromFile.exe -f /afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/user.krumnack.pat_tutorial_data.v1/user.lheelan.003342.MyFirstD3PD._00001.root /afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/user.krumnack.pat_tutorial_ttbar_corrected.v1/user.lheelan.003375.MyFirstD3PD._00001.root /afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/user.krumnack.pat_tutorial_Zmumu.v1/user.lheelan.003373.MyFirstD3PD._00001.root -n Event -o ./codeAnd also generate a RootCore package with the generated code from the
code
directory transferred:
d3pdReaderRootCoreMaker.py -p D3PDReader ./code/* |
D3PDReader
package actually contains the the source code (look in the Root
sub-directory). If that is the case, delete the temporary code
directory again:
rm -rf code |
You can now close this session. You should have a new package called D3PDReader now.
While you need athena to generate the D3PDReader, you can use it without it. Or in other words, you can generate the D3PDReader package on lxplus, and then copy it to your laptop or any other machine you would like to work on. As long as you have root and RootCore working on your machine you can use D3PDReader.
Warning: The following should be done in your original RootCore session, or in a new session that you set up in the same way. Do not try to do this in the session in which you set up athena. You will get all kinds of problems.
In your original RootCore session you can now add and compile this package (this will take a while):
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Now it is time to create your own package. RootCore provides you a
script that creates a skeleton package, and we highly recommend you use
it. We'll call the package MyAnalysis
, but you can name it anything you want. In the ROOTTutorial
directory execute:
$ROOTCOREDIR/scripts/make_skeleton.sh MyAnalysis |
Side Note: For actual work (not this tutorial) we highly recommend storing all your packages, even private ones, in SVN. This makes it a lot easier to share your code across machines or with other people; and it gives you the advantages of version management, in case something ever goes wrong. To do that (assuming the base package already exists in SVN), just check it out before you callmake_skeleton.sh
, i.e.:
svn co [svn-url] MyAnalysis $ROOTCOREDIR/scripts/make_skeleton.sh MyAnalysis
If you list the contents of the package directory, you'll see:
LinkDef.h
file that ROOT uses to build dictionaries.
Now let's just ask RootCore to pick up this new package:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Technically you don't have to rebuild after every step, since you won't be running code until the very end of this tutorial, but it is good practice to check in regular intervals whether your code still compiles.
EventLoop
is a package maintained by the PAT group for looping over events. You
first start out by making an algorithm class that will hold your
analysis code. We'll call it MyFirstD3PDAnalysis
, but you
can name it anything you want. You could create this class by hand, but
the EventLoop package provides a script to create the skeleton of the
class for you inside the package you specify (here MyAnalysis
):
$ROOTCOREBIN/user_scripts/EventLoop/make_skeleton MyAnalysis MyFirstD3PDAnalysis |
Now we should be able to build our package with our empty algorithm:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
We'll have to rerun find_packages.sh
here, because make_skeleton
updated the dependencies for our package. Normally you can recompile using just compile.sh
.
To use the D3PDReader classes inside our code, we first need to add the dependency to our package. For that edit the file MyAnalysis/cmt/Makefile.RootCore
and change the line with PACKAGE_DEP
to
PACKAGE_DEP = EventLoop D3PDReader
Now we are going to add the hooks needed for the D3PDReader classes to
our algorithm. First we'll need to add the D3PDReader header files to
the includes in at the top of MyAnalysis/MyAnalysis/MyFirstD3PDAnalysis.h
:
#include <D3PDReader/Event.h> |
Now let's insert a new data member into the class definition in the the same file (please note that the //!
is important):
D3PDReader::Event *event; //! |
Forgetting the
//!
is the most common problem when using EventLoop. It tends to manifest
itself as a crash when you try to run the job. In effect what it does
is to tell EventLoop that the corresponding variable will not be
initialized until initialize
gets called. So you should add this to all variables in your algorithm, except for configuration parameters.
The exact meaning of
//!
:
What happens internally is that EventLoop will store a copy of your
algorithm object on disk, and then for every sample it reads back a
copy, initializes it and processes the samples. Saving objects and
reading them back happens through a so called streamer. These get
automatically generated for you by ROOT CINT. Adding //!
to a variable tells CINT to ignore that variable. If you were to forget the //!
in this case, CINT would try to store a copy of the event structure on
disk, but since it hasn't been initialized yet it will (most likely)
crash.
Now we need to tell EventLoop that we actually want to use the
D3PDReader in our job. The easiest way to do that is to add the request
in the setupJob
function in MyAnalysis/Root/MyFirstD3PDAnalysis.cxx
:
EL::StatusCode MyFirstD3PDAnalysis :: setupJob (EL::Job& job) { // Here you put code that sets up the job on the submission object // so that it is ready to work with your algorithm, e.g. you can // request the D3PDReader service or add output files. Any code you // put here could instead also go into the submission script. The // sole advantage of putting it here is that it gets automatically // activated/deactivated when you add/remove the algorithm from your // job, which may or may not be of value to you. job.useD3PDReader (); return EL::StatusCode::SUCCESS; } |
Now we need to change the initialize
function in MyAnalysis/Root/MyFirstD3PDAnalysis.cxx
so that we connect this member to our D3PDReader. All this does is
create a local reference to the D3PDReader event object, which adds to
our convenience and is also slightly faster than accessing it otherwise:
EL::StatusCode MyFirstD3PDAnalysis :: initialize () { // Here you do everything that you need to do after the first input // file has been connected and before the first event is processed, // e.g. create additional histograms based on which variables are // available in the input files. You can also create all of your // histograms and trees in here, but be aware that this method // doesn't get called if no events are processed. So any objects // you create here won't be available in the output if you have no // input events. event = wk()->d3pdreader(); return EL::StatusCode::SUCCESS; } |
Now we should be able to build our package using D3PDReader classes in our algorithm:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Again we had to rerun find_packages.sh
here, because we updated the dependencies for our package.
Now we are in the situation, in which we can add code to our algorithm to fill a simple histogram. In our case, all we want to do is make a histogram of the electron pt spectrum, which is fairly straightforward.
Again in the header file MyFirstD3PDAnalysis.h
add an include for the TH1 class file (before the class statement):
#include <TH1.h> |
Within our algorithm class add a histogram pointer as a member to our MyFirstD3PDAnalysis
algorithm class (in MyFirstD3PDAnalysis.h
). You can make this either public or private, it doesn't matter. Please note that the //!
is important:
TH1 *AHist; //! |
Now inside the source file MyFirstD3PDAnalysis.cxx
, we need to book the histogram and add it to the output. That happens in the histInitialize
function:
EL::StatusCode MyFirstD3PDAnalysis :: histInitialize () { // Here you do everything that needs to be done at the very // beginning on each worker node, e.g. create histograms and output // trees. This method gets called before any input files are // connected. AHist = new TH1F("AHist", "AHist", 100, 0, 1.e5); wk()->addOutput (AHist); return EL::StatusCode::SUCCESS; } |
This method is called before processing any events. Note that the wk()->addOutput
call is a mechanism EventLoop uses for delivering the results of an
algorithm to the outside world. When running in PROOF, ROOT will merge
all of the objects in this list. In principle you can place any object
that inherits from TObject
in there, but if you put anything other than histograms there you need to take special precautions for the merging step.
Now that we have the histogram, we also need to fill it. For that we can use the execute
method, which is called for every event:
EL::StatusCode MyFirstD3PDAnalysis :: execute () { // Here you do everything that needs to be done on every single // events, e.g. read input variables, apply cuts, and fill // histograms and trees. This is where most of your actual analysis // code will go. // Histogramming the pT of electrons for( Int_t i = 0; i < event->el.n(); ++i ){ AHist->Fill(event->el[i].pt()); } return EL::StatusCode::SUCCESS; } |
Now we should be able to build our package with the full algorithm:
$ROOTCOREDIR/scripts/compile.sh |
This time around you don't need to call find_packages.sh
since the dependencies didn't change. This is the normal situation when
you are actually developing code. In this tutorial you will call find_packages.sh
a lot, but that is because we show you a lot of different packages.
To actually run your algorithm you need some steering code. This can be a root macro in either C++ or python or some compiled C++ code. For this tutorial we will create a C++ macro.
First let's create a Run
directory for it. Within ROOTTutorial
execute:
mkdir Run cd Run |
And in that directory we will create a new file called ATestRun.cxx
. Note that we will use another PAT tool called SampleHandler which is a nice tool that allows for easy sample management. Fill this new file, ATestRun.cxx
, with the following:
void ATestRun (const std::string& submitDir) { // Load the libraries for all packages gROOT->ProcessLine(".x $ROOTCOREDIR/scripts/load_packages.C"); // create a new sample handler to describe the data files we use SH::SampleHandler sh; // scan for datasets in the given directory // this works if you are on lxplus, otherwise you'd want to copy over files // to your local machine and use a local path. if you do so, make sure // that you copy all subdirectories and point this to the directory // containing all the files, not the subdirectories. SH::scanDir (sh, "/afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/"); // set the name of the tree in our files sh.setMetaString ("nc_tree", "MyFirstD3PD"); // further sample handler configuration may go here // print out the samples we found sh.print (); // this is the basic description of our job EL::Job job; job.sampleHandler (sh); // add our algorithm to the job MyFirstD3PDAnalysis *alg = new MyFirstD3PDAnalysis; // later on we'll add some configuration options for our algorithm that go here job.algsAdd (alg); // make the driver we want to use: // this one works by running the algorithm directly EL::DirectDriver driver; // process the job using the driver driver.submit (job, submitDir); // Fetch and plot our histogram... Note that "AHist" is the name of // the histogram and that "data..." is the name of the sample // this is mostly for show, in a real analysis you would likely do this in a separate macro. SH::SampleHandler sh_hist; sh_hist.load (submitDir + "/hist"); TH1 *hist = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("AHist"); hist->Draw (); } |
Ok, now the big moment has come. Within your Run
directory execute your ATestRun.cxx macro with root:
root -l 'ATestRun.cxx ("submitDir")' |
You should have a histogram pop up. You can quit ROOT by typing ".q" at the ROOT prompt once you're happy with the resulting plot.
Note that
submitDir
is the directory where the output of your job is stored. If you want
to run again, you either have to remove that directory or pass a
different name into ATestRun.cxx
.
Please Note: For actual work you should give more thought about how you want to organize your submission directories. Since those contain the output of your jobs you probably want to keep old versions around, so that you can compare. Also, you may not want to keep them in your source directory tree, but instead put them into a separate data directory tree. However, try to avoid putting them inside a RootCore managed package. Doing so may result in them getting copied around when submitting to the grid or a batch system.
This section introduces you to some of the tools that allow you to select good physics objects and events. Each subsection can be done independently of the others. To be honest this section is overall too short and only covers a subset of tools. We plan to address this shortcoming for the next tutorial and provide a uniform interface for all tools, but for now it is what it is.
Object selectors are tools that allow to select physics objects that fulfill the requirements of the joint performance groups. In this example we'll use the JetCleaningTool to apply tighter jet cleaning. Note that the procedure for using the other selectors (e.g. Muon or Electron) is essentially the same.
First we need to tell RootCore that we will be depending on the package where the JetCleaningTool lives. Open the Makefile.RootCore
in the cmt
directory of our MyAnalysis
package, and add JetSelectorTools
to PACKAGE_DEP
. So the PACKAGE_DEP
line should look something like this (if you have other packages listed here, don't remove them):
PACKAGE_DEP = EventLoop D3PDReader JetSelectorTools |
Next add the include for the JetCleaningTool to the header file of our analysis algorithm (MyFirstD3PDAnalysis.h
):
#include "JetSelectorTools/TJetCleaningTool.h" |
Now add a pointer to a TJetCleaningTool to the analysis class. This is a
configurable object we will initialize in our steering code, i.e. a
configuration option. As such you should not be using //!
and have to make it public:
public: Root::TJetCleaningTool *my_JetCleaningTool; |
For this tutorial we'll demonstrate the result of jet cleaning in a single histogram. Let's add that histogram to the analysis class as well (note that the //! is important):
TH1 *jetHist; //! |
Now in the source file (MyFirstD3PDAnalysis.cxx
) add the includes we need:
#include <TH2D.h> |
We are going to instantiate and configure the jet cleaning tool in our
steering macro, so the class itself doesn't have to take care of that.
For now let's just initialize our jet cleaning tool to a null pointer in
MyFirstD3PDAnalysis::MyFirstD3PDAnalysis
constructor (your code will likely crash if you forget to do this):
my_JetCleaningTool = 0; |
Inside initialize
check that the tool is there and initialize it:
if (!my_JetCleaningTool){ throw std::string ("No JetCleaningTool configured"); } my_JetCleaningTool->initialize(); |
Since we want to create another histogram we have to create it in the initialize
function:
jetHist = new TH2D ("jet_n", "cut vs. uncut jets", 20, 0, 20, 20, 0, 20); wk()->addOutput (jetHist); |
Now in the execute
function in the source file, we add a
jet counter and fill our histogram to compare the number of tight jets
with the total number of jets:
unsigned njets = 0; for( Int_t i = 0; i < event->jet.n(); ++i ) { if (my_JetCleaningTool->accept(event->jet[i].eta(), event->jet[i].NegativeE(), event->jet[i].hecf(), event->jet[i].HECQuality(), event->jet[i].emfrac(), event->jet[i].sumPtTrk(), event->jet[i].LArQuality(), event->jet[i].Timing(), event->jet[i].fracSamplingMax(), event->jet[i].AverageLArQF() )){ ++ njets; } // end if } // end for loop of jets jetHist->Fill (event->jet.n(), njets); |
Now we need to compile our package again. Since we changed our dependencies, you have to run find_packages.sh
again. Remember that this has to be done in the ROOTTutorial
directory:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Now we need to update our steering macro (ATestRun.cxx
) to
initialize the jet cleaning tool in our algorithm. There is a python
macro provided with the jet cleaning tools for configuring the jet
cleaning tool correctly, and all the following code does is to call that
macro. Just add this code in your steering macro where it says to put
the algorithm configuration:
// Configure the Jet selector (tight selections) TString path = "$ROOTCOREBIN/python/JetSelectorTools/ConfiguredTJetCleaningTools.py"; gSystem->ExpandPathName (path); TPython::LoadMacro (path.Data()); Root::TJetCleaningTool* myTightJetCleanTool = static_cast<Root::TJetCleaningTool*>((void*)TPython::Eval("ConfiguredTJetCleaningTool_Tight()")); // Tell our analysis to use this selector. alg->my_JetCleaningTool=myTightJetCleanTool; |
In order to see the effect of the jet cleaning tool, let's change the last two lines which display the final histogram:
TH1 *hist = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("jet_n"); hist->Draw ("BOX"); |
Now let's run again, inside the Run
directory call. If you already have a submitDir
you either have to rename or remove it:
root -l 'ATestRun.cxx ("submitDir")' |
You should now see a 2d histogram which shows the total number of jets on the x-axis and the number of tight jets on the y-axis. Hopefully the latter will always be smaller than the former.
The GoodRunsLists package (GRL) allows you to select events that are
limited to periods that are considered good for physics. In general you
will have a good runs lists containing a list of all data taking
periods. The actual file is a .xml
file and how you create
or obtain it will be covered elsewhere. For the purposes of this
tutorial we will provide you with one.
First we need to tell RootCore that we will be depending on the GoodRunsLists package. Open the Makefile.RootCore
in the cmt
directory of our MyAnalysis
package, and add GoodRunsLists
to PACKAGE_DEP
. So the PACKAGE_DEP
line should look something like this (if you have other packages listed here, don't remove them):
PACKAGE_DEP = EventLoop D3PDReader GoodRunsLists |
Next add the include for GoodRunsLists to the header file of our analysis algorithm (MyFirstD3PDAnalysis.h
):
#include <GoodRunsLists/TGoodRunsListReader.h> |
And then add a good runs lists to our algorithm class. This is a configuration option, so we should not use the //!
and have to make it public:
public: Root::TGoodRunsList my_grl; |
We will add another histogram so that we can compare the result with and
without GRL, so let's add that to the class (note that the //!
is important):
TH1 *GRLHist; //! |
Now in the source file let's add that histogram in the histInitialize
function:
GRLHist = new TH1F("GRLHist", "GRLHist", 100, 0, 1.e5); wk()->addOutput (GRLHist); |
Within the execute
function we check if the event passes the GRL, and if it does we fill GRLHist
just like we did for AHist
:
if (event->eventinfo.isSimulation() || my_grl.HasRunLumiBlock(event->eventinfo.RunNumber(), event->eventinfo.lbn())) { for( Int_t i = 0; i < event->el.n(); ++i ) GRLHist->Fill(event->el[i].pt()); } |
Now we need to compile our package again. Since we changed our dependencies, you have to run find_packages.sh
again. Remember that this has to be done in the ROOTTutorial
directory:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Now we need to update our steering macro (ATestRun.cxx
) to set the GRL in our algorithm. Add this code where it says to put the algorithm configuration:
// Read the GRL. if you don't have AFS on your local machine, copy // the file over and change the path accordingly. Root::TGoodRunsListReader* grlR = new Root::TGoodRunsListReader(); grlR->SetXMLFile ("/afs/cern.ch/user/a/atlasdqm/www/grlgen/All_Good/data11_7TeV.periodAllYear_DetStatus-v36-pro10-02_CoolRunQuery-00-04-08_All_Good.xml"); grlR->Interpret (); // Tell our analysis to use this GRL alg->my_grl= grlR->GetMergedGoodRunsList(); |
In order to see the effect of the GRL, let's change the lines which display the final histogram to:
sh_hist.load (submitDir + "/hist"); TH1 *hist = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("AHist"); hist->Draw (); TH1 *hist2 = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("GRLHist"); hist2->SetLineColor (2); hist2->Draw ("SAME"); |
Not let's run again, inside the Run
directory call. If you already have a submitDir
you either have to rename or remove it:
root -l 'ATestRun.cxx ("submitDir")' |
You should now see two histograms of the electron pt spectrum. With the black one showing all events and the red one only the events passing the GRL. Hopefully the later will always be smaller than the former, but hopefully not by much.
This section contains examples of how to run EventLoop differently, so that you can take a bigger advantage of it.
SampleHandler has the ability to store additional data with each sample, called meta-data. This data can be accessed from within your algorithm. Among other things this can serve as a way to configure your algorithm differently for each sample. I was too lazy to cook up an actual example, but the way this works in practice is quite simple. And if you look at the section on making a stack plot with SampleHandler, it uses that mechanism quite a bit.
You can set the sample meta-data the following way (we use that in the stack plot example):
sh.get ("sampleName")->setMetaDouble ("doubleVarName", 17.3); sh.get ("sampleName")->setMetaString ("stringVarName", "some string"); |
Or you can also set a meta-data field for all the samples in the sample handler. This is useful if you want to provide a default for all samples and then change individual ones. We also use this mechanism in the base example to set the tree name for all samples. The general form is like this:
sh.setMetaDouble ("doubleVarName", 17.3); sh.setMetaString ("stringVarName", "some string"); |
You can then retrieve the meta-data within your algorithm using:
double doubleMeta = wk()->metaData()->getDouble ("doubleVarName"); std::string stringMeta = wk()->metaData()->getString ("stringVarName"); |
If you want to, you can also provide a default value, in case the meta-data wasn't set for a particular sample. This value will then be used if the field wasn't initialized:
double doubleMetaDefault = wk()->metaData()->getDouble ("doubleVarName", 17.3); std::string stringMetaDefault = wk()->metaData()->getString ("stringVarName", "some string"); |
Just for completeness, let's mention that you can then retrieve the meta-data within your steering script as well. We use that in the stack plot example, the general case is:
double doubleMeta = sh.get ("sampleName")->getMetaDouble ("doubleVarName"); std::string stringMeta = sh.get ("sampleName")->getMetaString ("stringVarName"); double doubleMetaDefault = sh.get ("sampleName")->getMetaDouble ("doubleVarName", 17.3); std::string stringMetaDefault = sh.get ("sampleName")->getMetaString ("stringVarName", "some string"); |
PROOF lite is a fairly effective and easy way of improving the performance of your analysis code. PROOF lite will execute your code in parallel on every CPU core available on your local machine, so if your machine has 16 cores, your analysis will run up to 16 times as fast. The actual amount of speed-up will depend on a variety of things, the number of cores in your machine, the speed of your hard drive, how you read your data, the amount of processing you do for each event, etc.
Using PROOF lite is very straight-forward, you just change the driver statement in ATestRun.cxx
to:
EL::ProofDriver driver; |
That's it. There are a couple of practical changes though. The main
one is that any messages your job prints while processing events no
longer appear on the screen. Instead they go into a log file in the ~/.proof
directory hierarchy. In general it is a little hard to debug jobs in
PROOF, so if something goes wrong there is an advantage to running with DirectDriver
instead.
Warning: This feature is not well tested yet, and I wouldn't be surprised if you run into trouble using it at the current time.
Beyond the lite mode, PROOF also can be used to run on your local batch system. For that the batch system has to be configured properly for PROOF mode. This can be done either centrally by your system administrator, or you can use a mode called PROOF-on-demand that will run on top of your local batch system.
PROOF has been specifically designed for use with root, and it offers the following two advantages over a traditional batch system:
To make use of your PROOF farm you need to configure the PROOF driver accordingly:
EL::ProofDriver driver; // set the PROOF master. this needs to be changed to point to your PROOF master node driver.proofMaster = "acas1010"; // by default the PROOF driver only creates a .par file for PROOF lite // resetting this option will give you a full-featured .par file driver.makeParOptions = ""; // this configures how output n-tuples are returned. // for full PROOF mode the output n-tuples // stay on the farm, as configured by this option. driver.returnFiles = false; |
The way PROOF works is that it will upload your code to your PROOF cluster and then recompile it there. This means you'll encounter a full recompile of your code the first time you upload, after that it will only recompile partially according to the changes that you made. However, sometimes it gets confused, mostly if you actually delete any source files (they don't get deleted on server). If that happens it is easiest to remove your software package from the PROOF master and recompile it completely. The easiest way to do that is to use the following command:
driver.removePar (); |
If you work on lxplus, you can run your job on the local batch system:
lxbatch. This allows you to run on many nodes in parallel, improving
the turnaround time on your job drastically. If you work on your
institutions tier 3 site you may be able to do something similar there
by using the correct driver for your batch system. For lxbatch just
change the driver statement in ATestRun.cxx
to:
EL::LSFDriver driver; driver.shellInit = "export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase && source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh && eval localSetupROOT --skipConfirm"; |
Note the shellInit
parameter, which is used to set up root
on each of the worker nodes. If you run somewhere else than lxplus you
will have to update that. Notice that we call localSetupROOT
slightly differently than normal to make it work on the lxbatch worker nodes.
For now there are a number of practical limitations though: Right now both the submitDir
and your ROOTTutorial
directory need to be located on a shared filesystem. If you create
n-tuples they will be copied into submitDir, not to your local file
server. All of these limitations can be overcome, probably with modest
effort. However, I will not start working on them until I actually get a
request by a user to do so.
The speed with which your jobs will run will in most cases be dominated by the speed with which you can read your input data. So it is often worthwhile to try to maximize that speed. One important step towards that end is using D3PDReader, which will only read the data you actually use, thereby avoiding to read unneeded data.
Another way to improve read performance is to use TTreeCache, which will predict the data that you are likely to access and preload it in big chunks. This is mostly important when reading files over the network, and it is virtually mandatory in case you read the files directly from the grid (see the next section on FAX).
Using TTreeCache with EventLoop is very straightforward, just specify the size of the cache you would like for your job before submitting it (in this case 10MB):
job.options()->setDouble (EL::Job::optCacheSize, 10*1024*1024); |
The default way in which TTreeCache predicts what data your job will access is by looking at the first n events and assume that the pattern will hold for the rest of your jobs. If you want to, you can change the number of events to be read to suit your needs:
job.options()->setDouble (EL::Job::optCacheLearnEntries, 20); |
You may have to play around a little to determine which number works best for you. There is a definite tradeoff, in that a too large number will mean that you read too many events without the benefit of the cache, while a too small number means that you will not be able to cache variables that you do not access on every event.
Warning:
FAX is a very new product and you may very well find it is not yet
ready to suit your needs. However it is at a stage where user feedback
is appreciated, and it can be used for productive work by experienced
users. This section is for advanced users who are
willing to give it a try or just want to see how it works. If you are a
beginner, you might just want to skip this section and go on to grid
processing instead. If you run into trouble or want to leave feedback,
the support mailing lists are atlas-adc-federated-xrootd@cernSPAMNOT.ch, usatlas-federated-xrootd@cernSPAMNOT.ch. Here are some issues that you may encounter:
FAX is a mechanism that allows you to read data directly from the grid without downloading it first. Depending on your usage pattern this may not only be more convenient, but also faster than downloading the files directly: If you download a file, you have to download it in its entirety, whereas if you read the file directly from the grid you may read only a small fraction of it. So if you are reading the file only once (or only a few times) this may be an option to improve your analysis workflow.
Warning: While not strictly necessary, it is strongly
recommended that you use TTreeCache when using FAX, otherwise your
performance is likely to be very poor. So if you haven't done so
already, you should work through the section on TTreeCache first.
As a first step, set up the DQ2 tools, which will be needed for using
FAX. There may be some specific advantages to setting them up before or
after setting up root, but I wouldn't know about that. Anyway, when
setting up on lxplus we already sourced atlasLocalSetup.sh
, so from there the recommended commands are:
localSetupDQ2Client voms-proxy-init -voms atlas export STORAGEPREFIX=root://atlas-xrd-eos-n2n.cern.ch/ |
The first line here takes care of setting up the actual DQ2 tools. If you do not work on lxplus you may need to do something different. The second line will initialize the VOMS proxy that holds your grid certificate. If you have problems with that command you should ask your resident grid expert. There are too many things that can go wrong with that to discuss them here. Now the last line is specific to FAX and tells it which machine to contact for samples. This again may vary depending on your site. You can check the official documentation, particularly this list here.
Now it's time to actually use FAX. For that comment out the part where we scan the local directory for samples, and instead scan DQ2:
// SH::scanDir (sh, "/afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/"); SH::scanDQ2 (sh, "user.krumnack.pat_tutorial_*.v1"); |
That should do it. You can now just run your script the same way you did before and with a little luck you will get the same result as before. The initialization will take a little longer, as this will actually query dq2 to find the datasets matching your request, and then again for each dataset to locate the actual files. However, compared to the overall run-time this overhead should be small, and the power you gain is most likely worth it.
Warning:
There are a lot more things that can go wrong with this, then with
reading files locally. After all, you might be reading files from the
other side of the world and not even know it. And this is also a fairly
new service, so there might be some growing pains associated as well.
If you can not or do not want to use FAX for accessing grid datasets, you either have to run on the grid or download them to your site. One option for the later is to tell SampleHandler that you have a certain area where your grid samples should go, and it will then automatically download them for you the first time you access them. For the most part this is a convenience feature, but if you activate it at the same time as FAX, it will try accessing the dataset with FAX first and if that fails it will then download the dataset.
To use this feature you have to setup the DQ2 tools as well as a cache
area. There may be some specific advantages to setting up the DQ2 tools
before or after setting up root, but I wouldn't know about that.
Anyway, when setting up on lxplus we already sourced atlasLocalSetup.sh
, so from there the recommended commands are:
localSetupDQ2Client voms-proxy-init -voms atlas export SAMPLEHANDLER_DQ2_AREA=/tmp/$USER/sh_dq2 mkdir -p $SAMPLEHANDLER_DQ2_AREA |
The first line here takes care of setting up the actual DQ2 tools. If you do not work on lxplus you may need to do something different. The second line will initialize the VOMS proxy that holds your grid certificate. If you have problems with that command you should ask your resident grid expert. There are too many things that can go wrong with that to discuss them here. The third line defines the directory where your cached data samples will go, and the fourth line makes sure that area exists.
Warning: In any real life application you will want to point this somewhere else than the /tmp area. We only point it to the /tmp area, because by default you don't have any other place to put large data files when working on lxplus. For actual work please point this to some nice and large data disk. If you want to put your files directly onto a storage element, please talk to me and I'll see if I can make that a reality.
Now let's use some grid samples instead of the AFS ones. For that comment out the part where we scan the local directory for samples, and instead scan DQ2:
// SH::scanDir (sh, "/afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/"); SH::scanDQ2 (sh, "user.krumnack.pat_tutorial_*.v1"); |
That should do it. You can now just run your script the same way you did before and with a little luck you will get the same result as before. The initialization will take a little longer, as this will actually query dq2 to find the datasets matching your request, and then for the first time it will take quite a while to download all data samples. However, for subsequent runs the overhead for querying dq2 should be small, and the power you gain is most likely worth it.
Warning:
The achilles heel of this mechanism is that if you try to download the
same dataset multiple times simultaneously something bad is likely to
happen and I'm not sure that my code will catch that. The easiest way
to avoid it is to run only one job at a time, or if you run multiple
jobs run them on different datasets or run them in sequence or make sure
all datasets are already downloaded. If you feel you can't work around
this talk to me.
So far our tutorial has been assuming that when you run a job you will
just sit there and wait until it finishes. This works well for the jobs
we are doing in the tutorial since they are fairly short, but it works
less well when you run longer jobs. Particularly when working on the
grid with large volumes of data it can take many hours. To better deal
with these cases, one may want to split our ATestRun.cxx
script into two:
Please note that while the following will work with all drivers, you will only get the full benefits with some of the drivers. At the time of this writing the supported drivers are the grid driver and the batch drivers. The other drivers will still do the full event processing in the submission step. If you feel there is another driver to which this capability should be added, send me an email (nils.erik.krumnack@cernSPAMNOT.ch) and I will try to implement it for you.
To create the first script, copy the content of ATestRun.cxx
all the way up to, and including, the driver.submit
statement into a new file ATestSubmit.cxx
. Don't forget to change the name of the macro at the beginning of the file to ATestSubmit
:
void ATestSubmit (const std::string& submitDir) |
Instead of the regular driver.submit
statement we will ask EventLoop only to submit the job and immediately return control to us. To do that, add to the end of ATestSubmit.cxx
replace driver.submit
with:
// submit the job, returning control immediately if the driver supports it driver.submitOnly (job, submitDir); } |
Next, let's create the script for retrieving and processing the job output, let's call it ATestRetrieve.cxx
. Start it out by loading all the packages and then retrieving the output:
void ATestRetrieve (const std::string& submitDir) { // Load the libraries for all packages gROOT->ProcessLine(".x $ROOTCOREDIR/scripts/load_packages.C"); // retrieve the job, waiting for it, if it is not finished EL::Driver::wait (submitDir); |
And then to process the output, just copy everything from ATestRun.cxx
that comes after driver.submit
. Please note that it is very important to call wait
, even if you know that your job already finished. wait
does more than just waiting for the job to finish, it also takes care
of merging histograms, moving files, and any other post-processing your
job may require.
Now for the actual running.
root -l -b -q 'ATestSubmit.cxx ("submitDir")' root -l 'ATestRetrieve.cxx ("submitDir")' |
The point here is that you can wait an arbitrary amount of time between
the two commands, and you can (if you want to) call the second command
as often as you want, e.g. to tweak how the histograms get displayed.
Should the job not be finished, wait
will perform a partial
retrieve, print a message, wait for a minute and then try again. It is
perfectly safe to kill the job while it is waiting, but you should not
kill it while it does the partial retrieve.
Please note that at least for the batch drivers you may get your results somewhat faster, if you have
wait
running all the time as opposed to only after the job finished. This
is because the driver can do some of the post-processing during the
partial retrieve while the other jobs are still running. Whether that
matters will depend on your job. If you don't want to have your job
actively waiting, you can instead trigger a partial retrieve manually
(just don't do both at the same time):
root -l -b -q $ROOTCOREDIR/scripts/retrieve.C \(\"submitDir\"\) |
In this section of the tutorial we will teach you how to run on the grid. There are two main advantages to running on the grid: First, you have access to the vast computing power of the grid, which means even very lengthy jobs can finish within hours, instead of days or weeks. Second, you have direct access to all the datasets available on the grid, which means you don't need to download them to your site first, which can save you hours if not days of time. There are also two main disadvantages: First, for all but the simplest jobs your turnaround time will be measured in hours, if not days. And this time can vary depending on the load at the various grid sites. Second, there are more things beyond your control that can go wrong, e.g. the only grid site with your samples may experience problems and go offline for a day or two thereby delaying the execution of your jobs.
One common model is to run on the grid to produce a smaller more highly processed set of n-tuple and then run on those n-tuples locally with a second job. If done right this will take advantage of the strengths of the grid to do the large scale processing, and then take advantage of your local resources to produce histograms efficiently. This is a common model, but it may not be the best one for you, e.g.: Your primary n-tuples may be small enough that you can just download them directly, in which case you can probably avoid grid processing altogether. Or you may be one of the (rare) cases in which one can not (sensibly) produce a processed n-tuple.
Warning: While not strictly necessary, it is strongly
recommended that you use TTreeCache when running on the grid, this can
significantly improve the run time on some grid sites. So if you
haven't done so already, you should work through the section on TTreeCache first.
Warning: While not strictly necessary, it is strongly
recommended that you use a separate submit and retrieval script when
running on the grid. A grid job can take hours if not days to finish,
and you wouldn't want your job to be just sitting there. So if you
haven't done so already, you should work through the section on separate submit-retrieve first.
Before running for the first time: The GridDriver uses ganga, which currently needs to create a few configuration defaults before it can be run for the first time. If you have never used ganga before, start it once in interactive mode before using the GridDriver:
Answer the questions, then press ctrl+d to exit. You only need to do this once on each computer where you want to use the GridDriver.
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh localSetupGanga ganga
As a first step, set up the DQ2 tools, which will be needed for running
on the grid. There may be some specific advantages to setting them up
before or after setting up root, but I wouldn't know about that.
Anyway, when setting up on lxplus we already sourced atlasLocalSetup.sh
, so from there the recommended commands are:
localSetupDQ2Client voms-proxy-init -voms atlas |
The first line here takes care of setting up the actual DQ2 tools. If you do not work on lxplus you may need to do something different. The second line will initialize the VOMS proxy that holds your grid certificate. If you have problems with that command you should ask your resident grid expert. There are too many things that can go wrong with that to discuss them here.
Open ATestSubmit.cxx and make sure that you are using the grid datasets. If you did the section on FAX or DQ2 downloads, this should be set already, otherwise comment out the directory scan now and instead scan DQ2:
// SH::scanDir (sh, "/afs/cern.ch/atlas/project/PAT/data/D3PDForRootAnalysis/"); SH::scanDQ2 (sh, "user.krumnack.pat_tutorial_*.v1"); |
Next, replace the driver with the GridDriver:
//EL::DirectDriver driver; EL::GridDriver driver; |
Before the jobs can be submitted you need to tell the driver which ATLAS software release should be set up on the worker nodes, otherwise the jobs cannot run. This in turn will decide which versions of ROOT and GCC are used for your analysis:
driver.athenaTag="17.2.7.4.1"; |
The GridDriver supports a number of optional configuration options that you might recognize from the prun program. If you want detailed control over how jobs are submitted, please consult this page for a list of options: https://twiki.cern.ch/twiki/bin/viewauth/AtlasProtected/EventLoop#Grid_Driver
Finally, submit the jobs as before:
root -l -b -q 'ATestSubmit.cxx ("myGridJob")' |
This command may take a while to complete - do not interrupt it! You can follow the evolution of your jobs by going to http://panda.cern.ch, clicking on users and then finding yourself in the list. Note that jobs for all samples may not appear immediately, they will be submitted over a period of time.
As opposed to the other batch drivers, it is not necessary to call wait() to trigger a partial retrieve. Grid jobs are often very large and numerous. Output of finished jobs is therefore continuously being downloaded in the background and only the final merging step happens when you call retrieve() or wait(). Failed jobs are also automatically retried a maximum of 4 times. This process will continue for 24 hours even if you log out. If your jobs are not completed by then, it will be restarted whenever you call wait() or retrieve(). But if you want the driver to inform you when the jobs are done, you can just keep wait() running:
root -l 'ATestRetrieve.cxx ("myGridJob")' |
Depending on the size of your samples and the load on the grid, it can take hours or even days before results are ready. However, our test jobs in this tutorial are quite small and should hopefully not take that long to finish. Otherwise, just cancel the retrieve script and try again later or the next day. Upon completion, you should see the histogram pop up as before.
What is happening behind the scenes?
The grid driver relies on ganga to manage your jobs. Ganga creates a
"server" process that will keep running on your machine till the jobs
are finished (max 24 hrs), even if you logoff. In case of problems the
ganga log file (~/.ganga.log) or server messages (~/gangadir-EL/server)
might provide some insight into what is going on. In addition, you can
set GridDriver::gangaLogFile
to save driver / server communication to a location of your choice.
(Hint: the tail -f command lets you follow log file updates in real
time).
Specifying paths to local files:
When running on the grid, your local file system is of course
inaccessible. All axuiliary files needed by your job must be placed in
the share directory in one of your RootCore packages and accessed using paths of the form $ROOTCOREDIR/data/PackageName/FileName.root (take a look around inside the RootCore
directory after compiling to see how symlinks on this form are
automatically created). It is a good praxis to always use this system,
as it works with all drivers.
In this section we'll build a simple example of writing a new n-tuple. Done right, creating your n-tuple is a good technique for improving the speed of your analysis:
For our example we'll use the capabilities of the EventLoop NTupleSvc and D3PDReader to create the new n-tuple. You can also create a new n-tuple yourself by using the facilities of TTree directly. As a demonstration, we'll do a bit of skimming/slimming/thinning.
Before we start make a copy of ATestRun.cxx
, which we will need in a second:
cp ATestRun.cxx ATestRun2.cxx |
ATestRun.cxx
will be the skim job and ATestRun2.cxx
will run over the skimmed n-tuple. As a first step we will simply use the NTupleSvc
and a basic selection in our job. As a first step, let us add the NTupleSvc
to the job. This has to happen before you add other algorithms to the job that want to use the service:
EL::OutputStream output ("output"); job.outputAdd (output); EL::NTupleSvc *ntuple = new EL::NTupleSvc ("output"); ntuple->copyBranch ("RunNumber"); ntuple->copyBranch ("EventNumber"); ntuple->copyBranch ("isSimulation"); ntuple->copyBranch ("el_n"); ntuple->copyBranch ("el_pt"); ntuple->copyBranch ("lbn"); ntuple->copyBranch ("jet_n"); ntuple->copyBranch ("jet_hecf"); ntuple->copyBranch ("jet_eta"); ntuple->copyBranch ("jet_NegativeE"); ntuple->copyBranch ("jet_HECQuality"); ntuple->copyBranch ("jet_emfrac"); ntuple->copyBranch ("jet_sumPtTrk"); ntuple->copyBranch ("jet_LArQuality"); ntuple->copyBranch ("jet_Timing"); ntuple->copyBranch ("jet_fracSamplingMax"); ntuple->copyBranch ("jet_AverageLArQF"); job.algsAdd (ntuple); |
The string output
is just an arbitrary name, that
identifies this particular output. If you create multiple outputs in the
same job each of them needs to be given a different label. In the
future, some drivers may allow you to specify further options as part of
the OutputStream object, e.g. specify on which disk the output n-tuples should be stored.
Now let's add a simple selection, in this case all events which contain at least two electrons or two muons:
EL::AlgSelect *select = new EL::AlgSelect ("output"); select->addCut ("el_n>=2||mu_muid_n>=2"); select->histName ("cut_flow"); job.algsAdd (select); |
We are also creating a cut flow histogram here, which is actually kind of boring in that we only have a single cut. However, if you add multiple cuts you will get to see how many events survive each cut.
Now we can go on and try to produce our n-tuple:
root -l 'ATestRun.cxx ("submitDir")' |
Assuming that went fine, this is now where things get complicated.
Depending on how you ran your code, your n-tuple will show up in
different places. There are two ways to get to them. The more general
way is to load the sample handler that was created for your output
stream. You can either use that SampleHandler
to access the n-tuples yourself or to use them as an input for another
run of EventLoop. Assuming the later, we now modify the ATestRun2.cxx
we created earlier. First make sure that your function name matches the name of the macro, so you change the first line to
void ATestRun2 (const std::string& submitDir) |
Also change the section for the input sample handler to:
// create a new sample handler to describe the data files we use SH::SampleHandler sh; // load an existing sample handler sh.load ("submitDir/output-output"); // set the name of the tree in our files sh.setMetaString ("nc_tree", "MyFirstD3PD"); // further sample handler configuration may go here // print out the samples we found sh.print (); |
Now you can run ATestRun2.cxx
like you did for ATestRun.cxx
:
root -l 'ATestRun2.cxx ("submitDir2")' |
This will run your analysis job again, but on your skimmed n-tuple instead of the full n-tuple. Note that in a more realistic example we would probably have different algorithms for running on the skimmed n-tuples instead of the full ones, e.g. we wouldn't want to write out a skimmed n-tupled every time we run on the skimmed n-tuple...
As I said there is another way to access your data, and that is by doing
it directly. Where the data is located will strongly depend on the
EventLoop driver you are using. If you are running on a single machine
(e.g. direct or PROOF-lite driver) you will find the output in submitDir/data-output
,
but if you are running on a computing farm the output will normally be
stored on the farm storage elements or the farm nodes themselves. That
is for a very practical reason: N-tuple files are typically huge.
Copying them all to your interactive machine could take a lot of time
and may require more disk space than locally available. Furthermore, if
your n-tuples are large you will typically want to run over them using
your farm. Your interactive node is typically not set up to deliver
n-tuples efficiently to your farm machines.
However assuming you used the direct or PROOF-lite driver you can try to access and inspect the output n-tuple like this:
root -l submitDir/data-output/user.krumnack.pat_tutorial_data.v1.root root [0] TBrowser x |
So far we have copied branches directly from the input tree to the output tree, and specified the events we want to keep using a simple expression. While that is sufficient in some cases, in many cases we would like to do more complicated things. For that we need to connect our algorithm to the n-tuple service, so that it can provide additional inputs.
As a first step we need to add the n-tuple service to our job. To do that, make sure your package depends on EventLoopAlgs
, i.e. modify the PACKAGE_DEP
lin in your Makefile.RootCore
so that it reads:
PACKAGE_DEP = ... EventLoopAlgs(where
...
stand for the other package dependencies you might have listed from previous parts of this tutorial or your own analysis.)
Now recompile. We need to run find_packages.sh
since we updated our dependencies:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Next add to the includes MyFirstD3PDAnalysis.h
in:
#include <EventLoopAlgs/NTupleSvc.h> |
And add a pointer to the n-tuple service in the class (note that the //!
is important):
EL::NTupleSvc *output; //! |
And add a configuration member that contains the name of the output
stream. This is an example of a configuration parameter, basically the
only case where you do not need and should not use the //!
:
public: std::string outputName; |
Now connect to the n-tuple service in the initialize
function in MyFirstD3PDAnalysis.cxx
:
if (!outputName.empty()){ output = EL::getNTupleSvc (wk(), outputName); } else{ output = 0; } |
And in ATestRun.cxx
add the output stream to the configuration of our algorithm:
alg->outputName = "output"; |
One simple thing we can do, is add another variable that is calculated from the existing variables. For demonstration purposes we are using the invariant mass of the first two electrons. In real life you would apply quality cuts to make sure you have two good electrons, but for simplicity we just take the first two.
Since we need the TLorentzVector
class we need to update our Makefile.RootCore
, so that the PACKAGE_PRELOAD
line reads:
PACKAGE_PRELOAD = ... Physics(where
...
stand for the other package dependencies you might have listed from previous parts of this tutorial or your own analysis.)
First add the new variable to the class in MyFirstD3PDAnalysis.h
:
Float_t diel_mass; //! |
Now in MyFirstD3PDAnalysis.cxx
add another include for the mass calculation:
#include <TLorentzVector.h> |
Now register the variable with the tree in initialize
after you connect to the ntuple service:
if (output){ output->tree()->Branch ("diel_mass", &diel_mass, "diel_mass/F"); } |
And in execute
fill the variable with a value:
diel_mass = 0; if (event->el.n() >= 2) { TLorentzVector mom1, mom2; mom1.SetPtEtaPhiM (event->el[0].pt(), event->el[0].eta(), event->el[0].phi(), 0); mom2.SetPtEtaPhiM (event->el[1].pt(), event->el[1].eta(), event->el[1].phi(), 0); diel_mass = (mom1 + mom2).M(); } |
Now we can go on and try to produce our n-tuple:
$ROOTCOREDIR/scripts/compile.sh root -l 'ATestRun.cxx ("submitDir")' |
And go on inspect the variable in our new n-tuple:
root -l submitDir/data-output/user.krumnack.pat_tutorial_data.v1.root root [0] MyFirstD3PD->Draw ("diel_mass") |
In this section we will replace our basic selection of skim events with a selection based on quantities we calculated in our algorithm, in this case the dielectron mass. Note that this is simplified, for a realistic scenario we would have to put precautions in place about what to do if we have more than two electrons.
Add a cut parameter to the class in MyFirstD3PDAnalysis.h
:
float massCut; |
And in the constructor MyFirstD3PDAnalysis()
in MyFirstD3PDAnalysis.cxx
set a default value for that parameter:
massCut = 30e3; |
Adding the selection in our code is very straightforward. After we fill diel_mass
in execute
add these lines:
if (output && diel_mass > massCut) output->setFilterPassed (); |
Now in ATestRun.cxx
remove AlgSelect
or you will get an OR of both selections (which is not what we want). Now rerun the code and inspect the output n-tuple:
$ROOTCOREDIR/scripts/compile.sh root -l 'ATestRun.cxx ("submitDir")' root -l submitDir/data-output/user.krumnack.pat_tutorial_data.v1.root root [0] MyFirstD3PD->Draw ("diel_mass") |
You should now see that the mass spectrum is cut off at the specified value. As an exercise, modify the cut value from ATestRun.cxx
and see if you see what you expect.
In the next section we will thin out the input electron collection to include only high pt electrons. To that end we are using the facilities of the D3PDReader to create that new collection and then write it out.
As a first step we need to add the new collection to our class in MyFirstD3PDAnalys.h
:
D3PDReader::ElectronD3PDObject out_electrons; //! |
Let's also add a configuration member for the electron pt cut to that class:
float electron_pt_cut; |
And in the constructor MyFirstD3PDAnalysis()
in the source file (=MyFirstD3PDAnalysis.cxx) give the parameter a default value:
electron_pt_cut=10e3; |
Then in your initialize
function (just before the return
statement) setup writing out some selected objects. After the code from
the previous step add a custom prefix for the output collection. Note
that we used regular expressions to control what information we like to
write out (i.e. slimming):
// Set a new prefix for the selected objects out_electrons.SetPrefix ("my_el_"); // We select just basic kinematic info by activating just those branches out_electrons.SetActive (kTRUE, "^my_el_n$|^my_el_pt$|^my_el_eta$|^my_el_phi$"); // Tell the objects to write into the output TTree if (output) out_electrons.WriteTo (output->tree()); |
Now back in the execute
function, add just the selected input objects to the output:
// Select objects to put into output out_electrons.Clear(); for (Int_t i = 0; i < event->el.n(); ++ i) if (event->el[i].pt() > electron_pt_cut) out_electrons += event->el[i]; |
Now re-run the skimming job and check the effect on the pt spectrum:
$ROOTCOREDIR/scripts/compile.sh root -l 'ATestRun.cxx ("submitDir")' root -l submitDir/data-output/user.krumnack.pat_tutorial_data.v1.root root [0] MyFirstD3PD->Draw ("my_el_pt") |
This should show that the final collection only contains electrons that are above the cut. If you want to play with this code a little more, here are a couple of things you can try:
ATestRun.cxx
diel_mass
calculation to use only the electrons that pass our selection
my_el_
replace the original collection el_
with our new thinned collection. this is often useful, in that it
allows code to run on either n-tuple, if it only looks at electrons
passing the selection. to do this you need to change the name in initialize
(within MyFirstD3PDAnalysis.cxx
) and remove the electron variables from the list of variables configured in NTupleSvc
(within ATestRun.cxx
)
In this section of the tutorial we will demonstrate how to use the features of SampleHandler to produce a stack plot from the histogram we created. The idea is that it will show you both how to use SampleHandler more effectively as well as giving you a generic tool for creating stack plots that you can build on and use in your analysis. It is probably not suited for all cases, but it should form a good basis for making quick and unofficial plots.
In this section we will build on the algorithm we developed earlier, but there is also an older, more self-contained version of this tutorial here.
While this is a very nice example of how to use SampleHandler to manage a large and changing number of samples (as well as their meta-data), it is not necessarily the best way to actually produce your final plots. Once you try to add systematics or other advanced features to your plot it can become quite tricky. To deal with this, I am currently working on a new package that will produce publication quality plots. Among other things, it will allow you to produce a nice looking stack plot directly from the output of your EventLoop job using only a few lines. It hasn't been reviewed or released yet, but if you are interested you can send me an email (Nils.Erik.Krumnack@cernSPAMNOT.ch) and I will inform you once it is. If you want, you can also volunteer as a pre-release tester, which would help me and may make life easier for you as well, but you would have to contend with the fact that this package isn't stable (or official) yet.
So far we haven't used SampleHandler to do anything, except to provide
the file lists for EventLoop. One very useful feature of SampleHandler
is that you can attach meta-data to your individual data samples. Let's
say we want to attach a label and the luminosity to your code. Then
you would add the following lines to ATestRun.cxx
after the SH::scanDir
command where it says to put further meta-data configuration:
// the luminosity we have to set manually and one by one sh.get ("user.krumnack.pat_tutorial_Zmumu.v1")->setMetaDouble ("lumi", 23.25); sh.get ("user.krumnack.pat_tutorial_Zmumu.v1")->setMetaString ("label", "Z+jets"); sh.get ("user.krumnack.pat_tutorial_ttbar_corrected.v1")->setMetaDouble ("lumi", 121.21); sh.get ("user.krumnack.pat_tutorial_ttbar_corrected.v1")->setMetaString ("label", "ttbar"); sh.get ("user.krumnack.pat_tutorial_data.v1")->setMetaDouble ("lumi", 5.891); sh.get ("user.krumnack.pat_tutorial_data.v1")->setMetaString ("label", "data"); |
Since these meta-data fields are defined for the input sample handler, they get automatically propagated to the sample handlers EventLoop generates for the output data. This will require you to re-run ATestRun.cxx
.
If you want to avoid that, you can instead add these lines to the
macro we are about to create after you load the sample handler (there is
a comment marking the right place).
Start out with a new macro AStackPlot.cxx
which will contain your plotting code. You could make this part of ATestRun.cxx
,
but I don't recommend that, since you would have to rerun your event
loop every time you want to change your plotting style. In the
beginning you first have to load all your packages again
#include <vector> void AStackPlot (const std::string& submitDir) { // Load the libraries for all packages gROOT->ProcessLine(".x $ROOTCOREDIR/scripts/load_packages.C"); |
To make our life simpler, define the plot parameters at the beginning of our code:
// the name of the histogram we are trying to plot std::string histName = "AHist"; |
The way our code will work is that we first create all the histograms,
and then plot them. So we need some variables to hold the variables in
between (you'll also need to add an include for <vector>
in the beginning):
// these vectors contain the data and mc histograms std::vector<TH1*> data, mc; |
Now we use the ability of SampleHandler to loop over events to get all the histograms we need:
// create and initialize the sample handler SH::SampleHandler sh; sh.load (submitDir + "/hist"); // add any extra meta-data you desire here for (unsigned samplesIter = 0; samplesIter != sh.size(); ++ samplesIter) { // the sample we are plotting SH::Sample *sample = sh.at (samplesIter); // determine whether this is data or MC bool isMC = sample->name().find ("data") == std::string::npos; // retrieve the histogram TH1 *hist = sample->readHist (histName); // store the histogram for later. if (isMC){ mc.push_back (hist); } else{ data.push_back (hist); } } // end for loop over samples sh.size |
Since we want to make nice data-MC comparison stack plots we now have to do a bit of addition to get the correct plots:
// add together the histograms for proper stack-plots for (unsigned iter = 1; iter < data.size(); ++ iter){ data[0]->Add (data[iter]); } for (unsigned iter = 1; iter < mc.size(); ++ iter){ mc[mc.size()-1-iter]->Add (mc[mc.size()-iter]); } |
And now just plot the histograms in the right order in a decent style:
// now plot all mc histograms for (unsigned iter = 0; iter < mc.size(); ++ iter) { mc[iter]->SetLineColor (2 + iter); mc[iter]->SetFillColor (2 + iter); mc[iter]->SetFillStyle (3001); mc[iter]->SetTitle (0); mc[iter]->Draw (iter ? "SAME HIST" : "HIST"); } // and if we have data, plot that too if (!data.empty()) { data[0]->SetMarkerStyle(20); data[0]->SetTitle (0); data[0]->Draw (mc.empty() ? "ERR" : "SAME ERR"); } } |
Now we have a complete plot macro we can try. First re-run your event-loop to make sure that all the meta-data is properly propagated:
root -l -b -q 'ATestRun.cxx ("submitDir")' |
Now run your new macro, this should create a new canvas with your stack plot:
root -l 'AStackPlot.cxx ("submitDir")' |
While the last macro we created works nice at producing a plot, the plot is not very meaningful: The samples are not properly normalized to each other. To take care of that we are using the luminosity meta-data we stored in the sample handler. To do that properly we need to keep track of the luminosity in data, so create a variable to store it (before the sample loop):
// the total luminosity of all data histograms double data_lumi = 0; |
Inside the sample loop, normalize the MC to a fixed luminosity, and keep track of the data luminosity:
// evaluate the luminosity // note the explicit use of meta-value default values if (isMC) hist->Scale (1./sample->getMetaDouble ("lumi", 1)); else data_lumi += sample->getMetaDouble ("lumi", 0); |
Now after the sample loop, but before you add the histograms together, apply the data luminosity as a normalization factor to the MC:
// do something meaningful if we don't have any data luminosity if (data_lumi <= 0){ data_lumi = 100; } // apply data luminosity to MC (if available) for (unsigned iter = 0; iter < mc.size(); ++ iter){ mc[iter]->Scale (data_lumi); } |
Now your macro should create a new canvas with your stack plot:
root -l 'AStackPlot.cxx ("submitDir")' |
Note that when I last tried this, the luminosities didn't seem to add up. Not sure why, they used to in the past. On the other hand, they are close and this is more of a proof of principle anyway. For a real analysis we would have to select only good electrons and have to apply a GRL selection to use only valid events.
While this plot is now kind of correct from a physics side, it is still somewhat meaningless, in that we can't tell which distribution is which. To fix that, we want to add a key to our plot. As a first step, we store the label in the title of the histogram. So in the sample loop add the following code:
// set the title of the histogram to the label we want to use. // note that we use the sample name if no label was set hist->SetTitle (sample->getMetaString ("label", sample->name()).c_str()); |
Now add some code after the sample loop:
// turn off the statistics box gStyle->SetOptStat (0); gStyle->SetOptFit (0); // create a key instead of the statistics box TLegend *legend = new TLegend(0.6666667,0.809322,0.9885057,0.9957627,NULL,"brNDC"); |
Then before we plot the histograms, add them to the key:
// now add all mc histograms to the key for (unsigned iter = 0; iter < mc.size(); ++ iter) { std::string label = mc[iter]->GetTitle(); // drawing code legend->AddEntry (mc[iter], label.c_str()); } // and if we have data, add that to the key too if (!data.empty()) { // drawing code legend->AddEntry (data[0], "data"); } |
And then in the end, also draw the key:
// don't forget to add the key legend->Draw (); |
Please note that this example now uses all the meta-data from the original samples, which got passed through to the derived samples.
MultiDraw is a package that was developed to allow using TTree::Draw
like formulas inside EventLoop
.
There are two ways of doing so: By using MultiDraw algorithms
directly or by incorporating MultiDraw formulas inside your own
algorithms. The former is envisioned to be useful for quickly adding a
couple more plots to your analysis. The later is useful if you want to
make it easier to configure your algorithms.
The big advantage of using MultiDraw formulas is that they are typically very easy to write in simple cases. The disadvantage is that more complicated cases can result in very longish formulas or may be impossible to do at all. One of the underlying assumptions is that what you have in your n-tuple is very close to your final physics quantities, i.e. all the cleaning and correction tools have already been applied. You can still use MultiDraw in other situations, but its value will be reduced.
As the simplest exercise, let us simply add the electron pt spectrum to the list of our final histograms. We are already doing this in our algorithm, but now we will be doing it in MultiDraw, allowing for a direct comparison.
All we need to do is add another algorithm to our steering macro ATestRun.cxx
:
job.algsAdd (new MD::AlgHist (new TH1F("MDHist", "MDHist", 100, 0, 1.e5), "el_pt")); |
Now to see that it is actually doing the right thing, let's overlay our old histogram with the new one:
SH::SampleHandler sh_hist; sh_hist.load (submitDir + "/hist"); TH1 *hist = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("AHist"); hist->Draw (); TH1 *hist2 = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("MDHist"); hist2->SetLineColor (2); hist2->SetLineStyle (2); hist2->Draw ("SAME"); |
Not let's run again, inside the Run
directory call. If you already have a submitDir
you either have to rename or remove it:
root -l 'ATestRun.cxx ("submitDir")' |
It should now show you our original histogram in black and the new histogram overlaid using a red dotted line. Both histograms should turn out the same.
MultiDraw can also fill multi-dimensional histograms and apply weights to each events. Just pass additional variables into the constructor.
A slightly more complicated, but quite powerful example of how to use MultiDraw algorithms is to create a cut flow. A cut flow is a histogram that shows the number of events (or objects) after each cut and is commonly used when comparing analysises.
All we need to do is add another algorithm to our steering macro ATestRun.cxx
. For that we first create a histogram describing the cut flow and then use it to create a cut flow algorithm:
TH1 *cflow = new TH1D ("cflow", "cutflow", 10, 0, 10); cflow->GetXaxis()->SetBinLabel (2, "el_n>=2"); cflow->GetXaxis()->SetBinLabel (3, "el_pt[0]>=25e3"); cflow->GetXaxis()->SetBinLabel (4, "el_pt[1]>=20e3"); job.algsAdd (new MD::AlgCFlow (cflow)); |
Now to see that it is actually doing the right thing, so change the histogram drawing section of the steering macro to display the cut flow:
SH::SampleHandler sh_hist; sh_hist.load (submitDir + "/hist"); TH1 *hist = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("cflow"); hist->Draw (); |
Not let's run again, inside the Run
directory call. If you already have a submitDir
you either have to rename or remove it:
root -l 'ATestRun.cxx ("submitDir")' |
It should now show a cut flow, which is a histogram that gets smaller with each consecutive cut. Note that the first and the second bin are the same, because our sample is filtered to contain only di-electron events.
MultiDraw can also create more complicated cut flows, for details see the MultiDraw documentation.
Another form of using MultiDraw is to use its formulas in your own algorithm. This has the advantage of allowing your algorithms to be configured to use arbitrary input variables and simple combinations of input variables without having to hard-code these variables in your algorithm. On the downside it is slower than the hard coded variables, so you should only use this feature for variables that you actually intend to configure this way.
First we need to tell RootCore that we will be depending on the MultiDraw package. Open the Makefile.RootCore
in the cmt
directory of our MyAnalysis
package, and add MultiDraw
to PACKAGE_DEP
. So the PACKAGE_DEP
line should look something like this (if you have other packages listed here, don't remove them):
PACKAGE_DEP = EventLoop D3PDReader MultiDraw |
Next add the include for MultiDraw to the header file of our analysis algorithm (MyFirstD3PDAnalysis.h
):
#include <MultiDraw/Global.h> |
And then add a cut string to our algorithm class. This is a configuration option, so we should not use the //!
and have to make it public:
public: std::string cutString; |
We will need access to the formula service and a single formula, so let's add that to the class (note that the //!
is important):
MD::FormulaSvc *formSvc; //! const MD::Formula *cutFormula; //! |
We will add another histogram so that we can compare the result with and
without the cut, so let's add that to the class (note that the //!
is important):
TH1 *cutHist; //! |
Now in the source file (MyFirstD3PDAnalysis.cxx
) add some more includes we need:
#include <MultiDraw/Formula.h> #include <MultiDraw/FormulaSvc.h> |
Now in the setupJob
function we need to specify that we use MultiDraw formulas in our job:
MD::useFormulas (job); |
And in the initialize
function we need to register our formula with the formula service:
formSvc = MD::formulas (wk()); cutFormula = formSvc->addForm (cutString); |
And we also need to add a histogram for the result in initialize
as well:
cutHist = new TH1F("cutHist", "cutHist", 100, 0, 1.e5); wk()->addOutput (cutHist); |
Within the execute
function we check if the event passes our cut, and if it does we fill cutHist
just like we did for AHist
except we allow our cutString
to encode a weight as well:
double weight = cutFormula->value (0); if (weight > 0) { for( Int_t i = 0; i < event->el.n(); ++i ) cutHist->Fill(event->el[i].pt(), weight); }; |
Now we need to compile our package again. Since we changed our dependencies, you have to run find_packages.sh
again. Remember that this has to be done in the ROOTTutorial
directory:
$ROOTCOREDIR/scripts/find_packages.sh $ROOTCOREDIR/scripts/compile.sh |
Now we need to update our steering macro (ATestRun.cxx
) to set the cut string in our algorithm. Add this code where it says to put the algorithm configuration:
alg->cutString = "jet_n>=3"; |
In order to see the effect of the cut, let's change the lines which display the final histogram to:
SH::SampleHandler sh_hist; sh_hist.load (submitDir + "/hist"); TH1 *hist = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("AHist"); hist->Draw (); TH1 *hist2 = sh_hist.get ("user.krumnack.pat_tutorial_data.v1")->readHist ("cutHist"); hist2->SetLineColor (2); hist2->Draw ("SAME"); |
Not let's run again, inside the Run
directory call. If you already have a submitDir
you either have to rename or remove it:
root -l 'ATestRun.cxx ("submitDir")' |
You should now see two histograms of the electron pt spectrum. With the black one showing all events and the red one only the events passing our cut. Hopefully the later will always be smaller than the former, but hopefully not by much.
While our general recommendation is that you work through the tutorial code one step at a time, there is also a version of the code in SVN, which you can check out if you want to investigate particular problems.
Simply check out the control package:
svn co svn+ssh://svn.cern.ch/reps/atlasusr/krumnack/tutorial/trunk/control |
Make sure you have root setup and source the following script to set up a couple more control variables
source control/setup.sh lxplus BaseVersion |
control/versions
for a list).
Now build it:
control/build |
And run it as many times as you want:
control/run |
Responsible: LouiseHeelan
Last reviewed by: Never reviewed