Data points, sets and files

Instance of DataPoint represents one kinematic point.

It can correspond to a single experimental measurement, and most of Gepard functions that calculate observables or various form factors accept DataPoint as an argument.

So, when you want to tabulate something, or plot something (such as CFF or cross-section) as a continuous function of some variable, you will have to create a “continuum” of DataPoints.

Several DataPoint objects can be collected in a special DataSet object. This is not necessary, but it is convenient, and all experimental datasets that ship with Gepard are DataSet objects with unique ID number.

DataPoint Attributes

Instance of DataPoint can have following attributes. Some are used by the code, some are just for convenience.

DataPoint attributes
Attribute	Description
`xB`	Bjorken \(x_B\)
`t`	Mandelstam t, i. e., momentum transfer to target squared.
`Q2`	\(Q^2\)
`phi`	azimutal angle \(\phi\)
`FTn`	harmonic of azimuthal angle \(\phi\). Here values 0, 1, … correspond to zeroth, first, … cosine harmonics, while -1, -2, … correspond to first, second, … sine harmonics.
`observable`	measured observables
`val`	value measured
`err`	total uncertainty of val
`errstat`	statistical uncertainty of val
`errsyst`	systematic uncertainty of val
`frame`	coordinate frame used (`BMK` or `Trento`)
`id`	id number of the dataset to which point belongs
`reference`	reference to where data was published

Some details and other attributes are given below.

Coordinate frames

Take note that Gepard internally works in the BMK frame, while most of the experimental data is published in the Trento frame. There are convenience functions to_conventions and from_conventions that transform datapoints in place from Trento to BMK frame and back, respectively.

>>> import gepard as g
>>> pt = g.DataPoint(xB=0.1, phi=1, frame='Trento')
>>> pt.to_conventions()
>>> pt.phi    #  = (pi - phi)
2.141592653589793
>>> pt.frame
'Trento'
>>> pt.from_conventions()
>>> pt.phi
1.0

Note that the frame attribute keeps the original value even after transformation to BMK frame.

All datasets that are made available in Gepard as g.dset are already transformed into the BMK frame.

Working with datasets

Datasets that ship with Gepard are all collected in the Python dictionary g.dset, where keys are ID numbers of datasets. Detailed description of available datasets will be here.

There is utility function g.list_data that gives short tabular description of sets with given IDs:

>>> g.list_data(list(range(47, 54)))
[ 47]     ZEUS   6    XGAMMA  0812.2517 Table 1
[ 48]     ZEUS   6    XGAMMA  0812.2517 Table 2
[ 49]     ZEUS   8    XGAMMA  0812.2517 Table 3
[ 50]    HALLA 288      XLUw    0607029 DFT analysis with MC error propagation by KK
[ 51]    HALLA  96      XUUw    0607029 DFT analysis with MC error propagation by KK
[ 52]   HERMES  36       TSA  1004.0177 Table 4
[ 53]   HERMES  36      BTSA  1004.0177 Table 4

In first column above are ID numbers of a given dataset. Another utility function, g.describe_data gives short tabular description of given DataSet:

>>> g.describe_data(g.dset[52])
npt x obs     collab  FTn    id  ref.
----------------------------------------------
12 x TSA     HERMES  -1.0   52  arXiv:1004.0177v1
12 x TSA     HERMES  -2.0   52  arXiv:1004.0177v1
12 x TSA     HERMES  -3.0   52  arXiv:1004.0177v1
----------------------------------------------
TOTAL = 36

>>> pt = g.dset[52][0]   # First point of this dataset
>>> pt.xB, pt.t, pt.Q2, pt.val, pt.err
(0.079, -0.031, 1.982, -0.008, 0.05239274758971894)

Useful utility function is g.select which selects subset of points from a dataset according to some criteria:

>>> len(g.dset[143])
90
>>> twist_resist = g.select(g.dset[143], criteria=['Q2 > 5', 't < 0.2'])
>>> len(twist_resist)
40

There are some plotting routines available for inspection of data and comparison with theory. First, there is a universal jbod (“just a bunch of data”) routine that plots any dataset, alone or with theory prediction lines. For example, ZEUS cross section data (id=49) from the table above:

>>> import gepard as g
>>> import gepard.plots
>>> from gepard.fits import th_KM15, th_KM10b
>>> gepard.plots.jbod(points=g.dset[49], lines=[th_KM15, th_KM10b]).show()

(Source code, png, hires.png, pdf)

Also, for some datasets there are dedicated plots, like

>>> import gepard.plots
>>> from gepard.fits import th_KM15, th_KM10b
>>> gepard.plots.H1ZEUS(lines=[th_KM15, th_KM10b]).show()

(Source code, png, hires.png, pdf)

Finally, there is a convenient method df which transforms any DataSet into a corresponding pandas DataFrame, which makes it easy to perform various dataset analyses. E. g., to find the mean values of kinematic variables of a dataset, you can do it like this:

>>> g.dset[52].df()[['Q2', 'xB', 't']].mean()
Q2    2.780750
xB    0.107083
t    -0.143667
dtype: float64

Dataset files

Each dataset that ships with Gepard is stored in the single ASCII file. User can add their own data files by placing them in some separate directory, say mydatafiles, and adding an empty file named __init__.py to this directory, which makes data files into proper Python modules. (Read about Python’s library importlib_resources for details.)

This directory has to be in Python module search path. Current working directory (where you start Python, can be displayed in IPython or Jupyter by issuing %pwd), is usually in the search path, and user can explicitely add some other directory to the path like this:

>>> import sys
>>> sys.path.append('<path to mydatafiles>')

Then datafile is available to be imported, and there is a utility function g.loaddata that parses all files in the directory and creates corresponding DataSet objects:

>>> import mydatafiles  
>>> mydset = g.data.loaddata(mydatafiles)  

Now mydset is analogous to g.dset, which means that datasets are available as mydset[id].

Data files are meant to be readable by both human and computer and follow the following rules:

Syntactic rules:

Empty lines and lines starting with hash sign (#) are ignored by parser and can be used for comments meant for human readers.
First part of the file is a preamble, consisting of lines with structure
key = value
where key should be regular computer variable identifier, i. e., should consist only of letters and numbers (no spaces), and should not start with a number. These keys will become attributes of DataPoint object and can be accessed using dot . operator, like this:
>>> pt = g.dset[52][0] # first point of this dataset >>> pt.collaboration 'HERMES'
second and final part of the file is just a grid of numbers.

Semantic rules:

There is world-unique ID number of the file, given by key id, and name of the person who created the file, given by key editor. If there are further edits by other people keys such as editor2 can be used.
Other information describing origin of the data can be given using keys such as collaboration, year, reference, etc. These keys can be used for automatic plots generation.
Coordinate frame used is given by key frame, equal to either Trento or BMK.
Scattering process is described using keys in1particle, in2particle, … out1particle, … , set equal to usual symbols for HEP particle names (e for electron, p for proton, …).
Kinematical and polarization properties of a particle in1 are then given using keywords in1energy, in1polarizationvector (L for longitudinal, T for transversal, U or unspecified for unpolarized) etc.
Key in1polarization describes the amount of polarization and is set to 1 if polarization is 100% or if measurements are already renormalized to take into account smaller polarization (which they mostly are).
Sign of in1polarization describes how the asymmetries are formed, by giving polarization of the first term in the asymmetry numerator (and similarly for in1charge).
For convenience, type of the process is summarized by keys process (equal to ep2epgamma for leptoproduction of photon, gammastarp2gammap for DVCS, gammastarp2rho0p for DVMP of rho0, etc.) and exptype (equal to fixed target or collider).
Finally, columns of numbers grid are described in the preamble using keys such as x1name giving the column variable and x1value = columnK, where K is the corresponding grid column number counting from 1. Here x1, x2, …, are used for kinematics (x-axes, such as \(x_{\rm B}\), \(Q^2\), \(t\), \(\phi\)), while y1 is for the measured observable.
Units should be specified by keys such as in1unit, and in particular for angles it should be stated whether their unit is deg or rad.
Uncertainties are given by keys y1error (total error), y1errorstatistic, y1errorsystematic, y1errorsystematicplus````y1errorsystematicminus, y1errornormalization.
For Fourier harmonics, special column names are used: FTn for harmonic of azimuthal angle \(\phi\) between lepton and reaction plane and varFTn for harmonic of azimuthal angle \(\phi_S\) of target polarization vector. Then in the grid, positive numbers 0, 1, 2, … denote \(\cos 0\phi\), \(\cos\phi\), \(\cos 2\phi\), … harmonics, while negative numbers -1, -2, … denote \(\sin\phi\), \(\sin 2\phi\), … harmonics.
If some kinematical value is common to the whole data set then instead of x1value = columnK we can specify, e. g., x1value = 0.36.
Names for observables are standardized. and given in table.