Profiling with jemalloc

The jemalloc allocator is a drop-in replacement for malloc. It is used primarily as shared library preloaded via the dynamic linker, although it can also be linked statically. jemalloc offers substantial improvement over malloc in terms of its concurrency support, and in regards to its avoidance of memory fragmentation. When Mantid is launched via workbench, or when a python interpreter is launched to call mantid algorithms from scripts (in the recommended manner), the jemalloc library will be preloaded automatically.

When used as an allocator, jemalloc provides extensive allocation control and profiling hooks. These hooks may be accessed either using the environment variable MALLOC_CONF (and several other environment variables) or also programatically using the nonstandard mallctl entrypoint. Note here that there are very many control hooks available! In this note, only a few specific profiling techniques will be elaborated, primarily using specific examples.

It should always be kept in mind that although Mantid does use jemalloc, its usage is only partially optimized, and there is always room for improvement. This means that the first solution that should be considered during profiling is the possibility of further optimizing jemalloc’s control settings during Mantid’s loading process.

All of the following examples assume a standard developer environment on linux machine. Most other OS should also work, and the setup commands will be similar, but this note won’t cover those details.

Build jemalloc including profiling

The standard jemalloc library does not include the profiling hooks, and it will be necessary to download the source code, and build a special version of the library, in order to use them.

jemalloc’s github repository (now archived) is at: https://github.com/jemalloc/jemalloc. Be sure to download the source code corresponding to the release of jemalloc that Mantid is currently using (now 5.2.0).

jemalloc uses the autoconf system. To build from the root of the jemalloc source tree:

./autogen.sh
./configure --enable-prof --enable-shared --prefix=<your installation prefix>
make
make install

Use jemalloc to profile an existing Mantid build

Any existing build of Mantid can be successfully profiled simply by changing the LD_PRELOAD to include the new jemalloc build instead of the standard one. This example will also show how to enable profiling using the MALLOC_CONF environment variable. Here’s how to generate a series of memory-allocation profiles, with a new profile written every time memory allocation increases by 1GB:

# Notes: enable and activate profiling, generate a profile every ~1GB of additional memory allocated.
#   The profile-dump interval is expressed somewhat cryptically as $\log_2$\ of the memory quantity:
#   here 1GB == 30 bits.
export MALLOC_CONF="prof:true,prof_active:true,lg_prof_interval:30,prof_prefix:jeprof.out"

LOCAL_PRELOAD=<your installation prefix>/lib/libjemalloc.so.2

# To get your `PYTHONPATH`, you can enter `import sys; ';'.join(sys.path)` in the `IPython` window in workbench,
#   although for a working installation, you probably don't need to worry about it.
#   First try removing the `PYTHONPATH` lines below and just assume that your Pixi environment has set the path correctly!
LOCAL_PYTHONPATH=<Mantid's PYTHONPATH>

LD_PRELOAD=${LOCAL_PRELOAD} \
    PYTHONPATH=${LOCAL_PYTHONPATH} \
    ${CONDA_PREFIX}/bin/python -m workbench

Optimize the Mantid build for profiling using jemalloc

There are several flags that can be set to optimize a Mantid build for jemalloc profiling. Mostly the objectives are to make sure that all of the line-number and stack-frame information is included in the build:

# From the Mantid repository root:
mkdir build; cd build
cmake .. -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-O2 -g -DNDEBUG -fno-omit-frame-pointer" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-O2 -g -DNDEBUG -fno-omit-frame-pointer"
cmake --build . --target AllTests

Viewing and analyzing memory-allocation profiles

The next examples assume that you have already generated a profile using MALLOC_CONF, or by modifying the code and running a script, as discussed above.

Any of the examples shown next to display profiles can also be used to display the difference between any two profiles. See the jemalloc man page at https://jemalloc.net/jemalloc.3.html, or its wiki at https://github.com/jemalloc/jemalloc/wiki.

Memory-allocation details: create an overview graph

jeprof can generate SVG, and several other formats of, graphs showing all of the allocation details:

export JEPROF='<your installation prefix>/bin/jeprof'

# The basic form of the command:
${JEPROF} --svg /path/to/program /path/to/dump.heap > heap.svg

# A Mantid-specific example, assuming that you've been running a script in Python:
${JEPROF} --svg ${CONDA_PREFIX}/bin/python /path/to/dump.heap > heap.svg

Now you can view the svg in your browser (firefox is useful for this, but google-chrome will also be OK):

firefox heap.svg &

Memory-allocation details, by source-line number

We can get a by-line-number allocation profile of any function (, including even inline functions), from any suitable profile. (It’s not necessary to generate a profile for each function you’re interested in!):

export JEPROF=<your installation prefix>/bin/jeprof

# Assume that the profile you want to examine has already generated, and is contained in the file <your profile>.dat
${JEPROF} --lines --inuse_space --list="GridDetector::createLayer" ${CONDA_PREFIX}/bin/python <your profile>.dat > <your profile>.heap.txt

The text output will look something like this:

ROUTINE ====================== Mantid::Geometry::GridDetector::createLayer in /mnt/R5_data1/data1/workspaces/ORNL-work/mantid/Framework/Geometry/src/Instrument/GridDetector.cpp
 258.6  322.1 Total MB (flat / cumulative)
     .      .  432:     return m_gridBase->getRelativePosAtXYZ(x, y, z) * V3D(scalex, scaley, scalez);
     .      .  433:   } else
     .      .  434:     return V3D(m_xstart + m_xstep * x, m_ystart + m_ystep * y, m_zstart + m_zstep * z);
     .      .  435: }
     .      .  436:
---
     .      .  437: void GridDetector::createLayer(const std::string &name, CompAssembly *parent, int iz, int &minDetID, int &maxDetID) {
     .      .  438:   // Loop and create all detectors in this layer.
    ~SNIP~
     .      .  469:       // Create the detector from the given id & shape and with xColumn as the
     .      .  470:       // parent.
 258.6  314.1  471:       auto *detector = new GridDetectorPixel(oss.str(), id, m_shape, xColumn, this, size_t(ix), size_t(iy), size_t(iz));
     .      .  472:
    ~SNIP~

This specific example shows the huge allocation associated with a grid detector expansion during an EventWorkspace instrument initialization. (Note that each detector pixel [out of possibly millions] is initialized using its own name, as a string!)

More advanced techniques: in-depth profiling of algorithm steps

The most direct way to profile the allocation details associated with step-wise execution of any algorithm is to use the jemalloc mallctl entry point. We still need to use MALLOC_CONF to enable profiling:

export MALLOC_CONF="prof:true"

For this example, we modify a C++-source file and add the following section (near the top):

// =============================================================
#include <chrono>
#include <cstring>
#include <string>
#include <thread>
#include <jemalloc/jemalloc.h>
namespace {
  bool mem_stats(const std::string& path) {
    const char *path_ = path.c_str();
    int err = mallctl("prof.dump",
                      nullptr, nullptr,
                      const_cast<char**>(&path_),
                      sizeof(const char*));
    return err == 0;
  }

  void periodic_mem_stats(std::chrono::seconds period, const std::string& path) {
    for (;;) {
      mem_stats(path);
      std::this_thread::sleep_for(period);
    }
  }
}
const std::string STATS_ROOT = "<your profiles dump directory path>";
// =============================================================

Now, whenever we want to generate a new profile after the execution of a section of code, we simply call the function:

mem_stats(STATS_ROOT + "/loadEvents-exit.dat");

After building Mantid, setting MALLOC_CONF as above, and executing an appropriate Python script: this generates a profile and writes it to the on-disk location "${STATS_ROOT}/loadEvents-exit.dat".