/* * Copyright © 2009 CNRS * Copyright © 2009-2011 INRIA. All rights reserved. * Copyright © 2009-2011 Université Bordeaux 1 * Copyright © 2009-2011 Cisco Systems, Inc. All rights reserved. * See COPYING in top-level directory. */ /*! \mainpage Hardware Locality

Portable abstraction of hierarchical architectures for high-performance computing


\htmlonly
\endhtmlonly \section Introduction hwloc provides command line tools and a C API to obtain the hierarchical map of key computing elements, such as: NUMA memory nodes, shared caches, processor sockets, processor cores, and processing units (logical processors or "threads"). hwloc also gathers various attributes such as cache and memory information, and is portable across a variety of different operating systems and platforms. hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms. *** Note that the hwloc project represents the merger of the libtopology project from INRIA and the Portable Linux Processor Affinity (PLPA) sub-project from Open MPI. Both of these prior projects are now deprecated. The first hwloc release was essentially a "re-branding" of the libtopology code base, but with both a few genuinely new features and a few PLPA-like features added in. Prior releases of hwloc included documentation about switching from PLPA to hwloc; this documentation has been dropped on the assumption that everyone who was using PLPA has already switched to hwloc. hwloc supports the following operating systems: hwloc only reports the number of processors on unsupported operating systems; no topology information is available. For development and debugging purposes, hwloc also offers the ability to work on "fake" topologies: hwloc can display the topology in a human-readable format, either in graphical mode (X11), or by exporting in one of several different formats, including: plain text, PDF, PNG, and FIG (see \ref cli_examples below). Note that some of the export formats require additional support libraries. hwloc offers a programming interface for manipulating topologies and objects. It also brings a powerful CPU bitmap API that is used to describe topology objects location on physical/logical processors. See the \ref interface below. It may also be used to binding applications onto certain cores or memory nodes. Several utility programs are also provided to ease command-line manipulation of topology objects, binding of processes, and so on. \htmlonly
\endhtmlonly \section installation Installation hwloc (http://www.open-mpi.org/projects/hwloc/) is available under the BSD license. It is hosted as a sub-project of the overall Open MPI project (http://www.open-mpi.org/). Note that hwloc does not require any functionality from Open MPI -- it is a wholly separate (and much smaller!) project and code base. It just happens to be hosted as part of the overall Open MPI project. Nightly development snapshots are available on the web site. Additionally, the code can be directly checked out of Subversion: \code shell$ svn checkout http://svn.open-mpi.org/svn/hwloc/trunk hwloc-trunk shell$ cd hwloc-trunk shell$ ./autogen.sh \endcode Note that GNU Autoconf >=2.63, Automake >=1.10 and Libtool >=2.2.6 are required when building from a Subversion checkout. Installation by itself is the fairly common GNU-based process: \code shell$ ./configure --prefix=... shell$ make shell$ make install \endcode The hwloc command-line tool "lstopo" produces human-readable topology maps, as mentioned above. It can also export maps to the "fig" file format. Support for PDF, Postscript, and PNG exporting is provided if the "Cairo" development package can be found when hwloc is configured and build. Similarly, lstopo's XML support requires the libxml2 development package. \htmlonly
\endhtmlonly \section cli_examples CLI Examples On a 4-socket 2-core machine with hyperthreading, the \c lstopo tool may show the following graphical output: \image html dudley.png \image latex dudley.png "" width=9cm Here's the equivalent output in textual form: \verbatim Machine (16GB) Socket L#0 + L3 L#0 (4096KB) L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1 PU L#2 (P#4) PU L#3 (P#12) Socket L#1 + L3 L#1 (4096KB) L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2 PU L#4 (P#1) PU L#5 (P#9) L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3 PU L#6 (P#5) PU L#7 (P#13) Socket L#2 + L3 L#2 (4096KB) L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4 PU L#8 (P#2) PU L#9 (P#10) L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5 PU L#10 (P#6) PU L#11 (P#14) Socket L#3 + L3 L#3 (4096KB) L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6 PU L#12 (P#3) PU L#13 (P#11) L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#15) \endverbatim Finally, here's the equivalent output in XML. Long lines were artificially broken for document clarity (in the real output, each XML tag is on a single line), and only socket #0 is shown for brevity: \verbatim \endverbatim On a 4-socket 2-core Opteron NUMA machine, the \c lstopo tool may show the following graphical output: \image html hagrid.png \image latex hagrid.png width=\textwidth Here's the equivalent output in textual form: \verbatim Machine (32GB) NUMANode L#0 (P#0 8190MB) + Socket L#0 L2 L#0 (1024KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (1024KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#1) NUMANode L#1 (P#1 8192MB) + Socket L#1 L2 L#2 (1024KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (1024KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#3) NUMANode L#2 (P#2 8192MB) + Socket L#2 L2 L#4 (1024KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (1024KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#5) NUMANode L#3 (P#3 8192MB) + Socket L#3 L2 L#6 (1024KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (1024KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#7) \endverbatim And here's the equivalent output in XML. Similar to above, line breaks were added and only PU #0 is shown for brevity: \verbatim \endverbatim On a 2-socket quad-core Xeon (pre-Nehalem, with 2 dual-core dies into each socket): \image html emmett.png \image latex emmett.png "" width=7cm Here's the same output in textual form: \verbatim Machine (16GB) Socket L#0 L2 L#0 (4096KB) L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L1 L#1 (32KB) + Core L#1 + PU L#1 (P#4) L2 L#1 (4096KB) L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6) Socket L#1 L2 L#2 (4096KB) L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1) L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#3 (4096KB) L1 L#6 (32KB) + Core L#6 + PU L#6 (P#3) L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) \endverbatim And the same output in XML (line breaks added, only PU #0 shown): \verbatim \endverbatim \htmlonly
\endhtmlonly \section interface Programming Interface The basic interface is available in hwloc.h. It essentially offers low-level routines for advanced programmers that want to manually manipulate objects and follow links between them. Documentation for everything in hwloc.h are provided later in this document. Developers should also look at hwloc/helper.h (and also in this document, which provides good higher-level topology traversal examples). To precisely define the vocabulary used by hwloc, a \ref termsanddefs section is available and should probably be read first. Each hwloc object contains a cpuset describing the list of processing units that it contains. These bitmaps may be used for \ref hwlocality_cpubinding and \ref hwlocality_membinding. hwloc offers an extensive bitmap manipulation interface in hwloc/bitmap.h. Moreover, hwloc also comes with additional helpers for interoperability with several commonly used environments. See the \ref interoperability section for details. The complete API documentation is available in a full set of HTML pages, man pages, and self-contained PDF files (formatted for both both US letter and A4 formats) in the source tarball in doc/doxygen-doc/. NOTE: If you are building the documentation from a Subversion checkout, you will need to have Doxygen and pdflatex installed -- the documentation will be built during the normal "make" process. The documentation is installed during "make install" to $prefix/share/doc/hwloc/ and your systems default man page tree (under $prefix, of course). \subsection portability Portability As shown in \ref cli_examples, hwloc can obtain information on a wide variety of hardware topologies. However, some platforms and/or operating system versions will only report a subset of this information. For example, on an PPC64-based system with 32 cores (each with 2 hardware threads) running a default 2.6.18-based kernel from RHEL 5.4, hwloc is only able to glean information about NUMA nodes and processor units (PUs). No information about caches, sockets, or cores is available. Similarly, Operating System have varying support for CPU and memory binding, e.g. while some Operating Systems provide interfaces for all kinds of CPU and memory bindings, some others provide only interfaces for a limited number of kinds of CPU and memory binding, and some do not provide any binding interface at all. Hwloc's binding functions would then simply return the ENOSYS error (Function not implemented), meaning that the underlying Operating System does not provide any interface for them. \ref hwlocality_cpubinding and \ref hwlocality_membinding provide more information on which hwloc binding functions should be preferred because interfaces for them are usually available on the supported Operating Systems. Here's the graphical output from lstopo on this platform when Simultaneous Multi-Threading (SMT) is enabled: \image html ppc64-with-smt.png \image latex ppc64-with-smt.pdf "" width=\textwidth And here's the graphical output from lstopo on this platform when SMT is disabled: \image html ppc64-without-smt.png \image latex ppc64-without-smt.pdf "" width=\textwidth Notice that hwloc only sees half the PUs when SMT is disabled. PU #15, for example, seems to change location from NUMA node #0 to #1. In reality, no PUs "moved" -- they were simply re-numbered when hwloc only saw half as many. Hence, PU #15 in the SMT-disabled picture probably corresponds to PU #30 in the SMT-enabled picture. This same "PUs have disappeared" effect can be seen on other platforms -- even platforms / OSs that provide much more information than the above PPC64 system. This is an unfortunate side-effect of how operating systems report information to hwloc. Note that upgrading the Linux kernel on the same PPC64 system mentioned above to 2.6.34, hwloc is able to discover all the topology information. The following picture shows the entire topology layout when SMT is enabled: \image html ppc64-full-with-smt.png \image latex ppc64-full-with-smt.pdf "" width=\textwidth Developers using the hwloc API or XML output for portable applications should therefore be extremely careful to not make any assumptions about the structure of data that is returned. For example, per the above reported PPC topology, it is not safe to assume that PUs will always be descendants of cores. Additionally, future hardware may insert new topology elements that are not available in this version of hwloc. Long-lived applications that are meant to span multiple different hardware platforms should also be careful about making structure assumptions. For example, there may someday be an element "lower" than a PU, or perhaps a new element may exist between a core and a PU. \subsection interface_example API Example The following small C example (named ``hwloc-hello.c'') prints the topology of the machine and bring the process to the first logical processor of the second core of the machine. \include hwloc-hello.c hwloc provides a \c pkg-config executable to obtain relevant compiler and linker flags. For example, it can be used thusly to compile applications that utilize the hwloc library (assuming GNU Make): \verbatim CFLAGS += $(pkg-config --cflags hwloc) LDLIBS += $(pkg-config --libs hwloc) cc hwloc-hello.c $(CFLAGS) -o hwloc-hello $(LDLIBS) \endverbatim On a machine with 4GB of RAM and 2 processor sockets -- each socket of which has two processing cores -- the output from running \c hwloc-hello could be something like the following: \verbatim shell$ ./hwloc-hello *** Objects at level 0 Index 0: Machine(3938MB) *** Objects at level 1 Index 0: Socket#0 Index 1: Socket#1 *** Objects at level 2 Index 0: Core#0 Index 1: Core#1 Index 2: Core#3 Index 3: Core#2 *** Objects at level 3 Index 0: PU#0 Index 1: PU#1 Index 2: PU#2 Index 3: PU#3 *** Printing overall tree Machine(3938MB) Socket#0 Core#0 PU#0 Core#1 PU#1 Socket#1 Core#3 PU#2 Core#2 PU#3 *** 2 socket(s) shell$ \endverbatim \htmlonly
\endhtmlonly \section bugs Questions and Bugs Questions should be sent to the devel mailing list (http://www.open-mpi.org/community/lists/hwloc.php). Bug reports should be reported in the tracker (https://svn.open-mpi.org/trac/hwloc/). If hwloc discovers an incorrect topology for your machine, the very first thing you should check is to ensure that you have the most recent updates installed for your operating system. Indeed, most of hwloc topology discovery relies on hardware information retrieved through the operation system (e.g., via the /sys virtual filesystem of the Linux kernel). If upgrading your OS or Linux kernel does not solve your problem, you may also want to ensure that you are running the most recent version of the BIOS for your machine. If those things fail, contact us on the mailing list for additional help. Please attach the output of lstopo after having given the --enable-debug option to ./configure and rebuilt completely, to get debugging output. \htmlonly
\endhtmlonly \section history History / Credits hwloc is the evolution and merger of the libtopology (http://runtime.bordeaux.inria.fr/libtopology/) project and the Portable Linux Processor Affinity (PLPA) (http://www.open-mpi.org/projects/plpa/) project. Because of functional and ideological overlap, these two code bases and ideas were merged and released under the name "hwloc" as an Open MPI sub-project. libtopology was initially developed by the INRIA Runtime Team-Project (http://runtime.bordeaux.inria.fr/) (headed by Raymond Namyst (http://dept-info.labri.fr/~namyst/). PLPA was initially developed by the Open MPI development team as a sub-project. Both are now deprecated in favor of hwloc, which is distributed as an Open MPI sub-project. \htmlonly
\endhtmlonly \section further_read Further Reading The documentation chapters include
  • \ref termsanddefs
  • \ref tools
  • \ref envvar
  • \ref cpu_mem_bind
  • \ref interoperability
  • \ref threadsafety
  • \ref embed
  • \ref faq
Make sure to have had a look at those too! \htmlonly
\endhtmlonly \page termsanddefs Terms and Definitions
Object
Interesting kind of part of the system, such as a Core, a Cache, a Memory node, etc. The different types detected by hwloc are detailed in the ::hwloc_obj_type_t enumeration. They are topologically sorted by CPU set into a tree.
CPU set
The set of logical processors (or processing units) logically included in an object (if it makes sense). They are always expressed using physical logical processor numbers (as announced by the OS). They are implemented as the ::hwloc_bitmap_t opaque structure. hwloc CPU sets are just masks, they do \em not have any relation with an operating system actual binding notion like Linux' cpusets.
Node set
The set of NUMA memory nodes logically included in an object (if it makes sense). They are always expressed using physical node numbers (as announced by the OS). They are implemented with the ::hwloc_bitmap_t opaque structure. as bitmaps.
Bitmap
A possibly-infinite set of bits used for describing sets of objects such as CPUs (CPU sets) or memory nodes (Node sets). They are implemented with the ::hwloc_bitmap_t opaque structure.
Parent object
The object logically containing the current object, for example because its CPU set includes the CPU set of the current object.
Ancestor object
The parent object, or its own parent object, and so on.
Children object(s)
The object (or objects) contained in the current object because their CPU set is included in the CPU set of the current object.
Arity
The number of children of an object.
Sibling objects
Objects which have the same parent. They usually have the same type (and hence are cousins, as well), but they may not if the topology is asymmetric.
Sibling rank
Index to uniquely identify objects which have the same parent, and is always in the range [0, parent_arity).
Cousin objects
Objects of the same type (and depth) as the current object, even if they do not have the same parent.
Level
Set of objects of the same type and depth. All these objects are cousins.
Depth
Nesting level in the object tree, starting from the 0th object.
OS or physical index
The index that the operating system (OS) uses to identify the object. This may be completely arbitrary, non-unique, non-contiguous, not representative of logical proximity, and may depend on the BIOS configuration. That is why hwloc almost never uses them, only in the default lstopo output (P#x) and cpuset masks.
Logical index
Index to uniquely identify objects of the same type and depth, automatically computed by hwloc according to the topology. It expresses logical proximity in a generic way, i.e. objects which have adjacent logical indexes are adjacent in the topology. That is why hwloc almost always uses it in its API, since it expresses logical proximity. They can be shown (as L#x) by lstopo thanks to the -l option. This index is always linear and in the range [0, num_objs_same_type_same_level-1]. Think of it as ``cousin rank.'' The ordering is based on topology first, and then on OS CPU numbers, so it is stable across everything except firmware CPU renumbering. "Logical index" should not be confused with "Logical processor". A "Logical processor" (which in hwloc we rather call "processing unit" to avoid the confusion) has both a physical index (as chosen arbitrarily by BIOS/OS) and a logical index (as computed according to logical proximity by hwloc).
Logical processor
Processing unit
The smallest processing element that can be represented by a hwloc object. It may be a single-core processor, a core of a multicore processor, or a single thread in SMT processor. "Logical processor" should not be confused with "Logical index of a processor". "Logical processor" is only one of the names which can be found in various documentations to designate a processing unit.
The following diagram can help to understand the vocabulary of the relationships by showing the example of a machine with two dual core sockets (with no hardware threads); thus, a topology with 4 levels. Each box with rounded corner corresponds to one hwloc_obj_t, containing the values of the different integer fields (depth, logical_index, etc.), and arrows show to which other hwloc_obj_t pointers point to (first_child, parent, etc.). The L2 cache of the last core is intentionally missing to show how asymmetric topologies are handled. \image html diagram.png \image latex diagram.eps width=\textwidth It should be noted that for PU objects, the logical index -- as computed linearly by hwloc -- is not the same as the OS index. See also \ref faq_asymmetric for more details. \page tools Command-Line Tools hwloc comes with an extensive C programming interface and several command line utilities. Each of them is fully documented in its own manual page; the following is a summary of the available command line tools. \section cli_lstopo lstopo lstopo (also known as hwloc-info and hwloc-ls) displays the hierarchical topology map of the current system. The output may be graphical or textual, and can also be exported to numerous file formats such as PDF, PNG, XML, and others. This command can also display the processes currently bound to a part of the machine (via the --ps option). Note that lstopo can read XML files and/or alternate chroot filesystems and display topological maps representing those systems (e.g., use lstopo to output an XML file on one system, and then use lstopo to read in that XML file and display it on a different system). \section cli_hwloc_bind hwloc-bind hwloc-bind binds processes to specific hardware objects through a flexible syntax. A simple example is binding an executable to specific cores (or sockets or bitmaps or ...). The hwloc-bind(1) man page provides much more detail on what is possible. hwloc-bind can also be used to retrieve the current process' binding. \section cli_hwloc_calc hwloc-calc hwloc-calc is generally used to create bitmap strings to pass to hwloc-bind. Although hwloc-bind accepts many forms of object specification (i.e., bitmap strings are one of many forms that hwloc-bind understands), they can be useful, compact representations in shell scripts, for example. hwloc-calc generates bitmap strings from given hardware objects with the ability to aggregate them, intersect them, and more. hwloc-calc generally uses the same syntax than hwloc-bind, but multiple instances may be composed to generate complex combinations. Note that hwloc-calc can also generate lists of logical processors or NUMA nodes that are convenient to pass to some external tools such as taskset or numactl. \section cli_hwloc_distrib hwloc-distrib hwloc-distrib generates a set of bitmap strings that are uniformly distributed across the machine for the given number of processes. These strings may be used with hwloc-bind to run processes to maximize their memory bandwidth by properly distributing them across the machine. \section cli_hwloc_ps hwloc-ps hwloc-ps is a tool to display the bindings of processes that are currently running on the local machine. By default, hwloc-ps only lists processes that are bound; unbound process (and Linux kernel threads) are not displayed. \section cli_hwloc_gather hwloc-gather-topology hwloc-gather-topology is a Linux-specific tool that saves the relevant topology files of the current machine into a tarball (and the corresponding lstopo output). These files may be used later (possibly offline) for simulating or debugging a machine without actually running on it. \page envvar Environment Variables The behavior of the hwloc library and tools may be tuned thanks to the following environment variables.
HWLOC_XMLFILE=/path/to/file.xml
enforces the discovery from the given XML file as if hwloc_topology_set_xml() had been called. This file may have been generated earlier with lstopo file.xml. For convenience, this backend provides empty binding hooks which just return success. To have hwloc still actually call OS-specific hooks, HWLOC_THISSYSTEM should be set 1 in the environment too, to assert that the loaded file is really the underlying system.
HWLOC_FSROOT=/path/to/linux/filesystem-root/
switches to reading the topology from the specified Linux filesystem root instead of the main file-system root, as if hwloc_topology_set_fsroot() had been called. Not using the main file-system root causes hwloc_topology_is_thissystem() to return 0. For convenience, this backend provides empty binding hooks which just return success. To have hwloc still actually call OS-specific hooks, HWLOC_THISSYSTEM should be set 1 in the environment too, to assert that the loaded file is really the underlying system.
HWLOC_THISSYSTEM=1
enforces the return value of hwloc_topology_is_thissystem(). It means that it makes hwloc assume that the selected backend provides the topology for the system on which we are running, even if it is not the OS-specific backend but the XML backend for instance. This means making the binding functions actually call the OS-specific system calls and really do binding, while the XML backend would otherwise provide empty hooks just returning success. This can be used for efficiency reasons to first detect the topology once, save it to an XML file, and quickly reload it later through the XML backend, but still having binding functions actually do bind.
HWLOC_IGNORE_DISTANCES=0
disables objects grouping based on distances. By default, hwloc uses distance matrices between objects (either read from the OS or given by the user) to find groups of close objects. These groups are described by adding intermediate Group objects in the topology. Setting this environment variable to 1 will disable this grouping.
HWLOC_<type>_DISTANCES=index,...:X*Y
HWLOC_<type>_DISTANCES=index,...:X*Y*Z
HWLOC_<type>_DISTANCES=index,...:distance,...
sets a distance matrix for objects of the given type and physical indexes. The type should be given as its case-sensitive stringified value (e.g. NUMANode, Socket, Cache, Core, PU). The variable value starts with a comma-separated list of the objects' physical indexes. Distances are then specified after a colon.
  • If X*Y is given, X groups of Y close objects are specified.
  • If X*Y*Z is given, X groups of Y groups of Z close objects are specified.
  • Otherwise, the comma-separated list of distances should be given. If N objects are considered, the i*N+j-th value gives the distance from the i-th object to the j-th object.
\page cpu_mem_bind CPU and Memory Binding Overview Some operating systems do not systematically provide separate functions for CPU and memory binding. This means that CPU binding functions may have have effects on the memory binding policy. Likewise, changing the memory binding policy may change the CPU binding of the current thread. This is often not a problem for applications, so by default hwloc will make use of these functions when they provide better binding support. If the application does not want the CPU binding to change when changing the memory policy, it needs to use the HWLOC_MEMBIND_NOCPUBIND flag to prevent hwloc from using OS functions which would change the CPU binding. Additionally, HWLOC_CPUBIND_NOMEMBIND can be passed to CPU binding function to prevent hwloc from using OS functions would change the memory binding policy. Of course, using these flags will reduce hwloc's overall support for binding, so their use is discouraged. One can avoid using these flags but still closely control both memory and CPU binding by allocating memory, touching each page in the allocated memory, and then changing the CPU binding. The already-really-allocated memory will then be "locked" to physical memory and will not be migrated. Thus, even if the memory binding policy gets changed by the CPU binding order, the already-allocated memory will not change with it. When binding and allocating further memory, the CPU binding should be performed again in case the memory binding altered the previously-selected CPU binding. Not all operating systems support the notion of a "current" memory binding policy for the current process, but such operating systems often still provide a way to allocate data on a given node set. Conversely, some operating systems support the notion of a "current" memory binding policy and do not permit allocating data on a specific node set without changing the current policy and allocate the data. To provide the most powerful coverage of these facilities, hwloc provides:
  • functions that set/get the current memory binding policies (if supported): hwloc_set/get_membind_*() and hwloc_set/get_proc_membind()
  • functions that allocate memory bound to specific node set without changing the current memory binding policy (if supported): hwloc_alloc_membind() and hwloc_alloc_membind_nodeset().
  • helpers which, if needed, change the current memory binding policy of the process in order to obtain memory binding: hwloc_alloc_membind_policy() and hwloc_alloc_membind_policy_nodeset()
An application can thus use the two first sets of functions if it wants to manage separately the global process binding policy and directed allocation, or use the third set of functions if it does not care about the process memory binding policy. See \ref hwlocality_cpubinding and \ref hwlocality_membinding for hwloc's API functions regarding CPU and memory binding, respectively. \page interoperability Interoperability With Other Software Although hwloc offers its own portable interface, it still may have to interoperate with specific or non-portable libraries that manipulate similar kinds of objects. hwloc therefore offers several specific "helpers" to assist converting between those specific interfaces and hwloc. Some external libraries may be specific to a particular OS; others may not always be available. The hwloc core therefore generally does not explicitly depend on these types of libraries. However, when a custom application uses or otherwise depends on such a library, it may optionally include the corresponding hwloc helper to extend the hwloc interface with dedicated helpers.
Linux specific features
hwloc/linux.h offers Linux-specific helpers that utilize some non-portable features of the Linux system, such as binding threads through their thread ID ("tid") or parsing kernel CPU mask files.
Linux libnuma
hwloc/linux-libnuma.h provides conversion helpers between hwloc CPU sets and libnuma-specific types, such as nodemasks and bitmasks. It helps you use libnuma memory-binding functions with hwloc CPU sets.
Glibc
hwloc/glibc-sched.h offers conversion routines between Glibc and hwloc CPU sets in order to use hwloc with functions such as sched_setaffinity().
OpenFabrics Verbs
hwloc/openfabrics-verbs.h helps interoperability with the OpenFabrics Verbs interface. For example, it can return a list of processors near an OpenFabrics device.
Myrinet Express
hwloc/myriexpress.h offers interoperability with the Myrinet Express interface. It can return the list of processors near a Myrinet board managed by the MX driver.
NVIDIA CUDA
hwloc/cuda.h and hwloc/cudart.h enable interoperability with NVIDIA CUDA Driver and Runtime interfaces. For instance, it may return the list of processors near NVIDIA GPUs.
Taskset command-line tool
The taskset command-line tool is widely used for binding processes. It manipulates CPU set strings in a format that is slightly different from hwloc's one (it does not divide the string in fixed-size subsets and separates them with commas). To ease interoperability, hwloc offers routines to convert hwloc CPU sets from/to taskset-specific string format. Most hwloc command-line tools also support the --taskset option to manipulate taskset-specific strings.
\page threadsafety Thread Safety Like most libraries that mainly fill data structures, hwloc is not thread safe but rather reentrant: all state is held in a \ref hwloc_topology_t instance without mutex protection. That means, for example, that two threads can safely operate on and modify two different \ref hwloc_topology_t instances, but they should not simultaneously invoke functions that modify the same instance. Similarly, one thread should not modify a \ref hwloc_topology_t instance while another thread is reading or traversing it. However, two threads can safely read or traverse the same \ref hwloc_topology_t instance concurrently. When running in multiprocessor environments, be aware that proper thread synchronization and/or memory coherency protection is needed to pass hwloc data (such as \ref hwloc_topology_t pointers) from one processor to another (e.g., a mutex, semaphore, or a memory barrier). Note that this is not a hwloc-specific requirement, but it is worth mentioning. For reference, \ref hwloc_topology_t modification operations include (but may not be limited to):
Creation and destruction
hwloc_topology_init(), hwloc_topology_load(), hwloc_topology_destroy() (see \ref hwlocality_creation) imply major modifications of the structure, including freeing some objects. No other thread cannot access the topology or any of its objects at the same time. Also references to objects inside the topology are not valid anymore after these functions return.
Runtime topology modifications
hwloc_topology_insert_misc_object_by_* (see \ref hwlocality_tinker) may modify the topology significantly by adding objects inside the tree, changing the topology depth, etc. hwloc_topology_restrict modifies the topology even more dramatically by removing some objects. Although references to former objects may still be valid after insertion or restriction, it is strongly advised to not rely on any such guarantee and always re-consult the topology to reacquire new instances of objects.
Locating topologies
hwloc_topology_ignore*, hwloc_topology_set* (see \ref hwlocality_configuration) do not modify the topology directly, but they do modify internal structures describing the behavior of the next invocation of hwloc_topology_load(). Hence, all of these functions should not be used concurrently. Note that these functions do not modify the current topology until it is actually reloaded; it is possible to use them while other threads are only read the current topology.
\page embed Embedding hwloc in Other Software It can be desirable to include hwloc in a larger software package (be sure to check out the LICENSE file) so that users don't have to separately download and install it before installing your software. This can be advantageous to ensure that your software uses a known-tested/good version of hwloc, or for use on systems that do not have hwloc pre-installed. When used in "embedded" mode, hwloc will: - not install any header files - not build any documentation files - not build or install any executables or tests - not build libhwloc.* -- instead, it will build libhwloc_embedded.* There are two ways to put hwloc into "embedded" mode. The first is directly from the configure command line: \verbatim shell$ ./configure --enable-embedded-mode ... \endverbatim The second requires that your software project uses the GNU Autoconf / Automake / Libtool tool chain to build your software. If you do this, you can directly integrate hwloc's m4 configure macro into your configure script. You can then invoke hwloc's configuration tests and build setup by calling an m4 macro (see below). \section embedding_m4 Using hwloc's M4 Embedding Capabilities Every project is different, and there are many different ways of integrating hwloc into yours. What follows is one example of how to do it. If your project uses recent versions Autoconf, Automake, and Libtool to build, you can use hwloc's embedded m4 capabilities. We have tested the embedded m4 with projects that use Autoconf 2.65, Automake 1.11.1, and Libtool 2.2.6b. Slightly earlier versions of may also work but are untested. Autoconf versions prior to 2.65 are almost certain to not work. You can either copy all the config/hwloc*m4 files from the hwloc source tree to the directory where your project's m4 files reside, or you can tell aclocal to find more m4 files in the embedded hwloc's "config" subdirectory (e.g., add "-Ipath/to/embedded/hwloc/config" to your Makefile.am's ACLOCAL_AMFLAGS). The following macros can then be used from your configure script (only HWLOC_SETUP_CORE must be invoked if using the m4 macros): - HWLOC_SETUP_CORE(config-dir-prefix, action-upon-success, action-upon-failure, print_banner_or_not): Invoke the hwloc configuration tests and setup the hwloc tree to build. The first argument is the prefix to use for AC_OUTPUT files -- it's where the hwloc tree is located relative to $top_srcdir. Hence, if your embedded hwloc is located in the source tree at contrib/hwloc, you should pass [contrib/hwloc] as the first argument. If HWLOC_SETUP_CORE and the rest of configure completes successfully, then "make" traversals of the hwloc tree with standard Automake targets (all, clean, install, etc.) should behave as expected. For example, it is safe to list the hwloc directory in the SUBDIRS of a higher-level Makefile.am. The last argument, if not empty, will cause the macro to display an announcement banner that it is starting the hwloc core configuration tests. HWLOC_SETUP_CORE will set the following environment variables and AC_SUBST them: HWLOC_EMBEDDED_CFLAGS, HWLOC_EMBEDDED_CPPFLAGS, and HWLOC_EMBEDDED_LIBS. These flags are filled with the values discovered in the hwloc-specific m4 tests, and can be used in your build process as relevant. The _CFLAGS, _CPPFLAGS, and _LIBS variables are necessary to build libhwloc (or libhwloc_embedded) itself. HWLOC_SETUP_CORE also sets HWLOC_EMBEDDED_LDADD environment variable (and AC_SUBSTs it) to contain the location of the libhwloc_embedded.la convenience Libtool archive. It can be used in your build process to link an application or other library against the embedded hwloc library. NOTE: If the HWLOC_SET_SYMBOL_PREFIX macro is used, it must be invoked before HWLOC_SETUP_CORE. - HWLOC_BUILD_STANDALONE: HWLOC_SETUP_CORE defaults to building hwloc in an "embedded" mode (described above). If HWLOC_BUILD_STANDALONE is invoked *before* HWLOC_SETUP_CORE, the embedded definitions will not apply (e.g., libhwloc.la will be built, not libhwloc_embedded.la). - HWLOC_SET_SYMBOL_PREFIX(foo_): Tells the hwloc to prefix all of hwloc's types and public symbols with "foo_"; meaning that function hwloc_init() becomes foo_hwloc_init(). Enum values are prefixed with an upper-case translation if the prefix supplied; HWLOC_OBJ_SYSTEM becomes FOO_HWLOC_OBJ_SYSTEM. This is recommended behavior if you are including hwloc in middleware -- it is possible that your software will be combined with other software that links to another copy of hwloc. If both uses of hwloc utilize different symbol prefixes, there will be no type/symbol clashes, and everything will compile, link, and run successfully. If you both embed hwloc without changing the symbol prefix and also link against an external hwloc, you may get multiple symbol definitions when linking your final library or application. - HWLOC_SETUP_DOCS, HWLOC_SETUP_UTILS, HWLOC_SETUP_TESTS: These three macros only apply when hwloc is built in "standalone" mode (i.e., they should NOT be invoked unless HWLOC_BUILD_STANDALONE has already been invoked). - HWLOC_DO_AM_CONDITIONALS: If you embed hwloc in a larger project and build it conditionally with Automake (e.g., if HWLOC_SETUP_CORE is invoked conditionally), you must unconditionally invoke HWLOC_DO_AM_CONDITIONALS to avoid warnings from Automake (for the cases where hwloc is not selected to be built). This macro is necessary because hwloc uses some AM_CONDITIONALs to build itself, and AM_CONDITIONALs cannot be defined conditionally. Note that it is safe (but unnecessary) to call HWLOC_DO_AM_CONDITIONALS even if HWLOC_SETUP_CORE is invoked unconditionally. If you are not using Automake to build hwloc, this macro is unnecessary (and will actually cause errors because it invoked AM_* macros that will be undefined). NOTE: When using the HWLOC_SETUP_CORE m4 macro, it may be necessary to explicitly invoke AC_CANONICAL_TARGET (which requires config.sub and config.guess) and/or AC_USE_SYSTEM_EXTENSIONS macros early in the configure script (e.g., after AC_INIT but before AM_INIT_AUTOMAKE). See the Autoconf documentation for further information. Also note that hwloc's top-level configure.ac script uses exactly the macros described above to build hwloc in a standalone mode (by default). You may want to examine it for one example of how these macros are used. \section embedding_example Example Embedding hwloc Here's an example of integrating with a larger project named sandbox that already uses Autoconf, Automake, and Libtool to build itself: \verbatim # First, cd into the sandbox project source tree shell$ cd sandbox shell$ cp -r /somewhere/else/hwloc- my-embedded-hwloc shell$ edit Makefile.am 1. Add "-Imy-embedded-hwloc/config" to ACLOCAL_AMFLAGS 2. Add "my-embedded-hwloc" to SUBDIRS 3. Add "$(HWLOC_EMBEDDED_LDADD)" and "$(HWLOC_EMBEDDED_LIBS)" to sandbox's executable's LDADD line. The former is the name of the Libtool convenience library that hwloc will generate. The latter is any dependent support libraries that may be needed by $(HWLOC_EMBEDDED_LDADD). 4. Add "$(HWLOC_EMBEDDED_CFLAGS)" to AM_CFLAGS 5. Add "$(HWLOC_EMBEDDED_CPPFLAGS)" to AM_CPPFLAGS shell$ edit configure.ac 1. Add "HWLOC_SET_SYMBOL_PREFIX(sandbox_hwloc_)" line 2. Add "HWLOC_SETUP_CORE([my-embedded-hwloc], [happy=yes], [happy=no])" line 3. Add error checking for happy=no case shell$ edit sandbox.c 1. Add #include 2. Add calls to sandbox_hwloc_init() and other hwloc API functions \endverbatim Now you can bootstrap, configure, build, and run the sandbox as normal -- all calls to "sandbox_hwloc_*" will use the embedded hwloc rather than any system-provided copy of hwloc. \page faq Frequently Asked Questions \section faq_xml I do not want hwloc to rediscover my enormous machine topology every time I rerun a process Although the topology discovery is not expensive on common machines, its overhead may become significant when multiple processes repeat the discovery on large machines (for instance when starting one process per core in a parallel application). The machine topology usually does not vary much, except if some cores are stopped/restarted or if the administrator restrictions are modified. Thus rediscovering the whole topology again and again may look useless. For this purpose, hwloc offers XML import/export features. It lets you save the discovered topology to a file (for instance with the lstopo program) and reload it later by setting the HWLOC_XMLFILE environment variable. Loading a XML topology is usually much faster than querying multiple files or calling multiple functions of the operating system. It is also possible to manipulate such XML files with the C programming interface, and the import/export may also be directed to memory buffer (that may for instance be transmitted between applications through a socket). \section faq_onedim hwloc only has a one-dimensional view of the architecture, it ignores distances hwloc places all objects in a tree. Each level is a one-dimensional view of a set of similar objects. All children of the same object (siblings) are assumed to be equally interconnected (same distance between any of them), while the distance between children of different objects (cousins) is supposed to be larger. Modern machines exhibit complex hardware interconnects, so this tree may miss some information about the actual physical distances between objects. The hwloc topology may therefore be annotated with distance information that may be used to build a more realistic representation (multi-dimensional) of each level. For instance, the root object may contain a distance matrix that represents the latencies between any pairs of NUMA nodes if the BIOS and/or operating system reports them. \section faq_smt How may I ignore symmetric multithreading, hyper-threading, ... ? hwloc creates one PU (processing unit) object per hardware thread. If your machine supports symmetric multithreading, for instance Hyper-Threading, each Core object may contain multiple PU objects. \verbatim $ lstopo - ... Core L#1 PU L#2 (P#1) PU L#3 (P#3) \endverbatim If you need to ignore symmetric multithreading, you should likely manipulate hwloc Core objects directly: \verbatim /* get the number of cores */ unsigned nbcores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE); ... /* get the third core below the first socket */ hwloc_obj_t socket, core; socket = hwloc_get_obj_by_type(topology, HWLOC_OBJ_SOCKET, 0); core = hwloc_get_obj_inside_cpuset_by_type(topology, socket->cpuset, HWLOC_OBJ_CORE, 2); \endverbatim Whenever you want to bind a process or thread to a core, make sure you singlify its cpuset first, so that the task is actually bound to a single thread within this core (to avoid useless migrations). \verbatim /* bind on the second core */ hwloc_obj_t core = hwloc_get_obj_by_type(topology, HWLOC_OBJ_CORE, 1); hwloc_cpuset_t set = hwloc_bitmap_dup(core->cpuset); hwloc_bitmap_singlify(set); hwloc_set_cpubind(topology, set, 0); hwloc_bitmap_free(set); \endverbatim With hwloc-calc or hwloc-bind command-line tools, you may specify that you only want a single-thread within each core by asking for their first PU object: \verbatim $ hwloc-calc core:4-7 0x0000ff00 $ hwloc-calc core:4-7.pu:0 0x00005500 \endverbatim When binding a process on the command-line, you may either specify the exact thread that you want to use, or ask hwloc-bind to singlify the cpuset before binding \verbatim $ hwloc-bind core:3.pu:0 -- echo "hello from first thread on core #3" hello from first thread on core #3 ... $ hwloc-bind core:3 --single -- echo "hello from a single thread on core #3" hello from a single thread on core #3 \endverbatim \section faq_asymmetric What happens if my topology is asymmetric? hwloc supports asymmetric topologies even if most platforms are usually symmetric. For example, there may be different types of processors in a single machine, each with different numbers of cores, symmetric multithreading, or levels of caches. To understand how hwloc manages such cases, one should first remember the meaning of levels and cousin objects. All objects of the same type are gathered as horizontal levels with a given depth. They are also connected through the cousin pointers of the hwloc_obj structure. Some types, such as Caches or Groups, are usually annotated with a depth or level attribute (for instance L2 cache). In this case, this attribute is also taken in account when gathering objects as horizontal levels. To be clear: there will be one level for L1 caches, another level for L2 caches, etc. If the topology is asymmetric (e.g., if a cache is missing in one of the processors), a given horizontal level will still exist if there exist any objects of that type. However, some branches of the overall tree may not have an object located in that horizontal level. Note that this specific hole within one horizontal level does not imply anything for other levels. All objects of the same type are gathered in horizontal levels even if their parents or children have different depths and types. Moreover, it is important to understand that a same parent object may have children of different types (and therefore, different depths). These children are therefore siblings (because they have the same parent), but they are not cousins (because they do not belong to the same horizontal levels). \section faq_annotate How do I annotate the topology with private notes? Each hwloc object contains a userdata field that may be used by applications to store private pointers. This field is kept intact as long as the object is valid, which means as long as topology objects are not modified by reloading or restricting the topology. It is also possible to insert Misc objects with custom names anywhere in the topology (hwloc_topology_insert_misc_object_by_cpuset()) or as a leaf of the topology (hwloc_topology_insert_misc_object_by_parent()). \section faq_upgrade How do I handle API upgrades? The hwloc interface is extended with every new major release. Any application using the hwloc API should be prepared to check at compile-time whether some features are available in the currently installed hwloc distribution. To check whether hwloc is at least 1.2, you should use: \verbatim #include #if HWLOC_API_VERSION >= 0x00010200 ... #endif \endverbatim One of the major changes in hwloc 1.1 was the addition of the bitmap API. It supersedes the now deprecated cpuset API which will be removed in a future hwloc release. It is strongly recommended to switch existing codes to the bitmap API. Keeping support for older hwloc versions is easy. For instance, if your code uses hwloc_cpuset_alloc, you should use hwloc_bitmap_alloc instead and add the following code to one of your common headers: \verbatim #include #if HWLOC_API_VERSION < 0x00010100 #define hwloc_bitmap_alloc hwloc_cpuset_alloc #endif \endverbatim Similarly, the hwloc 1.0 interface may be detected by comparing HWLOC_API_VERSION with 0x00010000. hwloc 0.9 did not define any HWLOC_API_VERSION but this very old release probably does not deserve support from your application anymore. */