Portable abstraction of hierarchical architectures for high-performance computing
\htmlonly
\endhtmlonly
\section Introduction
hwloc provides command line tools and a C API to obtain the
hierarchical map of key computing elements, such as: NUMA memory
nodes, shared caches, processor sockets, processor cores, and
processing units (logical processors or "threads").
hwloc also gathers various attributes such as
cache and memory information, and is portable across a variety of
different operating systems and platforms.
hwloc primarily aims at helping high-performance computing (HPC)
applications, but is also applicable to any project seeking to exploit
code and/or data locality on modern computing platforms.
*** Note that the hwloc project represents the merger of the
libtopology project from INRIA and the Portable Linux Processor
Affinity (PLPA) sub-project from Open MPI. Both of these prior
projects are now deprecated. The first hwloc release was
essentially a "re-branding" of the libtopology code base, but with
both a few genuinely new features and a few PLPA-like features added
in. Prior releases of hwloc included documentation about switching
from PLPA to hwloc; this documentation has been dropped on the
assumption that everyone who was using PLPA has already switched to
hwloc.
hwloc supports the following operating systems:
Linux (including old kernels not having sysfs topology
information, with knowledge of cpusets, offline CPUs, ScaleMP vSMP,
and Kerrighed support)
Solaris
AIX
Darwin / OS X
FreeBSD and its variants, such as kFreeBSD/GNU
OSF/1 (a.k.a., Tru64)
HP-UX
Microsoft Windows
hwloc only reports the number of processors on unsupported operating
systems; no topology information is available.
For development and debugging purposes, hwloc also offers the ability to
work on "fake" topologies:
Symmetrical tree of resources generated from a list of level arities
Remote machine simulation through the gathering of Linux sysfs topology files
hwloc can display the topology in a human-readable format, either in
graphical mode (X11), or by exporting in one of several different
formats, including: plain text, PDF, PNG, and FIG (see \ref cli_examples
below). Note that some of the export formats require additional
support libraries.
hwloc offers a programming interface for manipulating topologies and
objects. It also brings a powerful CPU bitmap API that is used to
describe topology objects location on physical/logical processors. See
the \ref interface below. It may also be used to binding applications
onto certain cores or memory nodes. Several utility programs are also
provided to ease command-line manipulation of topology objects,
binding of processes, and so on.
\htmlonly
\endhtmlonly
\section installation Installation
hwloc (http://www.open-mpi.org/projects/hwloc/) is available under the
BSD license. It is hosted as a sub-project of the overall Open MPI
project (http://www.open-mpi.org/). Note that hwloc does not require
any functionality from Open MPI -- it is a wholly separate (and much
smaller!) project and code base. It just happens to be hosted as part
of the overall Open MPI project.
Nightly development snapshots are available on the web site.
Additionally, the code can be directly checked out of Subversion:
\code
shell$ svn checkout http://svn.open-mpi.org/svn/hwloc/trunk hwloc-trunk
shell$ cd hwloc-trunk
shell$ ./autogen.sh
\endcode
Note that GNU Autoconf >=2.63, Automake >=1.10 and Libtool >=2.2.6 are
required when building from a Subversion checkout.
Installation by itself is the fairly common GNU-based process:
\code
shell$ ./configure --prefix=...
shell$ make
shell$ make install
\endcode
The hwloc command-line tool "lstopo" produces human-readable topology
maps, as mentioned above. It can also export maps to the "fig" file
format. Support for PDF, Postscript, and PNG exporting is provided if
the "Cairo" development package can be found when hwloc is configured
and build. Similarly, lstopo's XML support requires the libxml2
development package.
\htmlonly
\endhtmlonly
\section cli_examples CLI Examples
On a 4-socket 2-core machine with hyperthreading, the \c lstopo tool
may show the following graphical output:
\image html dudley.png
\image latex dudley.png "" width=9cm
Here's the equivalent output in textual form:
\verbatim
Machine (16GB)
Socket L#0 + L3 L#0 (4096KB)
L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1
PU L#2 (P#4)
PU L#3 (P#12)
Socket L#1 + L3 L#1 (4096KB)
L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2
PU L#4 (P#1)
PU L#5 (P#9)
L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3
PU L#6 (P#5)
PU L#7 (P#13)
Socket L#2 + L3 L#2 (4096KB)
L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4
PU L#8 (P#2)
PU L#9 (P#10)
L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5
PU L#10 (P#6)
PU L#11 (P#14)
Socket L#3 + L3 L#3 (4096KB)
L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6
PU L#12 (P#3)
PU L#13 (P#11)
L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#15)
\endverbatim
Finally, here's the equivalent output in XML. Long lines were
artificially broken for document clarity (in the real output, each XML
tag is on a single line), and only socket #0 is shown for brevity:
\verbatim
\endverbatim
On a 4-socket 2-core Opteron NUMA machine, the \c lstopo tool may show
the following graphical output:
\image html hagrid.png
\image latex hagrid.png width=\textwidth
Here's the equivalent output in textual form:
\verbatim
Machine (32GB)
NUMANode L#0 (P#0 8190MB) + Socket L#0
L2 L#0 (1024KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#1)
NUMANode L#1 (P#1 8192MB) + Socket L#1
L2 L#2 (1024KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#3)
NUMANode L#2 (P#2 8192MB) + Socket L#2
L2 L#4 (1024KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#5)
NUMANode L#3 (P#3 8192MB) + Socket L#3
L2 L#6 (1024KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#7)
\endverbatim
And here's the equivalent output in XML. Similar to above, line
breaks were added and only PU #0 is shown for brevity:
\verbatim
\endverbatim
On a 2-socket quad-core Xeon (pre-Nehalem, with 2 dual-core dies into
each socket):
\image html emmett.png
\image latex emmett.png "" width=7cm
Here's the same output in textual form:
\verbatim
Machine (16GB)
Socket L#0
L2 L#0 (4096KB)
L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L1 L#1 (32KB) + Core L#1 + PU L#1 (P#4)
L2 L#1 (4096KB)
L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
Socket L#1
L2 L#2 (4096KB)
L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#3 (4096KB)
L1 L#6 (32KB) + Core L#6 + PU L#6 (P#3)
L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
\endverbatim
And the same output in XML (line breaks added, only PU #0 shown):
\verbatim
\endverbatim
\htmlonly
\endhtmlonly
\section interface Programming Interface
The basic interface is available in hwloc.h. It essentially offers
low-level routines for advanced programmers that want to manually
manipulate objects and follow links between them. Documentation for
everything in hwloc.h are provided later in this document. Developers
should also look at hwloc/helper.h (and also in this document, which
provides good higher-level topology traversal examples).
To precisely define the vocabulary used by hwloc, a \ref termsanddefs
section is available and should probably be read first.
Each hwloc object contains a cpuset describing the list of processing
units that it contains. These bitmaps may be used for \ref
hwlocality_cpubinding and \ref hwlocality_membinding. hwloc offers an extensive
bitmap manipulation interface in hwloc/bitmap.h.
Moreover, hwloc also comes with additional helpers for
interoperability with several commonly used environments.
See the \ref interoperability section for details.
The complete API documentation is available in a full set of HTML
pages, man pages, and self-contained PDF files (formatted for both
both US letter and A4 formats) in the source tarball in
doc/doxygen-doc/.
NOTE: If you are building the documentation from a
Subversion checkout, you will need to have Doxygen and pdflatex
installed -- the documentation will be built during the normal "make"
process. The documentation is installed during "make install" to
$prefix/share/doc/hwloc/ and your systems default man page tree (under
$prefix, of course).
\subsection portability Portability
As shown in \ref cli_examples, hwloc can obtain information on a wide
variety of hardware topologies. However, some platforms and/or
operating system versions will only report a subset of this
information. For example, on an PPC64-based system with 32 cores
(each with 2 hardware threads) running a default 2.6.18-based kernel
from RHEL 5.4, hwloc is only able to glean information about NUMA
nodes and processor units (PUs). No information about caches,
sockets, or cores is available.
Similarly, Operating System have varying support for CPU and memory binding,
e.g. while some Operating Systems provide interfaces for all kinds of CPU and
memory bindings, some others provide only interfaces for a limited number of
kinds of CPU and memory binding, and some do not provide any binding interface
at all. Hwloc's binding functions would then simply return the ENOSYS error
(Function not implemented), meaning that the underlying Operating System
does not provide any interface for them. \ref hwlocality_cpubinding and \ref
hwlocality_membinding provide more information on which hwloc binding functions
should be preferred because interfaces for them are usually available on the
supported Operating Systems.
Here's the graphical output from lstopo on this platform when
Simultaneous Multi-Threading (SMT) is enabled:
\image html ppc64-with-smt.png
\image latex ppc64-with-smt.pdf "" width=\textwidth
And here's the graphical output from lstopo on this platform when SMT is
disabled:
\image html ppc64-without-smt.png
\image latex ppc64-without-smt.pdf "" width=\textwidth
Notice that hwloc only sees half the PUs when SMT is disabled. PU #15,
for example, seems to change location from NUMA node #0 to #1. In
reality, no PUs "moved" -- they were simply re-numbered when hwloc
only saw half as many. Hence, PU #15 in the SMT-disabled picture
probably corresponds to PU #30 in the SMT-enabled picture.
This same "PUs have disappeared" effect can be seen on other platforms
-- even platforms / OSs that provide much more information than the
above PPC64 system. This is an unfortunate side-effect of how
operating systems report information to hwloc.
Note that upgrading the Linux kernel on the same PPC64 system
mentioned above to 2.6.34, hwloc is able to discover all the topology
information. The following picture shows the entire topology layout
when SMT is enabled:
\image html ppc64-full-with-smt.png
\image latex ppc64-full-with-smt.pdf "" width=\textwidth
Developers using the hwloc API or XML output for portable applications
should therefore be extremely careful to not make any assumptions
about the structure of data that is returned. For example, per the
above reported PPC topology, it is not safe to assume that PUs will
always be descendants of cores.
Additionally, future hardware may insert new topology elements that
are not available in this version of hwloc. Long-lived applications
that are meant to span multiple different hardware platforms should
also be careful about making structure assumptions. For example,
there may someday be an element "lower" than a PU, or perhaps a new
element may exist between a core and a PU.
\subsection interface_example API Example
The following small C example (named ``hwloc-hello.c'') prints the
topology of the machine and bring the process to the first logical processor
of the second core of the machine.
\include hwloc-hello.c
hwloc provides a \c pkg-config executable to obtain relevant compiler
and linker flags. For example, it can be used thusly to compile
applications that utilize the hwloc library (assuming GNU Make):
\verbatim
CFLAGS += $(pkg-config --cflags hwloc)
LDLIBS += $(pkg-config --libs hwloc)
cc hwloc-hello.c $(CFLAGS) -o hwloc-hello $(LDLIBS)
\endverbatim
On a machine with 4GB of RAM and 2 processor sockets -- each socket of
which has two processing cores -- the output from running \c
hwloc-hello could be something like the following:
\verbatim
shell$ ./hwloc-hello
*** Objects at level 0
Index 0: Machine(3938MB)
*** Objects at level 1
Index 0: Socket#0
Index 1: Socket#1
*** Objects at level 2
Index 0: Core#0
Index 1: Core#1
Index 2: Core#3
Index 3: Core#2
*** Objects at level 3
Index 0: PU#0
Index 1: PU#1
Index 2: PU#2
Index 3: PU#3
*** Printing overall tree
Machine(3938MB)
Socket#0
Core#0
PU#0
Core#1
PU#1
Socket#1
Core#3
PU#2
Core#2
PU#3
*** 2 socket(s)
shell$
\endverbatim
\htmlonly
\endhtmlonly
\section bugs Questions and Bugs
Questions should be sent to the devel mailing
list (http://www.open-mpi.org/community/lists/hwloc.php).
Bug reports should be reported in the tracker
(https://svn.open-mpi.org/trac/hwloc/).
If hwloc discovers an incorrect topology for your machine, the very
first thing you should check is to ensure that you have the most
recent updates installed for your operating system. Indeed, most of
hwloc topology discovery relies on hardware information retrieved
through the operation system (e.g., via the /sys virtual filesystem of
the Linux kernel). If upgrading your OS or Linux kernel does not
solve your problem, you may also want to ensure that you are running
the most recent version of the BIOS for your machine.
If those things fail, contact us on the mailing list for additional
help. Please attach the output of lstopo after having given the
--enable-debug option to ./configure and rebuilt completely, to get
debugging output.
\htmlonly
\endhtmlonly
\section history History / Credits
hwloc is the evolution and merger of the libtopology
(http://runtime.bordeaux.inria.fr/libtopology/) project and the Portable
Linux Processor Affinity (PLPA) (http://www.open-mpi.org/projects/plpa/)
project. Because of functional and ideological overlap, these two code bases
and ideas were merged and released under the name "hwloc" as an Open MPI
sub-project.
libtopology was initially developed by the INRIA Runtime Team-Project
(http://runtime.bordeaux.inria.fr/) (headed by Raymond Namyst
(http://dept-info.labri.fr/~namyst/). PLPA was initially developed by
the Open MPI development team as a sub-project. Both are now deprecated
in favor of hwloc, which is distributed as an Open MPI sub-project.
\htmlonly
\endhtmlonly
\section further_read Further Reading
The documentation chapters include
\ref termsanddefs
\ref tools
\ref envvar
\ref cpu_mem_bind
\ref interoperability
\ref threadsafety
\ref embed
\ref faq
Make sure to have had a look at those too!
\htmlonly
\endhtmlonly
\page termsanddefs Terms and Definitions
Object
Interesting kind of part of the system, such as a Core, a Cache,
a Memory node, etc. The different types detected by hwloc are
detailed in the ::hwloc_obj_type_t enumeration.
They are topologically sorted by CPU set into a tree.
CPU set
The set of logical processors (or processing units) logically included in an object
(if it makes sense). They are always expressed using physical logical
processor numbers (as announced by the OS). They are implemented as the
::hwloc_bitmap_t opaque structure. hwloc CPU sets are just masks, they
do \em not have any relation with an operating system actual binding notion like
Linux' cpusets.
Node set
The set of NUMA memory nodes logically included in an object
(if it makes sense). They are always expressed using physical node
numbers (as announced by the OS). They are implemented with the
::hwloc_bitmap_t opaque structure.
as bitmaps.
Bitmap
A possibly-infinite set of bits used for describing sets of objects
such as CPUs (CPU sets) or memory nodes (Node sets). They are implemented
with the ::hwloc_bitmap_t opaque structure.
Parent object
The object logically containing the current object, for example
because its CPU set includes the CPU set of the current object.
Ancestor object
The parent object, or its own parent object, and so on.
Children object(s)
The object (or objects) contained in the current object because
their CPU set is included in the CPU set of the current object.
Arity
The number of children of an object.
Sibling objects
Objects which have the same parent. They usually have the same type
(and hence are cousins, as well), but they may not if the topology
is asymmetric.
Sibling rank
Index to uniquely identify objects which have
the same parent, and is always in the range [0, parent_arity).
Cousin objects
Objects of the same type (and depth) as the current object,
even if they do not have the same parent.
Level
Set of objects of the same type and depth. All these objects
are cousins.
Depth
Nesting level in the object tree, starting from the 0th object.
OS or physical index
The index that the operating system (OS) uses to identify the
object. This may be completely arbitrary, non-unique, non-contiguous, not
representative of logical proximity, and may depend on the BIOS
configuration. That is why hwloc almost never uses them, only in the default
lstopo output (P#x) and cpuset masks.
Logical index
Index to uniquely identify objects of the same type and depth,
automatically computed by hwloc according to the topology. It expresses
logical proximity in a generic way, i.e. objects which have adjacent logical
indexes are adjacent in the topology. That is why hwloc almost always uses
it in its API, since it expresses logical proximity. They can be shown (as
L#x) by lstopo thanks to the -l option. This index
is always linear and in
the range [0, num_objs_same_type_same_level-1]. Think of it as ``cousin
rank.'' The ordering is based on topology first, and then on OS CPU numbers,
so it is stable across everything except firmware CPU renumbering.
"Logical index" should not be confused with "Logical processor". A "Logical
processor" (which in hwloc we rather call "processing unit" to avoid the
confusion) has both a physical index (as chosen arbitrarily by BIOS/OS) and a logical
index (as computed according to logical proximity by hwloc).
Logical processor
Processing unit
The smallest processing element that can be represented by a hwloc
object. It may be a single-core processor, a core of a multicore
processor, or a single thread in SMT processor.
"Logical processor" should not be confused with "Logical index of a
processor". "Logical processor" is only one of the names which can be found in
various documentations to designate a processing unit.
The following diagram can help to understand the vocabulary of the relationships
by showing the example of a machine with two dual core sockets (with no
hardware threads); thus, a topology with 4 levels. Each box with rounded corner
corresponds to one hwloc_obj_t, containing the values of the different integer
fields (depth, logical_index, etc.), and arrows show to which other hwloc_obj_t
pointers point to (first_child, parent, etc.). The L2 cache of the last core is intentionally missing to show how asymmetric topologies are handled.
\image html diagram.png
\image latex diagram.eps width=\textwidth
It should be noted that for PU objects, the logical index -- as
computed linearly by hwloc -- is not the same as the OS index.
See also \ref faq_asymmetric for more details.
\page tools Command-Line Tools
hwloc comes with an extensive C programming interface and several
command line utilities. Each of them is fully documented in its own
manual page; the following is a summary of the available command line
tools.
\section cli_lstopo lstopo
lstopo (also known as hwloc-info and hwloc-ls) displays the
hierarchical topology map of the current system. The output may be
graphical or textual, and can also be exported to numerous file
formats such as PDF, PNG, XML, and others.
This command can also display the processes currently bound to a part
of the machine (via the --ps option).
Note that lstopo can read XML files and/or alternate chroot
filesystems and display topological maps representing those systems
(e.g., use lstopo to output an XML file on one system, and then use
lstopo to read in that XML file and display it on a different system).
\section cli_hwloc_bind hwloc-bind
hwloc-bind binds processes to specific hardware objects through a
flexible syntax. A simple example is binding an executable to
specific cores (or sockets or bitmaps or ...). The hwloc-bind(1) man
page provides much more detail on what is possible.
hwloc-bind can also be used to retrieve the current process' binding.
\section cli_hwloc_calc hwloc-calc
hwloc-calc is generally used to create bitmap strings to pass to
hwloc-bind. Although hwloc-bind accepts many forms of object
specification (i.e., bitmap strings are one of many forms that
hwloc-bind understands), they can be useful, compact representations
in shell scripts, for example.
hwloc-calc generates bitmap strings from given hardware objects with
the ability to aggregate them, intersect them, and more. hwloc-calc
generally uses the same syntax than hwloc-bind, but multiple instances
may be composed to generate complex combinations.
Note that hwloc-calc can also generate lists of logical processors or
NUMA nodes that are convenient to pass to some external tools such as
taskset or numactl.
\section cli_hwloc_distrib hwloc-distrib
hwloc-distrib generates a set of bitmap strings that are uniformly
distributed across the machine for the given number of processes.
These strings may be used with hwloc-bind to run processes to maximize
their memory bandwidth by properly distributing them across the
machine.
\section cli_hwloc_ps hwloc-ps
hwloc-ps is a tool to display the bindings of processes that are
currently running on the local machine. By default, hwloc-ps only
lists processes that are bound; unbound process (and Linux kernel
threads) are not displayed.
\section cli_hwloc_gather hwloc-gather-topology
hwloc-gather-topology is a Linux-specific tool that saves the
relevant topology files of the current machine into a tarball
(and the corresponding lstopo output). These files may be used
later (possibly offline) for simulating or debugging a machine
without actually running on it.
\page envvar Environment Variables
The behavior of the hwloc library and tools may be tuned thanks to the
following environment variables.
HWLOC_XMLFILE=/path/to/file.xml
enforces the discovery from the given XML file as if
hwloc_topology_set_xml() had been called.
This file may have been generated earlier with lstopo file.xml.
For convenience, this backend provides empty binding hooks which just
return success. To have hwloc still actually call OS-specific hooks,
HWLOC_THISSYSTEM should be set 1 in the environment too, to assert that
the loaded file is really the underlying system.
HWLOC_FSROOT=/path/to/linux/filesystem-root/
switches to reading the topology from the specified
Linux filesystem root instead of the main file-system root, as if
hwloc_topology_set_fsroot() had been called.
Not using the main file-system root causes hwloc_topology_is_thissystem()
to return 0.
For convenience, this backend provides empty binding hooks which just
return success. To have hwloc still actually call OS-specific hooks,
HWLOC_THISSYSTEM should be set 1 in the environment too, to assert that
the loaded file is really the underlying system.
HWLOC_THISSYSTEM=1
enforces the return value of hwloc_topology_is_thissystem().
It means that it makes hwloc assume that the selected backend provides the
topology for the system on which we are running, even if it is not the
OS-specific backend but the XML backend for instance.
This means making the binding functions actually call the OS-specific
system calls and really do binding, while the XML backend would otherwise
provide empty hooks just returning success.
This can be used for efficiency reasons to first detect the topology once,
save it to an XML file, and quickly reload it later through the XML
backend, but still having binding functions actually do bind.
HWLOC_IGNORE_DISTANCES=0
disables objects grouping based on distances.
By default, hwloc uses distance matrices between objects (either read
from the OS or given by the user) to find groups of close objects.
These groups are described by adding intermediate Group objects in the topology.
Setting this environment variable to 1 will disable this grouping.
HWLOC_<type>_DISTANCES=index,...:X*Y
HWLOC_<type>_DISTANCES=index,...:X*Y*Z
HWLOC_<type>_DISTANCES=index,...:distance,...
sets a distance matrix for objects of the given type and physical indexes.
The type should be given as its case-sensitive stringified value
(e.g. NUMANode, Socket, Cache, Core, PU).
The variable value starts with a comma-separated list of the objects'
physical indexes. Distances are then specified after a colon.
If X*Y is given, X groups of Y close objects are specified.
If X*Y*Z is given, X groups of Y groups of Z close objects are specified.
Otherwise, the comma-separated list of distances should be given.
If N objects are considered, the i*N+j-th value gives the
distance from the i-th object to the j-th object.
\page cpu_mem_bind CPU and Memory Binding Overview
Some operating systems do not systematically provide separate
functions for CPU and memory binding. This means that CPU binding
functions may have have effects on the memory binding policy.
Likewise, changing the memory binding policy may change the CPU
binding of the current thread. This is often not a problem for
applications, so by default hwloc will make use of these functions
when they provide better binding support.
If the application does not want the CPU binding to change when
changing the memory policy, it needs to use the
HWLOC_MEMBIND_NOCPUBIND flag to prevent hwloc from using OS functions
which would change the CPU binding. Additionally,
HWLOC_CPUBIND_NOMEMBIND can be passed to CPU binding function to
prevent hwloc from using OS functions would change the memory binding
policy. Of course, using these flags will reduce hwloc's overall support for
binding, so their use is discouraged.
One can avoid using these flags but still closely control both memory
and CPU binding by allocating memory, touching each page in the
allocated memory, and then changing the CPU binding. The
already-really-allocated memory will then be "locked" to physical
memory and will not be migrated. Thus, even if the memory binding
policy gets changed by the CPU binding order, the already-allocated
memory will not change with it. When binding and allocating further
memory, the CPU binding should be performed again in case the memory
binding altered the previously-selected CPU binding.
Not all operating systems support the notion of a "current" memory
binding policy for the current process, but such operating systems often still
provide a way to allocate data on a given node set. Conversely, some
operating systems support the notion of a "current" memory binding policy and do
not permit allocating data on a specific node set without changing the
current policy and allocate the data. To provide the most powerful coverage of
these facilities, hwloc provides:
functions that set/get the current memory binding policies (if supported):
hwloc_set/get_membind_*() and hwloc_set/get_proc_membind()
functions that allocate memory bound to specific node set without changing
the current memory binding policy (if supported): hwloc_alloc_membind() and
hwloc_alloc_membind_nodeset().
helpers which, if needed, change the current memory binding policy of the
process in order to obtain memory binding: hwloc_alloc_membind_policy() and
hwloc_alloc_membind_policy_nodeset()
An application can thus use the two first sets of functions if it wants to
manage separately the global process binding policy and directed allocation,
or use the third set of functions if it does not care about the process memory
binding policy.
See \ref hwlocality_cpubinding and \ref hwlocality_membinding for
hwloc's API functions regarding CPU and memory binding, respectively.
\page interoperability Interoperability With Other Software
Although hwloc offers its own portable interface, it still may have to
interoperate with specific or non-portable libraries that manipulate
similar kinds of objects. hwloc therefore offers several specific
"helpers" to assist converting between those specific interfaces and
hwloc.
Some external libraries may be specific to a particular OS; others may
not always be available. The hwloc core therefore generally does not
explicitly depend on these types of libraries. However, when a custom
application uses or otherwise depends on such a library, it may
optionally include the corresponding hwloc helper to extend the hwloc
interface with dedicated helpers.
Linux specific features
hwloc/linux.h offers Linux-specific helpers that utilize some
non-portable features of the Linux system, such as binding threads
through their thread ID ("tid") or parsing kernel CPU mask files.
Linux libnuma
hwloc/linux-libnuma.h provides conversion helpers between hwloc CPU
sets and libnuma-specific types, such as nodemasks and bitmasks. It
helps you use libnuma memory-binding functions with hwloc CPU sets.
Glibc
hwloc/glibc-sched.h offers conversion routines between Glibc and
hwloc CPU sets in order to use hwloc with functions such as
sched_setaffinity().
OpenFabrics Verbs
hwloc/openfabrics-verbs.h helps interoperability with the
OpenFabrics Verbs interface. For example, it can return a list of
processors near an OpenFabrics device.
Myrinet Express
hwloc/myriexpress.h offers interoperability with the Myrinet
Express interface. It can return the list of processors near
a Myrinet board managed by the MX driver.
NVIDIA CUDA
hwloc/cuda.h and hwloc/cudart.h enable interoperability with
NVIDIA CUDA Driver and Runtime interfaces. For instance, it may
return the list of processors near NVIDIA GPUs.
Taskset command-line tool
The taskset command-line tool is widely used for binding
processes. It manipulates CPU set strings in a format that
is slightly different from hwloc's one (it does not divide the
string in fixed-size subsets and separates them with commas).
To ease interoperability, hwloc offers routines to convert
hwloc CPU sets from/to taskset-specific string format.
Most hwloc command-line tools also support the --taskset
option to manipulate taskset-specific strings.
\page threadsafety Thread Safety
Like most libraries that mainly fill data structures, hwloc is not
thread safe but rather reentrant: all state is held in a \ref
hwloc_topology_t instance without mutex protection. That means, for
example, that two threads can safely operate on and modify two
different \ref hwloc_topology_t instances, but they should not
simultaneously invoke functions that modify the same
instance. Similarly, one thread should not modify a \ref
hwloc_topology_t instance while another thread is reading or
traversing it. However, two threads can safely read or traverse the
same \ref hwloc_topology_t instance concurrently.
When running in multiprocessor environments, be aware that proper thread
synchronization and/or memory coherency protection is needed to pass hwloc
data (such as \ref hwloc_topology_t pointers) from one processor
to another (e.g., a mutex, semaphore, or a memory barrier).
Note that this is not a hwloc-specific requirement, but it is worth
mentioning.
For reference, \ref hwloc_topology_t modification operations include
(but may not be limited to):
Creation and destruction
hwloc_topology_init(), hwloc_topology_load(),
hwloc_topology_destroy() (see \ref hwlocality_creation) imply
major modifications of the structure, including freeing some
objects. No other thread cannot access the topology or any of its
objects at the same time.
Also references to objects inside the topology are not valid anymore
after these functions return.
Runtime topology modifications
hwloc_topology_insert_misc_object_by_* (see \ref
hwlocality_tinker) may modify the topology significantly by adding
objects inside the tree, changing the topology depth, etc.
hwloc_topology_restrict modifies the topology even more
dramatically by removing some objects.
Although references to former objects may still be valid
after insertion or restriction, it is strongly advised to not rely on any such
guarantee and always re-consult the topology to reacquire new
instances of objects.
Locating topologies
hwloc_topology_ignore*, hwloc_topology_set*
(see \ref hwlocality_configuration) do not modify the topology
directly, but they do modify internal structures describing the
behavior of the next invocation of hwloc_topology_load().
Hence, all of these functions should not be used concurrently.
Note that these functions do not modify the current topology until
it is actually reloaded; it is possible to use them while other
threads are only read the current topology.
\page embed Embedding hwloc in Other Software
It can be desirable to include hwloc in a larger software package (be
sure to check out the LICENSE file) so that users don't have to
separately download and install it before installing your software.
This can be advantageous to ensure that your software uses a
known-tested/good version of hwloc, or for use on systems that do not
have hwloc pre-installed.
When used in "embedded" mode, hwloc will:
- not install any header files
- not build any documentation files
- not build or install any executables or tests
- not build libhwloc.* -- instead, it will build
libhwloc_embedded.*
There are two ways to put hwloc into "embedded" mode. The first is
directly from the configure command line:
\verbatim
shell$ ./configure --enable-embedded-mode ...
\endverbatim
The second requires that your software project uses the GNU Autoconf /
Automake / Libtool tool chain to build your software. If you do this,
you can directly integrate hwloc's m4 configure macro into your
configure script. You can then invoke hwloc's configuration tests and
build setup by calling an m4 macro (see below).
\section embedding_m4 Using hwloc's M4 Embedding Capabilities
Every project is different, and there are many different ways of
integrating hwloc into yours. What follows is one example of
how to do it.
If your project uses recent versions Autoconf, Automake, and Libtool
to build, you can use hwloc's embedded m4 capabilities. We have
tested the embedded m4 with projects that use Autoconf 2.65, Automake
1.11.1, and Libtool 2.2.6b. Slightly earlier versions of may also
work but are untested. Autoconf versions prior to 2.65 are almost
certain to not work.
You can either copy all the config/hwloc*m4 files from the hwloc
source tree to the directory where your project's m4 files reside, or
you can tell aclocal to find more m4 files in the embedded hwloc's
"config" subdirectory (e.g., add "-Ipath/to/embedded/hwloc/config" to
your Makefile.am's ACLOCAL_AMFLAGS).
The following macros can then be used from your configure script (only
HWLOC_SETUP_CORE must be invoked if using the m4 macros):
- HWLOC_SETUP_CORE(config-dir-prefix, action-upon-success,
action-upon-failure, print_banner_or_not): Invoke the hwloc
configuration tests and setup the hwloc tree to build. The first
argument is the prefix to use for AC_OUTPUT files -- it's where the
hwloc tree is located relative to $top_srcdir. Hence, if
your embedded hwloc is located in the source tree at contrib/hwloc,
you should pass [contrib/hwloc] as the first argument. If
HWLOC_SETUP_CORE and the rest of configure completes
successfully, then "make" traversals of the hwloc tree with standard
Automake targets (all, clean, install, etc.) should behave as
expected. For example, it is safe to list the hwloc directory in
the SUBDIRS of a higher-level Makefile.am. The last argument, if
not empty, will cause the macro to display an announcement banner
that it is starting the hwloc core configuration tests.
HWLOC_SETUP_CORE will set the following environment variables and
AC_SUBST them: HWLOC_EMBEDDED_CFLAGS, HWLOC_EMBEDDED_CPPFLAGS, and
HWLOC_EMBEDDED_LIBS. These flags are filled with the values
discovered in the hwloc-specific m4 tests, and can be used in your
build process as relevant. The _CFLAGS, _CPPFLAGS, and _LIBS
variables are necessary to build libhwloc (or libhwloc_embedded)
itself.
HWLOC_SETUP_CORE also sets HWLOC_EMBEDDED_LDADD environment variable
(and AC_SUBSTs it) to contain the location of the
libhwloc_embedded.la convenience Libtool archive. It can be used in
your build process to link an application or other library against
the embedded hwloc library.
NOTE: If the HWLOC_SET_SYMBOL_PREFIX macro is used, it must
be invoked before HWLOC_SETUP_CORE.
- HWLOC_BUILD_STANDALONE: HWLOC_SETUP_CORE defaults to building hwloc
in an "embedded" mode (described above). If HWLOC_BUILD_STANDALONE
is invoked *before* HWLOC_SETUP_CORE, the embedded definitions will
not apply (e.g., libhwloc.la will be built, not
libhwloc_embedded.la).
- HWLOC_SET_SYMBOL_PREFIX(foo_): Tells the hwloc to prefix all of
hwloc's types and public symbols with "foo_"; meaning that function
hwloc_init() becomes foo_hwloc_init(). Enum values are prefixed
with an upper-case translation if the prefix supplied;
HWLOC_OBJ_SYSTEM becomes FOO_HWLOC_OBJ_SYSTEM. This is recommended
behavior if you are including hwloc in middleware -- it is possible
that your software will be combined with other software that links
to another copy of hwloc. If both uses of hwloc utilize different
symbol prefixes, there will be no type/symbol clashes, and
everything will compile, link, and run successfully. If you both
embed hwloc without changing the symbol prefix and also link against
an external hwloc, you may get multiple symbol definitions when
linking your final library or application.
- HWLOC_SETUP_DOCS, HWLOC_SETUP_UTILS, HWLOC_SETUP_TESTS: These three
macros only apply when hwloc is built in "standalone" mode (i.e.,
they should NOT be invoked unless HWLOC_BUILD_STANDALONE has already
been invoked).
- HWLOC_DO_AM_CONDITIONALS: If you embed hwloc in a larger project and
build it conditionally with Automake (e.g., if HWLOC_SETUP_CORE is
invoked conditionally), you must unconditionally invoke
HWLOC_DO_AM_CONDITIONALS to avoid warnings from Automake (for the
cases where hwloc is not selected to be built). This macro is
necessary because hwloc uses some AM_CONDITIONALs to build itself,
and AM_CONDITIONALs cannot be defined conditionally. Note that it
is safe (but unnecessary) to call HWLOC_DO_AM_CONDITIONALS even if
HWLOC_SETUP_CORE is invoked unconditionally. If you are not using
Automake to build hwloc, this macro is unnecessary (and will actually
cause errors because it invoked AM_* macros that will be undefined).
NOTE: When using the HWLOC_SETUP_CORE m4 macro, it may
be necessary to explicitly invoke AC_CANONICAL_TARGET (which requires
config.sub and config.guess) and/or AC_USE_SYSTEM_EXTENSIONS macros
early in the configure script (e.g., after AC_INIT but before
AM_INIT_AUTOMAKE). See the Autoconf documentation for further
information.
Also note that hwloc's top-level configure.ac script uses exactly the
macros described above to build hwloc in a standalone mode (by
default). You may want to examine it for one example of how these
macros are used.
\section embedding_example Example Embedding hwloc
Here's an example of integrating with a larger project named sandbox
that already uses Autoconf, Automake, and Libtool to build itself:
\verbatim
# First, cd into the sandbox project source tree
shell$ cd sandbox
shell$ cp -r /somewhere/else/hwloc- my-embedded-hwloc
shell$ edit Makefile.am
1. Add "-Imy-embedded-hwloc/config" to ACLOCAL_AMFLAGS
2. Add "my-embedded-hwloc" to SUBDIRS
3. Add "$(HWLOC_EMBEDDED_LDADD)" and "$(HWLOC_EMBEDDED_LIBS)" to
sandbox's executable's LDADD line. The former is the name of the
Libtool convenience library that hwloc will generate. The latter
is any dependent support libraries that may be needed by
$(HWLOC_EMBEDDED_LDADD).
4. Add "$(HWLOC_EMBEDDED_CFLAGS)" to AM_CFLAGS
5. Add "$(HWLOC_EMBEDDED_CPPFLAGS)" to AM_CPPFLAGS
shell$ edit configure.ac
1. Add "HWLOC_SET_SYMBOL_PREFIX(sandbox_hwloc_)" line
2. Add "HWLOC_SETUP_CORE([my-embedded-hwloc], [happy=yes], [happy=no])" line
3. Add error checking for happy=no case
shell$ edit sandbox.c
1. Add #include
2. Add calls to sandbox_hwloc_init() and other hwloc API functions
\endverbatim
Now you can bootstrap, configure, build, and run the sandbox as normal
-- all calls to "sandbox_hwloc_*" will use the embedded hwloc rather
than any system-provided copy of hwloc.
\page faq Frequently Asked Questions
\section faq_xml I do not want hwloc to rediscover my enormous machine topology every time I rerun a process
Although the topology discovery is not expensive on common machines,
its overhead may become significant when multiple processes repeat
the discovery on large machines (for instance when starting one process
per core in a parallel application).
The machine topology usually does not vary much, except if some cores
are stopped/restarted or if the administrator restrictions are modified.
Thus rediscovering the whole topology again and again may look useless.
For this purpose, hwloc offers XML import/export features. It lets you
save the discovered topology to a file (for instance with the lstopo program)
and reload it later by setting the HWLOC_XMLFILE environment variable.
Loading a XML topology is usually much faster than querying multiple
files or calling multiple functions of the operating system.
It is also possible to manipulate such XML files with the C programming
interface, and the import/export may also be directed to memory buffer
(that may for instance be transmitted between applications through a socket).
\section faq_onedim hwloc only has a one-dimensional view of the architecture, it ignores distances
hwloc places all objects in a tree. Each level is a one-dimensional
view of a set of similar objects. All children of the same object (siblings)
are assumed to be equally interconnected (same distance between any of them),
while the distance between children of different objects (cousins) is supposed
to be larger.
Modern machines exhibit complex hardware interconnects, so this tree
may miss some information about the actual physical distances between objects.
The hwloc topology may therefore be annotated with distance information that
may be used to build a more realistic representation (multi-dimensional)
of each level.
For instance, the root object may contain a distance matrix that represents
the latencies between any pairs of NUMA nodes if the BIOS and/or operating
system reports them.
\section faq_smt How may I ignore symmetric multithreading, hyper-threading, ... ?
hwloc creates one PU (processing unit) object per hardware thread.
If your machine supports symmetric multithreading, for instance Hyper-Threading,
each Core object may contain multiple PU objects.
\verbatim
$ lstopo -
...
Core L#1
PU L#2 (P#1)
PU L#3 (P#3)
\endverbatim
If you need to ignore symmetric multithreading, you should likely manipulate
hwloc Core objects directly:
\verbatim
/* get the number of cores */
unsigned nbcores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE);
...
/* get the third core below the first socket */
hwloc_obj_t socket, core;
socket = hwloc_get_obj_by_type(topology, HWLOC_OBJ_SOCKET, 0);
core = hwloc_get_obj_inside_cpuset_by_type(topology, socket->cpuset,
HWLOC_OBJ_CORE, 2);
\endverbatim
Whenever you want to bind a process or thread to a core, make sure you
singlify its cpuset first, so that the task is actually bound to a single
thread within this core (to avoid useless migrations).
\verbatim
/* bind on the second core */
hwloc_obj_t core = hwloc_get_obj_by_type(topology, HWLOC_OBJ_CORE, 1);
hwloc_cpuset_t set = hwloc_bitmap_dup(core->cpuset);
hwloc_bitmap_singlify(set);
hwloc_set_cpubind(topology, set, 0);
hwloc_bitmap_free(set);
\endverbatim
With hwloc-calc or hwloc-bind command-line tools, you may specify that
you only want a single-thread within each core by asking for their first
PU object:
\verbatim
$ hwloc-calc core:4-7
0x0000ff00
$ hwloc-calc core:4-7.pu:0
0x00005500
\endverbatim
When binding a process on the command-line, you may either specify
the exact thread that you want to use, or ask hwloc-bind to singlify
the cpuset before binding
\verbatim
$ hwloc-bind core:3.pu:0 -- echo "hello from first thread on core #3"
hello from first thread on core #3
...
$ hwloc-bind core:3 --single -- echo "hello from a single thread on core #3"
hello from a single thread on core #3
\endverbatim
\section faq_asymmetric What happens if my topology is asymmetric?
hwloc supports asymmetric topologies even if most platforms are usually
symmetric. For example, there may be different types of processors
in a single machine, each with different numbers of cores, symmetric
multithreading, or levels of caches.
To understand how hwloc manages such cases, one should first remember
the meaning of levels and cousin objects. All objects of the same type
are gathered as horizontal levels with a given depth. They are also
connected through the cousin pointers of the hwloc_obj structure.
Some types, such as Caches or Groups, are usually annotated with a
depth or level attribute (for instance L2 cache). In this case, this
attribute is also taken in account when gathering objects as
horizontal levels. To be clear: there will be one level for L1
caches, another level for L2 caches, etc.
If the topology is asymmetric (e.g., if a cache is missing in one of
the processors), a given horizontal level will still exist if there
exist any objects of that type. However, some branches of the overall
tree may not have an object located in that horizontal level. Note
that this specific hole within one horizontal level does not imply
anything for other levels. All objects of the same type are gathered
in horizontal levels even if their parents or children have different
depths and types.
Moreover, it is important to understand that a same parent object may
have children of different types (and therefore, different
depths). These children are therefore siblings (because they
have the same parent), but they are not cousins (because they
do not belong to the same horizontal levels).
\section faq_annotate How do I annotate the topology with private notes?
Each hwloc object contains a userdata field that may be used by
applications to store private pointers. This field is kept intact as long
as the object is valid, which means as long as topology objects are not
modified by reloading or restricting the topology.
It is also possible to insert Misc objects with custom names anywhere
in the topology (hwloc_topology_insert_misc_object_by_cpuset())
or as a leaf of the topology (hwloc_topology_insert_misc_object_by_parent()).
\section faq_upgrade How do I handle API upgrades?
The hwloc interface is extended with every new major release.
Any application using the hwloc API should be prepared to check at
compile-time whether some features are available in the currently
installed hwloc distribution.
To check whether hwloc is at least 1.2, you should use:
\verbatim
#include
#if HWLOC_API_VERSION >= 0x00010200
...
#endif
\endverbatim
One of the major changes in hwloc 1.1 was the addition of the bitmap
API. It supersedes the now deprecated cpuset API which will be removed
in a future hwloc release. It is strongly recommended to switch existing
codes to the bitmap API. Keeping support for older hwloc versions is easy.
For instance, if your code uses hwloc_cpuset_alloc, you should use
hwloc_bitmap_alloc instead and add the following code to one of your
common headers:
\verbatim
#include
#if HWLOC_API_VERSION < 0x00010100
#define hwloc_bitmap_alloc hwloc_cpuset_alloc
#endif
\endverbatim
Similarly, the hwloc 1.0 interface may be detected by comparing
HWLOC_API_VERSION with 0x00010000.
hwloc 0.9 did not define any HWLOC_API_VERSION but this very old
release probably does not deserve support from your application anymore.
*/