[patch[ Simple Topology API

[patch[ Simple Topology API

Post by Matthew Dobso » Sun, 14 Jul 2002 09:40:04



Here is a very rudimentary topology API for NUMA systems.  It uses prctl() for
the userland calls, and exposes some useful things to userland.  It would be
nice to expose these simple structures to both users and the kernel itself.
Any architecture wishing to use this API simply has to write a .h file that
defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h.
  Voila!  Instant inclusion in the topology!

Enjoy!

-Matt

[ 2.5.25-simple_topo.patch 11K ]
diff -Nur linux-2.5.25-vanilla/include/asm-i386/core_ibmnumaq.h linux-2.5.25-api/include/asm-i386/core_ibmnumaq.h
--- linux-2.5.25-vanilla/include/asm-i386/core_ibmnumaq.h       Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/include/asm-i386/core_ibmnumaq.h   Thu Jul 11 13:58:25 2002
@@ -0,0 +1,62 @@
+/*
+ * linux/include/asm-i386/core_ibmnumaq.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpa...@us.ibm.com>
+ */
+#ifndef _ASM_CORE_IBMNUMAQ_H_
+#define _ASM_CORE_IBMNUMAQ_H_
+
+/*
+ * These functions need to be defined for every architecture.
+ * The first five are necessary for the Memory Binding API to function.
+ * The last is needed by several pieces of NUMA code.
+ */
+
+
+/* Returns the number of the node containing CPU 'cpu' */
+#define _cpu_to_node(cpu) (cpu_to_logical_apicid(cpu) >> 4)
+
+/* Returns the number of the node containing MemBlk 'memblk' */
+#define _memblk_to_node(memblk) (memblk)
+
+/* Returns the number of the node containing Node 'nid'.  This architecture is flat,
+   so it is a pretty simple function! */
+#define _node_to_node(nid) (nid)
+
+/* Returns the number of the first CPU on Node 'node' */
+static inline int _node_to_cpu(int node)
+{
+       int i, cpu, logical_apicid = node << 4;
+
+       for(i = 1; i < 16; i <<= 1)
+               if ((cpu = logical_apicid_to_cpu(logical_apicid | i)) >= 0)
+                       return cpu;
+
+       return 0;
+}
+
+/* Returns the number of the first MemBlk on Node 'node' */
+#define _node_to_memblk(node) (node)
+
+#endif /* _ASM_CORE_IBMNUMAQ_H_ */
diff -Nur linux-2.5.25-vanilla/include/asm-i386/mmzone.h linux-2.5.25-api/include/asm-i386/mmzone.h
--- linux-2.5.25-vanilla/include/asm-i386/mmzone.h      Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/include/asm-i386/mmzone.h  Fri Jul 12 16:10:43 2002
@@ -0,0 +1,49 @@
+/*
+ * linux/include/asm-i386/mmzone.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpa...@us.ibm.com>
+ */
+#ifndef _ASM_MMZONE_H_
+#define _ASM_MMZONE_H_
+
+#include <asm/smpboot.h>
+
+#ifdef CONFIG_IBMNUMAQ
+#include <asm/core_ibmnumaq.h>
+/* Other architectures wishing to use this simple topology API should fill
+   in the below functions as appropriate in their own <arch>.h file. */
+#else /* !CONFIG_IBMNUMAQ */
+
+#define _cpu_to_node(cpu)      (0)
+#define _memblk_to_node(memblk)        (0)
+#define _node_to_node(nid)     (0)
+#define _node_to_cpu(node)     (0)
+#define _node_to_memblk(node)  (0)
+
+#endif /* CONFIG_IBMNUMAQ */
+
+/* Returns the number of the current Node. */
+#define numa_node_id()         (_cpu_to_node(smp_processor_id()))
+
+#endif /* _ASM_MMZONE_H_ */
diff -Nur linux-2.5.25-vanilla/include/linux/membind.h linux-2.5.25-api/include/linux/membind.h
--- linux-2.5.25-vanilla/include/linux/membind.h        Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/include/linux/membind.h    Fri Jul 12 16:31:30 2002
@@ -0,0 +1,38 @@
+/*
+ * linux/include/linux/membind.h
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpa...@us.ibm.com>
+ */
+#ifndef _LINUX_MEMBIND_H_
+#define _LINUX_MEMBIND_H_
+
+int cpu_to_node(int);
+int memblk_to_node(int);
+int node_to_node(int);
+int node_to_cpu(int);
+int node_to_memblk(int);
+int get_curr_cpu(void);
+int get_curr_node(void);
+
+#endif /* _LINUX_MEMBIND_H_ */
diff -Nur linux-2.5.25-vanilla/include/linux/prctl.h linux-2.5.25-api/include/linux/prctl.h
--- linux-2.5.25-vanilla/include/linux/prctl.h  Fri Jul  5 16:42:28 2002
+++ linux-2.5.25-api/include/linux/prctl.h      Wed Jul 10 13:58:17 2002
@@ -26,4 +26,17 @@
 # define PR_FPEMU_NOPRINT      1       /* silently emulate fp operations accesses */
 # define PR_FPEMU_SIGFPE       2       /* don't emulate fp operations, send SIGFPE instead */

+/* Get CPU/Node */
+#define PR_GET_CURR_CPU                13
+#define PR_GET_CURR_NODE       14
+
+/* XX to Node conversion functions */
+#define PR_CPU_TO_NODE         15
+#define PR_MEMBLK_TO_NODE              16
+#define PR_NODE_TO_NODE                17
+
+/* Node to XX conversion functions */
+#define PR_NODE_TO_CPU         18
+#define PR_NODE_TO_MEMBLK              19
+
 #endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.5.25-vanilla/kernel/membind.c linux-2.5.25-api/kernel/membind.c
--- linux-2.5.25-vanilla/kernel/membind.c       Wed Dec 31 16:00:00 1969
+++ linux-2.5.25-api/kernel/membind.c   Fri Jul 12 16:13:17 2002
@@ -0,0 +1,130 @@
+/*
+ * linux/kernel/membind.c
+ *
+ * Written by: Matthew Dobson, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <colpa...@us.ibm.com>
+ */
+#include <linux/kernel.h>
+#include <linux/unistd.h>
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/membind.h>
+#include <linux/mmzone.h>
+#include <linux/errno.h>
+#include <linux/smp.h>
+
+extern unsigned long memblk_online_map;
+
+/*
+ * cpu_to_node(cpu): Returns the number of the most specific Node
+ * containing CPU 'cpu'.
+ */
+inline int cpu_to_node(int cpu)
+{
+       if (cpu == -1)     /* return highest numbered node */
+               return (numnodes - 1);
+
+       if ((cpu < 0) || (cpu >= NR_CPUS) ||
+           (!(cpu_online_map & (1 << cpu))))   /* invalid cpu # */
+               return -ENODEV;
+
+       return _cpu_to_node(cpu);
+}
+
+/*
+ * memblk_to_node(memblk): Returns the number of the most specific Node
+ * containing Memory Block 'memblk'.
+ */
+inline int memblk_to_node(int memblk)
+{
+       if (memblk == -1)  /* return highest numbered node */
+               return (numnodes - 1);
+
+       if ((memblk < 0) || (memblk >= NR_MEMBLKS) ||
+           (!(memblk_online_map & (1 << memblk))))   /* invalid memblk # */
+               return -ENODEV;
+
+       return _memblk_to_node(memblk);
+}
+
+/*
+ * node_to_node(nid): Returns the number of the of the most specific Node that
+ * encompasses Node 'nid'.  Some may call this the parent Node of 'nid'.
+ */
+int node_to_node(int nid)
+{
+       if ((nid < 0) || (nid >= numnodes))   /* invalid node # */
+               return -ENODEV;
+
+       return _node_to_node(nid);
+}
+
+/*
+ * node_to_cpu(nid): Returns the lowest numbered CPU on Node 'nid'
+ */
+inline int node_to_cpu(int nid)
+{
+       if (nid == -1)  /* return highest numbered cpu */
+               return (num_online_cpus() - 1);
+
+       if ((nid < 0) || (nid >= numnodes))   /* invalid node # */
+               return -ENODEV;
+
+       return _node_to_cpu(nid);
+}
+
+/*
+ * node_to_memblk(nid): Returns the lowest numbered MemBlk on Node 'nid'
+ */
+inline int node_to_memblk(int nid)
+{
+       if (nid == -1)  /* return highest numbered memblk */
+               return (num_online_memblks() - 1);
+
+       if ((nid < 0) || (nid >= numnodes))   /* invalid node # */
+               return -ENODEV;
+
+       return _node_to_memblk(nid);
+}
+
+/*
+ * get_curr_cpu(): Returns the currently executing CPU number.
+ * For now, this has only mild usefulness, as this information could
+ * change on the return from syscall (which automatically calls schedule()).
+ * Due to this, the data could be stale by the time it gets back to the user.
+ ...

read more »

 
 
 

[patch[ Simple Topology API

Post by Andrew Morto » Sun, 14 Jul 2002 11:50:04



> Here is a very rudimentary topology API for NUMA systems.  It uses prctl() for
> the userland calls, and exposes some useful things to userland.  It would be
> nice to expose these simple structures to both users and the kernel itself.
> Any architecture wishing to use this API simply has to write a .h file that
> defines the 5 calls defined in core_ibmnumaq.h and include it in asm/mmzone.h.

Matt,

I suspect what happens when these patches come out is that most people simply
don't have the knowledge/time/experience/context to judge them, and nothing
ends up happening.  No way would I pretend to be able to comment on the
big picture, that's for sure.

If the code is clean, the interfaces make sense, the impact on other
platforms is minimised and the stakeholders are OK with it then that
should be sufficient, yes?

AFAIK, the interested parties with this and the memory binding API are
ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon.  It would be helpful
if the owners of those platforms could review this work and say "yes,
this is something we can use and build upon".  Have they done that?

I'd have a few micro-observations:

Quote:> ...
> --- linux-2.5.25-vanilla/kernel/membind.c       Wed Dec 31 16:00:00 1969
> +++ linux-2.5.25-api/kernel/membind.c   Fri Jul 12 16:13:17 2002
> ..
> +inline int memblk_to_node(int memblk)

The inlines with global scope in this file seem strange?


> Here is a Memory Binding API
> ...
> +    memblk_binding:    { MEMBLK_NO_BINDING, MPOL_STRICT },             \
> ...
> +typedef struct memblk_list {
> +       memblk_bitmask_t bitmask;
> +       int behavior;
> +       rwlock_t lock;
> +} memblk_list_t;

Is is possible to reduce this type to something smaller for
CONFIG_NUMA=n?

In the above task_struct initialiser you should initialise the
rwlock to RWLOCK_LOCK_UNLOCKED.

It's nice to use the `name:value' initialiser format in there, too.

Quote:> ...
> +int set_memblk_binding(memblk_bitmask_t memblks, int behavior)
> +{
> ...
> +       read_lock_irqsave(&current->memblk_binding.lock, flags);

Your code accesses `current' a lot.  You'll find that the code
generation is fairly poor - evaluating `current' chews 10-15
bytes of code.  You can perform a manual CSE by copying current
into a local, and save a few cycles.

Quote:> ...
> +struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order)
> +{
> ...
> +       spin_lock_irqsave(&node_lock, flags);
> +       temp = pgdat_list;
> +       spin_unlock_irqrestore(&node_lock, flags);

Not sure what you're trying to lock here, but you're not locking
it ;)  This is either racy code or unneeded locking.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Andi Klee » Mon, 15 Jul 2002 05:20:04



> AFAIK, the interested parties with this and the memory binding API are
> ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon.  It would be helpful
> if the owners of those platforms could review this work and say "yes,
> this is something we can use and build upon".  Have they done that?

Comment from the x86-64 side:

Current x86-64 NUMA essentially has no 'nodes', just each CPU has
local memory that is slightly faster than remote memory. This means
the node number would be always identical to the CPU number. As long
as the API provides it's ok for me. Just the node concept will not be
very useful on that platform. memblk will also be identity mapped to
node/cpu.

Some way to tell user space about memory affinity seems to be useful,
but...

General comment:

I don't see what the application should do with the memblk concept
currently. Just knowing about it doesn't seem too useful.
Surely it needs some way to allocate memory in a specific memblk to be useful?
Also doesn't it need to know how much memory is available in each memblk?
(otherwise I don't see how it could do any useful partitioning)

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Linus Torvald » Tue, 16 Jul 2002 04:20:09


[ I've been off-line for a week, so I didn't follow all of the discussion,
  but here goes anyway ]


Quote:

> Current x86-64 NUMA essentially has no 'nodes', just each CPU has
> local memory that is slightly faster than remote memory. This means
> the node number would be always identical to the CPU number. As long
> as the API provides it's ok for me. Just the node concept will not be
> very useful on that platform. memblk will also be identity mapped to
> node/cpu.

The whole "node" concept sounds broken. There is no such thing as a node,
since even within nodes latencies will easily differ for different CPU's
if you have local memories for CPU's within a node (which is clearly the
only sane thing to do).

If you want to model memory behaviour, you should have memory descriptors
(in linux parlance, "zone_t") have an array of latencies to each CPU. That
latency is _not_ a "is this memory local to this CPU" kind of number, that
simply doesn't make any sense. The fact is, what matters is the number of
hops. Maybe you want to allow one hop, but not five.

Then, make the memory binding interface a function of just what kind of
latency you allow from a set X of CPU's. Simple, straightforward, and it
has a direct meaning in real life, which makes it unabiguous.

So your "memory affinity" system call really needs just one number: the
acceptable latency. You may also want to have a CPU-set argument, although
I suspect that it's equally correct to just assume that the CPU-set is the
set of CPU's that the process can already run on.

After that, creating a new zone array is nothing more than:

 - give each zone a "latency value", which is simply the minimum of all
   the latencies for that zone from CPU's that are in the CPU set.

 - sort the zone array, lowest latency first.

 - the passed-in latency is the cut-off-point - clear the end of the
   array (with the sanity check that you always accept one zone, even if
   it happens to have a latency higher than the one passed in).

End result: you end up with a priority-sorted array of acceptable zones.
In other words, a zone list. Which iz _exactly_ what you want anyway
(that's what the current "zone_table" is.

And then you associate that zone-list with the process, and use that
zone-list for all process allocations.

Advantages:

 - very direct mapping to what the hardware actually does

 - no complex data structures for topology

 - works for all topologies, the process doesn't even have to know, you
   can trivially encode it all internally in the kernel by just having the
   CPU latency map for each memory zone we know about.

Disadvantages:

 - you cannot create "crazy" memory bindings. You can only say "I don't
   want to allocate from slow memory". You _can_ do crazy things by
   initially using a different CPU binding, then doing the memory
   binding, and then re-doing the CPU binding. So if you _want_ bad memory
   bindings you can create them, but you have to work at it.

 - we have to use some standard latency measure, either purely time-based
   (which changes from machine to machine), or based on some notion of
   "relative to local memory".

My personal suggestion would be the "relative to local memory" thing, and
call that 10 units. So a cross-CPU (but same module) hop might imply a
latency of 15, which a memory access that goes over the backbone between
modules might be a 35. And one that takes two hops might be 55.

So then, for each CPU in a machine, you can _trivially_ create the mapping
from each memory zone to that CPU. And that's all you really care about.

No?

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Andi Klee » Tue, 16 Jul 2002 04:50:09



> The whole "node" concept sounds broken. There is no such thing as a node,
> since even within nodes latencies will easily differ for different CPU's
> if you have local memories for CPU's within a node (which is clearly the
> only sane thing to do).

I basically agree, but then when you go for a full graph everything
becomes very complex. It's not clear if that much detail is useful
for the application.

Quote:> latency is _not_ a "is this memory local to this CPU" kind of number, that
> simply doesn't make any sense. The fact is, what matters is the number of
> hops. Maybe you want to allow one hop, but not five.

> Then, make the memory binding interface a function of just what kind of
> latency you allow from a set X of CPU's. Simple, straightforward, and it
> has a direct meaning in real life, which makes it unabiguous.

Hmm - that could be a problem for applications that care less about
latency, but more about equal use of bandwidth (see below).
They just want their datastructures to be spread out evenly over
all the available memory controllers. I don't see how that could be
done with a single latency value; you really need some more complete
idea about the topology.

At least on Hammer the latency difference is small enough that
caring about the overall bandwidth makes more sense.

Quote:> And then you associate that zone-list with the process, and use that
> zone-list for all process allocations.

That's the basic idea sure for normal allocations from applications
that do not care much about NUMA.

But "numa aware" applications want to do other things like:
- put some memory area into every node (e.g. for the numa equivalent of
per CPU data in the kernel)
- "stripe" a shared memory segment over all available memory subsystems
(e.g. to use memory bandwidth fully if you know your interconnect can
take it; that's e.g. the case on the Hammer)

As I understood it this API is supposed to be the base of such an
NUMA API for applications (just offer the information, but no way
to use it usefully yet)

More comments from the NUMA gurus please.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Eric W. Biederm » Tue, 16 Jul 2002 11:50:05



> At least on Hammer the latency difference is small enough that
> caring about the overall bandwidth makes more sense.

I agree.  I will have to look closer but unless there is more
juice than I have seen in Hyper-Transport it is going to become
one of the architectural bottlenecks of the Hammer.

Currently you get 1600MB/s in a single direction.  Not to bad.
But when the memory controllers get out to dual channel DDR-II 400,
the local bandwidth to that memory is 6400MB/s, and the bandwidth to
remote memory 1600MB/s, or 3200MB/s (if reads are as common as
writes).  

So I suspect bandwidth intensive applications will really benefit
from local memory optimization on the Hammer.  I can buy that the
latency is negligible, the fact the links don't appear to scale
in bandwidth as well as the connection to memory may be a bigger
issue.

Quote:> > And then you associate that zone-list with the process, and use that
> > zone-list for all process allocations.

> That's the basic idea sure for normal allocations from applications
> that do not care much about NUMA.

> But "numa aware" applications want to do other things like:
> - put some memory area into every node (e.g. for the numa equivalent of
> per CPU data in the kernel)
> - "stripe" a shared memory segment over all available memory subsystems
> (e.g. to use memory bandwidth fully if you know your interconnect can
> take it; that's e.g. the case on the Hammer)

The latter I really quite believe.  Even dual channel PC2100 can
exceed your interprocessor bandwidth.

And yes I have measured 2000MB/s memory copy with an Athlon MP and
PC2100 memory.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Sandy Harri » Wed, 17 Jul 2002 01:30:06




> > At least on Hammer the latency difference is small enough that
> > caring about the overall bandwidth makes more sense.

> I agree.  I will have to look closer but unless there is more
> juice than I have seen in Hyper-Transport it is going to become
> one of the architectural bottlenecks of the Hammer.

> Currently you get 1600MB/s in a single direction.

That's on an 8-bit channel, as used on Clawhammer (AMD's lower cost
CPU for desktop market). The spec allows 2, 4, 6, 16 or 32-bit
channels. If I recall correctly, the AMD presentation at OLS said
Sledgehammer (server market) uses 16-bit.

Quote:> Not to bad.
> But when the memory controllers get out to dual channel DDR-II 400,
> the local bandwidth to that memory is 6400MB/s, and the bandwidth to
> remote memory 1600MB/s, or 3200MB/s (if reads are as common as
> writes).

> So I suspect bandwidth intensive applications will really benefit
> from local memory optimization on the Hammer.  I can buy that the
> latency is negligible,

I'm not so sure. Clawhammer has two links, can do dual-CPU. One link
to the other CPU, one for I/O. Latency may well be negligible there.

Sledgehammer has three links, can do no-glue 4-way with each CPU
using two links to talk to others, one for I/O.

    I/O -- A ------ B -- I/O
           |        |
           |        |
    I/O -- C ------ D -- I/O

They can also go to no-glue 8-way:

    I/O -- A ------ B ------ E ------ G -- I/O
           |        |        |        |
           |        |        |        |
    I/O -- C ------ D ------ F ------ H -- I/O

I suspect latency may become an issue when more than one link is
involved and there can be contention.

Beyond 8-way, you need glue logic (hypertransport switches?) and
latency seems bound to become an issue.

Quote:> the fact the links don't appear to scale
> in bandwidth as well as the connection to memory may be a bigger
> issue.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

[patch[ Simple Topology API

Post by Chris Friese » Wed, 17 Jul 2002 01:40:10



> I suspect latency may become an issue when more than one link is
> involved and there can be contention.

According to the AMD talk at OLS, worst case on a 4-way is better than current
best-case on a uniprocessor athlon.

Quote:> Beyond 8-way, you need glue logic (hypertransport switches?) and
> latency seems bound to become an issue.

Nope.  Just extend the ladder.  Each cpu talks to three other entities, either
cpu or I/O.  Can be extended arbitrarily until latencies are too high.

Chris

--
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Matthew Dobso » Wed, 17 Jul 2002 03:00:08




>>AFAIK, the interested parties with this and the memory binding API are
>>ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon.  It would be helpful
>>if the owners of those platforms could review this work and say "yes,
>>this is something we can use and build upon".  Have they done that?

> Comment from the x86-64 side:

> Current x86-64 NUMA essentially has no 'nodes', just each CPU has
> local memory that is slightly faster than remote memory. This means
> the node number would be always identical to the CPU number. As long
> as the API provides it's ok for me. Just the node concept will not be
> very useful on that platform. memblk will also be identity mapped to
> node/cpu.

> Some way to tell user space about memory affinity seems to be useful,
> but...

That shouldn't be a problem at all.  Since each architecture is responsible for
defining the 5 main topology functions, you could do this:

#define _cpu_to_node(cpu)       (cpu)
#define _memblk_to_node(memblk) (memblk)
#define _node_to_node(node)     (node)
#define _node_to_cpu(node)      (node)
#define _node_to_memblk(node)   (node)

Quote:> General comment:

> I don't see what the application should do with the memblk concept
> currently. Just knowing about it doesn't seem too useful.
> Surely it needs some way to allocate memory in a specific memblk to be useful?
> Also doesn't it need to know how much memory is available in each memblk?
> (otherwise I don't see how it could do any useful partitioning)

For that, you need to look at the Memory Binding API that I sent out moments
after this patch...  It builds on top of this infrastructure to allow binding
processes to individual memory blocks or groups of memory blocks.

Cheers!

-Matt

> -Andi

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

[patch[ Simple Topology API

Post by Matthew Dobso » Wed, 17 Jul 2002 04:00:09


 > Matt,
 >
 > I suspect what happens when these patches come out is that most people simply
 > don't have the knowledge/time/experience/context to judge them, and nothing
 > ends up happening.  No way would I pretend to be able to comment on the
 > big picture, that's for sure.
Absolutely correct.  I know that most people here on LKML don't have 8, 16, 32,
or more CPU systems to test this code on, or for that matter, even care about
code designed for said systems.  I'm lucky enough to get to work on such
machines, and I'm sure there are others out there (as evidenced by some of the
replies I've gotten) that do care.  Also, there are publicly available NUMA
machines in the OSDL that people can use to "play" on large systems.  I hope
that by seeing code and using these systems, some more people might get
interested in some of the interesting scalability issues that crop up with
these machines.

 > If the code is clean, the interfaces make sense, the impact on other
 > platforms is minimised and the stakeholders are OK with it then that
 > should be sufficient, yes?
I would hope so.  That's what I'm trying to establish! ;)

 > AFAIK, the interested parties with this and the memory binding API are
 > ia32-NUMA, ia64, PPC, some MIPS and x86-64-soon.  It would be helpful
 > if the owners of those platforms could review this work and say "yes,
 > this is something we can use and build upon".  Have they done that?
I've gotten some feedback from large systems people.  I hope to get feedback
from anyone with large systems that could potentially use this kind of API, and
get a "this is great" or a "this sucks".  I believe that bigger systems need
new ways to improve efficiency and scalability than what the kernel offers now.
  I know I do...

 > I'd have a few micro-observations:
 >
 >>...
 >>--- linux-2.5.25-vanilla/kernel/membind.c       Wed Dec 31 16:00:00 1969
 >>+++ linux-2.5.25-api/kernel/membind.c   Fri Jul 12 16:13:17 2002
 >>..
 >>+inline int memblk_to_node(int memblk)
 >
 >
 > The inlines with global scope in this file seem strange?
 >
 >

 >
 >>Here is a Memory Binding API
 >>...
 >>+    memblk_binding:    { MEMBLK_NO_BINDING, MPOL_STRICT },             \
 >
 >
 >>...
 >>+typedef struct memblk_list {
 >>+       memblk_bitmask_t bitmask;
 >>+       int behavior;
 >>+       rwlock_t lock;
 >>+} memblk_list_t;
 >
 >
 > Is is possible to reduce this type to something smaller for
 > CONFIG_NUMA=n?
Probably...  I'll look at that today...

 > In the above task_struct initialiser you should initialise the
 > rwlock to RWLOCK_LOCK_UNLOCKED.
Yep..  Totally forgot about that! :(

 > It's nice to use the `name:value' initialiser format in there, too.
Sure, enhanced readability is always a good thing!

 >>...
 >>+int set_memblk_binding(memblk_bitmask_t memblks, int behavior)
 >>+{
 >>...
 >>+       read_lock_irqsave(&current->memblk_binding.lock, flags);
 >
 >
 > Your code accesses `current' a lot.  You'll find that the code
 > generation is fairly poor - evaluating `current' chews 10-15
 > bytes of code.  You can perform a manual CSE by copying current
 > into a local, and save a few cycles.
Sure..  I've actually gotten a couple different ideas about improving the
efficiency of that function, and will also be rewriting that today..

 >>...
 >>+struct page * _alloc_pages(unsigned int gfp_mask, unsigned int order)
 >>+{
 >>...
 >>+       spin_lock_irqsave(&node_lock, flags);
 >>+       temp = pgdat_list;
 >>+       spin_unlock_irqrestore(&node_lock, flags);
 >
 >
 > Not sure what you're trying to lock here, but you're not locking
 > it ;)  This is either racy code or unneeded locking.
To be honest, I'm not entirely sure what that's locking either.  That is the
non-NUMA path of that function, and the locking was in the original code, so I
just moved it along.  After doing a bit of searching, that lock seems
COMPLETELY useless there.  Especially since in the original function, a few
lines further down pgdat_list is read again, without the lock!  I guess, unless
someone here says otherwise, I'll pull that locking out of the next rev.

Thanks for all the feedback.  I'll incorporate most of it into the next rev of
the patch!

Cheers!

-Matt

 >
 >
 > Thanks.
 > -
 > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 > Please read the FAQ at  http://www.tux.org/lkml/
 >

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Jukka Honkel » Wed, 17 Jul 2002 05:00:06



>> Beyond 8-way, you need glue logic (hypertransport switches?) and
>> latency seems bound to become an issue.
>Nope.  Just extend the ladder.  Each cpu talks to three other entities,
>either cpu or I/O.  Can be extended arbitrarily until latencies are too
>high.

You seem to be missing one critical piece from the OLS talk. The HT
protocol (or something related) can't handle more than 8 CPU's in a single
configuration. You need to have some kind of bridge to connect
more than 8CPU's together, although systems with more than 8 CPU's have
not been discussed officially anywhere, afaik.

8 CPU's and less belongs to the SUMO category (Sufficiently Uniform Memory
Organization, apparently new AMD terminology) whereas 9 CPU's and more is
likely to be NUMA.

--
Jukka Honkela

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[patch[ Simple Topology API

Post by Eric W. Biederm » Wed, 17 Jul 2002 19:50:08





> > > At least on Hammer the latency difference is small enough that
> > > caring about the overall bandwidth makes more sense.

> > I agree.  I will have to look closer but unless there is more
> > juice than I have seen in Hyper-Transport it is going to become
> > one of the architectural bottlenecks of the Hammer.

> > Currently you get 1600MB/s in a single direction.

> That's on an 8-bit channel, as used on Clawhammer (AMD's lower cost
> CPU for desktop market). The spec allows 2, 4, 6, 16 or 32-bit
> channels. If I recall correctly, the AMD presentation at OLS said
> Sledgehammer (server market) uses 16-bit.

Thanks, my confusion.  The danger is of having more bandwidth to memory
than to other processors is still present, but it may be one of those
places where the cpu designers are able to stay one step ahead of
the problem.  I will definitely agree the problem goes away for the
short term with a 32bit link.

- Show quoted text -

Quote:> > Not to bad.
> > But when the memory controllers get out to dual channel DDR-II 400,
> > the local bandwidth to that memory is 6400MB/s, and the bandwidth to
> > remote memory 1600MB/s, or 3200MB/s (if reads are as common as
> > writes).

> > So I suspect bandwidth intensive applications will really benefit
> > from local memory optimization on the Hammer.  I can buy that the
> > latency is negligible,

> I'm not so sure. Clawhammer has two links, can do dual-CPU. One link
> to the other CPU, one for I/O. Latency may well be negligible there.

> Sledgehammer has three links, can do no-glue 4-way with each CPU
> using two links to talk to others, one for I/O.

>     I/O -- A ------ B -- I/O
>            |        |
>            |        |
>     I/O -- C ------ D -- I/O

> They can also go to no-glue 8-way:

>     I/O -- A ------ B ------ E ------ G -- I/O
>            |        |        |        |
>            |        |        |        |
>     I/O -- C ------ D ------ F ------ H -- I/O
> I suspect latency may become an issue when more than one link is
> involved and there can be contention.

I think the 8-way topology is a little more interesting than
presented.  But if not it does look like you can run into issues.
The more I look at it there appears to be a strong dynamic balance
in the architecture between having just enough bandwidth, and low
enough latency not to become a bottleneck, and having a low hardware
cost.

Quote:> Beyond 8-way, you need glue logic (hypertransport switches?) and
> latency seems bound to become an issue.

Beyond 8-way you get into another system architecture entirely, which
should be considered on it's own merits.  In large part cache
directories and other very sophisticated techniques are needed when
you scale a system beyond the SMP point.  As long as the inter-cpu
bandwidth is >= the memory bandwidth on a single memory controller
Hammer can probably get away with being just a better SMP, and not
really a NUMA design.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/