Scaling Linux to 512 and Beyond =============================== Mark Goodwin, SGI Engineering Basic scalability is there; working on more advanced things. NUMA Overview * SMP hits bus bandwidth bottle neck at ~64 processors * NUMA scalability depends on interconnect speed/topology - typically 1.9x to 3.5x local latency, comparable bandwidth - interconnects try to parallelise access - SGI protocol is ~1000 pages, very complex, very proprietary * Always CC GSM - much simplified compared to message passing - Cache is always local * Stanford DASH -- beginning of NUMA Altix topologies - ring; cheapest (NUMALink), no routers - 32p -- fat tree topology. routers and meta-routers (between routers) - out of the research community - 512p fat tree topology -- NASA is 20 of these - Visualisation tool came out of Melbourne. With 512 algorithms to draw broke down. Simulated annealing algorithm (?) minimum energy. Links never cross over. ccNUMA * All memory access to cache. * L3 miss goes to main memory. Either local/remote. * memory controller handles cache and interconnect. * cc-numa encodes node address in physical address. memory controller knows the "home node" for each PA. * Directory used 1/32th of local memory on each node. - one directory entry for every cache line of physical memory. - directory says if cache lines either x, shared, unowned - directory says where the cache line is - may be present in any CPU on any node - coherency protocol and state machine details are proprietary. up to 16 transitional states in cc protocol. ccNUMA scalability considerations * not just hardware; software needs to work with hardware and interconnect * always use cache friendly declarations - global locks should never be in the same cache line - avoid false sharing; one node gets the first lock and another node goes for the second lock. * node local (private data areas (PDA)) and CPU local allocation - keep stuff where you know you will need it; avoid locks - per_cpu() tied to TLB in ia64 - kmem_alloc_node() - "first touch" allocations; keep further allocations where first touched * if local node is full, spill "nearby" - ACPI SLIT table shows layout. - anonymous pages such as buffer cache go round robbin as not associated to a particular task. - clean buffer cache pages are treated as free memory on the local node - this currently needs some work; NASA reports their 4TB machine swapping for example. * cachline colour - cache colouring helps being more cache friendly; structures with different colours don't overlap. Performance tools and application tuning considerations * SGI customers require global shared memory rather than MPI - apps are not cache friendly with big memory footprints - reproducible runtimes/results; have to show customers before they buy - thread/memory locality is critical - MTBF of machine is getting high with bigger machines; must be fault tolerant. Taking out one node can kill an entire benchmark. * To solve this sort of thing you need good performance monitoring tools. * dplace, runon, cpusets, check-point/restart * complicated by multiple threads/core, cores/socket, sockets/node - some tools can't handle hyperthreads; involves a whole new range of issue. SGI sometimes disable them. * profiling tools - pfmon, histx, gprof, lockstat, kprof, vtune - these come into play when you notice things like high levels of local cache misses * system perf tools - performance co-pilot, pmshub, linkstat, gtopology - gtoplogy can show animated link-stats Scalability for Altix first release * 2.4.19 kernel; patch nightmare - not working with community at first - patches must be accepted by the community * discontig memory, EFI mem map, leverage early Linux/MIPS - physical address space is not contiguous due to nodes. - started before hardware (IA64) arrived * extend device/nic naming; hundreds of disks/net interfaces - devfs, related to physical location - in 2.6 have ditched devfs; too many deadlocks, etc. * bitmasks need to be > unsigned long to unsigned long[] * things like proc assuming that there would only be single digits for cpu numbers. - /proc/interrupts memory corruption because of too many interrupts * tunables such as default #file descriptors. * lock contention; esp. page and dentry cache. pmshub helped eliminate false sharing or large global locks. * big memory issues -- tcp hash table sized as % of memory * interrupt/cpu affinity - sn_toplogy helps maps pci busses to nodes * O(1) scheduler required for 2.4; scheduler runqueues caused livelocks. - very difficult to debug as the whole machine stops - load balancing back-off from linear to exponential - per-cpu task migration threads * Memory controller had a bug, requiring global TLB purge serialisation. Confused things and caused a big problem. * Need ways of poising memory - Slab allocator does it, need more Ongoing work * major issues and low hanging fruit fixed * move to 2.6, move devfs to udev, NPTL, new chipsets * increased max nodes to 16; 2 sockets per node, 2 cpus per socket, 2 cores per CPU, 2 threads per CPU. adds up to a lot of CPUS. * manual page migration - apac contract(?) - want to be able to stop a job to run a bigger one, then get smaller job back. placement is then thrown out the window. - Christoph Lameter; try to keep things node local * cpusets - Paul Jackson. Just as things were looking good get multi-thread, multi-core, multi-socket etc. - thermal clock throttle from Intel. - if you make something run faster it gets hotter, which makes things run slower. you have to tune for heat. * PAGG, batch/gang schedulers - pluggable schedulers - John Hawkes doing pluggable stuff; new guys based in Egan. * RAS - reliability, accessibility, servacability - memory hotplug for bad dimms - mark pages of memory as bad - DIMM hot plug * Comprehensive System Accounting (CSA) - suped-up BSD accounting for charing customers. * Compilers - optimising ia64 asm by hand is hard - with -O2 you can re-use the registers that are used as parameters; makes debugging very hard. * GFX DRI kernel tuning - David Mosberger coming to SGI looking at that. * sysfs for NUMA toplogy - not community code at the moment. * Large page sizes, huge TLB - max kmalloc() size. * VM - constantly changing. - fix one thing, makes something else worse. * Page/Buffer cache - big problems; mainly - kmem_alloc overflow to remote nodes due to local page cache - page cache grows without bound - writeback triggered by memory reclaim from tail or lru list - slab cache reclaim not independently controllable - mempools for dirty page flushing to avoid i/o req allocation deadlock Questions: Does IRIX help with Linux? Borrowed a lot of ideas for Linux; already gone down some paths which they know don't work out. Things like checkpoint-restart/placement tools/avoiding cache line thrashing. VM-Rewrites? Are they always better? Scalability team pulls latest kernels and tries to keep things up to date. How much in Melbourne as opposed to US? Him, Keith, and then mostly XFS and NFS performance. 4 in Melbourne; not that many in Melbourne.