Time+Place: Tuesday 31/01/2012 15:30 Room 337-8 Taub Bld.
Title: The Plural Architecture: Shared Memory Many-cores with Hardware Scheduling
Speaker: Ran Ginosar NOTE UNUSUAL HOUR http://webee.technion.ac.il/~ran/
Affiliation: EE & CS, Technion
Host: Johann Makowsky

Abstract:


The Plural many-core architecture combines hundreds of simple
cache-less cores, many shared cache banks, a hardware scheduler, and
two custom active networks-on-chip: cores-to-shared-caches and
cores-to-scheduler. A theoretical model (almost) justifies increasing 
the number of cores while making them smaller and slower, maximizing
performance-to-power ratio. Several benchmark simulations are 
demonstrated, showing close to linear speedup and high 
performance-to-power ratio.

A de-synchronized PRAM-like task-based non-CSP programming model for 
shared memory enables fine-grain parallelism. Plural tasks are sequential.
Precedence relations among tasks are described by a task map, which is
executed by the hardware scheduler. Duplicable tasks are described once 
and executed as multiple instances, under control of the hardware scheduler.
Tasks are not functions-they neither receive inputs nor generate 
outputs; data are shared only through shared memory. Control tasks (join, 
fork, condition) contain no code, and are executed only by the scheduler. 
There are no locking mechanisms-all synchronizations are formulated as 
inter-task dependencies and managed by the scheduler.

The shared memory is organized as many banks, allowing all or most cores
simultaneous access. A multistage interconnection network resolves address
conflicts and may include fetch-and-op facility to enhance PRAM-like
concurrent read-and-write as well as unique indexing operations. Addresses
are interleaved to reduce conflicts. The entire shared memory is organized
as a shared L1 cache. The architecture supports an optional L2 cache
on-chip.

The Plural architecture employs standard processors; we have tried Sparc,
Microblaze and some proprietary ones. Cores contain a small private
scratch-pad memory for unshared variables. Shared co-processors include FPU
and collective support. DMA processors provide for data pre-fetching.

The Plural architecture is intended for one-job-at-a-time accelerators; it
is not a multitasking multicore, and there should be no OS. The architecture
has been implemented as an IP core for mobile SoC and as a FPGA accelerator.
It has yet to be demonstrated as a standalone IC. During the talk we will
also contrast it with other many-core architectures including Tiles, Rigel
and XMT.


Short Bio:

Prof. Ran Ginosar received BSc from the Technion and PhD from Princeton
University in 1982. He has conducted research at Bell Laboratories, at 
the University of Utah and at Intel Research Laboratories in Oregon, USA. 
He is member of the faculty of EE and CS departments at the Technion, and 
heads the VLSI Systems Research Center. He has also co-founded several 
start-up companies in the area of VLSI and parallel processing. His research
interests focus on VLSI and parallel processing architectures.