Differences

This shows you the differences between two versions of the page.

--- programmable_hardware [2007-06-18 10:13] – external edit 127.0.0.1
+++ programmable_hardware [2007-12-30 12:57] (current) – nik
@@ Line 1: / Line 1: @@
 ==== Programmable Logic ====
 aka programmable hardware, (re)configurable logic, etc
 === Hardware ===
@@ Line 27: / Line 24: @@
   * http://www.al-williams.com/pldhome.htm
   * open directory category http://dmoz.org/Computers/Hardware/Programmable_Logic/
-  * suvey/ tutorial http://www.eecg.toronto.edu/~jayar/pubs/brown/survey.html
+  * survey/ tutorial http://www.eecg.toronto.edu/~jayar/pubs/brown/survey.html
+----
+==== other ====
+<code>
+Subject: Repost: Performance Benchmarks: Emulating FPGAs Using
+         General Purpose Processors
+Date: 09 Feb 1996 00:00:00 GMT
+newsgroups: comp.arch.fpga
+Reposted after Netcom corrupted the first copy.
+In <1996Feb9.082016.20892@super.org> sc-@vcc.com (Steve Casselman)
+writes:
+>>Another thread that would be good is just for some benchmark
+>>data on what people are doing now. For example if someone
+>>is doing neural nets you might post:
+>>
+>>I did a 12 neuron neural net that did 12 Billion connections/sec
+>>in 3000 gates and a ALPHA 330MegHz can do them @ 300 Million
+>>connections/sec.  (no I did not do this it is just an example
+>>of the detail of the proposed post)
+A while back I was wondering about this in general: "how much better
+are FPGAs at executing, in hardware, an arbitrary computation, than are
+modern general purpose processors at executing, in software, the same
+arbitrary computation?".  And so I wrote the following loose analysis
+which might be interesting to discuss:
+Emulating FPGAs using general purpose processors
+Jan Gray, 1995
+It is well known that FPGAs can implement general purpose computers,
+and general purpose computers can emulate FPGAs.  In this message I
+consider how efficiently a general purpose processor can emulate an
+arbitrary FPGA circuit.  It arises from my curiosity regarding what
+performance advantage (if any) the custom reconfigurable computing
+crowd has over general purpose processors, to quantify, best case, how
+much faster an FPGA is at arbitrary FPGA problems.
+Setting aside such modern features as embedded SRAM, 3-state buses,
+asynchronous flip-flop set/reset, and even flip-flop clock enables, we
+model a typical lookup-table-based FPGA logic element as an arbitrary
+-input logic function driving either another stage of logic or a
+flip-flop.  Without fudging too much we can then consider that all
+multilevel logic functions are pipelined with a register between each
+logic function.  This simplifies my equivalence construction below and
+allows us to compare against an FPGA clock rate based upon the sum of
+flip-flop clock-to-output delay, delay through some interconnect, one
+logic function delay, and the flip-flop setup time.
+Thus, we model an FPGA as a vector of n 4-input function generators
+F[i] driving n flip-flops R[i] which are in turn fed back into the
+function generator inputs. In practice most FPGAs are incapable of
+modeling arbitrary interconnect schemes between logic elements, instead
+greatly favouring local interconnect between nearby logic elements.
+However, we’ll be overly generous and permit any function generator
+F[i] to compute any function of any 4 flip-flop outputs:
+	R[i]’ = F[i](R[a[i]], R[b[i]], R[c[i]], R[d[i]]);
+for i, a[i], b[i], c[i], d[i] all in [0..n).
+Note that a, b, c, d, are simple mappings which describe which
+flip-flop outputs are inputs to each given function generator.  For
+instance, if R[0]’ = R[2] xor R[3] xor R[5] xor R[7], we would have
+a[0]=2, b[0]=3, c[0]=5, d[0]=7.
+On a general purpose processor with word width w, we can implement the
+flip-flop vector by partitioning it into w bit registers.  To simplify
+the presentation, assume that n == w.  Then a simulation of one
+clocking of the FPGA is to form A, B, C, D, each a mapping of R such
+that
+	A[i] == R[a[i]]
+	B[i] == R[b[i]]
+	C[i] == R[c[i]]
+	D[i] == R[d[i]]
+and finally compute
+	R’[i] = F[i](A[i], B[i], C[i], D[i])
+word-wise, over all the bit positions i simultaneously.
+The first sub-problem is to compute 4 mappings of bits of R into A, B,
+C, D efficiently.  One approach is to perform wordwise bit shifting,
+xor, and mask operations upon R.  For instance, R can be bit-reversed
+in 5 lg w simple RISC instructions, arbitrarily permuted in 10 lg w
+instructions, and arbitrary mapped (any bit input to any bit output,
+including arbitrary bit replication) in roughly 20 lg w instructions
+employing 4 lg w constants.  (Knuth developed this by analogy from
+exchange networks).  Another approach is to build w/8 256-entry lookup
+tables, for each byte of bits in R, and "or" the contributions
+together:
+    word R, LUA[w/8][256], A = 0;
+    for (i = 0; i < w/8; i++)
+	A |= LUA[i][(R>>(i*8)) % 256];
+Written in-line this would require about w/8 shifts, masks, loads and
+"or"s.  Let us assume the tables fit into d-cache and charge overall
+*w/8 == w/2 instructions to arbitrarily map R into A.
+For example, for w==64, the wordwise bit mapping approach would require
+approximately 120 instructions, whereas the table lookup approach would
+require approximately 32 instructions (including 8 loads).  Thus the
+four mappings A, B, C, D require about 4*32 == 128 instructions,
+including 32 loads and about 8 KB of lookup tables.
+The next sub-problem is to compute F(A,B,C,D).  To perform this in
+parallel across all n bits, it suffices to generate the 16 minterms
+M[0]=~A&~B&~C&~D, M[1]=~A&~B&~C&D, ..., M[15]=A&B&C&D and then combine
+them under 16 masks N[0..15]:
+	R = M[0]&N[0] | M[1]&N[1] | ... | M[15]&N[15];
+By reusing common subexpressions this can be done in about 60
+instructions.
+All totaled, on a w-bit processor, we need about 4*max(w/2, 20 lg w) +
+instructions to simulate one clocking of an arbitrary w-bit wide
+FPGA, as modeled.  Also of interest, the state required to describe the
+arbitrary w-bit FPGA of 4-input logic elements, is only about 16 + 4 lg
+w words.
+Let’s try out this model against some real hardware!  A lowly XC4002A-5
+has 8x8 CLBs each with 2 function generators and two flip-flops, and
+with a clock to flip-flop output delay of 3 ns, an "interconnect delay"
+which I'll empirically and arbitrarily state is 17 ns, and a combined
+function generator plus flip-flop setup time of 5 ns, all totaled,
+let’s say 25 ns.  In summary, we won’t be far wrong (?) to say an ‘02A
+is a "128-bit FPGA" which you can clock at 40 MHz.  In comparison a
+x32 CLB XC4025-5 would be a "2048-bit FPGA".
+In the general purpose processor corner, let’s employ an Alpha 21164A.
+I recall this machine can issue 4 instructions per clock at 300 MHz.
+Since w==64, it would take an Alpha about 200 instructions or about 50
+issue slots (167 ns) to emulate one clocking of an arbitrary 64-bit
+FPGA.  Thus our Alpha might emulate half an ‘02A at a rate of about 6
+MHz.
+For another example, the BiCMOS MicroUnity media processor
+implementation can apparently perform 128-bit operations at 1000 MHz.
+Here with w==128, it could emulate an arbitrary 128-bit FPGA at about 4
+MHz, perhaps faster if we can take advantage of some of its special bit
+shuffle instructions.
+Well!  These results surprised me.   Even executing a billion
+operations per second on 64- or 128-bit registers, these general
+purpose processors couldn’t achieve 10% of the speed*gates figure of a
+small FPGA.  Even if you consider my "12 ns" FPGA interconnect delay
+assumption far too generous given the arbitrary FPGA’s arbitrary
+interconnection topology, and derate it to 50 ns or more, the little
+FPGA is still several times faster at FPGA type work than a general
+purpose microprocessor.
+(I also wonder if this presentation doesn’t suggest a slow, yet
+compact, physical FPGA implementation.  Attach a vector of flip-flops
+to the input and output sides of a "4-context" n-stage
+shuffle/copy/exchange interconnect, set up to route its input values
+according to 4 different routing configurations.  The interconnect can
+be pipelined, and run at high speed, to map the flip-flop outputs into
+registered function generator inputs and every 8 or 9 clocks the
+flip-flop itself could be clocked.  Overall the pipeline could run at
++ MHz and the overall FPGA at 25 MHz.)
+Jan Gray
+</code>