Differences
This shows you the differences between two versions of the page.
programmable_hardware [2007-06-18 10:13] – external edit 127.0.0.1 | programmable_hardware [2007-12-30 12:57] (current) – nik | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | |||
- | |||
==== Programmable Logic ==== | ==== Programmable Logic ==== | ||
aka programmable hardware, (re)configurable logic, etc | aka programmable hardware, (re)configurable logic, etc | ||
- | |||
=== Hardware === | === Hardware === | ||
Line 27: | Line 24: | ||
* http:// | * http:// | ||
* open directory category http:// | * open directory category http:// | ||
- | * suvey/ tutorial http:// | + | * survey/ tutorial http:// |
+ | |||
+ | ---- | ||
+ | |||
+ | ==== other ==== | ||
+ | |||
+ | < | ||
+ | |||
+ | Subject: Repost: Performance Benchmarks: Emulating FPGAs Using | ||
+ | | ||
+ | Date: 09 Feb 1996 00:00:00 GMT | ||
+ | newsgroups: comp.arch.fpga | ||
+ | |||
+ | Reposted after Netcom corrupted the first copy. | ||
+ | |||
+ | In < | ||
+ | writes: | ||
+ | >> | ||
+ | >> | ||
+ | >>is doing neural nets you might post: | ||
+ | >> | ||
+ | >>I did a 12 neuron neural net that did 12 Billion connections/ | ||
+ | >>in 3000 gates and a ALPHA 330MegHz can do them @ 300 Million | ||
+ | >> | ||
+ | >>of the detail of the proposed post) | ||
+ | |||
+ | A while back I was wondering about this in general: "how much better | ||
+ | are FPGAs at executing, in hardware, an arbitrary computation, | ||
+ | modern general purpose processors at executing, in software, the same | ||
+ | arbitrary computation?" | ||
+ | which might be interesting to discuss: | ||
+ | |||
+ | |||
+ | Emulating FPGAs using general purpose processors | ||
+ | Jan Gray, 1995 | ||
+ | |||
+ | It is well known that FPGAs can implement general purpose computers, | ||
+ | and general purpose computers can emulate FPGAs. | ||
+ | consider how efficiently a general purpose processor can emulate an | ||
+ | arbitrary FPGA circuit. | ||
+ | performance advantage (if any) the custom reconfigurable computing | ||
+ | crowd has over general purpose processors, to quantify, best case, how | ||
+ | much faster an FPGA is at arbitrary FPGA problems. | ||
+ | |||
+ | |||
+ | Setting aside such modern features as embedded SRAM, 3-state buses, | ||
+ | asynchronous flip-flop set/reset, and even flip-flop clock enables, we | ||
+ | model a typical lookup-table-based FPGA logic element as an arbitrary | ||
+ | 4-input logic function driving either another stage of logic or a | ||
+ | flip-flop. | ||
+ | multilevel logic functions are pipelined with a register between each | ||
+ | logic function. | ||
+ | allows us to compare against an FPGA clock rate based upon the sum of | ||
+ | flip-flop clock-to-output delay, delay through some interconnect, | ||
+ | logic function delay, and the flip-flop setup time. | ||
+ | |||
+ | Thus, we model an FPGA as a vector of n 4-input function generators | ||
+ | F[i] driving n flip-flops R[i] which are in turn fed back into the | ||
+ | function generator inputs. In practice most FPGAs are incapable of | ||
+ | modeling arbitrary interconnect schemes between logic elements, instead | ||
+ | greatly favouring local interconnect between nearby logic elements. | ||
+ | However, we’ll be overly generous and permit any function generator | ||
+ | F[i] to compute any function of any 4 flip-flop outputs: | ||
+ | R[i]’ = F[i](R[a[i]], | ||
+ | for i, a[i], b[i], c[i], d[i] all in [0..n). | ||
+ | |||
+ | Note that a, b, c, d, are simple mappings which describe which | ||
+ | flip-flop outputs are inputs to each given function generator. | ||
+ | instance, if R[0]’ = R[2] xor R[3] xor R[5] xor R[7], we would have | ||
+ | a[0]=2, b[0]=3, c[0]=5, d[0]=7. | ||
+ | |||
+ | On a general purpose processor with word width w, we can implement the | ||
+ | flip-flop vector by partitioning it into w bit registers. | ||
+ | the presentation, | ||
+ | clocking of the FPGA is to form A, B, C, D, each a mapping of R such | ||
+ | that | ||
+ | A[i] == R[a[i]] | ||
+ | B[i] == R[b[i]] | ||
+ | C[i] == R[c[i]] | ||
+ | D[i] == R[d[i]] | ||
+ | and finally compute | ||
+ | R’[i] = F[i](A[i], B[i], C[i], D[i]) | ||
+ | word-wise, over all the bit positions i simultaneously. | ||
+ | |||
+ | The first sub-problem is to compute 4 mappings of bits of R into A, B, | ||
+ | C, D efficiently. | ||
+ | xor, and mask operations upon R. For instance, R can be bit-reversed | ||
+ | in 5 lg w simple RISC instructions, | ||
+ | instructions, | ||
+ | including arbitrary bit replication) in roughly 20 lg w instructions | ||
+ | employing 4 lg w constants. | ||
+ | exchange networks). | ||
+ | tables, for each byte of bits in R, and " | ||
+ | together: | ||
+ | word R, LUA[w/ | ||
+ | for (i = 0; i < w/8; i++) | ||
+ | A |= LUA[i][(R>> | ||
+ | Written in-line this would require about w/8 shifts, masks, loads and | ||
+ | " | ||
+ | 4*w/8 == w/2 instructions to arbitrarily map R into A. | ||
+ | |||
+ | For example, for w==64, the wordwise bit mapping approach would require | ||
+ | approximately 120 instructions, | ||
+ | require approximately 32 instructions (including 8 loads). | ||
+ | four mappings A, B, C, D require about 4*32 == 128 instructions, | ||
+ | including 32 loads and about 8 KB of lookup tables. | ||
+ | |||
+ | The next sub-problem is to compute F(A, | ||
+ | parallel across all n bits, it suffices to generate the 16 minterms | ||
+ | M[0]=~A& | ||
+ | them under 16 masks N[0..15]: | ||
+ | R = M[0]& | ||
+ | By reusing common subexpressions this can be done in about 60 | ||
+ | instructions. | ||
+ | |||
+ | All totaled, on a w-bit processor, we need about 4*max(w/2, 20 lg w) + | ||
+ | 60 instructions to simulate one clocking of an arbitrary w-bit wide | ||
+ | FPGA, as modeled. | ||
+ | arbitrary w-bit FPGA of 4-input logic elements, is only about 16 + 4 lg | ||
+ | w words. | ||
+ | |||
+ | |||
+ | Let’s try out this model against some real hardware! | ||
+ | has 8x8 CLBs each with 2 function generators and two flip-flops, and | ||
+ | with a clock to flip-flop output delay of 3 ns, an " | ||
+ | which I'll empirically and arbitrarily state is 17 ns, and a combined | ||
+ | function generator plus flip-flop setup time of 5 ns, all totaled, | ||
+ | let’s say 25 ns. In summary, we won’t be far wrong (?) to say an ‘02A | ||
+ | is a " | ||
+ | 32x32 CLB XC4025-5 would be a " | ||
+ | |||
+ | In the general purpose processor corner, let’s employ an Alpha 21164A. | ||
+ | I recall this machine can issue 4 instructions per clock at 300 MHz. | ||
+ | Since w==64, it would take an Alpha about 200 instructions or about 50 | ||
+ | issue slots (167 ns) to emulate one clocking of an arbitrary 64-bit | ||
+ | FPGA. Thus our Alpha might emulate half an ‘02A at a rate of about 6 | ||
+ | MHz. | ||
+ | |||
+ | For another example, the BiCMOS MicroUnity media processor | ||
+ | implementation can apparently perform 128-bit operations at 1000 MHz. | ||
+ | Here with w==128, it could emulate an arbitrary 128-bit FPGA at about 4 | ||
+ | MHz, perhaps faster if we can take advantage of some of its special bit | ||
+ | shuffle instructions. | ||
+ | |||
+ | |||
+ | Well! These results surprised me. Even executing a billion | ||
+ | operations per second on 64- or 128-bit registers, these general | ||
+ | purpose processors couldn’t achieve 10% of the speed*gates figure of a | ||
+ | small FPGA. Even if you consider my "12 ns" FPGA interconnect delay | ||
+ | assumption far too generous given the arbitrary FPGA’s arbitrary | ||
+ | interconnection topology, and derate it to 50 ns or more, the little | ||
+ | FPGA is still several times faster at FPGA type work than a general | ||
+ | purpose microprocessor. | ||
+ | |||
+ | |||
+ | (I also wonder if this presentation doesn’t suggest a slow, yet | ||
+ | compact, physical FPGA implementation. | ||
+ | to the input and output sides of a " | ||
+ | shuffle/ | ||
+ | according to 4 different routing configurations. | ||
+ | be pipelined, and run at high speed, to map the flip-flop outputs into | ||
+ | 4 registered function generator inputs and every 8 or 9 clocks the | ||
+ | flip-flop itself could be clocked. | ||
+ | 200+ MHz and the overall FPGA at 25 MHz.) | ||
+ | |||
+ | Jan Gray | ||
+ | |||
+ | </ |