Hardware Acceleration for Machine Learning
1. Key Concepts in Machine Learning Hardware
Key Concepts in Machine Learning Hardware
Computational Requirements of Neural Networks
The core operation in deep learning is the multiply-accumulate (MAC) computation, which dominates the computational workload. For a fully connected layer with N inputs and M outputs, the number of MAC operations is given by:
Convolutional layers exhibit higher complexity due to their sliding-window nature. For a 2D convolution with C input channels, K output channels, and a kernel size of F × F, the MAC count becomes:
Where Hout and Wout are the output spatial dimensions. This quadratic scaling explains why convolutional networks demand specialized hardware.
Memory Bandwidth Bottleneck
The von Neumann bottleneck becomes particularly severe in neural networks due to their large parameter counts. The memory bandwidth requirement for a layer can be expressed as:
Where Isize, Wsize, and Osize are the input, weight, and output tensor sizes respectively, and fop is the operating frequency. Modern architectures like Transformers with attention mechanisms exacerbate this bottleneck through their O(N²) memory complexity.
Precision Requirements
While floating-point (FP32) provides numerical stability during training, inference can often utilize reduced precision:
- FP16: 16-bit floating point (5 exponent, 10 mantissa bits)
- INT8: 8-bit integer with quantization scaling factors
- Binary/ternary: Extreme quantization (1-2 bits per weight)
The error introduced by quantization can be modeled as:
Where Δ is the quantization step size. Modern accelerators implement mixed-precision pipelines to balance accuracy and efficiency.
Parallelism Strategies
Hardware accelerators exploit three fundamental forms of parallelism:
- Data parallelism: Batch dimension partitioning across multiple processing elements
- Model parallelism: Layer-wise or channel-wise distribution of the network
- Operation parallelism: Concurrent execution of independent tensor operations
The theoretical speedup from N parallel processing elements is limited by Amdahl's Law:
Where P is the parallelizable fraction of the computation. Practical implementations must account for communication overhead between processing elements.
Energy Efficiency Metrics
The energy-delay product (EDP) captures the trade-off between performance and power consumption:
State-of-the-art accelerators achieve >100 TOPS/W for INT8 operations through architectural innovations like:
- Systolic arrays for dense matrix multiplication
- Weight stationary dataflow to minimize memory accesses
- Near-memory computing with 3D-stacked memories
Hardware-Software Co-Design
Modern accelerators employ specialized instructions for neural network primitives. For example, a matrix-multiply-accumulate (MMA) operation in NVIDIA Tensor Cores follows:
Where the dimensions reflect the warp-level tensor core operation. Such instructions are exposed through programming models like CUDA's WMMA API or direct compiler intrinsics.
The Need for Hardware Acceleration
Traditional general-purpose processors, such as CPUs, are ill-suited for modern machine learning workloads due to their sequential execution model and limited parallelism. The computational demands of training deep neural networks (DNNs) scale quadratically with model size, making efficient hardware acceleration essential.
Computational Complexity of Neural Networks
The forward pass of a fully connected layer with n inputs and m outputs requires:
operations. For convolutional layers with an N×N input, K×K kernel, and C channels, the complexity becomes:
This exponential growth in operations quickly overwhelms CPU capabilities, especially when processing high-resolution images or video data.
Memory Bandwidth Limitations
Neural networks exhibit two key memory access patterns that stress conventional architectures:
- Weight-stationary: Repeated access to the same parameters across multiple input samples
- Input-stationary: Reuse of input activations across different filters
The von Neumann bottleneck becomes particularly severe when the memory bandwidth cannot keep up with the processor's computational throughput. For a matrix multiplication of dimensions M×N and N×P, the arithmetic intensity (operations per byte) is:
which often falls below the machine balance point for CPUs.
Energy Efficiency Considerations
Specialized accelerators achieve orders of magnitude better energy efficiency than general-purpose processors. The energy per operation breakdown shows:
Component | Energy (pJ/op) |
---|---|
32-bit CPU ALU | 3.1 |
GPU tensor core | 0.3 |
TPU systolic array | 0.05 |
This difference becomes critical at scale - training a single large language model on CPUs could consume megawatt-hours versus kilowatt-hours on specialized hardware.
Parallelism Opportunities
Neural networks expose multiple dimensions of parallelism that hardware accelerators exploit:
- Data parallelism: Batch processing across multiple examples
- Model parallelism: Distributing layers across devices
- Operation parallelism: Concurrent execution of independent tensor operations
Modern accelerators achieve peak performance through carefully designed execution pipelines that maintain:
where CEs are compute elements. High-end GPUs sustain >80% utilization on DNN workloads compared to <10% for CPUs.
Real-World Performance Gains
Benchmarks on ResNet-50 demonstrate the impact of hardware acceleration:
Platform | Throughput (images/sec) | Latency (ms) |
---|---|---|
Xeon 8280 (28-core) | 210 | 133 |
V100 GPU | 1,250 | 7.8 |
TPU v3 | 2,800 | 3.5 |
The 13-40x performance improvement enables practical deployment of complex models in production environments with strict latency requirements.
1.3 Comparison of CPU, GPU, and TPU Architectures
Architectural Differences
CPUs, GPUs, and TPUs are optimized for fundamentally different computational workloads. CPUs are designed for sequential task execution with a few high-performance cores, while GPUs employ thousands of smaller cores optimized for parallel processing. TPUs, in contrast, are application-specific integrated circuits (ASICs) designed explicitly for tensor operations prevalent in machine learning.
The von Neumann architecture dominates CPU design, featuring:
- Complex instruction pipelining
- Sophisticated branch prediction
- Large cache hierarchies (L1-L3)
- Clock speeds exceeding 5 GHz in modern processors
GPU architectures follow a single-instruction, multiple-data (SIMD) paradigm:
- Thousands of CUDA cores (NVIDIA) or stream processors (AMD)
- High-bandwidth memory (HBM/GDDR)
- Hardware-accelerated matrix operations
TPUs implement a systolic array architecture:
- Specialized matrix multiplication units (MXUs)
- Reduced-precision arithmetic (bfloat16, int8)
- On-chip memory with high bandwidth
- Minimal control logic for deterministic execution
Performance Metrics
The computational efficiency of these architectures can be quantified through several metrics:
Where:
- Ncores = number of processing units
- fclock = operating frequency
- FLOPs/cycle = operations per clock cycle
For matrix multiplication (A×B), the theoretical peak performance differs substantially:
Architecture | Peak TFLOPS | Memory Bandwidth | Power Efficiency |
---|---|---|---|
CPU (Xeon Platinum 8380) | 3.8 | 307 GB/s | 50 GFLOPS/W |
GPU (A100 80GB) | 312 | 2 TB/s | 250 GFLOPS/W |
TPUv4 | 275 | 1.2 TB/s | 900 GFLOPS/W |
Memory Hierarchy and Dataflow
Memory access patterns critically impact performance for machine learning workloads. CPUs rely on sophisticated caching strategies to mitigate latency, while GPUs use coalesced memory access to maximize bandwidth utilization. TPUs implement weight-stationary or output-stationary dataflows to minimize data movement.
The energy cost of data movement follows:
Where DRAM access typically consumes ~100× more energy than register access. TPUs optimize this by keeping frequently used operands in on-chip buffers.
Precision and Numerical Representation
Modern machine learning leverages reduced-precision arithmetic to improve throughput:
- CPUs: Native support for FP64/FP32, with some AVX-512 extensions for INT8
- GPUs: Tensor cores accelerate mixed-precision (FP16/FP32) operations
- TPUs: Native bfloat16 support with INT8/INT4 quantization capabilities
The numerical error introduced by reduced precision can be modeled as:
Where β is the base and p is the precision. For bfloat16 (β=2, p=8), this gives a maximum relative error of ~0.8%.
Practical Considerations
In real-world deployments, architectural choices depend on:
- Batch size: GPUs outperform CPUs for large batches (>64 samples)
- Model architecture: TPUs excel at dense matrix operations but struggle with irregular computations
- Latency requirements: CPUs provide better single-thread performance for small inferences
The optimal architecture follows from Amdahl's Law:
Where p is the parallelizable fraction and s is the speedup of the parallel portion. For neural networks with >95% parallel operations, GPUs/TPUs provide near-linear scaling.
2. Graphics Processing Units (GPUs)
Graphics Processing Units (GPUs)
Architectural Advantages for Parallel Processing
Modern GPUs are built around a massively parallel architecture consisting of thousands of smaller, efficient cores designed for concurrent execution. Unlike CPUs that optimize for single-thread performance with complex control logic and large caches, GPUs employ a Single Instruction, Multiple Data (SIMD) paradigm. This allows them to execute the same operation simultaneously across multiple data points, making them exceptionally well-suited for the matrix and tensor operations fundamental to machine learning.
The computational throughput of a GPU can be quantified by its floating-point operations per second (FLOPS). For a GPU with N cores each running at frequency f and performing k operations per cycle, peak FLOPS is given by:
For example, an NVIDIA A100 GPU with 6912 CUDA cores running at 1.41 GHz and capable of 2 operations per cycle (via fused multiply-add) achieves:
Memory Hierarchy and Bandwidth Optimization
GPUs employ a tiered memory architecture to balance latency and bandwidth:
- Global memory (High-bandwidth GDDR6/HBM2): 10-100x higher bandwidth than CPU RAM (e.g., 1555 GB/s on NVIDIA H100 vs ~50 GB/s for DDR5)
- Shared memory/L1 cache: On-chip memory with ~20x lower latency than global memory
- Registers: Thread-local storage with single-cycle access
The effective memory bandwidth (Beff) for a kernel depends on access patterns:
Coalesced memory accesses (adjacent threads accessing contiguous addresses) can achieve >80% of theoretical bandwidth, while random access patterns may drop below 10%.
CUDA and Tensor Cores
NVIDIA's CUDA architecture introduces three key abstractions for parallel programming:
- Grids: Top-level organization of thread blocks
- Blocks: Groups of threads with shared memory synchronization
- Threads: Individual execution units
Tensor Cores (introduced in Volta architecture) accelerate mixed-precision matrix operations through dedicated hardware. For two 4×4 FP16 matrices A and B, they compute:
where C and D are 4×4 FP32 matrices, completing in one clock cycle versus 64 cycles for conventional CUDA cores.
Practical Considerations for ML Workloads
When deploying models on GPUs, consider:
- Occupancy: Ratio of active warps to maximum supported warps per SM (Streaming Multiprocessor). Optimal occupancy balances parallelism with resource constraints.
- Kernel fusion: Combining multiple operations into a single kernel to reduce memory transfers
- Asynchronous execution: Overlapping computation with data transfers using CUDA streams
The execution time (T) for a compute-bound kernel can be estimated as:
Where achievable FLOPS typically reaches 60-70% of peak for well-optimized linear algebra operations.
2.2 Tensor Processing Units (TPUs)
Architecture and Design Principles
Tensor Processing Units (TPUs) are application-specific integrated circuits (ASICs) optimized for tensor operations, particularly matrix multiplications and convolutions prevalent in deep learning. Unlike general-purpose CPUs or even GPUs, TPUs employ a systolic array architecture—a grid of multiply-accumulate (MAC) units that enable massive parallelism for matrix operations. Each MAC unit performs a partial computation and passes intermediate results to adjacent units, minimizing memory bandwidth bottlenecks.
The systolic array operates at a lower clock frequency (~700 MHz) compared to GPUs (~1.5 GHz), but achieves higher throughput via extreme parallelism. For an N×N systolic array, O(N²) operations execute per cycle. Google’s TPUv3, for instance, uses a 128×128 systolic array, enabling 16,384 parallel MAC operations per clock cycle.
Quantization and Numerical Precision
TPUs leverage 8-bit integer quantization (INT8) for matrix multiplications, trading numerical precision for energy efficiency and throughput. The quantization process maps 32-bit floating-point weights and activations to 8-bit integers via affine transformations:
where s is a scaling factor and z is a zero-point offset. This reduces memory footprint by 4× and increases MAC operation density compared to FP32. Error analysis shows that INT8 quantization introduces <1% accuracy loss for well-conditioned models post-training.
Memory Hierarchy and Dataflow
TPUs implement a unified buffer (UB) for activations and a weight FIFO for pre-loaded parameters, decoupling memory access patterns. The UB acts as a software-managed cache, while the weight FIFO streams data directly into the systolic array. This separation avoids von Neumann bottlenecks, achieving 95%+ utilization rates for large matrix multiplications.
Data flows through the TPU in a wavefront pattern: weights are loaded once and remain stationary, while activations propagate horizontally through the systolic array. Partial sums accumulate vertically, minimizing external memory accesses. For a convolution operation with kernel size K×K, the TPU achieves:
Performance Benchmarks
In ResNet-50 inference tasks, TPUv4 achieves 400 TOPS at 30W power draw, outperforming contemporary GPUs by 3–5× in TOPS/Watt. The table below compares key metrics:
Metric | TPUv4 | A100 GPU |
---|---|---|
Peak TOPS | 400 | 312 |
Memory Bandwidth | 1.2 TB/s | 2 TB/s |
Power Efficiency | 13.3 TOPS/W | 4.2 TOPS/W |
Compiler and Software Stack
TPUs require model compilation via XLA (Accelerated Linear Algebra), which optimizes computation graphs for systolic execution. XLA performs operator fusion, memory layout transformations, and tiling to match the 128×128 array dimensions. The software stack includes:
- TensorFlow/XLA: Graph optimization and compilation
- MLIR: Intermediate representation for hardware-specific passes
- BSP (Bulk Synchronous Parallel): Execution model for distributed TPU pods
2.3 Field-Programmable Gate Arrays (FPGAs)
Architecture and Reconfigurability
FPGAs consist of an array of programmable logic blocks (PLBs) interconnected via a reconfigurable routing fabric. Each PLB typically contains lookup tables (LUTs), flip-flops, and multiplexers, enabling the implementation of custom digital circuits. The key advantage lies in their post-fabrication programmability, allowing hardware architectures to be optimized for specific machine learning workloads through hardware description languages (HDLs) like VHDL or Verilog.
Parallelism and Low-Latency Execution
Unlike CPUs, FPGAs exploit fine-grained parallelism by implementing custom datapaths that match the computational graph of a neural network. For example, matrix multiplications can be unrolled into spatially parallel multiplier-accumulator (MAC) units. The absence of instruction fetch-decode overhead reduces latency to the nanosecond range, critical for real-time inference. The achievable parallelism is governed by:
where \( f_{\text{clk}} \) is the clock frequency and \( N_{\text{MAC}}} \) is the number of parallel MAC units.
Energy Efficiency
FPGAs outperform GPUs in operations-per-watt for fixed-precision arithmetic, as they eliminate redundant fetch-execute cycles and memory hierarchies. Dynamic power consumption scales with:
where \( \alpha \) is activity factor, \( C \) is switched capacitance, and \( V \) is supply voltage. Partial reconfiguration further reduces power by disabling unused logic blocks.
High-Level Synthesis (HLS) Tools
Modern toolchains like Xilinx Vitis or Intel OpenCL SDK enable algorithm-to-hardware compilation from C/C++/Python, abstracting HDL complexities. HLS optimizations include:
- Loop pipelining: Overlapping iterations to maximize throughput
- Dataflow parallelism: Concurrent execution of independent operations
- Memory partitioning: Reducing access contention via bank splitting
Case Study: Quantized Neural Networks
FPGAs excel at low-precision arithmetic (e.g., 8-bit or binary networks). A binarized CNN implemented on a Xilinx Zynq FPGA achieves 14.8 TOPS/W, leveraging:
- XNOR-popcount operations for binary layers
- On-chip BRAM for feature map caching
- Custom precision DSP blocks
Limitations and Trade-offs
While FPGAs provide flexibility, their performance is bounded by:
- Fixed DSP and memory resources per chip
- High development effort compared to GPU CUDA kernels
- Lower peak FLOPs than ASICs for homogeneous workloads
Application-Specific Integrated Circuits (ASICs)
Application-Specific Integrated Circuits (ASICs) represent the pinnacle of hardware acceleration for machine learning, offering unparalleled performance and energy efficiency by eliminating the general-purpose overhead found in CPUs and GPUs. Unlike FPGAs, which are reprogrammable, ASICs are custom-designed for a specific computational task, enabling extreme optimization at the transistor level. This specialization comes at the cost of non-reconfigurability, making ASICs ideal for high-volume, fixed-workload applications such as deep learning inference in data centers or edge devices.
Architectural Advantages
ASICs achieve superior performance through domain-specific architectures that maximize parallelism and minimize data movement. For example, Google's Tensor Processing Unit (TPU) employs a systolic array architecture, where processing elements (PEs) are arranged in a grid to enable high-throughput matrix multiplications. The data flows rhythmically between PEs without external memory access, reducing latency and power consumption. The computational efficiency can be modeled as:
where Nops is the number of operations per second, Pdynamic is dynamic power, and Pstatic is static leakage power. ASICs often achieve 10–100× better TOPS/W than GPUs by optimizing for sparsity, quantization, and near-memory computing.
Design Trade-offs
The development of an ASIC involves a rigorous design cycle spanning RTL synthesis, place-and-route, and fabrication. Key considerations include:
- Die Area: Larger dies accommodate more PEs but increase manufacturing costs and defect rates.
- Memory Hierarchy: On-chip SRAM provides low-latency access but consumes significant area, while off-chip DRAM offers capacity at higher energy costs.
- Precision: Fixed-point or logarithmic arithmetic reduces power but may require retraining models for acceptable accuracy.
For instance, the TPUv4 uses 128×128 systolic arrays with bfloat16 support, achieving 275 TOPS at 75W, whereas Groq’s LPU employs a deterministic execution model to eliminate control overhead entirely.
Case Study: Cryptocurrency Mining ASICs
Bitmain’s Antminer S19j Pro demonstrates ASIC optimization for SHA-256 hashing, delivering 104 TH/s at 29.5 J/TH. The design employs custom datapaths to unroll hash rounds, minimizing register usage and clock cycles. While not a machine learning example, it illustrates how ASICs exploit algorithmic rigidity—similar to how AI accelerators optimize for GEMM (General Matrix Multiply) operations.
Emerging Trends
Recent advancements include 3D-stacked memories (HBM2E) to alleviate bandwidth bottlenecks and analog in-memory computing using resistive RAM (ReRAM) for neuromorphic architectures. Cerebras’ Wafer-Scale Engine epitomizes scale, integrating 850,000 cores on a single 46,225 mm² die, bypassing inter-chip communication delays entirely.
3. Latency and Throughput in Accelerated Systems
3.1 Latency and Throughput in Accelerated Systems
Fundamental Definitions
Latency refers to the time delay between the initiation of a computation and the availability of its result, typically measured in milliseconds (ms) or microseconds (µs). In hardware-accelerated machine learning systems, latency is dominated by data transfer overheads, pipeline stalls, and computational dependencies. For a single inference task, latency (L) can be modeled as:
where tdata is the time to move data between memory and compute units, tcompute is the execution time, and tsync accounts for synchronization delays.
Throughput, measured in operations per second (OPS) or inferences per second (IPS), quantifies the system's capacity to process multiple tasks concurrently. For a batch size B, throughput (T) is:
where Lavg is the average latency per batch element. Throughput is maximized when the hardware pipeline is fully utilized, avoiding idle cycles.
Trade-offs and Bottlenecks
Hardware accelerators like GPUs, TPUs, and FPGAs optimize throughput by exploiting parallelism, but this often increases latency for individual tasks due to:
- Batching overheads: Larger batches improve throughput but may increase latency due to memory contention.
- Memory hierarchy: Accessing off-chip DRAM (high latency) vs. on-chip SRAM (low latency).
- Kernel launch delays: In CUDA-based systems, kernel invocation adds ~10–100 µs of latency.
Quantitative Analysis
The Roofline Model formalizes the relationship between latency and throughput. For a compute-bound operation with peak FLOP/s F and arithmetic intensity I (FLOPs/byte), the achievable throughput is:
where β is the memory bandwidth. For example, an NVIDIA A100 GPU (F = 312 TFLOPS, β = 1.5 TB/s) running a model with I = 100 FLOPs/byte achieves:
Case Study: Transformer Inference
In a transformer model with 175B parameters, latency is dominated by memory bandwidth. Using 8-way tensor parallelism on TPUv4 pods reduces latency by 4.2× compared to a single TPU, but throughput scales sublinearly due to cross-device synchronization costs.
Optimization Techniques
- Pipelining: Overlap data transfer and computation (e.g., CUDA streams).
- Quantization: Reducing precision from FP32 to INT8 cuts latency by 2–4×.
- Operator fusion: Combine multiple kernels to reduce launch overheads.
3.2 Power Efficiency and Thermal Considerations
Power Dissipation in Accelerator Architectures
Hardware accelerators for machine learning, such as GPUs, TPUs, and FPGAs, achieve high computational throughput at the cost of significant power dissipation. The total power consumption Ptotal comprises dynamic power Pdynamic, short-circuit power Pshort, and leakage power Pleakage:
Dynamic power dominates in high-frequency operation and scales with clock frequency f, supply voltage Vdd, and switched capacitance C:
where α is the activity factor. At advanced process nodes (below 7nm), leakage current becomes non-negligible due to quantum tunneling effects, following an exponential relationship with temperature T:
Thermal Design Constraints
The thermal resistance θJA (junction-to-ambient) determines the steady-state temperature rise ΔT for a given power dissipation:
Modern accelerators employ multi-domain thermal management:
- Package-level: High-conductivity thermal interface materials (TIMs) with κ > 5 W/m·K
- System-level: Liquid cooling solutions achieving heat fluxes > 100 W/cm²
- Architectural: Dynamic voltage/frequency scaling (DVFS) with thermal throttling
Energy-Efficient Design Techniques
Approximate computing methods trade off computational precision for power savings. For matrix operations common in neural networks, reduced precision (FP16/INT8) provides 2-4× energy reduction:
Sparse tensor cores exploit neural network weight sparsity through zero-skipping gating:
where ρ is the sparsity ratio. In practice, 50-90% sparsity is achievable with pruning techniques while maintaining model accuracy.
Case Study: Data Center Cooling
Google's TPUv4 pods demonstrate advanced cooling at scale:
- Liquid immersion cooling with dielectric fluids (3M Novec)
- Power usage effectiveness (PUE) of 1.10 versus 1.67 for air cooling
- Waste heat repurposing for district heating systems
The thermal resistance network for such systems includes multiple heat transfer mechanisms:
where conduction through TIMs (θcond ≈ 0.05 K/W), convection in coolant (θconv ≈ 0.02 K/W), and radiation (θrad ≈ 0.5 K/W) form parallel thermal paths.
3.3 Benchmarking ML Hardware Accelerators
Benchmarking machine learning hardware accelerators requires a systematic approach to evaluate performance, power efficiency, and scalability across different architectures. Key metrics include throughput (inferences per second), latency (time per inference), power consumption (watts), and energy efficiency (inferences per joule). These metrics must be measured under controlled conditions to ensure fair comparisons.
Performance Metrics
The primary performance metric for ML accelerators is throughput, defined as the number of inferences processed per second (IPS). For a batch size B and total inference time T, throughput is calculated as:
Latency, the time taken for a single inference, is critical for real-time applications. For a batch size of 1, latency equals T. Power consumption is measured in watts (W), while energy efficiency is derived as:
Benchmarking Workloads
Standardized benchmarks like MLPerf provide reproducible workloads for comparing accelerators. These include:
- Image Classification: ResNet-50 on ImageNet
- Object Detection: SSD on COCO
- Natural Language Processing: BERT on SQuAD
Each benchmark stresses different aspects of the hardware, such as matrix multiplication efficiency (CNNs) or memory bandwidth (transformers).
Measurement Methodology
Accurate benchmarking requires:
- Fixed-Point vs. Floating-Point: Quantized models (INT8) often achieve higher throughput but may sacrifice accuracy.
- Thermal Throttling: Sustained performance must account for thermal limits.
- Software Stack: Frameworks like TensorFlow Lite or ONNX Runtime can impact results.
Case Study: GPU vs. TPU
Comparing an NVIDIA A100 GPU and a Google TPU v4 on ResNet-50:
- A100: 12,000 IPS at 250W (48 IPS/W)
- TPU v4: 18,000 IPS at 200W (90 IPS/W)
The TPU’s higher energy efficiency stems from its systolic array architecture, optimized for large matrix operations.
Advanced Considerations
For research-grade benchmarking, additional factors include:
- Memory Hierarchy: Cache hit rates and DRAM bandwidth.
- Scalability: Multi-accelerator performance scaling.
- Numerical Precision: Impact of mixed-precision (FP16, BF16) on accuracy.
Tools like NVIDIA Nsight and Intel VTune provide low-level profiling to identify bottlenecks in memory access or compute utilization.
4. CUDA and cuDNN for GPU Acceleration
CUDA and cuDNN for GPU Acceleration
GPU Parallel Computing Architecture
Modern GPUs leverage thousands of small, efficient cores designed for parallel computation. Unlike CPUs optimized for sequential tasks, GPUs excel at executing the same operation across multiple data points simultaneously. NVIDIA's CUDA (Compute Unified Device Architecture) provides a programming model that exposes this parallelism, allowing developers to offload compute-intensive tasks to the GPU.
The CUDA execution model organizes threads into hierarchical groups:
- Threads - The smallest executable unit, grouped into warps (32 threads)
- Blocks - A collection of threads that can synchronize and share memory
- Grids - A set of blocks that execute the same kernel function
CUDA Kernel Optimization
Writing efficient CUDA kernels requires careful memory management and thread organization. Key considerations include:
- Coalesced memory access - Ensuring consecutive threads access consecutive memory locations
- Shared memory utilization - Using fast on-chip memory for data reused across threads
- Occupancy - Maximizing the number of active warps per streaming multiprocessor
The theoretical occupancy can be calculated as:
cuDNN: Deep Learning Primitives
The CUDA Deep Neural Network library (cuDNN) provides highly optimized implementations of common DL operations:
- Convolution forward/backward propagation
- Pooling operations (max, average)
- Activation functions (ReLU, sigmoid, tanh)
- Tensor transformations
cuDNN uses autotuning to select the most efficient algorithm based on input parameters and hardware capabilities. For convolution operations, it evaluates different approaches:
Mixed Precision Training
Modern GPUs support Tensor Cores that accelerate mixed-precision matrix operations:
where A and B are FP16 matrices while C accumulates in FP32. This approach provides:
- 2-4x faster matrix multiplication
- Reduced memory bandwidth requirements
- Maintained numerical stability through FP32 accumulation
Performance Optimization Case Study
When optimizing a ResNet-50 model on NVIDIA V100, the following techniques yielded 3.2x speedup:
- Kernel fusion to reduce memory transfers
- Automatic mixed precision training
- cuDNN heuristic selection for convolution algorithms
- Increased batch size to maximize GPU utilization
The final throughput followed Amdahl's law:
where p is the parallel fraction and n is the speedup of the parallel portion.
4.2 TensorFlow and PyTorch Integration with TPUs
TensorFlow TPU Support
TensorFlow provides native support for TPUs through its TPUStrategy API, enabling distributed training across multiple TPU cores. The execution model follows a data-parallel approach, where input batches are split across TPU workers. The key steps for integration include:
- Initializing the TPU cluster using
tf.distribute.cluster_resolver.TPUClusterResolver
. - Defining the model within a
TPUStrategy
scope to ensure automatic distribution. - Converting datasets to
tf.data.Dataset
and optimizing with prefetching for TPU memory bandwidth.
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
model = tf.keras.Sequential([...])
model.compile(optimizer='adam', loss='mse')
PyTorch XLA Backend
PyTorch leverages TPUs via XLA (Accelerated Linear Algebra), a compiler-based backend that optimizes tensor operations. The torch_xla
package replaces CUDA tensors with XLA tensors, enabling execution on TPUs. Critical components include:
- Device management using
xla_device = xm.xla_device()
. - Explicit marking of training steps with
xm.mark_step()
for synchronization. - Gradient accumulation via
optimizer_step(optimizer, barrier=True)
.
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters())
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = loss_fn(outputs, labels)
loss.backward()
xm.optimizer_step(optimizer)
Performance Optimization
Maximizing TPU utilization requires addressing bottlenecks:
- Batch sizing: TPUs perform optimally with batch sizes divisible by 128 (TensorFlow) or 1024 (PyTorch/XLA).
- Mixed precision: Enable
tf.keras.mixed_precision
ortorch.cuda.amp
for FP16/FP32 hybrid computation. - Dataset pipeline: Use
tf.data.Dataset.cache()
or PyTorch'sDataLoader
withnum_workers > 0
.
Debugging and Profiling
TPU-specific tools include:
- TensorFlow's
tf.debugging
and Cloud TPU profiler. - PyTorch/XLA's
xm.get_memory_info()
and METRIC APIs.
4.3 OpenCL and FPGA Toolchains
OpenCL for FPGA Acceleration
OpenCL (Open Computing Language) provides a standardized framework for heterogeneous computing, enabling developers to write parallel programs that execute across CPUs, GPUs, and FPGAs. Unlike GPU-centric acceleration, FPGAs leverage OpenCL to exploit fine-grained parallelism through custom hardware pipelines. The OpenCL execution model consists of host code (running on a CPU) and kernels (parallel functions offloaded to the FPGA).
where \( T_{\text{exec}} \) is execution time, \( N \) is operations count, \( t_{\text{clk}} \) is clock period, and \( P \) is parallelism factor. FPGAs optimize \( P \) via spatial architecture, unlike GPUs' temporal SIMD approach.
FPGA Toolchain Integration
Major vendors (Xilinx, Intel) provide OpenCL-compatible toolchains:
- Xilinx Vitis: Translates OpenCL kernels into RTL (Register Transfer Level) via LLVM-IR, optimizing for Xilinx FPGAs.
- Intel FPGA SDK for OpenCL: Compiles kernels for Intel FPGAs, integrating with Quartus for place-and-route.
Key compilation stages:
- Kernel parsing: OpenCL C → LLVM intermediate representation.
- Hardware synthesis: LLVM-IR → RTL (VHDL/Verilog).
- Place-and-route: RTL → FPGA bitstream.
Memory Hierarchy Optimization
FPGAs use configurable memory blocks (BRAM, URAM) with non-uniform access latencies. OpenCL's __local
and __constant
qualifiers map to on-chip memories, while __global
uses external DDR. Optimal dataflow requires:
where \( B_{\text{mem}} \) is memory bandwidth and \( B_{\text{kernel}} \) is compute throughput.
Case Study: Matrix Multiplication
A 1024×1024 matrix multiply on a Xilinx Alveo U280 achieves 3.2 TFLOPS using:
- 16 parallel processing elements (PEs).
- Blocking factor of 64×64 with double-buffered BRAM.
__kernel void matmul(__global float* A, __global float* B, __global float* C) {
int row = get_global_id(0);
int col = get_global_id(1);
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row*N + k] * B[k*N + col];
}
C[row*N + col] = sum;
}
Performance Trade-offs
FPGAs outperform GPUs in power efficiency (GFLOPS/W) for fixed-precision workloads but require longer development cycles. Key metrics:
Metric | FPGA (OpenCL) | GPU (CUDA) |
---|---|---|
Latency | 10–100 µs | 50–500 µs |
Power Efficiency | 20–50 GFLOPS/W | 5–15 GFLOPS/W |
5. Neuromorphic Computing for ML
5.1 Neuromorphic Computing for ML
Neuromorphic computing architectures emulate the biological structure and function of neural networks, offering energy-efficient alternatives to traditional von Neumann-based machine learning accelerators. Unlike conventional hardware, neuromorphic systems leverage event-driven spiking neural networks (SNNs) and analog or mixed-signal circuits to achieve low-power, high-parallelism computation.
Biological Inspiration and Computational Model
The human brain operates at approximately 20 W while performing complex cognitive tasks—a stark contrast to the kilowatt-scale power consumption of GPU clusters running deep learning models. Neuromorphic engineering draws from three key neurobiological principles:
- Sparse, event-driven computation (spikes replace dense matrix operations)
- Massive parallelism (1015 synapses vs. 109 transistors in CPUs)
- Co-located memory and processing (eliminating von Neumann bottlenecks)
The Leaky Integrate-and-Fire (LIF) neuron model forms the mathematical basis for most neuromorphic implementations:
where τm is the membrane time constant, V the membrane potential, Vrest the resting potential, Rm the membrane resistance, and Isyn the synaptic current. When V crosses threshold Vth, the neuron fires a spike and resets.
Hardware Implementations
Modern neuromorphic chips employ various technologies to implement SNNs:
Technology | Example | Key Features |
---|---|---|
Analog VLSI | IBM TrueNorth | Subthreshold CMOS, 1 million neurons/chip |
Memristive Crossbars | Intel Loihi 2 | Programmable synaptic weights, on-chip learning |
Photonic | Lightmatter | Optical interference for matrix multiplication |
Case Study: Intel Loihi 2
The second-generation Loihi chip demonstrates architectural innovations:
- 128 neuromorphic cores with 1 million programmable neurons
- 3-factor spike-timing-dependent plasticity (STDP) learning rule
- 10x energy efficiency improvement over Loihi 1 for SNN inference
where η is the learning rate, xi the presynaptic trace, yj the postsynaptic trace, and α the weight decay constant.
Applications and Performance
Neuromorphic systems excel in edge computing scenarios requiring low latency and power efficiency:
- Real-time classification: IBM's TrueNorth achieves 1,200 fps on gesture recognition at 300 mW
- Event-based vision: Dynamic Vision Sensors (DVS) coupled with SNNs show 100x lower power than frame-based CNN approaches
- Robotic control: Loihi-based controllers demonstrate 10 ms latency for motor feedback loops
The energy advantage emerges from the sparse activation paradigm—a typical convolutional layer might perform 109 multiply-accumulate (MAC) operations per inference, while an equivalent SNN layer often requires fewer than 106 spike events.
Challenges and Future Directions
Despite promising results, neuromorphic computing faces several hurdles:
- Training complexity: Backpropagation through time (BPTT) for SNNs requires 5-10x more iterations than ANNs
- Precision limitations: Analog implementations typically achieve 4-6 bit precision versus 32-bit floating point in GPUs
- Toolchain immaturity: Lack of standardized frameworks comparable to PyTorch/TensorFlow
Emerging solutions include hybrid analog-digital architectures and surrogate gradient methods for training:
where k controls the smoothness of the pseudo-derivative used during backpropagation.
5.2 Quantum Computing and Machine Learning
Quantum Parallelism and Superposition
Quantum computing leverages superposition and entanglement to perform computations in parallel across multiple states. A quantum bit (qubit) can exist in a superposition of states:
where α and β are complex probability amplitudes satisfying \(|\alpha|^2 + |\beta|^2 = 1\). This enables quantum algorithms to process exponentially large datasets with fewer operations than classical counterparts.
Quantum Machine Learning Algorithms
Several quantum algorithms accelerate machine learning tasks:
- Quantum Support Vector Machines (QSVM): Uses quantum kernel estimation to solve classification problems with exponential speedup.
- Quantum Principal Component Analysis (QPCA): Extracts eigenvalues and eigenvectors faster than classical PCA.
- Variational Quantum Eigensolver (VQE): Optimizes parameterized quantum circuits for regression and optimization.
Quantum Data Encoding
Classical data must be mapped to quantum states for processing. Common encoding methods include:
where \(x_i\) represents classical data points and \(|i\rangle\) denotes basis states. Amplitude encoding allows \(N\)-dimensional data to be stored in \(\log_2 N\) qubits.
Challenges and Limitations
Despite theoretical advantages, practical challenges remain:
- Noise and Decoherence: Qubits lose coherence quickly, requiring error correction.
- Limited Qubit Count: Current NISQ (Noisy Intermediate-Scale Quantum) devices have insufficient qubits for large-scale ML.
- Hybrid Classical-Quantum Approaches: Most algorithms rely on classical optimization loops.
Case Study: Quantum Neural Networks
Quantum neural networks (QNNs) replace classical neurons with parametrized quantum gates. A single-qubit rotation gate can be expressed as:
where \(\sigma_x\) is the Pauli-X operator. Training involves optimizing \(\theta\) via gradient descent on quantum hardware.
Future Prospects
Research focuses on fault-tolerant quantum computing and efficient error mitigation. Applications in drug discovery, financial modeling, and optimization are actively being explored.
5.3 Edge AI and Low-Power Acceleration
Energy Constraints in Edge AI Systems
Edge AI deployments operate under stringent power budgets, often limited to milliwatt or microwatt ranges for battery-powered or energy-harvesting applications. The total energy consumption Etotal of an edge inference system can be decomposed as:
where Ecomp represents computation energy, Emem accounts for memory access energy, and Ecomm covers wireless transmission costs. For a typical CNN layer with N MAC operations, the computation energy follows:
Here, EMAC denotes energy per multiply-accumulate operation (ranging from 1-100 pJ in modern accelerators), while Edata captures energy overhead from operand fetch.
Architectural Optimizations
Three dominant approaches have emerged for efficient edge acceleration:
- Spatial architectures exploit data reuse through systolic arrays or processing-in-memory (PIM) designs, reducing Emem by 10-100x compared to von Neumann systems.
- Temporal architectures employ voltage scaling and subthreshold operation, trading throughput for ultra-low power (e.g., 1-10 μW for keyword spotting).
- Mixed-signal implementations leverage analog compute for linear operations like matrix multiplication, achieving < 1 pJ/MAC at the cost of reduced precision.
Quantization-Aware Silicon
Modern edge accelerators implement native support for 4-8 bit integer operations. The energy savings from reduced precision follow a quadratic relationship:
where Cint and Cfp represent circuit complexity factors for integer versus floating-point units.
Real-World Implementations
Commercial edge AI processors demonstrate these principles:
- Google's Edge TPU achieves 2-4 TOPS/W at 4-bit precision through tightly coupled MAC arrays and on-chip SRAM hierarchies.
- ARM Ethos-U55 combines weight compression with SIMD vector processing for < 1 mW inference in microcontroller deployments.
- Mythic's analog compute tiles leverage flash memory cells as programmable resistors, enabling 25 TOPS/W for vision applications.
Thermal Considerations
Power density constraints become critical at the edge, where passive cooling is often mandatory. The maximum sustainable compute density follows:
For a typical IoT node with Rth = 50°C/W and Tambient = 45°C, the 85°C junction limit constrains power dissipation to 800 mW—requiring careful thermal-aware floorplanning in accelerator designs.
6. Key Research Papers and Articles
6.1 Key Research Papers and Articles
- PDF Deep-Learning Inferencing with High-Performance Hardware Accelerators — understand the current HPC machine -learning inferencing domain. This section aims to explain all the components necessary for the app acceleration on different architectures, frameworks, and platforms. 2.1 Machine-Learning Inference In common machine-learning apps, there are at least two distinct phases: training and inferencing.
- PDF FPGA-Based Hardware Accelerators for Deep Learning in Mobile ... - UTUPub — meet this demand, edge computing hardware accelerators have come to the forefront, notably with regard to deep learning and robotic systems. This thesis explores preeminent hardware accelerators and examines the performance, accuracy, and power consumption of a GPU and an FPGA-based platform, both specifically designed for edge comput-ing ...
- Being-ahead: Benchmarking and Exploring Accelerators for Hardware ... — targeted workloads. To achieve these goals, in this paper, we leverage an automation tool called DNNExplorer [1] for benchmarking customized DNN hardware accelerators and exploring novel accelerator designs with improved perfor-mance and efficiency. Key features include (1) direct support to popular machine learning frameworks for DNN workload
- Hardware Acceleration of Machine Learning - odr.chalmers.se — Hardware Acceleration of Machine Learning Evaluation and comparison of different hardware-aware opti-mization techniques Master's thesis in Computer Science and Engineering Fangzhou Chen William Sköld Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023
- PDF Co-designing Model Compression Algorithms and Hardware Accelerators for ... — the impact of our ideas on real hardware. In the process, we examine the use of high-level syn-thesis tools in reducing the hardware design effort. This thesis represents a cross-domain research effort at efficient deep learning. First, we propose specialized architectures for accelerating bina-rized neural networks on FPGA.
- PDF Scalable and Broad Hardware Acceleration Through Practical Speculative ... — Scalable and Broad Hardware Acceleration Through Practical Speculative Parallelism by Maleen Abeydeera Submitted to the Department of Electrical Engineering and Computer Science on June 29, 2021, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract
- PDF Hardware Acceleration of Machine Learning Using Fpga - Juit — We hereby declare that the work reported in the B.Tech Project Report entitled "Hardware Acceleration of Machine Learning Using FPGA" submitted at Jaypee University of Information Technology, Waknaghat, India is an authentic record of our work carried out under the supervision of Mr. Anuj Kumar Maurya.
- Designing Deep Learning Hardware Accelerator and Efficiency Evaluation — This paper first illustrates state-of-the-art FPGA-based accelerator design by emphasizing the contributions and limitations of existing research works. Subsequently, we demonstrated significant concepts of parallel computing (PC) in the convolution algorithm and discussed how to accomplish parallelism based on the FPGA hardware structure.
- PDF Design and Optimization of Hardware Accelerators for Deep Learning — version (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the com-putational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques
- A Survey of Accelerator Architectures for Deep Neural Networks — Machine learning (ML) approaches have been successfully applied to solve many problems in academia and in industry. Although the explosion of big data applications is driving the development of ML, it also imposes severe challenges of data processing speed and scalability on conventional computer systems.
6.2 Recommended Books and Tutorials
- Artificial Intelligence Hardware Design - Wiley-VCH — 1.3.1 Supervised Learning 4 1.3.2 Semi-supervised Learning 5 1.3.3 Unsupervised Learning 6 1.4 Neural Network Framework 6 1.5 Neural Network Comparison 10 Exercise 11 References 12 2 Deep Learning 13 2.1 Neural Network Layer 13 2.1.1 Convolutional Layer 13 2.1.2 Activation Layer 17 2.1.3 Pooling Layer 18 2.1.4 Normalization Layer 19 2.1.5 ...
- 6. GPU and Hardware Acceleration — Machine Learing Compiler 0.0.1 ... — GPU and Hardware Acceleration. search. Quick search code. Show Source Course GitHub 中文版 Table Of Contents. 1. Introduction; 2. Tensor Program Abstraction. 2.1. Primitive Tensor Function; 2.2. Tensor Program Abstraction; 2.3. ... Integration with Machine Learning Frameworks.
- Hardware Acceleration of Machine Learning - odr.chalmers.se — Hardware Acceleration of Machine Learning Evaluation and comparison of different hardware-aware opti-mization techniques Master's thesis in Computer Science and Engineering Fangzhou Chen William Sköld Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023
- 11 AI Acceleration - Machine Learning Systems — Machine Learning Accelerator (ML Accelerator) refers to a specialized computing hardware designed to efficiently execute machine learning workloads.These accelerators optimize matrix multiplications, tensor operations, and data movement, enabling high-throughput and energy-efficient computation. ML accelerators operate at various power and performance scales, ranging from edge devices with ...
- Energy-Efficient Design of Advanced Machine Learning Hardware - Springer — The key insight in DNN acceleration is to process in parallel to the maximum. ... The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. ... R., Javed, M.U., Rehman, S., Shafique, M. (2019). Energy-Efficient Design of Advanced Machine Learning Hardware. In: Elfadel, I., Boning, D ...
- PDF Artificial Intelligence Hardware Design — v Author Biographies xi Preface xiii Acknowledgments xv Table of Figures xvii 1 Introduction1 1.1 Development History 2 1.2 Neural Network Models 4 1.3 Neural Network Classification 4 1.3.1 Supervised Learning 4 1.3.2 Semi-supervised Learning 5 1.3.3 Unsupervised Learning 6 1.4 Neural Network Framework 6 1.5 Neural Network Comparison 10 Exercise 11 ...
- PDF Hardware Acceleration of EDA Algorithms — Book Outline This research monograph is organized into four parts. In Part I of this research monograph, we discuss alternative hardware platforms. We also provide details of the programming model used for interfacing with the graphics processor. In Chap-ter 2, we compare and contrast the hardware platforms that are considered in this monograph.
- Artificial intelligence and hardware accelerators - SearchWorks catalog — This book explores new methods, architectures, tools, and algorithms for Artificial Intelligence Hardware Accelerators. The authors have structured the material to simplify readers journey toward understanding the aspects of designing hardware accelerators, complex AI algorithms, and their computational requirements, along with the multifaceted ...
- Energy-Efficient Single-Core Hardware Acceleration — Object detection using convolutional neural networks (CNNs) has garnered much interest due to their high-performance capability. Yet, the large number of operations and memory fetches to both on-chip and external memory needed for such CNNs result in high latency and power dissipation on resource-constrained edge devices, hence impeding their real-time operation from a battery supply.
- (PDF) Learning on Hardware: A Tutorial on Neural Network Accelerators ... — recommended to consider it already in the hardware design phase, to optimally design the data ow and the processing elements. 5.3 Fast Fourier Transformation
6.3 Online Resources and Communities
- Energy-Efficient Design of Advanced Machine Learning Hardware - Springer — Such systems are required to be robust, intelligent, and self-learning while possessing the capabilities of high-performance and power-/energy-efficient systems. As a result, a hype in the artificial intelligence and machine learning research has surfaced in numerous communities (e.g., deep learning and hardware architecture).
- Hardware Acceleration of Machine Learning - odr.chalmers.se — Hardware Acceleration of Machine Learning Evaluation and comparison of different hardware-aware opti-mization techniques Master's thesis in Computer Science and Engineering Fangzhou Chen William Sköld Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023
- PDF Efficient Methods and Hardware for Deep Learning a Dissertation ... — deep learning. Her ambition and foresight ignited my passion for bridging the research in deep learning and hardware. Sitting on the same floor with Fei-Fei and her students spawned many researchspark. IsincerelythankFei-Fei'sstudentsAndrejKarpathy,YukeZhu,JustinJohnson,
- 11 AI Acceleration - Machine Learning Systems — Machine Learning Accelerator (ML Accelerator) refers to a specialized computing hardware designed to efficiently execute machine learning workloads.These accelerators optimize matrix multiplications, tensor operations, and data movement, enabling high-throughput and energy-efficient computation. ML accelerators operate at various power and performance scales, ranging from edge devices with ...
- PDF Large-Scale Optical Hardware for Neural Network Inference Acceleration — both a student and a teaching assistant in Joel's class on hardware for machine learning (co-taught with Professor Vivienne Sze), and Joel taught me a tremendous amount about electronic deep neural network accelerators. I thank him for his mentorship, characterized by kindness and a genuine desire to see me succeed. I had the pleasure of ...
- 6 Hardware-Aware Execution | part of Machine Learning under Resource ... — Feed-Forward Networks (FFNs), or multilayer perceptrons, are fundamental network structures for deep learning. Although feed-forward networks are structurally uncomplicated, their training procedure is computationally expensive. It is challenging to design customized hardware for training due to the diversity of operations in forwardand backward-propagation processes. In this contribution, we ...
- PDF Co-designing Model Compression Algorithms and Hardware Accelerators for ... — the impact of our ideas on real hardware. In the process, we examine the use of high-level syn-thesis tools in reducing the hardware design effort. This thesis represents a cross-domain research effort at efficient deep learning. First, we propose specialized architectures for accelerating bina-rized neural networks on FPGA.
- PDF Hardware Acceleration of Machine Learning Using Fpga - Juit — We hereby declare that the work reported in the B.Tech Project Report entitled "Hardware Acceleration of Machine Learning Using FPGA" submitted at Jaypee University of Information Technology, Waknaghat, India is an authentic record of our work carried out under the supervision of Mr. Anuj Kumar Maurya.
- (PDF) Learning on Hardware: A Tutorial on Neural Network Accelerators ... — Additional Key W ords and Phrases: neural network, hardware accelerator, deep learning, CNN, FPGA, ASIC, GPU, data ow processing, energy e cient accelerators, performance gap ACM Reference Format:
- PDF Design and Optimization of Hardware Accelerators for Deep Learning — contents abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii list of figures ...