Friday, October 17, 2025

arm æþ - Crashproofing Neuromorphic/Cordian Suite + Architecture + Debugger + Unified Webserver + Compositor core YuKKi

## Obeisances to Amma and Appa during my difficulties. Thanks to Google Gemini, ChatGPT and all contributors worldwide. Enjoy the bash script or scrobble as per Open Source Common Share License v4.

# Neuromorphic Suite + Architecture + Debugger + Unified Webserver

Epilogue:

From Errors to Insights: Building a Crash-Proof System-on-Chip (SoC)

In the world of high-performance hardware, failure is not an option. A system crash caused by a buffer overflow or a single malformed data packet can be catastrophic. But what if we could design a System-on-Chip (SoC) that doesn't just survive these events, but treats them as valuable data?

This post outlines a multi-layered architectural strategy for a high-throughput SoC that is resilient by design. We'll explore how to move beyond simple error flags to create a system that proactively prevents crashes, isolates faults, and provides deep diagnostic insights, turning potential failures into opportunities for analysis and optimization.

The Backbone: A Scalable Network-on-Chip (NoC)

For any complex SoC with multiple processing elements and shared memory, a traditional shared bus is a recipe for a bottleneck. Our architecture is built on a packet-switched Network-on-Chip (NoC). Think of it as a dedicated multi-lane highway system for data packets on the chip. This allows many parallel data streams to flow simultaneously between different hardware blocks, providing the scalability and high aggregate bandwidth essential for a demanding compositor system.

Layer 1: Proactive Flow Control with Smart Buffering

Data doesn't always flow smoothly. It arrives in bursts and must cross between parts of the chip running at different speeds (known as Clock Domain Crossings, or CDCs). This is a classic recipe for data overruns and loss.

Our first line of defense is a network of intelligent, dual-clock FIFO (First-In, First-Out) buffers. But simply adding buffers isn't enough. The key to resilience is proactive backpressure.

Instead of waiting for a buffer to be completely full, our FIFOs generate an almost_full warning signal. This signal propagates backward through the NoC, automatically telling the original data source to pause. This end-to-end, hardware-enforced flow control prevents overflows before they can even happen, allowing the system to gracefully handle intense data bursts without dropping a single packet.

Layer 2: A Hardware Firewall for Malformed Data

A common cause of system crashes is malformed or malicious data. Our architecture incorporates a dedicated Ingress Packet Validator—a hardware firewall that sits at the edge of the chip. Before any packet is allowed onto the NoC, this module performs a series of rigorous checks in a single clock cycle:

 * Opcode Validation: Is this a known, valid command?

 * Length Checking: Does the packet have the expected size for its command type?

 * Integrity Checking: Does the packet’s payload pass a Cyclic Redundancy Check (CRC)?

If a packet fails any of these checks, it is quarantined, not processed. The invalid data is never allowed to reach the core processing logic, preventing it from corrupting system state or causing a crash. This transforms a potentially system-wide failure into a silent, contained event.

Layer 3: Fault Containment with Resource Partitioning

To handle multiple tasks with different priorities, we draw inspiration from modern GPU virtualization technology (like NVIDIA's Multi-Instance GPU). A Hardware Resource Manager (HRM) allows the SoC's processing elements to be partitioned into isolated, independent groups.

This provides two major benefits:

 * Guaranteed Quality of Service (QoS): A high-priority, real-time task can be guaranteed its slice of processing power and memory bandwidth, unaffected by other tasks running on the chip.

 * Fault Containment: A software bug or data-dependent error that causes a deadlock within one partition cannot monopolize shared resources or crash the entire system. The fault is completely contained within its hardware partition, allowing the rest of the SoC to operate normally.

Turning Errors into Insights: The 'Sump' Fault Logger

The most innovative component of our architecture is a dedicated on-chip fault logging unit we call the 'Sump'. When the firewall quarantines a bad packet or a buffer reports a critical event, it doesn't just disappear. The detecting module sends a detailed fault report to the Sump.

The Sump acts as the SoC's "black box recorder," storing a history of the most recent hardware exceptions in a non-volatile ring buffer. Each log entry is a rich, structured record containing:

 * A high-resolution Timestamp

 * The specific Fault Code (e.g., INVALID_OPCODE, FIFO_OVERFLOW)

 * The unique ID of the Source Module that reported the error

 * A snapshot of the offending Packet Header

To retrieve this data safely, we designed a custom extension to the standard JTAG debug interface. An external debugger can connect and drain the fault logs from the Sump via this out-of-band channel without pausing or interfering with the SoC's primary operations.

A System That Heals and Informs

By integrating these layers, we create a complete chain of resilience. A corrupted packet arrives, the firewall quarantines it, and the Sump logs a detailed report with microsecond precision—all while the system continues to process valid data without interruption. An engineer can later connect via JTAG to perform post-mortem analysis, using the timestamped logs to instantly pinpoint the root cause of the issue.

This philosophy transforms hardware design. By treating errors as data, we can build systems that are not only robust and crash-proof but also provide the deep visibility needed for rapid debugging, performance tuning, and creating truly intelligent, self-aware hardware.



Technical detail:

The refactored neuromorphic suite introduces several architectural changes designed to improve computation efficiency and control flexibility, particularly within embedded ARM/GPU hybrid environments. 

Computational Improvements

The refactoring improves computation this year primarily through hardware optimization, dynamic resource management, and introduction of a specialized control execution system:

1. Hardware-Optimized Control Paths (ARM)

The system enhances performance by optimizing frequent control operations via MMIO (Memory-Mapped I/O) access using ARM short-case efficiency for hot paths.

  • This is achieved by using inline AArch64 instructions (ldr/str) and the __attribute__((always_inline)) attribute for fast MMIO read/write operations when running on AArch64 hardware.
  • When the ENABLE_MAPPED_GPU_REGS define is used, the runtime server performs control writes backed by MMIO, leveraging these inline assembly optimizations.

2. Dynamic Resource Management and GPU Acceleration

Computation is dynamically improved through throttling and autoscaling mechanisms integrated into the gpu_runtime_server.

  • GPU Throttling and Autoscaling: The GlobalGpuThrottler uses a token bucket model to manage maximum bytes per second transferred. The ThrottleAutoScaler observes actual transfer rates against the configured rate and dynamically adjusts the throttle rate to maintain a target_util_ (defaulting to 70%).
  • Lane Utilization Feedback: The system incorporates neuromorphic lane utilization tracking from the hardware/VHDL map. The VHDL map includes logic for 8 ONoC (Optical Network on Chip) lanes with utilization counters. These utilization percentages are read from MMIO (e.g., NEURO_MMIO_ADDR or LANE_UTIL_ADDR) and posted to the runtime server. This allows the ThrottleAutoScaler to adjust the lane_fraction, enabling computation to adapt based on current ONoC traffic.
  • GPU Acceleration with Fallback: The runtime server attempts to use GPU Tensor Core Transform via cuBLAS for accelerated vector processing. If CUDA/cuBLAS support is not available, it uses a CPU fallback mechanism.
The GPU to CPU fallback mechanism is a critical feature implemented in the gpu_runtime_server to ensure the neuromorphic system remains functional even when hardware acceleration via CUDA/cuBLAS is unavailable.

Here is a detailed breakdown of the mechanism:

1. Detection of GPU/CUDA Support

The decision to use the GPU or fall back to the CPU is made by checking for the presence and readiness of the CUDA/cuBLAS environment during server initialization and before processing a transformation request.

  • CUDA Runtime Check: The function has_cuda_support_runtime() is used to determine if the CUDA runtime is available and if there is at least one detected device (devcount > 0).
  • cuBLAS Initialization Check: The function initialize_cublas() attempts to create a cuBLAS handle (g_cublas_handle). If the status returned by cublasCreate is not CUBLAS_STATUS_SUCCESS, cuBLAS is marked as unavailable (g_cublas_ready = false).
  • Server Startup Logging: When the server starts, it logs the outcome of these checks:
    • If initialize_cublas() and has_cuda_support_runtime() are successful, it logs: [server] cuBLAS/CUDA available.
    • Otherwise, it logs: [server] cuBLAS/CUDA NOT available; CPU fallback enabled.

2. Implementation of the Fallback in /transform Endpoint

The actual selection between GPU processing and CPU processing occurs when the server receives a request on the /transform endpoint.

  • The endpoint handler checks the global cublas_ok flag (which reflects the successful initialization of cuBLAS/CUDA).

  • The output vector (out) is determined using a conditional call:

    std::vector<float> out = (cublas_ok ? gpu_tensor_core_transform(input) : cpu_tensor_transform(input));
    

    If cublas_ok is true, the GPU transformation is attempted; otherwise, the CPU fallback is executed.

3. CPU Fallback Functionality

The dedicated CPU fallback function is simple, defining a direct identity transformation:

  • The function cpu_tensor_transform takes the input vector (in) and returns it directly.

    std::vector<float> cpu_tensor_transform(const std::vector<float> &in) {
        return in;
    }
    

4. GPU Path Internal Fallback

Even when the GPU path (gpu_tensor_core_transform) is selected, it contains an internal early exit fallback for immediate failure conditions:

  • The gpu_tensor_core_transform function first checks if initialize_cublas() and has_cuda_support_runtime() succeed again.
  • If either check fails (meaning the GPU environment became unavailable after startup or the initial check failed), the function executes a loop that copies the input vector to the output vector and returns, performing a CPU copy operation instead of the GPU work.

Summary of CPU Fallback Execution

The CPU fallback condition is triggered in two main scenarios:

  1. System-Wide Lack of Support: If CUDA/cuBLAS is not initialized successfully at startup, the /transform endpoint executes cpu_tensor_transform(input), which returns the input unchanged.
  2. Internal GPU Failure: If the gpu_tensor_core_transform function is called but finds that CUDA initialization or runtime support is missing, it skips all CUDA memory allocation and cuBLAS operations, and instead copies the input vector to the output vector on the CPU. ]

3. Compact Control Execution via Short-Code VM

The introduction of a Short-Code Virtual Machine (VM) represents a refactoring for flexible and compact control execution.

  • This stack-based VM is implemented in both the C++ runtime server and the C bootloader.
  • The runtime server exposes a new /execute endpoint that accepts binary bytecode payloads for execution, allowing for compact control commands like dynamically setting the lane fraction (SYS_SET_LANES).
  • The bootloader also gains an execute <hex_string> command, enabling low-level, intrant control bytecode execution on the bare-metal target for operations like MMIO writes or system resets. This potentially improves control latency and footprint by minimizing the communication necessary for complex control sequences.


ARM æþ v1  -Baremetal/Standalone OEM ready - just need the hardware system and a Neuromorphic Cordian chipset below

ARM bootmenu v2-Compositor / Boot Menu Added
ARM 
æþ  Neuromorphic Compositor -Compositor Standalone

Compositor core YuKKi-bash

Globus Anarchus Compositor - POC

Neuromorphic CORDIAN chipset VHDL - Try an iCE40 or iCE65 FPGA for emulation :-0 just supply your own controller and software but also low level hardware vhdl map for the neuromorphic component

Adi-protocol-portable.c - All major computing OS/operands - possible low level ONoC protocol

Overhauled Simulation Summary (Gemini):

​The Overhauled architecture is not merely an improvement; it represents a fundamental shift from a simple request-response model to a modern, high-throughput, asynchronous compute engine. Its design principles are directly analogous to those proven essential in the HPC domain for achieving near-hardware-limit performance. Our simulation confidently predicts that it would outperform its synchronous predecessor by more than an order of magnitude in any real-world, multi-client scenario.

ARM æþ Overhauled multi-GPU TCP suite

Adi single GPU Processingload Suite

generated_image.png


Simulated ARM æþ v1:
Maximum bucket rate
•     Unconstrained (no guardrail):
R_max equals the node’s peak fabric rate. For a 16‑tile node at 1024‑bit and 2.5 GHz per tile:
•     T_node,peak = 16 × 320 GB/s = 5.12 TB/s
•     Therefore, bucket rate at maximum operation: 5.12 TB/s
•     Within QoS guardrails (aggressive 10% cap):
•     R_max = 0.10 × 5.12 TB/s = 512 GB/s
•     If you adopt the optical overprovision example (peak ≈ 6.4 TB/s):
•     Unconstrained: 6.4 TB/s
•     10% guardrail: 640 GB/s
Tip: Use R_max = η × T_node,peak, with η chosen to protect on‑chip QoS (commonly 2–10%). 

Simulated Overhaul:
Overhauled bucket rate = 6.2 TB/s





No comments:

Post a Comment