DragonFire Developer Portal

Getting Started

Core Systems

Related Components

DragonFire Cache

The DragonFire Cache is a revolutionary frictionless memory subsystem that enables ultra-high-throughput code execution through direct memory addressing and zero-overhead access patterns.

Overview

DragonFire Cache is an advanced memory subsystem that fundamentally rethinks how data is stored, accessed, and processed. Unlike traditional caching systems that rely on complex lookup mechanisms, DragonFire Cache leverages a fixed "Zero Cube" memory structure where the address itself encodes information about the content, eliminating traditional caching overhead.

Frictionless Access

Direct mapping between memory addresses and content eliminates lookup overhead

Zero Cube Structure

Fixed memory pattern optimized for L1 cache with deterministic bit sequences

Massive Throughput

Capable of handling over 30 million worker requests per second even on minimal hardware

L1 Optimization

Designed to fit entirely within L1 cache for ultimate performance (32KB footprint)

Architecture

The DragonFire Cache system consists of three primary components that work together to deliver frictionless memory operations:

1. Zero Cube Memory Structure

The Zero Cube is a fixed memory block mapped to L1 cache containing a deterministic bit pattern. It serves as the foundation of the DragonFire Cache's direct memory addressing system.

  • Size: 8KB (128 cache lines × 64 bytes)
  • Uniqueness: Generated once at system startup, never changes
  • Mapping: Perfect de Bruijn sequence ensuring each N-bit pattern appears exactly once

2. Worker Request Queue

The Worker Request Queue is an optimized memory area for batching worker requests, designed to maximize throughput by grouping related memory operations.

  • Size: 24KB (384 cache lines × 64 bytes)
  • Organization: Hash-clustered by target memory region
  • Workers Per Line: 16 (4 bytes per worker)

3. Address-to-Content Mapping System

The core innovation of DragonFire Cache is its direct address-to-content mapping system, which creates a mathematical relationship between memory addresses and their contents.

  • Implementation: Direct relationship where address = f(content)
  • Characteristics: Bidirectional, deterministic, O(1) complexity
  • Benefit: Eliminates the need for traditional cache lookups

Memory Layout

┌─────────────────────────────────────────┐
│              L1 Cache (32KB)            │
├─────────────────┬───────────────────────┤
│   Zero Cube     │    Worker Queues      │
│     (8KB)       │       (24KB)          │
│  128 cache lines│    384 cache lines    │
├─────────────────┴───────────────────────┤
│      Memory-Mapped Control Registers    │
└─────────────────────────────────────────┘

Operational Flow

The DragonFire Cache operates on a frame-based system, processing batches of worker requests at a fixed rate of 1024 frames per second.

  1. Request Registration

    Worker requests arrive and are hashed to appropriate queue positions based on their target memory regions

  2. Batch Formation

    Requests are batched by target memory region to maximize L1 cache efficiency

  3. Batch Processing

    When batch size reaches threshold or timer expires:

    • Single read from Zero Cube memory
    • Results distributed to all workers in batch
  4. Continuous Operation

    System processes at 1024 frames per second (976.56 μs per frame)

Note: The frame rate is fixed at 1024 FPS to ensure deterministic performance and align with system timing requirements.

Advanced Addressing Scheme

The DragonFire Cache employs a sophisticated addressing scheme that enables its frictionless operation:

Zero Cube Addressing

Each 64-bit position in the Zero Cube is addressable through a 20-bit address:

  • 7 bits: cache line selector (0-127)
  • 6 bits: word selector within cache line (0-63)
  • 7 bits: bit position within word (0-63)

Worker Queue Addressing

Workers are assigned to queue positions using a hash function designed to optimize memory access patterns:

queue_position = hash(worker_id, target_code)

The hash function is specifically designed to cluster workers needing the same memory region, maximizing batch efficiency.

Direct Address-to-Content Mapping

The core innovation is the ability to directly calculate:

  • Memory address from the desired content
  • Content from the memory address

This bidirectional mapping eliminates the need for traditional cache lookups, dramatically reducing latency.

Performance Characteristics

The DragonFire Cache delivers exceptional performance across various hardware configurations:

Throughput

Hardware Class No SIMD SSE2 (128-bit) AVX2 (256-bit) AVX-512 (512-bit)
Low-end (2GHz) 60.3M/s 241.2M/s 482.3M/s 964.7M/s
Mid-range (3GHz) 60.3M/s 241.2M/s 482.3M/s 964.7M/s
High-end (4GHz) 60.3M/s 241.2M/s 482.3M/s 964.7M/s

Latency

  • Queue Insertion: 2-4 cycles
  • Batch Processing: 8-12 cycles per batch
  • Code Extraction: 4-8 cycles
  • Total End-to-End: ~15-25 ns on 2GHz CPU

Memory Efficiency

  • Bit Utilization: 96% of bits effectively utilized
  • L1 Cache Usage: 32KB (100% of typical L1 data cache)
  • Memory Footprint: Fixed 32KB regardless of worker count

Key Performance Insight: The DragonFire Cache's performance is largely CPU-frequency independent because it relies on batch processing. This means predictable, consistent performance across a wide range of hardware.

API Reference

The DragonFire Cache exposes a simple but powerful C API for integration into applications:

Core Functions

// Initialize Dragon Cache system
dragon_status_t dragon_init(dragon_config_t* config);

// Register worker interest in a specific code
dragon_status_t dragon_register_worker(uint32_t worker_id, uint64_t target_code);

// Process one frame
dragon_stats_t dragon_process_frame(void);

// Query system status
dragon_status_t dragon_get_status(dragon_stats_t* stats);

// Shutdown system
void dragon_shutdown(void);

Configuration Structure

typedef struct {
    uint32_t zero_cube_size;        // Size of Zero Cube in bytes
    uint32_t worker_queue_size;     // Size of worker queues in bytes
    uint32_t batch_threshold;       // Minimum workers to process as batch
    uint32_t frame_rate;            // Frames per second (1024 default)
    bool use_simd;                  // Enable SIMD acceleration
    bool pin_to_l1;                 // Force pinning to L1 cache
} dragon_config_t;

For complete API documentation, see the DragonFire Cache API Reference.

Integration with DragonFire Ecosystem

The DragonFire Cache is designed to integrate seamlessly with other components of the DragonFire ecosystem:

Krishna Routing System (KRS)

The DragonFire Cache works in tandem with the Krishna Routing System, which handles the routing of execution packets to appropriate processing units.

The KRS leverages the DragonFire Cache's frictionless memory access to achieve zero-overhead routing of computation tasks.

Lion Caching System

The Lion Caching System sits on both the front-end and back-end of the Krishna Routing System, managing execution point allocation and optimizing cache usage.

It uses nibble-based caching control to dynamically assign execution paths based on usage patterns.

TARDIS Timing System

The TARDIS system ensures precise timing of cache operations, maintaining the 1024 FPS frame rate with microsecond precision.

This temporal precision is critical for deterministic performance in the DragonFire Cache.

DragonFire Cache Integration Diagram

Code Examples

Basic Usage

#include "dragon_cache.h"

int main() {
    // Configure the Dragon Cache
    dragon_config_t config = {
        .zero_cube_size = 8 * 1024,         // 8KB Zero Cube
        .worker_queue_size = 24 * 1024,     // 24KB Worker Queue
        .batch_threshold = 64,              // Process batches of 64+
        .frame_rate = 1024,                 // 1024 FPS
        .use_simd = true,                   // Enable SIMD acceleration
        .pin_to_l1 = true                   // Pin to L1 cache
    };
    
    // Initialize the system
    dragon_status_t status = dragon_init(&config);
    if (status != DRAGON_SUCCESS) {
        fprintf(stderr, "Failed to initialize Dragon Cache: %d\n", status);
        return 1;
    }
    
    // Register workers
    for (uint32_t i = 0; i < 1000; i++) {
        // Worker ID, Target code
        dragon_register_worker(i, calculate_target_code(i));
    }
    
    // Process frames in a loop
    for (int frame = 0; frame < 60; frame++) {
        dragon_stats_t stats = dragon_process_frame();
        printf("Frame %d: Processed %u workers\n", frame, stats.workers_processed);
    }
    
    // Shutdown
    dragon_shutdown();
    return 0;
}

Advanced: Multi-core Scaling

#include "dragon_cache.h"
#include 

#define NUM_CORES 4

typedef struct {
    int core_id;
    dragon_config_t config;
} thread_args_t;

void* core_thread(void* arg) {
    thread_args_t* args = (thread_args_t*)arg;
    
    // Customize config for this core
    args->config.zero_cube_size = 8 * 1024;
    args->config.worker_queue_size = 24 * 1024;
    
    // Initialize Dragon Cache for this core
    dragon_status_t status = dragon_init_on_core(&args->config, args->core_id);
    if (status != DRAGON_SUCCESS) {
        fprintf(stderr, "Failed to initialize on core %d\n", args->core_id);
        return NULL;
    }
    
    // Register workers for this core
    uint32_t workers_per_core = 1000;
    uint32_t worker_offset = args->core_id * workers_per_core;
    
    for (uint32_t i = 0; i < workers_per_core; i++) {
        uint32_t worker_id = worker_offset + i;
        dragon_register_worker(worker_id, calculate_target_code(worker_id));
    }
    
    // Process frames
    for (int frame = 0; frame < 1000; frame++) {
        dragon_process_frame();
        // Wait for synchronization point
        dragon_sync_barrier();
    }
    
    dragon_shutdown_core(args->core_id);
    return NULL;
}

int main() {
    pthread_t threads[NUM_CORES];
    thread_args_t args[NUM_CORES];
    
    // Create core-specific Dragon Cache instances
    for (int i = 0; i < NUM_CORES; i++) {
        args[i].core_id = i;
        pthread_create(&threads[i], NULL, core_thread, &args[i]);
    }
    
    // Wait for all cores to finish
    for (int i = 0; i < NUM_CORES; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

Additional Resources

Technical Documentation

Related Components

Examples & Tutorials

SDK Downloads