Next-Generation Memory Coherence
This article was originally published on the SpeakEZ Technologies blog as part of our early design work on the Fidelity Framework. It has been updated to reflect the Clef language naming and current project structure.
SpeakEZ’s Fidelity framework with its innovative BAREWire technology is uniquely positioned to take advantage of emerging memory coherence and interconnect technologies like CXL, NUMA, and recent PCIe enhancements. By combining BAREWire’s zero-copy architecture with these hardware innovations, Fidelity can put the developer in unprecedented control over heterogeneous computing environments with the elegant semantics of the Clef language.
This innovation represents a fundamental shift in how distributed memory systems interact and the cognitive demands it places on the software engineering process. This breakthrough stands to revolutionize distributed model training by eliminating the traditional boundaries in memory management that have constrained AI workloads and the teams that build them.
The challenge with CXL lies not in the hardware but in the software abstractions available to leverage it. C++ CXL libraries expose raw pointers and require manual tracking of which memory regions reside in which pools; the programmer maintains a mental model of coherence domains that the type system cannot verify. Rust improves memory safety but its ownership model assumes a single coherent address space; CXL’s multiple memory pools with different latency characteristics fall outside what the borrow checker can express. Fidelity’s type system encodes memory pool residency directly, making pool-aware allocation verifiable at compile time rather than debuggable at runtime.
BAREWire and CXL: A Perfect Match for Zero-Copy Computing
BAREWire’s fundamental premise of unified memory abstractions aligns perfectly with CXL’s hardware-level coherent memory access capabilities. Here’s how Fidelity would leverage CXL:
module BAREWire.CXL =
// Clef Extended Units of measure for memory safety
[<Measure>] type addr // Memory address
[<Measure>] type bytes // Size in bytes
[<Measure>] type cxl_mem // CXL memory space
[<Measure>] type cpu_mem // CPU memory space
[<Measure>] type unified // Unified memory space
// CXL-aware memory allocation with hardware coherency
let allocateCoherentBuffer<'T> (size: int<bytes>) : SharedBuffer<'T, unified> =
// Determine if CXL.mem is available through sysfs interface
let cxlAvailable = checkCXLAvailability()
if cxlAvailable then
// Use ioctl interface to allocate from CXL memory pool
let fd = openCXLDevice()
let cxlConfig = {
size = size
interleave_ways = 1
interleave_granularity = CXL_INTERLEAVE_GRANULARITY_256
restrictions = CXL_MEM_RESTRICT_TYPE_NORMAL
}
let ptr = allocateCXLMemory<'T>(fd, cxlConfig)
{
Address = ptr
Size = size
Layout = MemoryLayout.getOptimized<'T>()
MemoryType = MemoryType.CXL
}
else
// Fall back to standard unified memory
let ptr = allocateUnifiedMemory<'T>(size)
{
Address = ptr
Size = size
Layout = MemoryLayout.getOptimized<'T>()
MemoryType = MemoryType.Standard
}This implementation adapts dynamically to the presence of CXL hardware, using it when available but gracefully side-stepping when not. BAREWire’s memory abstraction model already prepares applications for the kind of unified memory that CXL provides at the hardware level.
Compare this with the C++ approach using libcxlmem or similar libraries. The C++ programmer calls cxl_malloc() and receives a void pointer with no type-level indication of which memory pool it came from. When the allocation fails, the programmer checks errno and hopes they remembered to handle all the failure modes. The Fidelity approach encodes pool information in the type: SharedBuffer<'T, unified> vs SharedBuffer<'T, cxl_mem> vs SharedBuffer<'T, cpu_mem>. The compiler prevents passing a CPU-local buffer to code expecting CXL-coherent memory. This is not merely defensive programming; it is structural correctness that C++ cannot express.
Hardware Coherency and Memory Models
CXL Type 2 devices provide full bidirectional coherency, but leveraging this capability safely requires toolchain support that C++ and Rust currently lack. The hardware ensures cache coherence; the software must ensure that access patterns respect coherence domain boundaries. A C++ programmer might allocate from a CXL pool, pass the pointer to GPU code, and discover at runtime that the GPU cannot access CXL memory directly. The pointer type provided no indication; the failure manifests as a segfault or silent corruption.
When using CXL Type 2 devices, BAREWire can eliminate the need for explicit synchronization in many cases:
// Create CXL memory views that leverage hardware coherency
let createGPUView<'T> (buffer: SharedBuffer<'T, unified>) =
match buffer.MemoryType with
| MemoryType.CXL ->
// CXL Type 2 provides hardware coherency - no need for explicit synchronization
{ buffer with MemSpace = typedefof<gpu_mem>; CoherencyModel = CoherencyModel.Hardware }
| _ ->
// Fall back to software coherency model for non-CXL memory
{ buffer with MemSpace = typedefof<gpu_mem>; CoherencyModel = CoherencyModel.Software }Developer-Friendly: From Primitives to Patterns
While the core BAREWire implementation deals with hardware-specific details, Clef developers don’t always have to wrestle with these lower-level abstractions. As conventions emerge, the framework will provide a constellation of supporting libraries that encapsulate these primitives into idiomatic Clef patterns familiar to application developers:
module Furnace =
// Create a tensor with optimal memory placement for current hardware
let tensor<'T> (dimensions: int list) : Tensor<'T> =
// Under the hood: Uses platform detection to determine
// optimal memory placement (CXL, NUMA, etc.)
let platform = PlatformDetection.current()
let size = dimensions |> List.fold (*) 1 |> fun s -> s * sizeof<'T>
// The developer doesn't need to know about the underlying memory model
let buffer = MemoryManager.allocateOptimal<'T>(size, platform)
Tensor<'T>(buffer, dimensions)
// Matrix multiplication with hardware acceleration
let matmul (a: Tensor<float32>) (b: Tensor<float32>) : Tensor<float32> =
// Automatically selects best implementation:
// - CXL-aware for systems with CXL memory
// - NUMA-optimized for multi-socket systems
// - GPU-accelerated when available
// - Fallback to optimized CPU implementation
Operations.createMatmul platform a b |> Operations.execute
let modelTraining() =
// Create tensors without worrying about memory placement
let weights = Furnace.tensor<float32>([1024; 1024])
let input = Furnace.tensor<float32>([128; 1024])
// Perform matrix multiplication - hardware details abstracted away
let output = Furnace.matmul weights inputThis approach allows Clef developers to work with familiar functional patterns while the underlying system handles the complexity of optimal memory placement and hardware acceleration.
The abstraction layer that Fidelity provides is qualitatively different from what C++ or Rust libraries can offer. A C++ tensor library might detect CXL at runtime and allocate appropriately, but the type signature of matmul remains unchanged: it accepts pointers and returns pointers. The programmer has no compile-time assurance that the tensors reside in compatible memory pools. A Rust tensor library might add lifetime annotations, but lifetimes track temporal validity, not spatial residency. When CXL introduces multiple memory pools with different access characteristics, neither language’s type system can express the constraints.
Fidelity’s actor model proves particularly well-suited to CXL architectures. Each memory pool maps naturally to an actor domain; actors own their memory regions and communicate through message passing with explicit capabilities. When an actor in the CPU domain needs to share a tensor with an actor in the GPU domain, it sends a capability that encodes both ownership transfer and residency requirements. The receiving actor knows at compile time whether the buffer is CXL-coherent and can access it accordingly. This capability-based ownership model supersedes what Rust’s borrow checker can express for multi-pool memory architectures.
Memory Access Patterns Library
Another developer-friendly abstraction is the Memory Access Patterns library, which provides high-level constructs for common memory access scenarios:
module MemoryPatterns =
// Producer-consumer pattern with zero-copy semantics
let producerConsumer<'T> (producer: unit -> 'T[]) (consumer: 'T[] -> unit) =
use buffer = SharedRingBuffer.create<'T>(capacity = 1024)
// Start producer and consumer tasks
let producerTask =
async {
while true do
let data = producer()
// Zero-copy operation regardless of whether using CXL or not
buffer.EnqueueBatch(data)
}
let consumerTask =
async {
while true do
// Dequeue with zero-copy semantics
let data = buffer.DequeueBatch(batchSize = 128)
consumer(data)
}
// Run both tasks
[producerTask; consumerTask] |> Async.Parallel |> Async.IgnoreA library such as this would allow developers to express common communication patterns without worrying about the underlying memory management details.
NUMA-Aware Memory Management
Fidelity’s platform configuration can include NUMA topology awareness, enabling optimal memory placement. NUMA-aware programming in C++ typically involves libnuma calls scattered throughout the codebase, with no type-level connection between where memory was allocated and where it is accessed. The programmer might allocate on NUMA node 0 and accidentally schedule the accessing thread on node 3; the code runs correctly but performs poorly, and nothing in the type system warns of the mismatch. Rust’s type system cannot express NUMA affinity at all; memory is memory.
Fidelity encodes NUMA topology in the platform configuration and carries allocation hints through the type system:
type NumaTopology = {
NodeCount: int
NodeDistances: int[,] // Distance matrix
CXLNodes: int list // NUMA nodes that represent CXL memory
}
let withNumaTopology (topology: NumaTopology) (config: PlatformConfig) =
{ config with NumaTopology = Some topology }
let allocateNuma<'T> (size: int<bytes>) (config: PlatformConfig) =
match config.NumaTopology with
| Some topology when topology.CXLNodes.Length > 0 ->
// Prioritize CXL memory for large buffers
if size > 1024L<bytes> * 1024L * 512L then
let cxlNode = topology.CXLNodes |> List.head
BAREWire.allocateOnNode<'T>(size, cxlNode)
else
// Use local NUMA node for smaller allocations
let localNode = getCurrentNumaNode()
BAREWire.allocateOnNode<'T>(size, localNode)
| Some topology ->
// Standard NUMA allocation strategy
let localNode = getCurrentNumaNode()
BAREWire.allocateOnNode<'T>(size, localNode)
| None ->
// Fall back to default allocation
BAREWire.allocate<'T>(size)High-Level NUMA Abstractions
Developers can leverage NUMA awareness without directly interacting with topology details:
type NumaAwareCollection<'T> =
static member Create(initialCapacity: int) : NumaAwareCollection<'T> =
// Internal implementation handles NUMA topology detection
// and optimal data placement
let platform = PlatformDetection.current()
NumaAwareCollection<'T>(initialCapacity, platform)
member this.Add(item: 'T) : unit =
// Placement logic hidden from developer
this.Internal.AddToOptimalNode(item)
// Parallel operations automatically respect NUMA topology
member this.ForAll(action: 'T -> unit) : unit =
// Executes the action in parallel across NUMA domains
this.Internal.NumaTopology(action)Resizable BAR for GPU Memory Access
Resizable BAR allows the CPU to map the entire GPU framebuffer into its address space, enabling direct access without staging buffers. C++ CUDA code can use Resizable BAR through unified memory APIs, but the programmer must still track whether a pointer refers to CPU memory, GPU memory, or unified memory; the type is always void* or T*. A function that expects GPU-resident data might receive a CPU pointer and fail silently or crash. Rust’s gpu-allocator crate improves matters somewhat, but the ownership model still cannot distinguish memory residency.
Our BAREWire technology can take advantage of Resizable BAR to enable zero-copy operations with GPU memory, with residency encoded in the type:
module BAREWire.GPU =
// Check if Resizable BAR is supported
let isResizableBarSupported() =
let pciDir = "/sys/bus/pci/devices/"
let gpuDevices = findGPUDevices(pciDir)
gpuDevices |> List.exists (fun dev ->
let resizableBarPath = /$"{pciDir}{dev}/resizable_bar"
if File.Exists(resizableBarPath) then
let content = File.ReadAllText(resizableBarPath).Trim()
content = "1" || content = "enabled"
else
false
)
// Create zero-copy buffer using Resizable BAR
let createGpuZeroCopyBuffer<'T> (size: int<bytes>) =
if isResizableBarSupported() then
let gpuMem = allocateGpuMemory<'T>(size, MemoryFlag.CPUAccessible)
{
Address = gpuMem.address
Size = size
Layout = MemoryLayout.getOptimized<'T>()
MemoryType = MemoryType.GPUResizableBAR
}
else
let gpuMem = allocateGpuMemory<'T>(size, MemoryFlag.Default)
{
Address = gpuMem.address
Size = size
Layout = MemoryLayout.getOptimized<'T>()
MemoryType = MemoryType.GPUStandard
}Making Hardware Acceleration Transparent
Developers can access GPU capabilities through high-level APIs that hide the complexity of Resizable BAR and memory management:
module Accelerate =
let map<'T, 'U> (mapping: 'T -> 'U) (input: 'T[]) : 'U[] =
// Under the hood: Uses Resizable BAR when available,
// falls back to explicit transfers when needed
let platform = PlatformDetection.current()
let kernel = Kernel.fromFunc mapping
// Execute with optimal memory strategy
GpuExecutor.execute kernel input platform
let filter<'T> (predicate: 'T -> bool) (input: 'T[]) : 'T[] =
// GPU-accelerated filter operation
GpuExecutor.executeFilter predicate input platform
let processImage (image: Image) =
let brightened =
image.Pixels
|> Accelerate.map (fun pixel ->
{ R = min 255 (pixel.R * 1.2);
G = min 255 (pixel.G * 1.2);
B = min 255 (pixel.B * 1.2) })
|> Image.fromPixelArray image.Width image.HeightThis abstraction allows developers to express computations in natural Clef style, while the system handles the complexity of GPU acceleration and memory management.
Unified Platform for Heterogeneous Memory
The power of Fidelity’s approach comes from its functional composition model for platform configuration, which can be extended to include CXL, NUMA, and PCIe capabilities:
type MemoryInterconnectCapabilities = {
HasCXL: bool
CXLVersion: CXLVersion option
ResizableBAR: bool
NumaTopology: NumaTopology option
}
let withCXLSupport (version: CXLVersion) (config: PlatformConfig) =
let interconnect = defaultArg config.MemoryInterconnect
{ HasCXL = false; CXLVersion = None; ResizableBAR = false; NumaTopology = None }
{ config with
MemoryInterconnect = Some { interconnect with HasCXL = true; CXLVersion = Some version } }
let withResizableBAR (config: PlatformConfig) =
let interconnect = defaultArg config.MemoryInterconnect
{ HasCXL = false; CXLVersion = None; ResizableBAR = false; NumaTopology = None }
{ config with
MemoryInterconnect = Some { interconnect with ResizableBAR = true } }
// A configuration for high-end data center with CXL 3.0
let dataCenter =
PlatformConfig.base'
|> withPlatform PlatformType.Server
|> withMemoryModel MemoryModelType.Abundant
|> withHeapStrategy HeapStrategyType.PerProcessGC
|> withCXLSupport CXLVersion.V3_0
|> withResizableBARConfiguration Presets and Automatic Detection
For most developers, even these configuration details are abstracted away through presets and automatic detection:
module AppConfig =
// Automatically detect and configure for current hardware
let autoDetect() =
let platform = PlatformDetection.current()
platform |> PlatformConfig.fromDetectedCapabilities
// Common configuration presets
let forDataScience() =
PlatformConfig.presets.DataScience
let forRealTimeProcessing() =
PlatformConfig.presets.LowLatency
let forEdgeDeployment() =
PlatformConfig.presets.EmbeddedHighPerformance
let startApplication() =
let config = AppConfig.autoDetect()
// Optoin to select from common presets with customization
let customConfig =
AppConfig.forDataScience()
|> withMemoryLimit (4L * 1024L * 1024L * 1024L) // 4GB limit
// Start application with optimal configuration
Application.start customConfigThis approach allows application developers to remain in Clef’s high-level, functional programming paradigm while still benefiting from advanced hardware capabilities.
ML Tensor Operations with CXL
Here’s a practical example of how Fidelity would leverage CXL for machine learning workloads:
let trainModelWithCXL (model: MLModel) (dataset: Dataset) (config: PlatformConfig) =
let parameterBuffer =
match config.MemoryInterconnect with
| Some { HasCXL = true } ->
// Use CXL memory for parameters as they need GPU access but are modified by CPU
BAREWire.CXL.allocateCoherentBuffer<float32>(model.ParameterCount * 4<bytes>)
| _ ->
// Fall back to standard memory with explicit transfers
BAREWire.allocate<float32>(model.ParameterCount * 4<bytes>)
// Create model with CXL-aware memory allocation
let cxlModel = {
Parameters = parameterBuffer
Architecture = model.Architecture
Config = config
}
// Train using data-parallel approach
DataParallel.train cxlModel dataset {
BatchSize = 128
Epochs = 10
Optimizer = Optimizer.Adam(LearningRate = 0.001)
}ML Frameworks: Clef Idioms for Deep Learning
For data scientists and ML engineers, Fidelity provides high-level, Clef-idiomatic libraries that hide the memory management complexity:
module DeepLearning =
let model = nn {
input [| 784 |]
dense 128 activation = Activation.ReLU
dense 64 activation = Activation.ReLU
dense 10 activation = Activation.Softmax
optimizer Adam {
learning_rate = 0.001
beta1 = 0.9
beta2 = 0.999
}
loss CrossEntropy
}
// Train model with automatic hardware optimization
let trainResult = model.Train(mnist, epochs = 10, batch_size = 128)
// The framework automatically:
// - Detects CXL availability and uses it if present
// - Optimizes memory placement across NUMA nodes
// - Leverages GPU acceleration with zero-copy where possible
// - Scales to multiple devices if available
let recognizeDigits() =
let mnist = Dataset.MNIST.load()
let model = nn {
// Model definition as above
}
// Train with automatic hardware optimization
let trainedModel = model.Fit(mnist.Train, epochs = 10)
// Evaluate
let accuracy = trainedModel.Evaluate(mnist.Test)
printfn "Test accuracy: %.2f%%" (accuracy * 100.0)This high-level API allows data scientists to focus on model architecture and training logic while the framework handles all memory and hardware optimization details.
BAREWire and CXL Memory Pooling
CXL 2.0+ adds memory pooling capabilities that BAREWire can leverage for dynamic resource allocation:
module BAREWire.MemoryPool =
let createPool (size: int<bytes>) (config: PlatformConfig) =
match config.MemoryInterconnect with
| Some { HasCXL = true; CXLVersion = Some v } when v >= CXLVersion.V2_0 ->
let fd = openCXLDevice()
let poolConfig = {
pool_id = 1
total_size = size |> int64
granularity = CXL_POOL_GRANULARITY_4K
}
let poolId = createCXLPool(fd, poolConfig)
{
PoolId = poolId
Size = size
Type = PoolType.CXL
}
| _ ->
{
PoolId = createStandardPool(size)
Size = size
Type = PoolType.Standard
}
let allocateFromPool<'T> (pool: MemoryPool) (size: int<bytes>) =
match pool.Type with
| PoolType.CXL ->
let fd = openCXLDevice()
let req = {
pool_id = pool.PoolId
size = size |> int64
}
let ptr = claimCXLMemory<'T>(fd, req)
{
Address = ptr
Size = size
Layout = MemoryLayout.getOptimized<'T>()
MemoryType = MemoryType.CXLPool
PoolId = Some pool.PoolId
}
| PoolType.Standard ->
let ptr = allocateFromStandardPool<'T>(pool.PoolId, size)
{
Address = ptr
Size = size
Layout = MemoryLayout.getOptimized<'T>()
MemoryType = MemoryType.StandardPool
PoolId = Some pool.PoolId
}Resource Library: High-Level Memory Pools
Developers interact with these capabilities through high-level resource management APIs:
module Resources =
type ResourcePool<'T> =
static member Create(initialCapacity: int) =
let platform = PlatformDetection.current()
let pool =
if platform.HasCXL && platform.CXLVersion.IsSome &&
platform.CXLVersion.Value >= CXLVersion.V2_0 then
CXLBackedPool<'T>(initialCapacity)
else
StandardPool<'T>(initialCapacity)
new ResourcePool<'T>(pool)
member this.Use(action: 'T -> 'R) : 'R =
use resource = this.Pool.Borrow()
action resource
member this.UseAsync(action: 'T -> Async<'R>) : Async<'R> =
async {
use! resource = this.Pool.BorrowAsync()
return! action resource
}
let processRequests() =
let bufferPool = Resources.ResourcePool<byte[]>.Create(initialCapacity = 10)
let processRequest (request: Request) =
bufferPool.Use(fun buffer ->
fillBufferWithRequestData(request, buffer)
transformData(buffer)
sendResponse(request.Id, buffer)
)This abstraction allows developers to efficiently manage large resources without concerning themselves with the underlying memory technology details.
Integration with the Olivier Actor Model
Fidelity’s Olivier actor model can be extended to leverage CXL and NUMA for optimal process placement:
module Olivier.Actors =
// Create an actor with awareness of memory topology
let createActor<'Msg, 'State> (initialState: 'State) (behavior: 'State -> 'Msg -> 'State) (config: PlatformConfig) =
// Determine optimal placement based on memory access patterns
let placement = match config.MemoryInterconnect, inferMemoryAccessPattern<'State, 'Msg>() with
| Some { NumaTopology = Some topo; HasCXL = true }, AccessPattern.GPUIntensive ->
let cxlNode = topo.CXLNodes |> List.head
ProcessPlacement.NumaNode cxlNode
| Some { NumaTopology = Some topo }, AccessPattern.MemoryIntensive ->
let localNode = getCurrentNumaNode()
ProcessPlacement.NumaNode localNode
| _ ->
ProcessPlacement.Default
// Create actor with optimal placement
Actor.create initialState behavior placementErlang-Inspired Concurrency with Clef Idioms
Developers interact with the actor system through high-level, Clef-idiomatic APIs:
module Olivier =
type CounterMsg =
| Increment
| Decrement
| Get of AsyncReplyChannel<int>
let createOptimalActor<'Msg> (config: PlatformConfig) (body: MailboxProcessor<'Msg> -> Async<unit>) =
let msgMemoryProfile = TypeAnalysis.getMemoryProfile<'Msg>()
match config.MemoryInterconnect, msgMemoryProfile with
| Some { NumaTopology = Some topo; HasCXL = true }, MemoryProfile.Large ->
// For large messages, use CXL memory if available
let node = topo.CXLNodes |> List.head
let options = MailboxProcessorOptions.Default
|> MailboxProcessorOptions.withNumaNode node
|> MailboxProcessorOptions.withZeroCopy true
MailboxProcessor.Start(body, options)
| Some { NumaTopology = Some topo }, _ ->
// Otherwise use local NUMA node
let node = getCurrentNumaNode()
let options = MailboxProcessorOptions.Default
|> MailboxProcessorOptions.withNumaNode node
MailboxProcessor.Start(body, options)
| _ ->
// Or fall back to standard MailboxProcessor
MailboxProcessor.Start(body)
let createCounter() =
createOptimalActor PlatformConfig.current (fun inbox ->
let rec loop count = async {
let! msg = inbox.Receive()
match msg with
| Increment ->
return! loop (count + 1)
| Decrement ->
return! loop (count - 1)
| Get reply ->
reply.Reply count
return! loop count
}
loop 0
)
let distributedProcessing() =
// Message type with zero-copy capability
type WorkerMsg =
| Process of ZeroCopyBuffer<float32>
| Shutdown
// Create worker
let createWorker() =
Olivier.createOptimalActor PlatformConfig.current (fun inbox ->
let rec loop() = async {
let! msg = inbox.Receive()
match msg with
| Process data ->
// Process data without copying
let result = processDataWithoutCopying data
return! loop()
| Shutdown ->
// Exit the loop
return ()
}
loop()
)
// Create worker pool
let workers = Array.init 10 (fun _ -> createWorker())
// Load-balancing round-robin dispatch
let dispatch (data: ZeroCopyBuffer<float32>) =
let index = Interlocked.Increment(&nextWorkerIndex) % workers.Length
workers.[index].Post(Process data)
// Process dataset with zero-copy where possible
dataset
|> Seq.iter (fun data ->
use buffer = ZeroCopyBuffer.fromArray data
dispatch buffer
)This high-level API allows developers to express concurrent programs using familiar Clef patterns while the system handles the complexity of optimal process placement and efficient communication.
Fidelity and Next-Generation Memory Architectures
The integration of Fidelity and our innovative BAREWire technology with CXL, NUMA, and PCIe optimizations represents a powerful approach to heterogeneous computing. By combining BAREWire’s zero-copy architecture with the hardware capabilities of CXL and Resizable BAR, Fidelity can deliver:
- True Zero-Copy Operations: Direct memory access across CPU and accelerators without transfers
- Optimal Memory Placement: Intelligent allocation across NUMA nodes including CXL memory
- Adaptive Memory Management: Graceful degradation when advanced hardware features aren’t available
- Type-Safe Memory Access: Units of measure ensuring memory safety without runtime overhead
- Platform-Specific Optimization: Functional composition driving memory strategies based on hardware capabilities
The contrast with existing toolchains is stark. C++ provides raw access to CXL through vendor libraries, but the type system offers no help in tracking which pointers refer to which memory pools, which coherence domains apply, or which access patterns are valid. The burden falls entirely on the programmer to maintain mental models of memory topology that grow increasingly complex as CXL deployments scale. Rust improves memory safety within a single coherent address space, but its ownership model predates multi-pool architectures; the borrow checker verifies lifetimes but cannot verify residency.
Fidelity’s approach anticipates the memory architectures that CXL makes possible. The actor model maps naturally to coherence domains; each actor owns its memory region and communicates through capabilities that encode both ownership and residency. The PSG captures semantic information about data flow that enables automatic optimization of memory placement. Units of measure distinguish pool types at the type level. When the hardware provides multiple memory pools with different latency and bandwidth characteristics, Fidelity is prepared to leverage them safely; we anticipate this will give our users a significant advantage over projects constrained by toolchains designed for simpler memory models.
For application developers, these capabilities will eventually be exposed through high-level, Clef-idiomatic libraries that maintain the language’s functional programming paradigm while leveraging advanced hardware features:
- Tensor Computing Library: For high-performance numerical operations
- GPU Acceleration Library: For transparent hardware acceleration
- Resource Management Library: For efficient pooling and sharing of resources
- Actor System Library: For distributed, fault-tolerant concurrency
- ML Framework: For deep learning with automatic hardware optimization
These libraries and others like them will allow developers to express computations in natural Clef style without worrying about the underlying hardware details, while still benefiting from the performance advantages of advanced memory technologies like CXL.
These capabilities make Fidelity uniquely suited for the next generation of heterogeneous computing, where the boundaries between different memory spaces are increasingly blurred by technologies like CXL. The pre-optimization approach of BAREWire aligns perfectly with the hardware coherency provided by CXL, creating a powerful foundation for high-performance native code across the entire computing spectrum.
The systems programming community has long accepted that advanced memory architectures require advanced programming discipline. CXL tutorials warn developers to “carefully track which pointers point where” and “always verify coherence domain compatibility before access.” This guidance amounts to admitting that the toolchain cannot help. Fidelity rejects this premise. If the hardware provides multiple memory pools with different characteristics, the type system should encode those characteristics. If coherence domains constrain valid access patterns, the compiler should verify them. The cognitive burden that C++ and Rust place on developers working with CXL is not inherent to the problem; it reflects limitations in those languages’ type systems that Fidelity does not share.
The underlying technology, built on our “System and Method for Zero-Copy Inter-Process Communication Using BARE Protocol” (US 63/786,247), creates new possibilities for AI systems that can efficiently distribute computation across heterogeneous hardware while minimizing the overhead traditionally associated with data movement. This software innovation from SpeakEZ AI represents a pivotal advancement in the field of distributed AI model training and heterogeneous computing.