Benchmarks

Performance measurements on Apple M-series, Python 3.13, PyArrow 19.x.

Serialization Performance

Time to serialize/deserialize a single RPC request or response payload via Arrow IPC.

Payload Type Serialize Deserialize
Primitive (single float64) 0.8 μs 0.6 μs
String parameter 1.2 μs 0.9 μs
List[float] (100 elements) 3.5 μs 2.8 μs
Dict[str, int] (50 keys) 8.2 μs 6.1 μs
Nested dataclass 12 μs 9.5 μs
Complex (dataclass + lists + enums) 18 μs 14 μs
1K-row batch (3 columns) 25 μs 12 μs
100K-row batch (3 columns) 1.8 ms 0.9 ms

End-to-End Unary Latency

Full round-trip time for a simple unary call (add(a=1.0, b=2.0) → float) including serialization, transport, dispatch, and deserialization.

Transport P50 P99 Throughput
Pipe (in-process) 5.2 μs 8.1 μs ~190K calls/s
Shared Memory (100 MB) 5.8 μs 9.2 μs ~170K calls/s
Unix Socket 33 μs 52 μs ~30K calls/s
Subprocess 52 μs 85 μs ~19K calls/s
HTTP (in-process WSGI) 520 μs 820 μs ~1.9K calls/s

Streaming Throughput

Sustained streaming performance for producer and exchange patterns.

Pattern Throughput Latency
Producer (pipe, 1K-row batches) 48,000 batches/s 21 μs/batch
Producer (subprocess) 12,000 batches/s 83 μs/batch
Exchange (pipe) 32,000 round-trips/s 31 μs/round-trip
Exchange (subprocess) 8,500 round-trips/s 118 μs/round-trip
SHM Producer (100 MB segment) 29 GB/s ~3.4 μs/batch

The shared memory transport achieves 29 GB/s by writing Arrow IPC batches directly to a memory-mapped segment and sending only a pointer (offset + length) over the pipe.

Cross-Language Performance

Performance of cross-language RPC via subprocess transport. The Python client communicates with servers implemented in different languages.

Scenario Unary Latency Producer
Python client → Go server (subprocess) ~120 μs ~15K batches/s
Python client → TypeScript server (subprocess) ~140 μs ~12K batches/s
Python client → Python server (subprocess) ~52 μs ~12K batches/s

Methodology

  • All benchmarks run with 1,000+ warm-up iterations followed by 10,000 measured iterations
  • P50 and P99 are wall-clock time including all framework overhead
  • Schema generation is cached (first-call overhead ~50 μs, subsequent calls use cached schema)
  • Shared memory benchmarks use pre-allocated 100 MB segments
  • HTTP benchmarks use in-process WSGI (no network round-trip)
  • Cross-language benchmarks include subprocess spawn overhead in first-call measurements (amortized in throughput numbers)