Benchmarks

Performance measurements on Apple M-series, Python 3.13, PyArrow 19.x.

Serialization Performance

Time to serialize/deserialize a single RPC request or response payload via Arrow IPC.

Payload Type	Serialize	Deserialize
Primitive (single float64)	0.8 μs	0.6 μs
String parameter	1.2 μs	0.9 μs
List[float] (100 elements)	3.5 μs	2.8 μs
Dict[str, int] (50 keys)	8.2 μs	6.1 μs
Nested dataclass	12 μs	9.5 μs
Complex (dataclass + lists + enums)	18 μs	14 μs
1K-row batch (3 columns)	25 μs	12 μs
100K-row batch (3 columns)	1.8 ms	0.9 ms

End-to-End Unary Latency

Full round-trip time for a simple unary call (add(a=1.0, b=2.0) → float) including serialization, transport, dispatch, and deserialization.

Transport	P50	P99	Throughput
Pipe (in-process)	5.2 μs	8.1 μs	~190K calls/s
Shared Memory (100 MB)	5.8 μs	9.2 μs	~170K calls/s
Unix Socket	33 μs	52 μs	~30K calls/s
Subprocess	52 μs	85 μs	~19K calls/s
HTTP (in-process WSGI)	520 μs	820 μs	~1.9K calls/s

Streaming Throughput

Sustained streaming performance for producer and exchange patterns.

Pattern	Throughput	Latency
Producer (pipe, 1K-row batches)	48,000 batches/s	21 μs/batch
Producer (subprocess)	12,000 batches/s	83 μs/batch
Exchange (pipe)	32,000 round-trips/s	31 μs/round-trip
Exchange (subprocess)	8,500 round-trips/s	118 μs/round-trip
SHM Producer (100 MB segment)	29 GB/s	~3.4 μs/batch

The shared memory transport achieves 29 GB/s by writing Arrow IPC batches directly to a memory-mapped segment and sending only a pointer (offset + length) over the pipe.

Cross-Language Performance

Performance of cross-language RPC via subprocess transport. The Python client communicates with servers implemented in different languages.

Scenario	Unary Latency	Producer
Python client → Go server (subprocess)	~120 μs	~15K batches/s
Python client → TypeScript server (subprocess)	~140 μs	~12K batches/s
Python client → Python server (subprocess)	~52 μs	~12K batches/s

Methodology

All benchmarks run with 1,000+ warm-up iterations followed by 10,000 measured iterations
P50 and P99 are wall-clock time including all framework overhead
Schema generation is cached (first-call overhead ~50 μs, subsequent calls use cached schema)
Shared memory benchmarks use pre-allocated 100 MB segments
HTTP benchmarks use in-process WSGI (no network round-trip)
Cross-language benchmarks include subprocess spawn overhead in first-call measurements (amortized in throughput numbers)

← Back to home