WebAssembly vs JavaScript for parsing big JSON files

Fast JSON Viewer ships the same parser twice: once in JavaScript and once in WebAssembly compiled from C with SIMD. Here is why we keep both, how the WASM one is built and shipped, and a 100 MB benchmark on this machine where it runs 2.1 times faster.

webassembly simd json simdjson performance

What we are trying to do

Fast JSON Viewer is a web page that opens enormous JSON files. You drag a file onto it, and it shows you the pretty-printed text with collapsible objects, search, and Vim-style navigation. The target is files that no editor will open: hundreds of megabytes, and on the high end past 10 GB. None of it touches a server. Everything runs in your browser, the page is static, and the bytes never leave your machine.

To show a file we first have to read every byte of it. Not to build a JavaScript object, we never do that, a 1 GB file would not fit as a live object tree. What we need is cheaper and more specific: confirm the JSON is well formed, count how many display lines it will render to, remember byte offsets so the virtual scroller can jump straight to line nine million, and record where every object and array opens and closes so the collapse chevrons know what they fold. We call that pass over the bytes validate and index. It is the one piece of work that scales with file size, so it is the piece worth making fast.

The file is split into chunks, each chunk handed to a Web Worker, and the workers run the validate-and-index pass in parallel while the main thread stays free to paint. The hot loop inside each worker walks bytes one at a time through a small state machine. That loop is where the time goes, and it is the reason this page has two parsers.

Why ship two engines

The byte loop started life in plain JavaScript. It is the reference implementation: easy to read, easy to test, and correct by construction because every behavior is spelled out in one file. We never threw it away. It is still the parser that runs when anything about the faster path looks wrong, and it is still the thing the test suite checks against.

The second engine is the same logic rewritten in C and compiled to WebAssembly. It is the default in the settings menu, and it does the heavy lifting on real files. The two are kept in lockstep: the same scenarios, the same byte classification tables, the same error codes. If the WASM module fails to load, fails an ABI check, or hits any instantiation problem, the worker quietly falls back to the JavaScript loop and the page keeps working. You get the speed when the platform cooperates and correctness when it does not.

A subtle trap we hit early: when WASM silently fell back to JS, our benchmark happily reported the JS numbers under the WASM column. The fix was to make the fallback loud in the bench harness, so a fallback fails the run instead of masquerading as a fast result. If you ship two engines, make sure your measurements can tell which one actually ran.

What WebAssembly actually is

WebAssembly, or WASM, is a compact binary instruction format that every modern browser can run. Its behavior is pinned down by the W3C WebAssembly Core Specification, which is part of why it runs the same way across engines. You write code in a language like C, Rust, or Zig, compile it to a .wasm file, and the browser hands that file to a fast compiler of its own that turns it into real machine code for the CPU it is running on. The result runs close to native speed, much closer than JavaScript can usually get, because the format was designed to be easy for the engine to compile and because it has no garbage collector pausing in the middle of a tight loop.

A WASM module is deliberately small and sealed off. It has no DOM, no network, no access to anything you do not explicitly pass in. Its whole world is one flat array of bytes called linear memory. That suits us well. The byte loop does not need the DOM or the network; it needs a block of bytes to scan and a place to write the counts it finds. A worker thread with a sandboxed number cruncher inside it is exactly the shape of the problem.

Two properties matter for our use. First, predictable performance: no JIT warmup games, no deoptimization when a hidden class changes, the same code path every time. Second, SIMD, which we will come back to, because it is the single biggest reason the WASM version pulls ahead.

How we wrote it

The parser lives in one C file, wasm/chunk-parser.c, about 1,600 lines. It is a faithful port of the JavaScript byte loop, including all 25 entry scenarios the JS parser tracks for chunk boundaries (more on why there are 25 further down). It is freestanding C, which means no standard library at all:

// No stdlib. Linear memory is the only address space.
// A single bump allocator owns all analyzer-scoped data; new_analyzer()
// resets it so long-running workers never leak.
#include <stdint.h>
#include <stddef.h>
#include <wasm_simd128.h>

There is no malloc. Memory is handed out by a bump allocator that walks a pointer forward through linear memory, and resetting it is just moving the pointer back to the start. A worker can validate ten thousand chunks in a row without leaking a byte, because each new chunk resets the arena. Byte classification, deciding whether a character is whitespace, a digit, a hex digit, a string delimiter, is done with a 256-entry lookup table of bit flags, the same table the JS version uses, so the two stay byte-for-byte identical in their decisions.

How we compile it

We compile straight with LLVM's clang, which has a built-in wasm32 backend, and link with wasm-ld. No Emscripten runtime, no wasm-bindgen, no npm toolchain in the hot path. The build script is one file and the flags are short:

clang \
  --target=wasm32 \
  -msimd128 \           # turn on the 128-bit SIMD instruction set
  -nostdlib \           # freestanding: no libc
  -fno-builtin \
  -O3 -flto \           # optimize hard, link-time optimization
  -Wl,--no-entry \      # a library, not a program with main()
  -Wl,--export-dynamic \
  -Wl,--strip-all \
  -o src/wasm/chunk-parser.wasm  wasm/chunk-parser.c

The whole module comes out to about 17.5 KB. That is the entire parser, SIMD and all, in less space than a small image. One wrinkle worth recording: even with -fno-builtin, clang will lower a large struct copy to a memcpy call, and under -nostdlib there is no memcpy to call, so the module fails to instantiate and you fall back to JS without noticing. We ended up providing tiny freestanding memcpy and memset in the C file so the linker has something to bind to.

How we ship it

The compiled .wasm is committed to the repository. That is on purpose: anyone cloning the project, and the static host serving it, never needs clang installed. The browser fetches the file and instantiates it once per worker, then checks that it got the module it expected before letting it anywhere near a byte loop:

const bytes = await fetchWasmBytes();             // fetch() in the browser, fs in node
const { instance } = await WebAssembly.instantiate(bytes, {});
const exports = instance.exports;

// Refuse a mismatched or truncated module up front, not mid-loop.
if (exports.parser_abi_version() >>> 0 !== EXPECTED_ABI) {
  throw new Error('WASM ABI mismatch - rebuild with `npm run build:wasm`');
}
const probe = 0x01020304;
if ((exports.ping(probe) >>> 0) !== ((probe ^ 0xA5A5A5A5) >>> 0)) {
  throw new Error('WASM ping sanity check failed');
}

The ABI version and the ping round-trip are cheap insurance. If the layout the C side writes and the layout the JS loader reads ever drift apart, we want to find out at load time, not from a corrupted line count three million rows into a file. Any throw here is caught one level up and turns into the JavaScript fallback.

The benchmark

Talk is cheap, so here is the run. I took a 100 MB JSON file (you can make your own with the built-in JSON test-file generator, which writes a synthetic file of any size straight to disk) and ran both engines through the full chunk pipeline, 10 iterations each, reporting the fastest iteration. The machine is an 8-core box with about 8 GB of RAM. The file was split across 4 workers.

$ node scripts/runbench.mjs scratch/pass-perf-100mb.json --iters 10

Engine	Throughput	Wall time	Relative
JavaScript pipeline	143 MB/s	670 ms	1.0×
WASM + SIMD128 pipeline	305 MB/s	314 ms	2.1×

Same file, same pipeline, same machine, the only difference is which byte loop the workers call. WebAssembly with SIMD validates and indexes the file in less than half the wall time. On a 95.9 MB file that is the difference between 670 ms and 314 ms, and the gap widens on bigger files where the per-byte loop dominates everything else.

One thing this table is not: it is not a comparison against JSON.parse. Those measure a different job. JSON.parse builds a full JavaScript object in memory, which we deliberately never do, so comparing our throughput to it would be comparing two different amounts of work. Our number is the cost of validating the bytes and recording the metadata the viewer needs, nothing more.

What SIMD is, and what we took from simdjson

SIMD stands for Single Instruction, Multiple Data. A normal CPU instruction adds one number to one number. A SIMD instruction does the same operation on a whole vector of values at once. WebAssembly's fixed-width SIMD, the 128-bit vector instructions folded into the core specification in 2.0, gives us registers that hold sixteen bytes, so one comparison can ask "which of these sixteen bytes is a double quote" and answer all sixteen lanes in a single instruction. For a parser that spends its life asking yes-or-no questions about bytes, that is a sixteen-to-one head start.

The idea of pushing SIMD this hard at JSON is not ours. It comes from simdjson, the parser by Daniel Lemire and Geoff Langdale that broke the gigabyte-per-second barrier on commodity hardware. If you want the full story, Lemire's talk on simdjson is the best hour you can spend on it, and his blog is a long-running source on this kind of work. The lesson we took was not their exact algorithm but the mindset behind it: stop branching on one byte at a time. Load a wide vector, classify every lane in parallel, collapse the result to a bitmask, and use a single count-trailing-zeros instruction to find the first interesting byte. Branches are where a scalar parser stalls; SIMD lets you replace a run of branches with one arithmetic comparison.

Here is the hottest loop in our parser, the scan that races through the inside of a string until it hits a quote, a backslash, or a control character. It checks sixteen bytes per step:

// Advance past bytes that are not quote / backslash / control.
// String content dominates real-world JSON, so this loop matters most.
static inline uint32_t scan_string_simd(const uint8_t *bytes, uint32_t index, uint32_t len) {
  const v128_t quote  = wasm_i8x16_splat(0x22);  // sixteen copies of '"'
  const v128_t bslash = wasm_i8x16_splat(0x5c);  // sixteen copies of '\'
  const v128_t space  = wasm_i8x16_splat(0x20);
  while (index + 16 <= len) {
    const v128_t v = wasm_v128_load(bytes + index);          // load 16 bytes
    const v128_t hit = wasm_v128_or(
      wasm_v128_or(wasm_i8x16_eq(v, quote), wasm_i8x16_eq(v, bslash)),
      wasm_u8x16_lt(v, space));                               // control char = byte < 0x20
    const uint32_t mask = wasm_i8x16_bitmask(hit);            // 16 lanes -> 16-bit mask
    if (mask) return index + (uint32_t)__builtin_ctz(mask);   // first hit, in one op
    index += 16;
  }
  // scalar tail: a 0x00 sentinel past the end breaks the loop
  while ((CHAR_FLAGS[bytes[index]] & (F_STRING_SPECIAL | F_CONTROL)) == 0) index++;
  return index;
}

The same pattern handles two other common runs: skipping long stretches of whitespace in indented files, and skipping the digit runs inside big numbers. One detail we learned the slow way is that SIMD is not free. Setting up the vectors costs a few instructions, so for a one or two byte gap between tokens it is pure overhead. We guard the wide scans with a short scalar probe and only drop into SIMD once a run proves long enough to pay for itself:

// Most gaps between tokens are 0-1 bytes, where SIMD setup is wasted.
// Probe scalar first; fall into the 16-wide scan only on a real run.
while ((CHAR_FLAGS[bytes[index]] & F_WHITESPACE) != 0) {
  index++;
  if (index - startIdx >= 8) { index = scan_whitespace_simd(bytes, index, len); break; }
}

Where we part ways with simdjson

It would be fair to ask why we did not just embed simdjson and be done. The answer is that we are not solving the same problem. simdjson is a parser: its job is to tell you the document is valid and hand you its values. Our job is to make a 10 GB file navigable in a browser, and that needs a layer of bookkeeping a parser has no reason to produce.

As the byte loop runs, it builds metadata for the viewer to render:

Display-line counts and byte offsets, sampled per 256 KB block, so the virtual scroller can map a scrollbar position to a line and seek to its bytes without re-reading the file.
Open and close depth for every container, so the collapse chevrons know exactly which range of lines a fold hides.
Running tallies of strings, keys, numbers, and literals, the kind of summary the UI shows about a file.

And there is a structural complication a single-shot parser never faces. We do not feed the parser a whole document. We hand each worker a 256 KB block that can begin anywhere, including in the middle of a string, halfway through a number, or three characters into the word false. The parser cannot know which until the neighboring chunk is stitched in. So it runs every plausible interpretation of where the block began at once, all 25 of them, one out-of-string scenario, six ways of being mid-string, and eighteen ways of being mid-number or mid-keyword. A later validation step picks the scenario that actually matches the chunk before it and discards the rest. That is the real reason the C port is 1,600 lines instead of a few hundred: every one of those 25 scenarios had to be carried across, in lockstep with the JS reference, and every one of them had to keep its line counts and tallies correct so the surviving scenario hands the viewer the right numbers.

simdjson taught us how to make the inner loop fly. The outer shape, parallel chunks, boundary scenarios, viewer metadata, is the part that is specific to showing a file rather than parsing one, and it is the part we had to build ourselves.

The short version

Fast JSON Viewer opens JSON files far too large for an editor, entirely in your browser, with nothing uploaded. The expensive step is one pass over every byte to validate the file and record what the viewer needs to render and scroll it. We wrote that pass twice: a readable JavaScript version that stays as the reference and the safety net, and a C version compiled to a 17.5 KB WebAssembly module that uses SIMD to classify sixteen bytes at a time. On a 100 MB file on a normal machine, the WASM path does the same work at 305 MB/s against the JavaScript path's 143 MB/s, a little over twice as fast, and the margin only grows as the files do. The technique comes straight from the simdjson school of parsing; the bookkeeping that turns a parse into a scrollable, collapsible view of a 10 GB file is ours.

⊹ Open the viewer Generate a test file Read the guide