Introducing Wizwand v2 — and what we learned after burning 5.4B tokens

I’m excited to introduce wizwand.com v2 — a big step toward helping researchers discover state-of-the-art machine learning and AI papers and compare results in a way that actually makes sense.

It took us a little over a month to build this version. Along the way, we ran a lot of experiments and burned billions of tokens trying to answer one deceptively hard question:

How do you make paper results comparable when the world refuses to be neatly structured?

Here’s the backstory, the biggest technical lessons from v1 → v2, and what we’re still improving next.

Why we rebuilt: the limits of “legacy structure”

The initial version of Wizwand was built on top of the (now sunsetted) PapersWithCode data and followed a similar approach: extract new papers, map them into a predefined schema, and render leaderboards.

That structure works… until you try to scale to real papers in the wild.

We observed two recurring issues in v1:

Dataset inconsistency (the “apples-to-apples” problem)
Task granularity (the “what even counts as the same task?” problem)

Both sound simple. Neither is.

Lesson 1: Datasets are not a clean key — they’re a messy language problem

In v1, we tried to make dataset matching fully structured.

In theory, “ImageNet” is a dataset. In practice, the words “ImageNet” can mean:

ImageNet-1K val vs test
ImageNet-1K at different resolutions (e.g., 224 vs 512)
TinyImageNet
ImageNet 1% / subset variants
Randomly sampled subsets (“we sampled 50k images from ImageNet”)
Merged datasets (“we combine COCO + Visual Genome”)
Alternate naming (“ImageNet”, “IN”, “ILSVRC”, “ImageNet-1K”, …)

If two papers report “ImageNet” results, are they comparable?

If one uses val and another uses test, is that apples-to-apples?
If one uses ImageNet-1K but 512×512, should it live on the same leaderboard as standard 224×224?
If one uses ImageNet-1K 1%, should it be compared at all?

Our v1 approach (and why it broke)

We started by building a schema to represent dataset identity:

family (e.g., ImageNet)
split (train/val/test)
variant (1K / 21K / tiny)
version (2012 / 2014 / etc.)
subset rules
transforms / preprocessing
and so on…

It quickly became clear: a data model that can describe every real-world dataset mention becomes either

too complex to maintain, or
too incomplete to trust.

And the worst failure mode is subtle: if the schema can’t represent a case, you risk silently merging results that should never be compared.

Our v2 approach: describe datasets in natural language, then let the model judge

In v2, we stopped forcing everything into rigid fields.

Instead, we represent datasets as natural-language descriptors and use an LLM to:

interpret what a dataset mention likely means (based on paper context)
compare two dataset descriptors
decide whether they are equivalent, compatible, or different
merge only when it’s safe

Examples like:

“MS COCO 2014 Karpathy test”
“gRefCOCO (testB)”

…are much easier to keep accurate in language than to shoehorn into a universal schema.

Principle we optimized for:

It’s better to be slightly less structured than to accidentally compare non-comparable results.

Lesson 2: Task taxonomy doesn’t scale like you think

Datasets are hard. Tasks are arguably harder.

Take “image classification.” Sounds like a single task, right?

But papers might be doing:

medical image classification
few-shot classification
self-supervised classification
multi-label classification
fine-grained classification
zero-shot classification
domain-specific variants (e.g., satellite, pathology, industrial defects)

Should those be grouped together? Sometimes yes, often no.

Our v1 approach: parent/child task trees

We initially tried a structured hierarchy:

parent task → subtask → sub-subtask…

But real research doesn’t evolve like a tidy tree. New tasks and settings often appear as cross-cuts:

“few-shot” applies across vision, NLP, audio, multimodal…
“medical” can be classification, segmentation, detection…
“robustness” is more of an evaluation regime than a task…

So the hierarchy kept exploding and still failed to capture the nuance.

Our v2 approach: keep domains, infer tasks, and merge with LLM judgment

In v2, we kept a simpler concept of domain/task labels (as categories), but removed the brittle parent/child taxonomy.

Pipeline-wise, we do two things:

Infer the task for each evaluation result from the paper context (LLM extraction + reasoning)
Use an LLM “judge” step to decide whether two results belong in the same comparable group

The trade-off is intentional:

You may see more tasks and less rigid nesting
But the comparisons are more accurate, and far less likely to mix incompatible results

Again, we’re choosing correctness of comparison over the illusion of structure.

How we used LLMs in v2 (and why we picked Gemini + Vertex AI)

For v2, we heavily relied on Google Gemini models for both:

data extraction (pulling structured + descriptive info from papers)
aggregation/merging (dataset equivalence, task grouping, deduplication, conflict resolution)

Reasons this stack worked well for us:

Models are fast and relatively cost-effective for large-scale processing
Vertex AI makes it practical to process tens of thousands of papers with robust orchestration

The unsexy part: evaluation harnesses

We also built our own evaluation tests so we could iterate quickly on the extraction + merge pipelines. Without tests, you’re basically guessing whether “it looks better” — which is not a strategy once you’re merging large corpora.

The harness helped us answer questions like:

Are we extracting the right dataset/task descriptors?
Are we over-merging or under-merging?
Where do errors cluster (tables vs captions vs main text)?
What’s the delta between pipeline versions?

The token bill: one processing round

Here’s the subtotal for one round of paper processing in v2:

Category	Tokens (in + out)
Image tokens	1,262,471,868 (1.2B)
Text tokens	4,194,202,614 (4.1B)
Total	5,456,674,482 (5.4B)

Yes — we burned a lot of tokens.

But it was the fastest way for us to explore the design space and converge on something that produces trustworthy comparisons.

What’s next

v2 is a big improvement over v1, but we’re not done. Two areas we’re actively working on:

Reducing residual pipeline errors (currently ~1–2%) These show up across extraction, transformation, and merge steps. Some are straightforward edge cases; others require better paper understanding (and better guardrails).
Better handling of method attributes and configs We want to display and filter results by things that often matter in practice, like:
- training setup (data aug, epochs, optimizer, batch size)
- inference setup (resolution, decoding strategy, test-time aug)
- model variants and ablations

This is essential if we want Wizwand to be more than “a leaderboard” — it should be a tool for serious research navigation.

Try Wizwand v2

If you’re doing ML research (or just trying to keep up with it), I’d love for you to try wizwand.com v2 and tell us what breaks, what’s confusing, and what you wish it did.

We built v2 around one core idea:

Comparisons are only useful when they’re actually comparable.

And we’re going to keep iterating until that’s consistently true.