Dense-WebVid-CoVR (video+text)

Dense-WebVid-CoVR (video+text)

Dense-WebVid-CoVR (video+text)

Composed retrieval doubles signal for long queries

Marengo 3.0 fuses text + visual intent into a single embedding, surfacing coherent grassy scenes in the first positions where baselines drift.

Query

Query

Query

Replace the fallen autumn leaves covering the ground with a patch of long, unkempt green grass swaying gently in a light breeze.

QUERY CLIP (30s)

Composed R@10

97.0%

Text-Only R@10

90.9% vs 78.3% (Vertex)

Embedding size

512d vs Nova 3072d

Marengo 3.0 (composed)

IMAGE + TEXT

IMAGE + TEXT

IMAGE + TEXT

GT at rank 1

Top 1

Top 2

Top 3

TOP 4

TOP 5

Marengo 3.0 (composed)

IMAGE + TEXT

IMAGE + TEXT

IMAGE + TEXT

GT at rank 1

Top 1

Top 2

Top 3

TOP 4

TOP 5

Marengo 3.0 (composed)

IMAGE + TEXT

IMAGE + TEXT

IMAGE + TEXT

GT at rank 1

Top 1

Top 2

Top 3

TOP 4

TOP 5

Marengo 3.0 (composed)

IMAGE + TEXT

IMAGE + TEXT

IMAGE + TEXT

GT at rank 1

Top 1

Top 2

Top 3

TOP 4

TOP 5

Marengo 3.0 (composed)

IMAGE + TEXT

IMAGE + TEXT

IMAGE + TEXT

GT at rank 1

Top 1

Top 2

Top 3

TOP 4

TOP 5

Librispeech (speech → text)

Librispeech (speech → text)

Librispeech (speech → text)

Speech retrieval stays faithful to exact utterances

Marengo 3.0 and 2.7 surface the ground-truth first; Nova fails to retrieve speech reliably.

Query

Query

Query

It is so made that everywhere we feel the sense of punishment

Marengo 3.0

GT at rank 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

Marengo 3.0

GT at rank 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

Marengo 3.0

GT at rank 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1

0:00/1:34

Top 1