Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction (mixedbread.com)

107 points by breadislove 4 days ago | 43 comments

Zagreus2142 1 days ago [-]

``` We evaluated several precision pairings across our internal retrieval benchmark suite. Scores are NDCG@10 averaged across the suite, scaled to 0–100. NDCG@10 (Normalized Discounted Cumulative Gain at rank 10) measures how well the top 10 results are ordered against the ideal ranking, rewarding relevant documents more when they appear higher, with 100 being a perfect ranking. The full-precision baseline averages 90.26. Int8 query against binary documents averages 89.65, a 0.61 point drop, while reducing document-vector storage by 32x ```

Saying "Near lossless" to mean 90% accurate retrieval of saved vectors is simply a lie. Lossy-ness is binary, not something you can paper over with getting close enough. And 90% is not close. Sure, LLMs are all about gradient descent on noisy data sets so I guess this is acceptable in this field but that terminology usage still bothered me

kittoes 24 hours ago [-]

I don't believe that's what they were saying at all though. The claim appears to be that it's near lossless relative to their own baseline that uses float. Which I'd grant, since a 32x storage reduction for 0.61% loss in quality is a reasonable trade off when you've already decided to accept that ~90% is "good enough".

coldtea 5 hours ago [-]

>Saying "Near lossless" to mean 90% accurate retrieval of saved vectors is simply a lie. Lossy-ness is binary, not something you can paper over with getting close enough.

Lossy-ness is a binary, "near lossness" however is still valid (and is not the same as saying lossless).

How else, if not by comparing to "lossness" (whether with a more abstract qualitative term like "near" or with some distance or error measurement) do you report the level of fidelity to non-lossy results?

Depends on the context, but even in abstract for whatever domain, 90% sounds pretty close if we're talking about a linear level of degradation corresponding to each X% level.

In this case if this is calculated on a lossless baseline that's itself close to 90% - it's distance from it, it doesn't represent distance from some pure 100% perfect retrieval. So ~90% vs ~89% is very close to lossless capability.

theropost 19 hours ago [-]

Yeah, what bugs me about stuff like that is like they spend all this time and then they output several or minimal real testing to prove the theory It's like you're building your model to And just because it takes a long time to compute and do the testing, you'd rather publish your article and then try to get credit on something that hasn't really been proven. Look, prove your results. Study it. Ruggedize it. Make sure it works. Then, show us.

seritools 23 hours ago [-]

near lossless refers to being 89.65/90.26 = 99.32% of baseline, i'm pretty sure.

breadislove 23 hours ago [-]

yes exactly.

ttoinou 17 hours ago [-]

Ask a SOTA LLM when Newton was born without any access to internet : the answer is Lossless for our shared culture understanding of this question. Not Near-lossless, lossless. Ask the same LLM when YOU were born, the answer is just wrong for almost anyone in the world, not lossy. Between the two there is a whole new field of Lossyness to study.

90% depends entirely on what the measure means here, do you understand what "Normalized Discounted Cumulative Gain at rank 10" means to the set of data that we are comparing ?

Sometimes coming up with new codecs (compressors decompressors) means coming up with new ways to interpret artifacts of the real world. And this is exactly why LLM are so powerful and they are like a giant Lossy (but Near-Lossless for various use cases) ZIP file / Database of the whole knowledge of the training data.

Nobody is trying to manipulate you here, humanity just has to find new explanations for complex topics.

    Lossy-ness is binary

Lossless is binary in pure information theory. to quote my other comment :

Lossless is objective for information theory. To get from the real world to digital world you need an analog to digital converter, this process is by definition lossy. We are interested in the real world, and information is pure but never represents exactly reality. Lossyness is baked into our problem statement here.

Using terms like near lossless means we think we are very close to reality for what we’re trying to do

elil17 1 days ago [-]

I would love to see real examples of what reduced quality means in practice. Are you able to recover a document from the vector in a human readable format? If so, what sort of changes come up?

I could imagine a scenario where differences tend to be more substantive than you'd expect because of how less frequent words with fine distinctions in meaning - the very words that make the document special - may be embedded in the vector space.

yorwba 1 days ago [-]

Most of the fine distinctions are already lost when a document is processed through a pile of linear algebra to turn it into a fixed-size list of floating-point numbers, as you can see from the NDCG@10. Vector search is not a tool for fine distinctions. It's a tool for reducing a large pile of documents to a smaller selection of candidates, which you can then check individually with some more expensive method.

breadislove 23 hours ago [-]

The ndcg loss is minimal 90.26 -> 89.65. This means it maintains most of the quality.

breadislove 23 hours ago [-]

this is the reason why we report ndcg and not recall. ndcg respects fine grained details so you get the an overview of how much details you are trading off since it would hurt the ranking.

purple-leafy 1 days ago [-]

Hey breadislove; amazing article, I’ll be sending mixedbread an email in the morning that may interest you (email will be <5-characters>@pm.me)

I have also been working in compression and performance engineering, and managed to get a 99+% compression unlock versus conventional approaches (100+KB down to 1KB) in the scenario of 30 minute massive multiplayer game replays for a “game+engine” I’m developing

I think there’s a synergy between these 2 concepts I’d love to chat some more

palinnilap 1 days ago [-]

Any way I can read about this or the use case? I have a hobby interest

purple-leafy 19 hours ago [-]

Yes soon I’ll be launching my game and engine, and will have a blog post - just keep an eye on Show HN over the following week

1 days ago [-]

breadislove 23 hours ago [-]

to which email did you send it? can u send it to support please?

purple-leafy 19 hours ago [-]

Sent to the support email with the subject line “Hackernews …”

derrickquinn 22 hours ago [-]

Asymmetry is clever. FWIW, this is very similar to the strategy employed by BitNet models (i.e., int8 activations with binary or ternary weights); I suspect retrieval is a little more amenable to this approach.

In principle, binary x binary should be pretty fast since it just requires bitwise XNOR and popcount/reduction, but in practice it's slow unless you've really optimized it. And, as stated in the article, you'd still be losing a lot of accuracy that way.

kaizenite 24 hours ago [-]

To people smarter than me, how impressive and/or revolutionary is this?

alfiedotwtf 1 days ago [-]

If you squint hard enough, it sounds like their storage layer is a bloom filter

rq1 1 days ago [-]

The Pi compression algorithm is better.

luma 1 days ago [-]

Doubtful. The problem with the pi idea is that you need to include the offset, which will likely be as long as or longer than your data.

nathan_compton 1 days ago [-]

" A single document produces more then one embedding, depending on the complexity of the document it can produce hundreds or thousands of vectors."

That typo up there is kind of endearing in the AI slop era.

HenryMulligan 23 hours ago [-]

Not seeing a typo in your quote. Can you point it out?

thatspartan 22 hours ago [-]

I think they're referring to "then" vs "than"

breadislove 22 hours ago [-]

ah whoops, I'll fix it. ty!

nathan_compton 17 hours ago [-]

Genuinely, from the bottom of my heart, thank you for writing without an AI.

breadislove 14 hours ago [-]

everything worth writing, you should write yourself

vasylvd 21 hours ago [-]

[flagged]

dismissed181 21 hours ago [-]

[dead]

m_m_carvalho 1 days ago [-]

[dead]

mv_d5339e31 1 days ago [-]

[dead]

johnathan101 1 days ago [-]

[flagged]

1 days ago [-]

TradingReality 3 days ago [-]

[flagged]

Ameo 1 days ago [-]

[flagged]

mwigdahl 23 hours ago [-]

Unfortunately as cost reduction trends to 100%, it comes along with an intrinsic high-pass sarcasm filter.

peheje 1 days ago [-]

Reminds me of 'Learning to be me' by Greg Egan

throwaway2027 1 days ago [-]

You would obviously be trading storage for compute and time to retrieve the storage.

throwaw12 1 days ago [-]

100% reduction is impossible for something which should work, because -100% means it is now 0

neonstatic 1 days ago [-]

They were clearly being sarcastic

functionmouse 1 days ago [-]

there is no such thing as "near lossless"

ttoinou 1 days ago [-]

There is, after you define what you’re ready to loose and understand the lossy space. That’s how we came up with mobile cellphones, audio and video codecs etc. Literally powering all modern devices we use.

greenleafone7 24 hours ago [-]

So then ... "lossy"

magicalhippo 8 hours ago [-]

Yes of course. A compression algorithm which just stores the number of bits and decodes all bits as zeros is also lossy.

A floating-point compression algorithm which reconstructs the elements so they differ by at most one ULP[1] compared to the original value is, no surprise, also lossy.

Being able to communicate that an algorithm is closer to the latter than the former is useful, hence terms like "near-lossless".

[1]: https://en.wikipedia.org/wiki/Unit_in_the_last_place

tancop 22 hours ago [-]

theres a big difference between 99% quality and 30%. near lossless is a good name for the first one. if you treat it in a binary way where everything short of 100 falls into one "lossy" bucket you lose all the practical differences that make one encoding much better than another.

functionmouse 21 hours ago [-]

> theres a big difference between 99% quality and 30%.

sure

> if you treat it in a binary way where everything short of 100 falls into one "lossy" bucket you lose all the practical differences that make one encoding much better than another.

no; lossless is an inherently binary term. and I don't lose all the practical differences of better lossy encoders by understanding that; I'm not just going to start using mp3 96k because I have an understanding of lossless vs lossy encoders...

Lossless is an objectively binary term.

ttoinou 18 hours ago [-]

Lossyness is baked into our problem statement here.

Using terms like near lossless means we think we are very close to reality for what we’re trying to do

greenleafone7 19 hours ago [-]

I agree with you somewhat, and I like what is described in the article. But I also feel like we are diluting the meaning of the word to make things sound better. Lossy/Lossless is inherently binary, and it carries a specific meaning. It would not detract from the work at all if it was described differently.

You can't be a little bit on fire :)

24 hours ago [-]

functionmouse 1 days ago [-]

Actually, all of those things are considered "lossy".

ttoinou 24 hours ago [-]

Yes, anything not lossless is lossy. Near-lossless is not lossless, so it is lossy. I hope we speak the same language

breadislove 14 hours ago [-]

yes, your are right. what heading would you have taken here?

Rendered at 14:15:44 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.