This Article IsEdited at 2024-06-29Referenced as w21

Adding semantic search to my website and giving up

I tried to add semantic search to my website. The tech is successful, but then I decided against it.

The idea starts from seeing openring being used on https://drewdevault.com/. If I want to add recommendation Why not being.

Mmmm…. I need an auto encoder-decoder. I have no idea. Let me ask in the RWKV chat room. They said I should use BERT. I don’t know what is BERT or if it is edible or not, so I looked up if there is BERT support in ggml. There is. There is!

So, quickly, I got the model up and running:

git clone --depth 1 https://github.com/ggerganov/llama.cpp
cd llama.cpp

# build
cmake -B build
pushd build
ninja llama-embedding llama-quantize
popd

# download
git clone --depth 1 https://huggingface.co/BAAI/bge-small-en-v1.5/
wget2 https://huggingface.co/BAAI/bge-small-en-v1.5/resolve/main/model.safetensors?download=true -O model.safetensors

# convert
pip install torch # this step can definitely be skipped if i work on the conversion script a bit
pip install sentencepiece transformers
python convert-hf-to-gguf.py bge-small-en-v1.5/

# quantize
build/bin/llama-quantize models/ggml-model-f16.gguf Q8_0

# generate embedding for text
echo hi | build/bin/llama-embedding -m models/ggml-model-f16.gguf -f /dev/stdin
echo hi | build/bin/llama-embedding -m models/ggml-model-Q8_0.gguf -f /dev/stdin

At this point, I was nearly bored to death, so I gave up on integrating this with Atom/RSS/JSON feeds that I subscribe to.

Why I was so bored

I left the ML scene awhile ago.

I fiddled with quantization and file formats so that llama/vicuna could run on my machine.
I watched the author of ggml and the author of a quantization method talk about whether they should name it q51 or not. It is not named q51.
I watched public hype companies openly flaunt their incompetence.¹
I fiddled a lot.

Back to the present. The concept of BERT/bge is boring. You feed it text, and it gives you back a high-dimensional point. Getting the model to run is also not fiddly at all. The ggml community has streamlined the library (llama.cpp) so much since I last used it. The technology is mature now; it is only to get more boring from now on.²

Constraints of BERT/bge

However good is bge, it’s still an individual. Well, other than it not having domain knowledge of my writing (it is not trained on such data), it is going to have its own view on what I write about. This is probably ok, I guess. If the more perspectives the better, wouldn’t plural systems have an advantage in copywriting?

To do

If you are interested, you can try doing those things below, which I didn’t do.

test more quantization methods
quantize embeddings (why not)

One [big tech] proudly annouced they have introduced int8 quantization in their machine learning framework around the same time. they say it’s a “big improvement” over f16 and bf16. ….. no. ONNX is still garbage.↩︎
Of course, the hype machine has its own ideas about the history of technology.↩︎