When querying a variable in a project containing files more than 600, it can never be found. #274

NingenSh1kkaku · 2025-08-21T03:08:15Z

NingenSh1kkaku
Aug 21, 2025

Each time vectorcode returns totally irrelevant files of count I specify. I am sure the file containing that variable is vectorised( check with vectorcode files ls)
I have re-vectorised the project for times, the result stays the same. My config is like this:

{
  // "host": "127.0.0.1", // before 0.6.8
  // "port": 8000, // before 0.6.8
  "db_url": "http://127.0.0.1:8000", // since 0.6.8
  "embedding_function": "SentenceTransformerEmbeddingFunction",
  "embedding_params": {
  },
  // "db_path": "~/.local/share/vectorcode/chromadb/",
  "db_log_path": "~/.local/share/vectorcode/",
  "db_settings": null,
  "chunk_size": 2000,
  "overlap_ratio": 0.2,
  "query_multiplier": -1,
  "reranker": "CrossEncoderReranker",
  "reranker_params": {
    "model_name_or_path": "cross-encoder/ms-marco-MiniLM-L-6-v2"
  },
  "hnsw": {
    "hnsw:M": 256,
    "hnsw:num_threads": 40
  },
  "chunk_filters": {},
  "encoding": "utf8"
}

I don't know a lot about vector database. Some params are changed because I gave some tries but had no luck.
Any suggesions?
BTW the project is RUST code.

Answered by Davidyz

Aug 21, 2025

I'd suggest trying different embedding models. I've been playing with qwen3-embedding recently, and had some SIGNIFICANTLY better retrieval results. I'll put together a guide to demonstrate how to do it in the next few days.

View full answer

Davidyz · 2025-08-21T04:35:09Z

Davidyz
Aug 21, 2025
Maintainer

I'd suggest trying different embedding models. I've been playing with qwen3-embedding recently, and had some SIGNIFICANTLY better retrieval results. I'll put together a guide to demonstrate how to do it in the next few days.

5 replies

NingenSh1kkaku Aug 21, 2025
Author

Thanks for the suggestion.
In fact I have tried all the embedding functinos listed in Chroma docs, but none of them works.
As for the one you refer to, I get the following error:

ERROR: asyncio : Task exception was never retrieved
future: <Task finished name='Task-13' coro=<chunked_add() done, defined at /home/csy/.local/share/uv/tools/vectorcode/lib/python3.13/site-packages/vectorcode/subcommands/vectorise.py:86> exception=JSONDecodeError('Expecting value: line 1 column 1 (char 0)')>
Traceback (most recent call last):
  File "/home/csy/.local/share/uv/tools/vectorcode/lib/python3.13/site-packages/vectorcode/subcommands/vectorise.py", line 149, in chunked_add
    embeddings = embedding_function(
        list(str(c) for c in inserted_chunks)
    )
  File "/home/csy/.local/share/uv/tools/vectorcode/lib/python3.13/site-packages/chromadb/api/types.py", line 466, in __call__
    result = call(self, input)
  File "/home/csy/.local/share/uv/tools/vectorcode/lib/python3.13/site-packages/chromadb/utils/embedding_functions/huggingface_embedding_function.py", line 52, in __call__
    ).json(),
      ~~~~^^
  File "/home/csy/.local/share/uv/tools/vectorcode/lib/python3.13/site-packages/httpx/_models.py", line 832, in json
    return jsonlib.loads(self.content, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/usr/lib/python3.13/json/decoder.py", line 345, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/json/decoder.py", line 363, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

It seems that Hugging Face would always return the json parsing error.

Davidyz Aug 21, 2025
Maintainer

That's curious. I've never used the huggingface embedding function so I'm not quite sure what's wrong. I initially used qwen3-embedding with sentence transformers, and now switched to llama.cpp's openai-compatible endpoint to prevent loading/unloading model (I'll include my setup in the write-up that I mentioned earlier).

Also, is the rust project open-source? If it is, could you leave the URL here? I have a side project that tries to evaluate retrieval results (still WIP, not ready to meet the public yet), and if you have a repo that VectorCode struggles to work with, it's probably a good "evaluation dataset".

Davidyz Aug 21, 2025
Maintainer

Also, I personally think "querying a variable" is not a good use case for VectorCode: this (search for exact match) sounds like a job for grep/rg/LSP, not for vector search. Vector search solves the problem of "I know what it is, but I don't know how it's written". While vector search should still work, it's probably too slow (and inefficient) compared to the 3 alternatives I mentioned.

NingenSh1kkaku Aug 22, 2025
Author

I am looking forward to your up-coming guide and sorry that my rust project is private. Never mind, I will try some other tools of CodeCompanion (Yeah I am led here by it). Thank you for your advice.

Davidyz Aug 24, 2025
Maintainer

Turns out I was mistaken. The rerankers don't make that much of a difference (at least on the VectorCode repo itself). You could try using NaiveReranker instead of CrossEncoderReranker and see if it makes things better. My theory is that the default reranker is either too small to understand code snippets correctly, or simply isn't optimised for code at all. I'll do some more tests on different repos, and if this is consistent across them, I'll revert the default reranker to NaiveReranker.

PS: if you're interested in what NaiveReranker is, it directly uses the distances between the embedding vectors as the similarity.

NingenSh1kkaku · 2025-08-25T15:13:41Z

NingenSh1kkaku
Aug 25, 2025
Author

Alright, it is kind of you to dig into it. I'll give it a try some day. I have one question that if native reranker works, why is the CrossEncoderReranker the default? David ***@***.***> 于 2025年8月24日周日 13:41写道：

…

Turns out I was mistaken. The rerankers don't make that much of a difference (at least on the VectorCode repo itself). You could try using NaiveReranker instead of CrossEncoderReranker and see if it makes things better. My theory is that the default reranker is either too small to understand code snippets correctly, or simply isn't optimised for code at all. I'll do some more tests on different repos, and if this is consistent across them, I'll revert the default reranker to NaiveReranker. PS: if you're interested in what NaiveReranker is, it directly uses the distances between the embedding vectors as the similarity. — Reply to this email directly, view it on GitHub <#274 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5VW5ABUJ7KTZK54SSAZGA33PFF7DAVCNFSM6AAAAACENJPHAKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMRQGAZDOMA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

Davidyz Aug 25, 2025
Maintainer

why is the CrossEncoderReranker the default?

Because I, while not having much data to support my opinions, was under the impression that the CrossEncoderReranker was better.

I've been actively (and slowly) researching how I can systematically evaluate the retrievals (see #67), but I never made it work, mostly because I wasn't sure how to score the retrievals. The lack of benchmarking/evaluation led to a lot of flaky (and opinionated) decisions about the retrieval process. A lot of researches about this field seem to be using Claude/GPT4o/Gemini 2.5 Pro as a "grader model", but that's not financially feasible for me (that is, if we want to be able to run the evaluations as part of the routine CI process). Nevertheless, the recent readings gave me some new inspirations. Maybe I can get my evaluation side project working, and solve this problem for good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When querying a variable in a project containing files more than 600, it can never be found. #274

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

When querying a variable in a project containing files more than 600, it can never be found. #274

Uh oh!

NingenSh1kkaku Aug 21, 2025

Replies: 2 comments · 6 replies

Uh oh!

Davidyz Aug 21, 2025 Maintainer

Uh oh!

NingenSh1kkaku Aug 21, 2025 Author

Uh oh!

Davidyz Aug 21, 2025 Maintainer

Uh oh!

Davidyz Aug 21, 2025 Maintainer

Uh oh!

NingenSh1kkaku Aug 22, 2025 Author

Uh oh!

Davidyz Aug 24, 2025 Maintainer

Uh oh!

NingenSh1kkaku Aug 25, 2025 Author

Uh oh!

Davidyz Aug 25, 2025 Maintainer

NingenSh1kkaku
Aug 21, 2025

Replies: 2 comments 6 replies

Davidyz
Aug 21, 2025
Maintainer

NingenSh1kkaku Aug 21, 2025
Author

Davidyz Aug 21, 2025
Maintainer

Davidyz Aug 21, 2025
Maintainer

NingenSh1kkaku Aug 22, 2025
Author

Davidyz Aug 24, 2025
Maintainer

NingenSh1kkaku
Aug 25, 2025
Author

Davidyz Aug 25, 2025
Maintainer