When querying a variable in a project containing files more than 600, it can never be found. #274
-
Each time vectorcode returns totally irrelevant files of count I specify. I am sure the file containing that variable is vectorised( check with {
// "host": "127.0.0.1", // before 0.6.8
// "port": 8000, // before 0.6.8
"db_url": "http://127.0.0.1:8000", // since 0.6.8
"embedding_function": "SentenceTransformerEmbeddingFunction",
"embedding_params": {
},
// "db_path": "~/.local/share/vectorcode/chromadb/",
"db_log_path": "~/.local/share/vectorcode/",
"db_settings": null,
"chunk_size": 2000,
"overlap_ratio": 0.2,
"query_multiplier": -1,
"reranker": "CrossEncoderReranker",
"reranker_params": {
"model_name_or_path": "cross-encoder/ms-marco-MiniLM-L-6-v2"
},
"hnsw": {
"hnsw:M": 256,
"hnsw:num_threads": 40
},
"chunk_filters": {},
"encoding": "utf8"
} I don't know a lot about vector database. Some params are changed because I gave some tries but had no luck. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
I'd suggest trying different embedding models. I've been playing with qwen3-embedding recently, and had some SIGNIFICANTLY better retrieval results. I'll put together a guide to demonstrate how to do it in the next few days. |
Beta Was this translation helpful? Give feedback.
-
Alright, it is kind of you to dig into it. I'll give it a try some day. I
have one question that if native reranker works, why is the
CrossEncoderReranker the default?
David ***@***.***> 于 2025年8月24日周日 13:41写道:
… Turns out I was mistaken. The rerankers don't make that much of a
difference (at least on the VectorCode repo itself). You could try using
NaiveReranker instead of CrossEncoderReranker and see if it makes things
better. My theory is that the default reranker is either too small to
understand code snippets correctly, or simply isn't optimised for code at
all. I'll do some more tests on different repos, and if this is consistent
across them, I'll revert the default reranker to NaiveReranker.
PS: if you're interested in what NaiveReranker is, it directly uses the
distances between the embedding vectors as the similarity.
—
Reply to this email directly, view it on GitHub
<#274 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5VW5ABUJ7KTZK54SSAZGA33PFF7DAVCNFSM6AAAAACENJPHAKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMRQGAZDOMA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
I'd suggest trying different embedding models. I've been playing with qwen3-embedding recently, and had some SIGNIFICANTLY better retrieval results. I'll put together a guide to demonstrate how to do it in the next few days.