-
Notifications
You must be signed in to change notification settings - Fork 142
Fix termux/android build #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for this. So, the issue on Android was that no visibility was specified for the iqk functions, Android apparently uses hidden visibility by default, so the linker does not find the iqk functions. I guess we need an |
It would be interesting to benchmark it, but I can't since my phone doesn't support IQK. My main motivation was thinking about doing a release (I can build on Windows, Linux, and now Android but I haven't done many non-native builds, and don't have access to a mac).
Yes, that and the definition fix for the iqk_flash_attn_noalibi.
That should work.
"Attempt fix 3" was my last try at that, I couldn't get it to work. |
Cleaned it up using an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if something else apart from the dot product is needed to have the iqk functions work on your phone. I see that I have consistently used ggml_vdotq_s32
, whiere ggml
provided an implementation when __ARM_FEATURE_DOTPROD
is not available. The one known missing ingredient without __ARM_FEATURE_DOTPROD
is vdotq_laneq_s32
. But is there something else missing? If vdotq_laneq_s32
was the only missing thing, one could add an implementation, and then one would be able to use iqk
stuff on generic __aarch64__
. I don't have an Android phone myself, so was never compelled to try.
I do have an android device, but I don't plan on using ik_llama on it, the limited RAM and slow CPU/GPU make it not worthwhile for me. I made the two suggested changes, and it compiles. |
So now we need to find someone with a modern phone willing to test. I would be really curious to compare the performance to Vulkan. The GPUs on many of the phones are quite underpowered, and the |
I should be able to get temporary access to a modern phone. I want to test the new Bitnet model (that needs to be ported) as that does seem like a really good fit for mobile use, and also a really good showcase of ik_llama.cpp.
Yes, Vulkan and this OpenCL backend, which was introduced after this repo forked (this repo is actually in an awkward middle where it has neither the old or the new OpenCL). Do you have a model/quant in mind you would want ran across the 3 backends? |
Including Android? Then something small like LLaMA-3B using |
I had a little bit of time with a Galaxy S22 (1×3.00 GHz Cortex-X2 & 3×2.40 GHz Cortex-A710 & 4×1.70 GHz Cortex-A510).
Flash attention did worse:
I'll be able to test more with it again later. |
I was able to test a bit more and turns out the results I got above are meaningless as the model returns gibberish. I have to build with arch flags manually set (and armv9 caused illegal instructions even though this device supports it, but Best result I was able to get was with 4 threads and FA off but I haven't managed to get another result close (even with those same settings for FA and thread number)
|
Do you know how |
I don't and I really want to but I until I find a way to get more consistent performance numbers on the device, I'm not sure any meaningful comparisons could be made. The issue does seem like a mix of the system scheduler, thermal throttling, and core assignment (and there might even be more issues). Using taskset does seem to help the core assignment issue, but results still fluctuate an incredible amount. I wanted to provide the flash attention numbers as well, but I'm not sure if I just can't get a good run, or if flash attention is worse on this device. |
So, my Arm optimizations are totally based on the M2 chip. Your results and what was reported in #345 may indicate that they may not really be optimal for lower end Arm processors. For instance, I often use more vector registers than available. On the M2-Max this register spillage is better (faster) than not using all vector registers. But the lower end chips may not handle this very well (common wisdom is that one should avoid register spillage). Or perhaps the compiler is not producing optimum code. Have you tried I guess, if I want to become serious with supporting mobile devices, I should get myself a Raspberry Pi to play with. Or perhaps the Rock 5b board. I haven't done any experiments on that sort of CPU for a long time. But I think around 2016 or so I did experiment with a bunch of heavy duty number crunching algorithms on my Android phone at the time (don't remember what the CPU was). It was actually quite impressive, being only about 3 times slower than my desktop PC at the time. But only for a short period of time. After a minute or two, performance would totally disintegrate, and would not come back without a reboot even after long periods of letting the phone sit idle. This is now almost 10 years ago and mobile phone CPUs have improved a lot since then, but I'm not surprised you are observing issues with performance sustaining over longer periods. |
Interesting.
I have only tried clang on this device (and I'm still not sure why the
The Raspberry Pi 5 has a 4×2.40GHz Cortex-A76, which is far worse than the (1×3.00 GHz Cortex-X2 & 3×2.40 GHz Cortex-A710 + ...) of the phone I am using. The Apple cores though are definitely nicer (but they take up a lot more die area).
It really is impressive how much compute mobile devices have.
If it was just throttling that would make it easy, but the fast run I posted wasn't even the first full run, and the phone was already noticeably warm by that point. The SoC in that phone is notorious for throttling though, so that probably played a part. |
@ikawrakow
Sorry this is a mess, but this does get it to build now on my android device where I was able to replicate the compile error (my device does not support __ARM_FEATURE_DOTPROD so even though it now builds, it does not use the IQK stuff, but I may be able to confirm it works later on a device that that does support dotprod later).
I did catch the additional issue of the changed iqk_flash_attn_noalibi definition in the case where your building this repo and IQK_IMPLEMENT is not defined because my device doesn't support dotprod.
Fixes #159