Fix termux/android build #336

saood06 · 2025-04-20T07:55:38Z

Sorry this is a mess, but this does get it to build now on my android device where I was able to replicate the compile error (my device does not support __ARM_FEATURE_DOTPROD so even though it now builds, it does not use the IQK stuff, but I may be able to confirm it works later on a device that that does support dotprod later).

I did catch the additional issue of the changed iqk_flash_attn_noalibi definition in the case where your building this repo and IQK_IMPLEMENT is not defined because my device doesn't support dotprod.

Fixes #159

ikawrakow · 2025-04-20T08:59:26Z

Thank you for this.

So, the issue on Android was that no visibility was specified for the iqk functions, Android apparently uses hidden visibility by default, so the linker does not find the iqk functions.

I guess we need an IQK_API macro similar to GGML_API. Or one can just reuse GGML_API as the iqk stuff gets built as part of the ggml library.

saood06 · 2025-04-20T09:20:04Z

Thank you for this.

It would be interesting to benchmark it, but I can't since my phone doesn't support IQK. My main motivation was thinking about doing a release (I can build on Windows, Linux, and now Android but I haven't done many non-native builds, and don't have access to a mac).

So, the issue on Android was that no visibility was specified for the iqk functions, Android apparently uses hidden visibility by default, so the linker does not find the iqk functions.

Yes, that and the definition fix for the iqk_flash_attn_noalibi.

I guess we need an IQK_API macro similar to GGML_API.

That should work.

Or one can just reuse GGML_API as the iqk stuff gets built as part of the ggml library.

"Attempt fix 3" was my last try at that, I couldn't get it to work.

saood06 · 2025-04-21T03:39:42Z

Cleaned it up using an IQK_API macro.

ikawrakow

I wonder if something else apart from the dot product is needed to have the iqk functions work on your phone. I see that I have consistently used ggml_vdotq_s32, whiere ggml provided an implementation when __ARM_FEATURE_DOTPROD is not available. The one known missing ingredient without __ARM_FEATURE_DOTPROD is vdotq_laneq_s32. But is there something else missing? If vdotq_laneq_s32 was the only missing thing, one could add an implementation, and then one would be able to use iqk stuff on generic __aarch64__. I don't have an Android phone myself, so was never compelled to try.

ggml/src/iqk/iqk_config.h

ggml/src/iqk/iqk_flash_attn.cpp

saood06 · 2025-04-21T07:13:59Z

I don't have an Android phone myself, so was never compelled to try.

I do have an android device, but I don't plan on using ik_llama on it, the limited RAM and slow CPU/GPU make it not worthwhile for me.

I made the two suggested changes, and it compiles.

ikawrakow · 2025-04-21T07:19:58Z

So now we need to find someone with a modern phone willing to test. I would be really curious to compare the performance to Vulkan. The GPUs on many of the phones are quite underpowered, and the llama.cpp Vulkan implementation is not particularly performant (although it seems to have been improving lately), so now that it builds on Android, running ik_llama.cpp on the CPU is possibly a viable alternative to Vulkan.

saood06 · 2025-04-21T07:38:30Z

So now we need to find someone with a modern phone willing to test.

I should be able to get temporary access to a modern phone. I want to test the new Bitnet model (that needs to be ported) as that does seem like a really good fit for mobile use, and also a really good showcase of ik_llama.cpp.

I would be really curious to compare the performance to Vulkan. The GPUs on many of the phones are quite underpowered, and the llama.cpp Vulkan implementation is not particularly performant (although it seems to have been improving lately), so now that it builds on Android, running ik_llama.cpp on the CPU is possibly a viable alternative to Vulkan.

Yes, Vulkan and this OpenCL backend, which was introduced after this repo forked (this repo is actually in an awkward middle where it has neither the old or the new OpenCL).

Do you have a model/quant in mind you would want ran across the 3 backends?

ikawrakow · 2025-04-21T08:45:24Z

Do you have a model/quant in mind you would want ran across the 3 backends?

Including Android? Then something small like LLaMA-3B using IQ4_XS or IQ4_KS. Bitnet would be good too.

saood06 · 2025-04-24T13:47:00Z

I had a little bit of time with a Galaxy S22 (1×3.00 GHz Cortex-X2 & 3×2.40 GHz Cortex-A710 & 4×1.70 GHz Cortex-A510).

~/ik_llama.cpp/build1 $ bin/llama-sweep-bench -m ~/ggml-model-iq2_bn_r4.gguf -t 4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	6.081	84.20	3.537	36.18
512	128	512	8.509	60.17	4.594	27.86
512	128	1024	12.571	40.73	5.461	23.44
512	128	1536	16.879	30.33	6.582	19.45
512	128	2048	20.344	25.17	7.640	16.75
512	128	2560	29.417	17.40	10.138	12.63
512	128	3072	34.477	14.85	11.348	11.28
512	128	3584	38.911	13.16	12.595	10.16

Flash attention did worse:
~/ik_llama.cpp/build1 $ bin/llama-sweep-bench -m ~/ggml-model-iq2_bn_r4.gguf -fa -t 4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	9.496	53.92	3.954	32.38
512	128	512	19.082	26.83	7.029	18.21
512	128	1024	27.123	18.88	10.393	12.32
512	128	1536	32.178	15.91	14.209	9.01
512	128	2048	40.818	12.54	16.617	7.70
512	128	2560	48.743	10.50	20.061	6.38
512	128	3072	55.976	9.15	25.354	5.05
512	128	3584	76.750	6.67	27.247	4.70

I'll be able to test more with it again later.

saood06 · 2025-04-30T07:37:58Z

I was able to test a bit more and turns out the results I got above are meaningless as the model returns gibberish. I have to build with arch flags manually set (and armv9 caused illegal instructions even though this device supports it, but armv8.2-a+dotprod+fp16 worked). The new build was tested working with the test prompt in cli returning coherent results (and the much longer compile time showed it was actually compiling iqk_mul_mat.cpp), but performance numbers were wildly inconsistent between runs (even using taskset to try and force it to only be on the performant cores helped a bit but still was very inconsistent).

Best result I was able to get was with 4 threads and FA off but I haven't managed to get another result close (even with those same settings for FA and thread number)

bin/llama-sweep-bench -m ~/ggml-model-iq2_bn_r4.gguf -t 4

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	10.261	49.90	5.130	24.95
512	128	512	11.840	43.24	6.445	19.86
512	128	1024	16.336	31.34	6.925	18.48
512	128	1536	13.914	36.80	7.685	16.66
512	128	2048	14.825	34.54	8.168	15.67
512	128	2560	17.940	28.54	8.694	14.72
512	128	3072	19.040	26.89	8.911	14.36
512	128	3584	20.549	24.92	9.319	13.74

ikawrakow · 2025-04-30T08:28:12Z

Do you know how BitNet.cpp does on this device?

saood06 · 2025-04-30T08:47:23Z

Do you know how BitNet.cpp does on this device?

I don't and I really want to but I until I find a way to get more consistent performance numbers on the device, I'm not sure any meaningful comparisons could be made. The issue does seem like a mix of the system scheduler, thermal throttling, and core assignment (and there might even be more issues). Using taskset does seem to help the core assignment issue, but results still fluctuate an incredible amount.

I wanted to provide the flash attention numbers as well, but I'm not sure if I just can't get a good run, or if flash attention is worse on this device.

ikawrakow · 2025-04-30T09:06:45Z

So, my Arm optimizations are totally based on the M2 chip. Your results and what was reported in #345 may indicate that they may not really be optimal for lower end Arm processors. For instance, I often use more vector registers than available. On the M2-Max this register spillage is better (faster) than not using all vector registers. But the lower end chips may not handle this very well (common wisdom is that one should avoid register spillage). Or perhaps the compiler is not producing optimum code. Have you tried clang (which is what I use for the M2)?

I guess, if I want to become serious with supporting mobile devices, I should get myself a Raspberry Pi to play with. Or perhaps the Rock 5b board.

I haven't done any experiments on that sort of CPU for a long time. But I think around 2016 or so I did experiment with a bunch of heavy duty number crunching algorithms on my Android phone at the time (don't remember what the CPU was). It was actually quite impressive, being only about 3 times slower than my desktop PC at the time. But only for a short period of time. After a minute or two, performance would totally disintegrate, and would not come back without a reboot even after long periods of letting the phone sit idle. This is now almost 10 years ago and mobile phone CPUs have improved a lot since then, but I'm not surprised you are observing issues with performance sustaining over longer periods.

saood06 · 2025-04-30T09:31:06Z

For instance, I often use more vector registers than available. On the M2-Max this register spillage is better (faster) than not using all vector registers. But the lower end chips may not handle this very well (common wisdom is that one should avoid register spillage).

Interesting.

Or perhaps the compiler is not producing optimum code. Have you tried clang (which is what I use for the M2)?

I have only tried clang on this device (and I'm still not sure why the armv9-a build gives illegal instruction even though my CPU supports that instruction set).

I guess, if I want to become serious with supporting mobile devices, I should get myself a Raspberry Pi to play with. Or perhaps the Rock 5b board.

The Raspberry Pi 5 has a 4×2.40GHz Cortex-A76, which is far worse than the (1×3.00 GHz Cortex-X2 & 3×2.40 GHz Cortex-A710 + ...) of the phone I am using. The Apple cores though are definitely nicer (but they take up a lot more die area).

I haven't done any experiments on that sort of CPU for a long time. But I think around 2016 or so I did experiment with a bunch of heavy duty number crunching algorithms on my Android phone at the time (don't remember what the CPU was). It was actually quite impressive, being only about 3 times slower than my desktop PC at the time.

It really is impressive how much compute mobile devices have.

But only for a short period of time. After a minute or two, performance would totally disintegrate, and would not come back without a reboot even after long periods of letting the phone sit idle. This is now almost 10 years ago and mobile phone CPUs have improved a lot since then, but I'm not surprised you are observing issues with performance sustaining over longer periods.

If it was just throttling that would make it easy, but the fast run I posted wasn't even the first full run, and the phone was already noticeably warm by that point. The SoC in that phone is notorious for throttling though, so that probably played a part.

saood06 added 8 commits April 20, 2025 00:27

Attempt fix

bb285c4

Attempt fix 2

1882327

Attempt fix 3

5c2380a

Attempt fix 4

b79a92c

Attempt fix 5

f0dc73d

Attempt fix 6

309e7ad

Attempt fix 7

16d2c20

Attempt fix 8

9aa1b15

saood06 added 4 commits April 20, 2025 22:08

Attempt fix 9

44ef87f

Attempt fix 10

ca1aae0

Attempt fix 11

cf6cfdc

Attempt fix 12

8e82ee7

saood06 marked this pull request as ready for review April 21, 2025 03:37

saood06 requested a review from ikawrakow April 21, 2025 03:37

ikawrakow approved these changes Apr 21, 2025

View reviewed changes

ggml/src/iqk/iqk_config.h Outdated Show resolved Hide resolved

ggml/src/iqk/iqk_flash_attn.cpp Outdated Show resolved Hide resolved

Attempt fix 13

d75c151

ikawrakow merged commit 93cd77b into main Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix termux/android build #336

Fix termux/android build #336

Uh oh!

saood06 commented Apr 20, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Apr 20, 2025

Uh oh!

saood06 commented Apr 20, 2025 •

edited

Loading

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow left a comment

Uh oh!

Uh oh!

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

saood06 commented Apr 24, 2025 •

edited

Loading

Uh oh!

saood06 commented Apr 30, 2025

Uh oh!

ikawrakow commented Apr 30, 2025

Uh oh!

saood06 commented Apr 30, 2025

Uh oh!

ikawrakow commented Apr 30, 2025

Uh oh!

saood06 commented Apr 30, 2025

Uh oh!

Uh oh!

Fix termux/android build #336

Fix termux/android build #336

Uh oh!

Conversation

saood06 commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Apr 20, 2025

Uh oh!

saood06 commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

saood06 commented Apr 21, 2025

Uh oh!

ikawrakow commented Apr 21, 2025

Uh oh!

saood06 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saood06 commented Apr 30, 2025

Uh oh!

ikawrakow commented Apr 30, 2025

Uh oh!

saood06 commented Apr 30, 2025

Uh oh!

ikawrakow commented Apr 30, 2025

Uh oh!

saood06 commented Apr 30, 2025

Uh oh!

Uh oh!

saood06 commented Apr 20, 2025 •

edited

Loading

saood06 commented Apr 20, 2025 •

edited

Loading

saood06 commented Apr 24, 2025 •

edited

Loading