-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Please confirm the following
- I agree to follow this project's code of conduct.
- I have checked the current issues for duplicates.
- I understand that AWX is open source software provided for free and that I might not receive a timely response.
- I am NOT reporting a (potential) security vulnerability. (These should be emailed to
[email protected]
instead.)
Bug Summary
Our environment :
- 1 awx master on k3s with 2 external workers (using receptor)
- projects are hosted on enterprise github
- SAML auth
- Upgraded from 21.10.0 to 22.1.0 (~mid june)
- Control node EE untouched (only custom workers EE for playbooks)
- No limit added to pods
We have recently issues on AWX. It seems random (maybe related with some sort of accumulation, or burst, or else).
The web interface isn't working correctly (slow, some errors) and any project sync fails.
Without the root cause, the only thing we can do is scale down web/task deployments so the operator restart them. Everything is ok after that.
We checked server's system (cpu/mem/disk/net/etc..) and logs without any luck.
No specific action on the plateform.
No high number of jobs running.
AWX version
22.1.0
Select the relevant components
- UI
- UI (tech preview)
- API
- Docs
- Collection
- CLI
- Other
Installation method
kubernetes
Modifications
no
Ansible version
2.14
Operating system
RHEL 8.6
Web browser
No response
Steps to reproduce
Didn't find a way to reproduce at will.
Expected results
The interface is smooth.
Project sync works.
Actual results
Interface is slow to show and all project sync fails (so no job with outdated project cache can be executed)
Additional information
• Example of project sync error
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 604, in run
res = receptor_job.run()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 317, in run
res = self._run_internal(receptor_ctl)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 392, in _run_internal
res = processor_future.result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py", line 1188, in wrapper_cleanup_new_process
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 460, in processor
return ansible_runner.interface.run(
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py", line 211, in run
r.run()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 355, in run
self.event_callback(data)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 310, in event_callback
should_write = self.event_handler(event_data)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/callback.py", line 222, in event_handler
super_return_value = super(RunnerCallbackForProjectUpdate, self).event_handler(event_data)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/callback.py", line 166, in event_handler
self.dispatcher.dispatch(event_data)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/queue.py", line 32, in dispatch
self.connection.rpush(self.queue, json.dumps(obj, cls=AnsibleJSONEncoder))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/commands/core.py", line 2695, in rpush
return self.execute_command("RPUSH", name, *values)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1238, in execute_command
return conn.retry.call_with_retry(
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/retry.py", line 49, in call_with_retry
fail(error)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1242, in <lambda>
lambda error: self._disconnect_raise(conn, error),
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1228, in _disconnect_raise
raise error
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/retry.py", line 46, in call_with_retry
return do()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1239, in <lambda>
lambda: self._send_command_parse_response(
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1215, in _send_command_parse_response
return self.parse_response(conn, command_name, **options)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1254, in parse_response
response = connection.read_response()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/connection.py", line 824, in read_response
response = self._parser.read_response(disable_decoding=disable_decoding)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/connection.py", line 487, in read_response
raise response
redis.exceptions.ConnectionError: max number of clients reached
Checked redis configuration (~20 clients connected for 10K limit).
With AWX metrics, I found some differences but I can't properly interpret them (I guess some kind of burst but I don't really know).
• Metrics near platform issues
# HELP awx_database_connections_total Number of connections to database
# TYPE awx_database_connections_total gauge
awx_database_connections_total 15.0
# HELP callback_receiver_events_queue_size_redis Current number of events in redis queue
# TYPE callback_receiver_events_queue_size_redis gauge
callback_receiver_events_queue_size_redis{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP callback_receiver_events_popped_redis Number of events popped from redis
# TYPE callback_receiver_events_popped_redis gauge
callback_receiver_events_popped_redis{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 241058
# HELP callback_receiver_events_in_memory Current number of events in memory (in transfer from redis to db)
# TYPE callback_receiver_events_in_memory gauge
callback_receiver_events_in_memory{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP callback_receiver_batch_events_errors Number of times batch insertion failed
# TYPE callback_receiver_batch_events_errors gauge
callback_receiver_batch_events_errors{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP callback_receiver_events_insert_db_seconds Total time spent saving events to database
# TYPE callback_receiver_events_insert_db_seconds gauge
callback_receiver_events_insert_db_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 6633.5608182273245
# HELP callback_receiver_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_events_insert_db gauge
callback_receiver_events_insert_db{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 236855
# HELP callback_receiver_events_broadcast Number of events broadcast to other control plane nodes
# TYPE callback_receiver_events_broadcast gauge
callback_receiver_events_broadcast{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 175081
# HELP callback_receiver_batch_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_batch_events_insert_db histogram
callback_receiver_batch_events_insert_db_bucket{le="10",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 122780
callback_receiver_batch_events_insert_db_bucket{le="50",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124498
callback_receiver_batch_events_insert_db_bucket{le="150",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="350",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="650",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="2000",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="+Inf",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_count{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_sum{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 236855
# HELP callback_receiver_event_processing_avg_seconds Average processing time per event per callback receiver batch
# TYPE callback_receiver_event_processing_avg_seconds gauge
callback_receiver_event_processing_avg_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 1.6913103333333332
# HELP subsystem_metrics_pipe_execute_seconds Time spent saving metrics to redis
# TYPE subsystem_metrics_pipe_execute_seconds gauge
subsystem_metrics_pipe_execute_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 118.10413664561273
# HELP subsystem_metrics_pipe_execute_calls Number of calls to pipe_execute
# TYPE subsystem_metrics_pipe_execute_calls gauge
subsystem_metrics_pipe_execute_calls{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 71599
# HELP subsystem_metrics_send_metrics_seconds Time spent sending metrics to other nodes
# TYPE subsystem_metrics_send_metrics_seconds gauge
subsystem_metrics_send_metrics_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 355.07566434563284
# HELP task_manager_get_tasks_seconds Time spent in loading tasks from db
# TYPE task_manager_get_tasks_seconds gauge
task_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.005228978581726551
# HELP task_manager_start_task_seconds Time spent starting task
# TYPE task_manager_start_task_seconds gauge
task_manager_start_task_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0
# HELP task_manager_process_running_tasks_seconds Time spent processing running tasks
# TYPE task_manager_process_running_tasks_seconds gauge
task_manager_process_running_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0
# HELP task_manager_process_pending_tasks_seconds Time spent processing pending tasks
# TYPE task_manager_process_pending_tasks_seconds gauge
task_manager_process_pending_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0
# HELP task_manager__schedule_seconds Time spent in running the entire _schedule
# TYPE task_manager__schedule_seconds gauge
task_manager__schedule_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.016287230886518955
# HELP task_manager__schedule_calls Number of calls to _schedule, after lock is acquired
# TYPE task_manager__schedule_calls gauge
task_manager__schedule_calls{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 254603
# HELP task_manager_recorded_timestamp Unix timestamp when metrics were last recorded
# TYPE task_manager_recorded_timestamp gauge
task_manager_recorded_timestamp{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 1688769903.3931773
# HELP task_manager_tasks_started Number of tasks started
# TYPE task_manager_tasks_started gauge
task_manager_tasks_started{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_running_processed Number of running tasks processed
# TYPE task_manager_running_processed gauge
task_manager_running_processed{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_pending_processed Number of pending tasks processed
# TYPE task_manager_pending_processed gauge
task_manager_pending_processed{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_tasks_blocked Number of tasks blocked from running
# TYPE task_manager_tasks_blocked gauge
task_manager_tasks_blocked{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_commit_seconds Time spent in db transaction, including on_commit calls
# TYPE task_manager_commit_seconds gauge
task_manager_commit_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0002639293670654297
# HELP dependency_manager_get_tasks_seconds Time spent loading pending tasks from db
# TYPE dependency_manager_get_tasks_seconds gauge
dependency_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.007382605224847794
• Metrics with healthy platform
# HELP awx_database_connections_total Number of connections to database
# TYPE awx_database_connections_total gauge
awx_database_connections_total 15.0
# HELP callback_receiver_events_queue_size_redis Current number of events in redis queue
# TYPE callback_receiver_events_queue_size_redis gauge
callback_receiver_events_queue_size_redis{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP callback_receiver_events_popped_redis Number of events popped from redis
# TYPE callback_receiver_events_popped_redis gauge
callback_receiver_events_popped_redis{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 4836
# HELP callback_receiver_events_in_memory Current number of events in memory (in transfer from redis to db)
# TYPE callback_receiver_events_in_memory gauge
callback_receiver_events_in_memory{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP callback_receiver_batch_events_errors Number of times batch insertion failed
# TYPE callback_receiver_batch_events_errors gauge
callback_receiver_batch_events_errors{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP callback_receiver_events_insert_db_seconds Total time spent saving events to database
# TYPE callback_receiver_events_insert_db_seconds gauge
callback_receiver_events_insert_db_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 107.60422206111252
# HELP callback_receiver_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_events_insert_db gauge
callback_receiver_events_insert_db{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 4779
# HELP callback_receiver_events_broadcast Number of events broadcast to other control plane nodes
# TYPE callback_receiver_events_broadcast gauge
callback_receiver_events_broadcast{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 3373
# HELP callback_receiver_batch_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_batch_events_insert_db histogram
callback_receiver_batch_events_insert_db_bucket{le="10",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2962
callback_receiver_batch_events_insert_db_bucket{le="50",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="150",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="350",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="650",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="2000",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="+Inf",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_count{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_sum{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 4779
# HELP callback_receiver_event_processing_avg_seconds Average processing time per event per callback receiver batch
# TYPE callback_receiver_event_processing_avg_seconds gauge
callback_receiver_event_processing_avg_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 1.7844824285714285
# HELP subsystem_metrics_pipe_execute_seconds Time spent saving metrics to redis
# TYPE subsystem_metrics_pipe_execute_seconds gauge
subsystem_metrics_pipe_execute_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2.358793856576085
# HELP subsystem_metrics_pipe_execute_calls Number of calls to pipe_execute
# TYPE subsystem_metrics_pipe_execute_calls gauge
subsystem_metrics_pipe_execute_calls{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 1727
# HELP subsystem_metrics_send_metrics_seconds Time spent sending metrics to other nodes
# TYPE subsystem_metrics_send_metrics_seconds gauge
subsystem_metrics_send_metrics_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 5.858100923709571
# HELP task_manager_get_tasks_seconds Time spent in loading tasks from db
# TYPE task_manager_get_tasks_seconds gauge
task_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.005362008698284626
# HELP task_manager_start_task_seconds Time spent starting task
# TYPE task_manager_start_task_seconds gauge
task_manager_start_task_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0
# HELP task_manager_process_running_tasks_seconds Time spent processing running tasks
# TYPE task_manager_process_running_tasks_seconds gauge
task_manager_process_running_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0
# HELP task_manager_process_pending_tasks_seconds Time spent processing pending tasks
# TYPE task_manager_process_pending_tasks_seconds gauge
task_manager_process_pending_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0
# HELP task_manager__schedule_seconds Time spent in running the entire _schedule
# TYPE task_manager__schedule_seconds gauge
task_manager__schedule_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.019680513069033623
# HELP task_manager__schedule_calls Number of calls to _schedule, after lock is acquired
# TYPE task_manager__schedule_calls gauge
task_manager__schedule_calls{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 355
# HELP task_manager_recorded_timestamp Unix timestamp when metrics were last recorded
# TYPE task_manager_recorded_timestamp gauge
task_manager_recorded_timestamp{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 1689609843.2095385
# HELP task_manager_tasks_started Number of tasks started
# TYPE task_manager_tasks_started gauge
task_manager_tasks_started{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_running_processed Number of running tasks processed
# TYPE task_manager_running_processed gauge
task_manager_running_processed{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_pending_processed Number of pending tasks processed
# TYPE task_manager_pending_processed gauge
task_manager_pending_processed{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_tasks_blocked Number of tasks blocked from running
# TYPE task_manager_tasks_blocked gauge
task_manager_tasks_blocked{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_commit_seconds Time spent in db transaction, including on_commit calls
# TYPE task_manager_commit_seconds gauge
task_manager_commit_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0001938343048095703
# HELP dependency_manager_get_tasks_seconds Time spent loading pending tasks from db
# TYPE dependency_manager_get_tasks_seconds gauge
dependency_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.011181914247572422