October 16, 2019 at 4:21 am #26875
The monitoring-prometheus-operator and altermanager both get stuck in crash loops when following the AWS EKS Cookbook using forgeops 6.5.2 code and kube 1.11.
I discovered that the alertmanager crashloop is caused by a memory limitation in v0.20.0 of the prometheus-operator image. Upgrading to v0.33.0 resolves the alertmanager crash loop as the newer images allocate more memory.
However, the prometheus-operator is still stuck in a crash loop.
The relevant error appears to be:
ts=2019-10-16T02:01:55.075633201Z caller=main.go:303 msg="Unhandled error received. Exiting..." err="creating CRDs failed: waiting for PodMonitor crd failed: timed out waiting for Custom Resource: failed to list CRD: podmonitors.monitoring.coreos.com is forbidden: User \"system:serviceaccount:monitoring:monitoring-prometheus-operator\" cannot list podmonitors.monitoring.coreos.com at the cluster scope"
But can’t square it with what’s in forgeops.
What is the best way to resolve this error?October 21, 2019 at 5:15 pm #26933
I’ve tested prometheus myself on eks. I saw the alertmanager issue that was mentioned and I resolved that by updating the Prometheus Operator version as stated, but I cannot reproduce the Prometheus Operator crash loop issue whether on 0.20 or 0.33. I have never seen this previously. Please can you confirm exact steps carried out to deploy the Prometheus Operator and also which k8s cluster version used?
LeeOctober 22, 2019 at 2:11 pm #26937
I have reproduced this issue now. The Prometheus Operator Helm chart has moved to a different Helm repo and although we fixed versions, it looks like it’s not working. I’m currently in the process of updating it now in master, then in 6.5.2 branch.October 22, 2019 at 9:40 pm #26942
Thanks Lee. I will watch for the change and test it out.October 30, 2019 at 2:40 pm #26977
@lee-baines Can you please point me at the relevant commit that fixes this?
Thanks!November 1, 2019 at 2:49 pm #26987
Hi, the whole Prometheus Helm chart was out of date so I’ve completely revamped to the most recent Prometheus Operator deployment. I’ll post here once this fix is merged.
LeeNovember 5, 2019 at 2:38 pm #26996
The prometheus fix has now been applied to the release/6.5.2 branch so if you do a git pull, you should get a working up to date version of Prometheus.
November 6, 2019 at 8:34 pm #27040
- This reply was modified 1 year, 10 months ago by lee.baines.
You must be logged in to reply to this topic.