monitoring-prometheus-operator crash loop

This topic contains 7 replies, has 2 voices, and was last updated by  8427C399-9811-4941-A0B3-92AF6BF19CD9 2 weeks, 1 day ago.

  • Author
    Posts
  • #26875

    The monitoring-prometheus-operator and altermanager both get stuck in crash loops when following the AWS EKS Cookbook using forgeops 6.5.2 code and kube 1.11.

    I discovered that the alertmanager crashloop is caused by a memory limitation in v0.20.0 of the prometheus-operator image. Upgrading to v0.33.0 resolves the alertmanager crash loop as the newer images allocate more memory.

    However, the prometheus-operator is still stuck in a crash loop.
    The relevant error appears to be:

    ts=2019-10-16T02:01:55.075633201Z caller=main.go:303 msg="Unhandled error received. Exiting..." err="creating CRDs failed: waiting for PodMonitor crd failed: timed out waiting for Custom Resource: failed to list CRD: podmonitors.monitoring.coreos.com is forbidden: User \"system:serviceaccount:monitoring:monitoring-prometheus-operator\" cannot list podmonitors.monitoring.coreos.com at the cluster scope"

    I found this:
    https://github.com/coreos/prometheus-operator/issues/2656

    But can’t square it with what’s in forgeops.

    What is the best way to resolve this error?

    #26933
     lee.baines 
    Participant

    Hi,

    I’ve tested prometheus myself on eks. I saw the alertmanager issue that was mentioned and I resolved that by updating the Prometheus Operator version as stated, but I cannot reproduce the Prometheus Operator crash loop issue whether on 0.20 or 0.33. I have never seen this previously. Please can you confirm exact steps carried out to deploy the Prometheus Operator and also which k8s cluster version used?

    regards,
    Lee

    #26937
     lee.baines 
    Participant

    I have reproduced this issue now. The Prometheus Operator Helm chart has moved to a different Helm repo and although we fixed versions, it looks like it’s not working. I’m currently in the process of updating it now in master, then in 6.5.2 branch.

    #26942

    Thanks Lee. I will watch for the change and test it out.

    #26977

    @lee-baines Can you please point me at the relevant commit that fixes this?
    Thanks!

    #26987
     lee.baines 
    Participant

    Hi, the whole Prometheus Helm chart was out of date so I’ve completely revamped to the most recent Prometheus Operator deployment. I’ll post here once this fix is merged.

    regards,
    Lee

    #26996
     lee.baines 
    Participant

    Hi,

    The prometheus fix has now been applied to the release/6.5.2 branch so if you do a git pull, you should get a working up to date version of Prometheus.

    regards,
    Lee

    • This reply was modified 2 weeks, 2 days ago by  lee.baines.
    #27040

    Thanks!

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.

©2019 ForgeRock - we provide an identity and access platform to secure every online relationship for the enterprise market, educational sector and even entire countries. Click to view our privacy policy and terms of use.

Log in with your credentials

Forgot your details?