Troubleshooting


Expired Certificates

Sometimes, the certificates may have expired and cannot renew automatically – you will typically see this problem when you can no longer login to your CloudCenter Suite cluster and you start receiving a networking error or similar error. If you review the AUTH pod logs you will may issues with accessing the certificate or its location. If you review the certification, you will see that it is failing due to an auto-renew setting with an error similar to the following code block:

kubectl -n cisco describe cert suite-auth-tls

These kind of error are caused by changes in either the certificate manager or the certificate failing due to an auto-renewal setting and the cluster is down.

To address this issue fix your cluster by following this process – use the following scripts with caution.

These scripts were only tested in a GCP environment where this error was first seen.

  1. Export the current certificates and secrets to YAML files.

    #!/bin/bash
    namespace="cisco"
    mkdir -p $namespace
    for n in $(kubectl -n $namespace get secrets -o custom-columns=:metadata.name | grep -v 'service-account')
    do
            echo "Saving $namespace/secret_$n..."
            kubectl -n $namespace get secret $n -o yaml > $namespace/secret_$n.yaml
    done
    for n in $(kubectl -n $namespace get cert -o custom-columns=:metadata.name)
    do
            echo "Saving $namespace/cert_$n..."
            kubectl -n $namespace get cert $n -o yaml > $namespace/cert_$n.yaml
    done
  2. Delete the old certificates and secrets.

    #!/bin/bash
    commit=1
    for n in $(kubectl -n cisco get secrets -o custom-columns=:metadata.name | grep -v 'service-account')
    do
            echo "Deleting $n..."
            if [[ $commit==1 ]]; then
                    kubectl -n cisco delete secret "$n"
            fi
    done
    for n in $(kubectl -n cisco get cert -o custom-columns=:metadata.name)
    do
            echo "Deleting $n..."
            if [[ $commit==1 ]]; then
                    kubectl -n cisco delete cert "$n"
            fi
    done
  3. Restore the certificates from their respective YAML files to the cluster.

    #!/bin/bash
    namespace=cisco
    echo "Restoring Opaque secrets..." 
    kubectl apply -f $namespace/secret_ca-key-pair.yaml 
    kubectl apply -f $namespace/secret_suite-fluentd-s3-config.yaml 
    kubectl apply -f $namespace/secret_suite-fluentd-s3-config-original.yaml 
    kubectl apply -f $namespace/secret_suite-gateway-external-tls-secrets.yaml 
    kubectl apply -f $namespace/secret_suite-random-password.yaml 
    kubectl apply -f $namespace/secret_suite-image-pull-secret.yaml 
    kubectl apply -f $namespace/secret_action-orchestrator-jwt-secret.yaml
    echo "Restoring Certs via YAML"
    for n in $namespace/*.yaml; do
            [ -f "$n" ] || break
            if [[ $n =~ "cert" ]]; then
                    echo "Restoring Cert via yaml file $n..."
                    kubectl apply -f "$n"
            fi
    done
    echo "Restarting Cert Manager Pod..."
    kubectl delete --all pods --namespace=cert-manager
    echo "Restarting all CCS Pods..."
    kubectl delete --all pods --namespace=$namespace
  4. Restore the opaque and other non-TLS based secrets.

  5. Restart the cert-manager.

  6. Restart all the CloudCenter Suite cluster pods.

Finding Kubernetes Resources

For private clouds, the download link for the Kubeconfig file is available on the last page of the installer UI as displayed in the following screenshot. 

While you may see this file for successful installations in the above screen, you will not be able to access this file if your installation was not successful. This file is required to issue any command listed in the https://kubernetes.io/docs/reference/kubectl/cheatsheet/ section of the Kubernetes documentation. 

By default, the kubectl command looks for the Kubeconfig file in the $HOME/.kube folder.  

  • Successful installation: Copy the downloaded Kubeconfig file to your $HOME/.kube folder and then issue any of the kubectl commands listed in the Kubernetes cheatsheet link above.

  • Stalled Installation

    • Private clouds and most public clouds: SSH into one of the primary server nodes and copy the Kubeconfig file from /etc/kubernetes/admin.conf to the /root/.kube folder.  

    • GCP: Login to GCP, access the Kubernetes Engine, locate your cluster, click Connect to Connect to the cluster, and click the copy icon as displayed in the following screenshot. You should have already installed gcloud in order to view this icon.

A Pod has unbound PersistentVolumeClaims

The problem displayed in the following screenshot is usually caused when the cloud user does not have permissions to the configured storage. For example, a vSphere user may not have permissions to the selected datastore.

Error during the Suite Installation Process

At any time, if you your installation stalls due to a lack of resources, perform this procedure to analyze the error logs.

To fetch the logs for this pod run : 

  1. Locate the actual name of the container by running the following command:

    kubectl get pods -all-namespaces | grep common-framework-suite-prod-mgmt-xxxx
  2. Click the Download Logs Download link to download the installation logs for the failed service in case of an installation failure. 

  3. View the Logs for the container: common-framework-suite-prod-mgmt ...

  4. Run the following command to view the error:

    kubectl logs -f common-framework-suite-prod-mgmt-xxxx -n cisco

Error in Creating Cluster

In case of failure (due to a quota availability issue) during the installation process, an error message similar to the one displayed in the following screenshot appears.

The Progress bar for a Kubernetes Cluster is stuck at Launching cluster nodes on the cloud or Configuring the primary cluster

The issue displayed in the following screenshot could be an issue with the cloud environment. Refer to your cloud documentation for possible issues.

Other examples:

  • If the target cloud is vSphere, check if the cloud account being used has permissions to launch a VM and if the VM is configured with a valid IPv4 address. 

  • If the cluster nodes are configured to use static IP, verify if the IP pool used is valid and if all the launched nodes have a unique IP from the pool.

The Kubernetes Cluster is installed successfully, but the progress bar for Suite Administration is stuck at Waiting for product to be ready

This issue indicates that the CloudCenter Suite installation has some issue. Use the downloaded SSH key to SSH into one of the primary server nodes. To check the status of the pods, run kubectl get pods --all-namespaces for each pod. If the status does not display Running, run the following commands to debug further:

kubectl describe pod <pod-name> -n cisco

or

kubectl logs -f <pod-name> -n cisco

Use the downloaded SSH key to SSH into each cluster node and check if the system clock is synchronized on all nodes. Even if the NTP servers were initially synchronized verify if they are still active by using the following command.

ntpdate <ntp_server>

Installation Failed: Failed to copy <script-name.sh>  to remote host or any error related to SSH connection failure

If any of the nodes are Not Ready state, then run the following command on the node:

kubectl describe node <node-name>

This issue can occur when the installer node cannot SSH/SCP into launched cluster nodes. Verify if all the launched nodes have a valid IPv4 address and if the installer network can communicate with the Kubernetes cluster network (if they are on different networks). Also verify that the cluster nodes are able to connect to vSphere.

If none of the above methods work, retry the installation or contact your CloudCenter Suite admin.

DHCP IP Allocation Mode

During installation if you select DHCP IP allocation mode, you may see an error when you start the installation (assuming other values are appropriate). In this case, check your installer VM's /etc/resolve.conf file, and comment or remove the entry containing the following keyword.

searchdomain

This entry adds a search domain entry in the /etc/resolve.conf file (required). This addition finds IPs for Nginx services from external locations, and maps them to the Nginx service within the CloudCenter Suite. However, CCP services do not need these maps as the internal IP map to the nginx service is the only required mapping. As such, you must remove the errant entries from the /etc/resolve.conf file.

To correct the error, you (sudo permissions required) must restart all the installer PODs which contain using the following command.

Execute sudo -i or prefix the command with sudo

kubectl delete pod $(kubectl get pods -n ccp | grep suite | awk '{print $1}') -n ccp

Now, wait for a minute to ensure that the PODs have started running before restarting the installation process.

After using Suite Admin for a while, users cannot login to Suite Admin if any of the cluster nodes are in a Not Ready state

This issue may be the result of any of the following situations:

  • Are all the cluster nodes up and running with a valid IP address?

  • If the nodes are running, then use the downloaded SSH key to SSH into one of the primary server nodes.

  • Run the following command on the primary server to verify if all the nodes are in the Ready state.

     kubectl get nodes 

When one of the workers is down a worker node scale up operation is stuck

When one of the workers is down, and you try to scale up the worker node, the node does not scaled up. The scale up operation remains stuck in scaling state.

Restart the operator POD of your environment by using the following command. The following example displays vSphere, and the corresponding vSphere operator. Similarly, if you are working in an OpenStack environment, use the OpenStack operator as applicable.

kubectl delete pod kaas-ccp-vsphere-operator-<dynamic alphanumeric characters> -n ccp

#or

kubectl delete pod kaas-ccp-openstack-operator-<dynamic alphanumeric characters> -n ccp

By restarting this service on any worker node, you will start the shutdown VM and scale up the new node that was stuck during the scale operation.

Download Logs

Click the Download Logs Download link to download the installation logs for the failed service in case of an installation failure. See Monitor Modules > Download Logs for additional information.

Velero Issues

Refer to https://heptio.github.io/velero/v0.11.0/ for Velero troubleshooting information.



  • No labels
Terms & Conditions Privacy Statement Cookies Trademarks