Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Streamline and secure CLI interactions, improve backend validation processes, and refine user guidance and documentation for the hcp command and related functionalities.
Future:
As a customer, I would like to deploy OpenShift On OpenStack, using the IPI workflow where my control plane would have 3 machines and each machine would have use a root volume (a Cinder volume attached to the Nova server) and also an attached ephemeral disk using local storage, that would only be used by etcd.
As this feature will be TechPreview in 4.15, this will only be implemented as a day 2 operation for now. This might or might not change in the future.
We know that etcd requires storage with strong performance capabilities and currently a root volume backed by Ceph has difficulties to provide these capabilities.
By also attaching local storage to the machine and mounting it for etcd would solve the performance issues that we saw when customers were using Ceph as the backend for the control plane disks.
Gophercloud already accepts to create a server with multiple ephemeral disks:
We need to figure out how we want to address that in CAPO, probably involving a new API; that later would be used in openshift (MAPO, and probably installer).
We'll also have to update the OpenStack Failure Domain in CPMS.
ARO (Azure) has conducted some benckmarks and is now recommending to put etcd on a separated data disk:
https://docs.google.com/document/d/1O_k6_CUyiGAB_30LuJFI6Hl93oEoKQ07q1Y7N2cBJHE/edit
Also interesting thread: https://groups.google.com/u/0/a/redhat.com/g/aos-devel/c/CztJzGWdsSM/m/jsPKZHSRAwAJ
In this feature will follow up OCPBU-186 Image mirroring by tags.
OCPBU-186 implemented new API ImageDigestMirrorSet and ImageTagMirrorSet and rolling of them through MCO.
This feature will update the components using ImageContentSourcePolicy to use ImageDigestMirrorSet.
The list of the components: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing.
Migrate OpenShift Components to use the new Image Digest Mirror Set (IDMS)
This doc list openshift components currently use ICSP: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing
Plan for ImageDigestMirrorSet Rollout :
Epic: https://issues.redhat.com/browse/OCPNODE-521
4.13: Enable ImageDigestMirrorSet, both ICSP and ImageDigestMirrorSet objects are functional
4.14: Update OpenShift components to use IDMS
4.17: Remove support for ICSP within MCO
As an openshift developer, I want --idms-file flag so that I can fetch image info from alternative mirror if --icsp-file gets deprecated.
Create custom roles for GCP with minimal set of required permissions.
Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.
Some of the service accounts that CCO creates, e.g. service account with role roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions.
TBD
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Image Registry Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
Evaluate if any of the GCP predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cloud Controller Manager Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster CAPI Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
These are phase 2 items from CCO-188
Moving items from other teams that need to be committed to for 4.13 this work to complete
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Storage Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Ingress Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Network Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Update GCP Credentials Request manifest of the Cluster Network Operator to use new API field for requesting permissions.
As an Infrastructure Administrator, I want to deploy OpenShift on Nutanix distributing the control plane and compute nodes across multiple regions and zones, forming different failure domains.
As an Infrastructure Administrator, I want to configure an existing OpenShift cluster to distribute the nodes across regions and zones, forming different failure domains.
Install OpenShift on Nutanix using IPI / UPI in multiple regions and zones.
This implementation would follow the same idea that has been done for vSphere. The following are the main PRs for vSphere:
https://github.com/openshift/enhancements/blob/master/enhancements/installer/vsphere-ipi-zonal.md
Nutanix Zonal: Multiple regions and zones support for Nutanix IPI and Assisted Installer
Note
As a user, I want to be able to spread control plane nodes for an OCP clusters across Prism Elements (zones).
Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
Why is this important? (mandatory)
An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Out CSI driver YAML files are mostly copy-paste from the initial CSI driver (AWS EBS?).
As OCP engineer, I want the YAML files to be generated, so we can keep consistency among the CSI drivers easily and make them less error-prone.
It should have no visible impact on the resulting operator behavior.
Finally switch both CI and ART to the refactored aws-ebs-csi-driver-operator.
The functionality and behavior should be the same as the existing operator, however, the code is completely new. There could be some rough edges. See https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md
Ci should catch the most obvious errors, however, we need to test features that we do not have in CI. Like:
Enable support to bring your own encryption key (BYOK) for OpenShift on IBM Cloud VPC.
As a user I want to be able to provide my own encryption key when deploying OpenShift on IBM Cloud VPC so the cluster infrastructure objects, VM instances and storage objects, can use that user-managed key to encrypt the information.
The Installer will provide a mechanism to specify a user-managed key that will be used to encrypt the data on the virtual machines that are part of the OpenShift cluster as well as any other persistent storage managed by the platform via Storage Classes.
This feature is a required component for IBM's OpenShift replatforming effort.
The feature will be documented as usual to guide the user while using their own key to encrypt the data on the OpenShift cluster running on IBM Cloud VPC
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Goal
Hardware RAID support on Dell, Supermicro and HPE with Metal3.
Why is this important
Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3.
Dell, Supermicro and HPE, which are the most common hardware platforms we find in our customers environments are the main target.
Goal
Hardware RAID support on Dell with Metal3.
Why is this important
Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3 for Dell, which are the most common hardware platforms we find in our customers environments.
Before implementing generic support, we need to understand the implications of enabling an interface in Metal3 to allow it on multiple hardware types.
Scope questions
While rendering BMO in https://issues.redhat.com/browse/METAL-829 the node cpu_arch was hardcoded to x86_64
We should use bmh.Spec.Architecture instead to be more future proof
Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms.
Extend the existing features while deploying OpenShift on IBM Cloud.
This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.
The Image Registry components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.
The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such
apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-10-04T22:02:15Z" generation: 1 name: cluster resourceVersion: "430" uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: us-east-disconnect-21-gtbwd infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::' location: us-east providerType: VPC resourceGroupName: us-east-disconnect-21-gtbwd serviceEndpoints: - name: iam url: https://private.us-east.iam.cloud.ibm.com - name: vpc url: https://us-east.private.iaas.cloud.ibm.com/v1 - name: resourcecontroller url: https://private.us-east.resource-controller.cloud.ibm.com - name: resourcemanager url: https://private.us-east.resource-controller.cloud.ibm.com - name: cis url: https://api.private.cis.cloud.ibm.com - name: dnsservices url: https://api.private.dns-svcs.cloud.ibm.com/v1 - name: cis url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud type: IBMCloud
The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:
data: config: |+ [global] version = 1.1.0 [kubernetes] config-file = "" [provider] accountID = ... clusterID = temp-disconnect-7m6rw cluster-default-provider = g2 region = eu-de g2Credentials = /etc/vpc/ibmcloud_api_key g2ResourceGroupName = temp-disconnect-7m6rw g2VpcName = temp-disconnect-7m6rw-vpc g2workerServiceAccountID = ... g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3 iamEndpointOverride = https://private.iam.cloud.ibm.com g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com rmEndpointOverride = https://private.resource-controller.cloud.ibm.com
Installer validates and injects user provided endpoint overrides into cluster deployment process and the Image Registry components use specified endpoints and start up properly.
Networking Definition of Planned
Epic Template descriptions and documentation
With ovn-ic we have multiple actors (zones) setting status on some CRs. We need to make sure individual zone statuses are reported and then optionally merged to a single status
Without that change zones will overwrite each others statuses.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.
Requirements | Notes | IS MVP |
Discover new offerings in Home Dashboard | Y | |
Access details outlining value of offerings | Y | |
Access step-by-step guide to install offering | N | |
Allow developers to easily find and use newly installed offerings | Y | |
Support air-gapped clusters | Y |
< What are we making, for who, and why/what problem are we solving?>
Discovering solutions that are not available for installation on cluster
No known dependencies
Background, and strategic fit
None
Quick Starts
Cluster admins need to be guided to install RHDH on the cluster.
Enable admins to discover RHDH, be guided to installing it on the cluster, and verifying its configuration.
RHDH is a key multi-cluster offering for developers. This will enable customers to self-discover and install RHDH.
RHDH operator
Description of problem:
The OpenShift Console QuickStarts promotes RHDH but also includes Janus IDP information.
The Janus IDP quick starts should be removed and all information about Janus IDP should be removed.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
Just navigate to Quick starts and select the "Install Red Hat Developer Hub (RHDH) with an Operator" quick starts
Actual results:
Expected results:
Additional info:
Initial PR: https://github.com/openshift/console-operator/pull/806
Consolidated Enhancement of HyperShift/KubeVirt Provider Post GA
This feature aims to provide a comprehensive enhancement to the HyperShift/KubeVirt provider integration post its GA release.
By consolidating CSI plugin improvements, core improvements, and networking enhancements, we aim to offer a more robust, efficient, and user-friendly experience.
Post GA quality of life improvements for the HyperShift/KubeVirt core
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
the rbac required for external infra needs to be documented on this page.
https://hypershift-docs.netlify.app/how-to/kubevirt/external-infrastructure/
Currently there is no option to influence on the placement of the VMs of an hosted cluster with kubevirt provider. the existing NodeSelector in HostedCluster are influencing only the pods in the hosted control plane namespace.
The goal is to introduce an new field in .spec.platform.kubevirt stanza in NodePool for node selector, propagate it to the VirtualMachineSpecTemplate, and expose this in the hypershift and hcp CLIs.
oc, the openshift CLI, needs as close to feature parity as we can get without the built-in oauth server and its associated user and group management. This will enable scripts, documentation, blog posts, and knowledge base articles to function across all form factors and the same form factor with different configurations.
CLI users and scripts should be usable in a consistent way regardless of the token issuer configuration.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
oc login needs to work without the embedded oauth server
Why is this important? (mandatory)
We are removing the embedded oauth-server and we utilize a special oauthclient in order to make our login flows functional
This allows documentation, scripts, etc to be functional and consistent with the last 10 years of our product.
This may require vendoring entire CLI plugins. It may require new kubeconfig shapes.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
Introduce --issuer-url flag in oc login .
Version-Release number of selected component (if applicable):
[xxia@2024-03-01 21:03:30 CST my]$ oc version --client Client Version: 4.16.0-0.ci-2024-03-01-033249 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-02-29-213249 True False 8h Cluster version is 4.16.0-0.ci-2024-02-29-213249
How reproducible:
Always
Steps to Reproduce:
1. Launch fresh HCP cluster. 2. Login to https://entra.microsoft.com. Register application and set properly. 3. Prepare variables. HC_NAME=hypershift-ci-267920 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig AUDIENCE=7686xxxxxx ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0 CLIENT_ID=7686xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret 4. Configure HC without oauthMetadata. [xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG [xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: '' oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " Wait pods to renew: [xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... certified-operators-catalog-7ff9cffc8f-z5dlg 1/1 Running 0 5h44m kube-apiserver-6bd9f7ccbd-kqzm7 5/5 Running 0 17m kube-apiserver-6bd9f7ccbd-p2fw7 5/5 Running 0 15m kube-apiserver-6bd9f7ccbd-fmsgl 5/5 Running 0 13m openshift-apiserver-7ffc9fd764-qgd4z 3/3 Running 0 11m openshift-apiserver-7ffc9fd764-vh6x9 3/3 Running 0 10m openshift-apiserver-7ffc9fd764-b7znk 3/3 Running 0 10m konnectivity-agent-577944765c-qxq75 1/1 Running 0 9m42s hosted-cluster-config-operator-695c5854c-dlzwh 1/1 Running 0 9m42s cluster-version-operator-7c99cf68cd-22k84 1/1 Running 0 9m42s konnectivity-agent-577944765c-kqfpq 1/1 Running 0 9m40s konnectivity-agent-577944765c-7t5ds 1/1 Running 0 9m37s 5. Check console login and oc login. $ export KUBECONFIG=$HOSTED_KUBECONFIG $ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... } Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com. Check oc login: $ rm -rf ~/.kube/cache/oc/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Actual results:
Console login succeeds. oc login fails.
Expected results:
oc login should also succeed.
Additional info:{}
As part of the deprecation progression of the openshift-sdn CNI plug-in, remove it as an install-time option for new 4.15+ release clusters.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Address technical debt around self-managed HCP deployments, including but not limited to
Users are encountering an issue when attempting to "Create hostedcluster on BM+disconnected+ipv6 through MCE." This issue is related to the default settings of `--enable-uwm-telemetry-remote-write` being true. Which might mean that that in the default case with disconnected and whatever is configured in the configmap for UWM e.g ( minBackoff: 1s url: https://infogw.api.openshift.com/metrics/v1/receive Is not reachable with disconneced. So we should look into reporting the issue and remdiating vs. Fataling on it for disconnected scenarios.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In MCE 2.4, we currently document to disable `--enable-uwm-telemetry-remote-write` if the hosted control plane feature is used in a disconnected environment. https://github.com/stolostron/rhacm-docs/blob/lahinson-acm-7739-disconnected-bare-[…]s/hosted_control_planes/monitor_user_workload_disconnected.adoc Once this Jira is fixed, the documentation needs to be removed, users do not need to disable `--enable-uwm-telemetry-remote-write`. The HO is expected to fail gracefully on `--enable-uwm-telemetry-remote-write` and continue to be operational.
The CLI cannot create dual stack clusters with the default values. We need to create the proper flags to enable the HostedCluster to be a dual stack one using the default values
This can be based on the exising CAPI agent provider workflow which already has an env var flag for disconnected
Epic Goal*
There was an epic / enhancement to create a cluster-wide TLS config that applies to all OpenShift components:
https://issues.redhat.com/browse/OCPPLAN-4379
https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/tls-config.md
For example, this is how KCM sets --tls-cipher-suites and --tls-min-version based on the observed config:
https://issues.redhat.com/browse/WRKLDS-252
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/506/files
The cluster admin can change the config based on their risk profile, but if they don't change anything, there is a reasonable default.
We should update all CSI driver operators to use this config. Right now we have a hard-coded cipher list in library-go. See OCPBUGS-2083 and OCPBUGS-4347 for background context.
Why is this important? (mandatory)
This will keep the cipher list consistent across many OpenShift components. If the default list is changed, we get that change "for free".
It will reduce support calls from customers and backport requests when the recommended defaults change.
It will provide flexibility to the customer, since they can set their own TLS profile settings without requiring code change for each component.
Scenarios (mandatory)
As a cluster admin, I want to use TLSSecurityProfile to control the cipher list and minimum TLS version for all CSI driver operator sidecars, so that I can adjust the settings based on my own risk assessment.
Dependencies (internal and external) (mandatory)
None, the changes we depend on were already implemented.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Goal:
As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.
Problem:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Append Infra CR with only the GCP PlatformStatus field (without any other fields esp the Spec) set with the LB IPs at the end of the bootstrap ignition. The theory is that when Infra CR is applied from the bootstrap ignition, first the infra manifest is applied. As we progress through all the other assets in the ignition files, Infra CR appears again but with only the LB IPs set. That way it will update the existing Infra CR already applied to the cluster.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This Epic details that work required to augment this CoreDNS pod to also resolve the *.apps URL. In addition, it will include changes to prevent Ingress Operator from configuring the cloud DNS after the ingress LBs have been created.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We want to add a new CLI option named `--all-images` to the `oc adm must-gather` command.
The oc client will scan all the CSVs on the cluster looking for `operators.openshift.io/must-gather-image=pullUrlOfTheImage` annotation building a list of must-gather images to be used to collect logs for all the products installed on the cluster and then the client will execute all of them aggregating the collection results.
Operator authors that want to opt-in to this mechanism should explicitly annotate their CSV with ``operators.openshift.io/must-gather-image` annotation.
Proposed title of this feature request
Add support to OpenShift Telemetry to report the provider that has been added via "platform: external"
What is the nature and description of the request?
There is a new platform we have added support in OpenShift 4.14 called "external" which has been added for partners to enable and support their own integrations with OpenShift rather than making RH to develop and support this.
When deploying OpenShift using "platform: external: we don't have the ability right now to identify the provider where the platform has been deployed which is key for the product team to analyze demand and other metrics.
Why does the customer need this? (List the business requirements)
OpenShift Product Management needs this information to analyze adoption of these new platforms as well as other metrics specifically for these platforms to help us to make decisions for the product development.
List any affected packages or components.
Telemetry for OpenShift
There is some additional information in the following Slack thread --> https://redhat-internal.slack.com/archives/CEG5ZJQ1G/p1698758270895639
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314
Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Implement an intitial PowerVS provider for CAPI.
Generate the machine manifests. Follow how it's done for AWS [0] and the PoC code [1]
[0] https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
\
We need to get CI on this PR in good shape https://github.com/openshift/installer/pull/7939 so we can look for reviews
Right now when trying one installation with this work https://github.com/openshift/installer/pull/7939 the bootstrap machine is not getting deleted. We need to ensure it's gone once bootstrap is finalized.
This is needed to identify if masters are schedulable and to upload the rhos image to glance.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
I want to create the public and private DNS records using one of the CAPI interface SDK hooks.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Machines for GCP need to be generated for use in CAPI. This will be similar to the AWS machine implementation
(https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go) added in
https://github.com/openshift/installer/pull/7771
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
I want the installer to create the service accounts that would be assigned to control plane and compute machines, similar to what is done in terraform now.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When installing on GCP, I want control-plane (including bootstrap) machines to bootstrap using ignition.
I want bootstrap ignition to be secured so that sensitive data is not publicly available.
Description of criteria:
Destroying bootstrap ignition can be handled separately.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Create the GCP Infrastructure controller in /pkg/clusterapi/system.go.
It will be based on the AWS controller in that file, which was added in https://github.com/openshift/installer/pull/7630.
Description of problem:
The bootstrap machine never contains a public IP address. When the publish strategy is set to External, the bootstrap machine should contain a public ip address.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision AWS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
The schema check[1] in the LB reconciliation is hardcoded to check the primary Load Balancer only, it will result to always filter the subnets from the schema for the primary, ignoring additional Load Balancers ("SecondaryControlPlaneLoadBalancer")
How to reproduce:
Actual results:
Expected results:
References:
when installconfig.platform.aws.userTags is specified, all taggable resources should have the specified user tags.
iam role is correctly attached to control plane node when installconfig.controlPlane.platform.aws.iamRole is specified
Epic Goal
Through this epic, we will update our CI to use a have an available agent-based workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.
Why is this important?
There is an active initiative in openshift to remove terraform from the openshift installer.
Acceptance Criteria
Done Checklist
As a CI job author, I would like to be able to reference a yaml/json parsing tool that works across architectures and doesn't need to be downloaded for each unique step.
Rafael pointed out that Alessandro add multi-arch containers for yq for the upi installer:
https://github.com/openshift/release/pull/49036#discussion_r1499870554
yq should have the ability to parse json.
We should evaluate if this can be added to the libvirt-installer image as well, and then used by all of our libvirt CI steps.
Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.
Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Currently, we are serving the metrics as http on 9537 we need to upgrade to use TLS
Related to https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Enhance Dynamic plugin with similar capabilities as Static page. Add new control and security related enhancements to Static page.
The Dynamic plugin should list pipelines similar to the current static page.
The Static page should allow users to override task and sidecar task parameters.
The Static page should allow users to control tasks that are setup for manual approval.
The TSSC security and compliance policies should be visible in Dynamic plugin.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a DevOps Engineer, I want to add manual approval points in my pipeline so that the pipeline pauses at that point and waits for a manual approval before continuing execution. Manual approvals are commonly used for approving deploying to production or modeling activities that are not automated (e.g. manual testing) in the pipeline.
Acceptance Criteria
Description of problem:
Update the Pipeline/PipelineRun List and Details Pages to acknowledge Custom Task
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always when Custom Task is used
Steps to Reproduce:
1.Create a Pipeline with a CustomTask like the Approval Task 2.Check the Tasks List in the Pipeline/Run Details Page 3.Check the Progress Bar in the PipelineRun List page
Actual results:
CustomTask is not recognized. Either throwing undefined or showing Task as pending always.
Expected results:
Just like Normal Tasks, CT should be infused thoroughly in the mentioned pages.
Additional info:
With the OpenShift Pipelines operator 1.2x we added support for a dynamic console plugin to the operator. In the first version it is only responsible for the Dashboard and Pipeline/Repository Metrics tab. We want move more and more code to the dynamic plugin and remove this later from the console repository.
Add TaskRun tab in PLR details page using plugin
Enable developers to perform actions in a faster and better way than before.
Developers will be able to reduce the time or clicks spent on the UI to perform specific set of actions.
Search->Resources option should show 5 of the recently searched resources across all sessions by the user.
The recently searched resources should should be clearly visible and separated from rest.
Pinning resources capability should be removed.
Getting Started menu on Add page can be collapsed and expanded
Use Cases (Optional):
__
Developers have to repeatedly search for resources and pin them separately in order to view details of a resource that have seen in the past.
Provide developers with the ability to see the last 5 resources that they have seen in the past so they can quickly view their details without any further actions.
Provides a better user experience for developers when using the console.
As a user, I want the console to remember the resources I have recently searched so that I don't have to type the names of the same resources I use frequently in the Search Page.
The Getting Started menu on Add page can cannot be restored after users click on the "X" symbol when its hidden.
Users should be able to collapse and expand the Getting Started menu on Add Page.
The current behavior causes confusion.
As of now, getting started section can be hidden and enable back using the button but user can close that button and it will not show back. It is confusing for the users. So add the expandable section instead of hide and show button similar to Functions List page
Check Functions list page in Dev perspective for the design
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.
This feature will be used to track all the CAPI preparation work that is common for all the supported providers
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
I want hack/build.sh to embed the kube-apiserver and etcd dependencies in openshift-install without making external network calls so that ART/OSBS can build the installer with CAPI dependencies.
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
PoC & design for running CAPI control plane using binaries.
Fit provisioning via the CAPI system into the infrastructure.Provider interface
so that:
The 100.88.0.0/14 IPv4 subnet is currently reserved for the transit switch in OVN-Kubernetes for east west traffic in the OVN Interconnect architecture. We need to make this value configurable so that users can avoid conflicts with their local infrastructure. We need to support this config both prior to installation and post installation (day 2).
This epic will include stories for the upstream ovn-org work, getting that work downstream, an api change, and a cno change to consume the new api
After the upstream pr merges it needs to get into openshift ovn-k via a downstream merge
Allow setting custom tags to machines created during the installation of an OpenShift cluster on vSphere.
Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.
Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.
For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.
Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.
For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space
have "--since and --until " option in mustgather
Epic Goal*
As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space
Why is this important? (mandatory)
Reduce must-gather size
Scenarios (mandatory)
Must-gather running over a cluster with many logs can produces tens of GBs of data in cases where only few MBs is needed. Such a huge must-gather archive takes too long to collect and too long to upload. Which makes the process impractical.
Dependencies (internal and external) (mandatory)
The default must-gather images need to implement this functionality. Custom images will be asked to implement the same.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Must-gather contain only logs from requested time interval
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Must-gather currently does not allow to limit the amount of data collected by a customer. Which can lead into collection of tens of GBs even when only a limited set of data (e.g. for the last 2 days) is required. In some cases ending with a master node down.
Suggested solution:
Acceptance criteria:
rotated-pod-logs does not interact with since/since-time. This causes inspect (and must-gather) outputs to considerably increase in size, even when users attempt to filter it by using time constraints.
Suggested solution:
Acceptance criteria:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Impacted areas based on CI:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic (standalone) |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
Other (please specify) | N/A |
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal*
Per OCPSTRAT-1042, we are removing Alicloud IPI/UPI support in 4.16 and removing code in 4.16. This epic track the necessary actions to remove the Alicloud Disk CSI driver
Why is this important? (mandatory)
Since we are removing support of Alicloud as a supported provider we need to clean up all the storage artefacts related to Alicloud.
Dependencies (internal and external) (mandatory)
Alicloud IPI/UPI support removal must be confirmed. IPI/UPI code should be remove.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
Enable the OCP Console to send back user analytics to our existing endpoints in console.redhat.com. Please refer to doc for details of what we want to capture in the future:
Collect desired telemetry of user actions within OpenShift console to improve knowledge of user behavior.
OpenShift console should be able to send telemetry to a pre-configured Red Hat proxy that can be forwarded to 3rd party services for analysis.
User analytics should respect the existing telemetry mechanism used to disable data being sent back
Need to update existing documentation with what we user data we track from the OCP Console: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/about-remote-health-monitoring.html
Capture and send desired user analytics from OpenShift console to Red Hat proxy
Red Hat proxy to forward telemetry events to appropriate Segment workspace and Amplitude destination
Use existing setting to opt out of sending telemetry: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html#opting-out-remote-health-reporting
Also, allow just disabling user analytics without affecting the rest of telemetry: Add annotation to the Console to disbale just user analytics
Update docs to show this method as well.
We will require a mechanism to store all the segment values
We need to be able to pass back orgID that we receive from the OCM subscription API call
Sending telemetry from OpenShift cluster nodes
Console already has support for sending analytics to segment.io in Dev Sandbox and OSD environments. We should reuse this existing capability, but default to http://console.redhat.com/connections/api for analytics and http://console.redhat.com/connections/cdn to load the JavaScript in other environments. We must continue to allow Dev Sandbox and OSD clusters a way to configure their own segment key, whether telemetry is enabled, segment API host, and other options currently set as annotations on the console operator configuration resource.
Console will need a way to determine the org-id to send with telemetry events. Likely the console operator will need to read this from the cluster pull secret.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.
Goal:
Update console telemetry plugin to send data to the appropriate ingress point.
Ingress point created for console.redhat.com
The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.
For that the telemetry-console-plugin must have options to configure where it loads the analytics.js and where to send the API calls (analytics events).
Graduce the new PV access mode ReadWriteOncePod as GA.
Such PV/PVC can be used only in a single pod on a single node compared to the traditional ReadWriteOnce access mode, where such a PV/PVC can be used on a single node by many pods.
The customers can start using the new ReadWriteOncePod access mode.
This new mode allows customers to provision and attach PV and get the guarantee that it cannot be attached to another local pod.
This new mode should support the same operations as regular ReadWriteOnce PVs therefore it should pass the regression tests. We should also ensure that this PV can't be accessed by another local-to-node pod.
As a user I want to attach a PV to a pod and ensure that it can't be accessed by another local pod.
We are getting this feature from upstream as GA. We need to test it and fully support it.
Check that there is no limitations / regression.
Remove tech preview warning. No additional change.
N/A
Support upstream feature "New RWO access mode " in OCP as GA, i.e. test it and have docs for it.
This is continuation of STOR-1171 (Beta/Tech Preview in 4.14), now we just need to mark it as GA and remove all TechPreview notes from docs.
Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set.
Users need to be able to update the BMC entries in the BareMetalHost objects after they have been created .
There are at least two scenarios in which this causes problems:
1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.
2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.
Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set. It's understood that this was an initial design choice, and there's a webhook that prevents any modification of the object once it has been set for the first time. There are at least two scenarios in which this causes problems:
1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.
2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.
Thanks!
Currently, the baremetal operator does not allow the BMC address of a node to be updated after BMH creation. This ability needs to be added in BMO.
Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.
This is also a key requirement for backup and DR solutions.
https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot
Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.
The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.
CSI drivers development/support of this feature.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.
Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.
Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.
Epic Goal*
Create an OCP feature gate that allows customers and parterns to VolumeGroupSnapshot feature while the feature is in alpha & beta upstream.
Why is this important? (mandatory)
Volume group snapshot is an important feature for ODF, OCP virt and backup partners. It requires driver support so partners need early access to the feature to confirm their driver works as expected before GA. The same applies to backup partners.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
This depends on the driver's support, the feature gate will enable it in the drivers that support it (OCP shipped drivers).
The feature gate should
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
By enabling the feature gate partners should be able to use the VolumeGroupSnapshot API. Non OCP shipped drivers may need to be configured.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Support the SMB CSI driver through an OLM operator as tech preview. The SMB CSI driver allows OCP to consume SMB/CIFS storage with a dynamic CSI driver. This enables customers to leverage their existing storage infrastructure with either SAMBA or Microsoft environment.
https://github.com/kubernetes-csi/csi-driver-smb
Customers can start testing connecting OCP to their backend exposing CIFS. This can allow to consume net new volume or consume existing data produced outside OCP.
Driver already exists and is under the storage SIG umbrella. We need to make sure the driver is meeting OCP quality requirement and if so develop an operator to deploy and maintain it.
Review and clearly define all driver limitations and corner cases.
Review the different authentication method.
Windows containers support.
Only storage class login/password authentication method. Other methods can be reviewed and considered for GA.
Customers are expecting to consume storage and possibly existing data via SMB/CIFS. As of today vendor's drivers support is really limited in terms of CIFS support whereas this protocol is widely used on premise especially with MS/AD customers.
Need to understand what customers expect in terms of authentication.
How to extend this feature to windows containers.
Document the operator and driver installation, usage capabilities and limitations.
Future: How to manage interoperability with windows containers (not for TP)
Add a new provider here:
https://github.com/openshift/api/blob/6d48d55c0598ec78adacdd847dcf934035ec2e1b/operator/v1/types_csi_cluster_driver.go#L86
It should be "smb.csi.k8s.io", from the CSIDriver manifest:
https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/csi-smb-driver.yaml
Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in OCPBU-559 and OCPBU-560.
There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.
Currently, Hypershift lacks support for CCO.
Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.
If we are successful, no special documentation should be needed for this.
Outcome Overview
Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.
Success Criteria
CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.
Expected Results (what, how, when)
Post Completion Review – Actual Results
After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).
Every guest cluster should have a running CCO pod with its kubeconfig attached to it.
Enchancement doc: https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md
The etc-ca must be rotatable both on-demand and automatically when expiry approaches.
Requirements (aka. Acceptance Criteria):
Deliver rotation and recovery requirements from OCPSTRAT-714
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
This spike explores using the library-go cert rotation utils in the etcd-operator to replace or augment the existing etcdcertsigner controller.
https://github.com/openshift/library-go/blob/master/pkg/operator/certrotation/client_cert_rotation_controller.go
https://github.com/openshift/cluster-etcd-operator/pull/1177
The goal of this spike is to evaluate if the library-go cert rotation util gives us rotation capabilities for the signer cert along with the peer and server certs.
There are a couple of issues to explore with the use of the library-go cert signer controller:
Refactoring in ETCD-512 does not clean up certificates that are dynamically generated. Imagine you're recreating all your master nodes everyday, we would create new peer/serving/metrics certificates for each node and never clean them up.
We should try to be conservative when cleaning them up, so keep them around for a certain retention period (7-10 days?) after the node went away.
AC:
Testing in ETCD-512 revealed that CEO does not react to changes in the CA bundle or the client certificates.
The current mounts are defined here:
https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/manifests/0000_20_etcd-operator_06_deployment.yaml#L90-L106
A simple fix would be to watch the respective resources in a controller and exit the container on changes. This is how we did it with feature gates as well: (https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/pkg/operator/starter.go#L174C1-L174C1)
If hot-reload would be feasible we should take a look at it, but it seems a larger refactoring.
AC:
Refactoring in ETCD-512 creates a rotation due to incompatible certificate creation processes. We should update the render [1] the same way the controller manages the certificates. Keep in mind that important information are always stored in annotations, which means we also need to update the manifest template itself (just exchanging file bytes isn't enough).
AC:
[1] https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L347-L365
Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.
The version of the oc binary could be included in the oc adm must-gather output [1], and if it's 2 or more minor versions than the running cluster, a warning should be shown.
1. Proposed title of this feature request
Include additional info into must-gather directory
2. What is the nature and description of the request?
Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.
The version of the oc binary can help to identify if an issue could be caused because the oc version is different than the version of the cluster.
The logs generated by the oc adm must-gather command will help to identify if some information could not be collected, and also the exact image used
4. List any affected packages or components.
oc
Include logs generated by the command into the must-gather directory when running the oc adm must-gather command.
Add support for standalone secondary networks for HCP kubevirt.
Advanced multus integration involves the following scenarios
1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
Users of HCP KubeVirt should be able to create a guest cluster that is completely isolated on a secondary network outside of the default pod network.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | na |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | na |
Connected / Restricted Network | yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 |
Operator compatibility | na |
Backport needed (list applicable versions) | na |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | na |
Other (please specify) | na |
ACM documentation should include how to configure secondary standalone networks.
This is a continuation of CNV-33392.
Multus Integration for HCP KubeVirt has three scenarios.
1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
3. Secondary network + pod network (default for kubelet) as multiple interfaces for VM
Item 3 is the simplest use case because it does not require any addition considerations for ingress and load balancing. This scenario [item 3] is covered by CNV-33392.
Items [1,2] are what this epic is tracking, which we are considering advanced use cases.
When creating a cluster with secondary network the ingress is broken since the created service cannot access the VMs secondary addresses.
We need to create a controller that manually create and update the service endpoints so it's always pointing to the VMs IPs.
The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.
Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.
There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.
Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel
Usual documentation will be required in case there are any new user-facing options available as a result of this feature.
*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways
USER STORY:
DESCRIPTION:
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
Required:
Nice to have:
...
ACCEPTANCE CRITERIA:
ENGINEERING DETAILS:
-
This Feature covers the pending tasks from OCPBU-16 to be covered in openshift-4.14.
Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important: Increased operational simplicity and scale flexibility of the cluster's control plane deployment.
See slack working group: #wg-ctrl-plane-resize
As a user I'd like to be warned when I'm setting up a Control Plane Machine Set for my control plane machines on GCP and I don't have the necessary TargetPools requirement for it to work correctly.
We previously had a validating webhook on the CPMSO for GCP that would check if the control plane machine provider config set on the CPMS did have TargetPools, otherwise it would error.
This unfortunately collides with GCP Private clusters, see https://issues.redhat.com/browse/OCPBUGS-6760.
As such we decided to remove the check until warnings are supported in controller-runtime.
Once those land we can re-add the check and throw a warning in those situations, to still inform the user without disrupting normal operations where that's not an issue.
See: https://redhat-internal.slack.com/archives/C68TNFWA2/p1675105892589279.
Webhook warnings available for 0.16.0 release of controller-runtime
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
As a cluster administrator, I would like to migrate my existing (that does not currently use Azure AD Workload Identity) cluster to use Azure AD Workload Identity
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Many customers would like to migrate to Azure AD Workload Identity with minimal downtime but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to take advantage of using Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+) in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Provide a documented method for migration to Azure AD Workload Identity for OpenShift 4.14+ with minimal downtime, and without customers having to start over with a new cluster using AZ Workload Identity and migrating over their workload. If there is risk of workload disruptive or downtime, we will keep to inform customers of this risk and have them accept this risk.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network, or all | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All applicable architectures |
Operator compatibility | |
Backport needed (list applicable versions) | 4.14+ |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Goal
Spike to evaluate if we can provide an automated way to support migration to Azure Managed Identity (preferred), or alternatively a manual method (second option) for customers to perform the migration themselves that is documented and supported, or not at all.
This spike will evaluate, scope the level of effort (sizing), and make recommendation on next steps.
Feature request
Support migration to Azure Managed Identity
Feature description
Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).
Why?
Provide a uniform operational experience for all clusters running versions which support Azure Managed Identity without having to decommission long running clusters
Other considerations
The cloud-credential-operator repository documentation can be updated for installing and/or migrating a cluster with Azure workload identity integration.
Add support for Johannesburg, South Africa (africa-south1) in GCP
As a user I'm able to deploy OpenShift in Johannesburg, South Africa (africa-south1) in GCP and this region is fully supported
A user can deploy OpenShift in GCP Johannesburg, South Africa (africa-south1) using all the supported installation tools for self-managed customers.
The support of this region is backported to the previuos OpenShift EUS release.
Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information of the new region needs to be added to the documentation so this is supported.
Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP
As a user I'm able to deploy OpenShift in Dammam, Saudi Arabia, Middle East (me-central2) region in GCP and this region is fully supported
A user can deploy OpenShift in GCP Dammam, Saudi Arabia, Middle East (me-central2) region using all the supported installation tools for self-managed customers.
The support of this region is backported to the previuos OpenShift EUS release.
Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information of the new region needs to be added to the documentation so this is supported.
Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.
Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Managed |
Classic (standalone cluster) | N/A |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | N/A |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_64 ARM |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Check with OCM and CAPI requirements to expose larger worker node count.
As a HyperShift controller in the management plane, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal.
This does not require a feature gate.
As the HyperShift scheduler, I want to be able to:
so that I can achieve
As the HyperShift administrator, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal.
This does not require a feature gate.
When a HostedCluster is given a size label, it needs to ensure that request serving nodes exist for that size label and when they do, reschedule request serving pods to the appropriate nodes.
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
sets up CAPI ecosystem for vSphere
So far we haven't tested this provider at all. We have to run it and spot if there are any issues with it.
Steps:
Outcome:
vSphere provider is not present in downstream operator, someone has to add it there.
This will include adding a tech preview e2e job in release repo and going through the process described here https://github.com/openshift/cluster-capi-operator/blob/main/docs/provideronboarding.md
Continue scale testing and performance improvements for ovn-kubernetes
Networking Definition of Planned
Epic Template descriptions and documentation
Manage Openshift Virtual Machines IP addresses from within the SDN solution provided by OVN-Kubernetes.
Customers want to offload IPAM from their custom solutions (e.g. custom DHCP server running on their cluster network) to SDN.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Investigate options to identify ovn scale problems, that usually cause high CPU usage for ovn-controller and vswitchd
https://docs.google.com/document/d/15PLDLKB9tGnbGYMhdHjlOsvlvXzT9TV7VfKmTJCwMRk/edit
Goal:
Support enablement of dual-stack VIPs on existing clusters created as dual-stack but at a time when it was not possible to have both v4 and v6 VIPs at the same time.
Why is this important?
This is a followup to SDN-2213 ("Support dual ipv4 and ipv6 ingress and api VIPs").
We expect that customers with existing dual stack clusters will want to make use of the new dual stack VIPs fixes/enablement, but it's unclear how this will work because we've never supported modifying on-prem networking configuration after initial deployment. Once we have dual stack VIPs enabled, we will need to investigate how to alter the configuration to add VIPs to an existing cluster.
We will need to make changes to the VIP fields in the Infrastructure and/or ControllerConfig objects. Infrastructure would be the first option since that would make all of the fields consistent, but that relies on the ability to change that object and have the changes persist and be propagated to the ControllerConfig. If that's not possible, we may need to make changes just in ControllerConfig.
For epics https://issues.redhat.com/browse/OPNET-14 and https://issues.redhat.com/browse/OPNET-80 we need a mechanism to change configuration values related to our static pods. Today that is not possible because all of the values are put in the status field of the Infrastructure object.
We had previously discussed this as part of https://issues.redhat.com/browse/OPNET-21 because there was speculation that people would want to move from internal LB to external, which would require mutating a value in Infrastructure. In fact, there was a proposal to put that value in the spec directly and skip the status field entirely, but that was discarded because a migration would be needed in that case and we need separate fields to indicate what was requested and what the current state actually is.
There was some followup discussion about that with Joel Speed from the API team (which unfortunately I have not been able to find a record of yet) where it was concluded that if/when we want to modify Infrastructure values we would add them to the Infrastructure spec and when a value was changed it would trigger a reconfiguration of the affected services, after which the status would be updated.
This means we will need new logic in MCO to look at the spec field (currently there are only fields in the status, so spec is ignored completely) and determine the correct behavior when they do not match. This will mean the values in ControllerConfig will not always match those in Infrastructure.Status. That's about as far as the design has gone so far, but we should keep the three use cases we know of (internal/external LB, VIP addition, and DNS record overrides) in mind as we design the underlying functionality to allow mutation of Infrastructure status values.
Depending on how the design works out, we may only track the design phase in this epic and do the implementation as part of one of the other epics. If there is common logic that is needed by all and can be implemented independently we could do that under this epic though.
Infrastructure.Spec will be modified by end-user. CNO needs to validate those changes and if valid, propagate them to Infrastructure.Status
For clusters that are installed as fresh 4.15 o/installer will propagate Infrastructure.Spec and Infrastructure.Status based on the install-config. However for clusters that are upgraded this code in o/installer will never run.
In order to have a consistent state at upgrade, we will make CNO to propagate Status back to Spec when cluster is upgraded to OCP 4.15.
As we have already done it when introducing multiple VIPs (API change that created plural field next to the singular), all the necessary code scaffolding is already in place.
Tasks to do here
Ensure CSI Stack for Azure is running on management clusters with hosted control planes, allowing customers to associate a cluster as "Infrastructure only" and move the following parts of the stack:
This feature enables customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Non-CSI Stack for Azure-related functionalities are out of scope for this feature.
Workload identity authentication is not covered by this feature - see STOR-1748
This feature is designed to enable customers to run their Azure infrastructure more efficiently and cost-effectively by using HyperShift control planes and supporting infrastructure without incurring additional charges from Red Hat.
Documentation for this feature should provide clear instructions on how to enable the CSI Stack for Azure on management clusters with hosted control planes and associate a cluster as "Infrastructure only." It should also include instructions on how to move the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to the appropriate clusters.
This feature impacts the CSI Stack for Azure and any layered products that interact with it. Interoperability test scenarios should be factored by the layered products.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Run Azure File CSI driver operator + Azure File CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".
Why is this important? (mandatory)
This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Scenarios (mandatory)
When leveraging Hosted control planes, the Azure File CSI driver operator + Azure File CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.
Dependencies (internal and external) (mandatory)
Hosted control plane on Azure.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We need to make changes to CSO, so as it can run azure-file-operator on hypershift.
As part of this story, we will simply move building and CI of existing code to combined csi-operator.
We need to modify csi-operator so as it be ran as azure-file operator on hypershift and standalone clusters.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Run Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".
Why is this important? (mandatory)
This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Scenarios (mandatory)
When leveraging Hosted control planes, the Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.
Dependencies (internal and external) (mandatory)
Hosted control plane on Azure.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
As part of this epic, Engineers working on Azure Hypershift should be able to build and use Azure Disk storage on hypershift guests via developer preview custom build images.
For this story, we are going to enable deployment of azure disk driver and operator by default in hypershift environment.
Add a knob in CNO to allow users to modify the changes made in ovn-k
Make it possible to entirely disable the Ingress Operator by leveraging the OCPPLAN-9638 Composable OpenShift capability.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
RFE: https://issues.redhat.com/browse/RFE-3395
Enhancement PR: https://github.com/openshift/enhancements/pull/1415
API PR: https://github.com/openshift/api/pull/1516
Ingress Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/950
Feature Goal: Make it possible to entirely disable the Ingress Operator by leveraging the Composable OpenShift capability.
Implement the ingress capability focusing on the HyperShift users.
As described in the EP PR.
# ...
* Release Technical Enablement - Provide necessary release enablement details and documents.
* Ingress Operator can be disabled on HyperShift.
# The install-config and ClusterVersion API have been updated with the capability feature.
# The console operator.
#
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Context
HyperShift uses the cluster-version-operator (CVO) to manage the part of the ingress operator's payload, namely CRDs and RBACs. The ingress operator's deployment though is reconciled directly by the control plane operator. That's why HyperShift projects the ClusterVersion resource into the hosted cluster and the enabled capabilities have to be set properly to enable/disable the ingress operator's payload and to let users as well as the other operators be aware of the state of the capabilities.
Goal
The goal of this user story is to implement a new capability in the OpenShift API repository. Use this procedure as example.
We need to create CI tests to ensure that during the live migration when a cluster has both OVN-K and Openshift-SDN deployed in some nodes. The cluster network can still work as normal. We need run the disruptive tests as we do for upgrde.
Image and artifact signing is a key part of a DevSecOps model. The Red Hat-sponsored sigstore project aims to simplify signing of cloud-native artifacts and sees increasing interest and uptake in the Kubernetes community. This document proposes to incrementally invest in OpenShift support for sigstore-style signed images and be public about it. The goal is to give customers a practical and scalable way to establish content trust. It will strengthen OpenShift’s security philosophy and value-add in the light of the recent supply chain security crisis.
CRIO
https://docs.google.com/document/d/12ttMgYdM6A7-IAPTza59-y2ryVG-UUHt-LYvLw4Xmq8/edit#
We need to be able to install the HO with external DNS and create HCPs on AKS clusters
CSI pods are failing to create on HCP on an AKS cluster.
% k get pods | grep -v Running NAME READY STATUS RESTARTS AGE csi-snapshot-controller-cfb96bff7-7tc94 0/1 CreateContainerConfigError 0 17h csi-snapshot-webhook-57f9799848-mlh8k 0/1 CreateContainerConfigError 0 17h
The issue is
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 34m default-scheduler Successfully assigned clusters-brcox-hypershift-arm/csi-snapshot-controller-cfb96bff7-7tc94 to aks-nodepool1-24902778-vmss000001 Normal Pulling 34m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83" Normal Pulled 34m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83" in 3.036610768s (10.652709193s including waiting) Warning Failed 32m (x12 over 34m) kubelet Error: container has runAsNonRoot and image will run as root (pod: "csi-snapshot-controller-cfb96bff7-7tc94_clusters-brcox-hypershift-arm(45ab89f2-9c00-4afa-bece-7846505edbfc)", container: snapshot-controller)
As a HyperShift CLI user, I want to be able to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a hub cluster admin, I want to be able to:
so that I can prevent
Description of criteria:
Implementing managed-only configs is out of scope for this story
This does not require a design proposal.
This does not require a feature gate.
As a hub cluster admin, I want to be able to:
so that I can prevent
Description of criteria:
Implementing managed-only configs is out of scope for this story
This does not require a design proposal.
This does not require a feature gate.
As of OpenShift 4.14, this functionality is Tech Preview for all platforms but OpenStack, where it is GA. This Feature is to bring the functionality to GA for all remaining platforms.
Allow to configure control plane nodes across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.
I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and nodes across multiple subnets.
I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.
Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.
Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.
Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.
While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.
Epic | Control Plane with Multiple Subnets | Compute with Multiple Subnets | Doesn't need external LB | Built-in LB |
---|---|---|---|---|
NE-1069 (all-platforms) | ✓ | ✓ | ✓ | ✓ |
NE-905 (all-platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ||
✓ | ✓ | ✓ | ✕ | |
NE-905 (all platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ |
Workers on separate subnets with IPI documentation
We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow
This procedure works on vSphere too, albeit no QE CI and not documented.
External load balancer with IPI documentation
Currently o/installer validates that ELB is used only in TechPreview clusters. This validation needs to be removed so that ELB can be consumed from GA.
Goal:
Enable and support Multus CNI for microshift.
Background:
Customers with advanced networking requirement need to be able to attach additional networks to a pod, e.g. for high-performance requirements using SR-IOV or complex VLAN setups etc.
Requirements:
Documentation:
Testing:
Customer Considerations:
Out of scope:
It should include all CNIs (bridge, macvlan, ipvlan, etc.) - if we decide to support something, we'll just update scripting to copy those CNIs
It needs to include IPAMs: static, dynamic (DHCP), host-local (we might just not copy it to host)
RHEL9 binaries only to save space
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
Openshift prefixed namespaces should all define their required PSa labels. Currently, the list of namespaces that are missing some or all PSa labels are the following:
namespace | in review | merged |
---|---|---|
openshift | ||
openshift-apiserver-operator | PR | |
openshift-cloud-credential-operator | PR | |
openshift-cloud-network-config-controller | PR | |
openshift-cluster-samples-operator | PR | |
openshift-cluster-storage-operator | PR | |
openshift-config OCPBUGS-28621 | PR | |
openshift-config-managed | PR | |
openshift-config-operator | PR | |
openshift-console | PR | |
openshift-console-operator | PR | |
openshift-console-user-settings | PR | |
openshift-controller-manager | PR | |
openshift-controller-manager-operator | PR | |
openshift-dns-operator | PR | |
openshift-etcd-operator | PR | |
openshift-host-network | PR | |
openshift-ingress-canary | PR | |
openshift-ingress-operator | PR | |
openshift-insights | PR | |
openshift-kube-apiserver-operator | PR | |
openshift-kube-controller-manager-operator | PR | |
openshift-kube-scheduler-operator | PR | |
openshift-kube-storage-version-migrator | PR | |
openshift-kube-storage-version-migrator-operator | PR | |
openshift-network-diagnostics | PR | |
openshift-node | ||
openshift-operator-lifecycle-manager | PR | |
openshift-operators | PR | |
openshift-route-controller-manager | PR | |
openshift-service-ca | PR | |
openshift-service-ca-operator | PR | |
openshift-user-workload-monitoring | PR |
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following table tracks progress:
namespace | in review | merged |
---|---|---|
openshift-apiserver-operator | PR | |
openshift-authentication | PR | |
openshift-authentication-operator | PR | |
openshift-catalogd | PR | |
openshift-cloud-controller-manager | ||
openshift-cloud-controller-manager-operator | ||
openshift-cloud-credential-operator | PR | |
openshift-cloud-network-config-controller | PR | |
openshift-cluster-csi-drivers | PR1, PR2 | |
openshift-cluster-machine-approver | ||
openshift-cluster-node-tuning-operator | PR | |
openshift-cluster-olm-operator | PR | |
openshift-cluster-samples-operator | PR | |
openshift-cluster-storage-operator | PR1, PR2 | |
openshift-cluster-version | PR | |
openshift-config-operator | PR | |
openshift-console | PR | |
openshift-console-operator | PR | |
openshift-controller-manager | PR | |
openshift-controller-manager-operator | PR | |
openshift-dns | ||
openshift-dns-operator | ||
openshift-etcd | ||
openshift-etcd-operator | ||
openshift-image-registry | PR | |
openshift-ingress | PR | |
openshift-ingress-canary | PR | |
openshift-ingress-operator | PR | |
openshift-insights | PR | |
openshift-kube-apiserver | ||
openshift-kube-apiserver-operator | ||
openshift-kube-controller-manager | ||
openshift-kube-controller-manager-operator | ||
openshift-kube-scheduler | ||
openshift-kube-scheduler-operator | ||
openshift-kube-storage-version-migrator | PR | |
openshift-kube-storage-version-migrator-operator | PR | |
openshift-machine-api | PR1, PR2, PR3, PR4, PR5, PR6 | |
openshift-machine-config-operator | PR | |
openshift-marketplace | PR | |
openshift-monitoring | PR | |
openshift-multus | ||
openshift-network-diagnostics | PR | |
openshift-network-node-identity | PR | |
openshift-network-operator | ||
openshift-oauth-apiserver | PR | |
openshift-operator-controller | PR | |
openshift-operator-lifecycle-manager | PR | |
openshift-ovn-kubernetes | ||
openshift-route-controller-manager | PR | |
openshift-service-ca | PR | |
openshift-service-ca-operator | PR | |
openshift-user-workload-monitoring | PR |
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Add node status as mentioned in the sample output to "oc adm upgrade status" in OpenShift Update Concepts
With this output, users will be able to see the state of the nodes that are not part of the master machine config pool and what version of RHCOS they are currently on. I am not sure if it is possible to also show corresponding OCP version. If possible we should display that as well.
=Worker Upgrade= Worker Pool: openshift-machine-api/machineconfigpool/worker Assessment: Admin required Completion: 25% (Est Time Remaining: N/A - Manual action required) Mode: Manual | Assisted [- No Disruption][- Disruption Permitted][- Scaling Permitted] Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded Worker Pool Node(s) NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-134-165.ec2.internal Manual N/A 4.12.1 ? Awaiting manual reboot ip-10-0-135-128.ec2.internal Complete Updated 4.12.16 - ip-10-0-138-184.ec2.internal Manual N/A 4.12.1 ? Awaiting manual reboot ip-10-0-139-31.ec2.internal Manual N/A 4.12.1 ? Awaiting manual reboot
Implementing RFE-928 would help a number of update-team use-cases:
The updates team is not well positioned to maintain oc access long-term; that seems like a better fit for the monitoring team (who maintain Alertmanager) or the workloads team (who maintain the bulk of oc). But we can probably hack together a proof-of-concept which we could later hand off to those teams, and in the meantime it would unblock our work on tech-preview commands consuming the firing-alert information.
The proof-of-concept could follow the following process:
$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts ...dump of firing alerts...
and a backing Go function that other subcommands like oc adm upgrade status can consume internally.
Expose 'Result of work' as structured JSON in a ClusterVersion condition (ResourceReconciliationIssues?). This would allow oc to explain what the CVO is having trouble with, without needing to reach around the CVO and look at ClusterOperators directly (avoiding the risk of latency/skew between what the CVO thinks it's waiting on and the current ClusterOperator content). And we could replace the current Progressing string with more information, and iterate quickly on the oc side. Although if we find ways we like more to stringify this content, we probably want to push that back up to the CVO so all clients can benefit.
We want to do this work behind a feature gate OTA-1169.
Note that a condition with Message containing a JSON instead of a human-readable message is against apimachinery conventions and is VERY poor API method in general. The purpose of this story is simply to experiment with how such API could like, and will inform how we will build it in the future. Do not worry too much about tech debt, this is exploratory code.
As a PoC (proof of concept), warn about Available=False and Degraded=True ClusterOperators, smoothing flapping conditions (like discussed on a refinement call on Nov 29).
Follow Justin's direction from design ideas, make this easily pluggable for more "warnings"
=Update Health= SINCE LEVEL IMPACT MESSAGE 3h Warning API Availability High control plane CPU during upgrade Resolved Info Update Speed Pod is slow to terminate 4h Info Update Speed Pod was slow to terminate 30m Info None Worker node version skew 1h Info None Update worker node
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
{}USER STORY:{}
As a system admin, I would like the static IP support for vSphere to use IPAddressClaims to provide IP address during installation so that after the install, the machines are defined in a way that is intended for use with IPAM controllers.
{}DESCRIPTION:{}
Currently the installer for vSphere will directly set the static IPs into the machine object yaml files. We would like to enhance the installer to create IPAddress, IPAddressClaim for each machine as well as update the machinesets to use addressesFromPools to request the IPAddress. Also, we should create a custom CRD that is the basis for the pool defined in the addressesFromPools field.
{}ACCEPTANCE CRITERIA:{}
After installing static IP for vSphere IPI, the cluster should contain machines, machinesets, crd, ipaddresses and ipaddressclaims related to static IP assignment.
{}ENGINEERING DETAILS:{}
These changes should all be contained in the installer project. We will need to be sure to cover static IP for zonal and non-zonal installs. Additionally, we need to have this work for all control-plane and compute machines.
We need to ensure that when vSphere static IP IPI install is being performed, we need to make sure the masters that are generated are treated as valid machines and do not get recreated by CPMS operator.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:
Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 3 (OpenShift 4.13): OCPBU-117
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)
Phase 6 (OpenShift 4.16): OCPSTRAT-731
Phase 7 (OpenShift 4.17): OCPSTRAT-1308
Questions to be addressed:
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
Description of problem:
In clusters with OCB functionality enabled, sometimes the machine-os-builder pod is not restarted when we update the imageBuilderType. What we have observed is that the pod is restarted if a build is running, but it is not restarted if we are not building anything.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 88m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Create the configuration resources needed by the OCB functionality. To reproduce this issue we use an on-cluster-build-config configmap with an empty imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}' 2. Create a infra pool and label it so that it can use OCB functionality apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" oc label mcp/infra machineconfiguration.openshift.io/layering-enabled= 3. Wait for the build pod to finish. 4. Once the build has finished and it has been cleaned, update the imageBuilderType so that we use "custom-pod-builder" type now. oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'
Actual results:
We waited for one hour, but the pod is never restarted. $ oc get pods |grep build machine-os-builder-6cfbd8d5d-xk6c5 1/1 Running 0 56m $ oc logs machine-os-builder-6cfbd8d5d-xk6c5 |grep Type I0914 08:40:23.910337 1 helpers.go:330] imageBuilderType empty, defaulting to "openshift-image-builder" $ oc get cm on-cluster-build-config -o yaml |grep Type imageBuilderType: custom-pod-builder
Expected results:
When we update the imageBuilderType value, the machine-os-builder pod should be restarted.
Additional info:
Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.
Overarching Goal
Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.
Background
Given enhancement - https://github.com/openshift/enhancements/pull/1481
Design Review Doc: https://docs.google.com/document/d/1-XuHN6_jvJMLULFwwAThfIcHqY32s32lU6m4bx7BiBE/edit
We want to allow the relevant APIs to pin images and make sure those don't get garbage collected.
Here is a summary of what will be required:
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined tags.
Resources List
Resource | Terraform API |
---|---|
VM Instance | google_compute_instance |
Storage Bucket | google_storage_bucket |
Acceptance Criteria:
ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components
Assisted installer can create installed cluster and use it to perform day2 operations
A doc that explains how it's done with kube-api
Parameters that are required from the user:
Actions required from the user
To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster
Deploy a command to generate a suitable ISO for adding a node to an existing cluster
Not all the required info are provided by the user (in reality, we do want to minimize as much as possible the amount of configuration provided by the user). Some of the required info needs to be extracted from the existing cluster, or from the existing kubeconfig. A dedicated asset could be useful for such operation.
A new workflow will be required to talk to assisted-service to import an existing cluster / add the node.New services could be required in the ignition assets to handle properly that
The two commands, one for adding the nodes (ISO generation) and the other to monitor the process, should be exposed by a new cli tool (name to be defined) built using the installer source. This task will be used to add the main of the cli tool and the two (empty) commands entry points
We need to ensure we have parity with OCP and support heterogeneous clusters
https://github.com/openshift/enhancements/pull/1014
Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image
Acceptance Criteria:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of multi-arch HyperShift, I would like a CEL validation to be added to the NodePool types to prevent the arch field from being changed from `amd64` when the platform is not supported (AWS is currently the only supported platform).
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image
Acceptance Criteria:
Technology Preview of the oc mirror enclaves support (Dev Preview was OCPSTRAT-765 in OpenShift 4.15).
Feature description
oc-mirror already focuses on mirroring content to disconnected environments for installing and upgrading OCP clusters.
This specific feature addresses use cases where mirroring is needed for several enclaves (disconnected environments), that are secured behind at least one intermediate disconnected network.
In this context, enclave users are interested in:
This epic covers the work for RFE-3800 (includes RFE-3393 and RFE-3733) for mirroring operators and additonal images
The full description / overview of the enclave support is best described here
The design document can be found here
Architecture Overview (diagram)
User Stories
All user stories are in the form :
Acceptance Criteria
OCI for ibm
Acceptance Criteria
Tasks
Acceptance Criteria
Tasks
Acceptance Criteria
Tasks
I need an interface and implementation that can read the history file and present it in a map of digests
Clear out all tars before starting the next mirrorToDisk
so that after newly generated archives are not mixed with archives generated in previous runs
If the archived content exceeds the max size specified, oc-mirror shall generate as many archive chunks as needed .
Acceptance Criteria
Tasks
In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform.
What we do in static mode, we need to do in k8s mode.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision vSphere infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
level=info msg=Running process: vsphere infrastructure provider with args [-v=2 --metrics-bind-addr=0 --health-addr=127.0.0.1:37167 --webhook-port=38521 --webhook-cert-dir=/tmp/envtest-serving-certs-445481834 --leader-elect=false] and env [...]
may contain sensitive data - passwords, logins etc. It should be filtered
okd required guestinfo domain
and
stealclock accounting
Description of problem:
the installer download the rhcos image locally to cache multiple times when using failure domains
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
time="2024-02-22T14:34:42-05:00" level=debug msg="Generating Cluster..." time="2024-02-22T14:34:42-05:00" level=warning msg="FeatureSet \"CustomNoUpgrade\" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster." time="2024-02-22T14:34:42-05:00" level=info msg="Creating infrastructure resources..." time="2024-02-22T14:34:43-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:34:43-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:36:02-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:36:02-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:37:22-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:37:22-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:38:39-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:38:39-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:39:33-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:39:33-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:39:33-05:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: The name 'ngirard-dev-pr89z-rhcos-us-west-us-west-1a' already exists."
Expected results:
should only download once
Additional info:
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
INFO Creating infrastructure resources...
DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'
DEBUG The file was found in cache: /home/jcallen/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing...
<very long pause here with no indication>
Causes password leaking in CI.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer, I want to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Console enhancements based on customer RFEs that improve customer user experience.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows
A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Description of problem:
In https://issues.redhat.com/browse/OCPBUGS-28625?focusedId=24056681&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24056681 , Seth Jennings states "It is not required to set the oauthMetadata to enable external OIDC".
Today having a chance to try without setting oauthMetadata, hit oc login fails with the error:
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Console login can succeed, though.
Note, OCM QE also encounters this when using ocm cli to test ROSA HCP external OIDC. Either oc or HCP, or anywhere (as a tester I'm not sure TBH ), worthy to have a fix, otherwise oc login is affected.
Version-Release number of selected component (if applicable):
[xxia@2024-03-01 21:03:30 CST my]$ oc version --client Client Version: 4.16.0-0.ci-2024-03-01-033249 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-02-29-213249 True False 8h Cluster version is 4.16.0-0.ci-2024-02-29-213249
How reproducible:
Always
Steps to Reproduce:
1. Launch fresh HCP cluster. 2. Login to https://entra.microsoft.com. Register application and set properly. 3. Prepare variables. HC_NAME=hypershift-ci-267920 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig AUDIENCE=7686xxxxxx ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0 CLIENT_ID=7686xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret 4. Configure HC without oauthMetadata. [xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG [xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: '' oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " Wait pods to renew: [xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... certified-operators-catalog-7ff9cffc8f-z5dlg 1/1 Running 0 5h44m kube-apiserver-6bd9f7ccbd-kqzm7 5/5 Running 0 17m kube-apiserver-6bd9f7ccbd-p2fw7 5/5 Running 0 15m kube-apiserver-6bd9f7ccbd-fmsgl 5/5 Running 0 13m openshift-apiserver-7ffc9fd764-qgd4z 3/3 Running 0 11m openshift-apiserver-7ffc9fd764-vh6x9 3/3 Running 0 10m openshift-apiserver-7ffc9fd764-b7znk 3/3 Running 0 10m konnectivity-agent-577944765c-qxq75 1/1 Running 0 9m42s hosted-cluster-config-operator-695c5854c-dlzwh 1/1 Running 0 9m42s cluster-version-operator-7c99cf68cd-22k84 1/1 Running 0 9m42s konnectivity-agent-577944765c-kqfpq 1/1 Running 0 9m40s konnectivity-agent-577944765c-7t5ds 1/1 Running 0 9m37s 5. Check console login and oc login. $ export KUBECONFIG=$HOSTED_KUBECONFIG $ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... } Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com. Check oc login: $ rm -rf ~/.kube/cache/oc/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Actual results:
Console login succeeds. oc login fails.
Expected results:
oc login should also succeed.
Additional info:{}
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
Updating oidcProviders does not take effect. See details below.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-26-155043
How reproducible:
Always
Steps to Reproduce:
1. Install fresh HCP env and configure external OIDC as steps 1 ~ 4 of https://issues.redhat.com/browse/OCPBUGS-29154 (to avoid repeated typing those steps, only referencing as is here). 2. Pods renewed: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... network-node-identity-68b7b8dd48-4pvvq 3/3 Running 0 170m oauth-openshift-57cbd9c797-6hgzx 2/2 Running 0 170m kube-controller-manager-66f68c8bd8-tknvc 1/1 Running 0 164m kube-controller-manager-66f68c8bd8-wb2x9 1/1 Running 0 164m kube-controller-manager-66f68c8bd8-kwxxj 1/1 Running 0 163m kube-apiserver-596dcb97f-n5nqn 5/5 Running 0 29m kube-apiserver-596dcb97f-7cn9f 5/5 Running 0 27m kube-apiserver-596dcb97f-2rskz 5/5 Running 0 25m openshift-apiserver-c9455455c-t7prz 3/3 Running 0 22m openshift-apiserver-c9455455c-jrwdf 3/3 Running 0 22m openshift-apiserver-c9455455c-npvn5 3/3 Running 0 21m konnectivity-agent-7bfc7cb9db-bgrsv 1/1 Running 0 20m cluster-version-operator-675745c9d6-5mv8m 1/1 Running 0 20m hosted-cluster-config-operator-559644d45b-4vpkq 1/1 Running 0 20m konnectivity-agent-7bfc7cb9db-hjqlf 1/1 Running 0 20m konnectivity-agent-7bfc7cb9db-gl9b7 1/1 Running 0 20m 3. oc login can succeed: $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 Please visit the following URL in your browser: http://localhost:8080 Logged into "https://a4af9764....elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.You don't have any projects. Contact your system administrator to request a project. 4. Update HC by changing claim: email to claim: sub: $ oc edit hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG ... username: claim: sub ... Update is picked up: $ oc get authentication.config cluster -o yaml ... spec: oauthMetadata: name: tested-oauth-meta oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: sub prefix: prefixString: 'oidc-user-test:' prefixPolicy: Prefix issuer: audiences: - 76863fb1-xxxxxx issuerCertificateAuthority: name: "" issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 name: microsoft-entra-id oidcClients: - clientID: 76863fb1-xxxxxx clientSecret: name: console-secret componentName: console componentNamespace: openshift-console serviceAccountIssuer: https://xxxxxx.s3.us-east-2.amazonaws.com/hypershift-ci-267402 type: OIDC status: oidcClients: - componentName: console componentNamespace: openshift-console conditions: - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "False" type: Degraded - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "False" type: Progressing - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "True" type: Available currentOIDCClients: - clientID: 76863fb1-xxxxxx issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 oidcProviderName: microsoft-entra-id 4. Check pods again: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... kube-apiserver-596dcb97f-n5nqn 5/5 Running 0 108m kube-apiserver-596dcb97f-7cn9f 5/5 Running 0 106m kube-apiserver-596dcb97f-2rskz 5/5 Running 0 104m openshift-apiserver-c9455455c-t7prz 3/3 Running 0 102m openshift-apiserver-c9455455c-jrwdf 3/3 Running 0 101m openshift-apiserver-c9455455c-npvn5 3/3 Running 0 100m konnectivity-agent-7bfc7cb9db-bgrsv 1/1 Running 0 100m cluster-version-operator-675745c9d6-5mv8m 1/1 Running 0 100m hosted-cluster-config-operator-559644d45b-4vpkq 1/1 Running 0 100m konnectivity-agent-7bfc7cb9db-hjqlf 1/1 Running 0 99m konnectivity-agent-7bfc7cb9db-gl9b7 1/1 Running 0 99m No new pods renewed. 5. Check login again, it does not use "sub", still use "email": $ rm -rf ~/.kube/cache/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 Please visit the following URL in your browser: http://localhost:8080 Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer. You don't have any projects. Contact your system administrator to request a project. $ cat ~/.kube/cache/oc/* | jq -r '.id_token' | jq -R 'split(".") | .[] | @base64d | fromjson' ... { ... "email": "xxia@redhat.com", "groups": [ ... ], ... "sub": "EEFGfgPXr0YFw_ZbMphFz6UvCwkdFS20MUjDDLdTZ_M", ...
Actual results:
Steps 4 ~ 5: after editing HC field value from "claim: email" to "claim: sub", even if `oc get authentication cluster -o yaml` shows the edited change is propagated: 1> The pods like kube-apiserver are not renewed. 2> After clean-up ~/.kube/cache, `oc login ...` relogin still prints 'Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer', i.e. still uses old claim "email" as user name, instead of using the new claim "sub".
Expected results:
Steps 4 ~ 5: Pods like kube-apiserver pods should renew after HC editing that changes user claim. The login should print that the new claim is used as user name.
Additional info:
Description of problem:
HCP does not honor the oauthMetadata field of hc.spec.configuration.authentication, making console crash and oc login fail.
Version-Release number of selected component (if applicable):
HyperShift management cluster: 4.16.0-0.nightly-2024-01-29-233218 HyperShift hosted cluster: 4.16.0-0.nightly-2024-01-29-233218
How reproducible:
Always
Steps to Reproduce:
1. Install HCP env. Export KUBECONFIG: $ export KUBECONFIG=/path/to/hosted-cluster/kubeconfig 2. Create keycloak applications. Then get the route: $ KEYCLOAK_HOST=https://$(oc get -n keycloak route keycloak --template='{{ .spec.host }}') $ echo $KEYCLOAK_HOST https://keycloak-keycloak.apps.hypershift-ci-18556.xxx $ curl -sSk "$KEYCLOAK_HOST/realms/master/.well-known/openid-configuration" > oauthMetadata $ cat oauthMetadata {"issuer":"https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master" $ oc create configmap oauth-meta --from-file ./oauthMetadata -n clusters --kubeconfig /path/to/management-cluster/kubeconfig ... 3. Set hc.spec.configuration.authentication: $ CLIENT_ID=openshift-test-aud $ oc patch hc hypershift-ci-18556 -n clusters --kubeconfig /path/to/management-cluster/kubeconfig --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: oauth-meta oidcProviders: - claimMappings: ... issuer: audiences: - $CLIENT_ID issuerCertificateAuthority: name: keycloak-oidc-ca issuerURL: $KEYCLOAK_HOST/realms/master name: keycloak-oidc-test type: OIDC " Check KAS indeed already picks up the setting: $ oc logs -c kube-apiserver kube-apiserver-5c976d59f5-zbrwh -n clusters-hypershift-ci-18556 --kubeconfig /path/to/management-cluster/kubeconfig | grep "oidc-" ... I0130 08:07:24.266247 1 flags.go:64] FLAG: --oidc-ca-file="/etc/kubernetes/certs/oidc-ca/ca.crt" I0130 08:07:24.266251 1 flags.go:64] FLAG: --oidc-client-id="openshift-test-aud" ... I0130 08:07:24.266261 1 flags.go:64] FLAG: --oidc-issuer-url="https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master" ... Wait about 15 mins. 4. Check COs and check oc login. Both show the same error: $ oc get co | grep -v 'True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.16.0-0.nightly-2024-01-29-233218 True True False 4h57m SyncLoopRefreshProgressing: Working toward version 4.16.0-0.nightly-2024-01-29-233218, 1 replicas available $ oc get po -n openshift-console NAME READY STATUS RESTARTS AGE console-547cf6bdbb-l8z9q 1/1 Running 0 4h55m console-54f88749d7-cv7ht 0/1 CrashLoopBackOff 9 (3m18s ago) 14m console-54f88749d7-t7x96 0/1 CrashLoopBackOff 9 (3m32s ago) 14m $ oc logs console-547cf6bdbb-l8z9q -n openshift-console I0130 03:23:36.788951 1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.059196ms) E0130 06:48:32.745179 1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused E0130 06:53:32.757881 1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused ... $ oc login --exec-plugin=oc-oidc --client-id=openshift-test-aud --extra-scopes=email,profile --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1 5. Check root cause, the configured oauthMetadata is not picked up well: $ curl -k https://a6e149f24f8xxxxxx.elb.ap-east-1.amazonaws.com:6443/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... }
Actual results:
As above steps 4 and 5, the configured oauthMetadata is not picked up well, causing console and oc login hit the error.
Expected results:
The configured oauthMetadata is picked up well. No error.
Additional info:
For oc, if I manually use `oc config set-credentials oidc --exec-api-version=client.authentication.k8s.io/v1 --exec-command=oc --exec-arg=get-token --exec-arg="--issuer-url=$KEYCLOAK_HOST/realms/master" ...` instead of using `oc login --exec-plugin=oc-oidc ...`, oc authentication works well. This means my configuration is correct. $ oc whoami Please visit the following URL in your browser: http://localhost:8080 oidc-user-test:xxia@redhat.com
A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows
A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that
When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.
An end user can use the openshift console without a notable difference in experience. This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The web console should behave like a generic OIDC client when requesting tokens from an OIDC provider.
User API may not always be available. K8S now has a stable API to query for user information - https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3325-self-subject-attributes-review-api. See if it can be used and replace all `user/~` calls with it.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Console needs to be able to auth agains external OIDC IDP. For that console-operator need to set configure it in that order.
AC:
Enable a "Break Glass Mechanism" in ROSA (Red Hat OpenShift Service on AWS) and other OpenShift cloud-services in the future (e.g., ARO and OSD) to provide customers with an alternative method of cluster access via short-lived certificate-based kubeconfig when the primary IDP (Identity Provider) is unavailable.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Customers need to be able to request the revocation of the signer cert for their break-glass credentials and know when it's been done. See design here: https://docs.google.com/document/d/19l48HB7-4_8p96b2kvlpYFpbZXCh5eXN_tEmM8QhaIc/edit#heading=h.1a08yt9sno82 and diagram here: https://miro.com/app/board/uXjVNUPFcQA=/
Users of the OpenShift Console leverage a streamlined, visual experience when discovering and installing OLM-managed operators in clusters that run on cloud providers with support for short-lived token authentication enabled. Users are intuitively becoming aware when this is the case and are put on the happy path to configure OLM-managed operators with the necessary information to support GCP Workload Identify Foundation (WIF).
Customers do not need to re-learn how to enable GCP WIF authentication support for each and every OLM-managed operator that supports it. The experience is standardized and repeatable so customers spend less time with initial configuration and more team implementing business value. The process is so easy that OpenShift is perceived as enabler for an increased security posture.
The OpenShift Console today provides little to no support for configuring OLM-managed operators for short-lived token authentication. Users are generally unaware if their cluster runs on a cloud provider and is set up to use short-lived tokens for its core functionality and users are not aware which operators have support for that by implementing the respective flows defined in OCPSTRAT-922.
Customers may or may not be aware about short-lived token authentication support. They need to proper context and pointers to follow-up documentation to explain the general concept and the specific configuration flow the Console supports. It needs to become clear that the Console cannot 100% automate the overall process and some steps need to be run outside of the cluster/Console using Cloud-provider specific tooling.
https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/token-auth-gcp, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.
AC:
https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/tls-profiles, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.
AC:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
While monitoring the payload job failures, open a parallel openshift/origin bump.
Note: There is a high chance of job failures in openshift/origin bump until the openshift/kubernetes PR merges as we only update the test and not the actual kube.
Benefit of opening this PR before ocp/k8s merge is to identify and fix the issues beforehand.
Follow the rebase doc[1] and update the spreadsheet[2] that tracks the required commits to be cherry-picked. Rebase the o/k repo with the "merge=ours" strategy as mentioned in the rebase doc.
Save the last commit id in the spreadsheet for future references.
Update the rebase doc if required.
[1] https://github.com/openshift/kubernetes/blob/master/REBASE.openshift.md
[2] https://docs.google.com/spreadsheets/d/10KYptJkDB1z8_RYCQVBYDjdTlRfyoXILMa0Fg8tnNlY/edit#gid=1957024452
Prev. Ref:
https://github.com/openshift/kubernetes/pull/1646
Bump the following libraries in an order with the latest kube and the dependent libraries
Prev Ref:
https://github.com/openshift/api/pull/1534
https://github.com/openshift/client-go/pull/250
https://github.com/openshift/library-go/pull/1557
https://github.com/openshift/apiserver-library-go/pull/118
This epic will own all of the usual update, rebase and release chores which must be done during the OpenShift 4.16 timeframe for Custom Metrics Autoscaler, Vertical Pod Autoscaler and Cluster Resource Override Operator
Shepherd and merge operator automation PR https://github.com/openshift/vertical-pod-autoscaler-operator/pull/151
Update operator like was done in https://github.com/openshift/vertical-pod-autoscaler-operator/pull/146
Shepherd and merge operand automation PR https://github.com/openshift/kubernetes-autoscaler/pull/277
Coordinate with cluster autoscaler team on upstream rebase as in https://github.com/openshift/kubernetes-autoscaler/pull/250
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
As a Cluster Administrator, I want to be able to perform an unassisted z-stream rollback, starting from OCP 4.12 and 4.14 standalone OpenShift.
See also:
This feature has no dependency on OpenShift docs. The feature details will only be shared in a KCS.
This feature doesn't block the release of 4.16.
Identify and fill gaps related to CI/CD for HyperShift-ARO integration.
Currently when creating the resource group it will be deleted after 24 hours. We can change this by setting an expiry tag on the resource group once created.
The ability to set this expiry tag when creating the resource group as part of the cluster creation command would be nice
We want to make sure ARO/HCP development happens while satisfying e2e expectations
There's a running, blocking test for azure in presubmits
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.
This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589
This will be implemented via a global knob in the API. This is required in addition to the feature gate as we expect customers to still want to toggle this feature when it leave tech preview.
Done when:
This feature is dedicated to enhancing data security and implementing encryption best practices across control-planes, Etcd, and nodes for HyperShift with Azure. The objective is to ensure that all sensitive data, including secrets is encrypted, thereby safeguarding against unauthorized access and ensuring compliance with data protection regulations.
As a user of HCP on Azure, I would like to be able to pass a customer-managed key when creating a HC so that the disks for the VMs in the NodePool are encrypted.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of HCP on Azure, I want to be able to provide a DiskEncryptionSet ID to encrypt the OS disks for the VMs in the NodePool so that the data on the OS disks will be protected by encryption.
Description of criteria:
N/A
This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.
As ARO/HCP provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.
As ARO/HCP provider, I want to be able to:
so that I can achieve
Description of criteria:
This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.
As a HyperShift, I want to use the service principal as the identity type for CAPZ, so that the warning message about using manual service principal in the capi-provider pod goes away.
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
As a user of HyperShift, I want to be able to create Azure VMs with ephemeral disks, so that I can achieve higher IOPS.
Note: AKS also defaults to using them.
Description of criteria:
N/A
This does not require a design proposal.
This does not require a feature gate.
The ability to create an Azure HCP since the k8s v1.29 bump is broken.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a hosted cluster deployer, I want to be able to:
so that I can achieve
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The current HCP implementation lets the Azure CCM pod create a load balancer (LB) and public IP address for guest cluster egress. The outbound SNAT is using default port allocation.
ARO HCP needs more control over how the LB is created and setup. Ideally, it would be nice to have CAPZ create and manage the LB. ARO HCP would also like the ability to utilize a LB with user-defined routing (UDR).
Utilizing a LB for guest cluster egress is the better option cost wise and availability wise compared to NAT Gateway. NAT Gateways are more expensive and also zonal.
Investigate the possibility of CAPZ creating and managing a LB for guest cluster egress.
Telecommunications providers look to displace Physical Network Functions (PNFs) with modern Virtual Network Functions (VNFs) at the Far Edge. Single Node OpenShift, as a the CaaS layer in the vRAN vDU architecture, must achieve a higher standard in regards to OpenShift upgrade speed and efficiency, as in comparison to PNFs.
Telecommunications providers currently deploy Firmware-based Physical Network Functions (PNFs) in their RAN solutions. These PNFs can be upgraded quickly due to their monolithic nature and image-based download-and-reboot upgrades. Furthermore they often have the ability to retry upgrades and to rollback to the previous image if the new image fails. These Telcos are looking to displace PNFs with virtual solutions, but will not do so unless the virtual solutions have comparable operational KPIs to the PNFs.
Service (vDU) Downtime is the time when the CNF is not operational and therefore no traffic is passing through the vDU. This has a significant impact as it degrades the customer’s service (5G->4G) or there’s an outright service outage. These disruptions are scheduled into Maintenance Windows (MW), but the Telecommunications Operators primary goal is to keep service running, so getting vRAN solutions with OpenShift to near PNF-like Service Downtime is and always will be a primary requirement.
Upgrading OpenShift is only one of many operations that occur during a Maintenance Window. Reducing the CaaS upgrade duration is meaningful to many teams within a Telecommunications Operators organization as this duration fits into a larger set of activities that put pressure on the duration time for Red Hat software. OpenShift must reduce the upgrade duration time significantly to compete with existing PNF solutions.
As mentioned above, the Service Downtime disruption duration must be as small as possible, this includes when there are failures. Hardware failures fall into a category called Break+Fix and are covered by TELCOSTRAT-165. In the case of software failures must be detected and remediation must occur.
Detection includes monitoring the upgrade for stalls and failures and remediation would require the ability to rollback to the previously well-known-working version, prior to the failed upgrade.
The OpenShift product support terms are too short for Telco use cases, in particular vRAN deployments. The risk of Service Downtime drives Telecommunications Operators to a certify-deploy-and-then-don’t-touch model. One specific request from our largest Telco Edge customer is for 4 years of support.
These longer support needs drive a misalignment with the EUS->EUS upgrade path and drive the requirement that the Single Node OpenShift deployment can be upgraded from OCP X.y.z to any future [X+1].[y+1].[z+1] where X+1 and x+1 are decided by the Telecommunications Operator depending on timing and the desired feature-set and x+1 is determined through Red Hat, vDU vendor and custom maintenance and engineering validation.
Red Hat is challenged with improving multiple OpenShift Operational KPIs by our telecommunications partners and customers. Improved Break+Fix is tracked in TELCOSTRAT-165 and improved Installation is tracked in TELCOSTRAT-38.
Whatever methodology achieves the above requirements must ensure that the customer has a pleasant experience via RHACM and Red Hat GitOps. Red Hat’s current install and upgrade methodology is via RHACM and any new technologies used to improve Operational KPIs must retain the seamless experience from the cluster management solution. For example, after a cluster is upgraded it must look the same to a RHACM Operator.
Whatever methodology achieves the above requirements must ensure that a technician troubleshooting a Single Node OpenShift deployment has a pleasant experience. All commands issued on the node must return output as it would before performing an upgrade.
Webhooks created during bootstrap-in-place will lead to failure in applying their admission subject resources. A permanent solution will be provided by resolving https://issues.redhat.com/browse/OCPBUGS-28550
We need a temporary workaround to unblock the development. This is easiest to do by changing the validating webhook failure policy to "Ignore"
Run tuneD on a container on a one-shot mode and read the output kernel arguments to apply them using a MachineConfig (MC).
This would be run in the bootstrap procedure of the Openshift Installer, just before the MachineConfigOperator(MCO) procedure here
Initial considerations: https://docs.google.com/document/d/1zUpcpFUp4D5IM4GbM4uWbzbjr57h44dS0i4zP-hek2E/edit
A systemd service that runs on a golden image first boot and configure the following:
1. networking ( the internal IP address require special attention)
2. Update the hostname (MGMT-15775)
3. Execute recert (regenereate certs, Cluster name and base domain MGMT-15533)
4. Start kubelet
5. Apply the personalization info:
If the answer is "yes", please make sure to check the corresponding option.
The following features depend on this functionality:
In IBI and IBU flows we need a way to change nodeip-configuration hint file without reboot and before mco even starts. In order for MCO to be happy we need to remove this file from it's management to make it we will stop using machine config and move to ignition
DPDK applications require dedicated CPUs, and isolated any preemption (other processes, kernel threads, interrupts), and this can be achieved with the “static” policy of the CPU manager: the container resources need to include an integer number of CPUs of equal value in “limits” and “request”. For instance, to get six exclusive CPUs:
spec:
containers:
- name: CNF
image: myCNF
resources:
limits:
cpu: "6"
requests:
cpu: "6"
The six CPUs are dedicated to that container, however non trivial, meaning real DPDK applications do not use all of those CPUs as there is always at least one of the CPU running a slow-path, processing configuration, printing logs (among DPDK coding rules: no syscall in PMD threads, or you are in trouble). Even the DPDK PMD drivers and core libraries include pthreads which are intended to sleep, they are infrastructure pthreads processing link change interrupts for instance.
Can we envision going with two processes, one with isolated cores, one with the slow-path ones, so we can have two containers? Unfortunately no: going in a multi-process design, where only dedicated pthreads would run on a process is not an option as DPDK multi-process is going deprecated upstream and has never picked up as it never properly worked. Fixing it and changing DPDK architecture to systematically have two processes is absolutely not possible within a year, and would require all DPDK applications to be re-written. Knowing that the first and current multi-process implementation is a failure, nothing guarantees that a second one would be successful.
The slow-path CPUs are only consuming a fraction of a real CPU and can safely be run on the “shared” CPU pool of the CPU Manager, however containers specifications do not accept to request two kinds of CPUs, for instance:
spec:
containers:
- name: CNF
image: myCNF
resources:
limits:
cpu_dedicated: "4"
cpu_shared: "20m"
requests:
cpu_dedicated: "4"
cpu_shared: "20m"
Why do we care about allocating one extra CPU per container?
Let’s take a realistic example, based on a real RAN CNF: running 6 containers with dedicated CPUs on a worker node, with a slow Path requiring 0.1 CPUs means that we waste 5 CPUs, meaning 3 physical cores. With real life numbers:
Intel public CPU price per core is around 150 US$, not even taking into account the ecological aspect of the waste of (rare) materials and the electricity and cooling…
Requirement | Notes | isMvp? |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This issue has been addressed lately by OpenStack.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
N/A
TBD
But this should probably goes under NTO as a separate lane.
Feature Overview
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Support additional AWS Security Group IDs while creating machine pool in an existing ROSA HCP cluster.
Goals (aka. expected user outcomes)
See the above Goal entry.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
The above requirements will be addressed by the linked HOSTEDCP EPIC so that the OCM API and then by extension clients of OCM API can use these capabilities.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
1. Early customers will need this by 03/15 - especially the API to attach 10 addtl. SG-ID when creating machine pool.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
1. Scale to Zero (SD-ADR-030) will impact this feature especially creating a node pool during HCP cluster creation.
Currently the behavior is to only use the default security group if no security groups were specified for a NodePool. This makes it difficult to implement additional security groups in ROSA because there is no way to know the default security group on cluster creation. By always appending the default security group, any security groups specified on the NodePool become additional security groups.
As a ROSA HyperShift customer I want to enforce that IMDSv2 is always the default, to ensure that I have the most secure setting by default.
Integration Testing:
Beta:
GA:
GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.
Links to Gdocs, github, and any other relevant information about this epic.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
The goal of this epic is to guarantee that all pods running within the ACM (Advanced Cluster Management) cluster adhere to Kubernetes Security Context Constraints (SCC). The implementation of a comprehensive SCC compliance checking system will proactively maintain a secure and compliant environment, mitigating security risks.
Ensuring SCC compliance is critical for the security and stability of a Kubernetes cluster.
A customer who is responsible for overseeing the operations of their cluster, faces the challenge of maintaining a secure and compliant Kubernetes environment. The organization relies on the ACM cluster to run a variety of critical workloads across multiple namespaces. Security and compliance are top priorities, especially considering the sensitive nature of the data and applications hosted in the cluster.
As an ACM admin, I want to add Kubernetes Security Context Constraints (SCC) V2 options to the component's resource YAML configuration to ensure that the Pod runs with the 'readonlyrootfilesystem' and 'privileged' settings, in order to enhance the security and functionality of our application.
In the resource config YAML, we need to add the follow context:
securityContext: privileged: false readOnlyRootFilesystem: true
Affected resources:
This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.
maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality
https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
Set of API updates, best practices adoption and cleanup to the e2e suite:
Building CI Images has recently increased in duration, sometimes hitting 2 hours, which causes multiple problems:
More importantly, the build times have gotten to a point where OSBS is failing to build the installer due to timeouts, which is making it impossible for ART to deliver the product or critical fixes.
Create a new repo for the providers, build them into a container image and import the image in the installer container image.
Hopefully this will save resources and decrease build times for CI jobs in Installer PRs.
The external platform was created to allow cloud providers to supply their own integration components (cloud controller manager, etc.) without prior integration into openshift release artifacts. We need to support this new platform in assisted-installer in order to provide a user friendly way to enable such clusters, and to enable new-to-openshift cloud providers to quickly establish an installation process that is robust and will guide them toward success.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
deprecate
platform.type: oci
in favor of
platform { type: external external: { platformName: oci CloudControllerManager: External } }
Currently, the UI shows a badge saying dev preview for OCI. We need it to be tech preview for 4.15.
GA date for 4.15 is Feb 27th, but releasing this change a week or so earlier is also fine. We're flexible.
CC Ju Lim Montse Ortega Gallart Tomas Jelinek liat gamliel Adrien Gentil Marcos Entenza Garcia
Allow using late binding via BMH, to support late binding via ZTP, part of RFE-4769
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
In order to allow declarative association of host with a cluster in late binding flow, annotation support will be added to bmc.
The annotation is
bmac.agent-install.openshift.io/cluster-reference.
The annotation value is a json encoded string containing dictionary with keys [name,namespace] that correspond to the name and namespace of the cluster deployment.
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Manage the effort for adding jobs for release-ocm-2.10 on assisted installer
https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng
Merge order:
Update the BUNDLE_CHANNELS in the Makefile in assisted-service and run bundle generation.
Make sure the statefulsets are not recreated after upgrade.
Recreating a statefulset should be an exception.
In CMO we have static yaml assets, that are added to our CMO container image and applied by CMO. We read them once from the file system and after that they are chached in memory.
Currently we use a loose unmarshal, i.e. fields that are spuperflous (not part of the type unmarshaled into) are silently dropped.
We should use strict unmarshaling in order to catch config mistakes like https://issues.redhat.com//browse/OCPBUGS-24630.
Additionally we could consider adding the static assets to our CMO binary via golangs embed.
Discussed, probably not worth it.
sigs.k8s.io/yaml offers a k8s compatible UnmarshalStrict.
golangs yaml implementation does not work well with optional fields. See http://web.archive.org/web/20190603050330/http://ghodss.com/2014/the-right-way-to-handle-yaml-in-golang/ for more details
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
NOTE:
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Description of the problem:
On s390x and DHCP user should not set static IP to IPL nodes.
This would lead to dub for IP in case of coreos installer.
E.g.: {}installer-args '[\"{}append-karg\",\"ip=encbdd0:dhcp\",\"{}append-karg\",\"rd.neednet=1\",\"{}append-karg\",\"ip=10.14.6.3::10.14.6.1:255.255.255.0:master-0.boea3e06.lnxero1.boe:encbdd0:none\",\"{}append-karg\",\"nameserver=10.14.6.1\",\"{}append-karg\",\"ip=[fd00::3]::[fd00::1]:64::encbdd0:none\",\"{-}-append-karg\",\"nameserver=[fd00::1]\"
How reproducible:
Create cluster with DHCP and ipl node(s) using parm line containting static ip settings (ip4 and / or ipv6).
Steps to reproduce:
1.
2.
3.
Actual results:
karg setting for ip using the same device is twice (dhcp and static ip).
Expected results:
If DHCP is used than no static IP in parm line.
In case of LPAR installation it's possible that the cmd: systemd-detect-virt --vm might return an exitCode < 0 even if logged into the booted VM and executed from cmdline returns "none" without an errorcode.
This leads to an abort of further processing in system_vendor.go and Manufacturer will not be set to IBM/S390.
To fix this glitch, the code to determine the Manufacturer should be moved before the call for systemd-detect-virt --vm cmd.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Description
This epic tracks install server errors which require investigation into whether the RP or ARO-Installer can be more resilient or return a User Error instead of a Server Error.
This lives under shift improvement as it will reduce the number of incidents we get from customers due to `Internal Server Error` being returned.
How to Create Stories Under Epic
During the weekly SLO/SLA/SLI meeting, we will examine install failures on the cluster installation failure dashboard. We will aggregate the top occurring items which show up as Server Errors, and each story underneath will be an investigation required to figure out root cause and how we can either prevent it, be resilient to failures, or return a User Error.
AS AN ARO SRE
I WANT either{}
SO THAT
1. Decorate the error in the ARO installer to return a more informative message as to why the SKU was not found.
2. Ensure that the OpenShift installer and the RP are validating the SKU in the same manner
3. If the validation is the same between the Installer and ARO installer, we have the option to remove the ARO installer validation step
Acceptance Criteria
Given: The RP and the ARO Installer validates the SKU in the same manner
When: The RP validates
Then: The ARO Installer does not
Given: The ARO Installer validates additional or improved information validation than the RP
When: The ARO Installer validation fails due to missing SKU (failed validation)
Then: Enhance the log to include the SKU that was not found, providing us with more information to troubleshoot
Breadcrumbs
In the current version, router does not support to load secrets directly and uses route resource to load private key and certificates exposing the security artifacts.
Acceptance criteria :
Update cluster-ingress-operator to bootstrap router with required featuregates
The cluster-ingress-operator will propagate the relevant Tech-Preview feature gate down to the router. This feature gate will be added as a command-line argument called ROUTER_EXTERNAL_CERTIFICATE to the router and will not be user configurable.
Refer:
Acceptance criteria
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Some changes are needed in the NodePool controller to enable running the Performance Profile controller as part of Node Tuning Operator in the HyperShift hosted control planes to manage node tuning of hosted nodes.
PerformanceProfile objects will be created in the management cluster, embedded into a configmap, and referenced in a field of the NodePool API, then NodePool controller will handle this objects and create ConfigMap into the hosted cluster namespace for the Performance Profile controller to read them
More information in the enhancement proposal
This epic is to track any stories for 4.16 hypershift kubevirt development that do not fit cleanly within a larger effort.
Here are some examples of tasks that this "catch all" epic can capture
This is a followup for CNV-29003 in which we've enabled the NodePoolUpgrade test, but just for replace strategy, in which a node pool release image update triggers a machine rollout resulting in creation of new nodes with the updated RHCOS/kubelet version.
With InPlace strategy, the machines are not getting recreated, but are getting upgraded only with soft-reboot. We've noticed it doesn't work for kubevirt as expected, as the updated node is getting stuck at SchedulingDisabled and the nodepool.status.version is not updated.
This task is for find the root cause, fix and make the InPlace nodepool test pass on the presubmit CI.
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
After upgrading cyress to v13 and switching from using Chrome to Electron browser, and we publish the deprecation of chrome browser, we should remove chrome from the builder image:
https://github.com/openshift/console/blob/master/Dockerfile.builder#L59
Find instances of chrome: https://github.com/search?q=org%3Aopenshift+--browser+%24%7BBRIDGE_E2E_BROWSER_NAME%3A%3Dchrome%7D&type=code
AC:
Once PR #12983 gets merged, Console application itself will use PatternFly v5 while also providing following PatternFly v4 packages as shared modules to existing dynamic plugins:
Above mentioned PR will allow dynamic plugins to bring in their own PatternFly code if the relevant Console provided shared modules (v4) are not sufficient.
Let's say we have a dynamic plugin that uses PatternFly v5 - since Console only provides v4 implementations of above shared modules, the plugin would bring in the whole v5 package(s) listed above.
There are two main issues here:
1. CSS consistency
2. Optimal code sharing
This story should address both issues mentioned above.
Acceptance criteria:
AC:
A hasty bug fix resulted in another bug. We should add integration tests that utilize cy.intercept in order to prevent such bugs from happening in the future.
AC:
Recent regressions in the EventStream component have exposed some error-prone patterns. We should refactor the component to remove these patterns. This will harden the component and make it easier to maintain.
AC:
#13679 - how to setup devel env. with Console and plugin servers running locally
#13521 - improve docs on shared modules (CONSOLE-3328)
#13586 - ensure that Console vs. SDK compat table is up-to-date
#13637 - how to migrate from PatternFly 4 to 5 (CONSOLE-3908)
#13637 - using correct MIME types when serving plugin assets - use "text/javascript" for all JS assets to ensure that "X-Content-Type-Options: nosniff" security header doesn't cause JS scripts to be blocked in the browser
#13637 - disable caching of plugin manifest JSON resource
Lightspeed team wants to integrate with console via the events, Ben Parees is leading the effort. For that we need to update the Events page, so the lightspeed dynamic-plugin can create a button and pass a callback to it, which would explain the given event.
AC:
The observability team is doing a similar integration with Alerts:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
A working group has begun migrating common components to @patternfly/react-component-groups. We need to determine that the recently migrated CloseButton component can easily be swapped for the existing one and, if not, identify any issues.
AC:
Rebase openshift/etcd to latest 3.5.11 upstream release.
Rebase openshift/etcd to latest 3.5.12 upstream release.
After Layering and Hypershift GAs in 4.12, we need to remove the code and builds that are no longer associated with mainline OpenShift.
This describes non-customer facing.
Links to the effort:
The MCO and ART bits can be done ahead of time (except for the final config knob flip). Then we would merge the openshift/os, openshift/driver-toolkit PRs and do the ART config knob flip at roughly the same time.
This is not a user-visible change.
METAL-119 provides the upstream ironic functionality
They're hardcoding URLs :5050/v1/continue. The easiest way is probably to add this as an alias to both httpd and ironic-proxy.
In SaaS, allow users of assisted-installer UI or API, to install any published OCP version out of a supported list of x.y options.
Feature probably origins from our own team. This feature will enhance the current workflow we're following to allow users selectively install versions in assisted-installer SaaS.
Until now we had to be contacted by individual users to allow a specific version (usually, it was replaced by us with a newer version). In this case, we would add this version to the relevant configuration file.
It's not possible to quantify the relevant numbers here, because users might be missing certain versions in assisted and just give up the usage of it. In addition, it's not possible to know if users intended to use a certain "old" version, or if it's just an arbitrary decision.
Osher De Paz can we know how many requests we had for "out-of-supported-list"?
It's essential to include this feature in the UI. Otherwise, users will get very confused about the feature parity between API and UI.
Osher De Paz there will always be features that exist in the API and not in the UI. We usually show in the UI features that are more common and we know that users will be interacting with them.
When enabling infrastructure operator automatically import the cluster and enable users to add nodes to self cluster via Infrastructure operator
Yes, it's a new functionality that will need to be documented
Description of the problem:
Imported local cluster doesn't inherit proxy settings from AI.
How reproducible:
100%
Steps to reproduce:
1. Install 4.15 hub cluster with proxy enabled (IPv6)
2. Install 2.10 ACM
Actual results:
No proxy settings in `local-cluster` ACI.
Expected results:
ACI `local-cluster` inherited proxy settings from AI pod.
I created a SNO cluster through the SaaS. A minor issue prevented one of the ClusterOperators from reporting "Available = true", and after one hour, the install process hit a timeout and was marked as failed. When I found the failed install, I was able to easily resolve the issue and get both the ClusterOperator and ClusterVersion to report "Available = true", but the SaaS no longer cared; as far as it was concerned, the installation failed and was permanently marked as such. It also would not give me the kubeadmin password, which is an important feature of the install experience and tricky to obtain otherwise. A hard timeout can cause more harm than good, especially when applied to a system (openshift) that continuously tries to get to desired state without an absolute concept of an operation succeeding or failing; desired state just hasn't been achieved yet. We should consider softening this timeout to be a warning that installation hasn't achieved completion as quickly as expected, without actively preventing a successful outcome.
Late binding scenario (kube-api):
User try to install a cluster with late binding featured enabled (deleting the cluster will return the hosts to InfraEnv), installation timeout and cluster goes into error state, user connect to the cluster and fix the issue.
AI will still think that there is an error in the cluster, If user will try to perform day2 operations on an in error cluster it will fail, the only option is to delete the cluster and create another one that is marked as installed but that will cause the host to boot from discovery ISO.
assisted-service should take care of timeouts.
Then, the API can be removed.
Add an annotation (with one sentence) that explains what the Kube resource is used for, sometimes we cannot/don't want to browse the code to know what the resource is used for, and mist of the times the name is not self-explanatory.
I don't think maintaining this will be a big deal as resources' purposes don't change a lot.
there is a standard annotation for this kubernetes.io/description https://kubernetes.io/docs/reference/labels-annotations-taints/#description
See https://issues.redhat.com/browse/MON-1634?focusedId=22170770&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22170770 for more context.
We don't clearly document what are the supported "3rd-party" monitoring APIs. We should compile the exhaustive list of API services that we officially support:
See RHDEVDOCS-4830 for the context.
https://github.com/rhobs/handbook/pull/59/files
https://github.com/openshift/cluster-monitoring-operator/pull/1631
https://github.com/openshift/origin/pull/27031
https://github.com/openshift/cluster-monitoring-operator/pull/1580
https://github.com/openshift/cluster-monitoring-operator/pull/1552
Require read-only access to Alertmanager in developer view.
https://issues.redhat.com/browse/RFE-4125
Common user should not see alerts in UWM.
https://issues.redhat.com/browse/OCPBUGS-17850
Related ServiceAccounts.
Interconnection diagram in monitoring stack.
https://docs.google.com/drawings/d/16TOFOZZLuawXMQkWl3T9uV2cDT6btqcaAwtp51dtS9A/edit?usp=sharing
None.
In CMO, Prometheus pods have an Oauth-proxy on port 9091 for web access on all paths.
We are going to replace it with kube-rbac-proxy and constraint the access to /api/v1 paths.
The current behavior is to allow access to the Prometheus web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stub custom resource to authorize both "post" and "get" HTTP requests from certain users.
The insight component is using this port, figure out how to keep its access after replacing the Oauth proxy https://github.com/openshift/insights-operator/blob/master/pkg/controller/const.go
Its service account "gather" and "operator" should use the prometheus endpoint. https://redhat-internal.slack.com/archives/CLABA9CHY/p1701345127689009
In CMO, Alertmanager pods in the openshift-monitoring namespace have an Oauth-proxy on port 9095 for web access on all paths.
We are going to replace it with kube-rbac-proxy and constraint the access to /api/v2 pathes.
The current behavior is to allow access to the Alertmanager web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stubbed custom resource to authorize both "post" and "get" HTTP requests from certain users.
There is a request to allow read-only access to alerts in Developer view. kube-rbac-proxy can facilitate this functionality.
Allow OpenShift users to configure audit logs for Metrics Server
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Migrate:
See https://docs.google.com/document/d/1fm2SZs8HroexPQnqI0Ua85Y31-lars8gCsdwSJC3ngo/edit?usp=sharing for more details.
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Currently various serviceIDs are created by create infra command to be used by various cluster operators, in which storage operator is exposed in guest cluster which should not happen. Need to reduce the scope of all the serviceIDs to specific to infra resources created for that cluster alone.
Currently cloud connection is used to connect powervs dhcp private network and vpc network to create load balancer to in vpc for ingress controller.
Cloud connection has a limitation of only 2 per zone.
Whereas in Transit gateway, 5 instances can be created in a zone also its lot faster than cloud connection since it uses PER.
https://cloud.ibm.com/docs/power-iaas?topic=power-iaas-per
https://cloud.ibm.com/docs/transit-gateway?topic=transit-gateway-getting-started&interface=ui
Epic Goal
Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing
Epic Goal
`sao04`,`wdc07`,`eu-de-1`, and `eu-de-2` have all added PER capability. Add them to the installer
Networking Definition of Planned
Epic Template descriptions and documentation
Bump OpenShift Router from Haproxy from 2.6 and 2.8.
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1. Perf & Scale
...
1. …
1. …
Push the bump and build the new HaProxy RPM after it has been Perf & Scale tested and reviewed.
Networking Definition of Planned
Epic Template descriptions and documentation
Bump the openshift-router container image to RHEL9 with help from the ART Team.
OpenShift is transitioning to RHEL9 base images and we must move to RHEL9 by 4.16. RHEL9 introduces OpenSSL 3.0. OpenSSL 3.0 has shown
Additional information on each of the above items can be found here: Networking Definition of Planned
...
...
1. …
The origin test "when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key" is failing due to a hardcoded certificate using SHA1 as the hash algorithm.
The certificate needs to be updated to use SHA256 as SHA1 isn't support anymore.
More details: https://github.com/openshift/router/pull/538#issuecomment-1831925290
Work with ART team to have https://github.com/openshift-eng/ocp-build-data/pull/3895 merged, verified that everything is working appropriately, and ensure we keep our openshift-router CI jobs up-to-date by using RHEL9 since we disabled automatic base image definition from ART.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Setting up the distgit (for production brew builds) depends on the ART team. This should be tackled early.
The PR to ocp-build-data should also be prioritised, as it blocks the PR to openshift/os. There is a separate CI Mirror used to run CI for openshift/os in order to merge, which can take a day to sync.
As an openshift maintainer I want our build tooling to produce the gcr credential provider plugin so that it can be distributed in RHCOS to be used by kubelet.
We need to ship the gcr credential provider via an rpm, so it is available to kubelet when it first starts.
To ship an rpm we must create a .spec file that provides information on how the package should be built
A working example for AWS is provided in this PR: https://github.com/openshift/cloud-provider-aws/pull/63
As an openshift maintainer I want our build tooling to produce the acr credential provider plugin so that it can be distributed in RHCOS to be used by kubelet.
We need to ship the acr credential provider via an rpm, so it is available to kubelet when it first starts.
To ship an rpm we must create a .spec file that provides information on how the package should be built
A working example for AWS is provided in this PR: https://github.com/openshift/cloud-provider-aws/pull/63
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Code in library-go currently uses feature gates to determine if Azure and GCP clusters should be external or not. They have been promoted for at least one release and we do not see ourselves going back.
In 4.17 the code is expected to be deleted completely.
We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.
As part of the migration to external cloud providers, the CCMO and KCMO used a CloudControllerOwner condition to show which controller owned the controllers.
This is no longer required and can be removed.
Here is our overall tech debt backlog: ODC-6711
See included tickets, we want to clean up in 4.16.
Increase and improve the automation health of ODC
Improve automation coverage for ODC
Provide the ability to export data in a CSV format from the various Observability pages in the OpenShift console.
Initially this will include exporting data from any tables that we use.
Product Requirements:
A user will have the ability to click a button which will download the data in the current table in a CSV format. The user will then be able to take this downloaded file use it to import into their system.
Provide the ability of users to export csv data from the dashboard and metrics tables
Goal:
Description of problem:
Failed to install OCP on the below LZ/WLZ, the common point in the below regions is that all of them have only one type of zones: LZ or WLZ. e.g. in af-south-1, only LZ is available, no WL, in ap-northeast-2, only WL is available, no LZ. Failed regions/zones: af-south-1 ['af-south-1-los-1a'] failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in af-south-1 ap-south-1 ['ap-south-1-ccu-1a', 'ap-south-1-del-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-south-1 ap-southeast-1 ['ap-southeast-1-bkk-1a', 'ap-southeast-1-mnl-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-1 me-south-1 ['me-south-1-mct-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in me-south-1 ap-southeast-2 ['ap-southeast-2-akl-1a', 'ap-southeast-2-per-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-2 eu-north-1 ['eu-north-1-cph-1a', 'eu-north-1-hel-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in eu-north-1 ap-northeast-2 ['ap-northeast-2-wl1-cjj-wlz-1', 'ap-northeast-2-wl1-sel-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ap-northeast-2 ca-central-1 ['ca-central-1-wl1-yto-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ca-central-1 eu-west-2 ['eu-west-2-wl1-lon-wlz-1', 'eu-west-2-wl1-man-wlz-1', 'eu-west-2-wl2-man-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in eu-west-2
Version-Release number of selected component (if applicable):
4.15.0-rc.3-x86_64
How reproducible:
Steps to Reproduce:
1) install OCP on above regions/zones
Actual results:
See description.
Expected results:
Don't check LZ's availability while installing OCP in WLZ Don't check WLZ's availability while installing OCP in LZ
Additional info:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The installer is using a very old reference to the machine-config project for apis.
Their apis have been moved to openshift/api. Update the imports and unit tests.
This will make a import error go away and make it easier to update in the future.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
After https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/89, ibm-vpc-node-label-updater is no longer deployed.
Once we're certain there is no risk of reverting to an older version of the driver/operator that requires the node labeler, we should remove it from the payload.
We get too many false positive bugs like https://issues.redhat.com/browse/OCPBUGS-25333 from SAST scans, especially from the vendor directory. Add a .snyk file like https://github.com/openshift/oc/blob/master/.snyk to each repo to ignore them.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
This includes ibm-vpc-node-label-updater!
(Using separate cards for each driver because these updates can be more complicated)
Today we manually catch some regressions by eyeballing disruption graphs.
There are two focuses for this Epic, first are updates to the existing disruption logic to fix and tune the existing logic and second is to consider new methods for collecting and analyzing disruption.
For the second part considerations are:
Design some automation to detect these kinds of regressions and alert TRT.
Would this be in sippy or something new? (too openshift specific?)
Bear in mind we'll soon need it for component memory and cpu usage as well.
Alerts should eventually be targeting the SLO we discussed in Infra arch call on Feb 7: https://docs.google.com/document/d/1QOXh7Me0w-4ad-c8HaTuQPvpG5cddUyA2b1j00H-MXQ/edit?usp=sharing
Make sure we gain testing over metal and vsphere which typically do not have the min 100 runs, how can we test these broader?
Today we fail the job if you're over the P99 for the last 3 weeks, as determined by a weekly pr to origin. The mechanism for creating that pr, reading it's data, and running these tests, is broken repeatedly without anyone realizing, and often doing things we don't expect.
Disruption ebbs and flows constantly especially at the 99th percentile, the test being run is not technically the same week to week.
We do still want to at least attempt to fail a job run if disruption was significant.
Examples we would not want to fail:
P99 2s, job got 4s. We don't care about a few seconds of disruption at this level. This happens all the time, and it's not a regression. Stop failing the test.
P99 60s, job got 65s. There's already huge disruption possible, a few seconds over is largely irrelevant week to week. Stop failing the test.
Layer 3 disruption monitoring is now our main focus point for catching more subtle regressions, this is just a first line of defence, best attempt at telling a PR author that your change may have caused a drastic disruption problem.
Changes proposed (see details in comments below for the data as to why):
Certain repositories, ovn/sdn, router, sometimes MCO, are prone to cause disruption regressions. We need to give engineers in these teams better visibility into when they're about to merge something that might break payloads.
We could request /payload on all PRs but this is expensive and a manual task that could still easily be forgotten.
In our epic TRT-787 we will likely soon have a little data on disruption in sippy, enough to know what we expect is normal, and how we're doing the last few days.
This card proposes a similar approach to risk analysis, or actually plugging right into risk analysis. A spyglass panel should be visible with data on disruption:
We could then enhance the pr commenter to drop this information right infront of the user if anything looks out of the ordinary.
1. Proposed title of this feature request
sosreport(sos rpm) command to be included in the tools imagestream.
2. What is the nature and description of the request?
There is imagestream called tools which are used for oc debug node/<node>. There are several tools to debug the node but sos is not one of them. sos report command is the most necessary command to get support from Red Hat and it should be included for debugging the node.
3. Why does the customer need this? (List the business requirements here)
Telco operators build their system in disconnected env. Therefore, it is hard to get additional rpms or images for their env. If the sos command is included in OCP platform with tools imagestream, it would be very useful for them. There is toolbox image for sosreport but it is not included in openshift release manifests. Some telco operator does not allow to bring additional packages or images other than OpenShift platform itself.
4. List any affected packages or components.
tools imagestream in Openshift release manifests
1. Proposed title of this feature request
sosreport(sos rpm) command to be included in the tools imagestream.
2. What is the nature and description of the request?
There is imagestream called tools which are used for oc debug node/<node>. There are several tools to debug the node but sos is not one of them. sos report command is the most necessary command to get support from Red Hat and it should be included for debugging the node.
3. Why does the customer need this? (List the business requirements here)
Telco operators build their system in disconnected env. Therefore, it is hard to get additional rpms or images for their env. If the sos command is included in OCP platform with tools imagestream, it would be very useful for them. There is toolbox image for sosreport but it is not included in openshift release manifests. Some telco operator does not allow to bring additional packages or images other than OpenShift platform itself.
4. List any affected packages or components.
tools imagestream in Openshift release manifests
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of the problem:
Impossible to create an extra partition on the main disk at installation time with OCP 4.15. It works perfectly with 4.14 and under
I supply a custom machineconfig manifest to do so, and the behavior is that during installation, after reboot, screen is blank, and host has no networking (no route to host)
A slack thread explaining the issue with further debugging can be consulted in https://redhat-internal.slack.com/archives/C999USB0D/p1707991107757299
The bug seems to be introduced in https://github.com/openshift/assisted-installer/pull/713 , which allows for one less reboot on installation time, and to do that, it implements part of the post-reboot code. This code runs BEFORE the extra partition is created, and this creates a problem
How reproducible:
Always
Steps to reproduce:
1. Create a 4.15 cluster with an extra manifest that creates a extra partition at the end of the main disk
Example of machineconfig (change device to match installation disk):
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: 98-extra-partition
spec:
config:
ignition:
version: 3.2.0
storage:
disks:
- device: /dev/vda
partitions:
- label: javi
startMiB: 110000 # space left for the CoreOS partition.
sizeMiB: 0 # Use all available space
2. Proceed with the installation
Actual results:
After reboot, node never comes back up
Expected results:
Cluster installs without problem
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-29384.
Description of problem:
the checkbox should be displayed on a single row eg: for 'Deny all ingress traffic' & 'Deny all egress traffic' in Create NetworkPolicy page for 'Secure Route' in Create route page
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-14-082209
How reproducible:
Always
Steps to Reproduce:
1. Go to Networking -> NetworkPolicies page, click the 'Create NetworkPolicy' button 2. Check the Policy type section, check if the checkbox of 'Deny all ingress traffic' & "Deny all egress traffic" is displayed in a single row 3. Check the same things in 'Create route' page,
Actual results:
not in a single row
Expected results:
in a single row
Additional info:
https://drive.google.com/file/d/1xgEe-CuuRYrY9tBFmIa-7o5Rcn7iCr1e/view?usp=drive_link
All the apiservers:
Expose both `apiserver_request_slo_duration_seconds` and `apiserver_request_sli_duration_seconds`. The SLI metric was introduced in Kubernetes 1.26 as a replacement of `apiserver_request_slo_duration_seconds` which was deprecated in Kubernetes 1.27. This change is only a renaming so both metrics expose the same data. To avoid storing duplicated data in Prometheus, we need to drop the `apiserver_request_slo_duration_seconds` in favor of `apiserver_request_sli_duration_seconds`.
Description of problem:
Issue - Profiles are degraded [1]even after applied due to below [2]error:
[1]
$oc get profile -A NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master0 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master1 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master2 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator worker0 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker1 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker10 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker11 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker12 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker13 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker14 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker15 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker2 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker3 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker4 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker5 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker6 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker7 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker8 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker9 rdpmc-patch-worker True True 5d
[2]
lastTransitionTime: "2023-12-05T22:43:12Z" message: TuneD daemon issued one or more sysctl override message(s) during profile application. Use reapply_sysctl=true or remove conflicting sysctl net.core.rps_default_mask reason: TunedSysctlOverride status: "True"
If we see in rdpmc-patch-master tuned:
NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master0 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master1 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master2 rdpmc-patch-master True True 5d
We are configuring below in rdpmc-patch-master tuned:
$ oc get tuned rdpmc-patch-master -n openshift-cluster-node-tuning-operator -oyaml |less
spec:
profile:
- data: |
[main]
include=performance-patch-master
[sysfs]
/sys/devices/cpu/rdpmc = 2
name: rdpmc-patch-master
recommend:
Below in Performance-patch-master which is included in above tuned:
spec: profile: - data: | [main] summary=Custom tuned profile to adjust performance include=openshift-node-performance-master-profile [bootloader] cmdline_removeKernelArgs=-nohz_full=${isolated_cores}
Below(which is coming in error) is in openshift-node-performance-master-profile included in above tuned:
net.core.rps_default_mask=${not_isolated_cpumask}
RHEL BUg has been raised for the same https://issues.redhat.com/browse/RHEL-18972
Version-Release number of selected component (if applicable):{code:none}
4.14
Please review the following PR: https://github.com/openshift/origin/pull/28452
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/179
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
If the user specifies baselineCapabilitySet: None in the install-config and does not specifically enable the capability baremetal, yet still uses platform: baremetal then the install will reliably fail.
This failure takes the form of a timeout with the bootkube logs (not easily accessible to the user) full of errors like:
bootkube.sh[46065]: "99_baremetal-provisioning-config.yaml": unable to get REST mapping for "99_baremetal-provisioning-config.yaml": no matches for kind "Provisioning" in version "metal3.io/v1alpha1" bootkube.sh[46065]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Since the installer can tell when processing the install-config if the baremetal capability is missing, we should detect this and error out immediately to save the user an hour of their life and us a support case.
Although this was found on an agent install, I believe the same will apply to a baremetal IPI install.
Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/41
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/630
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/180
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
Operation cannot be fulfilled on networks.operator.openshift.io during OVN live migration
Version-Release number of selected component (if applicable):
How reproducible:
Not always
Steps to Reproduce:
1. Enable features of egressfirewall, externalIP,multicast, multus, network-policy, service-idle. 2. Start migrate SDN to OVN cluster
Actual results:
[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation validatingwebhookconfiguration.admissionregistration.k8s.io "sre-techpreviewnoupgrade-validation" deleted [weliang@weliang ~]$ oc edit featuregate cluster featuregate.config.openshift.io/cluster edited [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-20-154.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 ip-10-0-45-93.ec2.internal Ready worker 80m v1.28.5+9605db4 ip-10-0-49-245.ec2.internal Ready worker 74m v1.28.5+9605db4 ip-10-0-57-37.ec2.internal Ready infra,worker 60m v1.28.5+9605db4 ip-10-0-60-0.ec2.internal Ready infra,worker 60m v1.28.5+9605db4 ip-10-0-62-121.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 ip-10-0-62-56.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 [weliang@weliang ~]$ for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" -- chroot /host cat /etc/kubernetes/kubelet.conf | grep NetworkLiveMigration ; done Starting pod/ip-10-0-20-154ec2internal-debug-9wvd8 ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-45-93ec2internal-debug-rwvls ... To use host binaries, run `chroot /host` "NetworkLiveMigration": true,Removing debug pod ... Starting pod/ip-10-0-49-245ec2internal-debug-rp9dt ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-57-37ec2internal-debug-q5thk ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-60-0ec2internal-debug-zp78h ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-62-121ec2internal-debug-42k2g ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-62-56ec2internal-debug-s99ls ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, [weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/live-migration":""}},"spec":{"networkType":"OVNKubernetes"}}' network.config.openshift.io/cluster patched [weliang@weliang ~]$ [weliang@weliang ~]$ oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.15.0-0.nightly-2024-01-06-062415 True False True 4h1m Internal error while updating operator configuration: could not apply (/, Kind=) /cluster, err: failed to apply / update (operator.openshift.io/v1, Kind=Network) /cluster: Operation cannot be fulfilled on networks.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-2-52.ec2.internal Ready worker 3h54m v1.28.5+9605db4 ip-10-0-26-16.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 ip-10-0-32-116.ec2.internal Ready worker 3h54m v1.28.5+9605db4 ip-10-0-32-67.ec2.internal Ready infra,worker 3h38m v1.28.5+9605db4 ip-10-0-35-11.ec2.internal Ready infra,worker 3h39m v1.28.5+9605db4 ip-10-0-39-125.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 ip-10-0-6-117.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 [weliang@weliang ~]$ oc get Network.operator.openshift.io/cluster -o json { "apiVersion": "operator.openshift.io/v1", "kind": "Network", "metadata": { "creationTimestamp": "2024-01-08T13:28:07Z", "generation": 417, "name": "cluster", "resourceVersion": "236888", "uid": "37fb36f0-c13c-476d-aea1-6ebc1c87abe8" }, "spec": { "clusterNetwork": [ { "cidr": "10.128.0.0/14", "hostPrefix": 23 } ], "defaultNetwork": { "openshiftSDNConfig": { "enableUnidling": true, "mode": "NetworkPolicy", "mtu": 8951, "vxlanPort": 4789 }, "ovnKubernetesConfig": { "egressIPConfig": {}, "gatewayConfig": { "ipv4": {}, "ipv6": {}, "routingViaHost": false }, "genevePort": 6081, "mtu": 8901, "policyAuditConfig": { "destination": "null", "maxFileSize": 50, "maxLogFiles": 5, "rateLimit": 20, "syslogFacility": "local0" } }, "type": "OVNKubernetes" }, "deployKubeProxy": false, "disableMultiNetwork": false, "disableNetworkDiagnostics": false, "kubeProxyConfig": { "bindAddress": "0.0.0.0" }, "logLevel": "Normal", "managementState": "Managed", "migration": { "mode": "Live", "networkType": "OVNKubernetes" }, "observedConfig": null, "operatorLogLevel": "Normal", "serviceNetwork": [ "172.30.0.0/16" ], "unsupportedConfigOverrides": null, "useMultiNetworkPolicy": false }, "status": { "conditions": [ { "lastTransitionTime": "2024-01-08T13:28:07Z", "status": "False", "type": "ManagementStateDegraded" }, { "lastTransitionTime": "2024-01-08T17:29:52Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2024-01-08T13:28:07Z", "status": "True", "type": "Upgradeable" }, { "lastTransitionTime": "2024-01-08T17:26:38Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2024-01-08T13:28:20Z", "status": "True", "type": "Available" } ], "readyReplicas": 0, "version": "4.15.0-0.nightly-2024-01-06-062415" } } [weliang@weliang ~]$
Expected results:
OVN live migration pass
Additional info:
must-gather: https://people.redhat.com/~weliang/must-gather1.tar.gz
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/46
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
4.15 control plane can't create a 4.14 node pool due to an issue with payload
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create an Hosted Cluster in 4.15 2. Create a Node Pool in 4.14 3. Node pool stuck in provisioning
Actual results:
No node pool is created
Expected results:
Node pool is created as we support N-2 version there
Additional info:
Possibly linked to OCPBUGS-26757
Not sure which component this bug should be associated with.
I am not even sure if importing respects ImageTagMirrorSet.
We could not figure out in the slack conversion.
https://redhat-internal.slack.com/archives/C013VBYBJQH/p1709583648013199
Description of problem:
The expecting behaviour of ImageTagMirrorSet of redirecting the pulling of a proxy to quay.io did not work out.
Version-Release number of selected component (if applicable):
oc --context build02 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.3 True False 7d4h
Steps to Reproduce:
oc --context build02 get ImageTagMirrorSet quay-proxy -o yaml apiVersion: config.openshift.io/v1 kind: ImageTagMirrorSet metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"config.openshift.io/v1","kind":"ImageTagMirrorSet","metadata":{"annotations":{},"name":"quay-proxy"},"spec":{"imageTagMirrors":[{"mirrors":["quay.io/openshift/ci"],"source":"quay-proxy.ci.openshift.org/openshift/ci"}]}} creationTimestamp: "2024-03-05T03:49:59Z" generation: 1 name: quay-proxy resourceVersion: "4895378740" uid: 69fb479e-85bd-4a16-a38f-29b08f2636c3 spec: imageTagMirrors: - mirrors: - quay.io/openshift/ci source: quay-proxy.ci.openshift.org/openshift/ci oc --context build02 tag --source docker quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest hongkliu-test/proxy-test-2:011 --as system:admin Tag proxy-test-2:011 set to quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest. oc --context build02 get is proxy-test-2 -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: annotations: openshift.io/image.dockerRepositoryCheck: "2024-03-05T20:03:02Z" creationTimestamp: "2024-03-05T20:03:02Z" generation: 2 name: proxy-test-2 namespace: hongkliu-test resourceVersion: "4898915153" uid: f60b3142-1f5f-42ae-a936-a9595e794c05 spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest generation: 2 importPolicy: importMode: Legacy name: "011" referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/hongkliu-test/proxy-test-2 publicDockerImageRepository: registry.build02.ci.openshift.org/hongkliu-test/proxy-test-2 tags: - conditions: - generation: 2 lastTransitionTime: "2024-03-05T20:03:02Z" message: 'Internal error occurred: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest: Get "https://quay-proxy.ci.openshift.org/v2/": EOF' reason: InternalError status: "False" type: ImportSuccess items: null tag: "011"
Actual results:
The status of the stream shows that it still tries to connect to quay-proxy.
Expected results:
The request goes to quay.io directly.
Additional info:
The proxy has been shut down completely just to simplify the case. If it was on, there are Access logs showing the proxy get the requests for the image. oc scale deployment qci-appci -n ci --replicas 0 deployment.apps/qci-appci scaled I also checked the pull secret in the namespace and it has correct pull credentials to both proxy and quay.io.
Today these are in isPodLog in the javascript, we'd like them in their own section, preferably charted very close to the node update section.
OCP 4.14.5
multicluster-engine.v2.4.1
advanced-cluster-management.v2.9.0
Attempt to run the create a spoke cluster:
apiVersion: extensions.hive.openshift.io/v1beta1 kind: AgentClusterInstall metadata: creationTimestamp: "2023-12-08T16:59:25Z" finalizers: - agentclusterinstall.agent-install.openshift.io/ai-deprovision generation: 1 name: infraenv-spoke namespace: infraenv-spoke ownerReferences: - apiVersion: hive.openshift.io/v1 kind: ClusterDeployment name: infraenv-spoke uid: 34f1fe43-2af2-4880-b4ca-fb9ab8df13df resourceVersion: "3468594" uid: 79a42bdf-db1f-4500-b689-8b3813bd27a6 spec: clusterDeploymentRef: name: infraenv-spoke imageSetRef: name: 4.14-test networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 serviceNetwork: - 172.30.0.0/16 userManagedNetworking: true provisionRequirements: controlPlaneAgents: 3 workerAgents: 2 status: conditions: - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: SyncOK reason: SyncOK status: "True" type: SpecSynced - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The cluster is not ready to begin the installation reason: ClusterNotReady status: "False" type: RequirementsMet - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: 'The cluster''s validations are failing: ' reason: ValidationsFailing status: "False" type: Validated - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The installation has not yet started reason: InstallationNotStarted status: "False" type: Completed - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The installation has not failed reason: InstallationNotFailed status: "False" type: Failed - lastProbeTime: "2023-12-08T16:59:30Z" lastTransitionTime: "2023-12-08T16:59:30Z" message: The installation is waiting to start or in progress reason: InstallationNotStopped status: "False" type: Stopped debugInfo: eventsURL: https://assisted-service-rhacm.apps.sno-0.qe.lab.redhat.com/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiZjU4MjVmMTctNTg0OS00OTljLWE1NDctNjJmMDc4ZDU3MDJiIn0.qpSeZuqLwZ3cr3qn6AZo665o1ANp45YVE6IWUv7Gdn1RmapG4HZaxsUUY4iswkRMiqIfka_pLHFnBeVzXSTbrg&cluster_id=f5825f17-5849-499c-a547-62f078d5702b logsURL: "" state: insufficient stateInfo: Cluster is not ready for install platformType: None progress: totalPercentage: 0 userManagedNetworking: true apiVersion: agent-install.openshift.io/v1beta1 kind: InfraEnv metadata: creationTimestamp: "2023-12-08T16:59:26Z" finalizers: - infraenv.agent-install.openshift.io/ai-deprovision generation: 1 name: infraenv-spoke namespace: infraenv-spoke resourceVersion: "3468794" uid: 6254bbb3-5531-4665-bb78-f073b439b023 spec: clusterRef: name: infraenv-spoke namespace: infraenv-spoke cpuArchitecture: s390x ipxeScriptType: "" nmStateConfigLabelSelector: {} pullSecretRef: name: infraenv-spoke-pull-secret status: agentLabelSelector: matchLabels: infraenvs.agent-install.openshift.io: infraenv-spoke bootArtifacts: initrd: "" ipxeScript: "" kernel: "" rootfs: "" conditions: - lastTransitionTime: "2023-12-08T16:59:51Z" message: 'Failed to create image: cannot use Minimal ISO because it''s not compatible with the s390x architecture on version 4.14.6 of OpenShift' reason: ImageCreationError status: "False" type: ImageCreated debugInfo: eventsURL: "" oc get clusterimagesets.hive.openshift.io 4.14-test -o yaml apiVersion: hive.openshift.io/v1 kind: ClusterImageSet metadata: creationTimestamp: "2023-12-08T18:11:29Z" generation: 1 name: 4.14-test resourceVersion: "3514589" uid: 32e2ba8d-6bb7-4e4b-b3a5-63fa8224d144 spec: releaseImage: registry.ci.openshift.org/ocp-s390x/release-s390x@sha256:f024a617c059bf2cbf4a669c2a19ab4129e78a007c6863b64dd73a413c0bdf46 oc get agentserviceconfigs.agent-install.openshift.io agent -o yaml apiVersion: agent-install.openshift.io/v1beta1 kind: AgentServiceConfig metadata: creationTimestamp: "2023-12-08T18:10:42Z" finalizers: - agentserviceconfig.agent-install.openshift.io/ai-deprovision generation: 1 name: agent resourceVersion: "3514534" uid: ef204896-25f1-4ff3-ae60-c80c2f45cd30 spec: databaseStorage: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi filesystemStorage: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi imageStorage: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi mirrorRegistryRef: name: mirror-registry-ca osImages: - cpuArchitecture: x86_64 openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/amd64/dependencies/rhcos/4.14/latest/rhcos-live.x86_64.iso version: "4.14" - cpuArchitecture: arm64 openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/aarch64/dependencies/rhcos/4.14/latest/rhcos-live.aarch64.iso version: "4.14" - cpuArchitecture: ppc64le openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/ppc64le/dependencies/rhcos/4.14/latest/rhcos-live.ppc64le.iso version: "4.14" - cpuArchitecture: s390x openshiftVersion: "4.14" rootFSUrl: "" url: https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.14/latest/rhcos-live.s390x.iso version: "4.14" status: conditions: - lastTransitionTime: "2023-12-08T18:10:42Z" message: AgentServiceConfig reconcile completed without error. reason: ReconcileSucceeded status: "True" type: ReconcileCompleted - lastTransitionTime: "2023-12-08T18:11:23Z" message: All the deployments managed by Infrastructure-operator are healthy. reason: DeploymentSucceeded status: "True" type: DeploymentsHealthy
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/642
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
User may provide an DNS domain outside GCP, once custom DNS is enabled, installer should skip DNS zone validation: level=fatal msg="failed to fetch Terraform Variables: failed to generate asset \"Terraform Variables\": failed to get GCP public zone: no matching public DNS Zone found"
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446 4.16.0-0.nightly-2024-02-03-221256
How reproducible:
Always
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. config a baseDomain which does not exist on GCP.
Actual results:
See description.
Expected results:
Installer should skip the validation, as the custom domain may not exist on GCP
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/206
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In https://github.com/openshift/release/pull/47618 there are quite a few warnings from snyk in the presubmit rehearsal jobs that have not been reported in the bugs filed against storage We need to go through each one and either fix (in the case of legit bugs) or ignore (false positives / test cases) to avoid having a presubmit job that always fails
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
run 'security' presubmit rehearsal jobs in https://github.com/openshift/release/pull/47618
Actual results:
snyk issues reported
Expected results:
clean test runs
Additional info:
Description of problem:
When destroying an HCP KubeVirt cluster using the cli and the --destroy-cloud-resources, pvcs are not cleaned up within the guest cluster due to the cli not properly honoring the --destroy-cloud-resources option for KubeVirt.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. destroy an hcp kubevirt cluster using cli and --destroy-cloud-resources 2. 3.
Actual results:
the hosted cluster does not have the hypershift.openshift.io/cleanup-cloud-resources: "true" annotation added which ensures the hosted cluster config controller cleans up pvcs
Expected results:
the hypershift.openshift.io/cleanup-cloud-resources: "true" should get added to the hosted cluster during tear down when the --destroy-cloud-resources cli option is used
Additional info:
Description of problem:
Manifests will be removed from CCO image so we have to start using CCA(cluster-config-api) image for bootstrap
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
KAS bootstrap container fails
Expected results:
KAS bootstrap container suceeds
Additional info:
Description of problem:
There is a new zone in PowerVS called dal12. We need to add this zone to the list of supported zones in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy OpenShift cluster to the zone 2. 3.
Actual results:
Fails
Expected results:
Works
Additional info:
Description of the problem:
non-lowercase hostname in DHCP breaks assisted installation
How reproducible:
100%
Steps to reproduce:
Actual results:
bootkube fails
Expected results:{}
bootkube should succeed
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/183
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/108
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The two tasks will always "UPDATE" the underlying "APIService" resource even when no changes are to be made. This behavior significantly elevates the likelihood of encountering conflicts, especially during upgrades, with other controllers that are concurrently monitoring the same resources (CA controller e.g.). Moreover, this consumes resources for unnecessary work. The tasks rely on CreateOrUpdateAPIService: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745 which always "UPDATE" the APIService resource.
Version-Release number of selected component (if applicable):
How reproducible:
keep a cluster running then take a look at audit logs concerning the APIService v1beta1.metrics.k8s.io, you could use: oc adm must-gather -- /usr/bin/gather_audit_logs
Steps to Reproduce:
1. 2. 3.
Actual results:
You would see that every "get" is followed by an "update" You can also take a look at the code taking care of that: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745
Expected results:
"Updates" should be avoided if no changes are to be made.
Additional info:
it'd be even better if we could avoid the "get"s, but that would be another subject to discuss.
Description of problem:
Facing error while creating manifests: ./openshift-install create manifests --dir openshift-config FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: unexpected end of JSON input Using below document : https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-vpc.html#installation-gcp-config-yaml_installing-gcp-vpc
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/279
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1979
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Sometimes the prometheus-operator's informer will be stuck because it receives objects that can't be converted to *v1.PartialObjectMetadata.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Not always
Steps to Reproduce:
1. Unknown 2. 3.
Actual results:
prometheus-operator logs show errors like 2024-02-09T08:29:35.478550608Z level=warn ts=2024-02-09T08:29:35.478491797Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused" 2024-02-09T08:29:35.478592909Z level=error ts=2024-02-09T08:29:35.478541608Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"
Expected results:
No error
Additional info:
The bug has been introduced in v0.70.0 by https://github.com/prometheus-operator/prometheus-operator/pull/5993 so it only affects 4.16 and 4.15.
Description of problem:
In accounts with a large amount of resources, the destroy code will fail to list all resources. This has revealed some changes that need to be made to the destroy code to handle these situations.
Version-Release number of selected component (if applicable):
How reproducible:
Difficult - but we have an account where we can reproduce it consistently
Steps to Reproduce:
1. Try to destroy a cluster in an account with a large amount of resources. 2. Fail. 3.
Actual results:
Fail to destroy
Expected results:
Destroy succeeds
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/87
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/197
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/242
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/46
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/393
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
BuildRun logs cannot be displayed in the console and shows the following error:
The buildrun is created and started using the shp cli (similar behavior is observed when the build is created & started via console/yaml too):
shp build create goapp-buildah \ --strategy-name="buildah" \ --source-url="https://github.com/shipwright-io/sample-go" \ --source-context-dir="docker-build" \ --output-image="image-registry.openshift-image-registry.svc:5000/demo/go-app"
The issue occurs on OCP 4.14.6. Investigation showed that this works correctly on OCP 4.14.5.
Description of problem:
Panic thrown by origin-tests
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create aws or rosa 4.15 cluster 2. run origin tests 3.
Actual results:
time="2024-03-07T17:03:50Z" level=info msg="resulting interval message" message="{RegisteredNode Node ip-10-0-8-83.ec2.internal event: Registered Node ip-10-0-8-83.ec2.internal in Controller map[reason:RegisteredNode roles:worker]}" E0307 17:03:50.319617 71 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23]) goroutine 310 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x84c6f20?, 0xc006fdc588}) k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc008c38120?}) k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:49 +0x75 panic({0x84c6f20, 0xc006fdc588}) runtime/panic.go:884 +0x213 github.com/openshift/origin/pkg/monitortests/testframework/watchevents.nodeRoles(0x0?) github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:251 +0x1e5 github.com/openshift/origin/pkg/monitortests/testframework/watchevents.recordAddOrUpdateEvent({0x96bcc00, 0xc0076e3310}, {0x7f2a0e47a1b8, 0xc007732330}, {0x281d36d?, 0x0?}, {0x9710b50, 0xc000c5e000}, {0x9777af, 0xedd7be6b7, ...}, ...) github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:116 +0x41b github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring.func2({0x8928f00?, 0xc00b528c80}) github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:65 +0x185 k8s.io/client-go/tools/cache.(*FakeCustomStore).Add(0x8928f00?, {0x8928f00?, 0xc00b528c80?}) k8s.io/client-go@v0.29.0/tools/cache/fake_custom_store.go:35 +0x31 k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0xe16d020?}, {0x9694a10, 0xc006b00180}, {0x96d2780, 0xc0078afe00}, {0x96f9e28?, 0x8928f00}, 0x0, ...) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:756 +0x603 k8s.io/client-go/tools/cache.(*Reflector).watch(0xc0005dcc40, {0x0?, 0x0?}, 0xc005cdeea0, 0xc005bf8c40?) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:437 +0x53b k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0005dcc40, 0xc005cdeea0) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:357 +0x453 k8s.io/client-go/tools/cache.(*Reflector).Run.func1() k8s.io/client-go@v0.29.0/tools/cache/reflector.go:291 +0x26 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10?) k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:226 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc007974ec0?, {0x9683f80, 0xc0078afe50}, 0x1, 0xc005cdeea0) k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:227 +0xb6 k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0005dcc40, 0xc005cdeea0) k8s.io/client-go@v0.29.0/tools/cache/reflector.go:290 +0x17d created by github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:83 +0x6a5 panic: runtime error: slice bounds out of range [24:23] [recovered] panic: runtime error: slice bounds out of range [24:23]
Expected results:
execution of tests
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
E1213 07:18:34.291004 1 run.go:74] "command failed" err="error while building transformers: KMSv1 is deprecated and will only receive security updates going forward. Use KMSv2 instead. Set --feature-gates=KMSv1=true to use the deprecated KMSv1 feature."
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When console with custom route is disabled before cluster upgrade, and re-enabled after cluster upgrade, console could not be accessed successfully.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-15-111607
How reproducible:
Always
Steps to Reproduce:
1. Launch a cluster with available update. 2. Create custom route for console in ingress configuration: # oc edit ingresses.config.openshift.io cluster spec: componentRoutes: - hostname: console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com name: console namespace: openshift-console - hostname: openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com name: downloads namespace: openshift-console domain: apps.qe-413-0216.qe.devcluster.openshift.com 3. After custom route is created, access console with custom route. 4. Remove console by setting managementState as Removed in console operator: # oc edit consoles.operator.openshift.io cluster spec: logLevel: Normal managementState: Removed operatorLogLevel: Normal 5. Upgrade cluster to a target version. 6. Enable console by setting managementState as Managed in console operator: # oc edit consoles.operator.openshift.io cluster spec: logLevel: Normal managementState: Managed operatorLogLevel: Normal 7. After console resources are created, access console url.
Actual results:
3. Console could be accessed through custom route. 4. Console resources are removed. And all cluster operators are in normal status # oc get all -n openshift-console No resources found in openshift-console namespace.
5. Upgrade succeeds, all cluster operators are in normal status
6. Console resources are created:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/console ClusterIP 172.30.226.112 <none> 443/TCP 3m50s
service/console-redirect ClusterIP 172.30.147.151 <none> 8444/TCP 3m50s
service/downloads ClusterIP 172.30.251.248 <none> 80/TCP 3m50s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/console 2/2 2 2 3m47s
deployment.apps/downloads 2/2 2 2 3m50s
NAME DESIRED CURRENT READY AGE
replicaset.apps/console-69d88985b 2 2 2 3m42s
replicaset.apps/console-6dbdd487d 0 0 0 3m47s
replicaset.apps/downloads-6b6b555d8d 2 2 2 3m50s
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/console console-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com console-redirect custom-route-redirect edge/Redirect None
route.route.openshift.io/console-custom console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com console https reencrypt/Redirect None
route.route.openshift.io/downloads downloads-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None
route.route.openshift.io/downloads-custom openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None
7. Could not open console url successfully. There is error info for console operator:
Expected results:
7. Should be able to access console successfully.
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7816
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Copying BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 on OCP side (as fix is needed on console). [UI] In Openshift-storage-client namespace, 'RWX' access mode RBD PVC with volumemode'Filesystem' can be created from Client. However, this is an invalid combination for RBD PVC creation From ODF Operator UI of other Platforms. Volume mode is not available when Cepfrbd storageclass and RWX access mode selected on other platform. This is visible in client operator view. This attempt to create PVc and stuck in pending state
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy Provider Client setup. 2. From UI Create PVC, select storage class : ceph-rbd, RWX access mode, check filemode : in case of this bug 'Filesystem' and 'block' volume mode is visible on UI, select volumemode: Filesystem and create the PVC.
Actual results:
PVC Created and stuck in pending status. PVC event shows error like: Generated from openshift-storage-client.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6d9dcb9fc7-vjj22_2bd4ede5-9418-4c8e-80ae-169b5cb4fa8012 times in the last 13 minutes failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes
Expected results:
Volumemode should not be visible on page when PVC with RWX access mode and RBD storage class is selected.
Additional info:
Screenshots are attached to the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 https://bugzilla.redhat.com/show_bug.cgi?id=2250911#c3
Please review the following PR: https://github.com/openshift/thanos/pull/142
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Up to latest decision RH won't going to support installation OCP cluster on Nutanix with
nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"
Slack discussion thread
https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159
Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/61
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
During the destroy cluster operation, unexpected results from the IBM Cloud API calls for Disks can result in panics when response data (or responses) are missing, resulting in unexpected failures during destroy.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Unknown, dependent on IBM Cloud API responses
Steps to Reproduce:
1. Successfully create IPI cluster on IBM Cloud 2. Attempt to cleanup (destroy) the cluster
Actual results:
Golang panic attempting to parse a HTTP response that is missing or lacking data. level=info msg=Deleted instance "ci-op-97fkzvv2-e6ed7-5n5zg-master-0" E0918 18:03:44.787843 33 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 228 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x6a3d760?, 0x274b5790}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe?}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x6a3d760, 0x274b5790}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion.func1() /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:84 +0x12a github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).Retry(0xc000791ce0, 0xc000573700) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:99 +0x73 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion(0xc000791ce0, {{0xc00160c060, 0x29}, {0xc00160c090, 0x28}, {0xc0016141f4, 0x9}, {0x82b9f0d, 0x4}, {0xc00160c060, ...}}) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:78 +0x14f github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyDisks(0xc000791ce0) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:118 +0x485 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction.func1() /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:201 +0x3f k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x7f7801e503c8, 0x18}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:109 +0x1b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x227a2f78?, 0xc00013c000?}, 0xc000a9b690?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:154 +0x57 k8s.io/apimachinery/pkg/util/wait.poll({0x227a2f78, 0xc00013c000}, 0xd0?, 0x146fea5?, 0x7f7801e503c8?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:245 +0x38 k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x227a2f78, 0xc00013c000}, 0x4136e7?, 0x28?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:229 +0x49 k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x100000000000000?, 0x806f00?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:214 +0x46 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction(0xc000791ce0, {{0x82bb9a3?, 0xc000a9b7d0?}, 0xc000111de0?}, 0x840366?, 0xc00054e900?) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:198 +0x108 created by github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyCluster /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:172 +0xa87 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
Destroy IBM Cloud Disks during cluster destroy, or provide a useful error message to follow up on.
Additional info:
The ability to reproduce is relatively low, as it requires the IBM Cloud API's to return specific data (or lack there of), which is currently unknown why the HTTP respoonse and/or data is missing. IBM Cloud already has a PR to attempt to mitigate this issue, like done with other destroy resource calls. Potentially followup for additional resources as necessary. https://github.com/openshift/installer/pull/7515
Description of problem:
Sample job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-data-path-9nodes/1760228008968327168
Version-Release number of selected component (if applicable):
How reproducible:
Anytime there is an error from the move-blobs command
Steps to Reproduce:
1. 2. 3.
Actual results:
An error message is shown
Expected results:
A panic is shown followed by the error message
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/49
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/277
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when extracting `oc` and `openshift-install` from release payload below warnings are shown which might be confusing for the user, to make this clear please help update the warning to add image names into the kubectl version mismatch message in addition to the version list Version-Release number of selected component (if applicable):{code:none} always How reproducible:{code:none} Always
Steps to Reproduce:
1. Run command to extract oc & openshift-install using `oc adm extract` 2. Run oc adm release info --commits <payload> 3.
Actual results:
$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-03-05-032119 warning: multiple versions reported for the kubectl: 1.29.1,1.28.2,1.29.0
Expected results:
show image names which needs kubernetes bump along with kubectl version
Additional info:
Thread here: https://redhat-internal.slack.com/archives/GK58XC2G2/p1709565188855519
Description of problem:
Kube apiserver pod keeps crashing when tested against v1.29 rebase
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Run hypershift e2e agains v1.29 rebase 2. 3.
Actual results:
Fails https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1810-periodics-e2e-aws-ovn/1732734494688940032
Expected results:
Succeeds
Additional info:
Kube apiserver pod is crashlooping with: E1208 21:17:06.619997 1 run.go:74] "command failed" err="group version flowcontrol.apiserver.k8s.io/v1alpha1 that has not been registered"
This is a follow up for https://issues.redhat.com/browse/OCPBUGS-14829 and https://issues.redhat.com/browse/OCPBUGS-21821, let's add an alert if vSphere users using usernames without domain, it makes storage doesn't work and should be alerted.
Description of problem:
It was found when testing OCP-71263 and regression OCP-35770 for 4.15. For GCP in Mint mode, the root credential can be removed after cluster installation. But after removing the root credential, CCO became degrade.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-25-051548 4.15.0-rc.3
How reproducible:
Always
Steps to Reproduce:
1.Install a GCP cluster with Mint mode 2.After install, remove the root credential jianpingshu@jshu-mac ~ % oc delete secret -n kube-system gcp-credentials secret "gcp-credentials" deleted 3.Wait some time(about 1/2h to 1h), CCO became degrade jianpingshu@jshu-mac ~ % oc get co cloud-credential NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE cloud-credential 4.15.0-rc.3 True True True 6h45m 6 of 7 credentials requests are failing to sync. jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sort CredentialsProvisionFailure=False openshift-cloud-network-config-controller-gcp CredentialsProvisionSuccess: successfully granted credentials request CredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-gcp-ccm CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-gcp-pd-csi-driver-operator CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-image-registry-gcs CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-ingress-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-machine-api-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found openshift-cloud-network-config-controller-gcp has no failure because it doesn't has customized role in 4.15.0.rc3
Actual results:
CCO became degrade
Expected results:
CCO not in degrade, just "upgradeable" condition updated with missing the root credential
Additional info:
Tested the same case on 4.14.10, no issue
Description of problem:
In service details page, under Revision and Route tabs, user is able to see No resource found message although Revision and Route is created for that service
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
1.Install serverless operator 2.Create serving instance 3.Create knative service/ function 4.Go to details page
Actual results:
User is not able to see Revision and Route created for the service
Expected results:
User should be able to see Revision and Route created for the service
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When executing oc mirror using an oci path, you can end up with in an error state when the destination is a file://<path> destination (i.e. mirror to disk).
Version-Release number of selected component (if applicable):
4.14.2
How reproducible:
always
Steps to Reproduce:
At IBM we use the ibm-pak tool to generate a OCI catalog, but this bug is reproducible using a simple skopeo copy. Once you've copied the image locally you can move it around using file system copy commands to test this in different ways. 1. Make a directory structure like this to simulate how ibm-pak creates its own catalogs. The problem seems to be related to the path you use, so this represents the failure case: mkdir -p /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list 2. make a location where the local storage will live: mkdir -p /root/.ibm-pak/oc-mirror-storage 3. Next, copy the image locally using skopeo: skopeo copy docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:8d28189637b53feb648baa6d7e3dd71935656a41fd8673292163dd750ef91eec oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list --all --format v2s2 4. You can copy the OCI catalog content to a location where things will work properly so you can see a working example: cp -r /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list /root/ibm-zcon-zosconnect-catalog 5. You'll need an ISC... I've included both the oci references in the example (the commented out one works, but the oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list reference fails). kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list #- catalog: oci:///root/ibm-zcon-zosconnect-catalog packages: - name: ibm-zcon-zosconnect channels: - name: v1.0 full: true targetTag: 27ba8e targetCatalog: ibm-catalog storageConfig: local: path: /root/.ibm-pak/oc-mirror-storage 6. run oc mirror (remember the ISC has oci refs for good and bad scenarios). You may want to change your working directory to different locations between running the good/bad examples. oc mirror --config /root/.ibm-pak/data/publish/latest/image-set-config.yaml "file://zcon --dest-skip-tls --max-per-registry=6
Actual results:
Logging to .oc-mirror.log Found: zcon/oc-mirror-workspace/src/publish Found: zcon/oc-mirror-workspace/src/v2 Found: zcon/oc-mirror-workspace/src/charts Found: zcon/oc-mirror-workspace/src/release-signatures error: ".ibm-pak/data/publish/latest/catalog-oci/manifest-list/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" is not a valid image reference: invalid reference format
Expected results:
Simple example where things were working with the oci:///root/ibm-zcon-zosconnect-catalog reference (this was executed in the same workspace so no new images were detected). Logging to .oc-mirror.log Found: zcon/oc-mirror-workspace/src/publish Found: zcon/oc-mirror-workspace/src/v2 Found: zcon/oc-mirror-workspace/src/charts Found: zcon/oc-mirror-workspace/src/release-signatures 3 related images processed in 668.063974ms Writing image mapping to zcon/oc-mirror-workspace/operators.1700092336/manifests-ibm-zcon-zosconnect-catalog/mapping.txt No new images detected, process stopping
Additional info:
I debugged the error that happened and captured one of the instances where the ParseReference call fails. This is only for reference to help narrow down the issue. github.com/openshift/oc/pkg/cli/image/imagesource.ParseReference (/root/go/src/openshift/oc-mirror/vendor/github.com/openshift/oc/pkg/cli/image/imagesource/reference.go:111) github.com/openshift/oc-mirror/pkg/image.ParseReference (/root/go/src/openshift/oc-mirror/pkg/image/image.go:79) github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:194) github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3 (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/operator.go:575) golang.org/x/sync/errgroup.(*Group).Go.func1 (/root/go/src/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75) runtime.goexit (/usr/local/go/src/runtime/asm_amd64.s:1594) Also, I wanted to point out that because we use a period in the path (i.e. .ibm-pak) I wonder if that's causing the issue? This is just a guess and something to consider. *FOLLOWUP* ... I just removed the period from ".ibm-pak" and that seemed to make the error go away.
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/515
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Run OLM on 4.15 cluster 2. 3.
Actual results:
OLM pod will panic
Expected results:
Should run just fine
Additional info:
This issue is due to failure of initiate a new map if nil
Description of problem:
The shutdown-delay-duration argument for the openshift-oauth-apiserver is set to 3s in hypershift, but set to 15s in core openshift. Hypershift should update the value to match.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If you allow the installer to provision a Power VS Workspace instead of bringing your own, it can sometimes fail when creating a network. This is because Power Edge Router can sometimes take up to a minute to configure.
Version-Release number of selected component (if applicable):
How reproducible:
Infrequent, but will probably hit it within 50-100 runs
Steps to Reproduce:
1. Install on Power VS with IPI with serviceInstanceGUID not set in the install-config.yaml 2. Occasionally you'll observe a failure due to the workspace not being ready for networks
Actual results:
Failure
Expected results:
Success
Additional info:
Not consistently reproducible
Please review the following PR: https://github.com/openshift/prometheus/pull/195
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/247
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1006
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
seems the issue still here, test on 4.15.0-0.nightly-2023-12-04-223539, there is status.message for each zone, but there is no summarized status, so move to Assigned.
apbexternalroute yaml file is:
apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: default-route-policy spec: from: namespaceSelector: matchLabels: kubernetes.io/metadata.name: test nextHops: static: - ip: "172.18.0.8" - ip: "172.18.0.9"
and Status section as below:
% oc get apbexternalroute NAME LAST UPDATE STATUS default-route-policy 12s <--- still empty % oc describe apbexternalroute default-route-policy | tail -n 10 Status: Last Transition Time: 2023-12-06T02:12:11Z Messages: qiowang-120620-gtt85-master-2.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-master-0.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-worker-a-55fzx.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-master-1.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-worker-b-m98ms.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 qiowang-120620-gtt85-worker-c-vtl8q.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 Events: <none>
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Add the possibility to have other search filters in the resource list toolbar
For now, using the props and some hacks we were able to change the Name search into an IP search but we would like to have both.
Description of problem:
nothing happens when user clicks on the 'Configure' button next to AlertmanagerReceiversNotConfigured alert
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-11-041450
How reproducible:
Always
Steps to Reproduce:
1. navigate to Home -> Overview, locate the AlertmanagerReceiversNotConfigured alert in 'Status' card 2. click the 'Configure' button next to AlertmanagerReceiversNotConfigured alert
Actual results:
nothing happens
Expected results:
user should be taken to alert manager configuration page /monitoring/alertmanagerconfig
Additional info:
Seen in 4.15-related update CI:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n 1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused 1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable' 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF 8 console RouteHealth_RouteNotAdmitted console route is not admitted 16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)
For example this 4.14 to 4.15 run had:
: [bz-Management Console] clusteroperator/console should not change condition/Available Run #0: Failed 1h25m23s { 1 unexpected clusteroperator state transitions during e2e test run Nov 28 03:42:41.207 - 1s E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}
While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant immediate admin intervention. Teaching the console operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
At least 4.15. Possibly other versions; I haven't checked.
.h2 How reproducible
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact
Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.
There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.
Lots of console ClusterOperator going Available=False blips in 4.15 update CI.
Console goes Available=False if and only if immediate admin intervention is appropriate.
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/409
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
1. TaskRuns list page is loading constantly for all projects 2. Archive icon is not displayed for some tasks in TaskRun list page 3. On change of ns to All Projects, PipelineRuns and TaskRuns are not loading properly
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
1.Create some TaskRun 2.Go to TaskRun list page 3.Select all project in project dropdown
Actual results:
Screen is keep on loading
Expected results:
Should load TaskRuns from all projects
Additional info:
Description of problem:
In a recently installed cluster running 4.13.29, after configuring the cluster-wide-proxy, the "vsphere-problem-detector" is not taking the proxy configuration. As the pod cannot reach vSphere it's failing to run checks: 2024-02-01T09:28:00.150332407Z E0201 09:28:00.150292 1 operator.go:199] failed to run checks: failed to connect to vsphere.local: Post "https://vsphere.local/sdk": dial tcp 172.16.1.3:443: i/o timeout The pod doesn't get the cluster proxy settings as expected: - name: HTTPS_PROXY value: http://proxy.local:3128 - name: HTTP_PROXY value: http://proxy.local:3128 Other storage related pods get the configuration expected as above. This causes the vsphere-problem-detector to fail connections to vSphere, hence failing the health checks.
Version-Release number of selected component (if applicable):
4.13.29
How reproducible:
Always
Steps to Reproduce:
1.Configure cluster-wide proxy in the environment. 2. Wait for the change 3. Check the pod configuration
Actual results:
vSphere health checks failing
Expected results:
vSphere health checks working through the cluster proxy
Additional info:
Description of problem:
CAPI E2Es failing to start in some CAPI provider's release branches.
Failing with the following error: `go: errors parsing go.mod:94/tmp/tmp.ssf1LXKrim/go.mod:5: unknown directive: toolchain` https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api/199/pull-ci-openshift-cluster-api-master-e2e-aws-capi-techpreview/1765512397532958720#1:build-log.txt%3A91-95
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is because the script launching the e2e is launching it from the `main` branch of the cluster-capi-operator (which has some backward incompabible go toolchain changes), rather than the correctly matching release branch.
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/273
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
To support external OIDC on hypershift, but not on self-managed, we need different schemas for the authentication CRD on a default-hypershift versus a default-self-managed. This requires us to change rendering so that it honors the clusterprofile.
Then we have to update the installer to match, then update hypershift, then update the manifests.
Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/router/pull/547
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Reproducer:
1. On a GCP cluster, create an ingress controller with internal load balancer scope, like this:
apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: foo namespace: openshift-ingress-operator spec: domain: foo.<cluster-domain> endpointPublishingStrategy: type: LoadBalancerService loadBalancer: dnsManagementPolicy: Managed scope: Internal
2. Wait for load balancer service to complete rollout
$ oc -n openshift-ingress get service router-foo NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-foo LoadBalancer 172.30.101.233 10.0.128.5 80:32019/TCP,443:32729/TCP 81s
3. Edit ingress controller to set spec.endpointPublishingStrategy.loadBalancer.scope to External
the load balancer service (router-foo in this case) should get an external IP address, but currently it keeps the 10.x.x.x address that was already assigned.
Description of the problem:
Until latest release we had a test which try to set dns name to 11.11.11 and was expecting BE to throw exception was succeding , (BE was throwing exception)
since last release it is no longer the case and dns name is beeing accepted as
How reproducible:
Steps to reproduce:
1.create a cluster , set base dns name to 11.11.11
2.
3.
Actual results:
according to discussion in thread dns name must start with a letter , in that case we expecting BE to throw exception
Expected results:
No exception thrown
See TODOs in https://github.com/openshift/cluster-monitoring-operator/pull/2160/files
Remove the DeleteConfigMapByNamespaceAndName util (if not used by others )
Description of problem:
Original issue reported here: https://issues.redhat.com/browse/ACM-6189 reported by QE and customer.
Using ACM/hive, customers can deploy Openshift on vSphere. In the upcoming release of ACM 2.9, we support customers on OCP 4.12 - 4.15. ACM UI updates the install config as users add configurations details.
This has worked for several releases over the last few years. However in OCP 4.13+ the format has changed and there is now additional validation to check if the datastore is a full path.
As per https://issues.redhat.com/browse/SPLAT-1093, removal of the legacy fields should not happen until later, so any legacy configurations such as relative paths should still work.
Version-Release number of selected component (if applicable):
ACM 2.9.0-DOWNSTREAM-2023-10-24-01-06-09 OpenShift 4.14.0-rc.7 OpenShift 4.13.18 OpenShift 4.12.39
How reproducible:
Always
Steps to Reproduce:
1. Deploy OCP 4.12 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS 2. Installer passes. 3. Deploy OCP 4.12 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 4. Installer fails. 5. Deploy OCP 4.12 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 6. Installer fails. 7. Deploy OCP 4.13 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS 8. Installer fails. 9. Deploy OCP 4.13 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 10. Installer passes. 11. Deploy OCP 4.13 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 12. Installer fails.
Actual results:
Default Datastore Value | OCP 4.12 | OCP 4.13 | OCP 4.14 |
---|---|---|---|
/Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS | No | Yes | Yes |
WORKLOAD-DS-Folder/WORKLOAD-DS | No | Yes | Yes |
WORKLOAD-DS | Yes | No | No |
For OCP 4.12.z managed clusters deployments name-only path is the only one that works as expected.
For OCP 4.13.z+ managed cluster deployments only full name and relative path with folder works as expected.
Expected results:
OCP 4.13.z+ takes relative path without specifying the folder like OCP 4.12.z does.
Additional info:
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/158
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/network-tools/pull/108
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating an HostedCluster with 'NodePort' service publishing strategy, the VMs (guest nodes) are trying to contact HCP services, such as ignition and oauth. If these services are colocated on the same infra node, they can't be reached via NodePort because the 'virt-launcher' NetworkPolicy is blocking it. Need to explicitly add access to oauth and ignition-server-proxy pods so they can be reached from the virtual machines on the same node.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always, if conditions are met
Steps to Reproduce:
1. As described above 2. 3.
Actual results:
VMs are not joining the cluster as nodes if the ignition server is running on the same infra node as the VM.
Expected results:
All VMs are joining the cluster as nodes, and the HostedCluster is eventually Completed and Available
Additional info:
Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-monitoring.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.15
Start Time: 2024-01-04T00:00:00Z
End Time: 2024-01-10T23:59:59Z
Success Rate: 42.31%
Successes: 11
Failures: 15
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 151
Failures: 0
Flakes: 0
Description of problem:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get co/image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry False True True 50m Available: The deployment does not exist... [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe co/image-registry ... Message: Progressing: Unable to apply resources: unable to sync storage configuration: cos region corresponding to a powervs region wdc not found ...
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-ppc64le-2024-01-10-083055
How reproducible:
Always
Steps to Reproduce:
1. Deploy a PowerVS cluster in wdc06 zone
Actual results:
See above error message
Expected results:
Cluster deploys
Description of problem:
When using the modal dialogs in a hook as part of the actions hook (i.e. useApplicationsActionsProvider) the console will throw an error since the console framework will pass null objects as part of the render cycle. According to Jon Jackson, the console should be safe from null objects but it looks like the code for useDeleteModal and getGroupVersionKindForresource are not safe,
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Use one of the modal APIs in an actions provider hook 2. 3.
Actual results:
Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'split') at i (main-chunk-9fbeef79a…d3a097ed.min.js:1:1) at u (main-chunk-9fbeef79a…d3a097ed.min.js:1:1) at useApplicationActionsProvider (useApplicationActionsProvider.tsx:23:43) at ApplicationNavPage (ApplicationDetails.tsx:38:67) at na (vendors~main-chunk-8…87b.min.js:174297:1) at Hs (vendors~main-chunk-8…87b.min.js:174297:1) at Sc (vendors~main-chunk-8…87b.min.js:174297:1) at Cc (vendors~main-chunk-8…87b.min.js:174297:1) at _c (vendors~main-chunk-8…87b.min.js:174297:1) at pc (vendors~main-chunk-8…87b.min.js:174297:1)
Expected results:
Works with no error
Additional info:
Please review the following PR: https://github.com/openshift/router/pull/546
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/56
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).
These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.
Since this particular platform's business significance is high, I'm setting this as "Critical" severity.
Please get in touch with me or Dean West if more teams need to be pulled into investigation and mitigation.
Version-Release number of selected component (if applicable):
4.15 / master
How reproducible:
Component Readiness Board
Actual results:
The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload.
Expected results:
1. We NEED to understand what is causing this problem. 2. If we can mitigate this, we should. 3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem. 4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.
Additional info:
4.16 payload are failing because of multiple issues. One of them is due to missing python module on RHEL9.
Here is a slack thread on it: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1707744121181479
PR for the fix:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The SAST scans keep coming up with bogus positive results from test and vendor files. This bug is just a placeholder to allow us to backport the change to ignore those files.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
A change to how Power VS Workspaces are queried is not compatible with the version of terraform-provider-ibm
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy with Power VS 2. Fail with an error stating that [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
Actual results:
Fail with [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
Expected results:
Install should succeed.
Additional info:
Description of problem:
CAPI manifests have the TechPreviewNoUpgrade annotation but are missing the CustomNoUpgrade annotation
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The ValidatingAdmissionPolicy admission plugin is set in OpenShift 4.14+ kube-apiserver config, but is missing from the HyperShift config. It should be set.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
4.15: https://github.com/openshift/hypershift/blob/release-4.15/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L293-L341 4.14: https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L283-L331
Expected results:
Expect to see ValidatingAdmissionPolicy
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/69
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The bubble box with wrong layout
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-11-16-110328
How reproducible:
Always
Steps to Reproduce:
1. Make sure there is no pod under your using project 2. navigate to Networking -> NetworkPolicies -> Create NetworkPolicy page, click the 'affected pods' in Pod selector section 3. Check the layout in the bubble component
Actual results:
the layout is in correct (shared file:https://drive.google.com/file/d/1I8e2ZkiFO2Gu4nSt9kJ6JmRG3LdvkE-u/view?usp=drive_link )
Expected results:
layout should correct
Additional info:
Description of problem:
The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-multi-2023-12-06-195439
How reproducible:
Always
Steps to Reproduce:
1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config 2.Create cluster 3.Check installation
Actual results:
Azure will precheck if architecture is consistent with instance type when creating manifests, like: 12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj" 12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64 But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster
Expected results:
The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)
Additional info:
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/250
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
[sig-arch][Late] collect certificate data [Suite:openshift/conformance/parallel]
Test is currently making the Unknown component red, but this test should be aligned to the kube-apiserver component. Looks like two others in the same file should be as well.
Description of problem:
1) Customer tag a image which including # (Hashtag) in the tag name
uk302-img-app-j:v0.6.12-build0000#000
2)When customer using OADP to backup images , they got below error
error excuting custom action(groupResource=imagestream.image.openshift.io namespace=dbp-p0010001, name=uk302-image-app-j): rpc error: code= Unknown= Invalid destination name udistribution-s3-c9814a92-67a4-4251-bd0d-142dfc4d3c80://dbp-p0010001/uk302-image-app-j:v0.6.12-build0000#00: invalid reference format
3) when check the source code below, we found that there are check towards tag name , seems # (Hashtag) is not allowed in regexp check
func copyImage(log logr.Logger, src, dest string, copyOptions *copy.Options) ([]byte, error) { policyContext, err := getPolicyContext() if err != nil { return []byte{}, fmt.Errorf("Error loading trust policy: %v", err) } defer policyContext.Destroy() srcRef, err := alltransports.ParseImageName(src) if err != nil { return []byte{}, fmt.Errorf("Invalid source name %s: %v", src, err) } destRef, err := alltransports.ParseImageName(dest) if err != nil { return []byte{}, fmt.Errorf("Invalid destination name %s: %v", dest, err) }
https://github.com/containers/image/blob/main/docker/reference/regexp.go#L111
const ( // alphaNumeric defines the alpha numeric atom, typically a // component of names. This only allows lower case characters and digits. alphaNumeric = `[a-z0-9]+` // separator defines the separators allowed to be embedded in name // components. This allow one period, one or two underscore and multiple // dashes. Repeated dashes and underscores are intentionally treated // differently. In order to support valid hostnames as name components, // supporting repeated dash was added. Additionally double underscore is // now allowed as a separator to loosen the restriction for previously // supported names. separator = `(?:[._]|__|[-]*)` // repository name to start with a component as defined by DomainRegexp // and followed by an optional port. domainComponent = `(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])` // The string counterpart for TagRegexp. tag = `[\w][\w.-]{0,127}` // The string counterpart for DigestRegexp. digestPat = `[A-Za-z][A-Za-z0-9]*(?:[-_+.][A-Za-z][A-Za-z0-9]*)*[:][[:xdigit:]]{32,}` // The string counterpart for IdentifierRegexp. identifier = `([a-f0-9]{64})` // The string counterpart for ShortIdentifierRegexp. shortIdentifier = `([a-f0-9]{6,64})`
Expected results: Customer want to know if this should be a bug that , when doing {code:java} oc tag
We should have some checking towards the tag name to prevent the #(Hashtag) or other non allowed code been setting in the image tag which causing unexpected issue like in using OADP or other tools.
please have a check , thank you!
Regards
Jacob
Observer - Alerting, Metrics, and Targets page does not load as expected, blank page would be shown
4.15.0-0.nightly-2023-12-07-041003
Always
1.Navigate to Observer -> Alerting, Metrics, and Targets page directly 2. 3.
Blank page, no data be loaded
Work as normal
Failed to load resource: the server responded with a status of 404 (Not Found) /api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%2715ace915-53d3-4455-b7e3-b7a5a4796b5c%27:1 Failed to load resource: the server responded with a status of 403 (Forbidden) main-chunk-bb9ed989a7f7c65da39a.min.js:1 API call to get support level has failed r: Access denied due to cluster policy. at https://console-openshift-console.apps.ci-ln-9fl1l5t-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-bb9ed989a7f7c65da39a.min.js:1:95279 (anonymous) @ main-chunk-bb9ed989a7f7c65da39a.min.js:1 /api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/#ALL_NS#/clusterserviceversions?:1 Failed to load resource: the server responded with a status of 404 (Not Found) vendor-patternfly-5~main-chunk-95cb256d9fa7738d2c46.min.js:1 Modal: When using hasNoBodyWrapper or setting a custom header, ensure you assign an accessible name to the the modal container with aria-label or aria-labelledby.
Description of problem:
The node-network-identity deployment should be set to assist in a controlled rollout of the microservice pods. The general goal is to have a microservice pod only report to Kubernetes as being ready when it has completed initialization and is stable enough to complete tasks.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/103
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As an ODC helm backend developer I would like to be able to bump version of helm to 3.13 to stay synched up with the version we will ship with OCP 4.15
Normal activity we do every time a new OCP version is release to stay current
NA
NA
Bump version of helm to 3.13 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.
Might had dependencies with DevFile team to move some dependencies forward
NA
Console Helm dependency is moved to 3.13
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
As part of the PatternFly update from 5.1.0 to 5.1.1 it was required to disable some dev console e2e tests.
See https://github.com/openshift/console/pull/13380
We need to re-enable and adapt at least this tests:
In frontend/packages/dev-console/integration-tests/features/addFlow/create-from-devfile.feature and in frontend/packages/dev-console/integration-tests/features/e2e/add-flow-ci.feature
In frontend/packages/helm-plugin/integration-tests/features/helm-release.feature and frontend/packages/helm-plugin/integration-tests/features/helm/actions-on-helm-release.feature:
Can we please also check why we have both broken tests in two features files? 🤷
Please review the following PR: https://github.com/openshift/coredns/pull/111
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Only customers have a break-glass certificate signer.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1.create CSR with any other signer chosen 2.does not work 3.
Actual results:
does not work
Expected results:
should work
Additional info:
Description of problem:
The unworkable filter component should not exist in resource section on the search page with Phone View
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2023-12-15-211129
How reproducible:
Always
Steps to Reproduce:
1. Change to phone view for the browser (Browser -> F12 - Toggle device toolbar) eg: iPhone 14 Pro Max 2. Navigate to Home -> Search page, select one resource eg: APIRequestCounts 3. Check the component in the resources panel
Actual results:
There is an unworkable filter icon under the 'Create APIREquestCount' button
Expected results:
Remove the filter component in Phone view OR, make sure the filter is workable in phone view if customer needed
Additional info:
https://drive.google.com/file/d/1Fwb8EGznWkA1z3cVVzGcJJjMFkjkuUhK/view?usp=drive_link
Description of problem:
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1996624, when the AWS root credential (must possesses the "iam:SimulatePrincipalPolicy" permission) exists on a BM cluster, the CCO Pod crashes when running the secretannotator controller.
Steps to Reproduce:
1. Install a BM cluster fxie-mac:cloud-credential-operator fxie$ oc get infrastructures.config.openshift.io cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2024-01-28T19:50:05Z" generation: 1 name: cluster resourceVersion: "510" uid: 45bc2a29-032b-4c74-8967-83c73b0141c4 spec: cloudConfig: name: "" platformSpec: type: None status: apiServerInternalURI: https://api-int.fxie-bm1.qe.devcluster.openshift.com:6443 apiServerURL: https://api.fxie-bm1.qe.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: fxie-bm1-x74wn infrastructureTopology: SingleReplica platform: None platformStatus: type: None 2. Create an AWS user with IAMReadOnlyAccess permissions: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "iam:GenerateCredentialReport", "iam:GenerateServiceLastAccessedDetails", "iam:Get*", "iam:List*", "iam:SimulateCustomPolicy", "iam:SimulatePrincipalPolicy" ], "Resource": "*" } ] } 3. Create AWS root credentials with a set of access keys of the user above 4. Trigger a reconcile of the secretannotator controller, e.g. via editting cloudcredential/cluster
Logs:
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:PutUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:TagUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Tested creds not able to perform all requested actions" controller=secretannotator
I0129 04:47:27.988535 1 reflector.go:289] Starting reflector *v1.Infrastructure (10h37m20.569091933s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.988546 1 reflector.go:325] Listing and watching *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.989503 1 reflector.go:351] Caches populated for *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a964a0]
goroutine 341 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:115 +0x1e5
panic({0x3fe72a0?, 0x809b9e0?})
/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws.LoadInfrastructureRegion({0x562e1c0?, 0xc002c99a70?}, {0x5639ef0, 0xc0001b6690})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws/utils.go:72 +0x40
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).validateCloudCredsSecret(0xc0008c2000, 0xc002586000)
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:206 +0x1a5
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).Reconcile(0xc0008c2000, {0x30?, 0xc000680c00?}, {0x4f38a3d?, 0x0?}, {0x4f33a20?, 0x416325?})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:166 +0x605
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x561ff20?, {0x561ff20?, 0xc002ff3b00?}, {0x4f38a3d?, 0x3b180c0?}, {0x4f33a20?, 0x55eea08?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000189360, {0x561ff58, 0xc0007e5040}, {0x4589f00?, 0xc000570b40?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314 +0x365
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000189360, {0x561ff58, 0xc0007e5040})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 183
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x565
Actual results:
CCO Pod crashes and restarts in a loop: fxie-mac:cloud-credential-operator fxie$ oc get po -n openshift-cloud-credential-operator -w NAME READY STATUS RESTARTS AGE cloud-credential-operator-657bdffdff-9wzrs 2/2 Running 3 (2m35s ago) 8h
Please review the following PR: https://github.com/openshift/bond-cni/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We have detected several bugs in Console dynamic plugin SDK v1 as part of Kubevirt plugin PR #1804
These bugs affect dynamic plugins which target Console 4.15+
ERROR in [entry] [initial] kubevirt-plugin.494371abc020603eb01f.hot-update.js Missing call to loadPluginEntry
LOG from @openshift-console/dynamic-plugin-sdk-webpack/lib/webpack/loaders/dynamic-module-import-loader ../node_modules/ts-loader/index.js??ruleSet[1].rules[0].use[0]!./utils/hooks/useKubevirtWatchResource.ts
<w> Detected parse errors in /home/vszocs/work/kubevirt-plugin/src/utils/hooks/useKubevirtWatchResource.ts
WARNING in shared module @patternfly/react-core No required version specified and unable to automatically determine one. Unable to find required version for "@patternfly/react-core" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/@openshift-console/dynamic-plugin-sdk/package.json). It need to be in dependencies, devDependencies or peerDependencies.
1. git clone Kubevirt plugin repo
2. switch to commit containing changes from PR #1804
3. yarn install && yarn dev to update dependencies and start local dev server
In https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-06-031624, I see several PR's involving moving crio metrics. This payload is being rejected on TargetAlerts alerts on AWS minor upgrades.
Example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]e-from-stable-4.15-e2e-aws-ovn-upgrade/1754707749708500992
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system expand_less 0s { TargetDown was at or above info for at least 1m58s on platformidentification.JobType
(maxAllowed=1s): pending for 15m0s, firing for 1m58s: Feb 06 04:48:42.698 - 118s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS{alertname="TargetDown", alertstate="firing", job="crio", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}}
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/109
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
`ensureSigningCertKeyPair` and `ensureTargetCertKeyPair` are always updating secret type. if the secret requires metadata update, its previous content will not be retained
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install 4.6 cluster (or make sure installer-generated secrets have `type: SecretTypeTLS` instead of `type: kubernetes.io/tls` 2. Run secret sync 3. Check secret contents
Actual results:
Secret was regenerated with new content
Expected results:
Existing content should be preserved, content is not modified
Additional info:
This causes api-int CA update for clusters born in 4.6 or earlier.
User Story:
This is story use to merge PR in openshift/origin repository
https://github.com/openshift/origin/pull/28382
Acceptance criteria:
Description of problem:
The agent-based installer does not support the TechPreviewNoUpgrade featureSet, and by extension nor does it support any of the features gated by it. Because of this, there is no warning about one of these features being specified - we expect the TechPreviewNoUpgrade feature gate to error out when any of them are used.
However, we don't warn about TechPreviewNoUpgrade itself being ignored, so if the user does specify it then they can use some of these non-supported features without being warned that their configuration is ignored.
We should fail with an error when TechPreviewNoUpgrade is specified, until such time as AGENT-554 is implemented.
Description of problem:
in UPI cluster, there is no MachineSets and Machines resource, when user visits Machines and MachineSets list page, we will see simple text 'Not found'
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-16-113018
How reproducible:
Always
Steps to Reproduce:
1. setup UPI cluster 2. goes to MachineSets and Machines list page, check the empty state message
Actual results:
2. we just simply show 'Not found' text
Expected results:
2. for other resources, we show richer text 'No <resourcekind> found', so we should also show 'No Machines found' and 'No MachineSets found' for these pages
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
4.14
How reproducible:
1. oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'
2. curl <svc>:<port>
3. oc scale --replicas=3 deploy/<deploy>
4. oc scale --replicas=0 deploy/<deploy>
5. oc scale --replicas=3 deploy/<deploy>
Actual results:
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720
Expected results:
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520
Additional info:
See the hostname in the server log output for each command.
$ oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520
$ oc scale --replicas=1 deploy/<deploy>
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47082
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47088
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54832
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54848
$ oc scale --replicas=3 deploy/<deploy>
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720
Description of problem:
The problem was that namespace handler on initial sync would delete all ports (because logical port cache where it got lsp UUIDs wasn't populated) and all acls (they were just set to nil). Even though both ports and acls will be re-added by the corresponding handlers, it may cause disruption.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. create a namespace with at least 1 pod and egress firewall in it
2. pick any ovnkube-node pod, find namespace port group UUID in nbdb by external_ids["name"]=<namespace name>, e.g. for "test" namespace
_uuid : 6142932d-4084-4bc3-bdcb-1990fc71891b acls : [ab2be619-1266-41c2-bb1d-1052cb4e1e97, b90a4b4a-ceee-41ee-a801-08c37a9bf3e7, d314fa8d-7b5a-40a5-b3d4-31091d7b9eae] external_ids : {name=test} name : a18007334074686647077 ports : [55b700e4-8176-42e7-97a6-8b32a82fefe5, cb71739c-ad6c-4436-8fd6-0643a5417c7d, d8644bf1-6bed-4db7-abf8-7aaab0625324]
3. restart chosen ovn-k pod
4. check logs on restart that update chosen port group to have zero ports and zero acls
Update operations generated as: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:test]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {6142932d-4084-4bc3-bdcb-1990fc71891b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUID: UUIDName:}]
Actual results:
Expected results:
On restart port group stays the same, no extra update with empty ports and acls is generated
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Please review the following PR: https://github.com/openshift/prometheus/pull/187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/271
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In CNO, need to change cni-sysctl-allowlist daemonset to hostNetwork: true in case of multus ds + cni-sysctl-allowlist ds upgrade successfully.
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/394
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.15.0-rc.4: 701 of 873 done (80% complete), waiting on operator-lifecycle-manager Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-db86b7466-gdp4g 1/1 Running 0 9h collect-profiles-28443465-9zzbk 0/1 Completed 0 34m collect-profiles-28443480-kkgtk 0/1 Completed 0 19m collect-profiles-28443495-shvs7 0/1 Completed 0 4m10s olm-operator-56cb759d88-q2gr7 0/1 CrashLoopBackOff 8 (3m27s ago) 20m package-server-manager-7cf46947f6-sgnlk 2/2 Running 0 9h packageserver-7b795b79f-thxfw 1/1 Running 1 14d packageserver-7b795b79f-w49jj 1/1 Running 0 4d17h
Version-Release number of selected component (if applicable):
How reproducible:
Unknown
Steps to Reproduce:
Upgrade from 4.15.0-rc.2 to 4.15.0-rc.4
Actual results:
The upgrade is unable to proceed
Expected results:
The upgrade can proceed
Additional info:
Please review the following PR: https://github.com/openshift/must-gather/pull/395
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While testing oc adm upgrade status against b02, I noticed some COs do not have any annotations, while I expected them to have the include/exclude.release.openshift.io/* ones (to recognize COs that come from the payload).
$ b02 get clusteroperator etcd -o jsonpath={.metadata.annotations} $ ota-stage get clusteroperator etcd -o jsonpath={.metadata.annotations} {"exclude.release.openshift.io/internal-openshift-hosted":"true","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"}
CVO does not reconcile CO resources once they exist, only precreates them but does not touch them once they exist. Build02 does not have CO with reconciled metadata because it was born as 4.2 which (AFAIK) is before OCP started to use the exclude/include annotations.
4.16 (development branch)
deterministic
1. delete an annotation on a ClusterOperator resource
The annotation wont be recreated
The annotation should be recreated
Jose added few other rendering tests that utilize different inputs and outputs. make render-sync should be able to prepare them.
Multi-arch compute clusters have an issue where the cluster version's image ref is single arch, so this change resolves the image ref without spinning up a pod.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The node-network-identity deployment should conform to hypershift control plane expectations that all applicable containers should have a liveness probe, and a readiness probe if it is an endpoint for a service.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
No liveness or readiness probes
Expected results:
Additional info:
Description of problem:
Currently console frontend and backend is using OpenShift centric UserKind type. In order for the console to work without OAuth server, iow. with. external OIDC it needs to use k8s UserInfo type, which is retrieved querying SelfSubjectReview API
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Console is not working with external OIDC provider
Expected results:
Console will be working with external OIDC provider
Additional info:
This is mainly an API change.
Description of problem:
The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using oc delete secret/signing-key -n openshift-service-ca operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.
console 4.14.0-0.nightly-2023-06-30-131338 False False True 159m RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable'
monitoring 4.14.0-0.nightly-2023-06-30-131338 False True True 161m reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies
How reproducible:
100%
Steps to Reproduce:
1.oc delete secret/signing-key -n openshift-service-ca 2. wait at least 30+ minutes 3. observe oc get co
Actual results:
console and monitoring degraded and not recovering
Expected results:
able to recover eventually as in previous versions
Additional info:
using manual deletion of all pods it is possible to recover the cluster from this state as follows: for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \ do oc delete pods --all -n $I; \ sleep 1; \ done
must-gather:
https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing
Description of problem:
When specifying a control plane operator image for dev purposes, the control plane operator pod fails to come up with an InvalidImageRef status.
Version-Release number of selected component (if applicable):
Mgmt cluster is 4.15, HyperShift control plane operator is latest from main
How reproducible:
Always
Steps to Reproduce:
1. Create a hosted cluster with an annotation to override control plane operator image and point it to a non-digest image ref.
Actual results:
The cluster fails to come up with the CPO pod failing with InvalidImageRef
Expected results:
The cluster comes up fine.
Additional info:
Description of problem:
gatewayConfig.ipForwarding allows any invalid value but it should enforce "", "Restricted" or "Global"
You can currently even do really funky stuff with that:
oc edit network.operator/cluster (...) 15 spec: 16 clusterNetwork: 17 - cidr: 10.128.0.0/14 18 hostPrefix: 23 19 - cidr: fd01::/48 20 hostPrefix: 64 21 defaultNetwork: 22 ovnKubernetesConfig: 23 egressIPConfig: {} 24 gatewayConfig: 25 ipForwarding: $(echo 'Im injected'; lscpu)
$ oc get pods -n openshift-ovn-kubernetes ovnkube-node-24628 -o yaml | grep sysctl -C5 fi # If IP Forwarding mode is global set it in the host here. ip_forwarding_flag= if [ "$(echo 'Im injected'; lscpu)" == "Global" ]; then sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1 else ip_forwarding_flag="--disable-forwarding" fi NETWORK_NODE_IDENTITY_ENABLE=
$ oc logs -n openshift-ovn-kubernetes ovnkube-node-24628 -c ovnkube-controller | grep inje -A5 ++ echo 'Im injected' ++ lscpu + '[' 'Im injected Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 112
I wouldn't consider this a security issue, because I have to be the admin to do that, and as the admin I can also simply modify the pod, but it's not very elegant to allow for some sort of code injection, even by the admin
Description of problem:
Usernames can contain all kinds of characters that are not allowed in resource names. Hash the name instead and use hex representation of the result to get a usable identifier.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. log in to the web console configured with a login to a 3rd party OIDC provider 2. go to the User Preferences page / check the logs in the javascript console
Actual results:
The User Preferences page shows empty values instead of defaults. The javascript console reports things like ``` consoleFetch failed for url /api/kubernetes/api/v1/namespaces/openshift-console-user-settings/configmaps/user-settings-kubeadmin r: configmaps "user-settings-kubeadmin" not found ```
Expected results:
I am able to persist my user preferences.
Additional info:
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/45
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Running the command coreos-installer iso kargs show no longer works with the 4.13 Agent ISO. Instead we get this error:
$ coreos-installer iso kargs show agent.x86_64.iso Writing manifest to image destination Storing signatures Error: No karg embed areas found; old or corrupted CoreOS ISO image.
This is almost certainly due to the way we repack the ISO as part of embedding the agent-tui binary in it.
It worked fine in 4.12. I have tested both with every version of coreos-installer from 0.14 to 0.17
UDP Packets are subject to SNAT in a self-managed OCP 4.13.13 cluster on Azure (OVN-K as CNI) using a Load Balancer Service with `externalTrafficPolicy: Local`. UDP Packets correctly arrive to the Node hosting the Pod but the source IP seen by the Pod is the OVN GW Router of the Node.
I've reproduced the customer scenario with the following steps:
This is issue is very critical because it is blocking customer business.
Description of problem:
following signing-key deletion, there is a service CA rotation process which might temporary disrupt platform components, but eventually all should use the updated certificates.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies
How reproducible:
100%
Steps to Reproduce:
1.oc delete secret/signing-key -n openshift-service-ca 2. reload the management console 3.
Actual results:
The Observe tab disappears from the menu bar and the monitoring-plugin shows as unavailable.
Expected results:
No disruption
Additional info:
using manual deletion of the monitoring-plugin pods it is possible to recover the situation
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-29441.
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/291
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
For special operators, there are warning info on operator detail modal page and installation page when it's Azure WI/FI cluster. The warning info titles are not consistent on these two pages.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-21-155123
How reproducible:
Always
Steps to Reproduce:
1.Prepare a special operator which would show warning info on Azure WI/FI cluster. 2.Login console of Azure WI/FI cluster, check the warning info title on the operator detail item modal and installation page. 3.
Actual results:
2. On operator detail item modal, the warning title is "Cluster in Azure Workload Identity / Federated Identity Mode", and on installation page, the warning info title is "Cluster in Workload Identity / Federated Identity Mode". The word "Azure" is missed on the second page.
Expected results:
2. The warning title should keep consistent.
Additional info:
screenshot: https://drive.google.com/drive/folders/1alFBEtO1gN4q5_mAtHCNzuLTOe5zXp0K?usp=drive_link
Component Readiness has found a potential regression in [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel].
Probability of significant regression: 98.46%
Sample (being evaluated) Release: 4.15
Start Time: 2023-12-29T00:00:00Z
End Time: 2024-01-04T23:59:59Z
Success Rate: 83.33%
Successes: 15
Failures: 3
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 98.36%
Successes: 120
Failures: 2
Flakes: 0
Description of problem:
oc-mirror with v2 will create the idms file as output , but the source is like : apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-2024-01-08t04-19-04z spec: imageDigestMirrors: - mirrors: - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift source: localhost:55000/openshift - mirrors: - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift-release-dev source: quay.io/openshift-release-dev status: {} The source should always be the origin registry like :quay.io/openshift-release-dev
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. run the command with v2 : apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.14 minVersion: 4.14.3 maxVersion: 4.14.3 graph: true `oc-mirror --config config.yaml file://out --v2` `oc-mirror --config config.yaml --from file://out --v2 docker://xxxx:5000/ocp2` 2. check the idms file
Actual results:
2. cat idms-2024-01-08t04-19-04z.yaml apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-2024-01-08t04-19-04z spec: imageDigestMirrors: - mirrors: - xxxx.com:5000/ocp2/openshift source: localhost:55000/openshift - mirrors: - xxxx.com:5000/ocp2/openshift-release-dev source: quay.io/openshift-release-dev
Expected results:
The source should not be localhost:55000, should be like the origin registry.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/303
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/136
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
'Oh no somthing went wrong' shown on Image Manifest Vulnerability page after create IMV via CLI
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446
How reproducible:
Always
Steps to Reproduce:
1. Installed the operator of 'Red Hat Quay Container Security Operator' 2. Use Command Line to created the IMV $ oc create -f imv.yaml imagemanifestvuln.secscan.quay.redhat.com/example created $ cat IMV.yaml apiVersion: secscan.quay.redhat.com/v1alpha1 kind: ImageManifestVuln metadata: name: example namespace: openshift-operators spec: {} 3. Navigate to page /k8s/ns/openshift-operators/operators.coreos.com~v1alpha1~ClusterServiceVersion/container-security-operator.v3.10.3/secscan.quay.redhat.com~v1alpha1~ImageManifestVuln
Actual results:
Oh no! Something went wrong. will be shown Description:Cannot read properties of undefined (reading 'replace')Component trace:Copy to clipboardat T (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/container-security-chunk-c75b48f176a6a5981ee2.min.js:1:3465) at https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631947 at tr at x (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:630876) at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:82:73479) at tbody at table at g (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:6:199268) at l (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:10:88631) at D (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:632038) at div at div at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:50:39294) at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:49:16122) at o (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:642088) at div at M (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631697) at div
Expected results:
no error issue
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/85
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We would like to include the CEL IP and CIDR validations in 4.16. They have been mergeded upstream and can be backported into OpenShift to improve out validation downstream. Upstream PR: https://github.com/kubernetes/kubernetes/pull/121912
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Add documentation on how to debug Azure nodes
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/39
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
On February 27th endpoints were turned off that were being queried for account details. The check is not vital so we are fine with removing it, however it is currently blocking all Power VS installs.
Version-Release number of selected component (if applicable):
4.13.0 - 4.16.0
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy with Power VS 2. Fail at the platform credentials check
Actual results:
Check fails
Expected results:
Check should succeed
Additional info:
Please review the following PR: https://github.com/openshift/telemeter/pull/497
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
tested https://github.com/openshift/cluster-monitoring-operator/pull/2187 with PR
launch 4.15,openshift/cluster-monitoring-operator#2187 aws
don't find "scrape.timestamp-tolerance" setting in prometheus and prometheus pod, no result for below commands
$ oc -n openshift-monitoring get prometheus k8s -oyaml | grep -i "scrape.timestamp-tolerance" $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "scrape.timestamp-tolerance" $ oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep -i "scrape.timestamp-tolerance"
not in prometheus configuration file either
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | head global: evaluation_interval: 30s scrape_interval: 30s external_labels: prometheus: openshift-monitoring/k8s prometheus_replica: prometheus-k8s-0 rule_files: - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml scrape_configs: - job_name: serviceMonitor/openshift-apiserver-operator/openshift-apiserver-operator/0
Description of problem:
Repositories list page breaks with a TypeError cannot read properties of undefined (reading `pipelinesascode.tekton.dev/repository`)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://drive.google.com/file/d/1TpH_PTyBxNX0b9SPZ2yS8b-q-tbvp6Ok/view?usp=sharing
The CVO managed manifest, that CMO ships lack capability annotations as defined in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations.
The dashboards should be tied to the console capability so that when CMO deploys on a cluster without the Console capability, CVO doesn't deploy the dashboards configmap.
Description of problem:
when user click ‘Cancel’ on any Secret creation page, it doesn’t return to Secrets list page
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-06-062415
How reproducible:
Always
Steps to Reproduce:
1. Go to Create Key/value secret|Image pull secret|Source secret|Webhook secret|FromYaml page eg:/k8s/ns/default/secrets/~new/generic 2. Click Cancel button 3.
Actual results:
The page does not go back to Secrets list page eg: /k8s/ns/default/core~v1~Secret
Expected results:
The page should go back to the Secrets list page
Additional info:
Description of the problem:
Prepare cluster for installation , add applied configuration from api code.
When we try to install cluster it backs to ready as expected but in case we fix configuration and try to install again it ALWAYS fails on the first attemp to install ( not related to time)
Only on the second install it works ! without changing the configuration.
How reproducible:
Always
Steps to reproduce:
1.Prepare cluster for installation
2.Create invalid applied configuration to verify that preparing-for-installation returns to ready state as expected.
invalid_override_config = {
"capabilities":
}
3. Start installation , back to ready as expected.
4. fix applied configuration and try to install again -> Fails
Actual results:
On the first attemp after config change installation fails.
It work only on the second try
Expected results:
Description of the problem:
When trying to install cluster on 4.15 LVMS with CNV and MCE operator
In operator page i am unabe to continue with installation
since host discovery pointing that hosts required more resources (as if i also have selected odf)
How reproducible:
Steps to reproduce:
1.Create a multi node cluster 4.15
2.make sure to have enough resources for lvms , cnv and mce operator
3.select cnv lvms and mce operator
Actual results:
In operator page it show that also cpu and ram resources related to ODF are required (which is not selected) and user unable to start installation
Expected results:
Should be able to start installation
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress. These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws. The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments. On management clusters *all* subnets are tagged with the MCs cluster-id. This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626 This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ. In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules. In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).
Version-Release number of selected component (if applicable):
4.14.z & 4.15.z
How reproducible:
Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.
Steps to Reproduce:
1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead. 2. Check the securitygroups to see if the source CIDRs are incorrect.
Actual results:
SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.
Expected results:
The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones.
Additional info:
Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/218
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
For certain operations the CEO will check the etcd member health by creating a client directly and waiting for its status report. Under a situation of any member not being reachable for a longer period, we found the CEO was constantly getting stuck / deadlocked and couldn't move certain controllers forward. In OCPBUGS-12475 we introduced a health-check that would dump stack and automatically restart with the operator deployment health probe. In a more recent upgrade run we could find the culprit [1] to be a missing context during client initialization to etcd, making it stuck infinitely: W0229 02:55:46.820529 1 aliveness_checker.go:33] Controller [EtcdEndpointsController] didn't sync for a long time, declaring unhealthy and dumping stack goroutine 1426 [select]: github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth({0x3272768?, 0xc002090310}, {0xc0000a6880, 0x3, 0xc001c98360?}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:64 +0x330 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.(*etcdClientGetter).MemberHealth(0xc000c24540, {0x3272688, 0x4c20080}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:412 +0x18c github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.CheckSafeToScaleCluster({0x324ccd0?, 0xc000b6d5f0?}, {0x3284250?, 0xc0008dda10?}, {0x324e6c0, 0xc000ed4fb0}, {0x3250560, 0xc000ed4fd0}, {0x32908d0, 0xc000c24540}) github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/bootstrap.go:149 +0x28e github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.(*QuorumCheck).IsSafeToUpdateRevision(0x2893020?) github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/qourum_check.go:37 +0x46 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).syncConfigMap(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x32801b0, 0xc001198540}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:146 +0x5d8 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).sync(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x325d240, 0xc003569e90}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:66 +0x71 github.com/openshift/cluster-etcd-operator/pkg/operator/health.(*CheckingSyncWrapper).Sync(0xc000f21bc0, {0x32726f8?, 0xc0008e60a0?}, {0x325d240?, 0xc003569e90?}) github.com/openshift/cluster-etcd-operator/pkg/operator/health/checking_sync_wrapper.go:22 +0x43 github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc00113cd80, {0x32726f8, 0xc0008e60a0}, {0x325d240?, 0xc003569e90?}) github.com/openshift/library-go@v0.0.0-20240124134907-4dfbf6bc7b11/pkg/controller/factory/base_controller.go:201 +0x43 goroutine 11640 [select]: google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc003707000, {0x3272768, 0xc002091260}, 0x3) google.golang.org/grpc@v1.58.3/clientconn.go:724 +0xb1 google.golang.org/grpc.DialContext({0x3272768, 0xc002091260}, {0xc003753740, 0x3c}, {0xc00355a880, 0x7, 0xc0023aa360?}) google.golang.org/grpc@v1.58.3/clientconn.go:295 +0x128e go.etcd.io/etcd/client/v3.(*Client).dial(0xc000895180, {0x32754a0?, 0xc001785670?}, {0xc0017856b0?, 0x28f6a80?, 0x28?}) go.etcd.io/etcd/client/v3@v3.5.10/client.go:303 +0x407 go.etcd.io/etcd/client/v3.(*Client).dialWithBalancer(0xc000895180, {0x0, 0x0, 0x0}) go.etcd.io/etcd/client/v3@v3.5.10/client.go:281 +0x1a9 go.etcd.io/etcd/client/v3.newClient(0xc002484e70?) go.etcd.io/etcd/client/v3@v3.5.10/client.go:414 +0x91c go.etcd.io/etcd/client/v3.New(...) go.etcd.io/etcd/client/v3@v3.5.10/client.go:81 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.newEtcdClientWithClientOpts({0xc0017853d0, 0x1, 0x1}, 0x0, {0x0, 0x0, 0x0?}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:127 +0x77d github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x32726f8, 0xc00318ac30}, 0xc002090460) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:103 +0xc5 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1() github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x6c created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth in goroutine 1426 github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5 [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/
Version-Release number of selected component (if applicable):
any currently supported OCP version
How reproducible:
Always
Steps to Reproduce:
1. create a healthy cluster 2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall) 3. wait for the CEO to restart pod on failing health probe and dump its stack (similar to the one above)
Actual results:
CEO controllers are getting deadlocked, but the operator will restart eventually after some time due to health probes failing
Expected results:
CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe
Additional info:
clientv3.New doesn't take any timeout context, but tries to establish a connection forever https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdcli/etcdcli.go#L127-L130 There's a way to pass the "default context" via the client config, which is slightly misleading.
: [bz-Routing] clusteroperator/ingress should not change |
Has been failing for over a month in the e2e-metal-ipi-sdn-bm-upgrade jobs
I think this is because there are only two worker nodes in the BM environment and some HA services loose redundancy when one of the workers is rebooted.
In the medium term I hope to add another node to each cluster but in the sort term we should skip the test.
https://github.com/openshift/release/pull/48835
The scatter chart is very slow to load for the last week, but while there are a few hits here and there over last y days, it looks like this got a lot more common yesterday around noon and has continued ever since.
Suspicious PR: https://github.com/openshift/origin/pull/28587
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Seen in this 4.15 to 4.16 CI run:
: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}
The operator recovered, and the update completed, but it's still probably worth cleaning up whatever's happening to avoid alarming anyone.
Seems like all recent CI runs that match this string touch 4.15, 4.16, or development branches:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Back-off+restarting+failed+container+cluster-baremetal-operator+in+pod+cluster-baremetal-operator' | grep 'failures match' pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway (all) - 11 runs, 36% failed, 25% of failures match = 9% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 20% failed, 33% of failures match = 7% impact pull-ci-openshift-kubernetes-master-e2e-aws-ovn-downgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 15 runs, 27% failed, 25% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 32 runs, 91% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 40 runs, 25% failed, 20% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 40 runs, 8% failed, 33% of failures match = 3% impact pull-ci-openshift-azure-file-csi-driver-operator-main-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-arm64 (all) - 6 runs, 33% failed, 50% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
Looks like ~8% impact. h2. Steps to Reproduce: 1. Run ~20 exposed job types. 2. Check for {{: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers}} failures with {{Back-off restarting failed container cluster-baremetal-operator}} messages. h2. Actual results: ~8% impact. h2. Expected results: ~0% impact. h2. Additional info: Dropping into Loki for the run I'd picked: {code:none} {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016"} | unpack | pod="cluster-baremetal-operator-574577fbcb-z8nd4" container="cluster-baremetal-operator" |~ "220 06:0"
includes:
E1220 06:04:18.794548 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning" I1220 06:05:40.753364 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1220 06:05:40.766200 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1220 06:05:40.780426 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" E1220 06:05:40.795555 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning" I1220 06:08:21.730591 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1220 06:08:21.747466 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1220 06:08:21.768138 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" E1220 06:08:21.781058 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
So some kind of ClusterOperator-modification race?
Problem Description:
Installed the Red Hat Quay Container Security Operator on the 4.13.25 cluster .
Below are my test results :
```
sasakshi@sasakshi ~]$ oc version
Client Version: 4.12.7
Kustomize Version: v4.5.7
Server Version: 4.13.25
Kubernetes Version: v1.26.9+aa37255
[sasakshi@sasakshi ~]$ oc get csv -A | grep -i "quay" | tail -1
openshift container-security-operator.v3.10.2 Red Hat Quay Container Security Operator 3.10.2 container-security-operator.v3.10.1 Succeeded
[sasakshi@sasakshi ~]$ oc get subs -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-operators container-security-operator container-security-operator redhat-operators stable-3.10
[sasakshi@sasakshi ~]$ oc get imagemanifestvuln -A | wc -l
82
[sasakshi@sasakshi ~]$ oc get vuln --all-namespaces | wc -l
82
Console -> Administration -> Image Vulnerabitlites : 82
Home -> Overiview -> Status -> Image Vulnerabitlites : 66
```
Observations from My testing :
Kindly refer to the attached screenshots for reference .
Documentation link referred:
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/52
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
When trying to create cluster with s390x architecture, an error occurs that stops cluster creation. The error is "cannot use Skip MCO reboot because it's not compatible with the s390x architecture on version 4.15.0-ec.3 of OpenShift"
How reproducible:
Always
Steps to reproduce:
Create cluster with architecture s390x
Actual results:
Create failed
Expected results:
Create should succeed
Please review the following PR: https://github.com/openshift/images/pull/158
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/sdn/pull/600
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When running 4.15 installer full function test, detect below three instance families and verified, need to append them in installer doc[1]: - standardHBv4Family - standardMSMediumMemoryv3Family - standardMDSMediumMemoryv3Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
OVN br-int ovs flows do not get updated on other nodes when a nodes bond MACADDR is changed to other slave interface after reboot. This causes network traffic coming for the sdn of one node to get dropped when it hits the node that changed mac addresses on its bond interface.
Version-Release number of selected component (if applicable): 4.12+
How reproducible: 100% of the time after rebooting if the mac changes. mac does not always change.
Steps to Reproduce:
1. Capture bond0 mac before reboot 2. Reboot host 3. Confirm mac change 4. oc run --rm -it test-pod-sdn --image=registry.redhat.io/openshift4/network-tools-rhel8 --overrides='{"spec": {"tolerations": [{"operator": "Exists"}],"nodeSelector":{"kubernetes.io/hostname":"nodeb-not-rebooted"}}}' /bin/bash 5. Ping rebooted node
Actual results:
ping hits rebooted node but is dropped because the MAC address is of other slave interface and not the one bond is using.
Expected results:
OVS flows to update on all nodesafter reboot if MAC changes
Additional info:
If we restart NetworkManager a couple times this triggers OVS flows to get updated, not sure why. Possible workarounds - https://access.redhat.com/solutions/6972925 - Statically set the mac of bond0 to one of the slave interfaces.
Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/227
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
RHOCP installation on RHOSP fails with an error
~~~
$ ansible-playbook -i inventory.yaml security-groups.yaml
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.36.5 is smaller than minimum version 1.0."}
~~~
Packages Installed :
ansible-2.9.27-1.el8ae.noarch Fri Oct 13 06:56:05 2023
python3-netaddr-0.7.19-8.el8.noarch Fri Oct 13 06:55:44 2023
python3-openstackclient-4.0.2-2.20230404115110.54bf2c0.el8ost.noarch Tue Nov 21 01:38:32 2023
python3-openstacksdk-0.36.5-2.20220111021051.feda828.el8ost.noarch Fri Oct 13 06:55:52 2023
Document followed :
https://docs.openshift.com/container-platform/4.13/installing/installing_openstack/installing-openstack-user.html#installation-osp-downloading-modules_installing-openstack-user
Description of problem:
Oh no! Something went wrong" in Topology -> Observese Tab
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-14-115151
How reproducible:
Always
Steps to Reproduce:
1.Navigate to Topology -> click one deployment and go to Observer Tab 2. 3.
Actual results:
The page crushed ErrorDescription:Component trace:Copy to clipboardat te (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:31:9773) at j (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:12:3324) at div at s (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:70124) at div at g (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:6:11163) at div at d (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:1:174472) at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:487478) at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:486390) at div at l (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:106304) at div
Expected results: {code:none} not crush
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We've moved to using bigquery for this, stop pushing and free up some loki cycles.
This is done with a command in origin: https://github.com/openshift/origin/blob/24e011ba3adf2767b88619351895bb878de3d62a/pkg/cmd/openshift-tests/dev/dev.go#L211
So all this code and probably some libraries could be removed.
But first, remove the invocation of this command in the release repo. (upload-intervals)
Sometimes user manifests could be the source of problems, and right now they're not included in the logs archive downloaded for an Assisted cluster from the UI.
Currently we can only see the feature usage "Custom manifest" in metadata.json but that only tells us the user has custom manifests, not manifests and what is their value
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2027
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ovnkube-node doesn't issue a CSR to get new certificates when node is suspended for 30 days
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Setup a libvirt cluster on machine 2. Disable chronyd on all nodes and host machine 3. Suspend nodes 4. Change time on host 30 days forward 5. Resume nodes 6. Wait for API server to come up 7. Wait for all operators to become ready
Actual results:
ovnkube-node would attempt to use expired certs: 2024-01-21T01:24:41.576365431+00:00 stderr F I0121 01:24:41.573615 8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0" 2024-04-20T01:25:08.519622252+00:00 stderr F I0420 01:25:08.516550 8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service 2024-04-20T01:25:08.900228370+00:00 stderr F I0420 01:25:08.898580 8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service 2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137891 8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137933 8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:17.137997952+00:00 stderr F I0420 01:25:17.137979 8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099057 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-1 2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099080 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-1: 35.077µs 2024-04-20T01:25:22.245550966+00:00 stderr F W0420 01:25:22.242774 8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-controller-manager/controller-manager-5485d88c84-xztxq IP address from the namespace address-set, err: pod openshift-controller-manager/controller-manager-5485d88c84-xztxq: no pod IPs found 2024-04-20T01:25:22.262446336+00:00 stderr F W0420 01:25:22.261351 8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9 IP address from the namespace address-set, err: pod openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9: no pod IPs found 2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154744 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-0 2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154770 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-0: 31.72µs 2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168666 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-2 2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168692 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-2: 34.346µs 2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194311 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-0 2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194339 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-0: 40.027µs 2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194582 8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0" 2024-04-20T01:25:27.215435944+00:00 stderr F I0420 01:25:27.215387 8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0" 2024-04-20T01:25:35.789830706+00:00 stderr F I0420 01:25:35.789782 8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-1 2024-04-20T01:25:35.790044794+00:00 stderr F I0420 01:25:35.790025 8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-1: 250.227µs 2024-04-20T01:25:37.596875642+00:00 stderr F I0420 01:25:37.596834 8852 iptables.go:358] "Running" command="iptables-save" arguments=["-t","nat"] 2024-04-20T01:25:47.138312366+00:00 stderr F I0420 01:25:47.138266 8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:47.138382299+00:00 stderr F I0420 01:25:47.138370 8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:25:47.138453866+00:00 stderr F I0420 01:25:47.138440 8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 2024-04-20T01:26:17.138583468+00:00 stderr F I0420 01:26:17.138544 8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:26:17.138640587+00:00 stderr F I0420 01:26:17.138629 8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp 2024-04-20T01:26:17.138708817+00:00 stderr F I0420 01:26:17.138696 8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 2024-04-20T01:26:39.474787436+00:00 stderr F I0420 01:26:39.474744 8852 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.EndpointSlice total 130 items received 2024-04-20T01:26:39.475670148+00:00 stderr F E0420 01:26:39.475653 8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: the server has asked for the client to provide credentials (get endpointslices.discovery.k8s.io) 2024-04-20T01:26:40.786339334+00:00 stderr F I0420 01:26:40.786255 8852 reflector.go:325] Listing and watching *v1.EndpointSlice from k8s.io/client-go/informers/factory.go:159 2024-04-20T01:26:40.806238387+00:00 stderr F W0420 01:26:40.804542 8852 reflector.go:535] k8s.io/client-go/informers/factory.go:159: failed to list *v1.EndpointSlice: Unauthorized 2024-04-20T01:26:40.806238387+00:00 stderr F E0420 01:26:40.804571 8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
Expected results:
ovnkube-node detects that cert is expired, requests new certs via CSR flow and reloads them
Additional info:
CI periodic to check this flow: https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ovn-sno-cert-rotation-suspend-30d artifacts contain sosreport Applies to SNO and HA clusters, works as expected when nodes are being properly shutdown instead of suspended
Description of problem:
After upgrade from 4.13.x to 4.14.10, the workload images that the customer stored inside the internal registry are lost, resulting the applications pods into error "Back-off pulling image". Even when manually pulling with podman, it fails then with "manifest unknown" because the image cannot be found in the registry anymore. - This behavior was found and reproduced 100% on ARO clusters, where the internal registry is by default backed up by the Storage Account created by the ARO RP service principal, which is the Containers blob service. - I do not know if in non-managed Azure clusters or any other architecture the same behavior is found.
Version-Release number of selected component (if applicable):
4.14.10
How reproducible:
100% with an ARO cluster (Managed cluster)
Steps to Reproduce: Attached.
The workaround found so far is to rebuild the apps or re-import the images. But those tasks are lengthy and costly specially if it is a production cluster.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/774
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/installer/pull/7819
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when there is only one server CSR pending on approval, we still show two records(one is client CSR requires approval which is already several hours old and the other is server CSR requires approval)
Version-Release number of selected component (if applicable):
pre-merge testing of https://github.com/openshift/console/pull/13493
How reproducible:
Always
Steps to Reproduce:
1. select one node which is joining the cluster, approve client CSR and do not approve server CSR, wait for some time => we can see only one node is pending on server CSR approval $ oc get csr | grep Pending | grep system:node csr-54sn4 142m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-7nhb9 65m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-9g22f 4m4s kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-bgrdq 35m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-chqnf 50m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-f4sbl 127m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-msnml 157m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-p9qrp 19m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-qp2pw 112m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-qrlnv 96m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending csr-tk7j4 81m kubernetes.io/kubelet-serving system:node:ip-10-0-49-55.us-east-2.compute.internal <none> Pending
Actual results:
1. on nodes list page, we can see two rows shown for node ip-10-0-49-55.us-east-2.compute.internal
Expected results:
since the pending client CSR has been there for several hours and the node now is actually waiting for server CSR approval, we should only show one record/row to indicate user that it requires server CSR approval The pending client CSR associated with ip-10-0-49-55.us-east-2.compute.internal is already 3 hours old $ oc get csr csr-4d628 NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-4d628 3h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
Additional info:
Description of problem:
Converting IPv6 Primary Dual Stack to IPv6 Single stack causing control plane failures. OVN masters in CLBO state peridically OVN masters logs: http://shell.lab.bos.redhat.com/~anusaxen/convert/ MG is not working as cluster lands in bad shape. Happy to share cluster if needed for debugging
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-04-21-084440
How reproducible:
Always
Steps to Reproduce:
1.Bring cluster on IPv6 Primary Dual Stack 2.Edit network.config.openshift.io from dual stack to single like as follow spec: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - fd02::/112 - 172.30.0.0/16 status: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 1400 networkType: OVNKubernetes serviceNetwork: - fd02::/112 - 172.30.0.0/16 TO apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2023-04-27T14:11:37Z" generation: 3 name: cluster resourceVersion: "81045" uid: 28f15675-e739-4262-9acc-4c2c0df4b38d spec: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - fd02::/112 status: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 clusterNetworkMTU: 1400 networkType: OVNKubernetes serviceNetwork: - fd02::/112 kind: List metadata: resourceVersion: "" 3. Wait for control plane components to roll out successfully
Actual results:
Cluster fails with network, ETCD, Kube API and ingress failures
Expected results:
Cluster should convert to IPv6 single without any issues
Additional info:
MGs not working due to varios control place restricting it
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We need to reenable the e2e integration tests as soon as the operator is available again.
Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Looking at the telemetry data for Nutanix I noticed that the “host_type” for clusters installed with platform nutanix shows as “virt-unknown”. Do you know what needs to happen in the code to tell telemetry about host type being Nutanix? The problem is that we can’t track those installations with platform none, just IPI. Refer to the slack thread https://redhat-internal.slack.com/archives/C0211848DBN/p1687864857228739.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Create an OCP Nutanix cluster
Actual results:
The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as “virt-unknown”.
Expected results:
The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as "nutanix".
Additional info:
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/183
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
iam:TagInstanceProfile is not listed in official document [1], IPI install would fail if iam:TagInstanceProfile permission is missing level=error msg=Error: creating IAM Instance Profile (ci-op-4hw2rz1v-49c30-zt9vx-worker-profile): AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-4hw2rz1v-49c30-minimal-perm is not authorized to perform: iam:TagInstanceProfile on resource: arn:aws:iam::301721915996:instance-profile/ci-op-4hw2rz1v-49c30-zt9vx-worker-profile because no identity-based policy allows the iam:TagInstanceProfile action level=error msg= status code: 403, request id: bb0641f5-d01c-4538-b333-261a804ddb59 [1] https://docs.openshift.com/container-platform/4.14/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-14-115151
How reproducible:
Always
Steps to Reproduce:
1. install a common IPI cluster with minimal permission provided in official document 2. 3.
Actual results:
Install failed.
Expected results:
Additional info:
install does a precheck for iam:TagInstanceProfile
Description of problem:
It would help making debugging easier if we included the namespace in the message for these alerts: https://github.com/openshift/cluster-ingress-operator/blob/master/manifests/0000_90_ingress-operator_03_prometheusrules.yaml#L69
Version-Release number of selected component (if applicable):
4.12.x
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
No namespace in the alert message
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/328
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/81
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Ran into a problem with our testing this morning on a newly create ROKS cluster ``` Error running /usr/bin/oc --namespace=e2e-test-oc-service-p4fz2 --kubeconfig=/tmp/configfile2694323048 create service nodeport mynodeport --tcp=8080:7777 --node-port=30000: StdOut> error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated StdErr> error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated exit status 1 ``` The port was already used by a different service, we would like to make a feature request to the testing to make the port number dynamic so that if that port is taken up, it can choose an available one.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
A runbook for the alert TargetDown will useful for Openshift users.
This runbook should explain:
1. how to identify which targets are down
2. how to investigate the reason why the target goes offline
3. resolution of common causes bringing down the target
Related cases:
Description of problem:
Dev console buildconfig got [the server does not allow this method on the requested resource] error when not setting metadate.namespace
How reproducible:
Test case is shown in below
Steps to Reproduce:
Using below to create a Buildconfig in GUI page of openshift console -> Developer -> Builds -> Create BuildConfig -> yaml view ~~~ apiVersion: build.openshift.io/v1 kind: BuildConfig metadata: name: mywebsite labels: name: mywebsite spec: triggers: - type: ImageChange imageChange: {} - type: ConfigChange source: type: Git git: uri: https://github.com/monodot/container-up contextDir: httpd-hello-world strategy: type: Docker dockerStrategy: dockerfilePath: Dockerfile from: kind: ImageStreamTag name: httpd:latest namespace: testbuild output: to: kind: ImageStreamTag name: mywebsite:latest ~~~
Actual results:
Get [the server does not allow this method on the requested resource] error
Expected results:
we can find even not setting metadata.namespace in CLI mode or by contact from the customer that in 4.11 GUI console will not trigger this error, Does that mean the code changed in 4.13 ?
Description of problem:
Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators. An example of this is on VSphere clusters. CAPI is not yet supported on VSphere clusters, but the CAPI IPAM CRDs, are needed by other operators than the usual consumer, cluster-capi-operator and the CAPI controllers.
Version-Release number of selected component (if applicable):
How reproducible:
Launch a techpreview cluster for an unsupported platform (e.g. vsphere/azure). Check that the Core CAPI CRDs are not present.
Steps to Reproduce:
$ oc get crds | grep cluster.x-k8s.io
Actual results:
Core CAPI CRDs are not present (only the metal ones)
Expected results:
Core CAPI CRDs should be present
Additional info:
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/55
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1020
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The HCP CSR flow allows any CN in the incoming CSR.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Using the CSR flow, any name you add to the CN in the CSR will be your username against the Kubernetes API server - check your username using the SelfSubjectRequest API (kubectl auth whoami)
Steps to Reproduce:
1.create CSR with CN=whatever 2.CSR signed, create kubeconfig 3.using kubeconfig, kubectl auth whoami should show whatever CN
Actual results:
any CN in CSR is the username against the cluster
Expected results:
we should only allow CNs with some known prefix (system:customer-break-glass:...)
Additional info:
While implementing MON-3669, we realized that none of the recording rules running on the telemeter server side are tested. Given how complex these rules can be, it's important for us to be confident that future changes won't bring regressions.
Even though not perfect, it's possible to unit tests Prometheus rules with the promtool binary (example in CMO: https://github.com/openshift/cluster-monitoring-operator/blob/2ca7067a4d1fc86b31f7a4816c85da6abc0c8abf/Makefile#L218-L221).
DoD
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Installation of cluster using OCP image 4.15.0-rc.0 and using HTTP Proxy configuration failed on
3/3 control plane nodes failed to install. Please check the installation logs for more information."
and
"error Host master-0-1: updated status from installing-in-progress to error (Host failed to install because its installation stage Waiting for control plane took longer than expected 1h0m0s)"
After looked at master-0-1 node jpurnactl log found the error:
"Dec 21 00:54:29 master-0-1 kubelet.sh[5111]: E1221 00:54:29.290568 5111 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-master-0-1_openshift-etcd(97cad44a9feb70b1091eaa3fb1e565ca)\"" pod="openshift-etcd/etcd-bootstrap-member-master-0-1" podUID="97cad44a9feb70b1091eaa3fb1e565ca"
"
HTTP Proxy configuration works fine with OCP images 4.14.5 and 4.13.26
Steps to reproduce:
1. Setup HTTP Proxy server on hypervisor using quay.io/sameersbn/squid:3.5.27-2
2. Create cluster and got Host Discovery
3. Press Add Host. Fill out SSH public key
4. Select Show proxy settings and fill out
HTTP proxy URL and No proxy domains
5. Generate ISO image, download it and boot 3 masters and 2 workers node.
6. Continue regular cluster installation
Actual results:
Installation failed on 69%
Expected results:
Installation passed
Installation some operators. After some time the ResolutionFailed showing up:
$ kubectl get subscription.operators -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,ResolutionFailed:.status.conditions[?(@.type=="ResolutionFailed")].status,MSG:.status.conditions[?(@.type=="ResolutionFailed")].message' NAMESPACE NAME ResolutionFailed MSG infra-sso rhbk-operator True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] metallb-system metallb-operator-sub True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] multicluster-engine multicluster-engine True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] open-cluster-management acm-operator-subscription True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] openshift-cnv kubevirt-hyperconverged True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-gitops-operator openshift-gitops-operator True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-local-storage local-storage-operator True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-nmstate kubernetes-nmstate-operator <none> <none> openshift-operators devworkspace-operator-fast-redhat-operators-openshift-marketplace <none> <none> openshift-operators external-secrets-operator <none> <none> openshift-operators web-terminal <none> <none> openshift-storage lvms <none> <none> openshift-storage mcg-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage ocs-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage odf-csi-addons-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage odf-operator <none> <none>
At the package server logs you can see one time the catalog source is not available, after a while the catalog source is available but the error doesn't disappear from the subscription.
Package server logs:
time="2023-12-05T14:27:09Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.37.69:50051: connect: connection refused\"" source="{redhat-operators openshift-marketplace}" time="2023-12-05T14:27:09Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:28:26Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace time="2023-12-05T14:30:23Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:35:56Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace time="2023-12-05T14:39:40Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace time="2023-12-05T14:46:07Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:47:37Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace time="2023-12-05T14:48:21Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:49:53Z" level=info msg="updating
4.14.3
1. Install an operator for example metallb 2. Wait until the catalog pod is not available for on time. 3. ResolutionFailed doesn't disappear anymore
ResolutionFailed doesn't disappear anymore from subscription.
ResolutionFailed disappear from subscription.
Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/130
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The default channel of 4.15, 4.16 clusters is stable-4.14.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-03-193825
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.16 cluster 2. Check default channel # oc adm upgrade warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-0.nightly-2024-01-03-193825 not found in the "stable-4.14" channel Cluster version is 4.16.0-0.nightly-2024-01-03-193825 Upgradeable=False Reason: MissingUpgradeableAnnotation Message: Cluster operator cloud-credential should not be upgraded between minor versions: Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade. Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.14 3.
Actual results:
Default channel is stable-4.14 in a 4.16 cluster
Expected results:
Default channel should be stable-4.16 in a 4.16 cluster
Additional info:
4.15 cluster has the issue as well.
Description of the problem:
Up to latest decision RH won't going to support installation OCP cluster on Nutanix with
nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"
Slack discussion thread
https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159
Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/26
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Automate E2E tests of Dynamic OVS Pinning. This bug is created for merging
https://github.com/openshift/cluster-node-tuning-operator/pull/746
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/88
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Kube-apiserver operator is trying to delete prometheus rule that does not exists leading to huge amount of unwanted audit logs, With the introduction of the change as a part of BUG-2004585 kube-apiserver SLO rulesare split into 2 groups kube-apiserver-slos-basic and kube-apiserver-slos-extended kube-apiserver-operator is trying to delete /apis/monitoring.coreos.com/v1/namespaces/openshift-kube-apiserver/prometheusrules/kube-apiserver-slos which no longer exist in the cluster
Version-Release number of selected component (if applicable):
4.12 4.13 4.14
How reproducible:
Its easy to reproduce
Steps to Reproduce:
1. install a cluster with 4.12 2. enable cluster logging 3. forward the audit log to internal or external logstore using below config apiVersion: logging.openshift.io/v1 kind: ClusterLogForwarder metadata: name: instance namespace: openshift-logging spec: pipelines: - name: all-to-default inputRefs: - infrastructure - application - audit outputRefs: - default 4. Check the audit logs in kibana, it will show the logs like below image
Actual results:
Kube-apiserver-operator is trying to delete prometheus rule that does not exists in the cluster
Expected results:
if the rule is not there in the cluster it should not be searched for deletion
Additional info:
Description of problem:
security-groups.yaml playbook runs the IPv6 security group rules creation tasks regardless of the os_subnet6 value. The when clause is not considering the os_subnet6 [1] value and is always executed.
It works with:
- name: 'Create security groups for IPv6' block: - name: 'Create master-sg IPv6 rule "OpenShift API"' [...] when: os_subnet6 is defined
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Always
Steps to Reproduce:
1. Don't set the os_subnet6 in the inventory file [2] (so it's not dual-stack) 2. Deploy 4.15 UPI by running the UPI playbooks
Actual results:
IPv6 security group rules are created
Expected results:
IPv6 security group rules shouldn't be created
Additional info:
[1] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/security-groups.yaml#L375
[2] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/inventory.yaml#L77
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Reported in https://issues.redhat.com/browse/MGMT-14527
resourcepool-path key needs to be added to config ini
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/91
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Steps to Reproduce:
1. Install a cluster using Azure Workload Identity 2. Check the value of the cco_credentials_mode metric
Actual results:
mode = manual
Expected results:
mode = manualpodidentity
Additional info:
The cco_credentials_mode metric reports manualpodidentity mode for an AWS STS cluster.
Description of problem:
[vSphere-CSI-Driver-Operator] does not update the VSphereCSIDriverOperatorCRAvailable status timely
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-04-162702
How reproducible:
Always
Steps to Reproduce:
1. Set up a vSphere cluster with 4.15 nightly; 2. Backup the secret/vmware-vsphere-cloud-credentials to "vmware-cc.yaml" 3. Change the secret/vmware-vsphere-cloud-credentials password to an invalid value under ns/openshift-cluster-csi-drivers by oc edit; 4. Wait for the cluster storage operator degrade and the driver controller pods CrashLoopBackOff, then recover the backup secret "vmware-cc.yaml" back by apply; 5. Observer the driver controller pods back to Running and the cluster storage operator should be back to healthy.
Actual results:
In Step5 : The driver controller pods back to Running but the cluster storage operator stuck at Degrade: True status for almost 1 hour$ oc get po NAME READY STATUS RESTARTS AGE vmware-vsphere-csi-driver-controller-664db7d497-b98vt 13/13 Running 0 16s vmware-vsphere-csi-driver-controller-664db7d497-rtj49 13/13 Running 0 23s vmware-vsphere-csi-driver-node-2krg6 3/3 Running 1 (3h4m ago) 3h5m vmware-vsphere-csi-driver-node-2t928 3/3 Running 2 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-45kb8 3/3 Running 2 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-8vhg9 3/3 Running 1 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-9fh9l 3/3 Running 1 (3h4m ago) 3h5m vmware-vsphere-csi-driver-operator-5954476ddc-rkpqq 1/1 Running 2 (3h10m ago) 3h17m vmware-vsphere-csi-driver-webhook-7b6b5d99f6-rxdt8 1/1 Running 0 3h16m vmware-vsphere-csi-driver-webhook-7b6b5d99f6-skcbd 1/1 Running 0 3h16m $ oc get co/storage -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.15.0-0.nightly-2023-12-04-162702 False False True 8m39s VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: error logging into vcenter: ServerFaultCode: Cannot complete login due to an incorrect user name or password. storage 4.15.0-0.nightly-2023-12-04-162702 True False False 0s $ oc get co/storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.15.0-0.nightly-2023-12-04-162702 True False False 3m41s
Expected results:
In Step5 : After driver controller pods back to Running the cluster storage operator should recover healthy status immediatelly
Additional info:
I compare with the previous CI results seems this issue happened after 4.15.0-0.nightly-2023-11-25-110147
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/29
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/router/pull/550
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In OCP 4.14 the catalog pods in openshift-marketplace where defined as: $ oc get pods -n openshift-marketplace redhat-operators-4bnz4 -o yaml apiVersion: v1 kind: Pod metadata: ... labels: olm.catalogSource: redhat-operators olm.pod-spec-hash: 658b699dc name: redhat-operators-4bnz4 namespace: openshift-marketplace ... spec: containers: - image: registry.redhat.io/redhat/redhat-operator-index:v4.14 imagePullPolicy: Always Now on OCP 4.15 they are defined as: apiVersion: v1 kind: Pod metadata: ... name: redhat-operators-44wxs namespace: openshift-marketplace ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: CatalogSource name: redhat-operators uid: 3b41ac7b-7ad1-4d58-a62f-4a9e667ae356 resourceVersion: "877589" uid: 65ad927c-3764-4412-8d34-82fd856a4cbc spec: containers: - args: - serve - /extracted-catalog/catalog - --cache-dir=/extracted-catalog/cache command: - /bin/opm ... image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7259b65d8ae04c89cf8c4211e4d9ddc054bb8aebc7f26fac6699b314dc40dbe3 imagePullPolicy: Always ... initContainers: ... - args: - --catalog.from=/configs - --catalog.to=/extracted-catalog/catalog - --cache.from=/tmp/cache - --cache.to=/extracted-catalog/cache command: - /utilities/copy-content image: registry.redhat.io/redhat/redhat-operator-index:v4.15 imagePullPolicy: IfNotPresent ... And due to `imagePullPolicy: IfNotPresent` on the initContainer used to extract the index image (referenced by tag) content, they are never really updated.
Version-Release number of selected component (if applicable):
OCP 4.15.0
How reproducible:
100%
Steps to Reproduce:
1. wait for the next version of a released operator on OCP 4.15 2. 3.
Actual results:
Operator catalogs are never really refreshed due to imagePullPolicy: IfNotPresent for the index image
Expected results:
Operator catalogs are periodically (every 10 minutes by default) refreshed
Additional info:
Description of problem:
We need to make controllerAvailabilityPolicy field inmutable in the HostedCluster spec section to ensure the customer cannot go from/to SingleReplica to HighAvailability.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If there is a taskRun with same name in 2 different namespace, then in TaskRuns list page for All namespace, showing only one record due to same name
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Create TaskRun using https://gist.github.com/karthikjeeyar/eb1bbdf9157431f5c875eb55ce47580c in 2 different namespace 2. Go to TaskRun list page 3. Select All Projects
Actual results:
Only one entry is shown
Expected results:
Both entries should be visible
Additional info:
Please review the following PR: https://github.com/openshift/ovirt-csi-driver-operator/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
W0109 17:47:02.340203 1 builder.go:109] graceful termination failed, controllers failed with error: failed to get infrastructure name: infrastructureName not set in infrastructure 'cluster'
Description of problem:
Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-14-100410
How reproducible:
Always
Steps to Reproduce:
1. create the first cluster and make sure it succeeds 2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed 3. destroy the second cluster which failed due to "Platform Provisioning Check" 4. check if the first cluster is still healthy
Actual results:
The first cluster turns unhealthy, because the dns record-sets are deleted by step3
Expected results:
The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.
Additional info:
(1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-14-100410 True False 54m Cluster version is 4.15.0-0.nightly-2024-01-14-100410 $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-0115y-lgns8-master-0.c.openshift-qe.internal Ready control-plane,master 73m v1.28.5+c84a6b8 jiwei-0115y-lgns8-master-1.c.openshift-qe.internal Ready control-plane,master 73m v1.28.5+c84a6b8 jiwei-0115y-lgns8-master-2.c.openshift-qe.internal Ready control-plane,master 74m v1.28.5+c84a6b8 jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal Ready worker 62m v1.28.5+c84a6b8 jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal Ready worker 63m v1.28.5+c84a6b8 $ (2) try to create the second cluster and expect failing due to dns record already exists $ openshift-install version openshift-install 4.15.0-0.nightly-2024-01-14-100410 built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5 release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d release architecture amd64 $ mkdir test1 $ cp install-config.yaml test1 $ yq-3.3.0 r test1/install-config.yaml baseDomain qe.gcp.devcluster.openshift.com $ yq-3.3.0 r test1/install-config.yaml metadata creationTimestamp: null name: jiwei-0115y $ yq-3.3.0 r test1/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 $ openshift-install create cluster --dir test1 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Consuming Install Config from target directory FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue $ (3) delete the second cluster $ openshift-install destroy cluster --dir test1 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Deleted 2 recordset(s) in zone qe INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer INFO Time elapsed: 37s INFO Uninstallation complete! $ (4) check the first cluster status and the dns record-sets $ oc get clusterversion Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host $ $ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone cloudLoggingConfig: kind: dns#managedZoneCloudLoggingConfig creationTime: '2024-01-15T07:22:55.199Z' description: Created By OpenShift Installer dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com. id: '9193862213315831261' kind: dns#managedZone labels: kubernetes-io-cluster-jiwei-0115y-lgns8: owned name: jiwei-0115y-lgns8-private-zone nameServers: - ns-gcp-private.googledomains.com. privateVisibilityConfig: kind: dns#managedZonePrivateVisibilityConfig networks: - kind: dns#managedZonePrivateVisibilityConfigNetwork networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network visibility: private $ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone NAME TYPE TTL DATA jiwei-0115y.qe.gcp.devcluster.openshift.com. NS 21600 ns-gcp-private.googledomains.com. jiwei-0115y.qe.gcp.devcluster.openshift.com. SOA 21600 ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300 $ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y' Listed 0 items. $
This wasn't supposed to have a junit associated but it looks like it did and is now killing payloads. It started failing because the LB was reaped, and we do not yet have confirmation the new one will be preserved.
This should be pulled out until we've got confirmation that (a) there is no junit for the backend, and (b) the LB is on the preserve whitelist.
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/114
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
A hostedcluster/hostedcontrolplane were stuck uninstalling. Inspecting the CPO logs, it showed that "error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095" Unfortunately, I do not have enough access to the AWS account to inspect this security group, though I know it is the default worker security group because it's recorded in the hostedcluster .status.platform.aws.defaultWorkerSecurityGroupID
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
I haven't tried to reproduce it yet, but can do so and update this ticket when I do. My theory is:
Steps to Reproduce:
1. Create an AWS HostedCluster, wait for it to create/populate defaultWorkerSecurityGroupID 2. Attach the defaultWorkerSecurityGroupID to anything else in the AWS account unrelated to the HCP cluster 3. Attempt to delete the HostedCluster
Actual results:
CPO logs: "error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"
HostedCluster Status Condition - lastTransitionTime: "2023-11-09T22:18:09Z" message: "" observedGeneration: 3 reason: StatusUnknown status: Unknown type: CloudResourcesDestroyed
Expected results:
I would expect that the CloudResourcesDestroyed status condition on the hostedcluster would reflect this security group as holding up the deletion instead of having to parse through logs.
Additional info:
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/10
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
DRA plugins can be installed, but do not really work because the required scheduler plugin DynamicResources isn't enabled.
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always
Steps to Reproduce:
1. Install an OpenShift cluster and enable the TechPreviewNoUpgrade feature set either during installation or post-install. The feature set includes the DynamicResourceAllocation feature gate. 2. Install a DRA plugin by any vendor, e.g. by NVIDIA (requires at least one GPU worker with NVIDIA GPU drivers installed on the node, and a few tweaks to allow the plugin to run on OpenShift). 3. Create a resource claim. 4. Create a pod that consumes the resource claim.
Actual results:
The pod remains in ContainerCreating state, the claim in WaitingForFirstConsumer state forever, without any meaningful event or error message.
Expected results:
A resource is allocated according to the resource claim, and assigned to the pod.
Additional info:
The problem is caused by the DynamicResources scheduler plugin not being automatically enabled when the feature flag is turned on. This makes DRA plugins run without issues (the right APIs are available), but do nothing.
Description of problem:
PipelineRun logs page navigation is broken on navigate through the task on the PiplineRun log tab.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to PipelineRuns details page and select the Logs tab. 2. Navigate through the tasks of the PipelineRun tasks
Actual results:
- Details tab gets active on selection of any task - Logs page gets empty on seldction of Logs tab again - Last task is not selected for completed PipelineRuns
Expected results:
- Logs tab should be active when user is not the Logs tab - Last task should be selected in case of the completed PipelineRuns
Additional info:
It is a regression after change in logic of tab selection in HorizontalNav component.
Video- https://drive.google.com/file/d/15fx9GWO2dRh4uaibRmZ4VTk4HFxQ7NId/view?usp=sharing
Description of problem:
After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job (gcp-ipi-oidc-rt-fips-f14) failed by same error.
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-disc-priv-oidc-f14/1761337726575054848
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-23-013505
How reproducible:
Steps to Reproduce:
After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job (gcp-ipi-oidc-rt-fips-f14) failed by same error.
Actual results:
containerStatuses: - containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02a0ea00865bda78b3b04056dc9e4f596dae74996ecc1fcdee7fbe8d603e33f1 imageID: 9dfa10971dce332900b111bbe6a28df76e1d6e0c5b9c132c3abfff80ea0afa9c lastState: terminated: containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8 exitCode: 255 finishedAt: "2024-02-24T15:20:08Z" message: | r,UID:,APIVersion:apps/v1,ResourceVersion:,FieldPath:,},Reason:FeatureGatesInitialized,Message:FeatureGates updated to featuregates.Features{Enabled:[]v1.FeatureGateName{\"AlibabaPlatform\", \"AzureWorkloadIdentity\", \"BuildCSIVolumes\", \"CloudDualStackNodeIPs\", \"ExternalCloudProvider\", \"ExternalCloudProviderAzure\", \"ExternalCloudProviderExternal\", \"ExternalCloudProviderGCP\", \"KMSv1\", \"NetworkLiveMigration\", \"OpenShiftPodSecurityAdmission\", \"PrivateHostedZoneAWS\", \"VSphereControlPlaneMachineSet\"}, Disabled:[]v1.FeatureGateName{\"AdminNetworkPolicy\", \"AutomatedEtcdBackup\", \"CSIDriverSharedResource\", \"ClusterAPIInstall\", \"DNSNameResolver\", \"DisableKubeletCloudCredentialProviders\", \"DynamicResourceAllocation\", \"EventedPLEG\", \"GCPClusterHostedDNS\", \"GCPLabelsTags\", \"GatewayAPI\", \"InsightsConfigAPI\", \"InstallAlternateInfrastructureAWS\", \"MachineAPIOperatorDisableMachineHealthCheckController\", \"MachineAPIProviderOpenStack\", \"MachineConfigNodes\", \"ManagedBootImages\", \"MaxUnavailableStatefulSet\", \"MetricsServer\", \"MixedCPUsAllocation\", \"NodeSwap\", \"OnClusterBuild\", \"PinnedImages\", \"RouteExternalCertificate\", \"SignatureStores\", \"SigstoreImageVerification\", \"TranslateStreamCloseWebsocketRequests\", \"UpgradeStatus\", \"VSphereStaticIPs\", \"ValidatingAdmissionPolicy\", \"VolumeGroupSnapshot\"}},Source:EventSource{Component:cloud-network-config-controller-86bc6cf968-54kkg,Host:,},FirstTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,LastTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:cloud-network-config-controller-86bc6cf968-54kkg,ReportingInstance:,}" F0224 15:20:08.633010 1 main.go:138] Error building cloud provider client, err: error: cannot initialize google client, err: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused reason: Error startedAt: "2024-02-24T15:20:07Z" name: controller ready: false restartCount: 12 started: false state: waiting: message: back-off 5m0s restarting failed container=controller pod=cloud-network-config-controller-86bc6cf968-54kkg_openshift-cloud-network-config-controller(95a0c264-ad8b-4fb0-9218-5b2b84fb8194) reason: CrashLoopBackOff
Expected results:
CNCC won't crash after upgrade
Additional info:
Please review the following PR: https://github.com/openshift/configmap-reload/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable" There's an issue when the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. Scenarios - If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok - On a reboot if tuned recovers after the guaranteed pod has been launched - broken - If tuned restarts at runtime for any reason - broken
Version-Release number of selected component (if applicable):
4.14 and likely earlier
How reproducible:
See description
Steps to Reproduce:
1.See description 2. 3.
Actual results:
Expected results:
Additional info:
The Cloud Credential operator was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. The CloudCredential cap was added as a new capability.
However, for OCP 4.15 the disablement of CCO is only supported on BareMetal platforms, see https://issues.redhat.com/browse/OCPEDGE-69?focusedId=23595076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23595076.
We propose to guard against installations on non-BareMetal platforms without the CloudCredential cap, which could be implemented similar to https://issues.redhat.com/browse/OCPBUGS-15659. 
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem Missing `useEffect` hook dependency warning error in console UI dev env
Version-Release number of selected component (if applicable): 4.15.0-0.ci.test-2024-02-28-133709-ci-ln-svyfg32-latest
How reproducible:
To reproduce:
Run `yarn lint --fix`
Red Hat OpenShift Container Platform subscriptions are often measured against underlying cores. However, the metrics for cores are unreliable with some known edge cases. Namely, when virtualization is used, depending on a variety of factors, the hypervisor doesn't report the underlying cores, and instead reports a core per "cpu" where "cpu" is a schedulable executor (possibly backed by a single hyperthreaded executor). In order to address, we assume a ratio of 2-vCPU to 1 core, and divide the "cores" value by 2 to normalize when we detect that hyperthreading information was not reported, when we're on x86-64 CPU architecture, and when the cluster is not a bare-metal cluster.
At this time, x86-64 virtualized clusters are the ones affected.
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1004
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Event showing how often host has been rebooted showing now only for part of the nodes
11/22/2023, 12:15:10 AM Node test-infra-cluster-57eb6989-master-1 has been rebooted 2 times before completing installation
11/22/2023, 12:00:01 AM Host: test-infra-cluster-57eb6989-master-1, reached installation stage Rebooting
11/21/2023, 11:53:14 PM Host: test-infra-cluster-57eb6989-worker-0, reached installation stage Rebooting
11/21/2023, 11:53:13 PM Host: test-infra-cluster-57eb6989-worker-1, reached installation stage Rebooting
11/21/2023, 11:34:56 PM Host: test-infra-cluster-57eb6989-master-0, reached installation stage Rebooting
11/21/2023, 11:34:26 PM Host: test-infra-cluster-57eb6989-master-2, reached installation stage Rebooting
in this cluster 4 events are missing
11/21/2023, 3:49:34 PM Node test-infra-cluster-164a0f73-master-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-1 has been rebooted 2 times before completing installation
11/21/2023, 3:37:15 PM Host: test-infra-cluster-164a0f73-master-0, reached installation stage Rebooting
11/21/2023, 3:27:34 PM Host: test-infra-cluster-164a0f73-worker-0, reached installation stage Rebooting
11/21/2023, 3:27:30 PM Host: test-infra-cluster-164a0f73-worker-1, reached installation stage Rebooting
11/21/2023, 3:09:40 PM Host: test-infra-cluster-164a0f73-master-2, reached installation stage Rebooting
11/21/2023, 3:09:35 PM Host: test-infra-cluster-164a0f73-master-1, reached installation stage Rebooting
in this cluster 2 events are missing
How reproducible:
Steps to reproduce:
1. create cluster
2. start installation
3.
Actual results:
some events are missing for the indication how often
Expected results:
for each host there should be indication evet
Single line execute markdown reference is not working.
The inline code is not getting rendered properly specifically for single line execute syntax.
The inline code should show a code block with a small execute icon to run the commands in web terminal
As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.
This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.
Steps to reproduce:
This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).
Description of problem:
4.15 nightly payloads have been affected by this test multiple times: : [sig-arch] events should not repeat pathologically for ns/openshift-kube-scheduler expand_less0s{ 1 events happened too frequently event happened 21 times, something is wrong: namespace/openshift-kube-scheduler node/ci-op-2gywzc86-aa265-5skmk-master-1 pod/openshift-kube-scheduler-guard-ci-op-2gywzc86-aa265-5skmk-master-1 hmsg/2652c73da5 - reason/ProbeError Readiness probe error: Get "https://10.0.0.7:10259/healthz": dial tcp 10.0.0.7:10259: connect: connection refused result=reject body: From: 08:41:08Z To: 08:41:09Z} In each of the 10 jobs aggregated, 2 to 3 jobs failed with this test. Historically this test passed 100%. But with the past two days test data, the passing rate has dropped to 97% and aggregator started allowing this in the latest payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1732295947339173888 The first payload this started appearing is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-12-05-071627. All the events happened during cluster-operator/kube-scheduler progressing. For comparison, here is a passed job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936539870498816 Here is a failed one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936538192777216 They both have the same set of probe error events. For the passing jobs, the frequency is lower than 20, while for the failed job, one of those events repeated more than 20 times and therefore results in the test failure.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If OLMPlacement is set to management, the cluster is up with disableAllDefaultSources set to true, remove it in the HostedCluster CR, in the guest cluster disableAllDefaultSources isn't removed and still set to true
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
No QA required, updating approvers across releases
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:
E0125 11:04:58.158222 1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded) panic: send on closed channel goroutine 15608 [running]: github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1() github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2 created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5
which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.
Job run for reference:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288
Description of problem:
If a ROSA HCP customer uses the default worker security group that the CPO creates for some other purpose (i.e. creates their own VPC Endpoint or EC2 instance using this security group) and then starts an uninstallation - the uninstallation will hang indefinitely because the CPO is unable to delete the security group. https://github.com/openshift/hypershift/blob/9e6255e5e44c8464da0850f8c19dc085bdbaf8cb/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L317-L331
Version-Release number of selected component (if applicable):
4.14.8
How reproducible:
100%
Steps to Reproduce:
1. Create a ROSA HCP cluster 2. Attach the default worker security group to some other object unrelated to the cluster, like an EC2 instance or VPC Endpoint 3. Uninstall the ROSA HCP cluster
Actual results:
The uninstall hangs without much feedback to the customer
Expected results:
Either that the uninstall gives up and moves on eventually, or that clear feedback is provided to the customer, so that they know that the uninstall is held up because of an inability to delete a specific security group id. If this feedback mechanism is already in place, but not wired through to OCM, this may not be an OCPBUGS and could just be an OCM bug instead!
Additional info:
Description of problem:
When the IPI installer creates a service instance for the user, PowerVS will now have the type as composite_instance rather than service_instance. Fixup delete cluster to account for this change.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Destroy cluster 3.
Actual results:
The newly created service instance does not delete.
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/208
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When we run opm on RHEL8, we met the following error ./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm) Note: it happened for 4.15.0-ec.3 I tried the 4.14, it works. I also tried to compile it with latest code, it also work.
Version-Release number of selected component (if applicable):
4.15.0-ec.3
How reproducible:
always
Steps to Reproduce:
[root@preserve-olm-env2 slavecontainer]# curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/candidate/opm-linux-4.15.0-ec.3.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz opm [root@preserve-olm-env2 slavecontainer]# ./opm version ./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm) ./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm) [root@preserve-olm-env2 slavecontainer]# curl -s -l -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest-4.14/opm-linux-4.14.5.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz opm [root@preserve-olm-env2 slavecontainer]# opm version Version: version.Version{OpmVersion:"639fc1203", GitCommit:"639fc12035292dec74a16b306226946c8da404a2", BuildDate:"2023-11-21T08:03:15Z", GoOs:"linux", GoArch:"amd64"} [root@preserve-olm-env2 kuiwang]# cd operator-framework-olm/ [root@preserve-olm-env2 operator-framework-olm]# git branch gs * master release-4.10 release-4.11 release-4.12 release-4.13 release-4.8 release-4.9 [root@preserve-olm-env2 operator-framework-olm]# git pull origin master remote: Enumerating objects: 1650, done. remote: Counting objects: 100% (1650/1650), done. remote: Compressing objects: 100% (831/831), done. remote: Total 1650 (delta 727), reused 1617 (delta 711), pack-reused 0 Receiving objects: 100% (1650/1650), 2.03 MiB | 12.81 MiB/s, done. Resolving deltas: 100% (727/727), completed with 468 local objects. From github.com:openshift/operator-framework-olm * branch master -> FETCH_HEAD 639fc1203..85c579f9b master -> origin/master Updating 639fc1203..85c579f9b Fast-forward go.mod | 120 +- go.sum | 240 ++-- manifests/0000_50_olm_00-pprof-secret.yaml ... create mode 100644 vendor/google.golang.org/protobuf/types/dynamicpb/types.go [root@preserve-olm-env2 operator-framework-olm]# rm -fr bin/opm [root@preserve-olm-env2 operator-framework-olm]# make build/opm make bin/opm make[1]: Entering directory '/data/kuiwang/operator-framework-olm' go build -ldflags "-X 'github.com/operator-framework/operator-registry/cmd/opm/version.gitCommit=85c579f9be61aaea11e90b6c870452c72107300a' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.opmVersion=85c579f9b' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.buildDate=2023-12-11T06:12:50Z'" -mod=vendor -tags "json1" -o bin/opm github.com/operator-framework/operator-registry/cmd/opm make[1]: Leaving directory '/data/kuiwang/operator-framework-olm' [root@preserve-olm-env2 operator-framework-olm]# which opm /data/kuiwang/operator-framework-olm/bin/opm [root@preserve-olm-env2 operator-framework-olm]# opm version Version: version.Version{OpmVersion:"85c579f9b", GitCommit:"85c579f9be61aaea11e90b6c870452c72107300a", BuildDate:"2023-12-11T06:12:50Z", GoOs:"linux", GoArch:"amd64"}
Actual results:
Expected results:
Additional info:
In the "request-serving" deployment model for HCPs, the request-serving nodes are being memory starved by the k8s API server. This has an observed impact on limiting the number of nodes a guest HCP cluster can provision, especially during upgrade events.
This is a spike card to investing setting the API Priority and Fairness [1] configuration, and exactly what configuration would be necessary to set.
[1] https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
Description of problem:
When the replica for a nodepool is set to 0, the message for the nodepool is "NotFound". This message should not be displayed if the desired replica is 0.
Version-Release number of selected component (if applicable):
How reproducible:
Create a nodepool and set the replica to 0
Steps to Reproduce:
1. Create a hosted cluster 2. Set the replica for the nodepool to 0 3.
Actual results:
NodePool message is "NotFound"
Expected results:
NodePool message to be empty
Additional info:
Description of problem:
Configure vm type as Standard_NP10s in install-config, which only supports Generation V1. -------------- compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: azure: type: Standard_NP10s replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: type: Standard_NP10s replicas: 3 Continue installation, installer failed when provisioning bootstrap node. -------------- ERROR ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 193: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failure applying terraform for "bootstrap" stage: error applying Terraform configs: failed to apply Terraform: exit status 1 ERROR ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 193: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR seems that issue is introduced by https://github.com/openshift/installer/pull/7642/
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-09-012410
How reproducible:
Always
Steps to Reproduce:
1. configure vm type to Standard_NP10s on control-plane in install-config.yaml 2. install cluster 3.
Actual results:
installer failed when provisioning bootstrap node
Expected results:
installation get successful
Additional info:
When upgrading a HC from 4.13 to 4.14, after admin-acking the API deprecation check, the upgrade is still blocked by the ClusterVersionUpgradeble condition on the HC being Unknown. This is because the CVO in the guest cluster does not have an Upgradeable condition anymore.
Description of problem:
When an OpenShift Container Platform cluster is installed on Bare Metal with RHACM, the "metal3-plugin" for the OpenShift Console is installed automatically. The "Nodes view (`<console>/k8s/cluster/core~v1~Node`) uses the `BareMetalNodesTable` which has very limited columns. However in the meantime OCP improved their Nodes table and added more features (like metrics) and we havent done any work in metal3. Customers are missing information like metrics or Pods, which are present in the standard Node view.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.33
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster using RHACM on Bare Metal 2. Ensure the "metal3-plugin" is enabled 3. Navigate to the "Nodes" view in the OpenShift Container Platform Console (`<console>/k8s/cluster/core~v1~Node`)
Actual results:
Limited columns (Name, Status, Role, Machine, Management Address) is visible. Columns like Memory, CPU, Pods, Filesystem, Instance Type are missing
Expected results:
All the columns from the standard view are visible, plus the "Management Address" column
Additional info:
* Issue was discussed here: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552957981989 * Screenshot of non-metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552980878029?thread_ts=1702552957.981989&cid=C027TN14SGJ * Screenshot of metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552995363389?thread_ts=1702552957.981989&cid=C027TN14SGJ
Description of problem:
There are several testcases in conformance testsuite that are failing due to openshift-multus configuration.
We are running conformance testsuite as part of our Openshift on Openstack CI. We use that just to confirm correct functionality of the cluster. The command we are using to run the test suite is:
openshift-tests run --provider '{\"type\":\"openstack\"}' openshift/conformance/parallel
The name of the tests that failed are:
1. sig-arch] Managed cluster should ensure platform components have system-* priority class associated [Suite:openshift/conformance/parallel]
Reason is:
6 pods found with invalid priority class (should be openshift-user-critical or begin with system-): openshift-multus/whereabouts-reconciler-6q6h7 (currently "") openshift-multus/whereabouts-reconciler-87dwn (currently "") openshift-multus/whereabouts-reconciler-fvhwv (currently "") openshift-multus/whereabouts-reconciler-h68h5 (currently "") openshift-multus/whereabouts-reconciler-nlz59 (currently "") openshift-multus/whereabouts-reconciler-xsch6 (currently "")
2. [sig-arch] Managed cluster should only include cluster daemonsets that have maxUnavailable or maxSurge update of 10 percent or maxUnavailable of 33 percent [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/operators/daemon_set.go:105]: Sep 23 16:12:15.283: Daemonsets found that do not meet platform requirements for update strategy: expected daemonset openshift-multus/whereabouts-reconciler to have maxUnavailable 10% or 33% (see comment) instead of 1, or maxSurge 10% instead of 0 Ginkgo exit error 1: exit with code 1
3.[sig-arch] Managed cluster should set requests but not limits [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/operators/resources.go:196]: Sep 23 16:12:17.489: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted: apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on cpu of 50m which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[cpu]") apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on memory of 100Mi which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[memory]") Ginkgo exit error 1: exit with code 1
4. [sig-node][apigroup:config.openshift.io] CPU Partitioning cluster platform workloads should be annotated correctly for DaemonSets [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/cpu_partitioning/pods.go:159]: Expected <[]error | len:1, cap:1>: [ <*errors.errorString | 0xc0010fa380>{ s: "daemonset (whereabouts-reconciler) in openshift namespace (openshift-multus) must have pod templates annotated with map[target.workload.openshift.io/management:{\"effect\": \"PreferredDuringScheduling\"}]", }, ] to be empty
How reproducible: Always
Steps to Reproduce: Run conformance testsuite:
https://github.com/openshift/origin/blob/master/test/extended/README.md
Actual results: Testcases failing
Expected results: Testcases passing
Description of problem:
https://github.com/openshift/installer/pull/7778 introduced a bug where an error is always returned while retrieving a marketplace image.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Configure marketplace image in the install-config 2. openshift-install create manifests 3.
Actual results:
$ ./openshift-install create manifests --dir ipi1 --log-level debug DEBUG OpenShift Installer 4.16.0-0.test-2023-12-12-020559-ci-ln-xkqmlqk-latest DEBUG Built from commit 456ae720a83e39dffd9918c5a71388ad873b6a38 DEBUG Fetching Master Machines... DEBUG Loading Master Machines... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>), compute[0].platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>)]
Expected results:
Success
Additional info:
When {{errors.Wrap(err, ...)}} was replaced by {{fmt.Errorf(...)}}, there is a slight difference in behavior in which {{errors.Wrap}} returns {{nil}} if {{err}} is {{nil}} but {{fmt.Errorf}} always returns an error.
Description of problem:
Go to one pvc "VolumeSnapshots" tab, it shows error "Oh no! Something went wrong."
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457
How reproducible:
Always
Steps to Reproduce:
1.Create a pvc in project. Go to the pvc's "VolumeSnapshots" tab. 2. 3.
Actual results:
1. The error "Oh no! Something went wrong." shows up on the page.
Expected results:
1. Should show volumesnapshot related to the pvc without error.
Additional info:
screenshot: https://drive.google.com/file/d/1l0i0DCFh_q9mvFHxnftVJL0AM1LaKFOO/view?usp=sharing
Description of problem:
Based on Azure doc [1], NCv2 series Azure virtual machines (VMs) are retired on September 6, 2023. VM could not be provisioned on those instance types. So remove standardNCSv2Family from azure doc tested_instance_types_x86_64 on 4.13+. [1] https://learn.microsoft.com/en-us/azure/virtual-machines/ncv2-series
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. cluster is installed failed on NCv2 series instance type 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/8
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Aggregator claims these tests only ran 4 times out of what looks like 10 jobs that ran to normal completion:
[sig-network-edge] Application behind service load balancer with PDB remains available using new connections
[sig-network-edge] Application behind service load balancer with PDB remains available using reused connections
However looking at one of the jobs not in the list of passes, we can see these tests ran:
Why is the aggregator missing this result somehow?
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/96
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
From OCPBUGS-237 we discussed disabled memory-trim-on-compaction once it was enabled by default.
On OCP with 4.15.0-0.nightly-2023-12-19-033450 we are at
Red Hat Enterprise Linux CoreOS 415.92.202312132107-0 (Plow) openvswitch3.1-3.1.0-59.el9fdp.x86_64
This should have memory-trim-on-compaction enabled by default
v3.0.0 - 15 Aug 2022 -------------------- * Returning unused memory to the OS after the database compaction is now enabled by default. Use 'ovsdb-server/memory-trim-on-compaction off' unixctl command to disable.
https://github.com/openvswitch/ovs/commit/e773140ec3f6d296e4a3877d709fb26fb51bc6ee
If we are enabled by default, we should remove the enable loop.
# set trim-on-compaction if ! retry 60 "trim-on-compaction" "ovn-appctl -t ${nbdb_ctl} --timeout=5 ovsdb-server/memory-trim-on-compaction on"; then exit 1 fi
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1. check if memory-trim-on-compaction is enabled by default in OVS
2. check ndbd log files for
Actual results:
2023-12-20T18:12:47.053444489Z 2023-12-20T18:12:47.053Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-12-20T18:12:49.001580092Z 2023-12-20T18:12:49.001Z|00003|ovsdb_server|INFO|memory trimming after compaction enabled.
Expected results:
memory-trim-on-compaction should be enabled by default, we don't need to re-enable it.
Affected Platforms:
All
After the fix for OCPBUGSM-44759, we put timeouts on payload retrieval operations (verification and download); previously they were uncapped and under certain network circumstances could take hours to terminate. Testing the fix uncovered a problem where, after CVO passes the path with the timeouts, CVO starts logging errors for the core manifest reconciliation loop:
I0208 11:22:57.107819 1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834) I0208 11:22:57.107887 1 task_graph.go:474] Canceled worker 1 while waiting for work I0208 11:22:57.107900 1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834) I0208 11:22:57.107911 1 task_graph.go:474] Canceled worker 0 while waiting for work I0208 11:22:57.107918 1 task_graph.go:523] Workers finished I0208 11:22:57.107925 1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)] I0208 11:22:57.107938 1 sync_worker.go:1169] Summarizing 1 errors I0208 11:22:57.107947 1 sync_worker.go:1173] Update error 648 of 834: UpdatePayloadFailed Could not update role "openshift-marketplace/marketplace-operator" (648 of 834) (context.deadlineExceededError: context deadline exceeded) E0208 11:22:57.107966 1 sync_worker.go:654] unable to synchronize image (waiting 3m39.457405047s): Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)
This is caused by locks. The SyncWorker.Update method acquires its lock for its whole duration. The payloadRetriever.RetrievePayload method is called inside SyncWorker.Update, on the following call chain:
SyncWorker.Update -> SyncWorker.loadUpdatedPayload -> SyncWorker.syncPayload -> payloadRetriever.RetrievePayload
RetrievePayload can take 2 or 4 minutes before it timeouts, so CVO holds the lock for this whole wait.
The manifest reconciliation loop is implemented in the apply method. The whole apply method is bounded by a timeout context set to 2*minimum reconcile interval so it will be set to a value between 4 and 8 minutes. While in the reconciling mode, the manifest graph is split into multiple "tasks" where smaller sequences of these tasks are applied in parallel. Individual tasks in these series are iterated over and each iteration uses a consistentReporter to report status via its Update method, which also acquires the lock on the following call sequence:
SyncWorker.apply -> { for _, task := range tasks ... -> consistentReporter.Update -> statusWrapper.Report -> SyncWorker.updateApplyStatus ->
This leads to the following sequence:
1. apply is called with a timeout between 4 and 8 minutes
2. in parallel, SyncWorker.Update starts and acquires the lock
3. tasks under apply wait on the reporter to acquire lock
4. after 2 or 4 minutes RetrievePayload under SyncWorker.Update timeout and terminate, SyncWorker.Update terminates and releases the lock
5. tasks under apply report results after briefly acquiring the lock, start to do their thing
6. in parallel, SyncWorker.Update starts again and acquires the lock
7. further iterations over tasks under apply wait on the reporter to acquire lock
8. context passed to apply times out
9. Canceled worker 0 while waiting for work... errors
4.13.0-0.ci.test-2023-02-06-062603 with https://github.com/openshift/cluster-version-operator/pull/896
always in certain cluster configuration
1. in a disconnected cluster, upgrade to an unrechachable payload image with --force
2. observe the CVO log
CVO starts to fail reconciling manifests
no failures, cluster continues to try retrieving the image but no interference with manifest reconciliation
This problem was discovered by Evgeni Vakhonin while testing fix for OCPBUGSM-44759: https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22
https://github.com/openshift/cluster-version-operator/pull/896 uncovers this issue but still gets CVO into a better shape - previously the RetrievePayload could be running for a much longer time (hours), preventing the CVO from working at all.
When the cluster gets into this buggy state, the solution is to abort the upgrade that fails to verify or download.
Description of problem:
Check on OperatorHub page, the long catalogsource display name will overflow the operator item tile Version-Release number of selected component (if applicable):{code:none} 4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1. Create a catalogsource with a long display name. 2. Check operator items supplied by the created catalogsource on OperatorHub page 3.
Actual results:
2. The catalogsource display name overflows from the item tile
Expected results:
2. Show show catalogsource display name in the item tile dynamically without overflow.
Additional info:
screenshot: https://drive.google.com/file/d/1GOHJOxoBmtZX3QWDsIvc2RT5a2inkpzM/view?usp=sharing
Description of problem:
When using a custom CNI plugin in a hostedcluster, multus requires some CSRs to be approved. The component approving these CSRs is the network-node-identity. This component only gets the proper RBAC rules configured when networkType is set to Calico. In the current implementation, there is an condition that will apply the required RBAC if the networkType is set to Calico[1]. When using other CNI plugins, like Cilium, you're supposed to set networkType to Other. With current implementation, you won't get the required RBAC in place and as such, the required CSRs won't be approved automatically. [1] https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L139
Version-Release number of selected component (if applicable):
Latest
How reproducible:
Always
Steps to Reproduce:
1. Set hostedcluster.spec.networking.networkType to Other 2. Wait for the HC to start deploying and for the Nodes to join the cluster 3. The nodes will remain in NotReady. Multus pods will complaing about certificates not being ready. 4. If you list CSRs you will find pending CSRs.
Actual results:
RBAC not properly configured when networkType set to Other
Expected results:
RBAC properly configured when networkType set to Other
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C01C8502FMM/p1704824277049609
Description of problem:
If these pods are evicted, they loose all knowlage of exsisting dhcp leases, and any pods using dhcp ipam will fail to renew the dhcp lease. even after the pod is re-created.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. use a NAD with ipam: dhcp. 2. delete the dhcp deamon pod on the smae node as your workload. 3. observe the lease expire on dhcp server / get reissued to a different pod causing network outage from duplicate addresses.
Actual results:
dhcp-daemon
Expected results:
dhcp-daemon pod does not get evicted before workloads. because of system-node-critical
Additional info:
All other multus components system-node-critical priority: 2000001000 priorityClassName: system-node-critical
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
MachineAutoscaler resources with a minimum replica size of zero will display a hyphen ("-") instead of zero on the list and detail pages.
Version-Release number of selected component (if applicable):
4.14.7
How reproducible:
always
Steps to Reproduce:
1.create a MachineAutoscaler with "min: 0" field 2.save record 3.navigate to MachineAutoscalers page under the Compute tab
Actual results:
the min replicas indicates "-"
Expected results:
min replicas indicates "0"
Additional info:
attaching a screenshot
Description of problem:
On a hybrid cluster with Windows nodes and coreOS nodes mixed, egressIP cannot be applied to coreOS anymore. QE testing profile: 53_IPI on AWS & OVN & WindowsContainer
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always
Steps to Reproduce:
1. Setup cluster with template aos-4_14/ipi-on-aws/versioned-installer-ovn-winc-ci 2. Label on coreOS node as egress node % oc describe node ip-10-0-59-132.us-east-2.compute.internal Name: ip-10-0-59-132.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m6i.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2b k8s.ovn.org/egress-assignable= kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-59-132.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m6i.xlarge node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2b topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2b Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"eni-0c661bbdbb0dde54a","ifaddr":{"ipv4":"10.0.32.0/19"},"capacity":{"ipv4":14,"ipv6":15}}] csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0629862832fff4ae3"} k8s.ovn.org/host-cidrs: ["10.0.59.132/19"] k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip: 10.129.2.13 k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 0a:58:0a:81:02:0d k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-132.us-east-2.compute.internal","mac-address":"06:06:e2:7b:9c:45","ip-address... k8s.ovn.org/network-ids: {"default":"0"} k8s.ovn.org/node-chassis-id: fa1ac464-5744-40e9-96ca-6cdc74ffa9be k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.7/16"} k8s.ovn.org/node-id: 7 k8s.ovn.org/node-mgmt-port-mac-address: a6:25:4e:55:55:36 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.132/19"} k8s.ovn.org/node-subnets: {"default":["10.129.2.0/23"]} k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.7/16"} k8s.ovn.org/remote-zone-migrated: ip-10-0-59-132.us-east-2.compute.internal k8s.ovn.org/zone-name: ip-10-0-59-132.us-east-2.compute.internal machine.openshift.io/machine: openshift-machine-api/wduan-debug-1120-vtxkp-worker-us-east-2b-z6wlc machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/desiredConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 22806 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 20 Nov 2023 09:46:53 +0800 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-10-0-59-132.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Mon, 20 Nov 2023 14:01:05 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:47:34 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.59.132 InternalDNS: ip-10-0-59-132.us-east-2.compute.internal Hostname: ip-10-0-59-132.us-east-2.compute.internal Capacity: cpu: 4 ephemeral-storage: 125238252Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16092956Ki pods: 250 Allocatable: cpu: 3500m ephemeral-storage: 114345831029 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14941980Ki pods: 250 System Info: Machine ID: ec21151a2a80230ce1e1926b4f8a902c System UUID: ec21151a-2a80-230c-e1e1-926b4f8a902c Boot ID: cf4b2e39-05ad-4aea-8e53-be669b212c4f Kernel Version: 5.14.0-284.41.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 414.92.202311150705-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.27.1-13.1.rhaos4.14.git956c5f7.el9 Kubelet Version: v1.27.6+b49f9d1 Kube-Proxy Version: v1.27.6+b49f9d1 ProviderID: aws:///us-east-2b/i-0629862832fff4ae3 Non-terminated Pods: (21 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-tlw5h 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 4h14m openshift-cluster-node-tuning-operator tuned-4fvgv 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h14m openshift-dns dns-default-z89zl 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 11m openshift-dns node-resolver-v9stn 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 4h14m openshift-image-registry image-registry-67b88dc677-76hfn 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 4h14m openshift-image-registry node-ca-hw62n 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h14m openshift-ingress-canary ingress-canary-9r9f8 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 4h13m openshift-ingress router-default-5957f4f4c6-tl9gs 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 4h18m openshift-machine-config-operator machine-config-daemon-h7fx4 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 4h14m openshift-monitoring alertmanager-main-1 9m (0%) 0 (0%) 120Mi (0%) 0 (0%) 4h12m openshift-monitoring monitoring-plugin-68995cb674-w2wr9 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h13m openshift-monitoring node-exporter-kbq8z 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 4h13m openshift-monitoring prometheus-adapter-54fc7b9c87-sg4vt 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 4h13m openshift-monitoring prometheus-k8s-1 75m (2%) 0 (0%) 1104Mi (7%) 0 (0%) 4h12m openshift-monitoring prometheus-operator-admission-webhook-84b7fffcdc-x8hsz 5m (0%) 0 (0%) 30Mi (0%) 0 (0%) 4h18m openshift-monitoring thanos-querier-59cbd86d58-cjkxt 15m (0%) 0 (0%) 92Mi (0%) 0 (0%) 4h13m openshift-multus multus-7gjnt 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 4h14m openshift-multus multus-additional-cni-plugins-gn7x9 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h14m openshift-multus network-metrics-daemon-88tf6 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 4h14m openshift-network-diagnostics network-check-target-kpv5v 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 4h14m openshift-ovn-kubernetes ovnkube-node-74nl9 80m (2%) 0 (0%) 1630Mi (11%) 0 (0%) 3h51m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 619m (17%) 0 (0%) memory 4296Mi (29%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: <none> % oc get node -l k8s.ovn.org/egress-assignable= NAME STATUS ROLES AGE VERSION ip-10-0-59-132.us-east-2.compute.internal Ready worker 4h14m v1.27.6+b49f9d1 3. Create egressIP object
Actual results:
% oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.59.101 % oc get cloudprivateipconfig No resources found
Expected results:
The egressIP should be applied to egress node
Additional info:
Please review the following PR: https://github.com/openshift/coredns/pull/107
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/308
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We deprecated the default field manager in CNO, but it is still used by default calls like Patch(). We need to update all calls to use explicit fieldManager, and add a test to verify deprecated managers are not used.
Since 4.14 https://github.com/openshift/cluster-network-operator/commit/db57a477b10f517bc4ae501d95cc7b8398a8755c#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R31 (more specifically, since sigs.k8s.io/controller-runtime bump) we were exposed to this bug https://github.com/kubernetes-sigs/controller-runtime/pull/2435/commits/a6b9c0b672c77a79fff4d5bc03221af1e1fe21fa which made the default fieldManager to be "Go-http-client" instead of "cluster-network-operator".
It means, that "cluster-network-operator" deprecation doesn't really work, since the manager is called differently. Manager name, when unset, is coming from https://github.com/kubernetes/kubernetes/blob/b85c9bbf1ac911a2a2aed2d5c1f5eaf5956cc199/staging/src/k8s.io/client-go/rest/config.go#L498 and then is cropped https://github.com/openshift/cluster-network-operator/blob/5f18e4231f291bf5a01812974b0b4dff19c77f2c/vendor/k8s.io/apiserver/pkg/endpoints/handlers/create.go#L253-L260.
Identified changes needed (may be more):
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Check CNO logs for deprecated field manager logs 2. oc logs -l name=network-operator --tail=-1 -n openshift-network-operator|grep "Depreciated field manager" 3.
Actual results:
Expected results:
Additional info:
Assisted installer agent's api_vip check doesn't accept multiple headers (src). This poses an issue when there are different ignition servers (e.g. hypershift) that expect different headers.
Latest use case: Hypershift's ignition server expects this header: Nodepool name and targetconfigversionhash
Description of problem:
When running 4.15 installer full function test, detect below one arm64 instance families and verified, need to append them in installer doc[1]: - standardBpsv2Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_aarch64.md
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Hosted control plane kube scheduler pods crashloop on clusters created with Kube 1.29 rebase
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Create a hosted cluster using 4.16 kube rebase code base 2. Wait for the cluster to come up
Actual results:
Cluster never comes up because kube scheduler pod crashloops
Expected results:
Cluster comes up
Additional info:
The kube scheduler configuration generated by the control plane operator is using the v1beta3 version of the configuration. That version is no longer included in Kubernetes v1.29
Description of problem:
A user noticed on delete cluster that the IPI generated service instance was not cleaned up. Add more debugging statements to find out why.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Delete cluster
Actual results:
Expected results:
Additional info:
Description of problem:
CCO reports credsremoved mode in metrics when the cluster is actually in the default mode. See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/47349/rehearse-47349-pull-ci-openshift-cloud-credential-operator-release-4.16-e2e-aws-qe/1744240905512030208 (OCP-31768).
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always.
Steps to Reproduce:
1. Creates an AWS cluster with CCO in the default mode (ends up in mint) 2. Get the value of the cco_credentials_mode metric
Actual results:
credsremoved
Expected results:
mint
Root cause:
The controller-runtime client used in metrics calculator (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L77) is unable to GET the root credentials Secret (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L184) since it is backed by a cache which only contains target Secrets requested by other operators (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/cmd/operator/cmd.go#L164-L168).
Description of problem:
ART is moving the container images to be built by Golang 1.21. We should do the same to keep our build config in sync with ART.
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
There's a typo in the openssl commands within the ovn-ipsec-containerized/ovn-ipsec-host daemonsets. The correct parameter is "-checkend", not "-checkedn".
Version-Release number of selected component (if applicable):
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.10 True False 7s Cluster version is 4.14.10
How reproducible:
Steps to Reproduce:
1. Enable IPsec encryption
# oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec": {"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'
Actual results:
Examining the initContainer (ovn-keys) logs
# oc logs ovn-ipsec-containerized-7bcd2 -c ovn-keys
...
+ openssl x509 -noout -dates -checkedn 15770000 -in /etc/openvswitch/keys/ipsec-cert.pem
x509: Use -help for summary.
# oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ovn-ipsec-containerized 1 1 0 1 0 beta.kubernetes.io/os=linux 159m ovn-ipsec-host 1 1 1 1 1 beta.kubernetes.io/os=linux 159m ovnkube-node 1 1 1 1 1 beta.kubernetes.io/os=linux 3h44m
# oc get ds ovn-ipsec-containerized -o yaml | grep edn if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then # oc get ds ovn-ipsec-host -o yaml | grep edn if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/631
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/route-override-cni/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem: The [sig-arch] events should not repeat pathologically for ns/openshift-dns test is permafailing in the periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial job
{ 1 events happened too frequently event happened 114 times, something is wrong: namespace/openshift-dns hmsg/d0c68b9435 service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 From: 17:11:03Z To: 17:11:04Z result=reject }
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The Port_Security has been override although it has set to false in Worker machineset configuration
Version-Release number of selected component (if applicable):
OCP=4.14.14 RHOSP=17.1
How reproducible:
NFV Perf lab ShiftonStack Deployment mode = IPI
Steps to Reproduce:
1.Network configuration resources for Worker node $ oc get machinesets.machine.openshift.io -n openshift-machine-api | grep worker 5kqfbl3y0rhocpnfv-wj2jj-worker-0 1 1 1 1 5d23h $ oc describe machinesets.machine.openshift.io -n openshift-machine-api 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Name: 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=5kqfbl3y0rhocpnfv-wj2jj machine.openshift.io/cluster-api-machine-role=worker machine.openshift.io/cluster-api-machine-type=worker Annotations: machine.openshift.io/memoryMb: 47104 machine.openshift.io/vCPU: 26 API Version: machine.openshift.io/v1beta1 Kind: MachineSet Metadata: Creation Timestamp: 2024-03-07T05:24:07Z Generation: 3 Resource Version: 226098 UID: 8cb06872-9b62-4c2c-b66b-bf91a03efa2d Spec: Replicas: 1 Selector: Match Labels: machine.openshift.io/cluster-api-cluster: 5kqfbl3y0rhocpnfv-wj2jj machine.openshift.io/cluster-api-machineset: 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Template: Metadata: Labels: machine.openshift.io/cluster-api-cluster: 5kqfbl3y0rhocpnfv-wj2jj machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: 5kqfbl3y0rhocpnfv-wj2jj-worker-0 Spec: Lifecycle Hooks: Metadata: Provider Spec: Value: API Version: machine.openshift.io/v1alpha1 Availability Zone: worker Cloud Name: openstack Clouds Secret: Name: openstack-cloud-credentials Namespace: openshift-machine-api Config Drive: true Flavor: sos-worker Image: 5kqfbl3y0rhocpnfv-wj2jj-rhcos Kind: OpenstackProviderSpec Metadata: Networks: Filter: Subnets: Filter: Id: 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 Ports: Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p1 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p1 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p2 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p2 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p3 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p3 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 1a892dcf-bf93-46ef-bf37-bda6cf923471 Name Suffix: provider3p4 Network ID: 50a557b5-34c2-4c47-b539-963688f7167c Port Security: false Tags: sriov Trunk: false Vnic Type: direct Fixed I Ps: Subnet ID: 76430b9e-302f-428d-916a-77482d9cfb19 Name Suffix: provider4p4 Network ID: e2106b16-8f83-4e2e-bdbd-20e2c12ec279 Port Security: false Tags: sriov Trunk: false Vnic Type: direct Primary Subnet: 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 Security Groups: Filter: Name: 5kqfbl3y0rhocpnfv-wj2jj-worker Server Group Name: 5kqfbl3y0rhocpnfv-wj2jj-worker-worker Server Metadata: Name: 5kqfbl3y0rhocpnfv-wj2jj-worker Openshift Cluster ID: 5kqfbl3y0rhocpnfv-wj2jj Tags: openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj Trunk: true User Data Secret: Name: worker-user-data Status: Available Replicas: 1 Fully Labeled Replicas: 1 Observed Generation: 3 Ready Replicas: 1 Replicas: 1 Events: <none> $ oc get nodes NAME STATUS ROLES AGE VERSION 5kqfbl3y0rhocpnfv-wj2jj-master-0 Ready control-plane,master 5d23h v1.27.10+28ed2d7 5kqfbl3y0rhocpnfv-wj2jj-master-1 Ready control-plane,master 5d23h v1.27.10+28ed2d7 5kqfbl3y0rhocpnfv-wj2jj-master-2 Ready control-plane,master 5d23h v1.27.10+28ed2d7 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Ready worker 5d22h v1.27.10+28ed2d7 $ oc describe nodes 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Name: 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=sos-worker beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=regionOne failure-domain.beta.kubernetes.io/zone=worker feature.node.kubernetes.io/network-sriov.capable=true kubernetes.io/arch=amd64 kubernetes.io/hostname=5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=sos-worker node.openshift.io/os_id=rhcos topology.cinder.csi.openstack.org/zone=worker topology.kubernetes.io/region=regionOne topology.kubernetes.io/zone=worker Annotations: alpha.kubernetes.io/provided-node-ip: 192.168.0.91 csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879"} machine.openshift.io/machine: openshift-machine-api/5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/desiredConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0 machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 505735 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done sriovnetwork.openshift.io/state: Idle tuned.openshift.io/bootcmdline: skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=10-25 tuned.non_isolcpus=000003ff systemd.cpu_affinity=0,1,2,3... volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 07 Mar 2024 06:09:31 +0000 Taints: <none> Unschedulable: false Lease: HolderIdentity: 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr AcquireTime: <unset> RenewTime: Wed, 13 Mar 2024 04:55:28 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:00 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:00 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:00 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 13 Mar 2024 04:55:33 +0000 Thu, 07 Mar 2024 15:18:05 +0000 KubeletReady kubelet is posting ready status Addresses: InternalIP: 192.168.0.91 Hostname: 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr Capacity: cpu: 26 ephemeral-storage: 104266732Ki hugepages-1Gi: 20Gi hugepages-2Mi: 0 memory: 47264764Ki openshift.io/intl_provider3: 4 openshift.io/intl_provider4: 4 pods: 250 Allocatable: cpu: 16 ephemeral-storage: 95018478229 hugepages-1Gi: 20Gi hugepages-2Mi: 0 memory: 25166844Ki openshift.io/intl_provider3: 4 openshift.io/intl_provider4: 4 pods: 250 System Info: Machine ID: aa5cfdcbeb4646d88ac25bb6f0c0d879 System UUID: aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 Boot ID: 77573755-0d27-4717-80fe-4579692d9c2c Kernel Version: 5.14.0-284.54.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 414.92.202402201520-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.27.3-6.rhaos4.14.git7eb2281.el9 Kubelet Version: v1.27.10+28ed2d7 Kube-Proxy Version: v1.27.10+28ed2d7 ProviderID: openstack:///aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 Non-terminated Pods: (19 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- crucible-rickshaw testpmd-host-device-e810-sriov 10 (62%) 10 (62%) 10000Mi (40%) 10000Mi (40%) 3d13h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-hnv49 30m (0%) 0 (0%) 150Mi (0%) 0 (0%) 5d22h openshift-cluster-node-tuning-operator tuned-fcjfp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5d22h openshift-dns dns-default-v7s59 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 5d22h openshift-dns node-resolver-gkz8b 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 5d22h openshift-image-registry node-ca-p5dn5 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 5d22h openshift-ingress-canary ingress-canary-fk59t 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 5d22h openshift-machine-config-operator machine-config-daemon-9qw8z 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 5d22h openshift-monitoring node-exporter-czcmj 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 5d22h openshift-monitoring prometheus-adapter-7696787779-vj5wk 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 5d4h openshift-multus multus-additional-cni-plugins-l7rpv 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 5d22h openshift-multus multus-nxr6k 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 5d22h openshift-multus network-metrics-daemon-tb7sq 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 5d22h openshift-network-diagnostics network-check-target-pqtp9 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 5d22h openshift-openstack-infra coredns-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr 200m (1%) 0 (0%) 400Mi (1%) 0 (0%) 5d22h openshift-openstack-infra keepalived-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr 200m (1%) 0 (0%) 400Mi (1%) 0 (0%) 5d22h openshift-sdn sdn-9mdnb 110m (0%) 0 (0%) 220Mi (0%) 0 (0%) 5d22h openshift-sriov-network-operator sriov-device-plugin-tr68w 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5d13h openshift-sriov-network-operator sriov-network-config-daemon-dtf95 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 5d22h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 10845m (67%) 10 (62%) memory 11928Mi (48%) 10000Mi (40%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 8Gi (40%) 8Gi (40%) hugepages-2Mi 0 (0%) 0 (0%) openshift.io/intl_provider3 4 4 openshift.io/intl_provider4 4 4 Events: <none> 2. OpenStack Network resource for Worker node $ openstack server list --all --fit-width +--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+ | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr | ACTIVE | management=192.168.0.91; provider-3=192.168.177.197, 192.168.177.59, 192.168.177.66, 192.168.177.83; | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-worker | | | | | provider-4=192.168.178.108, 192.168.178.121, 192.168.178.144, 192.168.178.18 | | | | 1a24baf3-acde-49a0-ab8e-4f4afcc9d3cc | 5kqfbl3y0rhocpnfv-wj2jj-master-2 | ACTIVE | management=192.168.0.62 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master | | 3e545ab5-6e28-4189-8d94-9272dfa1cd05 | 5kqfbl3y0rhocpnfv-wj2jj-master-1 | ACTIVE | management=192.168.0.78 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master | | 97e5c382-0fb0-4a70-b58e-0469d3869a4e | 5kqfbl3y0rhocpnfv-wj2jj-master-0 | ACTIVE | management=192.168.0.93 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master | +--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+$ openstack port list --server aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 +--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status | +--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ | 0a562c29-4ddc-41c4-82e8-13934d3ee273 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-0 | fa:16:3e:16:9a:c3 | ip_address='192.168.0.91', subnet_id='7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34' | ACTIVE | | 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 | fa:16:3e:15:88:d7 | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | | 1778cb62-5fbf-42be-8847-53a7b092bdf5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 | fa:16:3e:2a:64:e4 | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | 557f205b-2674-4f6e-91a2-643fe1702be2 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p1 | fa:16:3e:56:a3:48 | ip_address='192.168.177.83', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | 721b5f15-2dc9-4509-a4ba-09f364ae8771 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p3 | fa:16:3e:dd:c3:28 | ip_address='192.168.177.59', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | 9da4b1be-27d7-4428-a194-9eb4b02f6ac5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p3 | fa:16:3e:fb:06:1b | ip_address='192.168.178.144', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | | a72fcbd2-83d3-4fa9-be3d-e9fbde27d4bf | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p4 | fa:16:3e:a9:28:0e | ip_address='192.168.177.66', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE | | ba5cd10f-c6bc-4bed-b978-3b8a3560ad5c | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p1 | fa:16:3e:33:e4:c4 | ip_address='192.168.178.18', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | | bf2ce123-76fc-4e5c-9e4f-0473febbdeac | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p2 | fa:16:3e:ce:91:10 | ip_address='192.168.178.121', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE | +--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ $ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | nfv-intel-11.perflab.com | | binding_profile | pci_slot='0000:b1:11.2', pci_vendor_info='8086:1889', physical_network='provider4' | | binding_vif_details | connectivity='l2', port_filter='False', vlan='178' | | binding_vif_type | hw_veb | | binding_vnic_type | direct | | created_at | 2024-03-07T06:03:43Z | | data_plane_status | None | | description | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj | | device_id | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | | device_owner | compute:worker | | device_profile | None | | dns_assignment | fqdn='host-192-168-178-108.openstacklocal.', hostname='host-192-168-178-108', ip_address='192.168.178.108' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | | id | 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767 | | ip_allocation | None | | mac_address | fa:16:3e:15:88:d7 | | name | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 | | network_id | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | | numa_affinity_policy | None | | port_security_enabled | True | | project_id | 927450d0f06647a99d86214acd822679 | | propagate_uplink_status | None | | qos_network_policy_id | None | | qos_policy_id | None | | resource_request | None | | revision_number | 6 | | security_group_ids | f0df9265-c7fd-4f47-875f-d346e5cb5074 | | status | ACTIVE | | tags | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj | | trunk_details | None | | updated_at | 2024-03-07T06:04:10Z | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+$ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | nfv-intel-11.perflab.com | | binding_profile | pci_slot='0000:b1:01.1', pci_vendor_info='8086:1889', physical_network='provider3' | | binding_vif_details | connectivity='l2', port_filter='False', vlan='177' | | binding_vif_type | hw_veb | | binding_vnic_type | direct | | created_at | 2024-03-07T06:03:41Z | | data_plane_status | None | | description | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj | | device_id | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | | device_owner | compute:worker | | device_profile | None | | dns_assignment | fqdn='host-192-168-177-197.openstacklocal.', hostname='host-192-168-177-197', ip_address='192.168.177.197' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | | id | 1778cb62-5fbf-42be-8847-53a7b092bdf5 | | ip_allocation | None | | mac_address | fa:16:3e:2a:64:e4 | | name | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 | | network_id | 50a557b5-34c2-4c47-b539-963688f7167c | | numa_affinity_policy | None | | port_security_enabled | True | | project_id | 927450d0f06647a99d86214acd822679 | | propagate_uplink_status | None | | qos_network_policy_id | None | | qos_policy_id | None | | resource_request | None | | revision_number | 9 | | security_group_ids | f0df9265-c7fd-4f47-875f-d346e5cb5074 | | status | ACTIVE | | tags | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj | | trunk_details | None | | updated_at | 2024-03-07T06:10:42Z | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ $ openstack network list +--------------------------------------+-------------+--------------------------------------+ | ID | Name | Subnets | +--------------------------------------+-------------+--------------------------------------+ | 50a557b5-34c2-4c47-b539-963688f7167c | provider-3 | 1a892dcf-bf93-46ef-bf37-bda6cf923471 | | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | provider-4 | 76430b9e-302f-428d-916a-77482d9cfb19 | | 5fdddf1c-3a71-4752-94bd-bdb5b9674500 | management | 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 | +--------------------------------------+-------------+--------------------------------------+$ openstack network show provider-3 +---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | | | created_at | 2024-03-01T16:45:48Z | | description | | | dns_domain | | | id | 50a557b5-34c2-4c47-b539-963688f7167c | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | None | | is_vlan_transparent | None | | mtu | 9216 | | name | provider-3 | | port_security_enabled | True | | project_id | ad4b9a972ac64bd9916ad7ee80288353 | | provider:network_type | vlan | | provider:physical_network | provider3 | | provider:segmentation_id | 177 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | True | | status | ACTIVE | | subnets | 1a892dcf-bf93-46ef-bf37-bda6cf923471 | | tags | | | updated_at | 2024-03-01T16:45:52Z | +---------------------------+--------------------------------------+$ openstack network show provider-4 +---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | | | created_at | 2024-03-01T16:45:57Z | | description | | | dns_domain | | | id | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | None | | is_vlan_transparent | None | | mtu | 9216 | | name | provider-4 | | port_security_enabled | True | | project_id | ad4b9a972ac64bd9916ad7ee80288353 | | provider:network_type | vlan | | provider:physical_network | provider4 | | provider:segmentation_id | 178 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | True | | status | ACTIVE | | subnets | 76430b9e-302f-428d-916a-77482d9cfb19 | | tags | | | updated_at | 2024-03-01T16:46:01Z | +---------------------------+--------------------------------------+ 3.
Actual results:
Expected results:
Additional info:
In PowerVS, when I try and deploy a 4.16 cluster, I see the following:
Description of problem:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get pods -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE powervs-cloud-controller-manager-6b6fbcc9db-9rhtj 0/1 CrashLoopBackOff 4 (10s ago) 2m47s powervs-cloud-controller-manager-6b6fbcc9db-wnvck 0/1 CrashLoopBackOff 3 (49s ago) 2m46s [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager Error from server: no preferred addresses found; known addresses: [] [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-wnvck -n openshift-cloud-controller-manager Error from server: no preferred addresses found; known addresses: []
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-ppc64le-2024-01-07-111144
How reproducible:
Aways
Steps to Reproduce:
1. Deploy OpenShift cluster
On the master-0 node, I see:
[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD a048556553827 ec3035a371e09312254a277d5eb9affba2930adbd4018f7557899a2f3d76bc88 18 seconds ago Exited kube-rbac-proxy 7 0381a589d57cd cluster-cloud-controller-manager-operator-94dd5b468-kxqw5 a326f7ec83ddb 60f5c9455518c79a9797cfbeab0b3530dae1bf77554eccc382ff12d99053efd1 11 minutes ago Running config-sync-controllers 0 0381a589d57cd cluster-cloud-controller-manager-operator-94dd5b468-kxqw5 ddaa6999b5b86 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60eff87ed56ee4761fd55caa4712e6bea47dccaa11c59ba53a6d5697eacc7d32 11 minutes ago Running cluster-cloud-controller-manager 0 0381a589d57cd cluster-cloud-controller-manager-operator-94dd5b468-kxqw5
The failing pod has this as its log:
[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl logs a048556553827 Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components I0108 18:09:12.320332 1 flags.go:64] FLAG: --add-dir-header="false" I0108 18:09:12.320401 1 flags.go:64] FLAG: --allow-paths="[]" I0108 18:09:12.320413 1 flags.go:64] FLAG: --alsologtostderr="false" I0108 18:09:12.320420 1 flags.go:64] FLAG: --auth-header-fields-enabled="false" I0108 18:09:12.320427 1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups" I0108 18:09:12.320435 1 flags.go:64] FLAG: --auth-header-groups-field-separator="|" I0108 18:09:12.320441 1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user" I0108 18:09:12.320447 1 flags.go:64] FLAG: --auth-token-audiences="[]" I0108 18:09:12.320454 1 flags.go:64] FLAG: --client-ca-file="" I0108 18:09:12.320460 1 flags.go:64] FLAG: --config-file="/etc/kube-rbac-proxy/config-file.yaml" I0108 18:09:12.320467 1 flags.go:64] FLAG: --help="false" I0108 18:09:12.320473 1 flags.go:64] FLAG: --http2-disable="false" I0108 18:09:12.320479 1 flags.go:64] FLAG: --http2-max-concurrent-streams="100" I0108 18:09:12.320486 1 flags.go:64] FLAG: --http2-max-size="262144" I0108 18:09:12.320492 1 flags.go:64] FLAG: --ignore-paths="[]" I0108 18:09:12.320500 1 flags.go:64] FLAG: --insecure-listen-address="" I0108 18:09:12.320506 1 flags.go:64] FLAG: --kubeconfig="" I0108 18:09:12.320512 1 flags.go:64] FLAG: --log-backtrace-at=":0" I0108 18:09:12.320520 1 flags.go:64] FLAG: --log-dir="" I0108 18:09:12.320526 1 flags.go:64] FLAG: --log-file="" I0108 18:09:12.320531 1 flags.go:64] FLAG: --log-file-max-size="1800" I0108 18:09:12.320537 1 flags.go:64] FLAG: --log-flush-frequency="5s" I0108 18:09:12.320543 1 flags.go:64] FLAG: --logtostderr="true" I0108 18:09:12.320550 1 flags.go:64] FLAG: --oidc-ca-file="" I0108 18:09:12.320556 1 flags.go:64] FLAG: --oidc-clientID="" I0108 18:09:12.320564 1 flags.go:64] FLAG: --oidc-groups-claim="groups" I0108 18:09:12.320570 1 flags.go:64] FLAG: --oidc-groups-prefix="" I0108 18:09:12.320576 1 flags.go:64] FLAG: --oidc-issuer="" I0108 18:09:12.320581 1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]" I0108 18:09:12.320590 1 flags.go:64] FLAG: --oidc-username-claim="email" I0108 18:09:12.320595 1 flags.go:64] FLAG: --one-output="false" I0108 18:09:12.320601 1 flags.go:64] FLAG: --proxy-endpoints-port="0" I0108 18:09:12.320608 1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:9258" I0108 18:09:12.320614 1 flags.go:64] FLAG: --skip-headers="false" I0108 18:09:12.320620 1 flags.go:64] FLAG: --skip-log-headers="false" I0108 18:09:12.320626 1 flags.go:64] FLAG: --stderrthreshold="2" I0108 18:09:12.320631 1 flags.go:64] FLAG: --tls-cert-file="/etc/tls/private/tls.crt" I0108 18:09:12.320637 1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]" I0108 18:09:12.320654 1 flags.go:64] FLAG: --tls-min-version="VersionTLS12" I0108 18:09:12.320661 1 flags.go:64] FLAG: --tls-private-key-file="/etc/tls/private/tls.key" I0108 18:09:12.320667 1 flags.go:64] FLAG: --tls-reload-interval="1m0s" I0108 18:09:12.320674 1 flags.go:64] FLAG: --upstream="http://127.0.0.1:9257/" I0108 18:09:12.320681 1 flags.go:64] FLAG: --upstream-ca-file="" I0108 18:09:12.320686 1 flags.go:64] FLAG: --upstream-client-cert-file="" I0108 18:09:12.320692 1 flags.go:64] FLAG: --upstream-client-key-file="" I0108 18:09:12.320697 1 flags.go:64] FLAG: --upstream-force-h2c="false" I0108 18:09:12.320703 1 flags.go:64] FLAG: --v="3" I0108 18:09:12.320709 1 flags.go:64] FLAG: --version="false" I0108 18:09:12.320719 1 flags.go:64] FLAG: --vmodule="" I0108 18:09:12.320735 1 kube-rbac-proxy.go:578] Reading config file: /etc/kube-rbac-proxy/config-file.yaml I0108 18:09:12.321427 1 kube-rbac-proxy.go:285] Valid token audiences: I0108 18:09:12.321473 1 kube-rbac-proxy.go:399] Reading certificate files E0108 18:09:12.321519 1 run.go:74] "command failed" err="failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory"
When I describe the pod, I see:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager Name: powervs-cloud-controller-manager-6b6fbcc9db-9rhtj Namespace: openshift-cloud-controller-manager Priority: 2000000000 Priority Class Name: system-cluster-critical Service Account: cloud-controller-manager Node: rdr-hamzy-test-wdc06-fs5m2-master-2/ Start Time: Mon, 08 Jan 2024 11:57:45 -0600 Labels: infrastructure.openshift.io/cloud-controller-manager=PowerVS k8s-app=powervs-cloud-controller-manager pod-template-hash=6b6fbcc9db Annotations: operator.openshift.io/config-hash: 09205e81b4dc20086c29ddbdd3fccc29a675be94b2779756a0e748dd9ba91e40 Status: Running IP: IPs: <none> Controlled By: ReplicaSet/powervs-cloud-controller-manager-6b6fbcc9db Containers: cloud-controller-manager: Container ID: cri-o://4365a326d05ecaac8e4114efabb4a46e01a308459ad30438d742b4829c24a717 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09 Image ID: 65401afa73528f9a425a9d7f5dee8a9de8d9d3d82c8fd84cd653b16409093836 Port: 10258/TCP Host Port: 10258/TCP Command: /bin/bash -c #!/bin/bash set -o allexport if [[ -f /etc/kubernetes/apiserver-url.env ]]; then source /etc/kubernetes/apiserver-url.env fi exec /bin/ibm-cloud-controller-manager \ --bind-address=$(POD_IP_ADDRESS) \ --use-service-account-credentials=true \ --configure-cloud-routes=false \ --cloud-provider=ibm \ --cloud-config=/etc/ibm/cloud.conf \ --profiling=false \ --leader-elect=true \ --leader-elect-lease-duration=137s \ --leader-elect-renew-deadline=107s \ --leader-elect-retry-period=26s \ --leader-elect-resource-namespace=openshift-cloud-controller-manager \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \ --v=2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Mon, 08 Jan 2024 12:35:12 -0600 Finished: Mon, 08 Jan 2024 12:35:12 -0600 Ready: False Restart Count: 12 Requests: cpu: 75m memory: 60Mi Liveness: http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3 Environment: POD_IP_ADDRESS: (v1:status.podIP) VPCCTL_CLOUD_CONFIG: /etc/ibm/cloud.conf ENABLE_VPC_PUBLIC_ENDPOINT: true Mounts: /etc/ibm from cloud-conf (rw) /etc/kubernetes from host-etc-kube (ro) /etc/pki/ca-trust/extracted/pem from trusted-ca (ro) /etc/vpc from ibm-cloud-credentials (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5xdm (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: trusted-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ccm-trusted-ca Optional: false host-etc-kube: Type: HostPath (bare host directory volume) Path: /etc/kubernetes HostPathType: Directory cloud-conf: Type: ConfigMap (a volume populated by a ConfigMap) Name: cloud-conf Optional: false ibm-cloud-credentials: Type: Secret (a volume populated by a Secret) SecretName: ibm-cloud-credentials Optional: false kube-api-access-z5xdm: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 38m default-scheduler Successfully assigned openshift-cloud-controller-manager/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj to rdr-hamzy-test-wdc06-fs5m2-master-2 Normal Pulling 38m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" Normal Pulled 37m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" in 36.694s (36.694s including waiting) Normal Started 36m (x4 over 37m) kubelet Started container cloud-controller-manager Normal Created 35m (x5 over 37m) kubelet Created container cloud-controller-manager Normal Pulled 35m (x4 over 37m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" already present on machine Warning BackOff 2m57s (x166 over 37m) kubelet Back-off restarting failed container cloud-controller-manager in pod powervs-cloud-controller-manager-6b6fbcc9db-9rhtj_openshift-cloud-controller-manager(bf58b824-b1a2-4d2e-8735-22723642a24a)
AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.
There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!
Technically, this is a copy of STOR-1797, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.
Description of problem:
Documentation for using Red Hat subscriptions in builds is missing a few important steps, especially for customers that have not turned on the tech preview feature for the Shared Resource CSI driver. These are the following: 1. Customer needs Simple Content Access import enabled in the Insights Operator: https://docs.openshift.com/container-platform/4.12/support/remote_health_monitoring/insights-operator-simple-access.html 2. Customer needs to copy the secret data from openshift-config-managed/etc-pki-entitlement to the workspace the build is running in. We should provide oc commands that a cluster admin/platform team can execute. For builds that are running in a network-restricted environment and access RHEL content through Satellite, the documentation must also provide instructions on how to obtain an `rhsm.conf` file for the Satellite instance and mount it into the build container.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
Read the documentation for https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html#builds-create-imagestreamtag_running-entitled-builds and execute the commands as is.
Actual results:
Build probably won't run because the required secret is not created.
Expected results:
Customers should be able to run a build that requires RHEL entitlements following the exact steps as described in the doc.
Additional info:
https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html
Description of problem:
Navigation: Workloads -> Deployments -> (select any Deployment from list) -> Details -> Volumes -> Remove volume Issue: Message "Are you sure you want to remove volume audit-policies from Deployment: apiserver?" is in English. Observation: Translation is present in branch release-4.15 file... frontend/public/locales/ja/public.json
Version-Release number of selected component (if applicable):
4.15.0-rc.3
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Content is in English
Expected results:
Content should be in selected language
Additional info:
Reference screenshot attached.
Description of problem:
Installer now errors when attempting to use networkType: OpenShiftSDN; but the message still says "deprecated".
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
100%
Steps to Reproduce:
1. Attempt to install 4.15+ with networkType: OpenShiftSDN Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"
Actual results:
Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"
Expected results:
A message more like: Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is not supported, please use OVNKubernetes"
Additional info:
See thread
Please review the following PR: https://github.com/openshift/ovirt-csi-driver/pull/132
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is duplicate of https://issues.redhat.com/browse/ART-8361 one since on ART bugs we are not able to set `target` so creating the issue here.
Description of problem:
Based on the discussion in https://issues.redhat.com/browse/OCPBUGS-24044
and the discussion in this slack [https://redhat-internal.slack.com/archives/CBWMXQJKD/p1700510945375019|thread] we need to update our CI and some of the work done for mutable scope in NE-621.
Specifically, we need to
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Run CI TestUnmanagedDNSToManagedDNSInternalIngressController 2. Observe failure in unmanaged-migrated-internal
Actual results:
CI tests fail.
Expected results:
CI tests shouldn't fail.
Additional info:
This is a change from past behavior, as reported in https://issues.redhat.com/browse/OCPBUGS-24044. Further discussion revealed that the new behavior is currently expected but could be restored in the future. Notes to SRE and release notes are needed for this change to behavior.
Description of problem:
Customer is trying to create RHOCP cluster with Domain name 123mydomain.com In RedHat Hybrid Cloud Console customer is getting below error : ~~~ Failed to update the cluster DNS format mismatch: 123mydomain.com domain name is not valid ~~~ *** As per regex check as described in KCS - https://access.redhat.com/solutions/5517531 The domain name starting with numeric character is valid e.g. 123mydomain.com Below is the RegeX check to find the domain name validity : [a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')] *** From the validations the assisted installer does as per : https://github.com/openshift/assisted-service/blob/master/pkg/validations/validations.go The below regexps are applied: baseDomainRegex = `^[a-z]([\-]*[a-z\d]+)+$` dnsNameRegex = `^([a-z]([\-]*[a-z\d]+)*\.)+[a-z\d]+([\-]*[a-z\d]+)+$` wildCardDomainRegex = `^(validateNoWildcardDNS\.).+\.?$` hostnameRegex = `^[a-z0-9][a-z0-9\-\.]{0,61}[a-z0-9]$` installerArgsValuesRegex = `^[A-Za-z0-9@!#$%*()_+-=//.,";':{}\[\]]+$` This means the domain name must start with a letter [a-z].
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Open RedHat Hybrid Cloud Console 2. Go to Clusters 3. Create Cluster 4. Go to Datacenter 5. Under Assisted Installer -> Create Cluster 6. Enter Cluster Name mytestcluster and enter Domain Name 123mydomain.com 7. Click on Next
Actual results:
Domain name with numeric character first and then letters e.g. 123mydomain.com showing invalid in RedHat Hybrid Cloud Console Assisted Installer , throwing error :- Failed to create new cluster DNS format mismatch: 123mydomain.com domain name is not valid
Expected results:
Domain name with numeric character first and then letters e.g. 123mydomain.com must be valid in OpenShift RedHat Hybrid Cloud Console Assisted Installer
When clicking on the output image link on a Shipwright BuildRun details page, the link leads to the imagestream details page but shows 404 error.
The image link is:
https://console-openshift-console.apps...openshiftapps.com/k8s/ns/buildah-example/imagestreams/sample-kotlin-spring%3A1.0-shipwright
The BuildRun spec
apiVersion: shipwright.io/v1beta1 kind: BuildRun metadata: generateName: sample-spring-kotlin-build- name: sample-spring-kotlin-build-xh2dq namespace: buildah-example labels: build.shipwright.io/generation: '2' build.shipwright.io/name: sample-spring-kotlin-build spec: build: name: sample-spring-kotlin-build status: buildSpec: output: image: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright' paramValues: - name: run-image value: 'paketocommunity/run-ubi-base:latest' - name: cnb-builder-image value: 'paketobuildpacks/builder-jammy-tiny:0.0.176' - name: app-image value: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright' source: git: url: 'https://github.com/piomin/sample-spring-kotlin-microservice.git' type: Git strategy: kind: ClusterBuildStrategy name: buildpacks completionTime: '2024-02-12T12:15:03Z' conditions: - lastTransitionTime: '2024-02-12T12:15:03Z' message: All Steps have completed executing reason: Succeeded status: 'True' type: Succeeded output: digest: 'sha256:dc3d44bd4d43445099ab92bbfafc43d37e19cfaf1cac48ae91dca2f4ec37534e' source: git: branchName: master commitAuthor: Piotr Mińkowski commitSha: aeb03d60a104161d6fd080267bf25c89c7067f61 startTime: '2024-02-12T12:13:21Z' taskRunName: sample-spring-kotlin-build-xh2dq-j47ql
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Hypershift CLI requires security group id when creating a nodepool, otherwise it fails to create a nodepool. jiezhao-mac:hypershift jiezhao$ ./bin/hypershift create nodepool aws --name=test --cluster-name=jie-test --node-count=3 2024-02-20T11:29:19-05:00 ERROR Failed to create nodepool {"error": "security group ID was not specified and cannot be determined from default nodepool"} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /Users/jiezhao/hypershift-test/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /Users/jiezhao/hypershift-test/hypershift/main.go:78 runtime.main /usr/local/Cellar/go/1.20.4/libexec/src/runtime/proc.go:250 Error: security group ID was not specified and cannot be determined from default nodepool security group ID was not specified and cannot be determined from default nodepool jiezhao-mac:hypershift jiezhao${code} Version-Release number of selected component (if applicable): {code:none}
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
nodepool creation should succeed without security group specified in hypershift cli
Additional info:
Description of the problem:
In a deployment with bonding and vlan during the booting of the provision image, the system loose the connectivity (even the ping) as soon as two new network interfaces appear on the node, created by the ironic-python-agent, those interfaces are the slaves interfaces with the vlan added.{code}
Version-Release number of selected component (if applicable):
4.12.48\{code} How reproducible: {code:none} Always\{code} Steps to Reproduce: {code:none} 1. Deploy a cluster with bonding + vlan 2. 3. Actual results: {code:none} After investigation from Openstack team, it looks like having this option "enable_vlan_interfaces = all" enabled in "/etc/ironic-python-agent.conf" is what trigger the creation of the vlan interfaces. This new interfaces is what cuts the communication.
Expected results:
No extra vlan interfaces created, communication is not lost and installation succeeds.
Additional info:
How customer crafted the test: As soon as the node start pinging he connected with ssh and set a password to core user one communication is lost (~1 min after started pinging) it connects through the KVM interface and cor password. If we disable the ironic-python-agent and manually remove the vlan interfaces created the communication is restored. Installation works if LLDP is turned off at teh switch. This issue was supposed to be fixed in these versions, according to the original JIRA which I have linked here. Team lead from that JIRA suggested the issue has to be fixed by re-vendoring ICC in the assisted-service, hence this JIRA creation.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/222
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.15
Start Time: 2024-02-08T00:00:00Z
End Time: 2024-02-14T23:59:59Z
Success Rate: 91.30%
Successes: 63
Failures: 6
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 735
Failures: 0
Flakes: 0
Note: When you look at the link above you will notice some of the failures mention the bare metal operator. That's being investigated as part of https://issues.redhat.com/browse/OCPBUGS-27760. There have been 3 cases in the last week where the console was in a fail loop. Here's an example:
We need help understanding why this is happening and what needs to be done to avoid it.
Description of problem:
No detail failure on signature verification while failing to validate signature of the target release payload during upgrade. It's unclear for user to know which action could be taken for the failure. For example, checking if any wrong configmap set, or default store is not available or any issue on custom store? # ./oc adm upgrade Cluster version is 4.15.0-0.nightly-2023-12-08-202155 Upgradeable=False Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat Upstream: https://amd64.ocp.releases.ci.openshift.org/graph Channel: stable-4.15 Recommended updates: VERSION IMAGE 4.15.0-0.nightly-2023-12-09-012410 registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 # ./oc -n openshift-cluster-version logs cluster-version-operator-6b7b5ff598-vxjrq|grep "verified"|tail -n4 I1211 09:28:22.755834 1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:22.755974 1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:37.817102 1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:37.817488 1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-08-202155
How reproducible:
always
Steps to Reproduce:
1. trigger an fresh installation with tp enabled(no spec.signaturestores property set by default) 2.trigger an upgrade against a nightly build(no signature available in default signature store) 3.
Actual results:
no detail log on signature verification failure
Expected results:
include detail failure on signature verification in the cvo log
Additional info:
https://github.com/openshift/cluster-version-operator/pull/1003
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Missing Source column header in PVC > VolumeSnapshots tab
Version-Release number of selected component (if applicable):
Cluster 4.10, 4.14, 4.16
How reproducible:
Yes
Steps to Reproduce:
1. Create a PVC i.e. "my-pvc" 2. Create a Pod and bind it to the "my-pvc" 3. Create a VolumeSnapshots and associate it with the "my-pvc" 4. Goto to PVC detail > VolumeSnapshots tab
Actual results:
The Source column header is not displayed
Expected results:
the Source column header should be displayed
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/90
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The upstream project reorganized the config directory and we need to adapt it for downstream. Until then, upstream->downstream syncing is blocked.
As a ROSA customer, I want to enforce that my workloads follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.
As Red Hat, I would like to follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.
Per AWS docs:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html
AWS recommends using Regional AWS STS endpoints instead of the global endpoint to reduce latency, build in redundancy, and increase session token validity.
https://docs.aws.amazon.com/sdkref/latest/guide/feature-sts-regionalized-endpoints.html
All new SDK major versions releasing after July 2022 will default to regional. New SDK major versions might remove this setting and use regional behavior. To reduce future impact regarding this change, we recommend you start using regional in your application when possible.
Areas where HyperShift creates STS credentials use regionalized STS endpoints, e.g. https://github.com/openshift/hypershift/blob/ae1caa00ff3a2c2bfc1129f0168efc1e786d1d12/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1225-L1228
Description of problem:
During a pod deletion, the whereabouts reconciler correctly detects the pod deletion but it errors out claiming that the IPPool is not found.However, when checking the audit logs, we can see no deletion, no re-creation and we can even see successful "patch" and "get" requests to the same IPPool. This means that the IPPool was never deleted and properly accessible at the time of the issue, so the error in the reconciler looks like it made some mistake while retrieving the IPPool.
Version-Release number of selected component (if applicable):
4.12.22
How reproducible:
Sometimes
Steps to Reproduce:
1.Delete pod 2. 3.
Actual results:
Error in whereabouts reconciler. New pods cannot using additional networks with whereabouts IPAM plugin cannot have IPs allocated due to wrong cleanup.
Expected results:
Additional info:
Description of problem:
Logs for PipelineRuns fetched from the Tekton Results API is not loading
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to the Log tab of PipelineRun fetched from the Tekton Results 2. 3.
Actual results:
Logs window is empty with a loading indicator
Expected results:
Logs should be shown
Additional info:
Description of problem:
manifests are duplicated with cluster-config-api image
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
capi-based installer failing with missing openshift-cluster-api namespace
Version-Release number of selected component (if applicable):
How reproducible:
Always in CustomNoUpgrade
Steps to Reproduce:
1. 2. 3.
Actual results:
Install failure
Expected results:
namespace create, install succeeds or does not error on missing namespace
Additional info:
Description of problem:
no ipsec on cluster post NS mc's deletion during ipsecConfig mode `Full`, on an upgraded cluster from 4.14 ->4.15 build
Version-Release number of selected component (if applicable):
bot build on https://github.com/openshift/cluster-network-operator/pull/2191
How reproducible:
Always
Steps to Reproduce:
Steps: 1. Cluster on EW+NS cluster(4.14), Upgraded to above bot build to check ipsecConfig modes 2. ipsecConfig mode changed to Full 3. Deleted NS MCs 4. new MCs spawned up as `80-ipsec-master-extensions` and `80-ipsec-worker-extensions` 5. cluster settled with no ipsec at all (no ovn-ipsec-host ds) 6. mode still Full
Actual results:
mode Full actually replicated Diasbled state on above steps
Expected results:
Just NS IPsec should have gone away. EW should have persisted
Additional info:
spec: configuration: featureGate: featureSet: TechPreviewNoUpgrade
$ oc get pod NAME READY STATUS RESTARTS AGE capi-provider-bd4858c47-sf5d5 0/2 Init:0/1 0 9m33s cluster-api-85f69c8484-5n9ql 1/1 Running 0 9m33s control-plane-operator-78c9478584-xnjmd 2/2 Running 0 9m33s etcd-0 3/3 Running 0 9m10s kube-apiserver-55bb575754-g4694 4/5 CrashLoopBackOff 6 (81s ago) 8m30s $ oc logs kube-apiserver-55bb575754-g4694 -c kube-apiserver --tail=5 E0105 16:49:54.411837 1 controller.go:145] while syncing ConfigMap "kube-system/kube-apiserver-legacy-service-account-token-tracking", err: namespaces "kube-system" not found I0105 16:49:54.415074 1 trace.go:236] Trace[236726897]: "Create" accept:application/vnd.kubernetes.protobuf, */*,audit-id:71496035-d1fe-4ee1-bc12-3b24022ea39c,client:::1,api-group:scheduling.k8s.io,api-version:v1,name:,subresource:,namespace:,protocol:HTTP/2.0,resource:priorityclasses,scope:resource,url:/apis/scheduling.k8s.io/v1/priorityclasses,user-agent:kube-apiserver/v1.29.0 (linux/amd64) kubernetes/9368fcd,verb:POST (05-Jan-2024 16:49:44.413) (total time: 10001ms): Trace[236726897]: ---"Write to database call failed" len:174,err:priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request 10001ms (16:49:54.415) Trace[236726897]: [10.001615835s] [10.001615835s] END F0105 16:49:54.415382 1 hooks.go:203] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
There is currently no way to specify AgentLabelSelector when creating a nodepool via the hypershift CLI
Description of problem:
The cluster operator "machine-config" degraded due to MachineConfigPool master is not ready, which tells error like "rendered-master-${hash} not found".
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Always. We met the issue in 2 CI profiles, Flexy template "functionality-testing/aos-4_15/upi-on-gcp/versioned-installer-ovn-ipsec-ew-ns-ci", and PROW CI test "periodic-ci-openshift-openshift-tests-private-release-4.15-multi-ec-gcp-ipi-disc-priv-oidc-arm-mixarch-f14".
Steps to Reproduce:
The Flexy template brief steps: 1. "create install-config" and then "create manifests" 2. add manifest file to config ovnkubernetes network for IPSec (please refer to https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blame/master/functionality-testing/aos-4_15/hosts/create_ign_files.sh#L517-530) 3. (optional) "create ignition-config" 4. UPI installation steps (see OCP doc https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-user-infra.html#installation-gcp-user-infra-exporting-common-variables)
Actual results:
Installation failed, with the machine-config operator degraded
Expected results:
Installation should succeed.
Additional info:
The must-gather is at https://drive.google.com/file/d/12xbjWUknDL_DRNSS8T_Z3u4d1KrNlJgT/view?usp=drive_link
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
A recent [PR](https://github.com/openshift/hypershift/commit/c030ab66d897815e16d15c987456deab8d0d6da0) updated the kube-apiserver service port to `6443`. That change causes a small outage when upgrading from a 4.13 cluster in IBMCloud. We need to keep the service port as 2040 for IBM Cloud Provider to avoid the outage.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Related with https://issues.redhat.com/browse/OCPBUGS-23000
Cluster-autoscaler by default evict all those pods -including those coming from daemon sets- In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption
Additional info:
Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"
Description of problem:
After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start an hypershift-hosted cluster with cluster-bot 2. Enable user-defined monitoring 3.
Actual results:
PrometheusOperatorRejectedResources alert becomes firing
Expected results:
No alert firing
Additional info:
Need to reach out to the HyperShift folks as the fix should probably be in their code base.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/159
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The convention is a format like node-role.kubernetes.io/role: "", not node-role.kubernetes.io: role, however ROSA uses the latter format to indicate the infra role. This changes the node watch code to ignore it, as well as other potential variations like node-role.kubernetes.io/.
The current code panics when run against a ROSA cluster:
{{ E0209 18:10:55.533265 78 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
goroutine 233 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x7a71840?, 0xc0018e2f48})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1000251f9fe?})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x7a71840, 0xc0018e2f48})
runtime/panic.go:884 +0x213
github.com/openshift/origin/pkg/monitortests/node/watchnodes.nodeRoles(0x7ecd7b3?)
github.com/openshift/origin/pkg/monitortests/node/watchnodes/node.go:187 +0x1e5
github.com/openshift/origin/pkg/monitortests/node/watchnodes.startNodeMonitoring.func1(0}}
We should add a Dockerfile that is optimized for running the tests locally. (The current Dockerfile assumes it is running with the CI setup.)
For years, the TechPreviewNoUpgrade alert has used:
cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0
But recently testing 4.12.19, I saw the alert pending with name="LatencySensitive". Alerting on that not-useful-since-4.6 feature set is fine (OCPBUGS-14497), but TechPreviewNoUpgrade isn't a great name when the actual feature set is LatencySensitive. And the summary and descripition don't apply to LatencySensitive either.
The buggy expr / alertname pair shipped in 4.3.0.
All the time.
1. Install a cluster like 4.12.19.
2. Set the LatencySensitive feature set:
$ oc patch featuregate cluster --type=json --patch='[{"op":"add","path":"/spec/featureSet","value":"LatencySensitive"}]'
3. Check alerts with /monitoring/alerts?rowFilter-alert-source=platform&resource-list-text=TechPreviewNoUpgrade in the web console.
TechPreviewNoUpgrade is pending or firing.
Something appropriate to LatencySensitive, like a generic alert that covers all non-default feature sets, is pending or firing.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Oh no! Something went wrong’ will be shown when user go to MultiClusterEngine details -> Yaml tab
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. Install 'multicluster engine for Kubernetes' operator in the cluster 2. Use the default value to create a new MultiClusterEngine 3. Navigate to the MultiClusterEngine details -> Yaml Tab
Actual results: ‘Oh no! Something went wrong.’ error will be shown with below details TypeErrorDescription: Cannot read properties of null (reading 'editor')
Expected results:
no error
Additional info:
This bug fix is in conjunction with https://issues.redhat.com/browse/OCPBUGS-22778
Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab
Version-Release number of selected component (if applicable):
4.16.0-0.ci-2024-01-05-050911
How reproducible:
Steps to Reproduce:
1. Create a PVC i.e. "my-pvc" 2. Create a Pod and bind it to the "my-pvc" 3. Create a VolumeSnapshots and associate it with the "my-pvc" 4. Goto to PVC detail > VolumeSnapshots tab
Actual results:
VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab
Expected results:
VolumeSnapshots data should be displayed in PVC > VolumeSnapshots tab
Additional info:
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/223
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
ccoctl consumes CredentialsRequests extracted from OpenShift releases and manages secrets associated with those requests for the cluster. Over time, ccoctl has grown a number of CredentialRequest filters, including deletion annotations in CCO-175 and tech-preview annotations in cco#444.
But with OTA-559, 4.14 and later oc adm release extract ... learned about an --included parameter, which allows oc to perform that "will the cluster need this credential?" filtering, and there is no longer a need for ccoctl to perform that filtering, or for ccoctl callers to have to think through "do I need to enable tech-preview CRs for this cluster or not?".
4.14 and later.
100%.
$ cat <<EOF >install-config.yaml > apiVersion: v1 > platform: > gcp: > dummy: data > featureSet: TechPreviewNoUpgrade > EOF $ oc adm release extract --included --credentials-requests --install-config install-config.yaml --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64 $ ccoctl gcp create-all --dry-run --name=test --region=test --project=test --credentials-requests-dir=credentials-requests
ccoctl doesn't dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it {--enable-tech-preview}}.
ccoctl does dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it --enable-tech-preview=false.
Longer-term, we likely want to go through some phases of deprecating and maybe eventually removing --enable-tech-preview and the ccoctl-side filtering. But for now, I think we want to pivot to defaulting to true, so that anyone with existing flows that do not include the new --included extraction has an easy way to keep their workflow going (they can set --enable-tech-preview=false). And I think we should backport that to 4.14's ccoctl to simplify OSDOCS-4158's docs#62148. But we're close enough to 4.14's expected GA, that it's worth some consensus-building and alternative consideration, before trying to rush changes back to 4.14 branches.
Description of problem:
In the self-managed HCP use case, if the on-premise baremetal management cluster does not have nodes labeled with the "topology.kubernetes.io/zone" key, then all HCP pods for a High Available cluster are scheduled to a single mgmt cluster node. This is a result of the way the affinity rules are constructed. Take the pod affinity/antiAffinity example below, which is generated for a HA HCP cluster. If the "topology.kubernetes.io/zone" label does not exist on the mgmt cluster nodes, then the pod will still get scheduled but that antiAffinity rule is effectively ignored. That seems odd due to the usage of the "requiredDuringSchedulingIgnoredDuringExecution" value, but I have tested this and the rule truly is ignored if the topologyKey is not present.
podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: hypershift.openshift.io/hosted-control-plane: clusters-vossel1 topologyKey: kubernetes.io/hostname weight: 100 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: kube-apiserver hypershift.openshift.io/control-plane-component: kube-apiserver topologyKey: topology.kubernetes.io/zone
In the event that no "zones" are configured for the baremetal mgmt cluster, then the only other pod affinity rule is one that actually colocates the pods together. This results in a HA HCP having all the etcd, apiservers, etc... scheduled to a single node.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. Create a self-managed HA HCP cluster on a mgmt cluster with nodes that lack the "topology.kubernetes.io/zone" label
Actual results:
all HCP pods are scheduled to a single node.
Expected results:
HCP pods should always be spread across multiple nodes.
Additional info:
A way to address this is to add another anti-affinity rule which prevents every component from being scheduled on the same node as its replicas
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/435
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
There was a glitch with the prometheus-adapter image after the 4.16 branching.
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702041610708869
allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving.
to prevent possible issues similar to https://issues.redhat.com//browse/OCPBUGS-23796
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-kni-infra
Test is now passing near 0% on metal-ipi.
Started around Feb 10th.
event [namespace/openshift-kni-infra node/master-1.ostest.test.metalkube.org pod/haproxy-master-1.ostest.test.metalkube.org hmsg/64785a22cf - Back-off restarting failed container haproxy in pod haproxy-master-1.ostest.test.metalkube.org_openshift-kni-infra(336080d8c1b455c151170524132c026d)] happened 295 times
Possible relation to haproxy 2.8 merge?
Logs indicate the error is:
/bin/bash: line 47: socat: command not found
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Some requests of this type:
{"x_request_id":"05e66411-7612-46bb-86c2-69bf7096b6da","protocol":"HTTP/1.1","authority":"zoscaru4s08w1mz.api.openshift.com","user_agent":"Go-http-client/2.0","method":"GET","response_flags":"UC","x_forwarded_for":"163.244.72.2,10.128.10.16,23.21.192.204","bytes_rx":0,"duration":13,"bytes_tx":95,"response_code":503,"timestamp":"2024-01-31T15:52:41.418Z","upstream_duration":null,"path":"/api/assisted-install/v2/infra-envs/84596f7d-0138-4f57-ada4-be72aea031a5/hosts/62148a92-8588-b591-5f7e-046bf1136b3b/instructions?timestamp=1706716359"} |
are causing 503 because the applicaiton crashes. Fortunately it's only a goroutine to crash, so main loop is still going and other requests seem unaffected
How reproducible:
Not sure what conditions we need to meet but plenty of requests of this type can be found from prod logs
Steps to reproduce:
1.
2.
3.
Actual results:
2024/01/31 16:41:31 http: panic serving 127.0.0.1:39486: runtime error: invalid memory address or nil pointer dereference goroutine 575931 [running]: net/http.(*conn).serve.func1() /usr/local/go/src/net/http/server.go:1854 +0xbf panic({0x4369120, 0x6bee680}) /usr/local/go/src/runtime/panic.go:890 +0x263 github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).getTangServersFromHostIgnition(0x34?, 0x484da60?) /assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:41 +0x3e github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).GetSteps(0xc000a5c440, {0xc00057c1c8?, 0x48adb49?}, 0xc000f5d180) /assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:104 +0x112 github.com/openshift/assisted-service/internal/host/hostcommands.(*InstructionManager).GetNextSteps(0xc000ad0780, {0x50cf558, 0xc00444b4a0}, 0xc000f5d180) /assisted-service/internal/host/hostcommands/instruction_manager.go:178 +0xa2f github.com/openshift/assisted-service/internal/host.(*Manager).GetNextSteps(0xc0003e4990?, {0x50cf558?, 0xc00444b4a0?}, 0x0?) /assisted-service/internal/host/host.go:548 +0x48 github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2GetNextSteps(0xc000da4800, {0x50cf558, 0xc00444b4a0}, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}) /assisted-service/internal/bminventory/inventory.go:5357 +0x1b8 github.com/openshift/assisted-service/restapi.HandlerAPI.func54({0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980?, 0xc000fb67e0?}) /assisted-service/restapi/configure_assisted_install.go:654 +0xf4 github.com/openshift/assisted-service/restapi/operations/installer.V2GetNextStepsHandlerFunc.Handle(0xc00203b300?, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980, 0xc000fb67e0}) /assisted-service/restapi/operations/installer/v2_get_next_steps.go:19 +0x7a github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetNextSteps).ServeHTTP(0xc00169e468, {0x50bfe00, 0xc001399d20}, 0xc00203b500) /assisted-service/restapi/operations/installer/v2_get_next_steps.go:66 +0x2dd github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x50bfe00, 0xc001399d20}, 0xc00203b500) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59 net/http.HandlerFunc.ServeHTTP(0x0?, {0x50bfe00?, 0xc001399d20?}, 0x17334b7?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/internal/metrics.Handler.func1.1() /assisted-service/internal/metrics/reporter.go:37 +0x31 github.com/slok/go-http-metrics/middleware.Middleware.Measure({{{0x50bfdd0, 0xc0014ce0e0}, {0x48922ea, 0x12}, 0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x50d23c0, ...}, ...) /assisted-service/vendor/github.com/slok/go-http-metrics/middleware/middleware.go:117 +0x30e github.com/openshift/assisted-service/internal/metrics.Handler.func1({0x50cdea0?, 0xc0005ec770}, 0xc00203b500) /assisted-service/internal/metrics/reporter.go:36 +0x35f net/http.HandlerFunc.ServeHTTP(0x50cf558?, {0x50cdea0?, 0xc0005ec770?}, 0xc002b9ee80?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/context.ContextHandler.func1.1({0x50cdea0, 0xc0005ec770}, 0xc00203b400) /assisted-service/pkg/context/param.go:95 +0xc8 net/http.HandlerFunc.ServeHTTP(0xc001735430?, {0x50cdea0?, 0xc0005ec770?}, 0xc0026625f8?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/go-openapi/runtime/middleware.NewRouter.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/router.go:77 +0x257 net/http.HandlerFunc.ServeHTTP(0x7fc8dc2c3820?, {0x50cdea0?, 0xc0005ec770?}, 0xc00009f000?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/go-openapi/runtime/middleware.Redoc.func1({0x50cdea0, 0xc0005ec770}, 0x418b480?) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/redoc.go:72 +0x242 net/http.HandlerFunc.ServeHTTP(0x1?, {0x50cdea0?, 0xc0005ec770?}, 0x0?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/go-openapi/runtime/middleware.Spec.func1({0x50cdea0, 0xc0005ec770}, 0x486ac6f?) /assisted-service/vendor/github.com/go-openapi/runtime/middleware/spec.go:46 +0x18c net/http.HandlerFunc.ServeHTTP(0xc001136380?, {0x50cdea0?, 0xc0005ec770?}, 0xc00203b200?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/rs/cors.(*Cors).Handler.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200) /assisted-service/vendor/github.com/rs/cors/cors.go:281 +0x1c4 net/http.HandlerFunc.ServeHTTP(0x0?, {0x50cdea0?, 0xc0005ec770?}, 0x4?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x50cd990, 0xc0012e8540}, 0xc0017358f0?) /assisted-service/vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e net/http.HandlerFunc.ServeHTTP(0x100c0017359e8?, {0x50cd990?, 0xc0012e8540?}, 0x10?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/app.WithMetricsResponderMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x16f0a19?) /assisted-service/pkg/app/middleware.go:32 +0xb0 net/http.HandlerFunc.ServeHTTP(0xc000a26900?, {0x50cd990?, 0xc0012e8540?}, 0xc00444a7e0?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/app.WithHealthMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x5094901?) /assisted-service/pkg/app/middleware.go:55 +0x162 net/http.HandlerFunc.ServeHTTP(0x50cf4b0?, {0x50cd990?, 0xc0012e8540?}, 0x5094980?) /usr/local/go/src/net/http/server.go:2122 +0x2f github.com/openshift/assisted-service/pkg/requestid.handler.ServeHTTP({{0x50aa9c0?, 0xc0008a4eb0?}}, {0x50cd990, 0xc0012e8540}, 0xc00203b100) /assisted-service/pkg/requestid/requestid.go:69 +0x1ad github.com/openshift/assisted-service/internal/spec.WithSpecMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0xc00203b100?) /assisted-service/internal/spec/spec.go:38 +0x9b net/http.HandlerFunc.ServeHTTP(0xc00124ec35?, {0x50cd990?, 0xc0012e8540?}, 0x170a0ce?) /usr/local/go/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc002cb8f30?}, {0x50cd990, 0xc0012e8540}, 0xc00203b100) /usr/local/go/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc00189d320, {0x50cf558, 0xc0011fc240}) /usr/local/go/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:3089 +0x5ed
Expected results:
Description of problem:
Update OWNERS file in route-controller-manager repository.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
n/a
Steps to Reproduce:
n/a
Actual results:
n/a
Expected results:
n/a
Additional info:
Description of problem:
[Azuredisk-csi-driver] allocatable volumes count incorrect in csinode for Standard_B4as_v2 instance types
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-02-132842
How reproducible:
Always
Steps to Reproduce:
1. Install Azure OpenShift cluster use the Standard_B4as_v2 instance type 2. Check the csinode object allocatable volumes count 3. Create a pod with the max allocatable volumes count pvcs(provision by azuredisk-csi-driver)
Actual results:
In step 2 the allocatable volumes count is 16. $ oc get csinode pewang-0908s-r6lwd-worker-southcentralus3-tvwwr -ojsonpath='{.spec.drivers[?(@.name=="disk.csi.azure.com")].allocatable.count}' 16 In step 3 the pod stuck at containerCreating that caused by attach volume failed of 09-07 22:38:28.758 "message": "The maximum number of data disks allowed to be attached to a VM of this size is 8.",\r
Expected results:
In step 2 the allocatable volumes count should be 8. In step 3 the pod should be Running well and all volumes could be read and written data
Additional info:
$ az vm list-skus -l eastus --query "[?name=='Standard_B4as_v2']"| jq -r '.[0].capabilities[] | select(.name =="MaxDataDiskCount")' { "name": "MaxDataDiskCount", "value": "8" } Currently in 4.14 we use the v1.28.1 driver, I checked the upstream issues and PRs, the issue fixed in v1.28.2 https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.2
Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tired of scrolling through alerts and pod states that are seldom useful to get to things that we need every day.
Description of problem:
To operate HyperShift at high scale, we need an option to disable dedicated request serving isolation, if not used.
Version-Release number of selected component (if applicable):
4.16, 4.15, 4.14, 4.13
How reproducible:
100%
Steps to Reproduce:
1. Install hypershift operator for versions 4.16, 4.15, 4.14, or 4.13 2. Observe start-up logs 3. Dedicated request serving isolation controllers are started
Actual results:
Dedicated request serving isolation controllers are started
Expected results:
Dedicated request serving isolation controllers to not start, if unneeded
Additional info:
Description of problem:
Snapshots taken to gather deprecation information from bundles are from the Subscription namespace instead of the CatalogSource namespace. That means that if the Subscription is in a different namespace then no bundles will be present in the snapshot.
How reproducible:
100%
Steps to Reproduce:
1.Create CatalogSource with olm.deprecation entries 2.Create Subscription targeting a package with deprecations in a different namespace.
Actual results:
No Deprecation Conditions will be present.
Expected results:
Deprecation Conditions should be present.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
On SNO with DU profile(RT kernel) tuned profile is always degraded due to net.core.busy_read, net.core.busy_poll and kernel.numa_balancing sysctl not existing in RT kernel
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO with DU profile(RT kernel) 2. Check tuned profile
Actual results:
oc -n openshift-cluster-node-tuning-operator get profile -o yaml apiVersion: v1 items: - apiVersion: tuned.openshift.io/v1 kind: Profile metadata: creationTimestamp: "2023-11-09T18:26:34Z" generation: 2 name: sno.kni-qe-1.lab.eng.rdu2.redhat.com namespace: openshift-cluster-node-tuning-operator ownerReferences: - apiVersion: tuned.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Tuned name: default uid: 4e7c05a2-537e-4212-9009-e2724938dec9 resourceVersion: "287891" uid: 5f4d5819-8f84-4b3b-9340-3d38c41501ff spec: config: debug: false tunedConfig: {} tunedProfile: performance-patch status: conditions: - lastTransitionTime: "2023-11-09T18:26:39Z" message: TuneD profile applied. reason: AsExpected status: "True" type: Applied - lastTransitionTime: "2023-11-09T18:26:39Z" message: 'TuneD daemon issued one or more error message(s) during profile application. TuneD stderr: net.core.rps_default_mask' reason: TunedError status: "True" type: Degraded tunedProfile: performance-patch kind: List metadata: resourceVersion: ""
Expected results:
Not degraded
Additional info:
Looking at the tuned log the following errors show up which are probably causing the profile to get into degraded state: 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_read', the parameter does not exist 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: sysctl option net.core.busy_read will not be set, failed to read the original value. 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_poll', the parameter does not exist 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: sysctl option net.core.busy_poll will not be set, failed to read the original value. 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'kernel.numa_balancing', the parameter does not exist 2023-11-09 18:30:49,287 ERROR tuned.plugins.plugin_sysctl: sysctl option kernel.numa_balancing will not be set, failed to read the original value. These sysctl parameters seem not to be available with RT kernel.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/95
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/11
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
While upgrading a loaded 250 node ROSA cluster from 4.13.13 to 4.14.rc2 the cluster failed to upgrade and was stuck at when network operator was trying
to upgrade.
Around 20 multus pods were in CrashLookpack state with the log
oc logs multus-4px8t 2023-10-10T00:54:34+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 2023-10-10T00:54:34+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 to /host/opt/cni/bin/ 2023-10-10T00:54:34Z [verbose] multus-daemon started 2023-10-10T00:54:34Z [verbose] Readiness Indicator file check 2023-10-10T00:55:19Z [error] have you checked that your default network is ready? still waiting for readinessindicatorfile @ /host/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
Description of problem:
In ROSA/OCP 4.14.z, attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes (in ROSA's case, this was attached to the ManagedOpenShift-Worker-Role, which is assigned by the installer to all the worker nodes), has no effect on ECR Image pull. User gets an authentication error. Attaching the policy ideally should avoid the need to provide an image-pull-secret. However, the error is resolved only if the user also provides an image-pull-secret. This is proven to work correctly in 4.12.z. Seems something has changed in the recent OCP versions.
Version-Release number of selected component (if applicable):
4.14.2 (ROSA)
How reproducible:
The issue is reproducible using the below steps.
Steps to Reproduce:
1. Create a deployment in ROSA or OCP on AWS, pointing at a private ECR repository 2. The image pulling will fail with Error: ErrImagePull & authentication required errors 3.
Actual results:
The image pull fails with "Error: ErrImagePull" & "authentication required" errors. However, the image pull is successful only if the user provides an image-pull-secret to the deployment.
Expected results:
The image should be pulled successfully by virtue of the ECR-read-only policy attached to the worker node role; without needing an image-pull-secret.
Additional info:
In other words:
in OCP 4.13 (and below) if a user adds the ECR:* permissions to the worker instance profile, then the user can specify ECR images and authentication of the worker node to ECR is done using the instance profile. In 4.14 this no longer works.
It is not sufficient as an alternative, to provide a pull secret in a deployment because AWS rotates ECR tokens every 12 hours. That is not a viable solution for customers that until OCP 4.13, did not have to rotate pull secrets constantly.
The experience in 4.14 should be the same as in 4.13 with ECR.
The current AWS policy that's used is this one: `arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly`
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": "*" } ] }
Description of problem:
In local setup this error appears when creating a deployment with scaling in the git form page locally: `Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32`
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-05-154400
How reproducible:
Everytime
Steps to Reproduce:
1. In the local setup go to the git form page 2. Enter a git repo and select deployment as the resource type 3. In scaling enter the value as '5' and click on Create button
Actual results:
Got this error: "Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32"
Expected results:
Deployment should be created
Additional info:
Happening with Deployment-config creation as well
CI is flaky because the TestHostNetworkPort test fails:
=== NAME TestAll/serial/TestHostNetworkPortBinding operator_test.go:1034: Expected conditions: map[Admitted:True Available:True DNSManaged:False DeploymentReplicasAllAvailable:True LoadBalancerManaged:False] Current conditions: map[Admitted:True Available:True DNSManaged:False Degraded:False DeploymentAvailable:True DeploymentReplicasAllAvailable:False DeploymentReplicasMinAvailable:True DeploymentRollingOut:True EvaluationConditionsDetected:False LoadBalancerManaged:False LoadBalancerProgressing:False Progressing:True Upgradeable:True] operator_test.go:1034: Ingress Controller openshift-ingress-operator/samehost status: { "availableReplicas": 0, "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=samehost", "domain": "samehost.ci-op-xlwngvym-43abb.origin-ci-int-aws.dev.rhcloud.com", "endpointPublishingStrategy": { "type": "HostNetwork", "hostNetwork": { "protocol": "TCP", "httpPort": 9080, "httpsPort": 9443, "statsPort": 9936 } }, "conditions": [ { "type": "Admitted", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "Valid" }, { "type": "DeploymentAvailable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentAvailable", "message": "The deployment has Available status condition set to True" }, { "type": "DeploymentReplicasMinAvailable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentMinimumReplicasMet", "message": "Minimum replicas requirement is met" }, { "type": "DeploymentReplicasAllAvailable", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentReplicasNotAvailable", "message": "0/1 of replicas are available" }, { "type": "DeploymentRollingOut", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentRollingOut", "message": "Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n" }, { "type": "LoadBalancerManaged", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer", "message": "The configured endpoint publishing strategy does not include a managed load balancer" }, { "type": "LoadBalancerProgressing", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "LoadBalancerNotProgressing", "message": "LoadBalancer is not progressing" }, { "type": "DNSManaged", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "UnsupportedEndpointPublishingStrategy", "message": "The endpoint publishing strategy doesn't support DNS management." }, { "type": "Available", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z" }, { "type": "Progressing", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "IngressControllerProgressing", "message": "One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n)" }, { "type": "Degraded", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z" }, { "type": "Upgradeable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "Upgradeable", "message": "IngressController is upgradeable." }, { "type": "EvaluationConditionsDetected", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "NoEvaluationCondition", "message": "No evaluation condition is detected." } ], "tlsProfile": { "ciphers": [ "ECDHE-ECDSA-AES128-GCM-SHA256", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES256-GCM-SHA384", "ECDHE-RSA-AES256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "DHE-RSA-AES128-GCM-SHA256", "DHE-RSA-AES256-GCM-SHA384", "TLS_AES_128_GCM_SHA256", "TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }, "observedGeneration": 1 } operator_test.go:1036: failed to observe expected conditions for the second ingresscontroller: timed out waiting for the condition operator_test.go:1059: deleted ingresscontroller samehost operator_test.go:1059: deleted ingresscontroller hostnetworkportbinding
This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1017/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1762147882179235840. Search.ci shows another failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/48873/rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1762576595890999296. The test has failed sporadically in the past, beyond what search.ci is able to search.
TestHostNetworkPort is marked as a serial test in TestAll and marked with t.Parallel() in the test itself. Not sure if this is what is causing a new failure seen in this test, but something is incorrect.
The test failures have been observed recently on 4.16 as well as on 4.12 (https://github.com/openshift/cluster-ingress-operator/pull/828#issuecomment-1292888086) and 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/914#issuecomment-1526808286). The logic error was introduced in 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc).
The logic error is self-evident. The test failure is very rare. The failure has been observed sporadically over the past couple years. Presently, search.ci shows two failures, with the following impact, for the past 14 days:
rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact
N/A.
The TestHostNetworkPort test fails. The test is marked as both serial and parallel.
Test should be marked as either serial or parallel, and it should pass consistently.
When TestAll was introduced, TestHostNetworkPortBinding was initially marked parallel in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc. After some discussion, it was moved to the serial list in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a449e497e35fafeecbee9ea656e0631393182f70, but the commit to remove t.Parallel() evidently got inadvertently dropped.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Chinese translation in topology was invalid, see https://github.com/openshift/console/pull/13458
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/109
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The azure csi driver operator cannot run in a HyperShift control plane because it has this selector: node-role.kubernetes.io/master: ""
Version-Release number of selected component (if applicable):
4.16 ci latest
How reproducible:
always
Steps to Reproduce:
1. Install hypershift 2. Create azure hosted cluster
Actual results:
azure-disk-csi-driver-operator pod remains in Pending state
Expected results:
all control plane pods run
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/264
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The degradation of the storage operator occurred because it couldn't locate the node by UUID. I noticed that the providerID was present for node 0, but it was blank for other nodes. A successful installation can be achieved on day 2 by executing step 4 after step 7 from this document: https://access.redhat.com/solutions/6677901. Additionally, if we provide credentials from the install-config, it's necessary to add a taint to the node using the uninitialized taint(oc adm taint node "$NODE" node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule) after the bootstrap completed.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create an agent ISO image 2. Boot the created ISO on vSphere VM
Actual results:
Installation is failing due to storage operator unable to find the node by UUID.
Expected results:
Storage operator should be installed without any issue.
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1702893456002729
Description of problem:
Hosted control plane clusters of OCP 4.16 are using default catalog sources (redhat-operators, certified-operators, community-operators and redhat-marketplace) pointing to the 4.14, thus 4.16 operators are not available and this can't be updated from within the guest.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. check the .spec.image of the default catalog sources in openshift-marketplace namespace.
Actual results:
the default catalogs are pointing to :v4.14
Expected results:
they should point to :v4.16 instead
Additional info:
Description of problem:
CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing
doesn't give much detail or suggest next-steps. Expanding it to include at least a more detailed error message would make it easier for the admin to figure out how to resolve the issue.
Version-Release number of selected component (if applicable):
It's in the dev branch, and probably dates back to whenever the canary system was added.
How reproducible:
100%
Steps to Reproduce:
1. Break ingress. FIXME: Maybe by deleting the cloud load balancer, or dropping a firewall in the way, or something.
2. See the canary pods start failing.
3. Ingress operator sets CanaryChecksRepetitiveFailures with a message.
Actual results:
Canary route checks for the default ingress controller are failing
Expected results:
Canary route checks for the default ingress controller are failing: ${ERROR_MESSAGE}. ${POSSIBLY_ALSO_MORE_TROUBLESHOOTING_IDEAS?}
Additional info:
Plumbing the error message through might be as straightforward as passing probeRouteEndpoint's err through to setCanaryFailingStatusCondition for formatting. Or maybe it's more complicated than that?
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/203
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating hypershift cluster in disconnected env, the worker node cannot pass the validation of the assisted service due to an ignition error.
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
100 %
Steps to Reproduce:
1. Steps to install HCP cluster is mentioned in documentation: https://hypershift-docs.netlify.app/labs/dual/mce/agentserviceconfig/#assisted-service-customization 2. 3.
Actual results:
Node addition fails
Expected results:
Node should get added to the cluster
Additional info:
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/639
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As part of our investigation into GCP disruption we want an endpoint separate from the cluster under test but inside GCP to monitor for connectivity.
One approach is to use a GCP Cloud Function with a HTTP Trigger
Another alternative it to standup our own server and collect logging
We need to consider cost of implementation, cost of maintaining and how well the implementation lines up with our overall test scenario (we are wanting to use this as a control to compare with reaching a pod within a cluster under test)
We may want to also consider standing up similar endpoints in AWS and Azure in the future.
A separate story will cover monitoring the endpoint from within Origin
Description of problem:
HyperShift-managed components use the default RevisionHistoryLimit of 10. This significantly impacts etcd load and scalability on the management cluster.
Version-Release number of selected component (if applicable):
4.9, 4.10, 4.11, 4.12, 4.13, 4.14, 4.15, 4.16
How reproducible:
100% (may vary depending on resource availablility on management cluster)
Steps to Reproduce:
1. Create 375+ HostedCluster 2. Observe etcd performance on management cluster 3.
Actual results:
etcd hitting storage space limits
Expected results:
Able to manage HyperShift control planes at scale (375+ HostedClusters)
Additional info:
Description of problem:
Clear Button in Upload Jar Form is not working, user need to close the form in-order to remove the previous selected JAR file.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open Upload Jar File form from Add Page 2. Upload a JAR file 3. Remove the JAR the file by using clear button
Actual results:
The selected JAR file is not removed even after using "Clear" button
Expected results:
The "Clear" button should remove the selected file from the form.
Additional info:
Description of problem:
Not able to reproduce it manually, but frequently happens when run auto scripts.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-05-195247
How reproducible:
Steps to Reproduce:
1. Label worker-0 node as egress node, created egressIP object,the egressIP was assigned to worker-0 node successfully on secondary NIC 2. Block 9107 port on worker-0 node and label worker-1 as egress node 3.
Actual results:
EgressIP was not moved to second node % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-66330 172.22.0.196 40m Warning EgressIPConflict egressip/egressip-66330 Egress IP egressip-66330 with IP 172.22.0.196 is conflicting with a host (worker-0) IP address and will not be assigned sh-4.4# ip a show enp1s0 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:1c:cf:40:5d:25 brd ff:ff:ff:ff:ff:ff inet 172.22.0.109/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0 valid_lft 76sec preferred_lft 76sec inet6 fe80::21c:cfff:fe40:5d25/64 scope link noprefixroute valid_lft forever preferred_lft forever
Expected results:
EgressIP should move to second egress node
Additional info:
Workaround: deleted it and recreated it works % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-66330 172.22.0.196 % oc delete egressip --all egressip.k8s.ovn.org "egressip-66330" deleted % oc create -f ../data/egressip/config1.yaml egressip.k8s.ovn.org/egressip-3 created % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-3 172.22.0.196 worker-1 172.22.0.196
Description of problem:
Tekton Results API endpoint failed to fetch data on airgapped cluster.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/origin/pull/28522 removes two tests related to http2 testing with the default certificate that are know known to fail with HAProxy 2.8. We are reworking the tests as part of NE-1444 (HAProxy 2.8 bump).
This bug is a reminder that come OCP 4.16 GA we need to have reworked the tests so that they now pass with HAProxy 2.8 or, if not fixed, revert https://github.com/openshift/origin/pull/28522 which is why I'm marking this bug as a blocker. We do not want to ship 4.16 without reinstating the two tests.
The goal of removing the two tests in https://github.com/openshift/origin/pull/28522 is to allow us to make additional progress in https://github.com/openshift/router/pull/551 (which is our HAProxy 2.8 bump). With all tests passing in router#551 we can continue our assessment of HAProxy 2.8 by a) running the payload tests and b) creating a HAproxy 2.8 image that QE can use with their reliability test suite.
Description of problem:
When deploying a cluster on Power VS, you need to wait for a short period after the workspace is created to facilitate the network configuration. This period is ignored by the DHCP service.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Deploy a cluster on Power VS with an installer provisioned workspace 2. Observe that the terraform logs ignore the waiting period
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/21
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/56
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When cloning a PVC of 60GiB size, the system autofills the remote size to be 8192 PeB. This size cannot be changed in the UI before starting the clone.
Version-Release number of selected component (if applicable):
CNV - 4.14.3
How reproducible:
always
Steps to Reproduce:
1.Create a VM with a PVC of 60Gib 2.Power off the VM 3.As a cluster admin, clone the 60GiB PVC (Storage -> PersistentVolumeClaims -> Kebab menu next to pvc
Actual results:
The system tries to clone the 60 GiB PVC as a 8192 PeB
Expected results:
A new pvc of the 60 GiB
Additional info:
This seems like the closed BZ 2177979.I will upload a screenshot of the UI. Here is the yaml for the original pvc. apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" cdi.kubevirt.io/storage.contentType: kubevirt cdi.kubevirt.io/storage.pod.phase: Succeeded cdi.kubevirt.io/storage.populator.progress: 100.0% cdi.kubevirt.io/storage.preallocation.requested: "false" cdi.kubevirt.io/storage.usePopulator: "true" pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-12-05T17:34:19Z" finalizers:kubernetes.io/pvc-protectionprovisioner.storage.kubernetes.io/cloning-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 kubevirt.io/created-by: 60f46f91-2db3-4118-aaba-b1697b29c496 name: win2k19-base namespace: base-images ownerReferences:apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: DataVolume name: win2k19-base uid: 8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe resourceVersion: "697047" uid: fccb0aa9-8541-4b51-b49e-ddceaa22b68c spec: accessModes:ReadWriteMany dataSource: apiGroup: cdi.kubevirt.io kind: VolumeImportSource name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe dataSourceRef: apiGroup: cdi.kubevirt.io kind: VolumeImportSource name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe resources: requests: storage: "64424509440" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block volumeName: pvc-dbfc9fe9-5677-469d-9402-c2f3a22dab3f status: accessModes:ReadWriteMany capacity: storage: 60Gi phase: Bound Here is the yaml for the cloning pvc. apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-12-06T14:24:07Z" finalizers:kubernetes.io/pvc-protection name: win2k19-base-clone namespace: base-images resourceVersion: "1551054" uid: f72665c3-6408-4129-82a2-e663d8ecc0cc spec: accessModes:ReadWriteMany dataSource: apiGroup: "" kind: PersistentVolumeClaim name: win2k19-base dataSourceRef: apiGroup: "" kind: PersistentVolumeClaim name: win2k19-base resources: requests: storage: "9223372036854775807" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block status: phase: Pending
Description of problem:
OKD/FCOS uses FCOS for its bootimage which lacks several tools and services such as oc and crio that the rendezvous host of the Agent-based Installer needs to set up a bootstrap control plane.
Version-Release number of selected component (if applicable):
4.13.0 4.14.0 4.15.0
We added a carry patch to change the healthcheck behaviour in the Azure CCM: https://github.com/openshift/cloud-provider-azure/pull/72 and whilst we opened an upstream twin PR for that https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3887 it got closed in favour of a different approach https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4891 .
As such in the next rebase we need to drop the commit introduced in 72, in favour of downstreaming through the rebase the change in 4891. While doing that we need to explicitly set the new probe behaviour, as the default is still the classic behaviour, which doesn't work with our cluster architecture setup on Azure.
For the steps on how to do this, we can follow this comment: https://github.com/openshift/cloud-provider-azure/pull/88#issuecomment-1803832076
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/80
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The customer has a custom apiserver certificate.
This error can be found while trying to uninstall any operator by console:
openshift-console/pods/console-56494b7977-d7r76/console/console/logs/current.log:
2023-10-24T14:13:21.797447921+07:00 E1024 07:13:21.797400 1 operands_handler.go:67] Failed to get new client for listing operands: Get "https://api.<cluster>.<domain>:6443/api?timeout=32s": x509: certificate signed by unknown authority
when trying the same request from the console pod we can see no issue.
We see the root ca that signs apiserver certificate and this CA is trusted in the pod.
It seems the code that provokes this issue is:
https://github.com/openshift/console/blob/master/pkg/server/operands_handler.go#L62-L70
Currently the konnectivity agent has the following update strategy:
```
updateStrategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 0
```
We (IBM) suggest to update it to the following:
```
updateStrategy:
rollingUpdate:
maxUnavailable: 10%
type: RollingUpdate
```
In a big cluster, it would speed up the konnectivity-agent update. As the agents are independent, this would not hurt the service.
Description of problem:
Starting OpenShift 4.8 (https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-notable-technical-changes), all pods are getting bound SA tokens. Currently, instead of expiring the token, we use the `service-account-extend-token-expiration` that extends a bound token validity to 1yr and warns in case of a use of a token that would've otherwise been expired. We want to disable this behavior in a future OpenShift release, which would break the OpenShift web console.
Version-Release number of selected component (if applicable):
4.8 - 4.14
How reproducible:
100%
Steps to Reproduce:
1. install a fresh cluster 2. wait ~1hr since console pods were deployed for the token rotation to occur 3. log in to the console and click around 4. check the kube-apiserver audit logs events for the "authentication.k8s.io/stale-token" annotation
Actual results:
many occurrences (I doubt I'll be able to upload a text file so I'll show a few audit events in the first comment.
Expected results:
The web-console re-reads the SA token regularly so that it never uses an expired token
Additional info:
In a theoretical case where a console pod lasts for a year, it's going to break and won't be able to authenticate to the kube-apiserver. We are planning on disallowing the use of stale tokens in a future release and we need to make sure that the core platform is not broken so that the metrics we collect from the clusters in the wild are not polluted.
Description of problem:
Reviewing 4.15 Install failures (install should succeed: overall) there are a number of variants impacted by recent install failures.
search.ci: Cluster operator console is not available
Jobs like periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial show failures that appear to start with 4.15.0-0.nightly-2023-12-07-225558 have installation failures due to console-operator
ConsoleOperator reconciliation failed: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
4.15.0-0.nightly-2023-12-07-225558 contains console-operator/pull/814, noting in case it is related
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. Review link to install failures above 2. 3.
Actual results:
Expected results:
Additional info:
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade
Description of problem:
setup cluster cluster on vsphere by usermanaged ELB, with install-config.yaml apiVIPs: - 10.38.153.2 ingressVIPs: - 10.38.153.3 loadBalancer: type: UserManaged networking: machineNetwork: - cidr: "10.38.153.0/25" featureSet: TechPreviewNoUpgrade after cluster is started, Found the keeplaived still running on worker nodes. omc get pod -n openshift-vsphere-infra NAME READY STATUS RESTARTS AGE coredns-ci-op-2kch7ldp-72b07-7l4vs-master-0 2/2 Running 0 1h coredns-ci-op-2kch7ldp-72b07-7l4vs-master-1 2/2 Running 0 59m coredns-ci-op-2kch7ldp-72b07-7l4vs-master-2 2/2 Running 0 59m coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74 2/2 Running 0 39m coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k 2/2 Running 0 37m keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74 2/2 Running 0 39m keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k 2/2 Running 0 37m
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. setup vsphere on multi-subnet network with ELB, job https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328 2. 3.
Actual results:
Expected results:
keepalived should not be running on worker node.
Additional info:
must-gather logs: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328/artifacts/vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/gather-must-gather/artifacts/must-gather.tar
Version-Release number of selected component (if applicable):
{code:none}
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/oc/pull/1627
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/658
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
click on any node status popover, the popover dialog will always move to the first line
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-14-100410
How reproducible:
Always
Steps to Reproduce:
1. goes to Nodes list page, mark any worker as unschedulable 2. click on the 'Ready/Scheduling diabled' text, a popover dialog is opened 3.
Actual results:
2. the popover dialog will always jump to the first line, it looks like popover dialog is showing wrong status/info, also it's difficult for user to perform node actions
Expected results:
2. popover dialog should be shown exactly next to the correct node
Additional info:
Please review the following PR: https://github.com/openshift/image-registry/pull/390
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This ticket is to satisfy the bot on GH
Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/293
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/135
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Recent introductions of a validation within the hypershift operator's webhook conflicts with the UI's ability to create HCP clusters. Previously the pull secret was not required to be posted before an HC or NP, but with a recent change, the pull secret is required because the pull secret is used to validate the release image payload. This issue is isolated to 4.15
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100% attempt to post a HC before the pull secret is posted and the HC will be rejected. The expected outcome is that it should be able to post the pull secret for a HC after the HC is posted, and the controller should be eventually consistent to this change.
Description of problem:
On customer feedback modal, there are 3 links for user to feedback to Red Hat, the third link lacks a title.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-21-155123
How reproducible:
Always
Steps to Reproduce:
1.Login admin console. Click on "?"->"Share Feedback", check the links on the modal 2. 3.
Actual results:
1. The third link lacks a link title (the link for "Learn about opportunities to ……").
Expected results:
1. There is link title "Inform the direction of Red Hat" in 4.14, it should also exists for 4.15.
Additional info:
screenshot for 4.14 page: https://drive.google.com/file/d/19AnPlE0h9WwvIjxV0gLuf5x27jLN7TLS/view?usp=drive_link screenshot for 4.15 page: https://drive.google.com/file/d/19MRjzNGRWfYnK-zcoMozh7Z7eaDDG2L-/view?usp=drive_link
Creating a Serverless Deployment with "Scaling" "Min Pods"/"Max Pods" options set, uses deprecated knative annotations "autoscaling.knative.dev/minScale" / "maxScale",
the correct current ones are "autoscaling.knative.dev/min-scale" / "max-scale"
The same problem with "autoscaling.knative.dev/targetUtilizationPercentage" , which should be "autoscaling.knative.dev/target-utilization-percentage"
Serverless operator
The created ksvc resource has
spec: template: metadata: annotations: autoscaling.knative.dev/maxScale: "3" autoscaling.knative.dev/minScale: "2" autoscaling.knative.dev/targetUtilizationPercentage: "70"
The created ksvc should have
spec: template: metadata: annotations: autoscaling.knative.dev/max-scale: "3" autoscaling.knative.dev/min-scale: "2" autoscaling.knative.dev/target-utilization-percentage: "70"
4.14.8
none required ATM, current serverless still supports the deprecated "minScale"/"maxScale" annotations.
https://issues.redhat.com/browse/SRVKS-910
Description of problem:
- Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-from-openshift-ingress namespace: testing spec: ingress: - from: - namespaceSelector: matchLabels: policy-group.network.openshift.io/ingress: "" podSelector: {} policyTypes: - Ingress
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Upgrade cluster to 4.13.30
2. Apply test pod running basic HTTP instance at random port
3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).
4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).
$ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done
Actual results:
- Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.
Expected results:
- traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.
Additional info:
– additional required template details in first comment below.
RCA UPDATE:
So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with
oc get ns openshift-host-network --show-labels
# expected this
kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=
# but before the fix you will see
kubernetes.io/metadata.name=openshift-host-network,policy-group.network.openshift.io/host-network=
Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged
oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0 oc scale deployment network-operator -n openshift-network-operator --replicas=0
and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears
Potentially affected versions (may need to reproduce to confirm)
4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070
4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293
4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039
Mitigation/support KCS:
https://access.redhat.com/solutions/7055050
Please review the following PR: https://github.com/openshift/images/pull/154
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/779
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Now for 4.16 ocp payload , only contain the oc.rhel9 , can't find the oc.rhel8. oc adm release extract --command='oc.rhel8' registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64 -a tmp/config.json --to octest/ error: image did not contain usr/share/openshift/linux_amd64/oc.rhel8
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1.oc adm release extract --command='oc.rhel8' registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64 -a tmp/config.json --to octest/ 2. 3.
Actual results:
failed to extract oc.rhel8
Expected results:
for ocp paypload should contain the oc.rhel8.
Additional info:
Description of problem:
For high scalability, we need an option to disable unused machine management control plane components.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create HostedCluster/HostedControlPlane 2. 3.
Actual results:
Machine management components (cluster-api, machine-approver, auto-scaler, etc) are deployed
Expected results:
Should have option to disable as some use cases they provide no utility.
Additional info:
Description of problem:
When the cluster is in upgrade, scroll sidebar to bottom on cluster settings page, there is blank space at the bottom.
Version-Release number of selected component (if applicable):
upgrade 4.14.0-0.nightly-2023-09-09-164123 to 4.14.0-0.nightly-2023-09-10-184037
How reproducible:
Always
Steps to Reproduce:
1.Launch a 4.14 cluster, trigger an upgrade. 2.Go to "Cluster Settings"->"Details" page during upgrade, scroll down the right sidebar to the bottom. 3.
Actual results:
2. It's blank at the bottom
Expected results:
2. Should not show blank.
Additional info:
screenshot: https://drive.google.com/drive/folders/1DenrQTX7K0chbs9hG9ZbSZyY-viRRy1k?ths=true
https://drive.google.com/drive/folders/10dgToTxZf7gOfmL2Mp5gAMVQnM06XvAf?usp=sharing
Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/142
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Should not collect the previous.log which not corresponding with the --since/--since-time for `oc adm inspect` command
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. `oc adm inspect --since-time="2024-01-25T01:35:27Z" ns/openshift-multus`
Actual results:
also collect the previous.log which not corresponding with the specified time.
Expected results:
Only collect the logs after the --since/--since-time.
Additional info:
We want to use the latest version of CAPO in MAPO. We need to revendor CAPO to version 0.9 before the 4.16 FF.
There are several API changes that might require Matt's help.
Description of problem:
oauthclients degraded condition that never gets removed, meaning once its set due to an issue on a cluster, it wont be unset
Version-Release number of selected component (if applicable):
How reproducible:
Sporadically, when the AuthStatusHandlerFailedApply condition is set on the console operator status conditions.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
[Multi-NIC]Egress traffic connect got timeout after remove another pod label in same namespace
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-08-024357
How reproducible:
Always
Steps to Reproduce:
1. Label one node as egress node 2. Create an egressIP object, egressIP was assigned to egress node secondary interface # oc get egressip -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip-66293"},"spec":{"egressIPs":["172.22.0.190"],"namespaceSelector":{"matchLabels":{"org":"qe"}},"podSelector":{"matchLabels":{"color":"pink"}}}} creationTimestamp: "2023-10-08T07:28:04Z" generation: 2 name: egressip-66293 resourceVersion: "461590" uid: f1ca3483-63f1-4f31-99b0-e6a55161c285 spec: egressIPs: - 172.22.0.190 namespaceSelector: matchLabels: org: qe podSelector: matchLabels: color: pink status: items: - egressIP: 172.22.0.190 node: worker-0 kind: List metadata: resourceVersion: "" 3. Created a namespace and two pod under it. % oc get pods -n hrw -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod 1/1 Running 0 6m46s 10.129.2.7 worker-1 <none> <none> hello-pod1 1/1 Running 0 6s 10.131.0.14 worker-0 <none> <none> 4. Add label org=qe to namespace hrw # oc get ns hrw --show-labels NAME STATUS AGE LABELS hrw Active 21m kubernetes.io/metadata.name=hrw,*org=qe,*pod-security.kubernetes.io/audit-version=v1.24,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=v1.24,pod-security.kubernetes.io/warn=restricted 5. At this time, from both pods to access external endpoint, succeeded. % oc rsh -n hrw hello-pod ~ $ curl 172.22.0.1 --connect-timeout 5 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL was not found on this server.</p> </body></html> ~ $ exit % oc rsh -n hrw hello-pod1 ~ $ curl 172.22.0.1 --connect-timeout 5 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL was not found on this server.</p> </body></html> 6. Add label color=pink to both pods % oc label pod hello-pod color=pink -n hrw pod/hello-pod labeled % oc label pod hello-pod1 color=pink -n hrw pod/hello-pod1 labeled 7. Both pods can access external endpoint. 8. Remove label color=pink from pod hello-pod % oc label pod hello-pod color- -n hrw pod/hello-pod unlabeled
Actual results:
Access external endpoint from the pod which keep the label got connect timeout % oc rsh -n hrw hello-pod1 ~ $ curl 172.22.0.1 --connect-timeout 5 curl: (28) Connection timeout after 5000 ms ~ $ ~ $ ~ $ curl 172.22.0.1 --connect-timeout 5 curl: (28) Connection timeout after 5000 ms Note the label was removed from hello-pod , but try to access external endpoint from another pod, here hello-pod1 which should still use egressIP and be able to access
Expected results:
Should be able to access external endpoint
Additional info:
Description of problem:
When creating an IAM role with a "path" (https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-friendly-names) its "principal" when applied to trust policies or VPC Endpoint Service allowed principals confusingly does not include the path. That is, for the folowing rolesRef on a hostedcluster:
spec: platform: aws: rolesRef: controlPlaneOperatorARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator imageRegistryARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-image-registry-installer-cloud-crede ingressARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-ingress-operator-cloud-credentials kubeCloudControllerARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-kube-controller-manager networkARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cloud-network-config-controller-clou nodePoolManagementARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-capa-controller-manager storageARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cluster-csi-drivers-ebs-cloud-creden
The actual valid principal that should be added to the VPC Endpoint Service's allowed principals is:
arn:aws:iam::765374464689:role/ad-int-path1-y4y2-kube-system-control-plane-operator
instead of
arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator
However, for all other cases, the full ARN including the path should be used, e.g. https://github.com/openshift/hypershift/blob/082e880d0a492a357663d620fa58314a4a477730/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L237-L273
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
100%
Steps to Reproduce:
ROSA HCP-specific steps:
1. rosa create account-roles --path /anything/ -m auto -y 2. rosa create cluster --hosted-cp 3. ...etc 4a. Observe on the hosted cluster AWS Account that the VPC Endpoint cannot be created with the error: 'failed to create vpc endpoint: InvalidServiceName' 4b. Observe on the management cluster that CPO is failing to update the VPC Endpoint Service's allowed principals with the error: Client.InvalidPrincipal 5. If the contents of .spec.platform.aws.rolesRef.controlPlaneOperatorARN are manually applied to the additional allowed principals with the path component removed, then the problems are largely fixed on the hosted cluster side. VPC Endpoint is created, worker nodes can spin up, etc.
Actual results:
The VPC Endpoint Service is attempting and failing to get this applied to its additional allowed principals: arn:aws:iam::${ACCOUNT_ID}:role/path/name
Expected results:
The VPC Endpoint Service gets this applied to its additional allowed principals: arn:aws:iam::${ACCOUNT_ID}:role/name
Additional info:
Description of problem:
openshift-install is unable to generate an aarch64 iso: FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100 %
Steps to Reproduce:
1. Create an install_config.yaml with controlplane.architecture and compute.architecture = arm64 2. openshift-install agent create image --log-level debug
Actual results:
DEBUG Generating Agent Installer ISO... INFO Consuming Install Config from target directory DEBUG Purging asset "Install Config" from disk INFO Consuming Agent Config from target directory DEBUG Purging asset "Agent Config" from disk DEBUG initDisk(): start DEBUG initDisk(): regular file FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file
Expected results:
agent.aarch64.iso is created
Additional info:
Seems to be related to this PR: https://github.com/openshift/installer/pull/7896 boot.catalog is also referenced in the assisted-image-service here: https://github.com/openshift/installer/blob/master/vendor/github.com/openshift/assisted-image-service/pkg/isoeditor/isoutil.go#L155
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
These tests look to have been mistakenly deleted in 4.14 during the big monitortest refactor. We could use them right now to identify gcp jobs with spatter disruption.
[sig-network] there should be nearly zero single second disruptions for _
[sig-network] there should be reasonably few single second disruptions for _
Find out what happened and get them restored. Code is there but it looks like there are assumptions about extracting the backend name that may have been broken somewhere.
Description of problem:
When expanding a PVC of unit-less size (e.g., '2147483648'), the Expand PersistentVolumeClaim modal populates the spinner with a unit-less value (e.g., 2147483648) instead of a meaningful value.
Version-Release number of selected component (if applicable):
CNV - 4.14.3
How reproducible:
always
Steps to Reproduce:
1.Create a PVC using the following YAML.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: task-pv-claim spec: storageClassName: gp3-csi accessModes: - ReadWriteOnce resources: requests: storage: "2147483648"
apiVersion: v1 kind: Pod metadata: name: task-pv-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault volumes: - name: task-pv-storage persistentVolumeClaim: claimName: task-pv-claim containers: - name: task-pv-container image: nginx ports: - containerPort: 80 name: "http-server" volumeMounts: - mountPath: "/usr/share/nginx/html" name: task-pv-storage
2. From the newly created PVC details page, Click Actions > Expand PVC. 3. Note the value in the spinner input.
See https://drive.google.com/file/d/1toastX8rCBtUzx5M-83c9Xxe5iPA8fNQ/view for a demo
Description of problem:
Install IPI cluster against 4.15 nightly build on Azure MAG and Azure Stack Hub or with Azure workload identity, image-registry co is degraded with different errors. On MAG: $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.0-0.nightly-2024-02-16-235514 True False True 5h44m AzurePathFixControllerDegraded: Migration failed: panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such host... $ oc get pod -n openshift-image-registry NAME READY STATUS RESTARTS AGE azure-path-fix-ssn5w 0/1 Error 0 5h47m cluster-image-registry-operator-86cdf775c7-7brn6 1/1 Running 1 (5h50m ago) 5h58m image-registry-5c6796b86d-46lvx 1/1 Running 0 5h47m image-registry-5c6796b86d-9st5d 1/1 Running 0 5h47m node-ca-48lsh 1/1 Running 0 5h44m node-ca-5rrsl 1/1 Running 0 5h47m node-ca-8sc92 1/1 Running 0 5h47m node-ca-h6trz 1/1 Running 0 5h47m node-ca-hm7s2 1/1 Running 0 5h47m node-ca-z7tv8 1/1 Running 0 5h44m $ oc logs azure-path-fix-ssn5w -n openshift-image-registry panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such hostgoroutine 1 [running]: main.main() /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:49 +0x125 The blob storage endpoint seems not correct, should be: $ az storage account show -n imageregistryjima41xvvww -g jima415a-hfxfh-rg --query primaryEndpoints { "blob": "https://imageregistryjima41xvvww.blob.core.usgovcloudapi.net/", "dfs": "https://imageregistryjima41xvvww.dfs.core.usgovcloudapi.net/", "file": "https://imageregistryjima41xvvww.file.core.usgovcloudapi.net/", "internetEndpoints": null, "microsoftEndpoints": null, "queue": "https://imageregistryjima41xvvww.queue.core.usgovcloudapi.net/", "table": "https://imageregistryjima41xvvww.table.core.usgovcloudapi.net/", "web": "https://imageregistryjima41xvvww.z2.web.core.usgovcloudapi.net/" } On Azure Stack Hub: $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.0-0.nightly-2024-02-16-235514 True False True 3h32m AzurePathFixControllerDegraded: Migration failed: panic: open : no such file or directory... $ oc get pod -n openshift-image-registry NAME READY STATUS RESTARTS AGE azure-path-fix-8jdg7 0/1 Error 0 3h35m cluster-image-registry-operator-86cdf775c7-jwnd4 1/1 Running 1 (3h38m ago) 3h54m image-registry-658669fbb4-llv8z 1/1 Running 0 3h35m image-registry-658669fbb4-lmfr6 1/1 Running 0 3h35m node-ca-2jkjx 1/1 Running 0 3h35m node-ca-dcg2v 1/1 Running 0 3h35m node-ca-q6xmn 1/1 Running 0 3h35m node-ca-r46r2 1/1 Running 0 3h35m node-ca-s8jkb 1/1 Running 0 3h35m node-ca-ww6ql 1/1 Running 0 3h35m $ oc logs azure-path-fix-8jdg7 -n openshift-image-registry panic: open : no such file or directorygoroutine 1 [running]: main.main() /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:36 +0x145 On cluster with Azure workload identity: Some operator's PROGRESSING is True image-registry 4.15.0-0.nightly-2024-02-16-235514 True True False 43m Progressing: The deployment has not completed... pod azure-path-fix is in CreateContainerConfigError status, and get error in its Event. "state": { "waiting": { "message": "couldn't find key REGISTRY_STORAGE_AZURE_ACCOUNTKEY in Secret openshift-image-registry/image-registry-private-configuration", "reason": "CreateContainerConfigError" } }
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-16-235514
How reproducible:
Always
Steps to Reproduce:
1. Install IPI cluster on MAG or Azure Stack Hub or config Azure workload identity 2. 3.
Actual results:
Installation failed and image-registry operator is degraded
Expected results:
Installation is successful.
Additional info:
Seems that issue is related with https://github.com/openshift/image-registry/pull/393
Description of problem:
Today Egress firewall rules with 'nodeSelector' only use the nodeIP in the OVN ACL rule. But there is possibility that one node may have secondary IPs other that the nodeIP. We shall create ACL with all the possible IPs of the selected node.
Version-Release number of selected component (if applicable):
How reproducible:
Create a egress firewall rule with 'nodeSelector'
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Last payload showed several occurrences of this problem seemingly surfacing the same way:
Example jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496 (blocked the payload)
Looking at sippy, both tests took a dip in pass rate on the 16th, meaning a regression may have merged late on the 15th (friday) or somewhere on the 16th (less likely)
The problem kills the install and thus we are getting no intervals charts to help debug.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when user clicks on perspective switcher after a hard refresh, the flicker appears
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-25-100326
How reproducible:
Always after user refresh the console
Steps to Reproduce:
1. user login to OCP console 2. refresh the whole console then click perspective switcher 3.
Actual results:
there is flicker when clicking on perspective switcher
Expected results:
no flickers
Additional info:
screen recording https://drive.google.com/file/d/1_2tPZ0DXNTapFP9sSz27vKbnwxxdWZSV/view?usp=drive_link
Description of problem:
Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup. Platform: IBM Cloud Bare Metal cluster. Steps done: Step 1. $ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge clusterversion.config.openshift.io/version patched Step 2: $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 The cluster was not upgraded successfully. $ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837 True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2024-01-18-050837 True False True 111s APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... console 4.15.0-0.nightly-2024-01-18-050837 False False False 111s RouteHealthAvailable: console route is not admitted dns 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6." etcd 4.15.0-0.nightly-2024-01-18-050837 True False True 12d EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379" Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379" Healthy:true Took:9.090133ms Error:<nil>}]... image-registry 4.15.0-0.nightly-2024-01-18-050837 True True False 12d Progressing: All registry resources are removed... machine-config 4.14.7 True True True 7d22h Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]] monitoring 4.15.0-0.nightly-2024-01-18-050837 False True True 7m54s UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded network 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)... node-tuning 4.15.0-0.nightly-2024-01-18-050837 True True False 98m Working towards "4.15.0-0.nightly-2024-01-18-050837" $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True True 3 0 0 1 12d worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 12d $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.7 True True 120m Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors $ oc get nodes NAME STATUS ROLES AGE VERSION baremetal2-01.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b baremetal2-02.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b baremetal2-03.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b baremetal2-04.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b baremetal2-05.qe.rh-ocs.com Ready worker 12d v1.28.5+c84a6b8 baremetal2-06.qe.rh-ocs.com Ready,SchedulingDisabled control-plane,master,worker 12d v1.27.8+4fab27b ---------------------------------------------------- During the efforts to bring the cluster back to a good state, these steps were done: The node baremetal2-06.qe.rh-ocs.com was uncordoned. Tried to upgrade to using the command $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading: Reason: ClusterOperatorsDegraded Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful. Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works. Some clusteroperators stayed on the previous version. Some moved to Degraded state. $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True False 3 1 1 0 13d worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 13d $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 11d rook-ceph-mon-pdb N/A 1 1 11d rook-ceph-osd N/A 1 1 3h17m $ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.23672 root default -5 1.74557 host baremetal2-01-qe-rh-ocs-com 1 ssd 0.87279 osd.1 up 1.00000 1.00000 4 ssd 0.87279 osd.4 up 1.00000 1.00000 -7 1.74557 host baremetal2-02-qe-rh-ocs-com 3 ssd 0.87279 osd.3 up 1.00000 1.00000 5 ssd 0.87279 osd.5 up 1.00000 1.00000 -3 1.74557 host baremetal2-05-qe-rh-ocs-com 0 ssd 0.87279 osd.0 up 1.00000 1.00000 2 ssd 0.87279 osd.2 up 1.00000 1.00000 OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/
Version-Release number of selected component (if applicable):
Initial version: OCP 4.14.7 ODF 4.14.4-5.fusion-hci OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380 Local Storage: local-storage-operator.v4.14.0-202312132033 OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci
How reproducible:
Reporting the first occurance of the isue.
Steps to Reproduce:
1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP
Actual results:
OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.
Expected results:
OCP upgrade to nightly build from 4.14.7 should be success.
Additional info:
There are 3 hosted clients present
oc-mirror - maxVersion of the imageset config is ignored for operators
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create 2 imageset that we are using: _imageset-config-test1-1.yaml:_ ~~~ kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /local/oc-mirror/test1/metadata mirror: platform: architectures: - amd64 graph: true channels: - name: stable-4.12 type: ocp minVersion: 4.12.1 maxVersion: 4.12.1 shortestPath: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12 packages: - name: cincinnati-operator channels: - name: v1 minVersion: 5.0.1 maxVersion: 5.0.1 ~~~ _imageset-config-test1-2.yaml:_ ~~~ kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /local/oc-mirror/test1/metadata mirror: platform: architectures: - amd64 graph: true channels: - name: stable-4.12 type: ocp minVersion: 4.12.1 maxVersion: 4.12.1 shortestPath: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12 packages: - name: cincinnati-operator channels: - name: v1 minVersion: 5.0.1 maxVersion: 5.0.1 - name: local-storage-operator channels: - name: stable minVersion: 4.12.0-202305262042 maxVersion: 4.12.0-202305262042 - name: odf-operator channels: - name: stable-4.12 minVersion: 4.12.4-rhodf maxVersion: 4.12.4-rhodf - name: rhsso-operator channels: - name: stable minVersion: 7.6.4-opr-002 maxVersion: 7.6.4-opr-002 - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12 packages: - name: k10-kasten-operator-rhmp channels: - name: stable minVersion: 6.0.6 maxVersion: 6.0.6 additionalImages: - name: registry.redhat.io/rhel8/postgresql-13:1-125 ~~~ 2. Generate a first .tar file from the first imageset-config file (imageset-config-test1-1.yaml) oc mirror --config=imageset-config-test1-1.yaml file:///local/oc-mirror/test1 3. Use the first .tar file to populate our registry oc mirror --from=/root/oc-mirror/test1/mirror_seq1_000000.tar docker://registry-url/oc-mirror1 4.Generate a second .tar file from the second imageset-config file (imageset-config-test1-2.yaml) oc mirror --config=imageset-config-test1-2.yaml file:///local/oc-mirror/test1 5. Populate the private registry named `oc-mirror1` with the second .tar file: oc mirror --from=/root/oc-mirror/test1/mirror_seq2_000000.tar docker://registry-url/oc-mirror1 6. Check the catalog index for **odf** and **rhsso** operators [root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.7-rhodf 4.12.8-rhodf 4.12.4-rhodf 4.12.5-rhodf 4.12.6-rhodf [root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002 7.6.4-opr-003 7.6.5-opr-001 7.6.5-opr-002
Actual results:
Check the catalog index for **odf** and **rhsso** operators. oc-mirror is not respecting the minVersion & maxVersion [root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.7-rhodf 4.12.8-rhodf 4.12.4-rhodf 4.12.5-rhodf 4.12.6-rhodf [root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002 7.6.4-opr-003 7.6.5-opr-001 7.6.5-opr-002
Expected results:
oc-mirror should respect the minVersion & maxVersion [root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.4-rhodf [root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002
Additional info:
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/153
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The node selector for the console deployment requires deploying it on the master nodes, The node selector for the console deployment requires deploying it on the master nodes, while the replica count is determined by the infrastructureTopology, which primarily tracks the workers' setup. When an OpenShift cluster is installed with a single master node and multiple workers, this leads the console deployment to request 2 replicas as infrastructureTopology is set to HighlyAvailable. Instead, ControlPlaneTopology is set to SingleReplica as expected.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Install an openshift cluster with 1 master and 2 workers
Actual results:
The installation fails as the replicas for the console deployment is set to 2. apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2024-01-18T08:34:47Z" generation: 1 name: cluster resourceVersion: "517" uid: d89e60b4-2d9c-4867-a2f8-6e80207dc6b8 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: aws: {} type: AWS status: apiServerInternalURI: https://api-int.adstefa-a12.qe.devcluster.openshift.com:6443 apiServerURL: https://api.adstefa-a12.qe.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: adstefa-a12-6wlvm infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: us-east-2 type: AWS apiVersion: apps/v1 kind: Deployment metadata: annotations: .... creationTimestamp: "2024-01-18T08:54:23Z" generation: 3 labels: app: console component: ui name: console namespace: openshift-console spec: progressDeadlineSeconds: 600 replicas: 2
Expected results:
The replica is set to 1, tracking the ControlPlaneTopology value instead of hte infrastructureTopology.
Additional info:
Description of problem:
If the authentication.config/cluster Type=="" but the OAuth/User APIs are already missing, the console-operator won't update the authentication.config/cluster status with its own client as it's crashing on being unable to retrieve OAuthClients.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100%
Steps to Reproduce:
1. scale oauth-apiserver to 0 2. set featuregates to TechPreviewNotUpgradable 3. watch the authentication.config/cluster .status.oidcClients
Actual results:
The client for the console does not appear.
Expected results:
The client for the console should appear.
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2308
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
In PSI, BE master ~ 2.30 - Massive amount of the following message "Cluster was updated with api-vip <IP ADDRESS>, ingress-vip <IP ADDRESS>" in cluster events.
This message is repeating itself every minute X 5 (I guess related to number of hosts ?)
the installation was started, but aborted due to network connection issues.
I've tried to reproduce in staging, but couldn't.
How reproducible:
Still checking
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
This message should not be shown more than once
Currently the openshift-baremetal-install binary is dynamically linked to libvirt-client, meaning that it is only possible to run it on a RHEL system with libvirt installed.
A new version of the libvirt bindings, v1.8010.0, allows the library to be loaded only on demand, so that users who do not execute any libvirt code can run the rest of the installer without needing to install libvirt. (See this comment from Dan Berrangé.) In practice, the "rest of the installer" is everything except the baremetal destroy cluster command (which destroys the bootstrap storage pool - though only if the bootstrap itself has already been successfully destroyed - and has probably never been used by anybody ever). The Terraform providers all run in a separate binary.
There is also a pure-go libvirt library that can be used even within a statically-linked binary on any platform, even when interacting with libvirt. The libvirt terraform provider that does almost all of our interaction with libvirt already uses this library.
Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Release controller > 4.14.2 > HyperShift conformance run > gathered assets:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c 65 hosted-cluster-config-operator-manager $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5 2023-11-09T17:17:15.130454Z 409 AlreadyExists 2023-11-09T17:17:15.163256Z 409 AlreadyExists 2023-11-09T17:17:15.198908Z 409 AlreadyExists 2023-11-09T17:17:15.230532Z 409 AlreadyExists 2023-11-09T17:17:22.899579Z 409 AlreadyExists
That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?
4.14.2. I haven't checked other releases.
Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.
1. Dump a hosted cluster.
2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c 130 create 409
Zero or rare 409 creation request from this user-agent.
The user agent seems to be defined here, so likely the fix will involve changes to that manager.
Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/213
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/160
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
A table in a dashboard relies on the order of the metric labels to merge results
Create a dashboard with a table including this query:
label_replace(sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by ( alertstate, alertname)), "aaa", "$1", "alertstate", "(.+)")
A single row will be displayed as the query is simulating that the first label `aaa` has a single value.
Expected result:
The table should not rely on a single metric label to merge results but consider all the labels so the expected rows are displayed.
Add an e2e check for sa token not being mounted on management side workloads unless necessary. This could be an extension of the needManamgementKasAccess check
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The e2e-aws-ovn-shared-to-local-gateway-mode-migration and e2e-aws-ovn-local-to-shared-gateway-mode-migration jobs fail about 50% of the time with + oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}' network.operator.openshift.io/cluster patched + oc wait co network --for=condition=PROGRESSING=True --timeout=60s error: timed out waiting for the condition on clusteroperators/network
ci/prow/test.local pipeline is currently broken due to the build04 cluster addressed to it in the buildfarm being a bit slow and making github.com/thanos-io/thanos/pkg/store go over the default 900s.
panic: test timed out after 15m0s1027running tests:1028 TestTSDBStoreSeries (4m3s)1029 TestTSDBStoreSeries/1SeriesWith10000000Samples (2m58s)1030
Extending it makes the test pass
ok github.com/thanos-io/thanos/pkg/store 984.344s
We'll be addressing this alongside a follow-up issue to address this with an env var in upstream Thanos.
Description of problem:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.20 True False 43h Cluster version is 4.11.20 $ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2023-01-11T13:16:47Z" name: system:openshift:kube-controller-manager:gce-cloud-provider resourceVersion: "6079" uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:kube-controller-manager:gce-cloud-provider subjects: - kind: ServiceAccount name: cloud-provider namespace: kube-system $ oc get sa cloud-provider -n kube-system Error from server (NotFound): serviceaccounts "cloud-provider" not found The serviceAccount cloud-provider does not exist. Neither in kube-system nor in any other namespace. It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount. From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4 (all version from what we have found)
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Run oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
Actual results:
$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2023-01-11T13:16:47Z" name: system:openshift:kube-controller-manager:gce-cloud-provider resourceVersion: "6079" uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:kube-controller-manager:gce-cloud-provider subjects: - kind: ServiceAccount name: cloud-provider namespace: kube-system $ oc get sa cloud-provider -n kube-system Error from server (NotFound): serviceaccounts "cloud-provider" not found
Expected results:
The serviceAccount called cloud-provider to exist or otherwise the ClusterRoleBinding to be removed.
Additional info:
Finding related to a Security review done on the OpenShift Container Platform 4 - Platform
Description of the problem:
BE master ~2.30 - in feature support api - VIP_AUTO_ALLOC is dev_preview for 4.15 - should be unavailable
How reproducible:
100%
Steps to reproduce:
1. GET https://<SERVICE_ADDRESS>/api/assisted-install/v2/support-levels/features?openshift_version=4.15&cpu_architecture=x86_64
2. BE response support level
3.
Actual results:
VIP_AUTO_ALLOC is dev_preview for 4.15
Expected results:
VIP_AUTO_ALLOC should be unavailable
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/70
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/40
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/node_exporter/pull/141
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/api/pull/1700
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/sdn/pull/596
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/397
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/thanos/pull/134
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Recent periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade failure caused by
: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers expand_less 0s { event [namespace/openshift-machine-api node/ci-op-j666c60n-23cd9-nb7wr-master-1 pod/cluster-baremetal-operator-79b78c4548-n5vrt hmsg/b7cb271b13 - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-79b78c4548-n5vrt_openshift-machine-api(32835332-fc25-4ddf-84ce-d3aa447d3ce0)] happened 25 times}
Shows in Component Readiness as unknown component
/ ovn upgrade-minor amd64 gcp rt > Unknown> [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers
We should update testBackoffStartingFailedContainer to check for / use known namespaces where a junit is created for known namespaces indicating pass, fail or flake.
The code already handles testBackoffStartingFailedContainerForE2ENamespaces
It looks as though we only check for known or e2e namespaces, need to double check that if that is the case we are ok with potentially unknown namespaced events getting through.
We should also review Sippy pathological tests for results that don't contain `for ns/namespace` format and review if they need to be broken out as well.
For each test that we break out we need to map the new namespace specific test to the correct component in the test mapping repository
Enable websockets (https://github.com/kubernetes/enhancements/issues/4006) in 4.16 so https://issues.redhat.com/browse/OCPBUGS-20515 can test whether websockets allows to timeout idle connection.
Description of problem:
apbexternalroute and egressfirewall status shows empty on hypershift hosted cluster
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-17-173511
How reproducible:
always
Steps to Reproduce:
1. setup hypershift, login hosted cluster % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-55.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-129-197.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-135-106.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-140-89.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 2. create new project test % oc new-project test 3. create apbexternalroute and egressfirewall on hosted cluster apbexternalroute yaml file: --- apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: apbex-route-policy spec: from: namespaceSelector: matchLabels: kubernetes.io/metadata.name: test nextHops: static: - ip: "172.18.0.8" - ip: "172.18.0.9" % oc apply -f apbexroute.yaml adminpolicybasedexternalroute.k8s.ovn.org/apbex-route-policy created egressfirewall yaml file: --- apiVersion: k8s.ovn.org/v1 kind: EgressFirewall metadata: name: default spec: egress: - type: Allow to: cidrSelector: 0.0.0.0/0 % oc apply -f egressfw.yaml egressfirewall.k8s.ovn.org/default created 3. oc get apbexternalroute and oc get egressfirewall
Actual results:
The status show empty: % oc get apbexternalroute NAME LAST UPDATE STATUS apbex-route-policy 49s <--- status is empty % oc describe apbexternalroute apbex-route-policy | tail -n 8 Status: Last Transition Time: 2023-12-19T06:54:17Z Messages: ip-10-0-135-106.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-129-197.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-128-55.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-140-89.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 Events: <none> % oc get egressfirewall NAME EGRESSFIREWALL STATUS default <--- status is empty % oc describe egressfirewall default | tail -n 8 Type: Allow Status: Messages: ip-10-0-129-197.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-128-55.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-140-89.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-135-106.us-east-2.compute.internal: EgressFirewall Rules applied Events: <none>
Expected results:
the status can be shown correctly
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Rukpak – an alpha tech preview API – has pushed a breaking change upstream. This bug tracks the need for us to disable and then reenable the cluster-olm-operator and platform-operators components which both depend on rukpak in order to push the breaking API change. This bug can be closed once those components are all updated and available on the cluster again.
Description of problem:
In a dualstack cluster, we use IPv4 URLs for callbacks to Ironic and Inspector. If the host only has IPv6 networking, the provisioning will fail. This issue affects both the normal IPI and ZTP with the converged flow.
A similar issue has been fixed as part of METAL-163 where we use the BMC's address family to determine which URL to send to it. This bug is somewhat simpler: we can provide IPA with several URLs and let it decide which one it can use. This way, only small changes to IPA itself, ICC and CBO are required.
The fix will only affect virtual media deployments without provisioning network or with virtualMediaViaExternalNetwork:true. We don't have a good dualstack story around provisioning networks anyway.
Upstream IPA request: https://bugs.launchpad.net/ironic/+bug/2045548
Description of problem:
Configuration files applied via the API do not have any effect on the configuration of the bare metal host.
Version-Release number of selected component (if applicable):
OpenShift 4.12.42 with ACM 2.8
How reproducible:
Reproducible.
Steps to Reproduce:
1. Applied nmstateconfig via 'oc apply', realized after it booted the subnet prefix was incorrect. 2. Deleted bmh and nmstateconfig. 3. Applied correct config via 'oc apply', machine boots with 1st config still. 4. Deleted bmh and nmstateconfig. 5. Created host via BMC form in GUI with correct config. Machine boots with correct config. 6. Tested deleting bmh and nmstateconfig, and creating new machine by just applying the bmh file with zero network config, and machine boots again with networking from step 5.
Actual results:
Bare metal host does not get config via 'oc apply'.
Expected results:
'oc apply -f nmstateconfig.yaml' should work to apply networking configuration.
Additional info:
HP Synergy 480 Gen10 (871942-B21) UEFI boot with redfish virtual media Static IP with bonding.
Recently a user was attempting to change the Virtual Machine Folder for a cluster installed on vSphere. The user used the configuration panel "vSphere Connection Configuration" to complete this process. Upon updating the path and clicking "Save Configuration" cluster wide issues emerged including nodes not coming back online after a reboot.
OpenShift nodes eventually crashed with an error resultant of an incorrectly parsed folder due to the string literal " " characters missing.
While this was exhibited on OCP 4.13, other versions may be affected.
Description of problem:
HyperShift operator is applying control-plane-pki-operator RBAC resources regardless of if PKI reconciliation is disabled for the HostedCluster.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create 4.15 HostedCluster with PKI reconciliation disabled 2. Unused RBAC resources for control-plane-pki-operator is created
Actual results:
Unused RBAC resources for control-plane-pki-operator is created
Expected results:
RBAC resources for control-plane-pki-operator should not be created if deployment for control-plane-pki-operator itself is not created.
Additional info:
Description of problem:
The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-multi-2023-12-06-195439
How reproducible:
Always
Steps to Reproduce:
1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config 2.Create cluster 3.Check installation
Actual results:
Azure will precheck if architecture is consistent with instance type when creating manifests, like: 12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj" 12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64 But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster
Expected results:
The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)
Additional info:
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/116
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In https://github.com/openshift/release/pull/47648 ecr-credentials-provider is built in CI and later included in RHCOS. In order to make it work on OKD it needs to be included in the payload, so that OKD machine-os could extract RPM and install it on the host
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Ref: OCPBUGS-25662
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/178
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
1. [sig-network][Feature:EgressFirewall] egressFirewall should have no impact outside its namespace [Suite:openshift/conformance/parallel] 2. [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel] The issue arises during the execution of the above tests and appears to be related to the image in use, specifically, the image located at https://quay.io/repository/redhat-developer/nfs-server?tab=tags&tag=1.1 (quay.io/redhat-developer/nfs-server:1.1). This image does not include the 'ping' executable for the s390x architecture, leading to the following error in the prow job logs: ... msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6mg9v --kubeconfig=/tmp/configfile3768380277 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n" ... Our suggest fix: build new s390x image that contains ping binary.
Version-Release number of selected component (if applicable):
How reproducible:
The issue is reproducible when the test container (quay.io/redhat-developer/nfs-server:1.1) is scheduled on an s390x node, leading to test failures.
Steps to Reproduce:
1.Have a multi-arch cluster (x86 + s390x day2 worker node attached) 2.Execute the two tests 3.Try few times to make the pod assigned to s390x node
Actual results from prow job:
Run #0: Failed expand_less30s{ fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:70]: Unexpected error: <*fmt.wrapError | 0xc005924300>: Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8: StdOut> time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH" command terminated with exit code 255 StdErr> time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH" command terminated with exit code 255 exit status 255 { msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n", err: <*exec.ExitError | 0xc0059242e0>{ ProcessState: { pid: 78611, status: 65280, rusage: { Utime: {Sec: 0, Usec: 168910}, Stime: {Sec: 0, Usec: 60897}, Maxrss: 206428, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 4199, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 753, Nivcsw: 149, }, }, Stderr: nil, }, } occurred Ginkgo exit error 1: exit with code 1}
Expected results:
Passed
Additional info:
This issue pertains to a specific bug on the s390x architecture and additionally impacts the libvirt-s390x prow job.
Please review the following PR: https://github.com/openshift/installer/pull/7818
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/127
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/154
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When platform specific passwords are included in the install-config.yaml they are stored in the generated agent-cluster-install.yaml, which is included in the output of the agent-gather command. These passwords should be redacted.
Description of the problem:
Setting up OCP with ODF (compact mode) using AI (stage). I have 3 hosts with install disk (120GB) and data disk (500GB, disk type: Multipath). Though we have non bootable disk (500GB), host status is "Insufficient". Could not proceed forward as the "Next" button is disabled.
Steps to reproduce:
1. Create a new cluster
2. Select "Install OpenShift Data Foundation" in Operators page
3. Take 3 hosts with 1 installation disk and 1 non-installation disk on each.
4. Add hosts by booting hosts with downloaded iso
Actual results:
Status of hosts is "Insufficient" and "Next" button is disabled
Expected results:
Status of hosts should be "Ready" and "Next" button should be enabled to proceed with installation
As this shows tls: bad certificate from kube-apiserver operator, for example, https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/470214, checked its must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-aws-ipi-imdsv2-fips-f14/1726036030588456960/artifacts/aws-ipi-imdsv2-fips-f14/gather-must-gather/artifacts/
MacBook-Pro:~ jianzhang$ omg logs prometheus-operator-admission-webhook-6bbdbc47df-jd5mb | grep "TLS handshake" 2023-11-27 10:11:50.687 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader 2023-11-19T00:57:08.318983249Z ts=2023-11-19T00:57:08.318923708Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48334: remote error: tls: bad certificate" 2023-11-19T00:57:10.336569986Z ts=2023-11-19T00:57:10.336505695Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48342: remote error: tls: bad certificate" ... MacBook-Pro:~ jianzhang$ omg get pods -A -o wide | grep "10.129.0.35" 2023-11-27 10:12:16.382 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader openshift-kube-apiserver-operator kube-apiserver-operator-f78c754f9-rbhw9 1/1 Running 2 5h27m 10.129.0.35 ip-10-0-107-238.ec2.internal
for more information slack - https://redhat-internal.slack.com/archives/CC3CZCQHM/p1700473278471309
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/etcd/pull/236
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Many jobs are failing because route53 is throttling us during cluster creation.
We need a make external-dns make fewer calls.
The theoretical minimum is:
list zones - 1 call
list zone records - (# of records / 100) calls
create 3 records per HC - 1-3 calls depending on how they are batched
Description of problem:
The image registry CO is not progressing on Azure Hosted Control Planes
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an Azure HCP 2. Create a kubeconfig for the guest cluster 3. Check the image-registry CO
Actual results:
image-registry co's message is Progressing: The registry is ready...
Expected results:
image-registry finishes progressing
Additional info:
I let it go for about 34m % oc get co | grep -i image image-registry 4.16.0-0.nightly-multi-2024-02-26-105325 True True False 34m Progressing: The registry is ready... % oc get co/image-registry -oyaml ... - lastTransitionTime: "2024-02-28T19:10:30Z" message: |- Progressing: The registry is ready NodeCADaemonProgressing: The daemon set node-ca is deployed AzurePathFixProgressing: The job does not exist reason: AzurePathFixNotFound::Ready status: "True" type: Progressing
Description of problem:
PodStartupStorageOperationsFailing alert is not getting raised when there are no successfull(zero) mount/attach happens on node
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-10-25-185510
How reproducible:
Always
Steps to Reproduce:
1. Install any platform cluster. 2. Create sc, pvc, dep. 3. Check dep pod reaching to containercreatingstate and check for alert
Actual results:
Alert is not getting raised when there are 0 successfull mount/attach happens
Expected results:
Alert should get raised when there are no successfull mount/attach happens
Additional info:
Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1697793500890839 When we take same alerting expression from 4.12, we can observe the alert in ocp web console page.
Description of problem:
When deploying a HCP KubeVirt cluster using the hcp's --node-selector cli arg, that node selector is not applied to the "kubevirt-cloud-controller-manager" pods within the HCP namespace. This makes it not possible to pin the entire HCP pods to specific nodes.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. deploy an hcp kubevirt cluster with the --node-selector cli option 2. 3.
Actual results:
the node selector is not applied to cloud provider kubevirt pod
Expected results:
the node selector should be applied to cloud provider kubevirt pod.
Additional info:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1980
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Customer pentest shows that the Server header is returned by admin console when browsing
https://console-openshift-console$domain/locales/resource.json?lng=en&ns=plugin__odf-console
This could lead to information about CVE for a potential attacker.
Response header:
Server: nginx/1.20.1
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/488
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/432
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
[Jira:"Test Framework"] monitor test azure-metrics-collector collection failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/28395/pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd/1724427658311241728
Looks like Azure is throttling our request. We should probably try some retry mechanism.
Relevant thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1699977299650309
Work to setup the endpoint will be handled in another card.
For this one we want to setup a new disruption backend similar to the cluster-network-liveness-probe. We'll poll, submit request ids and possibly job identifiers.
This approach us gets us free disruption intervals in bigquery, charting per job, and graphing capabilities from the disruption dashboard.
Observed during testing of candidate-4.15 image as of 2024-02-08.
This is an incomplete report as I haven't verified the reproducer yet or attempted to get a must-gather. I have observed this multiple times now, so I am confident it's a thing. I can't be confident that the procedure described here reliably reproduces it, or that all the described steps are required.
I have been using MCO to apply machine config to masters. This involves a rolling reboot of all masters.
During a rolling reboot I applied an update to CPMS. I observed the following sequence of events:
At this point there were only 2 nodes in the cluster:
and machines provisioning:
Currently when creating an Azure cluster, only the first node of the nodePool will be ready and join the cluster, all other azure machines are stuck in the `Creating` state.
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/212
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api/pull/191
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/272
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/84
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The YAML sidebar is occupying too much space on some pages
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457
How reproducible:
Always
Steps to Reproduce:
1. Go to Deployment/DeploymentConfig creation page 2. Choose 'YAML view' 3. (for comparison) Go to other resources YAML page, open the sidebar
Actual results:
We can see the sidebar is occupying too much screen compared with other resources YAML page
Expected results:
We should reduce the space sidebar occupies
Additional info:
Description of problem:
- Investigate why `name` and `namespcae` properties are passed as arguments in `k8sCreate` instance for Create YAML editor function - Remove the `name` and `namespcae` arguments in `k8sCreate` instance for Create YAML editor function if it does not require a big change. Problem: If consoleFetchCommon takes an additional option(argument) and return response based on the option as proposed in "[Add support for returning response.header in consoleFetchCommon function|https://issues.redhat.com/browse/CONSOLE-3949]" story, the wrong and unused arguments in k8sCreate would cause the consoleFetchCommon method arguments to return entire response instead of response body which would break the Create Resource YAML functionality. Code: https://github.com/openshift/console/blob/master/frontend/public/components/edit-yaml.jsx#L334
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/149
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Currently when attempting installs, our infrastructure "cluster" CR is containing resource with a path such as:
"Workspace.ResourcePool: /DEVQEdatacenter/host/DEVQEcluster//Resources
This is causing issue with CPMSO resulting in rollout of new masters after initial control plane is established
I0219 07:46:32.076950 1 updates.go:478] "msg"="Machine requires an update" "controller"="controlplanemachineset" "diff"=["Workspace.ResourcePool: /DEVQEdatacenter/host/DEVQEcluster//Resources != /DEVQEdatacenter/host/DEVQEcluster/Resources"] "index"=2 "name"="sgao-devqe-vblw8-master-2" "namespace"="openshift-machine-api" "reconcileID"="5f47f5a5-0a90-4168-bfcc-dae0fad9b953" "updateStrategy"="RollingUpdate"
Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1597
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In LGW (local gateway mode) mode, when pod is selected by an EIP thats hosted by an interface that isnt the default interface, connection to node IP fails
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently, the plugin template gives you instructions for running the console using a container image, which is a lightweight to do development and avoids the need to build the console source code from scratch. The image we reference uses a production version of React, however. This means that you aren't able to use the React browser plugin to debug your application.
We should look at alternatives that allow you to use React Developer Tools. Perhaps we can publish a different image that uses a development build. Or at least we need to better document building console locally instead of using an image to allow development builds.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Nodepools fail to provision if not subnet is specified.
Subnet field https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/nodepool_types.go#L760 should be required.
Background:
CCO was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. CloudCredential was introduced as a new capability to openshift/api. We need to bump api at oc to include the CloudCredential capability so oc adm release extract works correctly.
Description of problem:
Some relevant CredentialsRequests are not extracted by the following command: oc adm release extract --credentials-requests --included --install-config=install-config.yaml ... where install-config.yaml looks like the following: ... capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - MachineAPI - CloudCredential platform: aws: ...
Logs:
... I1209 19:57:25.968783 79037 extract.go:418] Found manifest 0000_50_cloud-credential-operator_05-iam-ro-credentialsrequest.yaml I1209 19:57:25.968902 79037 extract.go:429] Excluding Group: "cloudcredential.openshift.io" Kind: "CredentialsRequest" Namespace: "openshift-cloud-credential-operator" Name: "cloud-credential-operator-iam-ro": unrecognized capability names: CloudCredential ...
Description of problem:
When deploying with a service ID, the installer is unable to query resource groups.
Version-Release number of selected component (if applicable):
4.13-4.16
How reproducible:
Easily
Steps to Reproduce:
1. Create a service ID with seemingly enough permissions to do an IPI install 2. Deploy to power vs with IPI 3. Fail
Actual results:
Fail to deploy a cluster with service ID
Expected results:
cluster create should succeed
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If ccm disabled in cloud such as aws, installation will continue until failed in ingress LoadBalancerPending
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Build image with pr openshift/cluster-cloud-controller-manager-operator#284,openshift/installer#7546,openshift/cluster-version-operator#979,openshift/machine-config-operator#3999 2. Install cluster on aws with "baselineCapabilitySet: v4.14" 3.
Actual results:
Installation failed, ingress LoadBalancerPending. $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-25-230.us-east-2.compute.internal Ready control-plane,master 86m v1.28.3+20a5764 ip-10-0-3-101.us-east-2.compute.internal Ready worker 78m v1.28.3+20a5764 ip-10-0-46-198.us-east-2.compute.internal Ready control-plane,master 87m v1.28.3+20a5764 ip-10-0-48-220.us-east-2.compute.internal Ready worker 80m v1.28.3+20a5764 ip-10-0-79-203.us-east-2.compute.internal Ready control-plane,master 86m v1.28.3+20a5764 ip-10-0-95-83.us-east-2.compute.internal Ready worker 78m v1.28.3+20a5764 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest False False True 85m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) baremetal 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m cloud-credential 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 86m cluster-autoscaler 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m config-operator 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m console 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest False True False 79m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 81m csi-snapshot-controller 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m dns 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m etcd 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 83m image-registry 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m ingress False True True 78m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) insights 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m kube-apiserver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 71m kube-controller-manager 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 82m kube-scheduler 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 82m kube-storage-version-migrator 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m machine-api 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 77m machine-approver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m machine-config 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m marketplace 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m monitoring 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 73m network 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 86m node-tuning 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m openshift-apiserver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 71m openshift-controller-manager 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 75m openshift-samples 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m service-ca 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m storage 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False
Expected results:
Tell users not to turn CCM off for cloud.
Additional info:
As a maintainer of the HyperShift repo, I would like to remove unused functions from the code base to reduce the code footprint of the repo.
Please review the following PR: https://github.com/openshift/network-tools/pull/105
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ConsolePluginComponents CMO task depends on the availability of a conversion service that is part of the console-operator Pod, that Pod is not duplicated, thus when it restarts due to a cluster upgrade or other that conversion webhook becomes unavailable and all the ConsolePlugin API queries from that CMO task fail.
Version-Release number of selected component (if applicable):
How reproducible:
Create a 4.14 cluster, make the console-operator unmanaged and bring it down, watch the ConsolePluginComponents tasks fail instantly after they're run.
Steps to Reproduce:
1. 2. 3.
Actual results:
The ConsolePluginComponents tasks fail instantly after they're run.
Expected results:
The tasks should be more resilient and retry. The long term solution is for that ConsolePlugin conversion service to be duplicated.
Additional info:
For OCP >=4.15, CMO v1 ConsolePlugin querie no longer rely on the conversion webhook because of https://github.com/openshift/api/pull/1477. But the retries will keep the task future proof + we'll be able to backport the fix.
Description of problem:
The job [sig-node] [Conformance] Prevent openshift node labeling on update by the node TestOpenshiftNodeLabeling [Suite:openshift/conformance/parallel/minimal] uses `oc debug` command [1]. Occasionally we find that command fails to run with ends up failing the test.
CI search results - https://search.ci.openshift.org/?search=TestOpenshiftNodeLabeling&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Observed in CI jobs mentioned above.
Steps to Reproduce:
1. 2. 3.
Actual results:
oc debug command occasionally exits with error
Expected results:
oc debug command should not occasionally exit with error
Additional info:
Several oc examples are incorrect. These are used in the CLI reference docs, but would also appear in the oc CLI help.
The commands that don't work have been removed manually from the CLI reference docs via this update: https://github.com/openshift/openshift-docs/compare/9907074162999c982a8a97c45665c98913d848c9..441f3419ef460d9863a45e4c2d6914b1c019e1d1
List of commands:
For more information, see the feedback on these PRs:
Description of problem:
Following https://issues.redhat.com/browse/CNV-28040 On CNV, when virtual machine, with secondary interfaces connected with bridge CNI, is live migrated we observe disruption at the VM inbound traffic. The root cause for it is the migration target bridge interface advertise before the migration is completed. When the migration destination pod is created an IPv6 NS (Neighbor Solicitation) and NA (Neighbor Advertisement) are sent automatically by the kernel. The switches at the endpoints (e.g.: migration destination node) tables get updated and the traffic is forwarded to the migration destination before the migration is completed [1]. The solution is to have the bridge CNI create the pod interface in "link-down" state [2], the IPv6 NS/NA packets are avoided, CNV in turn, set the pod interface to "link-up" [3]. CNV depends on bridge CNI with [2] bits, which is deployed by cluster-network-operator. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2186372#c6 [2] https://github.com/kubevirt/kubevirt/pull/11069 [3] https://github.com/containernetworking/plugins/pull/997
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
CNO deploys CNI bridge w/o an option to set the bridge interface down.
Expected results:
CNO to deploy bridge CNI with [1] changes, from release-4.16 branch. [1] https://github.com/containernetworking/plugins/pull/997
Additional info:
More https://issues.redhat.com/browse/CNV-28040
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Standalone OCP encrypts various resources at rest in etcd: https://docs.openshift.com/container-platform/4.14/security/encrypting-etcd.html HyperShift control planes are only encrypting secrets. We should have parity with standalone.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create HyperShift standalone control plane 2. Check that configmaps, routes, oauth access tokens or oauth authorize tokens are encrypted
Actual results:
Those resources are not encrypted
Expected results:
Those resources are encrypted
Additional info:
Resources to be encrypted are configured here: https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/kas/kms/aws.go#L121-L126
Description of problem:
After running ./openshift-install destroy cluster, TagCategory still exist # ./openshift-install destroy cluster --dir cluster --log-level debug DEBUG OpenShift Installer 4.15.0-0.nightly-2023-12-18-220750 DEBUG Built from commit 2b894776f1653ab818e368fa625019a6de82a8c7 DEBUG Power Off Virtual Machines DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-2 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-1 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-0 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn DEBUG Delete Virtual Machines INFO Destroyed VirtualMachine=sgao-devqe-spn2w-rhcos-generated-region-generated-zone INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-2 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-1 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-0 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn DEBUG Delete Folder INFO Destroyed Folder=sgao-devqe-spn2w DEBUG Delete StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w INFO Destroyed StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w DEBUG Delete Tag=sgao-devqe-spn2w INFO Deleted Tag=sgao-devqe-spn2w DEBUG Delete TagCategory=openshift-sgao-devqe-spn2w INFO Deleted TagCategory=openshift-sgao-devqe-spn2w DEBUG Purging asset "Metadata" from disk DEBUG Purging asset "Master Ignition Customization Check" from disk DEBUG Purging asset "Worker Ignition Customization Check" from disk DEBUG Purging asset "Terraform Variables" from disk DEBUG Purging asset "Kubeconfig Admin Client" from disk DEBUG Purging asset "Kubeadmin Password" from disk DEBUG Purging asset "Certificate (journal-gatewayd)" from disk DEBUG Purging asset "Cluster" from disk INFO Time elapsed: 29s INFO Uninstallation complete! # govc tags.category.ls | grep sgao openshift-sgao-devqe-spn2w
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-18-220750
How reproducible:
always
Steps to Reproduce:
1. IPI install OCP on vSphere 2. Destroy cluster installed, check TagCategory
Actual results:
TagCategory still exist
Expected results:
TagCategory should be deleted
Additional info:
Also reproduced in openshift-install-linux-4.14.0-0.nightly-2023-12-20-184526,4.13.0-0.nightly-2023-12-21-194724, while 4.12.0-0.nightly-2023-12-21-162946 have not this issue
Description of problem:
The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently
Version-Release number of selected component (if applicable):
4.14
How reproducible:
in a statistically significant pattern
Steps to Reproduce:
1. run OCP test suite many times for it to matter
Actual results:
fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times
Expected results:
Test pass
Additional info:
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2156
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Since the singular variant of APIVIP/IngressVIP has been removed as part of https://github.com/openshift/installer/pull/7574, the appliance disk image e2e job is now failing: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static The job fails since th appliance support only 4.14, which still requires the singular variant of the VIP properties.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Invoke appliance e2e job on master: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static
Actual results:
Job fails with the following validation error: "the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs" Due to missing apiVIP and ingressVIP in AgentClusterInstall.
Expected results:
AgentClusterInstall should include also the singular 'apiVIP' and 'ingressVIP', and the e2e job should successfully complete
Additional info:
This is a clone of issue OCPBUGS-22293. The following is the description of the original issue:
—
Description of problem:
Upgrading from 4.13.5 to 4.13.17 fails at network operator upgrade
Version-Release number of selected component (if applicable):
How reproducible:
Not sure since we only had one cluster on 4.13.5.
Steps to Reproduce:
1. Have a cluster on version 4.13.5 witn ovn kubernetes 2. Set desired update image to quay.io/openshift-release-dev/ocp-release@sha256:c1f2fa2170c02869484a4e049132128e216a363634d38abf292eef181e93b692 3. Wait until it reaches network operator
Actual results:
Error message: Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: failed to apply / update (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: DaemonSet.apps "ovnkube-master" is invalid: [spec.template.spec.containers[1].lifecycle.preStop: Required value: must specify a handler type, spec.template.spec.containers[3].lifecycle.preStop: Required value: must specify a handler type]
Expected results:
Network operator upgrades successfully
Additional info:
Since I'm not able to attach files please gather all required debug data from https://access.redhat.com/support/cases/#/case/03645170
Upon debugging, nodes are stuck in NotReady state and CNI is not initialised on them.
Seeing the following error log in cluster network operator
failed parsing certificate data from ConfigMap "openshift-service-ca.crt": failed to parse certificate PEM
CNO operator logs: https://docs.google.com/document/d/1hor1r9ue4gnetkXm9mh8AKa7vm8zNBPhUQqWCbbnnUc/edit?usp=sharing
This is happening on a management cluster that is configured to use legacy service CA's:
$ oc get kubecontrollermanager/cluster -o yaml --as system:admin apiVersion: operator.openshift.io/v1 kind: KubeControllerManager metadata: name: cluster spec: logLevel: Normal managementState: Managed operatorLogLevel: Normal unsupportedConfigOverrides: null useMoreSecureServiceCA: false
In newer clusters, useMoreSecureServiceCA is set to true.
Description of problem:
Observing the following test case failure in 4.14 to 4.15 and 4.15 to 4.16 upgrade CI runs continuously. [bz-Image Registry] clusteroperator/image-registry should not change condition/Available
4.14 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.14.0-0.nightly-ppc64le-2024-01-15-085349
4.15 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.15.0-0.nightly-ppc64le-2024-01-15-042536
I was identifying what remains with:
cat e2e-events_20231204-183144.json| jq '.items[] | select(has("tempSource") | not)'
I think I've cleared all the difficult ones, hopefully these are just simple stragglers.
Description of problem:
capi machine cannot be deleted by installer during cluster destroy, checked on GCP console, found this machine lacks label(kubernetes-io-cluster-clusterid: owned), if adding this label manually on GCP console for the machine, then the machine can be deleted by installer during cluster destroy.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-05-053337
How reproducible:
Always
Steps to Reproduce:
1.Follow the steps here https://bugzilla.redhat.com/show_bug.cgi?id=2107999#c9 to create a capi machine liuhuali@Lius-MacBook-Pro huali-test % oc get machine.cluster.x-k8s.io -n openshift-cluster-api NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION capi-ms-mtchm huliu-gcpx-c55vm gce://openshift-qe/us-central1-a/capi-gcp-machine-template-gcw9t Provisioned 51m 2.Destroy the cluster The cluster destroyed successfully, but checked on GCP console, found the capi machine is still there.
labels of capi machine
labels of mapi machine
Actual results:
capi machine cannot be deleted by installer during cluster destroy
Expected results:
capi machine should be deleted by installer during cluster destroy
Additional info:
Also checked on aws, the case worked well, and found there is tag(kubernetes.io/cluster/clusterid:owned) for capi machines.
Description of problem:[link Worker CloudFormation Template|[Installing a cluster on AWS using CloudFormation templates - Installing on AWS | Installing | OpenShift Container Platform 4.13|https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-user-infra.html#installation-cloudformation-worker_installing-aws-user-infra]]
In OpenShift Documentation under Manual AWS Cloudformation Templates. Within the cloudformation template for Worker Nodes. The description for Subnet and WorkerSecurityGroupId refer to the Master Nodes. Based on the variable names the descriptions should refer to Worker Nodes instead.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api/pull/190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/106
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As a developer I want to remove the NoUpgrade annotation from the CAPI IPAM CRDs so that I can promote them to General Availability
The SPLAT team is planning to have the CAPI IPAM CRDs promoted to GA because they need them in a component they are promoting to GA.
Description of problem:
Trying to run without --node-upgrade-type param fails for "spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\"" although in --help it is documented to have a default value of 'InPlace'
Version-Release number of selected component (if applicable):
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp -v hcp version openshift/hypershift: af9c0b3ce9c612ec738762a8df893c7598cbf157. Latest supported OCP: 4.15.0 [
How reproducible:
happens all the time
Steps to Reproduce:
1.on an hosted cluster setup run : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help Creates basic functional NodePool resources for Agent platformUsage: hcp create nodepool agent [flags]Flags: -h, --help help for agentGlobal Flags: --cluster-name string The name of the HostedCluster nodes in this pool will join. (default "example") --name string The name of the NodePool. --namespace string The namespace in which to create the NodePool. (default "clusters") --node-count int32 The number of nodes to create in the NodePool. (default 2) --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default ) --release-image string The release image for nodes; if this is empty, defaults to the same release image as the HostedCluster. --render Render output as YAML to stdout instead of applying. 2.try to run with default value of --node-upgrade-type: [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2
Actual results:
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 2024-02-06T19:57:03+02:00 ERROR Failed to create nodepool {"error": "NodePool.hypershift.openshift.io \"nodepool-of-extra1\" is invalid: spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /home/kni/hypershift_working/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /home/kni/hypershift_working/hypershift/product-cli/main.go:60 runtime.main /home/kni/hypershift_working/go/src/runtime/proc.go:250 Error: NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace" NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"
Expected results:
should pass as if your adding the param : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type InPlace NodePool nodepool-of-extra1 created [kni@ocp-edge119 ~]$
Additional info:
A related issue is that we have a difference if the --help is used with other parameters or not : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help > long.help.out [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --help > short.help.out [kni@ocp-edge119 ~]$ diff long.help.out short.help.out 14c14 < --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default ) --- > --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace [kni@ocp-edge119 ~]$
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Adding test case when exceed openshift.io/image-tags will ban to create new image references in the project
Version-Release number of selected component (if applicable):
4.16
pr - https://github.com/openshift/origin/pull/28464
In v2 there is an issue to mirroring multiple catalogs at the same mirroring flow.
Complement https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ to include Share everything / share nothing / dedicated behaviour and requirements. -> https://docs.google.com/document/d/1eSaqR7rUwelq0PRC_trL3d5vxLMlJmLytHtFZyFOpvg/edit
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/266
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
OCPBUGS-11856 added a patch to change termination log permissions manually. Since 4.15.z this is not longer necessary as its fixed by lumberjack dep bump.
This bug tracks carry revert
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/79
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/136
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Client side throttling observed when running the metrics controller.
Steps to Reproduce:
1. Install an AWS cluster in mint mode 2. Enable debug log by editing cloudcredential/cluster 3. Wait for the metrics loop to run for a few times 4. Check CCO logs
Actual results:
// 7s consumed by metrics loop which is caused by client-side throttling time="2024-01-20T19:43:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics I0120 19:43:56.251278 1 request.go:629] Waited for 176.161298ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.451311 1 request.go:629] Waited for 197.182213ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.651313 1 request.go:629] Waited for 197.171082ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.850631 1 request.go:629] Waited for 195.251487ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials ... time="2024-01-20T19:44:03Z" level=info msg="reconcile complete" controller=metrics elapsed=7.231061324s
Expected results:
No client-side throttling when running the metrics controller.
Description of problem:
When using the oc cli to query information about release images it is not possible to use the --certificate-authority option to specify an alternative CA bundle for verifying connections to the target registry.
Version-Release number of selected component (if applicable): 4.14.5
How reproducible: 100%
Steps to Reproduce:
1. oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64
Actual results:
error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
Expected results:
Something beginning with: Name: 4.14.9 Digest: sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Created: 2024-01-12T06:48:42Z OS/Arch: linux/amd64 Manifests: 680 Metadata files: 1 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Release Metadata:
Additional info:
To fully verify that this was an issue I went through the following steps which should show that the oc command is not using the CA bundle in the provided file and that the command would have worked if oc was using the provided bundle // show the command works with the system CA bundle # oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head Name: 4.14.9 Digest: sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Created: 2024-01-12T06:48:42Z OS/Arch: linux/amd64 Manifests: 680 Metadata files: 1 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44 Release Metadata: // move the system CA bundle to the local directory # mv /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem . // show the same command now fails without that bundle file # oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority // show using that same bundle file with --certificate-authority doesn't work # oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority Additionally this also seems to be a problem for at least the following commands as well: oc image info oc adm release extract
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
pki operator runs even when annotation to turn off PKI is on the hosted control plane
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/633
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Install cluster with azure workload identity against 4.16 nightly build, failed as some co are degraded. $ oc get co | grep -v "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.0-0.nightly-2024-02-07-200316 False False True 153m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) console 4.16.0-0.nightly-2024-02-07-200316 False True True 141m DeploymentAvailable: 0 replicas available for console deployment... ingress False True True 137m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) Ingress LB public IP is pending to be created $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.199.169 <pending> 80:32007/TCP,443:30229/TCP 154m router-internal-default ClusterIP 172.30.112.167 <none> 80/TCP,443/TCP,1936/TCP 154m Detected that CCM pod is CrashLoopBackOff with error $ oc get pod -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE azure-cloud-controller-manager-555cf5579f-hz6gl 0/1 CrashLoopBackOff 21 (2m55s ago) 160m azure-cloud-controller-manager-555cf5579f-xv2rn 0/1 CrashLoopBackOff 21 (15s ago) 160m error in ccm pod: I0208 04:40:57.141145 1 azure.go:931] Azure cloudprovider using try backoff: retries=6, exponent=1.500000, duration=6, jitter=1.000000 I0208 04:40:57.141193 1 azure_auth.go:86] azure: using workload identity extension to retrieve access token I0208 04:40:57.141290 1 azure_diskclient.go:68] Azure DisksClient using API version: 2022-07-02 I0208 04:40:57.141380 1 azure_blobclient.go:73] Azure BlobClient using API version: 2021-09-01 F0208 04:40:57.141471 1 controllermanager.go:314] Cloud provider azure could not be initialized: could not init cloud provider azure: no token file specified. Check pod configuration or set TokenFilePath in the options
Version-Release number of selected component (if applicable):
4.16 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Install cluster with azure workload identity 2. 3.
Actual results:
Installation failed due to some operators are degraded
Expected results:
Installation is successful.
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Package that we use for Power VS has recently been revealed to be unmaintained. We should remove it in favor of maintained solutions.
Version-Release number of selected component (if applicable):
4.13.0 onward
How reproducible:
It's always used
Steps to Reproduce:
1. Deploy with IPI on Power VS 2. Use bluemix-go 3.
Actual results:
bluemix-go is used
Expected results:
bluemix-go should be avoided
Additional info:
Description of the problem:
Non-Nutanix node was successfully added to Nutanix day1 cluster
How reproducible:
100%
Steps to reproduce:
1. Deploy Nutanix day1 cluster
4. Try to add non-Nutanix day-2 node to Nutanix cluster
Actual results:
Day-2 node installation started and host installed
Expected results:
Day-2 node doesn't pass pre-installation checks
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1175
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/359
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/561
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When deploying to a Power VS workspace created after February 14th 2024, it will not be found by the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily.
Steps to Reproduce:
1. Create a Power VS Workspace 2. Specify it in the install config 3. Attempt to deploy 4. Fail with "...is not a valid guid" error.
Actual results:
Failure to deploy to service instance
Expected results:
Should deploy to service instance
Additional info:
PR https://github.com/openshift/monitoring-plugin/pull/83 was intended to just modify the images built for local testing, but accidentally changed the deafult Dockerfile leading to a mismatch between the nginx config and Dockerfile used in CI. The causes the monitoring-plugin to fail to load in CI builds.
Please review the following PR: https://github.com/openshift/csi-operator/pull/114
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Must-gather link
long snippet from e2e log
external internet 09/01/23 07:26:09.624 Sep 1 07:26:09.624: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 http://www.google.com:80' STEP: creating an egressfirewall object 09/01/23 07:26:09.903 STEP: calling oc create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml 09/01/23 07:26:09.903 Sep 1 07:26:09.904: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml' egressfirewall.k8s.ovn.org/default createdSTEP: sending traffic to control plane nodes should work 09/01/23 07:26:22.122 Sep 1 07:26:22.130: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443' Sep 1 07:26:23.358: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443: StdOut> command terminated with exit code 28 StdErr> command terminated with exit code 28[AfterEach] [sig-network][Feature:EgressFirewall] github.com/openshift/origin/test/extended/util/client.go:180 STEP: Collecting events from namespace "e2e-test-egress-firewall-e2e-2vvzx". 09/01/23 07:26:23.358 STEP: Found 4 events. 09/01/23 07:26:23.361 Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {multus } AddedInterface: Add eth0 [10.131.0.89/23] from ovn-kubernetes Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-quay-io-redhat-developer-nfs-server-1-1-dlXGfzrk5aNo8EjC" already present on machine Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Created: Created container egressfirewall-container Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Started: Started container egressfirewall-container Sep 1 07:26:23.363: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 07:26:23.363: INFO: egressfirewall lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT } {Ready True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT }] Sep 1 07:26:23.363: INFO: Sep 1 07:26:23.367: INFO: skipping dumping cluster info - cluster too large Sep 1 07:26:23.383: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-egress-firewall-e2e-2vvzx-user}, err: <nil> Sep 1 07:26:23.398: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-egress-firewall-e2e-2vvzx}, err: <nil> Sep 1 07:26:23.414: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~X_2HPGEj3O9hpd-3XKTckrp9bO23s_7zlJ3Tkn7ncBE}, err: <nil> [AfterEach] [sig-network][Feature:EgressFirewall] github.com/openshift/origin/test/extended/util/client.go:180 STEP: Collecting events from namespace "e2e-test-no-egress-firewall-e2e-84f48". 09/01/23 07:26:23.414 STEP: Found 0 events. 09/01/23 07:26:23.416 Sep 1 07:26:23.417: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 07:26:23.417: INFO: Sep 1 07:26:23.421: INFO: skipping dumping cluster info - cluster too large Sep 1 07:26:23.446: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-no-egress-firewall-e2e-84f48-user}, err: <nil> Sep 1 07:26:23.451: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-no-egress-firewall-e2e-84f48}, err: <nil> Sep 1 07:26:23.457: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~2Lk8-jWfwpdyo59E9YF7kQFKH2LBUSvnbJdKj7rOzn4}, err: <nil> [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] dump namespaces | framework.go:196 STEP: dump namespace information after failure 09/01/23 07:26:23.457 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] tear down framework | framework.go:193 STEP: Destroying namespace "e2e-test-no-egress-firewall-e2e-84f48" for this suite. 09/01/23 07:26:23.457 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] dump namespaces | framework.go:196 STEP: dump namespace information after failure 09/01/23 07:26:23.462 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] tear down framework | framework.go:193 STEP: Destroying namespace "e2e-test-egress-firewall-e2e-2vvzx" for this suite. 09/01/23 07:26:23.463 fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:155]: Unexpected error: <*fmt.wrapError | 0xc001dd50a0>: { msg: "Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:\nStdOut>\ncommand terminated with exit code 28\nStdErr>\ncommand terminated with exit code 28\nexit status 28\n", err: <*exec.ExitError | 0xc001dd5080>{ ProcessState: { pid: 140483, status: 7168, rusage: { Utime: {Sec: 0, Usec: 149480}, Stime: {Sec: 0, Usec: 19930}, Maxrss: 222592, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 1536, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 596, Nivcsw: 173, }, }, Stderr: nil, }, } Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443: StdOut> command terminated with exit code 28 StdErr> command terminated with exit code 28 exit status 28 occurred Ginkgo exit error 1: exit with code 1failed: (18.7s) 2023-09-01T11:26:23 "[sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]"
Version-Release number of selected component (if applicable):
4.13.11
How reproducible:
This e2e failure is not consistently reproduceable.
Steps to Reproduce:
1.Start a Z stream Job via Jenkins 2.monitor e2e
Actual results:
e2e is getting failed
Expected results:
e2e should pass
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OTA team wants to rename the `supported but not recommended` update edges to `known issues`
Version-Release number of selected component (if applicable):
openshift-4.16
Expected results:
`supported but not recommended` edges are renamed to `known issues`
Additional info:
https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L191 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L216 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L219 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L234 https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219[…]rontend/public/components/cluster-settings/cluster-settings.tsx
Description of problem:
documentationBaseURL still points to 4.14
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Check documentationBaseURL on 4.16 cluster: # oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/ 2. 3.
Actual results:
1.documentationBaseURL is still pointing to 4.14
Expected results:
1.documentationBaseURL should point to 4.16
Additional info:
Changes made for faster risk cache-warming (the OCPBUGS-19512 series) introduced an unfortunate cycle:
1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like graph-data#4528.
4. Cases:
The regression went back via:
Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.
Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.
1. Launch a cluster.
2. Point it at dummy Cincinnati data, as described in OTA-520. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .
Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).
Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).
To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:
$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))' { "lastTransitionTime": "2023-12-15T22:00:45Z", "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958", "reason": "EvaluationFailed", "status": "Unknown", "type": "Recommended" }
To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail I1221 20:36:39.783530 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:36:39.831358 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:40:19.674925 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:40:19.727998 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:43:59.567369 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:43:59.620315 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:47:39.457582 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:47:39.509505 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:51:19.348286 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:51:19.401496 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n"
showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail I1221 20:50:10.165101 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:11.166170 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:12.166314 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:13.166517 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:14.166847 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:15.167737 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:16.168486 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:17.169417 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:18.169576 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:19.170544 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail ...no hits...
If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods
Bump Golang to v1.21 in main and hack/tools go.mod's.
Yesterday a major DPCR and thus Loki outage took the system down entirely. One test would fail as a result:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[ { "metric": { "__name__": "ALERTS", "alertname": "KubeDaemonSetRolloutStuck", "alertstate": "firing", "container": "kube-rbac-proxy-main", "daemonset": "loki-promtail", "endpoint": "https-main", "job": "kube-state-metrics", "namespace": "openshift-e2e-loki", "prometheus": "openshift-monitoring/k8s", "service": "kube-state-metrics", "severity": "warning" }, "value": [ 1709071917.851, "1" ] } ]
The query this test uses should be adapted to omit everything in openshift-e2e-loki.
Ideally, backports would be good here, but we could just fix it going forward also if this is too cumbersome.
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/122
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The customer is pointing that the memory max value scale up is based in GB value, while the min memory for scale down is based on GiB.
So, setting both values in GB makes scale down fail:
Skipping ocgc4preplatgt-98fwh-worker-c-sk2sz - minimal limit exceeded for [memory]
While setting both values in GiB makes the scale up fail.
Description of the problem:
Specified ClusterImageSetRef in SiteConfig CR in HubCluster mistmatches the specific image pulled from quay when trying to install 4.12.x managed SNO clusters. This behavior has not been detected when installing SNO 4.14.x
How reproducible:
Install 4.12.19 from ACM using clusterImageSetNameRef: "img4.12.19-x86-64-appsub"
Actual results:
Image pulled is actually img4.12.19-multi-x86-64-appsub
# oc adm release info 4.12.19 | more
Name: 4.12.19
Digest: sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
Created: 2023-05-24T06:58:32Z
OS/Arch: linux/amd64
Manifests: 647
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
Metadata:
release.openshift.io/architecture: multi url: https://access.redhat.com/errata/RHSA-2023:3287
Expected results:
Image pulled is img4.12.19-x86-64-appsub
# oc adm release info 4.12.19 | more
Name: 4.12.19
Digest: sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
Metadata:
url: https://access.redhat.com/errata/RHSA-2023:3287
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/113
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
With 4.15, resource watch completes whenever it is interrupted. But for 4.16 jobs, it does not until 1h grace period kicks in and the job is terminated by ci-operator. This means:
This was discovered when investigating this slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705578623724259
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Current description of HighOverallControlPlaneCPU is wrong for SNO cases and can mislead users. We need to add information regarding SNO clusters to the description of the alert
Description of problem:
Pod capi-ibmcloud-controller-manager stuck in ContainerCreating on IBM cloud
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Built a cluster on ibm cloud and enable TechPreviewNoUpgrade 2. 3.
Actual results:
4.16 cluster $ oc get po NAME READY STATUS RESTARTS AGE capi-controller-manager-6bccdc844-jsm4s 1/1 Running 9 (24m ago) 175m capi-ibmcloud-controller-manager-75d55bfd7d-6qfxh 0/2 ContainerCreating 0 175m cluster-capi-operator-768c6bd965-5tjl5 1/1 Running 0 3h Warning FailedMount 5m15s (x87 over 166m) kubelet MountVolume.SetUp failed for volume "credentials" : secret "capi-ibmcloud-manager-bootstrap-credentials" not found $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-01-21-154905 True False 156m Cluster version is 4.16.0-0.nightly-2024-01-21-154905 4.15 cluster $ oc get po NAME READY STATUS RESTARTS AGE capi-controller-manager-6b67f7cff4-vxtpg 1/1 Running 6 (9m51s ago) 35m capi-ibmcloud-controller-manager-54887589c6-6plt2 0/2 ContainerCreating 0 35m cluster-capi-operator-7b7f48d898-9r6nn 1/1 Running 1 (17m ago) 39m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-22-160236 True False 11m Cluster version is 4.15.0-0.nightly-2024-01-22-160236
Expected results:
No pod is in ContainerCreating status
Additional info:
must-gather: https://drive.google.com/file/d/1F5xUVtW-vGizAYgeys0V5MMjp03zkSEH/view?usp=sharing
When metal3-plugin is enabled it adds Disks and NICs tabs to Nodes details page.
We would like to remove these tabs becase:
This is an extension of https://issues.redhat.com/browse/HOSTEDCP-190, in which we are adding container resource preservation to more hosted control plane components.
Description of problem:
Unable to view the alerts, metrics page, getting a blank page.
Version-Release number of selected component (if applicable):
4.15.0-nightly
How reproducible:
Always
Steps to Reproduce:
Click on any alert under "Notification Panel" to view more, and you will be redirected to the alert page.
Actual results:
User is unable to view any alerts, metrics.
Expected results:
User should be able to view all/individual alerts, metrics.
Additional info:
N.A
Description of problem:
Find in QE's CI (with vsphere-agent profile), storage CO is not avaliable and vsphere-problem-detector-operator pod is CrashLoopBackOff with panic. (Find must-garther here: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-agent-disconnected-ha-f14/1734850632575094784/artifacts/vsphere-agent-disconnected-ha-f14/gather-must-gather/) The storage CO reports "unable to find VM by UUID": - lastTransitionTime: "2023-12-13T09:15:27Z" message: "VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: unable to find VM ci-op-782gwsbd-b3d4e-master-2 by UUID \nVSphereProblemDetectorDeploymentControllerAvailable: Waiting for Deployment" reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_vcenter_api_error::VSphereProblemDetectorDeploymentController_Deploying status: "False" type: Available (But I did not see the "unable to find VM by UUID" from vsphere-problem-detector-operator log in must-gather) The vsphere-problem-detector-operator log: 2023-12-13T10:10:56.620216117Z I1213 10:10:56.620159 1 vsphere_check.go:149] Connected to vcenter.devqe.ibmc.devcluster.openshift.com as ci_user_01@devqe.ibmc.devcluster.openshift.com 2023-12-13T10:10:56.625161719Z I1213 10:10:56.625108 1 vsphere_check.go:271] CountVolumeTypes passed 2023-12-13T10:10:56.625291631Z I1213 10:10:56.625258 1 zones.go:124] Checking tags for multi-zone support. 2023-12-13T10:10:56.625449771Z I1213 10:10:56.625433 1 zones.go:202] No FailureDomains configured. Skipping check. 2023-12-13T10:10:56.625497726Z I1213 10:10:56.625487 1 vsphere_check.go:271] CheckZoneTags passed 2023-12-13T10:10:56.625531795Z I1213 10:10:56.625522 1 info.go:44] vCenter version is 8.0.2, apiVersion is 8.0.2.0 and build is 22617221 2023-12-13T10:10:56.625562833Z I1213 10:10:56.625555 1 vsphere_check.go:271] ClusterInfo passed 2023-12-13T10:10:56.625603236Z I1213 10:10:56.625594 1 datastore.go:312] checking datastore /DEVQEdatacenter/datastore/vsanDatastore for permissions 2023-12-13T10:10:56.669205822Z panic: runtime error: invalid memory address or nil pointer dereference 2023-12-13T10:10:56.669338411Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x23096cb] 2023-12-13T10:10:56.669565413Z 2023-12-13T10:10:56.669591144Z goroutine 550 [running]: 2023-12-13T10:10:56.669838383Z github.com/openshift/vsphere-problem-detector/pkg/operator.getVM(0xc0005da6c0, 0xc0002d3b80) 2023-12-13T10:10:56.669991749Z github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:319 +0x3eb 2023-12-13T10:10:56.670212441Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*vSphereChecker).enqueueSingleNodeChecks.func1() 2023-12-13T10:10:56.670289644Z github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:238 +0x55 2023-12-13T10:10:56.670490453Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker.func1(0xc000c88760?, 0x0?) 2023-12-13T10:10:56.670702592Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:40 +0x55 2023-12-13T10:10:56.671142070Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker(0xc000c78660, 0xc000c887a0?) 2023-12-13T10:10:56.671331852Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:41 +0xe7 2023-12-13T10:10:56.671529761Z github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool.func1() 2023-12-13T10:10:56.671589925Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:28 +0x25 2023-12-13T10:10:56.671776328Z created by github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool 2023-12-13T10:10:56.671847478Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:27 +0x73
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Steps to Reproduce:
1. See description 2. 3.
Actual results:
vpd is panic
Expected results:
vpd should not panic
Additional info:
I guess it is privileges issue, but our pod should not be panic.
Looks like we're accidentally passing the JavaScript `window.alert()` method instead of the Prometheus alert object.
Description of problem:
The cluster-network-operator in hypershift when templating in cluster resources does not use the node local address of the client side haproxy load balancer that runs on all nodes. This bypasses a level of health checks for the backend redundant apiserver addresses that is performed by the local kube-apiserver-proxy pods that run on every node in a hypershift environment. In environments where the backend api servers are not fronted through an additional cloud load balancer: this leads to a percentage of request failures from the in cluster components occuring when a control plane endpoint goes down even if other endpoints are available.
Version-Release number of selected component (if applicable):
4.16 4.15 4.14
How reproducible:
100%
Steps to Reproduce:
1. Setup a hypershift cluster in a baremetal/non cloud environment where there are redundant API servers behind a DNS that point directly to the node IPs. 2. Power down one of the control plane nodes 3. Schedule workload into cluster that depends on kube-proxy and/or multus to setup networking configuration 4. You will see errors like the following ``` add): Multus: [openshiftai/moe-8b-cmisale-master-0/9c1fd369-94f5-481c-a0de-ba81a3ee3583]: error getting pod: Get "https://[p9d81ad32fcdb92dbb598-6b64a6ccc9c596bf59a86625d8fa2202-c000.us-east.satellite.appdomain.cloud]:30026/api/v1/namespaces/openshiftai/pods/moe-8b-cmisale-master-0?timeout=1m0s": dial tcp 192.168.98.203:30026: connect: timeout ```
Actual results:
When a control plane node fails intermittent timeouts occur when kube-proxy/multus resolve the dns and a failed control plane node ip is returned
Expected results:
No requests fail (which will occur if all traffic is routed through the node local load balancer instance
Additional info:
Additionally: control plane components in the management cluster that live next to the apiserver are adding uneeded dependencies by using an external DNS entry to talk to the kube-apiserver when it can use the local kube-apiserver address to have it all go over cluster local networking
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/305
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Since OCP 4.15 we see issue with OLM deployed operator unable to operate in watched namespaces (multiple). It works fine with single watched namespace (subscription). Also, same test passes if we don't deploy operator using OLM, but using files. It seems like it is permission issue based on operator log. Same test works fine on any other previous OCP 4.14 and older.
Version-Release number of selected component (if applicable):
Server Version: 4.15.0-ec.3 Kubernetes Version: v1.28.3+20a5764
How reproducible:
Always
Steps to Reproduce:
0. oc login OCP4.15 1. git clone https://gitlab.cee.redhat.com/amq-broker/claire 2. make -f Makefile.downstream build ARTEMIS_VERSION=7.11.4 RELEASE_TYPE=released 3. make -f Makefile.downstream operator_test OLM_IIB=registry-proxy.engineering.redhat.com/rh-osbs/iib:636350 OLM_CHANNEL=7.11.x TESTS=ClusteredOperatorSmokeTests TEST_LOG_LEVEL=debug DISABLE_RANDOM_NAMESPACES=true
Actual results:
Can't deploy artemis broker custom resource in given namespace (permission issue - see details below)
Expected results:
Successfully deployed broker on watched namespaces
Additional info:
Log from AMQ Broker operator - seems like some permission issues since 4.15
E0103 10:04:54.425202 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemis: failed to list *v1beta1.ActiveMQArtemis: activemqartemises.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemises" in API group "broker.amq.io" in the namespace "cluster-testsa" E0103 10:04:54.425207 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemisSecurity: failed to list *v1beta1.ActiveMQArtemisSecurity: activemqartemissecurities.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemissecurities" in API group "broker.amq.io" in the namespace "cluster-testsa" E0103 10:04:54.425221 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "pods" in API group "" in the namespace "cluster-testsa" W0103 10:04:54.425296 1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1beta1.ActiveMQArtemisScaledown: activemqartemisscaledowns.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemisscaledowns" in API group "broker.amq.io" in the namespace "cluster-testsa"
Description of problem:
CMPS was supported in 4.15 on vsphere platform when enable TechPreviewNoUpgrade. but after I build the cluster with no failure domains/single failure domain setting in install-config. there were three duplicated failure domains.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
install a cluster with TP enabled and don't set failure domain (or set single failure doamin) in install-config.
Steps to Reproduce:
1. do not config failure domain in install-config (or set single failure doamin). 2. install a cluster with TP enabled 3. check CPMS with command: oc get controlplanemachineset -oyaml
Actual results:
duplicated failure domains. failureDomains: platform: VSphere vsphere: - name: generated-failure-domain - name: generated-failure-domain - name: generated-failure-domain metadata: labels:
Expected results:
failure domain should not duplicated when setting single failure domain in install-config. failure domain should not exists when not setting failure domain in install-config.
Additional info:
Description of problem:
Cluster install fails on IBMCloud, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Version-Release number of selected component (if applicable):
from 4.16.0-0.nightly-2023-12-22-210021 last PASS version: 4.16.0-0.nightly-2023-12-20-061023
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster on IBMCloud, we use auto flexy template: aos-4_16/ipi-on-ibmcloud/versioned-installer liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 92m Unable to apply 4.16.0-0.nightly-2023-12-25-200355: an unknown error has occurred: MultipleErrors liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager 4.16.0-0.nightly-2023-12-25-200355 True False False 89m cloud-credential cluster-autoscaler config-operator console control-plane-machine-set csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-ibma-qbg48-master-0 NotReady control-plane,master 89m v1.29.0+b0d609f huliu-ibma-qbg48-master-1 NotReady control-plane,master 89m v1.29.0+b0d609f huliu-ibma-qbg48-master-2 NotReady control-plane,master 89m v1.29.0+b0d609f liuhuali@Lius-MacBook-Pro huali-test % oc describe node huliu-ibma-qbg48-master-0 Name: huliu-ibma-qbg48-master-0 Roles: control-plane,master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=huliu-ibma-qbg48-master-0 kubernetes.io/os=linux node-role.kubernetes.io/control-plane= node-role.kubernetes.io/master= node.openshift.io/os_id=rhcos Annotations: volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 27 Dec 2023 18:02:21 +0800 Taints: node-role.kubernetes.io/master:NoSchedule node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule node.kubernetes.io/not-ready:NoSchedule Unschedulable: false Lease: HolderIdentity: huliu-ibma-qbg48-master-0 AcquireTime: <unset> RenewTime: Wed, 27 Dec 2023 19:32:24 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Wed, 27 Dec 2023 19:32:21 +0800 Wed, 27 Dec 2023 18:02:21 +0800 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Addresses: Capacity: cpu: 4 ephemeral-storage: 104266732Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16391716Ki pods: 250 Allocatable: cpu: 3500m ephemeral-storage: 95018478229 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15240740Ki pods: 250 System Info: Machine ID: 0ae21a012be844f18c5871f6eaefb85b System UUID: 0ae21a01-2be8-44f1-8c58-71f6eaefb85b Boot ID: fbe619e2-8ff5-4cdb-b6a4-cd6830ccc568 Kernel Version: 5.14.0-284.45.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 416.92.202312250319-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.28.2-9.rhaos4.15.git6d902a3.el9 Kubelet Version: v1.29.0+b0d609f Kube-Proxy Version: v1.29.0+b0d609f Non-terminated Pods: (0 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 0 (0%) 0 (0%) memory 0 (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasNoDiskPressure 90m (x7 over 90m) kubelet Node huliu-ibma-qbg48-master-0 status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 90m (x7 over 90m) kubelet Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientPID Normal NodeHasSufficientMemory 90m (x7 over 90m) kubelet Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientMemory Normal RegisteredNode 90m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 73m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 53m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 32m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller Normal RegisteredNode 12m node-controller Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-787645668b-djqnr 0/1 CrashLoopBackOff 22 (2m29s ago) 90m ibm-cloud-controller-manager-787645668b-pgkh2 0/1 Error 15 (5m8s ago) 52m liuhuali@Lius-MacBook-Pro huali-test % oc describe pod ibm-cloud-controller-manager-787645668b-pgkh2 -n openshift-cloud-controller-manager Name: ibm-cloud-controller-manager-787645668b-pgkh2 Namespace: openshift-cloud-controller-manager Priority: 2000000000 Priority Class Name: system-cluster-critical Node: huliu-ibma-qbg48-master-2/ Start Time: Wed, 27 Dec 2023 18:41:23 +0800 Labels: infrastructure.openshift.io/cloud-controller-manager=IBMCloud k8s-app=ibm-cloud-controller-manager pod-template-hash=787645668b Annotations: operator.openshift.io/config-hash: 82a75c6ff86a490b0dac9c8c9b91f1987da0e646a42d72c33c54cbde3c29395b Status: Running IP: IPs: <none> Controlled By: ReplicaSet/ibm-cloud-controller-manager-787645668b Containers: cloud-controller-manager: Container ID: cri-o://c56e246f64c770146c30b7a894f6a4d974159551dbb9d1ea31c238e516a0f854 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218 Image ID: e494d0d4b28e31170a4a2792bb90701c7f1e81c78c03e3686c5f0e601801937e Port: 10258/TCP Host Port: 10258/TCP Command: /bin/bash -c #!/bin/bash set -o allexport if [[ -f /etc/kubernetes/apiserver-url.env ]]; then source /etc/kubernetes/apiserver-url.env fi exec /bin/ibm-cloud-controller-manager \ --bind-address=$(POD_IP_ADDRESS) \ --use-service-account-credentials=true \ --configure-cloud-routes=false \ --cloud-provider=ibm \ --cloud-config=/etc/ibm/cloud.conf \ --profiling=false \ --leader-elect=true \ --leader-elect-lease-duration=137s \ --leader-elect-renew-deadline=107s \ --leader-elect-retry-period=26s \ --leader-elect-resource-namespace=openshift-cloud-controller-manager \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \ --v=2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 27 Dec 2023 19:33:23 +0800 Finished: Wed, 27 Dec 2023 19:33:23 +0800 Ready: False Restart Count: 15 Requests: cpu: 75m memory: 60Mi Liveness: http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3 Environment: POD_IP_ADDRESS: (v1:status.podIP) VPCCTL_CLOUD_CONFIG: /etc/ibm/cloud.conf VPCCTL_PUBLIC_ENDPOINT: false Mounts: /etc/ibm from cloud-conf (rw) /etc/kubernetes from host-etc-kube (ro) /etc/pki/ca-trust/extracted/pem from trusted-ca (ro) /etc/vpc from ibm-cloud-credentials (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cbd4b (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: trusted-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ccm-trusted-ca Optional: false host-etc-kube: Type: HostPath (bare host directory volume) Path: /etc/kubernetes HostPathType: Directory cloud-conf: Type: ConfigMap (a volume populated by a ConfigMap) Name: cloud-conf Optional: false ibm-cloud-credentials: Type: Secret (a volume populated by a Secret) SecretName: ibm-cloud-credentials Optional: false kube-api-access-cbd4b: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 52m default-scheduler Successfully assigned openshift-cloud-controller-manager/ibm-cloud-controller-manager-787645668b-pgkh2 to huliu-ibma-qbg48-master-2 Normal Pulling 52m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" Normal Pulled 52m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" in 3.431s (3.431s including waiting) Normal Created 50m (x5 over 52m) kubelet Created container cloud-controller-manager Normal Started 50m (x5 over 52m) kubelet Started container cloud-controller-manager Normal Pulled 50m (x4 over 52m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" already present on machine Warning BackOff 2m19s (x240 over 52m) kubelet Back-off restarting failed container cloud-controller-manager in pod ibm-cloud-controller-manager-787645668b-pgkh2_openshift-cloud-controller-manager(d7f93ecf-cd14-450e-a986-028559a775b3) liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
cluster install failed on IBMCloud
Expected results:
cluster install succeed on IBMCloud
Additional info:
The funky job renaming done in these is breaking risk analysis. The disruption checking actually ran if you look in the logs, but we don't get far enough to generate the html template, I suspect because of the error returned by risk analysis.
Would be nice if both would work but I'd be happy just to get the disruption portion populated for now. I don't think the overall risk analysis will be easy.
Description of problem:
After a manual crash of a OCP node the OSPD VM running on the OCP node is stuck in terminating state
Version-Release number of selected component (if applicable):
OCP 4.12.15 osp-director-operator.v1.3.0 kubevirt-hyperconverged-operator.v4.12.5
How reproducible:
Login to a OCP 4.12.15 Node running a VM Manually crash the master node. After reboot the VM stay in terminating state
Steps to Reproduce:
1. ssh core@masterX 2. sudo su 3. echo c > /proc/sysrq-trigger
Actual results:
After reboot the VM stay in terminating state $ omc get node|sed -e 's/modl4osp03ctl/model/g' | sed -e 's/telecom.tcnz.net/aaa.bbb.ccc/g' NAME STATUS ROLES AGE VERSION model01.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 model02.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 model03.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 $ omc get pod -n openstack NAME READY STATUS RESTARTS AGE openstack-provision-server-7b79fcc4bd-x8kkz 2/2 Running 0 8h openstackclient 1/1 Running 0 7h osp-director-operator-controller-manager-5896b5766b-sc7vm 2/2 Running 0 8h osp-director-operator-index-qxxvw 1/1 Running 0 8h virt-launcher-controller-0-9xpj7 1/1 Running 0 20d virt-launcher-controller-1-5hj9x 1/1 Running 0 20d virt-launcher-controller-2-vhd69 0/1 NodeAffinity 0 43d $ omc describe pod virt-launcher-controller-2-vhd69 |grep Status: Status: Terminating (lasts 37h) $ xsos sosreport-xxxx/|grep time ... Boot time: Wed Nov 22 01:44:11 AM UTC 2023 Uptime: 8:27, 0 users
Expected results:
VM restart automatically OR does not stay in Terminating state
Additional info:
The issue has been seen two time. First time, a crash of the kernel occured and we had the associated VM on the node in terminating state Second time we try to reproduce the issue by crashing manually the kernel and we got the same result. The VM running on the OCP node stay in terminating state
Description of problem:
The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1]. We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping: 1 events happened too frequently event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times} I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time. [1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97 [2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368 [3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144
Version-Release number of selected component (if applicable):
> 4.15
How reproducible:
always by running the test
Steps to Reproduce:
Run the test: [sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial] and observe the event invariant failing on it crash looping
Actual results:
catalogd-controller-manager crash loops and causes our CI jobs to fail
Expected results:
our e2e job is green again and catalogd-controller-manager doesn't crash loop
Additional info:
Please review the following PR: https://github.com/openshift/images/pull/155
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is failing on hypershift:
[sig-operator] an end user can use OLM can subscribe to the operator [apigroup:config.openshift.io]
Description of problem:
From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result. Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/
Version-Release number of selected component (if applicable):
>4.14
How reproducible:
Always
Steps to Reproduce:
1. create a healthy cluster 2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall) 3. wait for the CEO to restart pod on failing health probe and dump its stack
Actual results:
CEO controllers are returning errors, but might not deadlock, which currently results in a restart
Expected results:
CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe
Additional info:
highly related to OCPBUGS-30169
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
Conditional risks have looser naming restrictions:
// +kubebuilder:validation:Required // +kubebuilder:validation:MinLength=1 // +required Name string `json:"name"`
...than condition Reason field:
// +required // +kubebuilder:validation:Required // +kubebuilder:validation:MaxLength=1024 // +kubebuilder:validation:MinLength=1 // +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$` Reason string `json:"reason" protobuf:"bytes,5,opt,name=reason"`
CVO can use a name risk as a reason so when a name of the applying risk is invalid, CVO will start trying to update the ClusterVersion resource with an invalid one.
4.14
always
Make the cluster consume update graph data containing a conditional edge with a risk with a name that does not follow the Condition.Reason restriction, e.g. uses a - character. The risk needs to apply on the cluster. For example:
{ "nodes": [ {"version": "CLUSTER-BOT-VERSION", "payload": "CLUSTER-BOT-PAYLOAD"}, {"version": "4.12.22", "payload": "quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111"} ], "conditionalEdges": [ { "edges": [{"from": "CLUSTER-BOT-VERSION", "to": "4.12.22"}], "risks": [ { "url": "https://github.com/openshift/api/blob/8891815aa476232109dccf6c11b8611d209445d9/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1515C4-L1520", "name": "OCPBUGS-9050", "message": "THere is no validation that risk name is a valid Condition.Reason so let's just use a - character in it.", "matchingRules": [{"type": "PromQL", "promql": { "promql": "group by (type) (cluster_infrastructure_provider)"}}] } ] } ] }
Then, observe the ClusterVersion status field after the cluster has a chance to evaluate the risk:
$ oc get clusterversion version -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Progressing" or .type=="Available" or .type=="Failing")'
{ "lastTransitionTime": "2023-09-01T13:21:49Z", "status": "False", "type": "Available" } { "lastTransitionTime": "2023-09-01T13:21:49Z", "message": "ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'", "status": "True", "type": "Failing" } { "lastTransitionTime": "2023-09-01T13:14:34Z", "message": "Error ensuring the cluster version is up to date: ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'", "status": "False", "type": "Progressing" }
No errors, CVO continues to work and either uses some sanitized version of the name as the reason, or maybe uses something generic, like RiskApplies.
CVO does not get stuck after consuming data from external source
1. We should CI PRs to o/cincinnati-graph-data so we never create invalid data
2. We should sanitize the field in CVO code so that CVO never attempts to submit an invalid ClusterVersion.status.conditionalUpdates.condition.reason
3. We should further restrict the conditional update risk name in the CRD so it is guaranteed compatible with Condition.Reason
4. We should sanitize the field in CVO code after being read from OSUS so that CVO never attempts to submit an invalid (after we do 3) ClusterVersion.conditinalUpdates.name
Description of problem:
When bootstrap logs are collected (e.g. as part of a CI run when bootstrapping fails), it no longer contains most of the Ironic services. They used to be run in standalone pods, but after a recent refactoring, they are systemd services.
Description of problem:
sha256 sum for "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/openshift-install-mac-arm64-4.14.9.tar.gz" does not match what it should be # sha256sum openshift-install-mac-arm64-4.14.9.tar.gz 61cccc282f39456b7db730a0625d0a04cd6c1c2ac0f945c4c15724e4e522a073 openshift-install-mac-arm64-4.14.9.tar.gz Which does not match what is posted here: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/sha256sum.txt It should be : c765c90a32b8a43bc62f2ba8bd59dc8e620b972bcc2a2e217c36ce139d517e29 openshift-install-mac-arm64-4.14.9.tar.gz
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
There are multiple dashboards, so this page title should be "Dashboards" rather than "Dashboard". The admin console version of this page is already titled "Dashboards".
Version-Release number of selected component (if applicable):
4.16
Steps to Reproduce:
1. Open "Developer" View > Observe
Actual results:
See the tab title is "Dashboard"
Expected results:
The tab title should be "Dashboards"
Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/32
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This update is compatible with recent kube-openapi changes so that we can drop our replace k8s.io/kube-openapi => k8s.io/kube-openapi v0.0.0-20230928195430-ce36a0c3bb67 introduced in 53b387f4f54c8426526478afd0fd3e2b4e7aec66.
Currently importing the HyperShift API packages in other projects brings also the dependencies and conflicts of the rest of the non-API packages. This is a request to create Go submodule containing only the API packages.
Once we cut beta and the API is stable we should move it into its own repo
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The following test started to fail freequently in the periodic tests: External Storage [Driver: pd.csi.storage.gke.io] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source in parallel
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Sometimes, but way too often in the CI
Steps to Reproduce:
1. Run the periodic-ci-openshift-release-master-nightly-X.X-e2e-gcp-ovn-csi test
Actual results:
Provisioning of some volumes fails with time="2024-01-05T02:30:07Z" level=info msg="resulting interval message" message="{ProvisioningFailed failed to provision volume with StorageClass \"e2e-provisioning-9385-e2e-scw2z8q\": rpc error: code = Internal desc = CreateVolume failed to create single zonal disk pvc-35b558d6-60f0-40b1-9cb7-c6bdfa9f28e7: failed to insert zonal disk: unknown Insert disk operation error: rpc error: code = Internal desc = operation operation-1704421794626-60e299f9dba08-89033abf-3046917a failed (RESOURCE_OPERATION_RATE_EXCEEDED): Operation rate exceeded for resource 'projects/XXXXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-a/disks/pvc-501347a5-7d6f-4a32-b0e0-cf7a896f316d'. Too frequent operations from the source resource. map[reason:ProvisioningFailed]}"
Expected results:
Test passes
Additional info:
Looks like we're hitting the API quota limits with the test
Failed test run example:
Link to Sippy:
Description of problem:
The status controller of CCO reconciles 500+ times/h on average on a resting 6-node mint-mode OCP cluster on AWS.
Steps to Reproduce:
1. Install a 6-node mint-mode OCP cluster on AWS 2. Do nothing with it and wait for a couple of hours 3. Plot the following metric in the metrics dashboard of OCP console: rate(controller_runtime_reconcile_total{controller="status"}[1h]) * 3600
Actual results:
500+ reconciles/h on a resting cluster
Expected results:
12-50 reconciles/h on a resting cluster Note: the reconcile() function always requeues after 5min so the theoretical minimum is 12 reconciles/h
Description of problem:
Results of -hypershift-aws-e2e-external CI jobs do not contain obvious reason why a test failed. For example, this TestCreateCluster is listed as failed, but all failures in TestCreateCluster look like errors dumping the cluster after failure.
It should show that "storage operator did not become Available=true". Or even tell that "pod cluster-storage-operator-6f6d69bf89-fx2d2 in the hosted control plane XYZ is in CrashloopBackoff"
The PR under test had a simple typo leading to crashloop and it should be more obvious what went wrong.
Version-Release number of selected component (if applicable):
4.15.0-0.ci.test-2023-10-03-040803
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/160
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
this issue was discovered while trying to verify bug opened by Javipolo
MGMT-16966
It seems that the avoid extra reboot not working properly for 4.15 cluster with partition on installation disk
Here are 3 clusters:
1) test-infra-cluster-7a4cb4cc OCP 4.15 with partition on installation disk
2) test-infra-cluster-066749e2 OCP 4.14 with partition on installation disk
3) test-infra-cluster-f9051a36 OCP 4.15 without partition on installation disk
i see indications:
*1) test-infra-cluster-7a4cb4cc *
2/19/2024, 9:19:53 PM Node test-infra-cluster-7a4cb4cc-worker-0 has been rebooted 1 times before completing installation 2/19/2024, 9:19:51 PM Node test-infra-cluster-7a4cb4cc-master-2 has been rebooted 2 times before completing installation 2/19/2024, 9:19:08 PM Node test-infra-cluster-7a4cb4cc-worker-1 has been rebooted 1 times before completing installation 2/19/2024, 8:49:59 PM Node test-infra-cluster-7a4cb4cc-master-1 has been rebooted 2 times before completing installation 2/19/2024, 8:49:55 PM Node test-infra-cluster-7a4cb4cc-master-0 has been rebooted 2 times before completing installation
2) test-infra-cluster-066749e2
2/19/2024, 8:32:36 PM Node test-infra-cluster-066749e2-master-2 has been rebooted 2 times before completing installation 2/19/2024, 8:32:35 PM Node test-infra-cluster-066749e2-worker-1 has been rebooted 2 times before completing installation 2/19/2024, 8:32:31 PM Node test-infra-cluster-066749e2-worker-0 has been rebooted 2 times before completing installation 2/19/2024, 8:05:26 PM Node test-infra-cluster-066749e2-master-0 has been rebooted 2 times before completing installation 2/19/2024, 8:05:25 PM Node test-infra-cluster-066749e2-master-1 has been rebooted 2 times before completing installation
3) test-infra-cluster-f9051a36
2/18/2024, 5:13:49 PM Node test-infra-cluster-f9051a36-worker-1 has been rebooted 1 times before completing installation 2/18/2024, 5:10:10 PM Node test-infra-cluster-f9051a36-worker-0 has been rebooted 1 times before completing installation 2/18/2024, 5:08:46 PM Node test-infra-cluster-f9051a36-worker-2 has been rebooted 1 times before completing installation 2/18/2024, 5:03:12 PM Node test-infra-cluster-f9051a36-master-1 has been rebooted 1 times before completing installation 2/18/2024, 4:33:39 PM Node test-infra-cluster-f9051a36-master-2 has been rebooted 1 times before completing installation 2/18/2024, 4:33:38 PM Node test-infra-cluster-f9051a36-master-0 has been rebooted 1 times before completing installation
according to Ori analysis
It seems skip MCO reboot didn't happen for masters. The ignition was not accessible
Feb 19 18:38:19 test-infra-cluster-7a4cb4cc-master-0 installer[3403]: time="2024-02-19T18:38:19Z" level=warning msg="failed getting encapsulated machine config. Continuing installation without skipping MCO reboot" error="failed after 240 attempts, last error: unexpected end of JSON input"
How reproducible:
Steps to reproduce:
1.create cluster with 4.15
2.add ustom manifest which modify ignition to create a partition on disk
3.start installation
Actual results:
seems that reboot avoid did not work properly
Expected results:
Please review the following PR: https://github.com/openshift/route-override-cni/pull/54
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seen in 4.14 to 4.15 update CI:
: [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available expand_less Run #0: Failed expand_less 1h34m55s { 1 unexpected clusteroperator state transitions during e2e test run Nov 22 21:48:41.624 - 56ms E clusteroperator/operator-lifecycle-manager-packageserver condition/Available reason/ClusterServiceVersionNotSucceeded status/False ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: APIServiceInstallFailed, message: APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"}
While a brief auth failure isn't fantastic, an issue that only persists for 56ms is not long enough to warrant immediate admin intervention. Teaching the operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required. It's also possible that this is an incoming-RBAC vs. outgoing-RBAC race of some sort, and that shifting manifest filenames around could avoid the hiccup entirely.
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/operator-lifecycle-manager-packageserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 20% failed, 400% of failures match = 80% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 6 runs, 67% failed, 75% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 100% of failures match = 40% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 43 runs, 51% failed, 36% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 63% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 25% failed, 200% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 50% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 50% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-vsphere-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 80% of failures match = 80% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 178% of failures match = 32% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 60% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 40% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 19 runs, 63% failed, 33% of failures match = 21% impact periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 57% of failures match = 27% impact
I'm not sure if all of those are from this system:anonymous issue, or if some of them are other mechanisms. Ideally we fix all of the Available=False noise, while, again, still going Available=False when it is worth summoning an admin immediately. Checking for different reason and message strings in recent 4.15-touching update runs:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/operator-lifecycle-manager-packageserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*message: \(.*\)|\1 \2 \3|' | sort | uniq -c | sort -n 3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: Unauthorized 3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install timeout 4 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": the object has been modified; please apply your changes to the latest version and try again 9 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded apiServices not installed 23 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists 82 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"
Lots of hits in the above CI search. Running one of the 100% impact flavors has a good chance at reproducing.
1. Install 4.14
2. Update to 4.15
3. Keep an eye on operator-lifecycle-manager-packageserver's ClusterOperator Available.
Available=False blips.
Available=True the whole time, or any Available=False looks like a serious issue where summoning an admin would have been appropriate.
Causes also these testcases to fail (mentioning them here for Sippy to link here on relevant component readiness failures):
Description of problem:
Test case: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should start and expose a secured proxy and unsecured metrics [apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Example Z Job Link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864 Z must-gather Link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864/artifacts/ocp-e2e-ovn-remote-libvirt-s390x/gather-libvirt/artifacts/must-gather.tar Example P Job Link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480 P must-gather Link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/gather-libvirt/artifacts/must-gather.tar JSON body of error: { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:383]: Unexpected error: <*fmt.wrapError | 0xc001d9c000>: https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get "https://thanos-querier.openshift-monitoring.svc:9091": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host { msg: "https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get \"https://thanos-querier.openshift-monitoring.svc:9091\": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host", err: <*url.Error | 0xc0020e02d0>{ Op: "Get", URL: "https://thanos-querier.openshift-monitoring.svc:9091", Err: <*net.OpError | 0xc000b8f770>{ Op: "dial", Net: "tcp", Source: nil, Addr: nil, Err: <*net.DNSError | 0xc0020df700>{ Err: "no such host", Name: "thanos-querier.openshift-monitoring.svc", Server: "172.30.38.188:53", IsTimeout: false, IsTemporary: false, IsNotFound: true, }, }, }, }
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Observe Nightlies on P and/or Z
Actual results:
Test failing
Expected results:
Test passing
Additional info:
Description of problem:
Test case failure- OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations The obtained response seems to have unmarshalling errors. Failed to fetch alerting rules: unable to parse response invalid character 's' after object key
Expected output- The response should be proper and the unmarshalling should have worked
Openshift Version- 4.13 & 4.14
Cloud Provider/Platform- PowerVS
Prow Job Link/Must gather path- https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-ovn-ppc64le-powervs/1700992665824268288/artifacts/ocp-e2e-ovn-ppc64le-powervs/
Description of problem:
A regression was identified creating LoadBalancer services in ARO in new 4.14 clusters (handled for new installations in OCPBUGS-24191) The same regression has been also confirmed in ARO clusters upgraded to 4.14
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
On any ARO cluster upgraded to 4.14.z
Steps to Reproduce:
1. Install an ARO cluster 2. Upgrade to 4.14 from fast channel 3. oc create svc loadbalancer test-lb -n default --tcp 80:8080
Actual results:
# External-IP stuck in Pending $ oc get svc test-lb -n default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-lb LoadBalancer 172.30.104.200 <pending> 80:30062/TCP 15m # Errors in cloud-controller-manager being unable to map VM to nodes $ oc logs -l infrastructure.openshift.io/cloud-controller-manager=Azure -n openshift-cloud-controller-manager I1215 19:34:51.843715 1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(default/test-lb) - wantLb(true): started I1215 19:34:51.844474 1 event.go:307] "Event occurred" object="default/test-lb" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer" I1215 19:34:52.253569 1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-r5iks3dh) success I1215 19:34:52.253632 1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(default/test-lb): lb(aro-r5iks3dh/mabad-test-74km6) wantLb(true) resolved load balancer name I1215 19:34:52.528579 1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again... E1215 19:34:52.714678 1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-r5iks3dh/providers/Microsoft.Network/networkInterfaces/mabad-test-74km6-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0 E1215 19:34:52.714888 1 azure_loadbalancer.go:126] reconcileLoadBalancer(default/test-lb) failed: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0 I1215 19:34:52.714956 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.871261893 request="services_ensure_loadbalancer" resource_group="aro-r5iks3dh" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="default/test-lb" result_code="failed_ensure_loadbalancer" E1215 19:34:52.715005 1 controller.go:291] error processing service default/test-lb (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
Expected results:
# The LoadBalancer gets an External-IP assigned $ oc get svc test-lb -n default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-lb LoadBalancer 172.30.193.159 20.242.180.199 80:31475/TCP 14s
Additional info:
In cloud-provider-config cm in openshift-config namespace, vmType="" When vmType gets changed to "standard" explicitly, the provisioning of the LoadBalancer completes and an ExternalIP gets assigned without errors.
Description of problem:
During automation test execution for dev-console package it is observed that cypress intensely fails the ongoing test due to "uncaught:exception : ResizeObserver limit exceed", but there is no visible failure from UI.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info: Screenshot
On https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1756168710529224704 we failed three netpol tests on just one result, failure, with no successess. However the other 9 jobs seemed to run fine.
[sig-network] Netpol NetworkPolicy between server and client should allow ingress access from namespace on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Netpol NetworkPolicy between server and client should allow egress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Netpol NetworkPolicy between server and client should allow ingress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
Something seems funky with these tests, they're barely running, they don't seem to consistently report results. 4.15 has just 3 runs in the last two weeks, however 4.16 has just 1 but it passed
Whatever's going on, it's capable of taking out a payload, though it's not happening 100% of the time. https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-10-040857
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/150
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CNV upgrades from v4.14.1 to v4.15.0 (unreleased) are not starting due to out of sync operatorCondition.
We see:
$ oc get csv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.14.1 OpenShift Virtualization 4.14.1 kubevirt-hyperconverged-operator.v4.14.0 Replacing kubevirt-hyperconverged-operator.v4.15.0 OpenShift Virtualization 4.15.0 kubevirt-hyperconverged-operator.v4.14.1 Pending
And on the v4.15.0 CSV:
$ oc get csv kubevirt-hyperconverged-operator.v4.15.0 -o yaml .... status: cleanup: {} conditions: - lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: requirements not yet checked phase: Pending reason: RequirementsUnknown - lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable
and if we check the pending operator condition (v4.14.1) we see:
$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml apiVersion: operators.coreos.com/v2 kind: OperatorCondition metadata: creationTimestamp: "2023-12-16T17:10:17Z" generation: 18 labels: operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: "" name: kubevirt-hyperconverged-operator.v4.14.1 namespace: openshift-cnv ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: ClusterServiceVersion name: kubevirt-hyperconverged-operator.v4.14.1 uid: 7db79d4b-e69e-4af8-9335-6269cf004440 resourceVersion: "4116127" uid: 347306c9-865a-42b8-b2c9-69192b0e350a spec: conditions: - lastTransitionTime: "2023-12-18T18:47:23Z" message: "" reason: Upgradeable status: "True" type: Upgradeable deployments: - hco-operator - hco-webhook - hyperconverged-cluster-cli-download - cluster-network-addons-operator - virt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator serviceAccounts: - hyperconverged-cluster-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator status: conditions: - lastTransitionTime: "2023-12-18T09:41:06Z" message: "" observedGeneration: 11 reason: Upgradeable status: "True" type: Upgradeable
where metadata.generation (18) is not in sync with status.conditions[*].observedGeneration (11).
Even manually redacting spec.conditions.lastTransitionTime is causing a change in metadata.generation (as expected) but this doesn't trigger any reconciliation on the OLM and so status.conditions[*].observedGeneration remains at 11.
$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml apiVersion: operators.coreos.com/v2 kind: OperatorCondition metadata: creationTimestamp: "2023-12-16T17:10:17Z" generation: 19 labels: operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: "" name: kubevirt-hyperconverged-operator.v4.14.1 namespace: openshift-cnv ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: ClusterServiceVersion name: kubevirt-hyperconverged-operator.v4.14.1 uid: 7db79d4b-e69e-4af8-9335-6269cf004440 resourceVersion: "4147472" uid: 347306c9-865a-42b8-b2c9-69192b0e350a spec: conditions: - lastTransitionTime: "2023-12-18T18:47:25Z" message: "" reason: Upgradeable status: "True" type: Upgradeable deployments: - hco-operator - hco-webhook - hyperconverged-cluster-cli-download - cluster-network-addons-operator - virt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator serviceAccounts: - hyperconverged-cluster-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator status: conditions: - lastTransitionTime: "2023-12-18T09:41:06Z" message: "" observedGeneration: 11 reason: Upgradeable status: "True" type: Upgradeable
since its observedGeneration is out of sync, this check:
https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/olm/operatorconditions.go#L44C1-L48
fails and the upgrade never starts.
I suspect (I'm only guessing) that it could be a regression introduced with the memory optimization for https://issues.redhat.com/browse/OCPBUGS-17157 .
Version-Release number of selected component (if applicable):
OCP 4.15.0-ec.3
How reproducible:
- Not reproducible (with the same CNV bundles) on OCP v4.14.z. - Pretty high (but not 100%) on OCP 4.15.0-ec.3
Steps to Reproduce:
1. Try triggering a CNV v4.14.1 -> v4.15.0 on OCP 4.15.0-ec.3 2. 3.
Actual results:
The OLM is not reacting to changes on spec.conditions on the pending operator condition, so metadata.generation is constantly out of sync with status.conditions[*].observedGeneration and so the CSV is reported as message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable
Expected results:
The OLM correctly reconcile the operatorCondition and the upgrade starts
Additional info:
Not reproducible with exactly the same bundle (origin and target) on OCP v4.14.z
Description of problem:
The test implementation in https://github.com/openshift/origin/commit/5487414d8f5652c301a00617ee18e5ca8f339cb4#L56 assumes there is just one kubelet service or at least that it is always the first one in the MCP. Which just changed in https://github.com/openshift/machine-config-operator/pull/4124 and the test is failing.
Version-Release number of selected component (if applicable):
master branch of 4.16
How reproducible:
always during test
Steps to Reproduce:
1. Test with https://github.com/openshift/machine-config-operator/pull/4124 applied
Actual results:
Test detects a wrong service and fails
Expected results:
Test finds the proper kubelet.service and passes
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
The command does not honor Windows path separators.
Related to https://issues.redhat.com//browse/OCPBUGS-28864 (access restricted and not publicly visible). This report serves as a target issue for the fix and its backport to older OCP versions. Please see more details in https://issues.redhat.com//browse/OCPBUGS-28864.
Description of problem:
I've noticed that 'agent-cluster-install.yaml' and 'journal.export' from the agent gather process contain passwords. It's important not to expose password information in any of these generated files.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Generate an agent ISO by utilising agent-config and install-config, including platform credentials 2. Boot the ISO that was created 3. Run the agent-gather command on the node 0 machine to generate files.
Actual results:
The 'agent-cluster-install.yaml' and 'journal.export' are containing the passwords information.
Expected results:
Password should be redacted.
Additional info:
Description of problem:
On ipv6primary dualstack cluster, creating an ipv6 egressIP following this procedure:
is not working. ovnkube-cluster-manager shows below error:
2024-01-16T14:48:18.156140746Z I0116 14:48:18.156053 1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:48:18.161367817Z I0116 14:48:18.161269 1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:48:18.161416023Z I0116 14:48:18.161357 1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:49:37.714410622Z I0116 14:49:37.714342 1 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.Service total 8 items received 2024-01-16T14:49:48.155826915Z I0116 14:49:48.155330 1 obj_retry.go:296] Retry object setup: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:49:48.156172766Z I0116 14:49:48.155899 1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:49:48.168795734Z I0116 14:49:48.168520 1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:49:48.169400971Z I0116 14:49:48.168937 1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
Same is observed with ipv6 subnet on slaac mode.
Version-Release number of selected component (if applicable):
How reproducible: Always.
Steps to Reproduce:
Applying below:
$ oc label node/ostest-8zrlf-worker-0-4h78l k8s.ovn.org/egress-assignable="" $ cat egressip_ipv4.yaml && cat egressip_ipv6.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-dualstack-ipv4 spec: egressIPs: - 192.168.192.111 namespaceSelector: matchLabels: app: egress apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-dualstack-ipv6 spec: egressIPs: - fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 namespaceSelector: matchLabels: app: egress $ oc apply -f egressip_ipv4.yaml $ oc apply -f egressip_ipv6.yaml
But it shows only info about ipv4 egressIP. The IPv6 port is not even created in openstack:
oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-67cbc4bc84-786jm I0116 13:15:48.914323 1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue I0116 13:15:48.928927 1 cloudprivateipconfig_controller.go:357] CloudPrivateIPConfig: "192.168.192.111" will be added to node: "ostest-8zrlf-worker-0-4h78l" I0116 13:15:48.942260 1 cloudprivateipconfig_controller.go:381] Adding finalizer to CloudPrivateIPConfig: "192.168.192.111" I0116 13:15:48.943718 1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue I0116 13:15:49.758484 1 openstack.go:760] Getting port lock for portID 8854b2e9-3139-49d2-82dd-ee576b0a0cce and IP 192.168.192.111 I0116 13:15:50.547268 1 cloudprivateipconfig_controller.go:439] Added IP address to node: "ostest-8zrlf-worker-0-4h78l" for CloudPrivateIPConfig: "192.168.192.111" I0116 13:15:50.602277 1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue I0116 13:15:50.614413 1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue $ openstack port list --network network-dualstack | grep -e 192.168.192.111 -e 6f44:5dd8:c956:f816:3eff:fef0:3333 | 30fe8d9a-c1c6-46c3-a873-9a02e1943cb7 | egressip-192.168.192.111 | fa:16:3e:3c:23:2a | ip_address='192.168.192.111', subnet_id='ae8a4c1f-d3e4-4ea2-bc14-ef1f6f5d0bbe' | DOWN |
Actual results: ipv6 egressIP object is ignored.
Expected results: ipv6 egressIP is created and can be attached to a pod.
Additional info: must-gather linked in private comment.
Customer is asking for this flag so they can keep using v1 code even when v2 will be the default.
Description of the problem:
LVMS multi node
requires a additional disk for the operator
however i was able to create cluster 4.15 multinode , select lvms and without adding the additional disk i see that lvm requirment passes and i am able to continue and start installaiton
How reproducible:
Steps to reproduce:
1. create cluter 4.15 multi node
2. select lvms operator
3. do not attach additional disk
Actual results:
it is possible to continue until installation page and start installation
lvm requirment is set as success
Expected results:
lvm requirement should show fail
should not be bale to proceed to installation before attaching difk
Please review the following PR: https://github.com/openshift/node_exporter/pull/140
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Add image configuration for hypershift Hosted Cluster not working as expected.
Version-Release number of selected component (if applicable):
# oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-rc.8 True False 6h46m Cluster version is 4.13.0-rc.8
How reproducible:
Always
Steps to Reproduce:
1. Get hypershift hosted cluster detail from management cluster. # hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r '.items[].metadata.name') 2. Apply image setting for hypershift hosted cluster. # oc patch hc/$hostedcluster -p '{"spec":{"configuration":{"image":{"registrySources":{"allowedRegistries":["quay.io","registry.redhat.io","image-registry.openshift-image-registry.svc:5000","insecure.com"],"insecureRegistries":["insecure.com"]}}}}}' --type=merge -n clusters hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched # oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.image { "registrySources": { "allowedRegistries": [ "quay.io", "registry.redhat.io", "image-registry.openshift-image-registry.svc:5000", "insecure.com" ], "insecureRegistries": [ "insecure.com" ] } } 3. Check Pod or operator restart to apply configuration changes. # oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE kube-apiserver-67b6d4556b-9nk8s 5/5 Running 0 49m kube-apiserver-67b6d4556b-v4fnj 5/5 Running 0 47m kube-apiserver-67b6d4556b-zldpr 5/5 Running 0 51m #oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} -l app=openshift-apiserver NAME READY STATUS RESTARTS AGE openshift-apiserver-7c69d68f45-4xj8c 3/3 Running 0 136m openshift-apiserver-7c69d68f45-dfmk9 3/3 Running 0 135m openshift-apiserver-7c69d68f45-r7dqn 3/3 Running 0 136m 4. Check image.config in hosted cluster. # oc get image.config -o yaml ... spec: allowedRegistriesForImport: [] status: externalRegistryHostnames: - default-route-openshift-image-registry.apps.hypershift-ci-32506.qe.devcluster.openshift.com internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 #oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-61.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 ip-10-0-130-68.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 ip-10-0-134-89.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 ip-10-0-138-169.us-east-2.compute.internal Ready worker 6h42m v1.26.3+b404935 # oc debug node/ip-10-0-128-61.us-east-2.compute.internal Temporary namespace openshift-debug-mtfcw is created for debugging node... Starting pod/ip-10-0-128-61us-east-2computeinternal-debug-mctvr ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.61 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# cat /etc/containers/registries.conf unqualified-search-registries = ["registry.access.redhat.com", "docker.io"] short-name-mode = ""[[registry]] prefix = "" location = "registry-proxy.engineering.redhat.com" [[registry.mirror]] location = "brew.registry.redhat.io" pull-from-mirror = "digest-only"[[registry]] prefix = "" location = "registry.redhat.io" [[registry.mirror]] location = "brew.registry.redhat.io" pull-from-mirror = "digest-only"[[registry]] prefix = "" location = "registry.stage.redhat.io" [[registry.mirror]] location = "brew.registry.redhat.io" pull-from-mirror = "digest-only"
Actual results:
Config changes not applied in backend.Not operator & pod restart
Expected results:
Configuration should applied and pod & operator should restart after config changes.
Additional info:
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/327
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/54
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/246
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Build02, a years old cluster currently running 4.15.0-ec.2 with TechPreviewNoUpgrade, has been Available=False for days:
$ oc get -o json clusteroperator monitoring | jq '.status.conditions[] | select(.type == "Available")' { "lastTransitionTime": "2024-01-14T04:09:52Z", "message": "UpdatingMetricsServer: reconciling MetricsServer Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/metrics-server: context deadline exceeded", "reason": "UpdatingMetricsServerFailed", "status": "False", "type": "Available" }
Both pods had been having CA trust issues. We deleted one pod, and it's replacement is happy:
$ oc -n openshift-monitoring get -l app.kubernetes.io/component=metrics-server pods NAME READY STATUS RESTARTS AGE metrics-server-9cc8bfd56-dd5tx 1/1 Running 0 136m metrics-server-9cc8bfd56-k2lpv 0/1 Running 0 36d
The young, happy pod has occasional node-removed noise, which is expected in this cluster with high levels of compute-node autoscaling:
$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-dd5tx E0117 17:16:13.492646 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5" E0117 17:16:28.611052 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5" E0117 17:16:56.898453 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": context deadline exceeded" node="build0-gstfj-ci-builds-worker-b-srjk5"
While the old, sad pod is complaining about unknown authorities:
$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-k2lpv E0117 17:19:09.612161 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.0.3:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-m-2.c.openshift-ci-build-farm.internal" E0117 17:19:09.620872 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.90:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-ci-prowjobs-worker-b-cg7qd" I0117 17:19:14.538837 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
More details in the Additional details section, but the timeline seems to have been something like:
So addressing the metrics-server /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt change detection should resolve this use-case. And triggering a container or pod restart would be an aggressive-but-sufficient mechanism, although loading the new data without rolling the process would be less invasive.
4.15.0-ec.3, which has fast CA rotation, see discussion in API-1687.
Unclear.
Unclear.
metrics-server pods having trouble with CA trust when attempting to scrape nodes.
metrics-server pods successfully trusting kubelets when scraping nodes.
The monitoring operator sets up the metrics server with --kubelet-certificate-authority=/etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, which is the "Path to the CA to use to validate the Kubelet's serving certificates" and is mounted from the kubelet-serving-ca-bundle ConfigMap. But that mount point only contains openshift-kube-controller-manager-operator_csr-signer-signer@... CAs:
$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- cat /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not ' Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-gtctn ... Removing debug pod ... Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 3 14:42:33 2023 GMT Not After : Feb 1 14:42:34 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 20 03:16:35 2023 GMT Not After : Jan 19 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1703042196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 4 03:16:35 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1704338196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 2 14:42:34 2024 GMT Not After : Mar 2 14:42:35 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 unable to load certificate 137730753918272:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
While actual kubelets seem to be using certs signed by kube-csr-signer_@1704338196 (which is one of the Subjects in /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt):
$ oc get -o wide -l node-role.kubernetes.io/master= nodes NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME build0-gstfj-m-0.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 build0-gstfj-m-1.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 build0-gstfj-m-2.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.3 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 $ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- openssl s_client -connect 10.0.0.3:10250 -showcerts </dev/null Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-ksl2k ... Can't use SSL_get_servername depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify error:num=21:unable to verify the first certificate verify return:1 depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify return:1 CONNECTED(00000003) --- Certificate chain 0 s:O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal i:CN = kube-csr-signer_@1704338196 -----BEGIN CERTIFICATE----- MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3 MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2 9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/ vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2 H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM -----END CERTIFICATE----- --- Server certificate subject=O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal issuer=CN = kube-csr-signer_@1704338196 --- Acceptable client certificate CA names OU = openshift, CN = admin-kubeconfig-signer CN = openshift-kube-controller-manager-operator_csr-signer-signer@1699022534 CN = kube-csr-signer_@1700450189 CN = kube-csr-signer_@1701746196 CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1691004449 CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1702234292 CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1699642292 OU = openshift, CN = kubelet-bootstrap-kubeconfig-signer CN = openshift-kube-apiserver-operator_node-system-admin-signer@1678905372 Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1 Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512 Peer signing digest: SHA256 Peer signature type: ECDSA Server Temp Key: X25519, 253 bits --- SSL handshake has read 1902 bytes and written 383 bytes Verification error: unable to verify the first certificate --- New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256 Server public key is 256 bit Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated Early data was not sent Verify return code: 21 (unable to verify the first certificate) --- DONE Removing debug pod ... $ openssl x509 -noout -text <<EOF 2>/dev/null > -----BEGIN CERTIFICATE----- MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3 MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2 9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/ vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2 H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM -----END CERTIFICATE----- > EOF ... Issuer: CN = kube-csr-signer_@1704338196 Validity Not Before: Jan 17 03:14:30 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal ...
The monitoring operator populates the openshift-monitoring kubelet-serving-ca-bundle} ConfigMap using data from the openshift-config-managed kubelet-serving-ca ConfigMap, and that propagation is working, but does not contain the kube-csr-signer_ CA:
$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not ' Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 3 14:42:33 2023 GMT Not After : Feb 1 14:42:34 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 20 03:16:35 2023 GMT Not After : Jan 19 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1703042196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 4 03:16:35 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1704338196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 2 14:42:34 2024 GMT Not After : Mar 2 14:42:35 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 unable to load certificate 140531510617408:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE $ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | sha1sum a32ab44dff8030c548087d70fea599b0d3fab8af - $ oc -n openshift-monitoring get -o json configmap kubelet-serving-ca-bundle | jq -r '.data["ca-bundle.crt"]' | sha1sum a32ab44dff8030c548087d70fea599b0d3fab8af -
Flipping over to the kubelet side, nothing in the machine-config operator's template is jumping out at me as a key/cert pair for serving on 10250. The kubelet seems to set up server certs via serverTLSBootstrap: true. But we don't seem to set the beta RotateKubeletServerCertificate, so I'm not clear on how these are supposed to rotate on the kubelet side. But there are CSRs from kubelets requesting serving certs:
$ oc get certificatesigningrequests | grep 'NAME\|kubelet-serving' NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-8stgd 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-xkdw2 <none> Approved,Issued csr-blbjx 9m1s kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-longtests-worker-b-5w9dz <none> Approved,Issued csr-ghxh5 64m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-sdwdn <none> Approved,Issued csr-hng85 33m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-longtests-worker-d-7d7h2 <none> Approved,Issued csr-hvqxz 24m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-fp6wb <none> Approved,Issued csr-vc52m 50m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-xlmt6 <none> Approved,Issued csr-vflcm 40m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-djpgq <none> Approved,Issued csr-xfr7d 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-8v4vk <none> Approved,Issued csr-zhzbs 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-rqr68 <none> Approved,Issued $ oc get -o json certificatesigningrequests csr-blbjx { "apiVersion": "certificates.k8s.io/v1", "kind": "CertificateSigningRequest", "metadata": { "creationTimestamp": "2024-01-17T19:20:43Z", "generateName": "csr-", "name": "csr-blbjx", "resourceVersion": "4719586144", "uid": "5f12d236-3472-485f-8037-3896f51a809c" }, "spec": { "groups": [ "system:nodes", "system:authenticated" ], "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlh6Q0NBUVFDQVFBd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVQwd093WURWUVFERXpSegplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnpMWGR2Y210bGNpMWlMVFYzCk9XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZ3F4ZHNZWkdmQXovTEpoZVgKd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVSUpUN2pCblV1WEdnZktCTQpNRW9HQ1NxR1NJYjNEUUVKRGpFOU1Ec3dPUVlEVlIwUkJESXdNSUlvWW5WcGJHUXdMV2R6ZEdacUxXTnBMV3h2CmJtZDBaWE4wY3kxM2IzSnJaWEl0WWkwMWR6bGtlb2NFQ2dBZ0F6QUtCZ2dxaGtqT1BRUURBZ05KQURCR0FpRUEKMHlRVzZQOGtkeWw5ZEEzM3ppQTJjYXVJdlhidTVhczNXcUZLYWN2bi9NSUNJUURycEQyVEtScHJOU1I5dExKTQpjZ0ZpajN1dVNieVJBcEJ5NEE1QldEZm02UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=", "signerName": "kubernetes.io/kubelet-serving", "usages": [ "digital signature", "server auth" ], "username": "system:node:build0-gstfj-ci-longtests-worker-b-5w9dz" }, "status": { "certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN6ekNDQWJlZ0F3SUJBZ0lSQUlGZ1NUd0ovVUJLaE1hWlE4V01KcEl3RFFZSktvWklodmNOQVFFTEJRQXcKSmpFa01DSUdBMVVFQXd3YmEzVmlaUzFqYzNJdGMybG5ibVZ5WDBBeE56QTBNek00TVRrMk1CNFhEVEkwTURFeApOekU1TVRVME0xb1hEVEkwTURJd016QXpNVFl6Tmxvd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6Ck1UMHdPd1lEVlFRREV6UnplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnoKTFhkdmNtdGxjaTFpTFRWM09XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZwpxeGRzWVpHZkF6L0xKaGVYd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVCklKVDdqQm5VdVhHZ2ZLT0JrakNCanpBT0JnTlZIUThCQWY4RUJBTUNCNEF3RXdZRFZSMGxCQXd3Q2dZSUt3WUIKQlFVSEF3RXdEQVlEVlIwVEFRSC9CQUl3QURBZkJnTlZIU01FR0RBV2dCVGYzZ0FHNUxiMkxTcXl6MEFxVCtaRAoyV0VuenpBNUJnTlZIUkVFTWpBd2dpaGlkV2xzWkRBdFozTjBabW90WTJrdGJHOXVaM1JsYzNSekxYZHZjbXRsCmNpMWlMVFYzT1dSNmh3UUtBQ0FETUEwR0NTcUdTSWIzRFFFQkN3VUFBNElCQVFBRE5ad0pMdkp4WWNta2RHV08KUm5ocC9rc3V6akJHQnVHbC9VTmF0RjZScml3eW9mdmpVNW5Kb0RFbGlLeHlDQ2wyL1d5VXl5a2hMSElBK1drOQoxZjRWajIrYmZFd0IwaGpuTndxQThudFFabS90TDhwalZ5ZzFXM0VwR2FvRjNsZzRybDA1cXBwcjVuM2l4WURJClFFY2ZuNmhQUnlKN056dlFCS0RwQ09lbU8yTFllcGhqbWZGY2h5VGRZVGU0aE9IOW9TWTNMdDdwQURIM2kzYzYKK3hpMDhhV09LZmhvT3IybTVBSFBVN0FkTjhpVUV0M0dsYzI0SGRTLzlLT05tT2E5RDBSSk9DMC8zWk5sKzcvNAoyZDlZbnYwaTZNaWI3OGxhNk5scFB0L2hmOWo5TlNnMDN4OFZYRVFtV21zN29xY1FWTHMxRHMvWVJ4VERqZFphCnEwMnIKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=", "conditions": [ { "lastTransitionTime": "2024-01-17T19:20:43Z", "lastUpdateTime": "2024-01-17T19:20:43Z", "message": "This CSR was approved by the Node CSR Approver (cluster-machine-approver)", "reason": "NodeCSRApprove", "status": "True", "type": "Approved" } ] } } $ oc get -o json certificatesigningrequests csr-blbjx | jq -r '.status.certificate | @base64d' | openssl x509 -noout -text | grep '^Certificate:\|Issuer\|Subject:\|Not ' Certificate: Issuer: CN = kube-csr-signer_@1704338196 Not Before: Jan 17 19:15:43 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: O = system:nodes, CN = system:node:build0-gstfj-ci-longtests-worker-b-5w9dz
So that's approved by cluster-machine-approver, but signerName: kubernetes.io/kubelet-serving is an upstream Kubernetes component documented here, and the signer is implemented by kube-controller-manager.
Description of problem:
default value of option --parallelism cannot be parsed to int
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-02-002725
How reproducible:
reproduce with cmd copy-to-node
Steps to Reproduce:
Cmd: "oc --namespace=e2e-test-mco-4zb88 --kubeconfig=/tmp/kubeconfig-3071675436 adm copy-to-node node/ip-10-0-17-85.ec2.internal --copy=/tmp/fetch-w637bgyv=/etc/mco-compressed-test-file", StdErr: "error: --parallelism must be either N or N%: strconv.ParseInt: parsing \"10%%\": invalid syntax",
Actual results:
default value of --parallelism cannot be parsed
Expected results:
no error
Additional info:
there is a hack code to append % to the default value ref: https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L75-L79 https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L94 err var percentParseErr should be used
Description of problem:
Openshift Console shows "Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available." while edit an XML file type configmaps.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create configmap from file: # oc create cm test-cm --from-file=server.xml=server.xml configmap/test-cm created 2. If we try to edit the configmap in the OCP console we see the following error: Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available.
Actual results:
Expected results:
Additional info:
Maxim Patlasov pointed this out in STOR-1453 but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.
It is possible to set a custom TLSSecurityProfile without minTLSversion:
$ oc edit apiserver cluster
...
spec:
tlsSecurityProfile:
type: Custom
custom:
ciphers:
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-ECDSA-AES128-GCM-SHA256
This causes the controller to crash loop:
$ oc get pods -n openshift-cluster-csi-drivers
NAME READY STATUS RESTARTS AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2 6/11 CrashLoopBackOff 10 (18s ago) 37s
...
because the `${TLS_MIN_VERSION}` placeholder is never replaced:
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
The observed config in the ClusterCSIDriver shows an empty string:
$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
"targetcsiconfig": {
"servingInfo":
}
}
which means minTLSVersion is empty when we get to this line, and the string replacement is not done:
So it seems we have a couple of options:
1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/202
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The monitoring operator may be down or disabled, and the components it manages may be unavailable or degraded.
Upon quick check I've noticed an error:
oc get co -o json | jq -r '.items[].status | select (.conditions) '.conditions | jq -r '.[] | select( (.type == "Degraded") and (.status == "True") )' { "lastTransitionTime": "2023-12-19T10:25:24Z", "message": "syncing Thanos Querier trusted CA bundle ConfigMap failed: reconciling trusted CA bundle ConfigMap failed: updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded, syncing Thanos Querier trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=alertmanager,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps), syncing Prometheus trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=prometheus,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)", "reason": "MultipleTasksFailed", "status": "True", "type": "Degraded" }
i.e. updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded
I ran oc get co again and everything looked fine, it seems this timeout condition could be handled better to avoid alerting SRE.
operator degraded
operator retries operation
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test is a frequent offender in the OpenStack CSI jobs. We're seeing it fail on 4.14 up to 4.16.
Example of failed job.
Example of successful job.
It seems like the 1 min timeout is too short and does not give enough time for the pods backing the service to come up.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is continuation of OCPBUGS-23342, now the vmware-vsphere-csi-driver-operator cannot connect to vCenter at all. Tested using invalid credentials.
The operator ends up with no Progressing condition during upgrade from 4.11 to 4.12, and cluster-storage-operator interprets it as Progressing=true.
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/415
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/117
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/223
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Up to latest decision RH won't going to support installation OCP cluster on vSphere with
nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "vSphere" on page "Cluster Details"
Slack discussion thread
https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159
Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
These two tests are permafailing on some metal jobs:
[sig-arch][Late] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
It was previously passing 100%, maybe it's tied to recent changes?
https://github.com/openshift/origin/pull/28444
Additional context here:
hypershift#1614 gave us the router Deployment (descended from the private-router Deployment), but it lacks PDB coverage. For example:
$ git --no-pager log -1 --oneline origin/main f3f421bc7 (origin/release-4.16, origin/release-4.15, origin/main, origin/HEAD) Merge pull request #3183 from muraee/azure-kms $ git --no-pager grep 'func [^(]*\(Deployment\|PodDisruptionBudget\)' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas} f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/ingress/router.go:func ReconcileRouterDeployment(deployment *appsv1.Deployment, ownerRef config.OwnerRef, deploymentConfig config.DeploymentConfig, image string, config *corev1.ConfigMap) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/deployment.go:func ReconcileKubeAPIServerDeployment(deployment *appsv1.Deployment, f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/pdb.go:func ReconcilePodDisruptionBudget(pdb *policyv1.PodDisruptionBudget, p *KubeAPIServerParams) error {
Both the ingress and kas packages have Reconcile*Deployment methods. Only kas has a ReconcilePodDisruptionBudget method.
This bug is asking for router to get a covering PDB too, because being able to simultaneously evict all router-* pods simultaneously (for the cluster flavors that have replicas > 1 on that Deployment) can make the incoming traffic unreachable. And some of that Route traffic looks like stuff that folks would want to be reliably reachable:
$ git --no-pager grep 'func Reconcile[^(]*Route(' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas} f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPublicRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPrivateRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileInternalRoute(route *routev1.Route, owner *metav1.OwnerReference) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityExternalRoute(route *routev1.Route, ownerRef config.OwnerRef, hostname string, defaultIngressDomain string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityInternalRoute(route *routev1.Route, ownerRef config.OwnerRef) error {
Test plan:
1. Install a hosted cluster.
2. Log into the managment cluster, and find the namespace of the hosted cluster $NAMESPACE.
3. Evict both router pods (using a raw create, because there isn't more convenient syntax yet):
oc -n "${NAMESPACE}" get -l app=private-router -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | while read NAME do oc create -f - <<EOF --raw "/api/v1/namespaces/${NAMESPACE}/pods/${NAME}/eviction" {"apiVersion": "policy/v1", "kind": "Eviction", "metadata": {"name": "${NAME}"}} EOF done
If that clears out both router pods right after the other, ingress will probably hiccup. And with the PDB in place, I'd expect the second eviction to fail.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When there is new update for cluster, try to click "Select a version" from cluster settings page, there is no reaction.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1.Prepare a cluster with available update. 2.Go to Cluster Settings page, choose a version by clicking on "Select a version" button. 3.
Actual results:
2. There is no response when click on the button, user could not select a version from the page.
Expected results:
2. A modal should show up for user to select version after clicking on "Select a version" button
Additional info:
screenshot: https://drive.google.com/file/d/1Kpyu0kUKFEQczc5NVEcQFbf_uly_S60Y/view?usp=sharing
Description of problem:
Node Overview Pane not displaying
Version-Release number of selected component (if applicable):
How reproducible:
In the openshift console, under Compute > Node > Node Details > the Overview tab does not display
Steps to Reproduce:
In the openshift console, under Compute > Node > Node Details > the Overview tab does not display
Actual results:
Overview tab does not display
Expected results:
Overview tab should display
Additional info:
{code
Please review the following PR: https://github.com/openshift/sdn/pull/599
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-proxy/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Update hash creation to use sha512 instead of sha1
Links
We frequently receive inquiries regarding the versions of monitoring components (such as Prometheus, Alertmanager, etc.) that are used in a giving OCP version.
Currently, obtaining this information requires several manual steps on our part, e.g.:
What if we automate this?
How about a view that displays the versions of all components for all recent OCP versions.
At 17:26:09, the cluster is happily upgrading nodes:
An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config
At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is OCPBUGS-20061)
An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available
~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:
An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver
This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:
1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
2. These two COs upgrade successfuly and upgrade proceeds
3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.
all
Not entirely deterministic:
1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)
1. Watch oc adm upgrade during the upgrade
Spurious "waiting for over 40m" message pops out of the blue
CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/155
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In the `DoHTTPProbe` function located at `github.com/openshift/router/pkg/router/metrics/probehttp/probehttp.go`, logging of the HTTP response object at verbosity level 4 results in a serialisation error due to non-serialisable fields within the `http.Response` object. The error logged is `<<error: json: unsupported type: func() (io.ReadCloser, error)>>`, pointing towards an inability to serialise the `Body` field, which is of type `io.ReadCloser`.
This function is designed to check if a GET request to the specified URL succeeds, logging detailed response information at higher verbosity levels for diagnostic purposes.
Steps to Reproduce:
1. Increase the logging level to 4.
2. Perform an operation that triggers the `DoHTTPProbe` function.
3. Review the logging output for the error message.
Expected Behaviour:
The logger should gracefully handle or exclude non-serialisable fields like `Body`, ensuring clean and informative logging output that aids in diagnostics without encountering serialisation errors.
Actual Behaviour:
Non-serialisable fields in the `http.Response` object lead to the error `<<error: json: unsupported type: func() (io.ReadCloser, error)>>` being logged. This diminishes the utility of logs for debugging at higher verbosity levels.
Impact:
The issue is considered of low severity since it only appears at logging level 4, which is beyond standard operational levels (level 2) used in production. Nonetheless, it could hinder effective diagnostics and clutter logs with errors when high verbosity levels are employed for troubleshooting.
Suggested Fix:
Modify the logging functionality within `DoHTTPProbe` to either filter out non-serialisable fields from the `http.Response` object or implement a custom serialisation approach that allows these fields to be logged in a more controlled and error-free manner.
Bump the sigs.k8s dependencies and update dependabot groupings
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to deploy in mad02 or mad04 with powervs 2. Cannot import boot image 3. fail
Actual results:
Fail
Expected results:
Cluster comes up
Additional info:
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/96
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Trying to define multiple receivers in a single user-defined AlertmanagerConfig
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
#### Monitoring for user-defined projects is enabled ``` oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml | head -4 ``` ``` apiVersion: v1 data: config.yaml: | enableUserWorkload: true ``` #### separate Alertmanager instance for user-defined alert routing is Enabled and Configured ``` oc -n openshift-user-workload-monitoring get configmap user-workload-monitoring-config -o yaml | head -6 ``` ``` apiVersion: v1 data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true ``` create testing namespace oc new-project libor-alertmanager-testing ``` ## TESTING - MULTIPLE RECEIVERS IN ALERTMANAGERCONFIG Single AlertmanagerConfig `alertmanager_config_webhook_and_email_rootDefault.yaml` ``` apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: libor-alertmanager-testing-email-webhook namespace: libor-alertmanager-testing spec: receivers: - name: 'libor-alertmanager-testing-webhook' webhookConfigs: - url: 'http://prometheus-msteams.internal-monitoring.svc:2000/occ-alerts' - name: 'libor-alertmanager-testing-email' emailConfigs: - to: USER@USER.CO requireTLS: false sendResolved: true - name: Default route: groupBy: - namespace receiver: Default groupInterval: 60s groupWait: 60s repeatInterval: 12h routes: - matchers: - name: severity value: critical matchType: '=' continue: true receiver: 'libor-alertmanager-testing-webhook' - matchers: - name: severity value: critical matchType: '=' receiver: 'libor-alertmanager-testing-email' ``` Once saved the continue statement is removed from the object. ``` the configuration applied to alertmanager contains continue false statements ``` oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 ``` route: receiver: Default group_by: - namespace continue: false routes: - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/Default group_by: - namespace matchers: - namespace="libor-alertmanager-testing" continue: true routes: - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-webhook matchers: - severity="critical" continue: false <---- - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-email matchers: - severity="critical" continue: false <----- ``` If I update the statements to read `continue: true` and test here: https://prometheus.io/webtools/alerting/routing-tree-editor/ then I get the desired results workaround is to use 2 separate files - the continue statement is being added.
Actual results:
Once saved the continue statement is removed from the object.
Expected results:
continue true statement is retain and applied to alertmanager
Additional info:
revert: https://github.com/openshift/origin/pull/28603
output looks like:
INFO[2024-02-16T00:25:04Z] time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n \n ERROR \n The requested URL could not be retrieved \n \n \n\n \n The following error was encountered while trying to retrieve the URL: http://35.212.33.188/health\">http://35.212.33.188/health \n\n \n Access Denied. \n \n\n Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect. \n\n Your cache administrator is root. \n \n \n\n \n \n Generated Fri, 16 Feb 2024 00:20:41 GMT by ofcir-3329d9226457452fb2040e269776e3a5 (squid/5.2) \n\n \n\n" auditID=facdfd31-51e5-4812-a356-6e4b0e30cd38 backend=gcp-network-liveness-reused-connections this-instance="{Disruption map[backend-disruption-name:gcp-network-liveness-reused-connections connection:reused disruption:openshift-tests]}" type=reused time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n \n ERROR
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/112
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
I noticed this today when looking at component readiness. A ~5% decrease in instability may seem minor, but these can certainly add up. This test passed 713 times in a row on 4.14. You can see today's failure here.
Details below:
-------
Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.
Probability of significant regression: 99.96%
Sample (being evaluated) Release: 4.15
Start Time: 2024-01-17T00:00:00Z
End Time: 2024-01-23T23:59:59Z
Success Rate: 94.83%
Successes: 55
Failures: 3
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 713
Failures: 0
Flakes: 4
Description of problem:
The pipeline operator has been removed from the operator hub so CI has been failing since https://search.ci.openshift.org/?search=Entire+pipeline+flow+from+Builder+page+%22before+all%22+hook+for+%22Background+Steps%22&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.
steps to Reproduce:
=================
1. Install 4.15 cluster with baselineCapabilitySet to None
2. Run command `oc get sa -A | grep deployer`
Actual Results:
================
[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m
Expected Results:
==================
No SA related to deployer should be returned
Please review the following PR: https://github.com/openshift/telemeter/pull/522
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
https://github.com/openshift/origin/pull/28484
Broken for 3 weeks, possibly for different reasons.
Unit test failing now, looks like 4.16 data not getting picked up, and it probably should be.
GDescription of problem:
NodeLogQuery e2e tests are failing with Kubernetes 1.28 bump. Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1646/pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-ovn/1683472309211369472
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/42
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/platform-operators/pull/105
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Agent CI jobs started to fail on a pretty regular basis, especially the compact ones. Jobs time out due either the console or authentication operators remaining in a degraded state. From the logs analysis, the are not able to get a route. Both apiserver and etcd component logs report connection refused messages, possibly indicating an underlying network problem
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Pretty frequently
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Recently, the passing rate for test "static pods should start after being created" has dropped significantly for some platforms: https://sippy.dptools.openshift.org/sippy-ng/tests/4.15/analysis?test=%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D Take a look at this example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072 The test failed with the following message: { static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ci-op-2z99zzqd-7f99c-rfp4q-master-0" didn't show up, waited: 3m0s} Seemingly revision 6 was never reached. But if we look at the log from kube-controller-manager-operator, it jumps from revision 5 to revision 7: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072/artifacts/e2e-azure-sdn-techpreview/gather-extra/artifacts/pods/openshift-kube-controller-manager-operator_kube-controller-manager-operator-7cd978d745-bcvkm_kube-controller-manager-operator.log The log also indicates that there is a possibility of race: W1013 12:59:17.775274 1 staticpod.go:38] revision 7 is unexpectedly already the latest available revision. This is a possible race! This might be a static controller issue. But I am starting with kube-controller-manager component for the case. Feel free to reassign. Here is a slack thread related to this: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1697472297510279
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In-cluster clients should be able to talk directly to the node local apiservert ip address and as a best practice should all be configured to use it. This load balancer provides added benefit in cloud environments of healthchecking the path from the machine to the load balancer fronting the kube-apiserver. It becomes more cruicial in baremetal/on-prem environments where there may not be a load balancer and instead just 3 unique endpoints directly to redundant kube-apiservers. In this case: if using just dns: intermittent traffic failures will be experienced if a control plane instance goes down. Using the node local load balancer: there will be no traffic disruption
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. Schedule a pod on any hypershift cluster node 2. In the pod run curl -v -k https://172.20.0.1:6443 3. The verbose output will show that the kube-apiserver cert does not have the node local client load balancer IP address in it's IPs section and therefore will not allow valid HTTPS requests on that address
Actual results:
Secure HTTPS requests cannot be made to the kube-apiserver
Expected results:
Secure HTTPS requests can be made to the kube-apiserver (no need to run -k when specifying proper CA bundle)
Additional info:
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/145
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The number of control plane replicas defined in install-config.yaml (or agent-cluster-install.yaml) should be validated to check its set to 3, or 1 in the case of SNO. If set to another value the "create image" command should fail.
We recently had a case where the number of replicas was set to 2 and the installation failed. It would be good to catch this misconfiguration prior to the install.
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/207
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When a customer certificate and sre certificate are configured and approved, revocation of customer certificate causes access to the cluster using kubeconfig with sre cert to be denied
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster 2. Configure a customer cert and a sre cert, they are approved 3. Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
Actual results:
Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
Expected results:
Revoke a customer cert, access to the cluster using kubeconfig with sre cert succeeds
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The following binaries need to get extracted from the release payload for both rhel8 and rhel9: oc ccoctl opm openshift-install oc-mirror The images that contain these, should produce artifacts of both kinds in some locatiuon, and probably make the artifact of their architecture available under a normal location in path. Example: /usr/share/<binary>.rhel8 /usr/share/<binary>.rhel9 /usr/bin/<binary> This ticket is about getting "oc adm release extract" to do the right thing in a backwards compatible way. If both binaries are available get those. If not, get from the old location.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
During live OVN migration, network operator show the error message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create 4.15 nightly SDN ROSA cluster 2. oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation 3. oc edit featuregate cluster to enable featuregates 4. Wait for all node rebooting and back to normal 5. oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Actual results:
[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation[weliang@weliang ~]$ oc edit featuregate cluster[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'network.config.openshift.io/cluster patched[weliang@weliang ~]$ [weliang@weliang ~]$ oc get co networkNAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGEnetwork 4.15.0-0.nightly-2023-12-18-220750 True False True 105m Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.[weliang@weliang ~]$ oc describe Network.config.openshift.io clusterName: clusterNamespace: Labels: <none>Annotations: network.openshift.io/network-type-migration: API Version: config.openshift.io/v1Kind: NetworkMetadata: Creation Timestamp: 2023-12-20T15:13:39Z Generation: 3 Resource Version: 119899 UID: 6a621b88-ac4f-4918-a7f6-98dba7df222cSpec: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 External IP: Policy: Network Type: OVNKubernetes Service Network: 172.30.0.0/16Status: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 Cluster Network MTU: 8951 Network Type: OpenShiftSDN Service Network: 172.30.0.0/16Events: <none>[weliang@weliang ~]$ oc describe Network.operator.openshift.io clusterName: clusterNamespace: Labels: <none>Annotations: <none>API Version: operator.openshift.io/v1Kind: NetworkMetadata: Creation Timestamp: 2023-12-20T15:15:37Z Generation: 275 Resource Version: 120026 UID: 278bd491-ac88-4038-887f-d1defc450740Spec: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 Default Network: Openshift SDN Config: Enable Unidling: true Mode: NetworkPolicy Mtu: 8951 Vxlan Port: 4789 Type: OVNKubernetes Deploy Kube Proxy: false Disable Multi Network: false Disable Network Diagnostics: false Kube Proxy Config: Bind Address: 0.0.0.0 Log Level: Normal Management State: Managed Observed Config: <nil> Operator Log Level: Normal Service Network: 172.30.0.0/16 Unsupported Config Overrides: <nil> Use Multi Network Policy: falseStatus: Conditions: Last Transition Time: 2023-12-20T15:15:37Z Status: False Type: ManagementStateDegraded Last Transition Time: 2023-12-20T16:58:58Z Message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. Reason: InvalidOperatorConfig Status: True Type: Degraded Last Transition Time: 2023-12-20T15:15:37Z Status: True Type: Upgradeable Last Transition Time: 2023-12-20T16:52:11Z Status: False Type: Progressing Last Transition Time: 2023-12-20T15:15:45Z Status: True Type: Available Ready Replicas: 0 Version: 4.15.0-0.nightly-2023-12-18-220750Events: <none>[weliang@weliang ~]$ oc get clusterversionNAME VERSION AVAILABLE PROGRESSING SINCE STATUSversion 4.15.0-0.nightly-2023-12-18-220750 True False 84m Error while reconciling 4.15.0-0.nightly-2023-12-18-220750: the cluster operator network is degraded[weliang@weliang ~]$
Expected results:
Migration success
Additional info:
Get same error message from ROSA and GCP cluster.
Some templating code can be simplified now.
Please review the following PR: https://github.com/openshift/csi-operator/pull/87
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The users are experiencing an issue with NodePort traffic forwarding, where the TCP traffic continues to be directed to pods which are under terminating state, the connection cannot be created sucessfully, as per the customer mentioned this issue is causing the connection disruptions in the business transaction.
Version-Release number of selected component (if applicable):
On the OpenShift 4.12.13 with RHEL8.6 workers and OVN environment.
How reproducible:
here is the code found.
https://github.com/openshift/ovn-kubernetes/blob/dd3c7ed8c1f41873168d3df26084ecbfd3d9a36b/go-controller/pkg/util/kube.go#L360;
—
func IsEndpointServing(endpoint discovery.Endpoint) bool {
if endpoint.Conditions.Serving != nil
else
{ return IsEndpointReady(endpoint) }}
// IsEndpointValid takes as input an endpoint from an endpoint slice and a boolean that indicates whether to include
// all terminating endpoints, as per the PublishNotReadyAddresses feature in kubernetes service spec. It always returns true
// if includeTerminating is true and falls back to IsEndpointServing otherwise.
func IsEndpointValid(endpoint discovery.Endpoint, includeTerminating bool) bool
—
Look like 'IsEndpointValid' function will retrun serving=true endpoint, it not checking the ready=true endpoint
I see recently the code has been changed in this section(look up Ready=true is changed to Serving=true)?
[Check the "Serving" field for endpoints]
https://github.com/openshift/ovn-kubernetes/commit/aceef010daf0697fe81dba91a39ed0fdb6563dea#diff-daf9de695e0ff81f9173caf83cb88efa138e92a9b35439bd7044aa012ff931c0
https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/util/kube.go#L326-L386
—
out.Port = *port.Port
for _, endpoint := range slice.Endpoints {
// Skip endpoint if it's not valid
if !IsEndpointValid(endpoint, includeTerminating)
for _, ip := range endpoint.Addresses {
klog.V(4).Infof("Adding slice %s endpoint: %v, port: %d", slice.Name, endpoint.Addresses, *port.Port)
ipStr := utilnet.ParseIPSloppy(ip).String()
switch slice.AddressType
}
}
—
Steps to Reproduce:
Here is the customer's sample pods for you refering.
mbgateway-st-8576f6f6f8-5jc75 1/1 Running 0 104m 172.30.195.124 appn01-100.app.paas.example.com <none> <none>
mbgateway-st-8576f6f6f8-q8j6k 1/1 Running 0 5m51s 172.31.2.97 appn01-202.app.paas.example.com <none> <none>
pod yaml:
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
name: mbgateway-st
ports:
- containerPort: 9190
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
resources:
limits:
cpu: "2"
ephemeral-storage: 10Gi
memory: 2G
requests:
cpu: 50m
ephemeral-storage: 100Mi
memory: 1111M
when delete pod Pod(mbgateway-st-8576f6f6f8-5jc75), check the EndpointSlice status:
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
Wait for a little moment, try to check Ovn Service lb, it found the endpoints information doesn't update to the latest.
9349d703-1f28-41fe-b505-282e8abf4c40 Service_lb59-10- tcp 172.35.0.185:31693 172.30.195.124:9190,172.31.2.97:9190
dca65745-fac4-4e73-b412-2c7530cf4a91 Service_lb59-10- tcp 172.35.0.170:31693 172.30.195.124:9190,172.31.2.97:9190
a5a65766-b0f2-4ac6-8f7c-cdebeea303e3 Service_lb59-10- tcp 172.35.0.89:31693 172.30.195.124:9190,172.31.2.97:9190
a36517c5-ecaa-4a41-b686-37c202478b98 Service_lb59-10- tcp 172.35.0.213:31693 172.30.195.124:9190,172.31.2.97:9190
16d997d1-27f0-41a3-8a9f-c63c8872d7b8 Service_lb59-10- tcp 172.35.0.92:31693 172.30.195.124:9190,172.31.2.97:9190
Wait for a little moment,
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
check Ovn Service lb, it found the Pod Endpoint information is still here:
fceeaf8f-e747-4290-864c-ba93fb565a8a Service_lb59-10- tcp 172.35.0.56:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
bef42efd-26db-4df3-b99d-370791988053 Service_lb59-10- tcp 172.35.1.26:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
84172e2c-081c-496a-afec-25ebcb83cc60 Service_lb59-10- tcp 172.35.0.118:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
34412ddd-ab5c-4b6b-95a3-6e718dd20a4f Service_lb59-10- tcp 172.35.1.14:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
Actual results:
Service LB endpoint determines on the POD.status.condition[type=Serving] status.
Expected results:
Service LB endpoint should determines on the POD.status.condition[type=Ready] status.
Additional info:
The ovn-controller determines whether an endpoint should be added to the Service Load Balancer (serviceLB) based on the condition.serving. The current issue is that when a pod is in the terminating state, the condition.serving remains true. Its state determines on the POD.status.condition[type=Ready] status is being true.
However when a pod is deleted, the endpointslice condition.serving state remains unchanged, and the backend pool of the service LB still includes the IP information of the deleted pod.Why doesn't ovn-controller use the condition.ready status to decide whether the pod's IP should be added to the service LB backend pool?
Could the shift-networking experts confirm whether this is the openshift ovn service lb bug or not?
Description of problem:
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Creates CredentialsRequest including the spec.providerSpec.stsIAMRoleARN string. 2.Cloud Credential Operator could not populate Secret based on CredentialsRequest. $ oc get secret -A | grep test-mihuang #Secret not found. $ oc get CredentialsRequest -n openshift-cloud-credential-operator NAME AGE ... test-mihuang 44s 3.
Actual results:
Secret not create successfully.
Expected results:
Successfully created the secret on the hosted cluster.
Additional info:
Description of problem:
Before: Warning FailedCreatePodSandBox 8s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82187d55b1379aad1e6c02b3394df7a8a0c84cc90902af413c1e0d9d56ddafb0": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-hhvfn/89e6349b-9797-4e03-8828-ebafe224dfaf:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: net.IPNet{IP:net.IP{0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}} After: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6s default-scheduler Successfully assigned default/netshoot-deployment-59898b5dd9-kk2zm to whereabouts-worker Normal AddedInterface 6s multus Add eth0 [10.244.2.2/24] from kindnet Warning FailedCreatePodSandBox 6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "23dd45e714db09380150b5df74be37801bf3caf73a5262329427a5029ef44db1": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-kk2zm/142de5eb-9f8a-4818-8c5c-6c7c85fe575e:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: 2000::/64 / excludeRanges: [2000::/32]
Fixed upstream in #366 https://github.com/k8snetworkplumbingwg/whereabouts/pull/366
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#tabledata includes a reference to `pf-c-table__action`, but v1+ of console-dynamic-plugin-sdk requires PatternFly 5, so the reference should be updated to `pf-v5-c-table__action`.
Description of problem:
Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address. Tests so far affected are: - TestUnmanagedDNSToManagedDNSInternalIngressController - TestScopeChange - TestInternalLoadBalancerGlobalAccessGCP - TestInternalLoadBalancer - TestAllowedSourceRanges For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready: operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True] Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True] Where DNSReady:False and LoadBalancerReady:False.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
10% of the time
Steps to Reproduce:
1. Run e2e-gcp-operator many times until you see one of these failures
Actual results:
Test Failure
Expected results:
Not failure
Additional info:
Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer
This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.
Please review the following PR: https://github.com/openshift/route-controller-manager/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/57
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/63
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-telemeter-master-integration shows that the job fails a lot while there was no recent change which could explain this.
It blocks merges to the telemeter repository for no valid reason.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
There is an 'Unhealthy Conditions' table on MachineHealthCheck details page, currently the first column is 'Status', user care more about Type then its Status
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-16-113018
How reproducible:
Always
Steps to Reproduce:
1. go to any MHC details page and check 'Unhealthy Conditions' table 2. 3.
Actual results:
in the table, 'Type' is the last column
Expected results:
we should put 'Type' as the first column since this is the most important factor user care for comparsion, we can check the 'Conditions' table on ClusterOperators details page, the order is Type -> Status -> other info which is very user friendly
Additional info:
Description of problem:
When no release image is provided on a HostedCluster, the backend hypershift operator picks the latest OCP release image within the operator's support windown. Today this fails due to how the operator selects this default image. For example, hypershift operator 4.14 does not support 4.15, but the 4.15.0.rc.3 is picked as a default release image today. This is a result of not anticipating that release candidates would not be reported as the latest stable release. The filter used to pick the latest release needs to consider patch level releases before the next y stream release.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1.create a self managed hcp cluster and do not specify a release image
Actual results:
the hcp will be rejected because the default release image picked does not fall within the support window
Expected results:
hcp should be created with the latest release image in the support window
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
NHC failed to watch Metal3 remediation template
Version-Release number of selected component (if applicable):
OCP4.13 and higher
How reproducible:
100%
Steps to Reproduce:
1. Create Metal3RemediationTemplate 2. Install NHCv.0.7.0 3. Create NHC with Metal3RemediationTemplate
Actual results:
E0131 14:07:51.603803 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource "metal3remediationtemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope E0131 14:07:59.912283 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3Remediation: unknown W0131 14:08:24.831958 1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource
Expected results:
No errors
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/215
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Use the new CRD when available, as the current one is being deprecated
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When upgrading clusters to any 4.13 version (from either 4.12 or 4.13), clusters with Hybrid Networking enabled appear to have a few DNS pods (not all) falling into CrashLoopBackOff status. Notably, pods are failing Readiness probes, and when deleted, work without incident. Nearly all other aspects of upgrade continue as expected
Version-Release number of selected component (if applicable):
4.13.x
How reproducible:
Always for systems in use
Steps to Reproduce:
1. Configure initial cluster installation with OVNKubernetes and enable Hybrid Networking on 4.12 or 4.13, e.g. 4.13.13 2. Upgrade cluster to 4.13.z, e.g. 4.13.14
Actual results:
dns-default pods in CrashLoopBackOff status, failing Readiness probes
Expected results:
dns-default pods are rolled out without incident
Additional info:
Appears strongly related to OCPBUGS-13172. CU has kept an affected cluster with the DNS pod issue ongoing for additional investigating, if needed.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/265
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Pre-test greenboot checks fail in during scenario run due to OVN-K pods reporting a "failed" status.
Version-Release number of selected component (if applicable):
I believe this is only affecting `periodic-ci-openshift-microshift-main-ocp-metal-nightly` jobs.
How reproducible:
Unsure. Has occurred 2 times in consecutive daily-periodic jobs.
Steps to Reproduce:
n/a
Actual results:
- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753221880392716288 - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753403067350388736
Expected results:
OVN-K Pods should deploy into a healthy state
Additional info:
Hypershift's ignition server expects certain headers. Assisted needs to gather all of this data and pass it when fetching the ignition from hypershift's ignition server.
Work items:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1978
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
See https://issues.redhat.com/browse/OCPBUGS-26053
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create an ETP=Local LB Service on LGW for some v6 workload (assign IP to lb with MetalLB or manually) 2. Set static routes to a node hosting a pod on the client 3. Attempt reaching the IPv6 Service fails
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/268
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When installing a new vSphere cluster with static IPs, control plane machine sets (CPMS) are also enabled in TechPreviewNoUpgrade and the installer applies the incorrect config to the CPMS resulting in masters being recreated.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. create install-config.yaml with static IPs following documentation 2. run `openshift-install create cluster` 3. as install progresses, watch the machines definitions
Actual results:
new master machines are created
Expected results:
all machines are the same as what was created by the installer.
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7817
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-ovirt/pull/176
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.
Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.
Always.
With a launch 4.14.10 gcp Cluster Bot cluster (logs):
$ oc adm upgrade Cluster version is 4.14.10 Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc -n openshift-machine-api get machinesets.machine.openshift.io NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-a 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-b 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-c 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 29m
Pick that set with 0 nodes. They don't come with taints by default:
$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints' null
So patch one in:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"} ]}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
And set up autoscaling:
$ cat cluster-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: name: default spec: maxNodeProvisionTime: 30m scaleDown: enabled: true $ oc apply -f cluster-autoscaler.yaml clusterautoscaler.autoscaling.openshift.io/default created
I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?
$ cat machine-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: test namespace: openshift-machine-api spec: maxReplicas: 2 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: ci-ln-s48f02k-72292-5z2hn-worker-f $ oc apply -f machine-autoscaler.yaml machineautoscaler.autoscaling.openshift.io/test created
Checking the autoscaler's logs:
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint W0122 19:18:47.246369 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:18:58.474000 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:09.703748 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:20.929617 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] ...
And the MachineSet is failing to scale:
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 50m
While if I remove the taint:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:
$ oc -n openshift-machine-api get machines.machine.openshift.io NAME PHASE TYPE REGION ZONE AGE ci-ln-s48f02k-72292-5z2hn-master-0 Running e2-custom-6-16384 us-central1 us-central1-a 53m ci-ln-s48f02k-72292-5z2hn-master-1 Running e2-custom-6-16384 us-central1 us-central1-b 53m ci-ln-s48f02k-72292-5z2hn-master-2 Running e2-custom-6-16384 us-central1 us-central1-c 53m ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf Running e2-standard-4 us-central1 us-central1-a 45m ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt Running e2-standard-4 us-central1 us-central1-b 45m ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m Running e2-standard-4 us-central1 us-central1-c 45m $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 53m $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50 I0122 19:23:17.284762 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:17.687036 1 legacy.go:296] No candidates for scale down W0122 19:23:27.924167 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:28.510701 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:28.909507 1 legacy.go:296] No candidates for scale down W0122 19:23:39.148266 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:39.737359 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:40.135580 1 legacy.go:296] No candidates for scale down W0122 19:23:50.376616 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:50.963064 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:51.364313 1 legacy.go:296] No candidates for scale down W0122 19:24:01.601764 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:24:02.191330 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:02.589766 1 legacy.go:296] No candidates for scale down I0122 19:24:13.415183 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:13.815851 1 legacy.go:296] No candidates for scale down I0122 19:24:24.641190 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:25.040894 1 legacy.go:296] No candidates for scale down I0122 19:24:35.867194 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:36.266400 1 legacy.go:296] No candidates for scale down I0122 19:24:47.097656 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:47.498099 1 legacy.go:296] No candidates for scale down I0122 19:24:58.326025 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:58.726034 1 legacy.go:296] No candidates for scale down I0122 19:25:04.927980 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:25:04.938213 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms I0122 19:25:09.552086 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:09.952094 1 legacy.go:296] No candidates for scale down I0122 19:25:20.778317 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:21.178062 1 legacy.go:296] No candidates for scale down I0122 19:25:32.005246 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:32.404966 1 legacy.go:296] No candidates for scale down I0122 19:25:43.233637 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:43.633889 1 legacy.go:296] No candidates for scale down I0122 19:25:54.462009 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:54.861513 1 legacy.go:296] No candidates for scale down I0122 19:26:05.688410 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:06.088972 1 legacy.go:296] No candidates for scale down I0122 19:26:16.915156 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:17.315987 1 legacy.go:296] No candidates for scale down I0122 19:26:28.143877 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:28.543998 1 legacy.go:296] No candidates for scale down I0122 19:26:39.369085 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:39.770386 1 legacy.go:296] No candidates for scale down I0122 19:26:50.596923 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:50.997262 1 legacy.go:296] No candidates for scale down I0122 19:27:01.823577 1 static_autoscaler.go:552] No unschedulable pods I0122 19:27:02.223290 1 legacy.go:296] No candidates for scale down I0122 19:27:04.938943 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:27:04.947353 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms
Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.
Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.
Please review the following PR: https://github.com/openshift/egress-router-cni/pull/79
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Our test that watches for alerts to appear we've never seen before has picked something up on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-02-031544
[sig-trt][invariant] No new alerts should be firing expand_less 0s
{ Found alerts firing which are new or less than two weeks old, which should not be firing: PrometheusOperatorRejectedResources has no test data, this alert appears new and should not be firing}It hit about 3-4 out of 10 on both azure and aws agg jobs. Could be a regression, could be something really rare.
Description of problem:
Unable to run oc commands in FIPS enable OCP cluster on PowerVS
Version-Release number of selected component (if applicable):
4.15.0-ec2
How reproducible:
Deploy OCP cluster with FIPS enabled
Steps to Reproduce:
1. Enable the var in var.tfvars - fips_compliant = true 2. Deploy the cluster 3. run oc commands
Actual results:
[root@rdr-swap-fips-syd05-bastion-0 ~]# oc version FIPS mode is enabled, but the required OpenSSL library is not available [root@rdr-swap-fips-syd05-bastion-0 ~]# oc debug node/syd05-master-0.rdr-swap-fips.ibm.com FIPS mode is enabled, but the required OpenSSL library is not available [root@rdr-swap-fips-syd05-bastion-0 ~]# fips-mode-setup --check FIPS mode is enabled.
Expected results:
# oc debug node/syd05-master-0.rdr-swap-fips1.ibm.com Temporary namespace openshift-debug-dns7d is created for debugging node... Starting pod/syd05-master-0rdr-swap-fips1ibmcom-debug-hs4dr ... To use host binaries, run `chroot /host` Pod IP: 193.168.200.9
Additional info:
Not able to collect must gather logs due to the issue links - https://access.redhat.com/solutions/7034387
CNO managed component (network-node-identity) to conform to hypershift control plane expectations that All secrets should be mounted to not have global read. change from 420(0644) to 416(0640)
Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/console-operator/pull/823
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
whereabouts reconciler is responsible for reclaiming dangling IPs, and freeing them to be available to allocate to new pods. This is crucial for scenarios where the amount of addresses are limited and dangling IPs prevent whereabouts from successfully allocating new IPs to new pods. The reconciliation schedule is currently hard-coded to run once a day, without a user-friendly way to configure.
Version-Release number of selected component (if applicable):
How reproducible:
Create a Whereabouts reconciler daemon set, not able to configure the reconciler schedule.
Steps to Reproduce:
1. Create a Whereabouts reconciler daemonset instructions: https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional- network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network 2. Run `oc get pods -n openshift-multus | grep whereabouts-reconciler` 3. Run `oc logs whereabouts-reconciler-xxxxx`
Actual results:
You can't configure the cron-schedule of the reconciler.
Expected results:
Be able to modify the reconciler cron schedule.
Additional info:
The fix for this bug is in two places: whereabouts, and cluster-network-operator. From this reason, in order to verify correctly we need to use both fixed components. Please read below for more details about how to apply the new configurations.
How to Verify:
Create a whereabouts-config ConfigMap with a custom value, and check in the whereabouts-reconciler pods' logs that it is updated, and triggering the clean up.
Steps to Verify:
1. Create a Whereabouts reconciler daemonset 2. Wait for the whereabouts-reconciler pods to be running. (takes time for the daemonset to get created). 3. See in logs: "[error] could not read file: <nil>, using expression from flatfile: 30 4 * * *" This means it uses the hardcoded default value. (Because no ConfigMap yet) 4. Run: oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/2 * * * *" 5. Check in the logs for: "successfully updated CRON configuration" 6. Check that in the next 2 minutes the reconciler runs: "[verbose] starting reconciler run"
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Please review the following PR: https://github.com/openshift/console/pull/13434
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In order to use hostPath volumes, containers in kubernetes must be started with the privileged flag set. This is because this flag toggles an SELinux boolean that cannot be toggled by enabling any particular capability. (Empirical testing shows the same restriction does not apply to emptyDir volumes.)
Since the baremetal components rely on a hostPath volumes for an number of purposes, this prevents many of them from running unprivileged.
However, there are a number of containers that do not use any hostPath volumes and need only an added capability, if anything. These should be specified explicitly instead of just setting privileged mode to enable everything.
Slack discussion here: https://redhat-internal.slack.com/archives/C02F1J9UJJD/p1702394712492839
Repo: openshift/kubernetes/pkg/controller/podautoscaler MISSING: jkyros, joelsmith Repo: openshift/kubernetes-autoscaler/vertical-pod-autoscaler MISSING: jkyros Repo: openshift/vertical-pod-autoscaler-operator/ MISSING: jkyros
The openshift/kubernetes one was the only real weird one where there might not be a precedent. Looking at the kubernetes repo it looks like the custom is to add a DOWNSTREAM_APPROVERS file as a carry patch for downstream approvers?
Included following regions
Description of problem:
It was notices that the openshift-hyperkube RPM which is primarilly, perhaps exclusively, used to install the kubelet in RHCOS or other environments included kube-apiserver, kube-controller-manager, and kube-scheduler binaries. Those binaries are all built and used via container images, which as far as I can tell don't make use of the RPM.
Version-Release number of selected component (if applicable):
4.12 - 4.16
How reproducible:
100%
Steps to Reproduce:
1. rpm -ql openshift-hyperkube on any node 2. 3.
Actual results:
# rpm -ql openshift-hyperkube /usr/bin/hyperkube /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/kubelet /usr/bin/kubensenter # ls -lah /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/hyperkube /usr/bin/kubensenter /usr/bin/kubelet -rwxr-xr-x. 2 root root 945 Jan 1 1970 /usr/bin/hyperkube -rwxr-xr-x. 2 root root 129M Jan 1 1970 /usr/bin/kube-apiserver -rwxr-xr-x. 2 root root 114M Jan 1 1970 /usr/bin/kube-controller-manager -rwxr-xr-x. 2 root root 54M Jan 1 1970 /usr/bin/kube-scheduler -rwxr-xr-x. 2 root root 105M Jan 1 1970 /usr/bin/kubelet -rwxr-xr-x. 2 root root 3.5K Jan 1 1970 /usr/bin/kubensenter
Expected results:
Just the kubelet and deps on the host OS, that's all that's necessary
Additional info:
My proposed change would be for people that cared about making this slim to install `openshift-hyperkube-kubelet` instead.
primary_ipv4_address is deprecated in favor of primary_ip[*].address. Replace it with the new attribute.
Description of problem:
During the control plane upgrade e2e test, it seems that the openshift apiserver becomes unavailable during the upgrade process. The test is run on an HA control plane, and this should not happen.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Often
Steps to Reproduce:
1. Create a hosted cluster with HA control plane and wait for it to become available 2. Upgrade the hosted cluster to a newer release 3. While upgrading, monitor whether the openshift apiserver is available by either querying APIService resources or resources served by the openshift apiserver.
Actual results:
The openshift apiserver is unavailable at some point during the upgrade
Expected results:
The openshift apiserver is available throughout the upgrade
Additional info:
Description of problem:
The installer supports pre-rendering of the PerformanceProfile related manifests. However the MCO render is executed after the PerfProfile render and so the master and worker MachineConfigPools are created too late. This causes the installation process to fail with: Oct 18 18:05:25 localhost.localdomain bootkube.sh[537963]: I1018 18:05:25.968719 1 render.go:73] Rendering files into: /assets/node-tuning-bootstrap Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.008421 1 render.go:133] skipping "/assets/manifests/99_feature-gate.yaml" [1] manifest because of unhandled *v1.FeatureGate Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.013043 1 render.go:133] skipping "/assets/manifests/cluster-dns-02-config.yml" [1] manifest because of unhandled *v1.DNS Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.021978 1 render.go:133] skipping "/assets/manifests/cluster-ingress-02-config.yml" [1] manifest because of unhandled *v1.Ingress Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023016 1 render.go:133] skipping "/assets/manifests/cluster-network-02-config.yml" [1] manifest because of unhandled *v1.Network Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023160 1 render.go:133] skipping "/assets/manifests/cluster-proxy-01-config.yaml" [1] manifest because of unhandled *v1.Proxy Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023445 1 render.go:133] skipping "/assets/manifests/cluster-scheduler-02-config.yml" [1] manifest because of unhandled *v1.Scheduler Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.024475 1 render.go:133] skipping "/assets/manifests/cvo-overrides.yaml" [1] manifest because of unhandled *v1.ClusterVersion Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: F1018 18:05:26.037467 1 cmd.go:53] no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="
Version-Release number of selected component (if applicable):
4.14.0-rc.6
How reproducible:
Always
Steps to Reproduce:
1. Add an SNO PerformanceProfile to extra manifest in the installer. Node selector should be: "node-role.kubernetes.io/master=" 2. 3.
Actual results:
no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="
Expected results:
Installation completes
Additional info:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: openshift-node-workload-partitioning-sno spec: cpu: isolated: 4-X <- must match the topology of the node reserved: 0-3 nodeSelector: node-role.kubernetes.io/master: ""
Following up from OCPBUGS-16357, we should enable health check of stale registration sockets in our operators.
We will need - https://github.com/kubernetes-csi/node-driver-registrar/pull/322 and we will have to enable healthcheck for registration sockets - https://github.com/kubernetes-csi/node-driver-registrar#example
Hosted cluster time to provision is outside the SLO 99% of cluster provision in less than 360s.
When collecting onprem events, we want to be able to distinguish among the various onprem deployments:
This info we should also make sure we forward it when collecting events
We should also define a human-friendly version for each
Slack thread about the supported deployment types
https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1706209886659329
Environment variables for setting the deployment type (one of: podman, operator, ACM, MCE, ABI) and for setting the release version (if applicable)
The egress-router implementations under https://github.com/openshift/images/tree/master/egress have unit tests alongside the implementations, within the same repository, but the repository does not have a CI job to run those unit tests. We do not have any tests for egress-router in https://github.com/openshift/origin. This means that we are effectively lacking CI test coverage for egress-router.
All versions.
100%.
1. Open a PR in https://github.com/openshift/images and check which CI jobs are run on it.
2. Check the job definitions in https://github.com/openshift/release/blob/master/ci-operator/jobs/openshift/images/openshift-images-master-presubmits.yaml.
There are "ci/prow/e2e-aws", "ci/prow/e2e-aws-upgrade", and "ci/prow/images" jobs defined, but no "ci/prow/unit" job.
There should be a "ci/prow/unit" job, and this job should run the unit tests that are defined in the repository.
The lack of a CI job came up on https://github.com/openshift/images/pull/162.
Document HostedCluster and HostedControlPlane manifests used by IBM Cloud.
Description of problem:
We see failures in this test: [Jira:"Networking / router"] monitor test service-type-load-balancer-availability setup expand_less 15m1s{ failed during setup error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition} See this https://search.ci.openshift.org/?search=error+waiting+for+load+balancer&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job to find recent ones. example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade/1754402739040817152 this has failed payloads like: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-01-211543 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-02-061913 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-02-001913
Version-Release number of selected component (if applicable):
4.15 and 4.16
How reproducible:
intermittent as shown in the search.ci query above
Steps to Reproduce:
1. run the e2e tests on 4.15 and 4.16 2. 3.
Actual results:
timeouts on getting load balancer
Expected results:
no timeout and successful load balancer
Additional info:
https://issues.redhat.com/browse/TRT-1486 has more info thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1707142256956139
Description of problem:
Icons which were formally blue are no longer blue.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Description of problem:
Since the golang.org/x/oauth2 package has been upgraded, GCP installs have been failing with level=info msg=Credentials loaded from environment variable "GOOGLE_CLOUD_KEYFILE_JSON", file "/var/run/secrets/ci.openshift.io/cluster-profile/gce.json" level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.gcp.project: Internal error: failed to create cloud resource service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused, : Internal error: failed to create compute service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused]
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The bump has been introduced by https://github.com/openshift/installer/pull/8020
Description of problem:
Due to RHEL9 incorporating OpenSSL 3.0, HaProxy will refuse to start if provided with a cert using SHA1-based signature algorithm. RHEL9 is being introduced in 4.16. This means customers updating from 4.15 to 4.16 with a SHA1 cert will find their router in a failure state. My Notes from experimenting with various ways of using a cert in ingress: - Routes with SHA1 spec.tls.certificate WILL prevent HaProxy from reloading/starting - It is NOT limited to FIPs, I broke a non-FIPs cluster with this - Routes with SHA1 spec.tls.caCertificate will NOT prevent HaProxy starting, but route is rejected, due to extended route validation failure: - lastTransitionTime: "2024-01-04T20:18:01Z" message: 'spec.tls.certificate: Invalid value: "redacted certificate data": error verifying certificate: x509: certificate signed by unknown authority (possibly because of "x509: cannot verify signature: insecure algorithm SHA1-RSA (temporarily override with GODEBUG=x509sha1=1)" while trying to verify candidate authority certificate "www.exampleca.com")' - Routes with SHA1 spec.tls.destinationCACertificate will NOT prevent HaProxy from starting. It actually seems to work as expected - IngressController with SHA1 spec.defaultCertificate WILL prevent HaProxy from starting. - IngressController with SHA1 spec.clientTLS.clientCA will NOT prevent HaProxy from starting.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. Create a Ingress Controller with spec.defaultCertificate or a Route with spec.tls.certificate as a SHA1 cert 2. Roll out the router
Actual results:
Router fails to start
Expected results:
Router should start
Additional info:
We've previously documented via story in RHEL9 epic: https://issues.redhat.com/browse/NE-1449 The initial fix for this issue was merged as [https://github.com/openshift/router/pull/555]. This issue is currently causing some issues, notably causing the openshift/cluster-ingress-operator repository's {{TestRouteAdmissionPolicy}} E2E test to fail intermittently, which causes the e2e-azure, e2e-gcp-operator, and e2e-aws-operator CI jobs to fail intermittently.
In TRT-1476, we created a VM that served as an endpoint where we can test connectivity in gcp.
We want one for AWS.
In TRT-1477, we created some code in origin to send HTTP GETs to that endpoint as a test to ensure connectivity remains working. Do this also for AWS.
TRT members already have an AWS account so we don't need to request one.
Updating github.com/IBM/networking-go-sdk 0.45.0 causes issues with PowerVS infra create as the API has changed.
Description of problem:
when use the oc-mirror with v2 format , will save the .oc-mirror dir to the default jjjjuser directory , and the data is very large. Now we don't have flag to specify the path for the log, but should save to the working directory.
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.15.0-202312011230.p0.ge4022d0.assembly.stream-e4022d0", GitCommit:"e4022d08586406f3a0f92bab1d3ea6cb8856b4fa", GitTreeState:"clean", BuildDate:"2023-12-01T12:48:12Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
run command : oc-mirror --from file://out docker://localhost:5000/ocptest --v2 --config config.yaml
Actual results:
will save the logs to user directory.
Expected results:
Better to have flags to specify where to save the logs or use the working directory .
[sig-cluster-lifecycle][Feature:Machines] Managed cluster should [sig-scheduling][Early] control plane machine set operator should not have any events [Suite:openshift/conformance/parallel]
Looks like this test is permafailing on 4.16 and 4.15 AWS UPI jobs - does this need to be skipped on UPI?
{ fail [github.com/openshift/origin/test/extended/machines/machines.go:191]: Unexpected error: <*errors.StatusError | 0xc0031b8f00>: controlplanemachinesets.machine.openshift.io "cluster" not found { ErrStatus: code: 404 details: group: machine.openshift.io kind: controlplanemachinesets name: cluster message: controlplanemachinesets.machine.openshift.io "cluster" not found metadata: {} reason: NotFound status: Failure, } occurred Ginkgo exit error 1: exit with code 1}
Adam Kaplan will be assuming the role of "Staff Engineer" for core OpenShift for the team, taking over the role from Ben Parees. To expedite reviews and other OCP processes, Adam needs to be added back as an approver in the following repositories:
Adam may also need to be added to the OWNERS files in the following repos:
This looks very much like a 'downstream a thing' process, but only making a modification to an existing one.
Currently, the operator-framework-olm monorepo generates a self-hosting catalog from operator-registry.Dockerfile. This image also contains cross-compiled opm binaries for windows and mac, and joins the payload as ose-operator-registry.
To separate concerns, this introduces a new operator-framework-cli image which will be based on scratch, not self-hosting in any way, and just a container to convey repeatably produced o-f CLIs. Right now, this will focus on opm for olm v0 only, but others can be added in future.
Add `madhu-pillai` in the `coreos-approvers` and `coreos-reviewers` lists.
Description of problem:
tlsSecurityProfile definitions do not align with documentation. When using `oc explain` the field descriptions note that certain values are unsupported, but the same values are supported in the OpenShift Documentation. This needs to be clarified and the spacing should be fixed in the descriptions as they are hard to understand.
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
⇒ oc explain ingresscontroller.spec.tlsSecurityProfile.modern
Steps to Reproduce:
1. Check the `oc explain` output
Actual results:
⇒ oc explain ingresscontroller.spec.tlsSecurityProfile.modern KIND: IngressController VERSION: operator.openshift.io/v1DESCRIPTION: modern is a TLS security profile based on: https://wiki.mozilla.org/Security/Server_Side_TLS#Modern_compatibility and looks like this (yaml): ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 minTLSVersion: TLSv1.3 NOTE: Currently unsupported.
Expected results:
An output that aligns with the documentation regarding support/unsupported TLS versions Additionally, fixing the output format would be useful as it is very hard to understand/read in it's current form. Here in the 4.14 Documentation, it states: ``` The HAProxy Ingress Controller image supports TLS 1.3 and the Modern profile. ```
Additional info:
The `apiserver` CR should also be checked for the same thing.
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/491
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
As a HyperShift Engineer, I want to be able to:
so that I can achieve
As a HyperShift Engineer, I want to be able to:
so that I can achieve
As a HyperShift Engineer, I want to be able to:
so that I can achieve
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
The flowcontrol manifests in the following operators (kas, oas, etcd, openshift controller manager, auth, and network) should use v1.
Description of problem:
Bootstrap process failed due to API_URL and API_INT_URL are not resolvable: Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'. Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up install logs: ... time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host" time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz" time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." ...
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165
How reproducible:
Always.
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. Create cluster 3.
Actual results:
Failed to complete bootstrap process.
Expected results:
See description.
Additional info:
I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969
When installation fails, status_info is reporting an incorrect status
Most likely is one of those two scenarios:
David mention this issue here: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702312628947029
duplicated_event_patterns I think it's creating a blackout range (events ok during time X) and then checks the time range itself, but doesn't appear to exclude the change in counts?
The count of the last event within the allowed range should be subtracted from the first event that is outside of the allowed time range for pathological event test calculation.
David has a demonstration of the count here: https://github.com/openshift/origin/pull/28456 but to fix you have to invert the testDuplicatedEvents to iterate through the event registry, not the the events
Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705425516419799
A revision controller is spinning to many revisions.
Goal: update the revision controller code to temporarily log config changes to validate the newly created revisions are valid. Or, proof some new revisions are unnecessary.
Remove the github handle of Haoyu Sun from OWNER files of monitoring components.
Please review the following PR: https://github.com/openshift/cluster-authentication-operator/pull/644
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Invalid CN is not bubbled up in the CR
Version-Release number of selected component (if applicable):
4.15.0-rc7
How reproducible:
always
Steps to Reproduce:
# generate a key with invalid CN openssl genrsa -out myuser4.key 2048 openssl req -new -key myuser4.key -out myuser4.csr -subj "/CN=baduser/O=system:masters" # get cert in the CSR # apply the CSR # Status remains in Accepted, but it is not Issued % oc get csr | grep 29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr 29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr 4m29s hypershift.openshift.io/ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1.customer-break-glass system:admin 60m Approved # No status in the CSR status: conditions: - lastTransitionTime: "2024-02-16T14:06:41Z" lastUpdateTime: "2024-02-16T14:06:41Z" message: The requisite approval resource exists. reason: ApprovalPresent status: "True" type: Approved # pki controller shows the error oc logs control-plane-pki-operator-bf6d75d5f-h95rf -n ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1 | grep "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" I0216 14:06:41.842414 1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1", Name:"control-plane-pki-operator", UID:"b63dbaa9-18f7-4ee6-8473-8a38bdb6f2df", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateSigningRequestApproved' "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" in is approved I0216 14:06:41.848623 1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1", Name:"control-plane-pki-operator", UID:"b63dbaa9-18f7-4ee6-8473-8a38bdb6f2df", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateSigningRequestInvalid' "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" is invalid: invalid certificate request: subject CommonName must begin with "system:customer-break-glass:"
Actual results:
Expected results:
status in the CR show failed and the error
Additional info: