OpenShift 4.13
https://github.com/openshift/installer/pull/6770 https://github.com/openshift/installer/pull/6782 https://github.com/openshift/installer/pull/6750 https://github.com/openshift/installer/pull/6738 https://github.com/openshift/installer/pull/6612 https://github.com/openshift/installer/pull/6327 https://github.com/openshift/api/pull/1388 https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/224 https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/218 https://github.com/openshift/openshift-docs/pull/54788 https://github.com/openshift/installer/pull/6905

Who	What	Reference
DEV	Upstream roadmap issue (or individual upstream PRs)	<link to GitHub Issue>
DEV	Upstream documentation merged	<link to meaningful PR>
DEV	gap doc updated	<name sheet and cell>
DEV	Upgrade consideration	<link to upgrade-related test or design doc>
DEV	CEE/PX summary presentation	label epic with cee-training and add a <link to your support-facing preso>
QE	Test plans in Polarion	<link or reference to Polarion>
QE	Automated tests merged	<link or reference to automated tests>
DOC	Downstream documentation merged	<link to meaningful PR>

Feature CNV-11934: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic CNV-29411: [CNV must-gather] one click solution for getting all the debug logs

View the Description

Goal

We want to add a new CLI option named `--all-images` to the `oc adm must-gather` command.
The oc client will scan all the CSVs on the cluster looking for `operators.openshift.io/must-gather-image=pullUrlOfTheImage` annotation building a list of must-gather images to be used to collect logs for all the products installed on the cluster and then the client will execute all of them aggregating the collection results.
Operator authors that want to opt-in to this mechanism should explicitly annotate their CSV with ``operators.openshift.io/must-gather-image` annotation.

User Stories

As a Cluster Admin, I want to execute a `oc adm must-gather` only once collecting all the logs for all the products deployed on my cluster without the need to explicitly specify a long (and error prone) list of images
As a Red Hat support engineer, I want to be able to suggest to customers a single and simple command to collect must-gather logs for different products in a single pass reducing the request ping-pong came and so the case resolving time.

Non-Requirements

List of things not included in this epic, to alleviate any doubt raised during the grooming process.

Notes

Any additional details or decisions made/needed

Story CNV-37284: [contx2] Convince OCP maintainers this is relevant and should be maintaned in oc client

View the linked PRs

https://github.com/openshift/oc/pull/1633

Feature OBSDA-553: Add support to OpenShift Telemetry to report the provider that has been added via "platform: external"

View the Description View the linked PRs

Proposed title of this feature request

Add support to OpenShift Telemetry to report the provider that has been added via "platform: external"

What is the nature and description of the request?

There is a new platform we have added support in OpenShift 4.14 called "external" which has been added for partners to enable and support their own integrations with OpenShift rather than making RH to develop and support this.

When deploying OpenShift using "platform: external: we don't have the ability right now to identify the provider where the platform has been deployed which is key for the product team to analyze demand and other metrics.

Why does the customer need this? (List the business requirements)

OpenShift Product Management needs this information to analyze adoption of these new platforms as well as other metrics specifically for these platforms to help us to make decisions for the product development.

List any affected packages or components.

Telemetry for OpenShift

There is some additional information in the following Slack thread --> https://redhat-internal.slack.com/archives/CEG5ZJQ1G/p1698758270895639

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1638

Feature OBSDA-666: CCX OCP core maintenance 2024

View the Description

Placeholder feature for ccx-ocp-core maintenance tasks.

Epic CCXDEV-12234: IO maintenance OCP 4.16

View the Description

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Bug OCPBUGS-30515: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-operator/pull/918

Feature OCPPLAN-4577: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic MULTIARCH-4268: Autoscaling from zero on MA-compute - other providers and upstream work

View the Description

Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314

Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.

Story MULTIARCH-4302: CAPI cmdline flag -> env var.

Feature OCPSTRAT-1001: Remove Terraform from the PowerVS installer

View the Description

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The PowerVS IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing PowerVS Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic MULTIARCH-4037: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story MULTIARCH-4098: Installer changes

Sub-task MULTIARCH-4158: Provision PowerVS with CAPI

View the Description View the linked PRs

Implement an intitial PowerVS provider for CAPI.

https://github.com/openshift/installer/pull/8060

Sub-task MULTIARCH-4095: Generate machine manifests

View the Description View the linked PRs

Generate the machine manifests. Follow how it's done for AWS [0] and the PoC code [1]

[0] https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go

[1] https://github.com/Karthik-K-N/installer/commit/2ee454c8aab89bb9e50c3fed30825439f5773fac#diff-40e07ddf8cfbe74c687ceb18e8cf354792d09e210bbf9e7dda0bf632b762b3e2

https://github.com/openshift/installer/pull/8020

Feature OCPSTRAT-1002: Remove Terraform from the OpenStack installer

View the Description

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The OpenStack Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing OpenStack Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic OSASINFRA-3332: Integrate CAPO in Installer

View the Description

Goal

Create cluster and OpenStackCluster resource for the install-config.yaml
Create OpenStackMachine
Remove terraform dependency for OpenStack

Why is this important?

To have a CAPO cluster functionally equivalent to the installer

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Task OSASINFRA-3363: Merge current capo-integration branch to installer

View the Description View the linked PRs

We need to get CI on this PR in good shape https://github.com/openshift/installer/pull/7939 so we can look for reviews

https://github.com/openshift/installer/pull/7939

Task OSASINFRA-3362: Ensure bootstrap machine is deleted

View the Description View the linked PRs

Right now when trying one installation with this work https://github.com/openshift/installer/pull/7939 the bootstrap machine is not getting deleted. We need to ensure it's gone once bootstrap is finalized.

https://github.com/openshift/installer/pull/8104

Task OSASINFRA-3371: Pass rhcosImage and manifests to the PreProvision hook

View the Description View the linked PRs

This is needed to identify if masters are schedulable and to upload the rhos image to glance.

https://github.com/openshift/installer/pull/7967

Feature OCPSTRAT-1003: Remove Terraform from the IBM Cloud VPC IPI installer

View the Description

Feature Overview (aka. Goal Summary)

To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The IBM Cloud VPC IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing IBM Cloud VPC Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CORS-3278: Replace Terraform with CAPI Provider for IBM Cloud

View the Description

Epic Goal

Replace Terraform infrastructure provisioning with CAPI-based approach

Story CORS-3281: Provision CAPI Controller

View the linked PRs

https://github.com/openshift/installer/pull/8090

Feature OCPSTRAT-1006: Remove Terraform from the GCP IPI installer

View the Description

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The GCP IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing GCP Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CORS-3196: Provision GCP with CAPI (no mgmt cluster)

View the Description View the linked PRs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Provision GCP infrastructure without the use of Terraform

Why is this important?

Removing Terraform from Installer

Scenarios

The new provider should aim to provide the same results as the existing GCP

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/installer/pull/7968

Story CORS-3259: create DNS records

View the Description View the linked PRs

User Story:

I want to create the public and private DNS records using one of the CAPI interface SDK hooks.

Acceptance Criteria:

Description of criteria:

Vanilla install: create public DNS record, private zone, private record
Shared VPC: check for existing private DNS zone
Publish Internal: create private zone, private record
Custom DNS scenario: create no records, or zone

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

Value for DNS record for capi-provisioned LB, can be grabbed from cluster manifest
The value for the SDK provisioned LB will need to be handled within our hook

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Bug OCPBUGS-30057: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8083

Story CORS-3207: Create GCP machines for CAPI

View the Description View the linked PRs

Machines for GCP need to be generated for use in CAPI. This will be similar to the AWS machine implementation
(https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go) added in
https://github.com/openshift/installer/pull/7771

https://github.com/openshift/installer/pull/7944

Story CORS-3220: Allow CAPG installation

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Ensure that when the cluster should use CAPI installs that the correct path is chosen. AKA: no longer use terraform for installations in GCP when the user has selected the featureSet for CAPI.

so that I can achieve

CAPI gcp installation

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/8011

Story CORS-3213: Create Cluster Manifest

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Create the GCP cluster manifest for CAPI installs

so that I can achieve

The manifests in <assets-dir>/cluster-api will be applied to bootstrap ignition and the files will find their way to the machines.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/7917

Story CORS-3257: assign service accounts to vms

View the Description View the linked PRs

User Story:

I want the installer to create the service accounts that would be assigned to control plane and compute machines, similar to what is done in terraform now.

Acceptance Criteria:

Description of criteria:

Control plane and compute service accounts are created with appropriate permissions (see details)
Service accounts are attached to machines
Skip creation of control-plane service accounts when they are specified in the install config

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

Service accounts are needed by the machines, so they could be created in either PreProvision or InfraReady
compute node service account permissions are captured here: https://github.com/openshift/installer/blob/master/data/data/gcp/cluster/iam/main.tf
control plane iam is here:
https://github.com/openshift/installer/blob/master/data/data/gcp/cluster/master/main.tf#L5-L38
The service accounts should be specified in the machine spec.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/8066

Story CORS-3212: Ignite control-plane machines

View the Description View the linked PRs

When installing on GCP, I want control-plane (including bootstrap) machines to bootstrap using ignition.

I want bootstrap ignition to be secured so that sensitive data is not publicly available.

Acceptance Criteria:

Description of criteria:

Control-plane machines pull ignition (boot successfully)
Bootstrap ignition is not public (typically signed url)
Service account is not required for signed url (stretch goal)
Should be labeled (with owned and user tags)

(optional) Out of Scope:

Destroying bootstrap ignition can be handled separately.

Engineering Details:

CAPG does not support ignition, so we will need to determine what to pass in Bootstrap to allow passing ignition stub in the user data.
terraform: https://github.com/openshift/installer/blob/master/data/data/gcp/cluster/master/main.tf#L85
- create storage bucket
- upload bootstrap ignition to object in storage bucket
- create signed url for object
- pass signed url in ignition stub

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/8027

Story CORS-3208: Create GCP Infrastructure controller

View the Description View the linked PRs

Create the GCP Infrastructure controller in /pkg/clusterapi/system.go.
It will be based on the AWS controller in that file, which was added in https://github.com/openshift/installer/pull/7630.

https://github.com/openshift/installer/pull/7940

Bug OCPBUGS-30058: CAPG Bootstrap machine never creates public ip

View the Description View the linked PRs

Description of problem:

The bootstrap machine never contains a public IP address. When the publish strategy is set to External, the bootstrap machine should contain a public ip address.

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8079

Feature OCPSTRAT-1007: Remove Terraform from the AWS IPI installer

View the Description

Feature Overview (aka. Goal Summary)

To avoid an increased support overhead once the license changes at the end of the year, we want to provision AWS infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The AWS IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing AWS Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CORS-2890: Provision AWS with CAPI (no mgmt cluster)

View the Description View the linked PRs

Epic Goal

Provision AWS infrastructure without the use of Terraform

Why is this important?

This is a key piece in producing a terraform-free binary for ROSA. See parent epic for more details.

Scenarios

The new provider should aim to provide the same results as the existing AWS terraform provider.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug CORS-3288: [capi][aws][bug] Load Balancer reconciliation must check scheme for correct LB

View the Description View the linked PRs

The schema check[1] in the LB reconciliation is hardcoded to check the primary Load Balancer only, it will result to always filter the subnets from the schema for the primary, ignoring additional Load Balancers ("SecondaryControlPlaneLoadBalancer")

How to reproduce:

Create a cluster w/ secondaryLoadBalancer w/ internet-facing
Check the subnets for the secondary load balancers: "ext" (public) API Load Balancer's subnets

Actual results:

Private subnets attached to the SecondaryControlPlaneLoadBalancer

Expected results:

Public subnets attached to the SecondaryControlPlaneLoadBalancer

References:

[1] https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/elb/loadbalancer.go#L253-L255

https://github.com/openshift/installer/pull/8114

Story CORS-2892: userTags are propogated to all installer created resources

View the Description View the linked PRs

when installconfig.platform.aws.userTags is specified, all taggable resources should have the specified user tags.

Manifests:
- cluster
- machines
Non-capi provisioned resources:
- IAM roles
- Load balancer resources
- DNS resources (private zone is tagged)
Check compatibility of how CAPA provider is tagging bootstrap S3 bucket

https://github.com/openshift/installer/pull/8150

Story CORS-2900: iamRole is set for control plane

View the Description View the linked PRs

iam role is correctly attached to control plane node when installconfig.controlPlane.platform.aws.iamRole is specified

https://github.com/openshift/installer/pull/8031

Feature OCPSTRAT-1010: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic MULTIARCH-4116: Agent Based Installer CI using existing yellow-zone hardware

View the Description

Epic Goal
Through this epic, we will update our CI to use a have an available agent-based workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.

Why is this important?
There is an active initiative in openshift to remove terraform from the openshift installer.

Acceptance Criteria

All tasks within the epic are completed.

Done Checklist

CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Story MULTIARCH-4159: Evaluate if multi-arch 'yq4' image can replace jq in CI

View the Description View the linked PRs

As a CI job author, I would like to be able to reference a yaml/json parsing tool that works across architectures and doesn't need to be downloaded for each unique step.

Rafael pointed out that Alessandro add multi-arch containers for yq for the upi installer:
https://github.com/openshift/release/pull/49036#discussion_r1499870554

yq should have the ability to parse json.

We should evaluate if this can be added to the libvirt-installer image as well, and then used by all of our libvirt CI steps.

https://github.com/openshift/installer/pull/8098

Feature OCPSTRAT-1011: Ports used by OCP should have TLS supports - phase 2

View the Description

Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.

Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit

https://access.redhat.com/solutions/5437491

Epic OCPNODE-1881: secure port 9537 with TLS

View the Description

Currently, we are serving the metrics as http on 9537 we need to upgrade to use TLS

Story OCPNODE-2099: Origin tolerations test change for kube-rbac-proxy-crio

View the linked PRs

https://github.com/openshift/origin/pull/28636

Story OCPNODE-2029: Migrate crio metrics in openshift/origin to https

View the linked PRs

https://github.com/openshift/origin/pull/28536

Story OCPNODE-2022: Modify OpenShift Monitoring Operator to use the secured rbac proxy port

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/2229

Feature OCPSTRAT-1019: Improve capabilities in Pipelines Dynamic Plugin and Static pages

View the Description

Feature Overview (aka. Goal Summary)

Enhance Dynamic plugin with similar capabilities as Static page. Add new control and security related enhancements to Static page.

Goals (aka. expected user outcomes)

The Dynamic plugin should list pipelines similar to the current static page.

The Static page should allow users to override task and sidecar task parameters.

The Static page should allow users to control tasks that are setup for manual approval.

The TSSC security and compliance policies should be visible in Dynamic plugin.

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic ODC-7446: Integrate pipelines manual approval gate with UI (devconsole)

View the Description

As a DevOps Engineer, I want to add manual approval points in my pipeline so that the pipeline pauses at that point and waits for a manual approval before continuing execution. Manual approvals are commonly used for approving deploying to production or modeling activities that are not automated (e.g. manual testing) in the pipeline.

Acceptance Criteria

User defines a manual approval point into their pipeline
PipelineRuns for user's pipeline pause at the point where manual approval is defined and waits for approval
Once user approves the PipelineRun, the PipelineRun continues execution

Bug OCPBUGS-29513: Update the Pipeline List and Details Pages to acknowledge Custom Task

View the Description View the linked PRs

Description of problem:

Update the Pipeline/PipelineRun List and Details Pages to acknowledge Custom Task

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Always when Custom Task is used

Steps to Reproduce:

    1.Create a Pipeline with a CustomTask like the Approval Task
    2.Check the Tasks List in the Pipeline/Run Details Page
    3.Check the Progress Bar in the PipelineRun List page

Actual results:

CustomTask is not recognized. Either throwing undefined or showing Task as pending always.

Expected results:

Just like Normal Tasks, CT should be infused thoroughly in the mentioned pages.

Additional info:

https://github.com/openshift/console/pull/13614

Epic ODC-7447: Move to a dynamic pipelines plugin Phase 2: Migrate all List and Detail pages

View the Description

Problem:

With the OpenShift Pipelines operator 1.2x we added support for a dynamic console plugin to the operator. In the first version it is only responsible for the Dashboard and Pipeline/Repository Metrics tab. We want move more and more code to the dynamic plugin and remove this later from the console repository.

Goal:

In this 2nd phase, we want to move all "easy movable" components from the Dev Console static pipeline-plugin https://github.com/openshift/console/tree/master/frontend/packages/pipelines-plugin into the new dynamic pipeline-plugin github.com/openshift-pipelines/console-plugin
While moving the Resources, we should check if we can cleanup and simply the code. For example by using new features like the option to extend the detail pages with the "console.resource/details-item" extension. See https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/console-extensions.md#consoleresourcedetails-item

Non-goal

We don't plan to change the "Import from Git" flow as part of this Epic.
We don't want remove the code from the console repository yet. => We don't do this yet so that the console still supports the Pipeline feature with an older Pipelines Operator (that doesn't have the Pipeline Plugin).

Why is it important?

We want to migrate to dynamic plugins so that it's easier to align changes in the console with changes in the operator.
In the mid-term this will also reduce the code size and build times of the console project.

Acceptance criteria:

The console support for Pipelines (Tekton resources) should still work without and with the new dynamic Pipelines Plugin
The following detail pages should be copied* to the dynamic plugin
1. Pipeline detail page (support for v1beta1 and v1)
2. PipelineRun detail page (support for v1beta1 and v1)
3. ClusterTask detail page (support for v1beta1 and v1)
4. Task detail page (support for v1beta1 and v1)
5. TaskRun detail page (support for v1beta1 and v1)
6. EventListener (support only v1alpha1 in the static plugin, should support v1beta1 in the dynamic plugin as well)
7. ClusterTriggerBinding (support only v1alpha1 in the static plugin, should support v1beta1 in the dynamic plugin as well)
8. TriggerBinding (support only v1alpha1 in the static plugin, should support v1beta1 in the dynamic plugin as well)
The following list pages should be copied to the dynamic plugin:
1. Pipeline
2. PipelineRun
3. ClusterTask
4. Task
5. TaskRun
We don't want to copy these deprecated resources:
1. PipelineResource
2. Condition
E2e tests for all list pages should run on the new dynamic pipeline plugin

Dependencies (External/Internal):

Changes will happen in the new dynamic pipeline-plugin github.com/openshift-pipelines/console-plugin, but we don't see any external dependencies

Design Artifacts:

Exploration:

Note:

Story ODC-7490: Add TaskRuns tab in PLR details page using plugin

View the Description View the linked PRs

Add TaskRun tab in PLR details page using plugin

https://github.com/openshift/console/pull/13527

Story ODC-7482: Add flags to hide Pipeline list pages from static plugin

View the Description View the linked PRs

Description

As a user,

Acceptance Criteria

Create a flag for list pages to hide pages if it present in dynamic plugin

Additional Details:

https://github.com/openshift/console/pull/13512

Feature OCPSTRAT-1024: Improve User experience in Dev Console

View the Description

Feature Overview (aka. Goal Summary)

Enable developers to perform actions in a faster and better way than before.

Goals (aka. expected user outcomes)

Developers will be able to reduce the time or clicks spent on the UI to perform specific set of actions.

Requirements (aka. Acceptance Criteria):

Search->Resources option should show 5 of the recently searched resources across all sessions by the user.

The recently searched resources should should be clearly visible and separated from rest.

Pinning resources capability should be removed.

Getting Started menu on Add page can be collapsed and expanded

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Epic ODC-7453: Improved Search page for Resources

View the Description

Problem:

Developers have to repeatedly search for resources and pin them separately in order to view details of a resource that have seen in the past.

Goal:

Provide developers with the ability to see the last 5 resources that they have seen in the past so they can quickly view their details without any further actions.

Why is it important?

Provides a better user experience for developers when using the console.

Use cases:

<case>

Acceptance criteria:

Same as listed in OCPSTRAT-1024

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Story ODC-7480: Add a New Recently Used Resources Section in the Search Resource page Dropdown

View the Description View the linked PRs

Description

As a user, I want the console to remember the resources I have recently searched so that I don't have to type the names of the same resources I use frequently in the Search Page.

Acceptance Criteria

Modify the Search dropdown to add a new section to display the recently searched items in the order they were searched.

Additional Details:

https://github.com/openshift/console/pull/13389

Epic ODC-7459: Improve Getting Started menu on Add page

View the Description

Problem:

The Getting Started menu on Add page can cannot be restored after users click on the "X" symbol when its hidden.

Goal:

Users should be able to collapse and expand the Getting Started menu on Add Page.

Why is it important?

The current behavior causes confusion.

Use cases:

<case>

Acceptance criteria:

The Getting Started menu on Add page can be collapsed by clicking on the screen.
The collapsed Getting Started page can be expanded by clicking on the screen.
When collapsed, it is visibly available for someone to expand again.

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Story ODC-7465: Update Getting started section in Openshift UI to use expandable section

View the Description View the linked PRs

Description

As of now, getting started section can be hidden and enable back using the button but user can close that button and it will not show back. It is confusing for the users. So add the expandable section instead of hide and show button similar to Functions List page

Acceptance Criteria

Update getting Started section to use expandable section in Add page
Update getting Started section to use expandable section in Cluster tab in overview page in Admin perspective
Updated the test cases

Additional Details:

Check Functions list page in Dev perspective for the design

https://github.com/openshift/console/pull/13466

Feature OCPSTRAT-1027: Remove Terraform from OpenShift Installer. Cross-platform preparation work

View the Description

Feature Overview (aka. Goal Summary)

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.

This feature will be used to track all the CAPI preparation work that is common for all the supported providers

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CORS-2840: CAPI install without a management cluster

View the Description View the linked PRs

Epic Goal

Day 0 Cluster Provisioning
Compatibility with existing workflows that do not require a container runtime on the host

Why is this important?

This epic would maintain compatibility with existing customer workflows that do not have access to a management cluster and do not have the dependency of a container runtime

Scenarios

openshift-install running in customer automation

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/installer/pull/7823

Story CORS-3191: Handle etcd and kube-apiserver dependencies

View the Description View the linked PRs

User Story:

I want hack/build.sh to embed the kube-apiserver and etcd dependencies in openshift-install without making external network calls so that ART/OSBS can build the installer with CAPI dependencies.

Acceptance Criteria:

Description of criteria:

dependencies are not obtained over the internet
gated by OPENSHIFT_INSTALL_CLUSTER_API env var
should work when building for various architectures

(optional) Out of Scope:

Engineering Details:

Currently the dependencies are obtained through the sync_envtest function in build-cluster-api.sh
Cluster API provider dependencies are vendored and built here

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Story CORS-2852: Run CAPI control plane binaries locally

View the Description View the linked PRs

PoC & design for running CAPI control plane using binaries.

Story CORS-3139: Fit CAPI Provisioning to infrastructure.Provider interface

View the Description View the linked PRs

Fit provisioning via the CAPI system into the infrastructure.Provider interface

so that:

code path to selecting an infrastructure provider is simplified
enabling/disabling infrastructure providers via feature gates is centralized in platform package

https://github.com/openshift/installer/pull/7824

Feature OCPSTRAT-1030: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-4541: post-merge testing: Custom configuring the transit switch subnet and allowing users to modify this on day2 with minimal disruption

View the Description

The 100.88.0.0/14 IPv4 subnet is currently reserved for the transit switch in OVN-Kubernetes for east west traffic in the OVN Interconnect architecture. We need to make this value configurable so that users can avoid conflicts with their local infrastructure. We need to support this config both prior to installation and post installation (day 2).

This epic will include stories for the upstream ovn-org work, getting that work downstream, an api change, and a cno change to consume the new api

Story SDN-4542: Transit switch config downstream

View the Description View the linked PRs

After the upstream pr merges it needs to get into openshift ovn-k via a downstream merge

https://github.com/openshift/ovn-kubernetes/pull/2089

Feature OCPSTRAT-1035: Allow setting optional tags to machines in vSphere

View the Description

Feature Overview

Allow setting custom tags to machines created during the installation of an OpenShift cluster on vSphere.

Why is this important

Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.

Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.

For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.

Epic SPLAT-1342: Enable the assignment of additional tags to machines in vSphere

View the Description

Epic Goal

Customers need to assign additional tags to machines that are reconciled via the machine API.

Why is this important?

Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.

For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task SPLAT-1385: add tagIDs to the openshift API

View the linked PRs

https://github.com/openshift/api/pull/1697

Task SPLAT-1387: add support for additional tags in the installer

View the linked PRs

https://github.com/openshift/installer/pull/7905

Feature OCPSTRAT-1039: [GA] Add ControlPlaneMachineSet for vSphere

View the Description

Feature Overview (aka. Goal Summary)

Add ControlPlaneMachineSet for vSphere

Epic SPLAT-1388: [GA] Add ControlPlaneMachineSet for vSphere

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Promote vSphere control plane machinesets from tech preview to GA

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Promotion PRs collectively pass payload testing

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task SPLAT-1390: installer: promote control plane machineset from techpreview

View the linked PRs

https://github.com/openshift/installer/pull/7908

Task SPLAT-1441: drop templates from tech preview

View the Description View the linked PRs

see: https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L1243

https://github.com/openshift/api/pull/1759

Task SPLAT-1440: add unit tests to the cpmso

View the Description View the linked PRs

validate failure domains

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/275

Task SPLAT-1391: control plane machineset: promote control plane machineset from techpreview

View the linked PRs

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/272

Task SPLAT-1396: drop tech preview from vsphere failure domains in the cpms CRD

View the linked PRs

https://github.com/openshift/api/pull/1733

Task SPLAT-1400: bump config operator

View the linked PRs

https://github.com/openshift/cluster-config-operator/pull/400

Feature OCPSTRAT-1040: Experimental in 4.16-->Provide option to collect must-gather based on time stamp (--since and --until)

View the Description

Feature Overview (aka. Goal Summary)

As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space

Goals (aka. expected user outcomes)

have "--since and --until " option in mustgather

Epic WRKLDS-994: must-gather: pass envs into must-gather images to enhance control over data collection

View the Description

Epic Goal*

Why is this important? (mandatory)
Reduce must-gather size

Scenarios (mandatory)

Must-gather running over a cluster with many logs can produces tens of GBs of data in cases where only few MBs is needed. Such a huge must-gather archive takes too long to collect and too long to upload. Which makes the process impractical.

Dependencies (internal and external) (mandatory)

The default must-gather images need to implement this functionality. Custom images will be asked to implement the same.

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development - workloads team
Documentation - docs team
QE - workloads QE team
PX -
Others -

Acceptance Criteria (optional)

Must-gather contain only logs from requested time interval

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story WRKLDS-950: must-gather: pass envs into must-gather images to enhance control over data collection

View the Description View the linked PRs

Must-gather currently does not allow to limit the amount of data collected by a customer. Which can lead into collection of tens of GBs even when only a limited set of data (e.g. for the last 2 days) is required. In some cases ending with a master node down.

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

We want to remove official support for UPI and IPI support for Alibaba Cloud provider, which is in Tech Preview in OpenShift 4.15 and earlier. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.
There will be no upgrades from OpenShift 4.14 / OpenShift 4.15 to OpenShift 4.16.
For OpenShift 4.16, we will instead offer customers the ability to install OpenShift on Alibaba Cloud with the agnostic platform (platform=none) using the Assisted Installer as Tech Preview - see OCPSTRAT-1149 for details. Note: we cannot remove IPI (Tech Preview) install method until we provide the Assisted Installer (Tech Preview) solution.

Impacted areas based on CI:

alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Update cloud.redhat.com (Hybrid Cloud Console) to remove Alibaba support.
Update/removal from Alibaba IPI installer code.
Update release notes with removal information and no updates/upgrades offered from OpenShift 4.14 or OpenShift 4.15.
Remove Alibaba installation instructions from Openshift documentation including updating OpenShift Container Platform 4.x Tested Integrations.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Self-managed
Classic (standalone cluster)	Classic (standalone)
Hosted control planes	N/A
Multi node, Compact (three node), or Single node (SNO), or all	All
Connected / Restricted Network	All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All
Operator compatibility	N/A
Backport needed (list applicable versions)	N/A
Other (please specify)	N/A

Done - Checklist (mandatory)

Remove any Alibaba-related operator
Remove Alibaba-related driver
Remove Alibaba related code from CSO, CCM, and other repos
Remove container images
Remove related CI jobs
Remove documentation
Stop building images in ART pipeline
Archive github repositories

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic STOR-1354: Alicloud Disk CSI driver removal

View the Description View the linked PRs

Epic Goal*

Per OCPSTRAT-1042, we are removing Alicloud IPI/UPI support in 4.16 and removing code in 4.16. This epic track the necessary actions to remove the Alicloud Disk CSI driver

Why is this important? (mandatory)

Since we are removing support of Alicloud as a supported provider we need to clean up all the storage artefacts related to Alicloud.

Dependencies (internal and external) (mandatory)

Alicloud IPI/UPI support removal must be confirmed. IPI/UPI code should be remove.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Done - Checklist (mandatory)

Remove operator
Remove driver
Remove Alibaba related code from CSO.
Remove container images
Remove related CI jobs
Remove documentation
Stop building images in ART pipeline
Archive github repositories

https://github.com/openshift/cluster-storage-operator/pull/440

Epic SPLAT-1343: Remove IPI/UPI support of Alibaba cloud from OpenShift

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

We want to remove official support for UPI and IPI support for Alibaba Cloud provider. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.

Why is this important?

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Scenarios

Impacted areas based on CI:

alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI jobs are removed
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SPLAT-1345: remove alibaba from the installer

View the linked PRs

https://github.com/openshift/installer/pull/7832

Feature OCPSTRAT-1066: Enable seamless collection of OpenShift console user telemetry

View the Description

Feature Overview (aka. Goal Summary)

Enable the OCP Console to send back user analytics to our existing endpoints in console.redhat.com. Please refer to doc for details of what we want to capture in the future:

Analytics Doc

Collect desired telemetry of user actions within OpenShift console to improve knowledge of user behavior.

Goals (aka. expected user outcomes)

OpenShift console should be able to send telemetry to a pre-configured Red Hat proxy that can be forwarded to 3rd party services for analysis.

Requirements (aka. Acceptance Criteria):

User analytics should respect the existing telemetry mechanism used to disable data being sent back

Need to update existing documentation with what we user data we track from the OCP Console: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/about-remote-health-monitoring.html

Capture and send desired user analytics from OpenShift console to Red Hat proxy

Red Hat proxy to forward telemetry events to appropriate Segment workspace and Amplitude destination

Use existing setting to opt out of sending telemetry: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html#opting-out-remote-health-reporting

Also, allow just disabling user analytics without affecting the rest of telemetry: Add annotation to the Console to disbale just user analytics

Update docs to show this method as well.

We will require a mechanism to store all the segment values
We need to be able to pass back orgID that we receive from the OCM subscription API call

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Sending telemetry from OpenShift cluster nodes

Background

Console already has support for sending analytics to segment.io in Dev Sandbox and OSD environments. We should reuse this existing capability, but default to http://console.redhat.com/connections/api for analytics and http://console.redhat.com/connections/cdn to load the JavaScript in other environments. We must continue to allow Dev Sandbox and OSD clusters a way to configure their own segment key, whether telemetry is enabled, segment API host, and other options currently set as annotations on the console operator configuration resource.

Console will need a way to determine the org-id to send with telemetry events. Likely the console operator will need to read this from the cluster pull secret.

Customer Considerations

Documentation Considerations

Interoperability Considerations

Epic ODC-7460: Configure URL and destination of console telemetry for non-Sandbox clusters

View the Description

Problem:

The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.

Goal:

Update console telemetry plugin to send data to the appropriate ingress point.

Why is it important?

Use cases:

<case>

Acceptance criteria:

Update the URL to the Ingress point created for console.redhat.com
Ensure telemetry data is flowing to the ingress point.

Dependencies (External/Internal):

Ingress point created for console.redhat.com

Design Artifacts:

Exploration:

Note:

Bug OCPBUGS-25722: [4.16] Console telemetry base URL (to load JS and make API calls to Segment) couldn't be configured

View the Description View the linked PRs

The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.

For that the telemetry-console-plugin must have options to configure where it loads the analytics.js and where to send the API calls (analytics events).

https://github.com/openshift/console/pull/13459

Feature OCPSTRAT-1080: ReadWriteOncePod access mode (GA)

View the Description

Feature Overview (aka. Goal Summary)

Graduce the new PV access mode ReadWriteOncePod as GA.

Such PV/PVC can be used only in a single pod on a single node compared to the traditional ReadWriteOnce access mode, where such a PV/PVC can be used on a single node by many pods.

Goals (aka. expected user outcomes)

The customers can start using the new ReadWriteOncePod access mode.

This new mode allows customers to provision and attach PV and get the guarantee that it cannot be attached to another local pod.

Requirements (aka. Acceptance Criteria):

This new mode should support the same operations as regular ReadWriteOnce PVs therefore it should pass the regression tests. We should also ensure that this PV can't be accessed by another local-to-node pod.

Use Cases (Optional):

As a user I want to attach a PV to a pod and ensure that it can't be accessed by another local pod.

Background

We are getting this feature from upstream as GA. We need to test it and fully support it.

Enhancement issue: https://github.com/kubernetes/enhancements/issues/2485
KEP: https://github.com/kubernetes/enhancements/pull/2489

Customer Considerations

Check that there is no limitations / regression.

Documentation Considerations

Remove tech preview warning. No additional change.

Interoperability Considerations

N/A

Epic STOR-1464: Upstream GA Tracking: New RWOP access mode

View the Description View the linked PRs

Epic Goal

Support upstream feature "New RWO access mode " in OCP as GA, i.e. test it and have docs for it.

This is continuation of ~~STOR-1171~~ (Beta/Tech Preview in 4.14), now we just need to mark it as GA and remove all TechPreview notes from docs.

Why is this important?

We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Enhancement issue: https://github.com/kubernetes/enhancements/issues/2485
KEP: https://github.com/kubernetes/enhancements/pull/2489

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

External: the feature is currently scheduled for GA in Kubernetes 1.29, i.e. OCP 4.16, but it may change before Kubernetes 1.29 GA.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OCPSTRAT-1092: Allow modification of BMC address after installation

View the Description

Feature Overview

Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set.

Users need to be able to update the BMC entries in the BareMetalHost objects after they have been created .

Use Cases

There are at least two scenarios in which this causes problems:

1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.

2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.

Epic METAL-359: post-merge testing: Allow modification of BMC address after installation

View the Description

Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set. It's understood that this was an initial design choice, and there's a webhook that prevents any modification of the object once it has been set for the first time. There are at least two scenarios in which this causes problems:

2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.

Thanks!

Story METAL-866: Allow BMC address update in BMO

View the Description View the linked PRs

Currently, the baremetal operator does not allow the BMC address of a node to be updated after BMH creation. This ability needs to be added in BMO.

https://github.com/openshift/baremetal-operator/pull/336

Feature OCPSTRAT-1093: Support for VolumeGroup Snapshots (TP)

View the Description

Feature Overview (aka. Goal Summary)

Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

This is also a key requirement for backup and DR solutions.

https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot

Goals (aka. expected user outcomes)

Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.

Requirements (aka. Acceptance Criteria):

The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.

Use Cases (Optional):

As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.
As a customer I want early access to test the VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs.

Out of Scope

CSI drivers development/support of this feature.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.

Documentation Considerations

Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.

Interoperability Considerations

Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.

Epic STOR-1700: OCP feature gate for VolumeGroupSnapshot API

View the Description View the linked PRs

Epic Goal*

Create an OCP feature gate that allows customers and parterns to VolumeGroupSnapshot feature while the feature is in alpha & beta upstream.

Why is this important? (mandatory)

Volume group snapshot is an important feature for ODF, OCP virt and backup partners. It requires driver support so partners need early access to the feature to confirm their driver works as expected before GA. The same applies to backup partners.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.

Dependencies (internal and external) (mandatory)

This depends on the driver's support, the feature gate will enable it in the drivers that support it (OCP shipped drivers).

The feature gate should

Configure the snapshotter to start with the right parameter to enable VolumeGroupSnapshot
Create the necessary CRDs
Configure the OCP shipped CSI driver

Contributing Teams(and contacts) (mandatory)

Development - STOR
Documentation - N/A
QE - STOR
PX -
Others -

Acceptance Criteria (optional)

By enabling the feature gate partners should be able to use the VolumeGroupSnapshot API. Non OCP shipped drivers may need to be configured.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Feature OCPSTRAT-1096: SMB CSI Driver Operator (TP)

View the Description

Feature Overview (aka. Goal Summary)

Support the SMB CSI driver through an OLM operator as tech preview. The SMB CSI driver allows OCP to consume SMB/CIFS storage with a dynamic CSI driver. This enables customers to leverage their existing storage infrastructure with either SAMBA or Microsoft environment.

https://github.com/kubernetes-csi/csi-driver-smb

Goals (aka. expected user outcomes)

Customers can start testing connecting OCP to their backend exposing CIFS. This can allow to consume net new volume or consume existing data produced outside OCP.

Requirements (aka. Acceptance Criteria):

Driver already exists and is under the storage SIG umbrella. We need to make sure the driver is meeting OCP quality requirement and if so develop an operator to deploy and maintain it.

Review and clearly define all driver limitations and corner cases.

Use Cases (Optional):

As an OCP admin, I want OCP to consume storage exposed via SMB/CIFS to capitalise on my existing infrastructure.
As an user, I want to consume external data stored on a SMB/CIFS backend.

Questions to Answer (Optional):

Review the different authentication method.

Out of Scope

Windows containers support.

Only storage class login/password authentication method. Other methods can be reviewed and considered for GA.

Background

Customers are expecting to consume storage and possibly existing data via SMB/CIFS. As of today vendor's drivers support is really limited in terms of CIFS support whereas this protocol is widely used on premise especially with MS/AD customers.

Customer Considerations

Need to understand what customers expect in terms of authentication.

How to extend this feature to windows containers.

Documentation Considerations

Document the operator and driver installation, usage capabilities and limitations.

Interoperability Considerations

Future: How to manage interoperability with windows containers (not for TP)

Epic STOR-309: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story STOR-1726: Implement operator for SMB csi driver

View the linked PRs

Story STOR-1724: API change to add SMB csi driver provider

View the Description View the linked PRs

Add a new provider here:
https://github.com/openshift/api/blob/6d48d55c0598ec78adacdd847dcf934035ec2e1b/operator/v1/types_csi_cluster_driver.go#L86

It should be "smb.csi.k8s.io", from the CSIDriver manifest:
https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/csi-smb-driver.yaml

https://github.com/openshift/api/pull/1741

Feature OCPSTRAT-110: Hypershift-enablement for short-lived token authentication flows with OLM-managed operators with CCO

View the Description

Feature Overview:

Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in ~~OCPBU-559~~ and ~~OCPBU-560~~.

Goals

There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.

Requirements:

the flows in ~~OCPBU-559~~ and ~~OCPBU-560~~ need to work unchanged on Hypershift-managed clusters
most likely this means that Hypershift needs to adopt the CloudCredentialOperator
all operators enabled as part of ~~OCPBU-563~~, OCPBU-564, ~~OCPBU-566~~ and OCPBU-568 need to be able to leverage short-lived authentication on Hypershift-managed clusters without being aware that they are on Hypershift-managed clusters
also OCPBU-569 and OCPBU-570 should be achievable on Hypershift-managed clusters

Background

Currently, Hypershift lacks support for CCO.

Customer Considerations

Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.

Documentation Considerations

If we are successful, no special documentation should be needed for this.

Epic CCO-386: HyperShift Integration

View the Description

Outcome Overview

Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.

Success Criteria

CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.

Expected Results (what, how, when)

Post Completion Review – Actual Results

After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).

Story CCO-388: Add CCO to HyperShift Management plane

View the Description View the linked PRs

Every guest cluster should have a running CCO pod with its kubeconfig attached to it.

Enchancement doc: https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md

https://github.com/openshift/hypershift/pull/2794

Feature OCPSTRAT-1104: [etcd] rotation of etcd signer certs when the cluster is still online

View the Description

Feature Overview (aka. Goal Summary)

The etc-ca must be rotatable both on-demand and automatically when expiry approaches.

Goals (aka. expected user outcomes)

Have a tested path for customers to rotate certs manually
We must have a tested path for auto rotation of certificates when certs need rotation due to age

Requirements (aka. Acceptance Criteria):

Deliver rotation and recovery requirements from OCPSTRAT-714

~~Deployment considerations~~	~~List applicable specific needs (N/A = not applicable)~~
~~Self-managed, managed, or both~~
~~Classic (standalone cluster)~~
~~Hosted control planes~~
~~Multi node, Compact (three node), or Single node (SNO), or all~~
~~Connected / Restricted Network~~
~~Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)~~
~~Operator compatibility~~
~~Backport needed (list applicable versions)~~
~~UI need (e.g. OpenShift Console, dynamic plugin, OCM)~~
~~Other (please specify)~~

Use Cases (Optional):

~~Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.~~

~~<your text here>~~

Questions to Answer (Optional):

~~Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.~~

~~<your text here>~~

Out of Scope

~~High-level list of items that are out of scope. Initial completion during Refinement status.~~

~~<your text here>~~

Background

~~Provide any additional context is needed to frame the feature. Initial completion during Refinement status.~~

~~<your text here>~~

Customer Considerations

~~Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.~~

~~<your text here>~~

Documentation Considerations

~~<your text here>~~

Interoperability Considerations

~~<your text here>~~

Epic ETCD-445: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Spike ETCD-512: Use the library-go cert rotation controller in etcd-operator

View the Description View the linked PRs

This spike explores using the library-go cert rotation utils in the etcd-operator to replace or augment the existing etcdcertsigner controller.

https://github.com/openshift/library-go/blob/master/pkg/operator/certrotation/client_cert_rotation_controller.go
https://github.com/openshift/cluster-etcd-operator/pull/1177

The goal of this spike is to evaluate if the library-go cert rotation util gives us rotation capabilities for the signer cert along with the peer and server certs.

There are a couple of issues to explore with the use of the library-go cert signer controller:

The etcd cluster is currently configured with a single CA for etcd's peer and server certs, whereas the library-go controller would require using different CAs for the peer and server certs.
We also need to consider how upgrades would be handled, i.e if we change to using two new CAs, would our new certsignercontroller handle that transparently?

Story ETCD-518: Refactored CertSignerController needs to garbage collect unused node certs

View the Description View the linked PRs

Refactoring in ~~ETCD-512~~ does not clean up certificates that are dynamically generated. Imagine you're recreating all your master nodes everyday, we would create new peer/serving/metrics certificates for each node and never clean them up.

We should try to be conservative when cleaning them up, so keep them around for a certain retention period (7-10 days?) after the node went away.

AC:

CEO should clean up old-enough "node" certificates

https://github.com/openshift/cluster-etcd-operator/pull/1209

Story ETCD-516: CEO needs to react on changing CA cert bundle and client certs

View the Description View the linked PRs

Testing in ~~ETCD-512~~ revealed that CEO does not react to changes in the CA bundle or the client certificates.

The current mounts are defined here:
https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/manifests/0000_20_etcd-operator_06_deployment.yaml#L90-L106

A simple fix would be to watch the respective resources in a controller and exit the container on changes. This is how we did it with feature gates as well: (https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/pkg/operator/starter.go#L174C1-L174C1)

If hot-reload would be feasible we should take a look at it, but it seems a larger refactoring.

AC:

CEO needs to react (restart) when it detects changes in certificate related secrets
add an e2e testcase for it

https://github.com/openshift/cluster-etcd-operator/pull/1202

Story ETCD-517: CEO render should reuse library-go rotation CAs, bundles and certs

View the Description View the linked PRs

Refactoring in ~~ETCD-512~~ creates a rotation due to incompatible certificate creation processes. We should update the render [1] the same way the controller manages the certificates. Keep in mind that important information are always stored in annotations, which means we also need to update the manifest template itself (just exchanging file bytes isn't enough).

AC:

CEO should render the same certificates it would otherwise when the refactored CertSignerController runs
test that a fresh installation avoids re-creating certificates after the bootstrap phase

[1] https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L347-L365

https://github.com/openshift/cluster-etcd-operator/pull/1186

Feature OCPSTRAT-1114: Include the version of the oc binary used, and the logs generated by the command into the must-gather directory

View the Description

Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.

The version of the oc binary can help to identify if an issue could be caused because the oc version is different than the version of the cluster.
The logs generated by the oc adm must-gather command will help to identify if some information could not be collected, and also the exact image used (while the image used is currently in the directory name, I think it could be better to use a short directory name, specially for customers running oc on Windows, as the too long directory name can cause issues there).

The version of the oc binary could be included in the oc adm must-gather output [1], and if it's 2 or more minor versions than the running cluster, a warning should be shown.

Epic WRKLDS-1011: Include additional info into must-gather directory

View the Description

1. Proposed title of this feature request

Include additional info into must-gather directory

2. What is the nature and description of the request?

Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.

The version of the oc binary can help to identify if an issue could be caused because the oc version is different than the version of the cluster.
The logs generated by the oc adm must-gather command will help to identify if some information could not be collected, and also the exact image used

4. List any affected packages or components.

[1] https://github.com/openshift/oc/blob/1d4a9afe8cf066b4c34018b3e0f4919d0157f091/pkg/cli/admin/mustgather/summary.go#L16-L45

Sub-task WRKLDS-1012: oc adm must-gather: pull gather container logs

View the Description View the linked PRs

Include logs generated by the command into the must-gather directory when running the oc adm must-gather command.

https://github.com/openshift/oc/pull/1641

Feature OCPSTRAT-1140: HCP KubeVirt Advanced Multiple Network Integration

View the Description

Feature Overview (aka. Goal Summary)

Add support for standalone secondary networks for HCP kubevirt.

Advanced multus integration involves the following scenarios

1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM

Goals (aka. expected user outcomes)

Users of HCP KubeVirt should be able to create a guest cluster that is completely isolated on a secondary network outside of the default pod network.

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	self-managed
Classic (standalone cluster)	na
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	na
Connected / Restricted Network	yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86
Operator compatibility	na
Backport needed (list applicable versions)	na
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	na
Other (please specify)	na

Documentation Considerations

ACM documentation should include how to configure secondary standalone networks.

Epic CNV-37390: Hypershift/KubeVirt Advanced Multus Integration (standalone secondary networks)

View the Description

This is a continuation of ~~CNV-33392~~.

Multus Integration for HCP KubeVirt has three scenarios.

1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
3. Secondary network + pod network (default for kubelet) as multiple interfaces for VM

Item 3 is the simplest use case because it does not require any addition considerations for ingress and load balancing. This scenario [item 3] is covered by ~~CNV-33392~~.

Items [1,2] are what this epic is tracking, which we are considering advanced use cases.

Task HOSTEDCP-1364: HS/KV multus implement ingress service endpoint generator

View the Description View the linked PRs

When creating a cluster with secondary network the ingress is broken since the created service cannot access the VMs secondary addresses.

We need to create a controller that manually create and update the service endpoints so it's always pointing to the VMs IPs.

https://github.com/openshift/hypershift/pull/3343

Feature OCPSTRAT-1154: AWS Public IPv4 Address cost mitigation

View the Description

Feature Overview (aka. Goal Summary)

The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.

Goals (aka. expected user outcomes)

Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.

Requirements (aka. Acceptance Criteria):

There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.

Background

Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel

Documentation Considerations

Usual documentation will be required in case there are any new user-facing options available as a result of this feature.

Epic SPLAT-1432: [aws][BYO-IPv4] Implement BYO Public IPv4 Pool support on Installer (Terraform)

View the Description

Epic Goal

Implement support in the install config to receive a Public IPv4 Pool ID and create resources* which consumes Public IP when publish strategy is "External".
The implementation must cover the installer changes in Terraform to provision the infrastructure

*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways

Why is this important?

The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024.

Scenarios

As a customer with BYO Public IPv4 pools in AWS, I would like to install OpenShift cluster on AWS consuming public IPs for my own CIDR blocks, so I can have control of IPs used by my the services provided by me and will not be impacted by AWS Public IPv4 charges

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Will the terraform implementation be backported to supported releases to be used in CI for previous e2e test infra?
Is there a method to use the pool by default for all Public IPv4 claims from a given VPC/workload? So the implementation doesn't need to create EIP and associations for each resource and subnet/zone.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SPLAT-1434: [aws] Create support of Public IPv4 Pool on installer (Terraform)

View the Description View the linked PRs

USER STORY:

As a customer with custom Public IPv4 blocks presented in AWS, I would like to install OpenShift clusters on AWS with publish strategy Public consuming public IPv4 blocks from my own pool, so I will not be impacted by additional Public IPv4 charges

DESCRIPTION:

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

Required:

e2e workflow with a presubmit job testing the new configuration

Nice to have:

...

ACCEPTANCE CRITERIA:

Installer PR reviewed, accepted and merged
e2e step testing the PR
- Presubmit job testing that flow (frequency should be defined by Test platform team)

ENGINEERING DETAILS:

https://github.com/openshift/installer/pull/7983

Feature OCPSTRAT-117: Control Plane Machine Set Testing(4.15)

View the Description

This Feature covers the pending tasks from ~~OCPBU-16~~ to be covered in openshift-4.14.

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster's control plane deployment.

See slack working group: #wg-ctrl-plane-resize

Epic OCPCLOUD-1978: ControlPlaneMachineSet Technical Debt

View the Description

Epic Goal

Resolve the outstanding technical debt from the ControlPlaneMachineSet project

Why is this important?

We need to make sure the project is tested, documented and maintained going forward

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

~~OCPCLOUD-1372~~

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-1894: CPMSO: WebHooks add warning for missing TargetPools on GCP

View the Description View the linked PRs

User Story

As a user I'd like to be warned when I'm setting up a Control Plane Machine Set for my control plane machines on GCP and I don't have the necessary TargetPools requirement for it to work correctly.

Background

We previously had a validating webhook on the CPMSO for GCP that would check if the control plane machine provider config set on the CPMS did have TargetPools, otherwise it would error.

This unfortunately collides with GCP Private clusters, see https://issues.redhat.com/browse/OCPBUGS-6760.

As such we decided to remove the check until warnings are supported in controller-runtime.
Once those land we can re-add the check and throw a warning in those situations, to still inform the user without disrupting normal operations where that's not an issue.

See: https://redhat-internal.slack.com/archives/C68TNFWA2/p1675105892589279.

Webhook warnings available for 0.16.0 release of controller-runtime

Steps

<Add steps to complete this card if appropriate>

Stakeholders

<Who is interested in this/where did they request this>

Definition of Done

<Add items that need to be completed for this card>

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/269

Feature OCPSTRAT-1185: Support migration to Microsoft Entra Workload ID (formerly known as Azure AD Workload Identity)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

As a cluster administrator, I would like to migrate my existing (that does not currently use Azure AD Workload Identity) cluster to use Azure AD Workload Identity

Goals (aka. expected user outcomes)

Many customers would like to migrate to Azure AD Workload Identity with minimal downtime but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to take advantage of using Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+) in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).

Requirements (aka. Acceptance Criteria):

Provide a documented method for migration to Azure AD Workload Identity for OpenShift 4.14+ with minimal downtime, and without customers having to start over with a new cluster using AZ Workload Identity and migrating over their workload. If there is risk of workload disruptive or downtime, we will keep to inform customers of this risk and have them accept this risk.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Self-managed
Classic (standalone cluster)	Classic
Hosted control planes	N/A
Multi node, Compact (three node), or Single node (SNO), or all	All
Connected / Restricted Network, or all	All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All applicable architectures
Operator compatibility
Backport needed (list applicable versions)	4.14+
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	N/A
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CCO-456: Support migration to Azure Managed Identity

View the Description

Goal

Spike to evaluate if we can provide an automated way to support migration to Azure Managed Identity (preferred), or alternatively a manual method (second option) for customers to perform the migration themselves that is documented and supported, or not at all.

This spike will evaluate, scope the level of effort (sizing), and make recommendation on next steps.

Feature request

Support migration to Azure Managed Identity

Feature description

Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).

Why?

Provide a uniform operational experience for all clusters running versions which support Azure Managed Identity without having to decommission long running clusters

Other considerations

Disruption to customer's workload.
Has to be closely coordinated with update effort to minimize disruption.
Tokenized operators and other layered products - work not yet done (OCP 4.15/4.16 plans) and has to be manually done for now and may not cover the full set.
If we grant this for Azure MI/WI, we will likely will need to also do this for STS and GCP WIF.
If we grant this, would we do this for self-managed and managed OpenShift (ARO)?

Story CCO-525: Update repo documentation for Azure workload identity

View the Description View the linked PRs

The cloud-credential-operator repository documentation can be updated for installing and/or migrating a cluster with Azure workload identity integration.

The document currently uses manually numbered lists, which creates more work when we want to add/remove steps. Migrating the document to use dynamically numbered lists where each number is (1.) will remove this extra work in future changes.
During installation, we can take advantage of the new(ish) --included parameter in the `oc adm extract release-images` command by creating an install-config prior to executing the command.
During migration, we can take advantage of the new(ish) `reboot-machine-config-pool` sub-command of `oc adm` to restart all of the pods on a cluster to reduce the risks. The sub-command restarts each node in series resulting in highly available services being restarted one pod at a time.

https://github.com/openshift/cloud-credential-operator/pull/680

Feature OCPSTRAT-1197: Add support for Johannesburg, South Africa (africa-south1) in GCP

View the Description

Feature Overview (aka. Goal Summary)

Add support for Johannesburg, South Africa (africa-south1) in GCP

Goals (aka. expected user outcomes)

As a user I'm able to deploy OpenShift in Johannesburg, South Africa (africa-south1) in GCP and this region is fully supported

Requirements (aka. Acceptance Criteria):

A user can deploy OpenShift in GCP Johannesburg, South Africa (africa-south1) using all the supported installation tools for self-managed customers.

The support of this region is backported to the previuos OpenShift EUS release.

Background

Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.

Documentation Considerations

The information of the new region needs to be added to the documentation so this is supported.

Epic CORS-3274: post-merge testing: Add support for Johannesburg, South Africa (africa-south1) in GCP

View the Description View the linked PRs

Epic Goal

Validate the Installer can deploy OpenShift on Johannesburg, South Africa (africa-south1) in GCP
Create the corresponding OCPBUGs if there is any issue while validating this

https://github.com/openshift/installer/pull/8055

Feature OCPSTRAT-1221: Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP

View the Description

Feature Overview (aka. Goal Summary)

Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP

Goals (aka. expected user outcomes)

As a user I'm able to deploy OpenShift in Dammam, Saudi Arabia, Middle East (me-central2) region in GCP and this region is fully supported

Requirements (aka. Acceptance Criteria):

A user can deploy OpenShift in GCP Dammam, Saudi Arabia, Middle East (me-central2) region using all the supported installation tools for self-managed customers.

The support of this region is backported to the previuos OpenShift EUS release.

Background

Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.

Documentation Considerations

The information of the new region needs to be added to the documentation so this is supported.

Epic CORS-3303: post-merge testing: Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP

View the Description View the linked PRs

Epic Goal

Validate the Installer can deploy OpenShift on Dammam, Saudi Arabia, Middle East (me-central2) region in AWS
Create the corresponding OCPBUGs if there is any issue while validating this

https://github.com/openshift/installer/pull/8132

Feature OCPSTRAT-1251: HCP: Guest clusters should support up to 500 worker nodes

View the Description

Feature Overview (aka. Goal Summary)

Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.

Goals (aka. expected user outcomes)

Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Managed
Classic (standalone cluster)	N/A
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	N/A
Connected / Restricted Network	Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_64 ARM
Operator compatibility	N/A
Backport needed (list applicable versions)	N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	N/A
Other (please specify)

Questions to Answer (Optional):

Check with OCM and CAPI requirements to expose larger worker node count.

Documentation:

Design document detailing the autoscaling mechanism and configuration options
User documentation explaining how to configure and use the autoscaling feature.

Acceptance Criteria

Configure max-node size from CAPI
Management cluster nodes automatically scale up and down based on the hosted cluster's size.
Scaling occurs without manual intervention.
A set of "warm" nodes are maintained for immediate hosted cluster creation.
Resizing nodes should not cause significant downtime for the control plane.
Scaling operations should be efficient and have minimal impact on cluster performance.

Epic HOSTEDCP-1413: Support 500 worker nodes in ROSA

View the Description

Goal

Dynamically scale the serving components of control planes

Why is this important?

To be able to have clusters with large amount of worker nodes

Scenarios

A hosted cluster amount of worker nodes increases past X amount, the serving components are moved to larger cloud instances
A hosted cluster amount of workers falls below a threshold, the serving components are moved to smaller cloud instances.

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1428: Expose Node Count on HostedControlPlanes

View the Description View the linked PRs

User Story:

As a HyperShift controller in the management plane, I want to be able to:

read the capacity of a cluster measured in the count of nodes from a HostedControlPlane or HostedCluster

so that I can achieve

automation that reacts to the number of cluster workers changing at runtime

Acceptance Criteria:

Description of criteria:

the count of nodes can be read from the HostedCluster or HostedControlPlane status

Engineering Details:

design in https://docs.google.com/document/d/1VV1NZgrimSNFjJEH3PeZmUquey_YX3lm2l0FDwP8WDc/edit

This requires a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/3557

Story HOSTEDCP-1429: Expose T-Shirt Size On HostedClusters

View the Description View the linked PRs

User Story:

As the HyperShift scheduler, I want to be able to:

see the size of each HostedCluster as translated to "t-shirt" sizes

so that I can achieve

correct scheduling outcomes and warm-node reserve

As the HyperShift administrator, I want to be able to:

configure how the size of each HostedCluster is translated to "t-shirt" sizes
configure how often the size of each HostedCluster is translated to "t-shirt" sizes, both as the size gets larger and as it gets smaller

so that I can achieve

tuning for auto-scaling of my management cluster

Acceptance Criteria:

Description of criteria:

allow configuration for sizing buckets and sizing change throughput
document the configuration above
expose t-shirt sizes of hosted clusters

Engineering Details:

design: https://docs.google.com/document/d/1VV1NZgrimSNFjJEH3PeZmUquey_YX3lm2l0FDwP8WDc/edit

This requires a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/3686

Story HOSTEDCP-1478: Hosted Cluster request serving pods should be scheduled on nodes according to size label

View the Description View the linked PRs

User Story:

When a HostedCluster is given a size label, it needs to ensure that request serving nodes exist for that size label and when they do, reschedule request serving pods to the appropriate nodes.

Acceptance Criteria:

Description of criteria:

HostedClusters reconcile the content of hypershift.openshift.io/request-serving-node-additional-selector label
- Reschedule the request serving pods to run to instances matching the label

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/3729

Story HOSTEDCP-1469: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/3708

Feature OCPSTRAT-1286: [Tech Preview] Cluster API Provider for vSphere

View the Description

Feature Overview

Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift

prerequisite work Goals completed in ~~OCPSTRAT-1122~~
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Phase 1 & 2 covers implementing base functionality for CAPI.

Background, and strategic fit

Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
CAPI has much better community interaction than MAPI.
Other projects are considering using CAPI and it would be cleaner to have one solution
Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic OCPCLOUD-1841: CAPI providers: vSphere Tech Preview

View the Description

sets up CAPI ecosystem for vSphere

Spike OCPCLOUD-1609: Investigate what is required to run CAPI vSphere MachineSets

View the Description View the linked PRs

So far we haven't tested this provider at all. We have to run it and spot if there are any issues with it.

Steps:

Try to run the vSphere provider on OCP
Try to create MachineSets
Try various features out and note down bugs
Create stories for resolving issues up stream and downstream

Outcome:

Create stories in epic of items for vSphere that need to be resolved

Story OCPCLOUD-1610: Add vsphere to downstream CAPI operator

View the Description View the linked PRs

vSphere provider is not present in downstream operator, someone has to add it there.

This will include adding a tech preview e2e job in release repo and going through the process described here https://github.com/openshift/cluster-capi-operator/blob/main/docs/provideronboarding.md

https://github.com/openshift/cluster-capi-operator/pull/149

Feature OCPSTRAT-145: Perf/Scale improvements and testing of OVN-Kubernetes

View the Description

Continue scale testing and performance improvements for ovn-kubernetes

Epic SDN-4152: [4.15] Perf-scale improvements and testing of OVN-Kubernetes

View the Description

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Manage Openshift Virtual Machines IP addresses from within the SDN solution provided by OVN-Kubernetes.

Why is this important?

Customers want to offload IPAM from their custom solutions (e.g. custom DHCP server running on their cluster network) to SDN.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement
details and documents.

...

Dependencies (internal and external)

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SDN-4194: Define alerts for high OVS CPU usage

View the Description View the linked PRs

Investigate options to identify ovn scale problems, that usually cause high CPU usage for ovn-controller and vswitchd

https://docs.google.com/document/d/15PLDLKB9tGnbGYMhdHjlOsvlvXzT9TV7VfKmTJCwMRk/edit

Feature OCPSTRAT-178: Allow adding ipv6 VIPs to existing dual stack clusters via Mutable Infrastructure

View the Description

Goal:
Support enablement of dual-stack VIPs on existing clusters created as dual-stack but at a time when it was not possible to have both v4 and v6 VIPs at the same time.

Why is this important?
This is a followup to SDN-2213 ("Support dual ipv4 and ipv6 ingress and api VIPs").

We expect that customers with existing dual stack clusters will want to make use of the new dual stack VIPs fixes/enablement, but it's unclear how this will work because we've never supported modifying on-prem networking configuration after initial deployment. Once we have dual stack VIPs enabled, we will need to investigate how to alter the configuration to add VIPs to an existing cluster.

We will need to make changes to the VIP fields in the Infrastructure and/or ControllerConfig objects. Infrastructure would be the first option since that would make all of the fields consistent, but that relies on the ability to change that object and have the changes persist and be propagated to the ControllerConfig. If that's not possible, we may need to make changes just in ControllerConfig.

Epic OPNET-340: Mechanism to change Infrastructure values

View the Description

For epics https://issues.redhat.com/browse/OPNET-14 and https://issues.redhat.com/browse/OPNET-80 we need a mechanism to change configuration values related to our static pods. Today that is not possible because all of the values are put in the status field of the Infrastructure object.

We had previously discussed this as part of https://issues.redhat.com/browse/OPNET-21 because there was speculation that people would want to move from internal LB to external, which would require mutating a value in Infrastructure. In fact, there was a proposal to put that value in the spec directly and skip the status field entirely, but that was discarded because a migration would be needed in that case and we need separate fields to indicate what was requested and what the current state actually is.

There was some followup discussion about that with Joel Speed from the API team (which unfortunately I have not been able to find a record of yet) where it was concluded that if/when we want to modify Infrastructure values we would add them to the Infrastructure spec and when a value was changed it would trigger a reconfiguration of the affected services, after which the status would be updated.

This means we will need new logic in MCO to look at the spec field (currently there are only fields in the status, so spec is ignored completely) and determine the correct behavior when they do not match. This will mean the values in ControllerConfig will not always match those in Infrastructure.Status. That's about as far as the design has gone so far, but we should keep the three use cases we know of (internal/external LB, VIP addition, and DNS record overrides) in mind as we design the underlying functionality to allow mutation of Infrastructure status values.

Depending on how the design works out, we may only track the design phase in this epic and do the implementation as part of one of the other epics. If there is common logic that is needed by all and can be implemented independently we could do that under this epic though.

Story OPNET-358: Make CNO support modifying Infrastructure.Spec

View the Description View the linked PRs

Infrastructure.Spec will be modified by end-user. CNO needs to validate those changes and if valid, propagate them to Infrastructure.Status

https://github.com/openshift/cluster-network-operator/pull/2130

Story OPNET-357: Make o/installer populate new Spec fields at install-time

View the linked PRs

https://github.com/openshift/installer/pull/7604

Story OPNET-360: Make CNO fill new fields at upgrade-time

View the Description View the linked PRs

For clusters that are installed as fresh 4.15 o/installer will propagate Infrastructure.Spec and Infrastructure.Status based on the install-config. However for clusters that are upgraded this code in o/installer will never run.

In order to have a consistent state at upgrade, we will make CNO to propagate Status back to Spec when cluster is upgraded to OCP 4.15.

As we have already done it when introducing multiple VIPs (API change that created plural field next to the singular), all the necessary code scaffolding is already in place.

https://github.com/openshift/cluster-network-operator/pull/2130

Story OPNET-415: Revert the API revert for 4.16

View the Description View the linked PRs

Tasks to do here

Revert the revert of API extension
Revendor API in client-go
Revendor API in installer

Feature OCPSTRAT-210: [Azure] Ensure CSI Stack is running on management clusters with hosted control planes (GA)

View the Description

Feature Overview:

Ensure CSI Stack for Azure is running on management clusters with hosted control planes, allowing customers to associate a cluster as "Infrastructure only" and move the following parts of the stack:

Azure Disk CSI driver
Azure File CSI driver
Azure File CSI driver operator

Value Statement:

This feature enables customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

Goals:

Ability for customers to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
Ability to run cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster.
Ability to run the driver DaemonSet in the hosted cluster.

Requirements:

The feature must ensure that the CSI Stack for Azure is installed and running on management clusters with hosted control planes.
The feature must allow customers to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
The feature must enable the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to run on the appropriate clusters.
The feature must enable the cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods to run in the management cluster.
The feature must enable the driver DaemonSet to run in the hosted cluster.
The feature must ensure security, reliability, performance, maintainability, scalability, and usability.

Use Cases:

A customer wants to run their Azure infrastructure using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. They use this feature to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
A customer wants to use Azure storage without having to see/manage its stack, especially on a managed service. This would mean that we need to run the cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster and the driver DaemonSet in the hosted cluster.

Questions to Answer:

What Azure-specific considerations need to be made when designing and delivering this feature?
How can we ensure the security, reliability, performance, maintainability, scalability, and usability of this feature?

Out of Scope:

Non-CSI Stack for Azure-related functionalities are out of scope for this feature.

Workload identity authentication is not covered by this feature - see STOR-1748

Background

This feature is designed to enable customers to run their Azure infrastructure more efficiently and cost-effectively by using HyperShift control planes and supporting infrastructure without incurring additional charges from Red Hat.

Documentation Considerations:

Documentation for this feature should provide clear instructions on how to enable the CSI Stack for Azure on management clusters with hosted control planes and associate a cluster as "Infrastructure only." It should also include instructions on how to move the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to the appropriate clusters.

Interoperability Considerations:

This feature impacts the CSI Stack for Azure and any layered products that interact with it. Interoperability test scenarios should be factored by the layered products.

Epic STOR-1697: Azure File CSI on Hosted control planes GA

View the Description

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Run Azure File CSI driver operator + Azure File CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".

Why is this important? (mandatory)

This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

Scenarios (mandatory)

When leveraging Hosted control planes, the Azure File CSI driver operator + Azure File CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.

Dependencies (internal and external) (mandatory)

Hosted control plane on Azure.

Contributing Teams(and contacts) (mandatory)

Development - STOR
Documentation -
QE -
PX -
Others -

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story STOR-1758: Allow CSO to start and run azure-file-operator on hypershift

View the Description View the linked PRs

We need to make changes to CSO, so as it can run azure-file-operator on hypershift.

https://github.com/openshift/cluster-storage-operator/pull/454

Story STOR-1750: Move building and CI of existing operator to csi-operator

View the Description View the linked PRs

As part of this story, we will simply move building and CI of existing code to combined csi-operator.

Story STOR-1762: Modify csi-operator to run azure-file on hypershift

View the Description View the linked PRs

We need to modify csi-operator so as it be ran as azure-file operator on hypershift and standalone clusters.

Epic STOR-1696: Azure Disk CSI on Hosted control planes GA

View the Description View the linked PRs

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Run Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".

Why is this important? (mandatory)

Scenarios (mandatory)

When leveraging Hosted control planes, the Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.

Dependencies (internal and external) (mandatory)

Hosted control plane on Azure.

Contributing Teams(and contacts) (mandatory)

Development - STOR
Documentation -
QE -
PX -
Others -

Done - Checklist (mandatory)

As part of this epic, Engineers working on Azure Hypershift should be able to build and use Azure Disk storage on hypershift guests via developer preview custom build images.

Story STOR-1722: Deploy Azure Disk driver and operator by default in hypershift environments

View the Description View the linked PRs

For this story, we are going to enable deployment of azure disk driver and operator by default in hypershift environment.

Story STOR-1713: Build azure disk operator from csi-operator repo

View the linked PRs

Feature OCPSTRAT-231: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-3516: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story SDN-3708: Add knob in cno

View the Description View the linked PRs

Add a knob in CNO to allow users to modify the changes made in ovn-k

https://github.com/openshift/cluster-network-operator/pull/1807

Feature OCPSTRAT-346: Make Ingress Operator optional

View the Description

Make it possible to entirely disable the Ingress Operator by leveraging the OCPPLAN-9638 Composable OpenShift capability.

Why is this important?

For Managed OpenShift on AWS (ROSA), we use the AWS load balancer and don't need the Ingress operator. Disabling the Ingress Operator will reduce our resource consumption on infra nodes for running OpenShift on AWS.
Customers want to be able to disable the Ingress Operator and use their own component.

Scenarios

This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), implications for auto scaling, chargeback/showback use scenarios, etc.
Disabled configuration must persist throughout cluster lifecycle including upgrades

Epic NE-1129: Composable OpenShift: Make ingress optional on HyperShift

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Links:

RFE: https://issues.redhat.com/browse/RFE-3395

Enhancement PR: https://github.com/openshift/enhancements/pull/1415

API PR: https://github.com/openshift/api/pull/1516

Ingress Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/950

Background

Feature Goal: Make it possible to entirely disable the Ingress Operator by leveraging the Composable OpenShift capability.

Epic Goal

Implement the ingress capability focusing on the HyperShift users.

Non-Goals

Fully implement the ingress capability on the standalone OpenShift.

Design

As described in the EP PR.

Why is this important?

For Managed OpenShift on AWS (ROSA), we use the AWS load balancer and don't need the Ingress operator. Disabling the Ingress Operator will reduce our resource consumption on infra nodes for running OpenShift on AWS.
Customers want to be able to disable the Ingress Operator and use their own component.

Scenarios

# ...

Acceptance Criteria

* Release Technical Enablement - Provide necessary release enablement details and documents.
* Ingress Operator can be disabled on HyperShift.

Dependent operators and OpenShift components can tolerate the disabled ingress operator on HyperShift.

Dependencies (internal and external)

# The install-config and ClusterVersion API have been updated with the capability feature.
# The console operator.

Previous Work (Optional):

Open questions:

Done Checklist

* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>

Story NE-1303: Add new capability for the ingress operator to OpenShift API

View the Description View the linked PRs

Context
HyperShift uses the cluster-version-operator (CVO) to manage the part of the ingress operator's payload, namely CRDs and RBACs. The ingress operator's deployment though is reconciled directly by the control plane operator. That's why HyperShift projects the ClusterVersion resource into the hosted cluster and the enabled capabilities have to be set properly to enable/disable the ingress operator's payload and to let users as well as the other operators be aware of the state of the capabilities.

Goal
The goal of this user story is to implement a new capability in the OpenShift API repository. Use this procedure as example.

https://github.com/openshift/api/pull/1724

Feature OCPSTRAT-427: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic NP-41: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View Demos

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-29075: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/api/pull/1755

Story NP-877: Add suite to openshift orign

View the Description View the linked PRs

We need to create CI tests to ensure that during the live migration when a cluster has both OVN-K and Openshift-SDN deployed in some nodes. The cluster network can still work as normal. We need run the disruptive tests as we do for upgrde.

https://github.com/openshift/origin/pull/28462

Story NP-903: Block users from triggering live migration on clusters not managed by SD team

View the linked PRs

https://github.com/openshift/cluster-network-operator/pull/2236

Story NP-874: Promote sdnLiveMigration featureGate to GA

View the linked PRs

Story SDN-4179: Add multicast metrics for SDN

View the linked PRs

https://github.com/openshift/sdn/pull/603

Feature OCPSTRAT-453: Sigstore runtime validation support in OpenShift

View the Description

Executive Summary

Image and artifact signing is a key part of a DevSecOps model. The Red Hat-sponsored sigstore project aims to simplify signing of cloud-native artifacts and sees increasing interest and uptake in the Kubernetes community. This document proposes to incrementally invest in OpenShift support for sigstore-style signed images and be public about it. The goal is to give customers a practical and scalable way to establish content trust. It will strengthen OpenShift’s security philosophy and value-add in the light of the recent supply chain security crisis.

CRIO

Support customer image validation
Support OpenShift release image validation

https://docs.google.com/document/d/12ttMgYdM6A7-IAPTza59-y2ryVG-UUHt-LYvLw4Xmq8/edit#

Epic OCPNODE-1628: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPNODE-1671: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-config-operator/pull/376

Feature OCPSTRAT-460: Optimized HyperShift Operator Deployment on AKS and Adaptive Environment Detection

View the Description

Goal

This goals of this features are:

optimize and streamline the operations of HyperShift Operator (HO) on Azure Kubernetes Service (AKS) clusters
Enable auto-detectopm of the underlying environment (managed or self-managed) to optimize the HO accordingly.

Epic HOSTEDCP-1425: Create HCPs with an externalDNS Setup on AKS

View the Description

Goal

We need to be able to install the HO with external DNS and create HCPs on AKS clusters

Why is this important?

AKS clusters will serve as the management clusters on ARO HCP.

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story STOR-1805: CSI pods are failing to create on HCP on AKS

View the Description View the linked PRs

CSI pods are failing to create on HCP on an AKS cluster.

% k get pods | grep -v Running
NAME                                                  READY   STATUS                       RESTARTS   AGE
csi-snapshot-controller-cfb96bff7-7tc94               0/1     CreateContainerConfigError   0          17h
csi-snapshot-webhook-57f9799848-mlh8k                 0/1     CreateContainerConfigError   0          17h

The issue is

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  34m                    default-scheduler  Successfully assigned clusters-brcox-hypershift-arm/csi-snapshot-controller-cfb96bff7-7tc94 to aks-nodepool1-24902778-vmss000001
  Normal   Pulling    34m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83"
  Normal   Pulled     34m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83" in 3.036610768s (10.652709193s including waiting)
  Warning  Failed     32m (x12 over 34m)     kubelet            Error: container has runAsNonRoot and image will run as root (pod: "csi-snapshot-controller-cfb96bff7-7tc94_clusters-brcox-hypershift-arm(45ab89f2-9c00-4afa-bece-7846505edbfc)", container: snapshot-controller)

https://github.com/openshift/hypershift/pull/3701

Story HOSTEDCP-1464: Add pull-secret flag to HyperShift install CLI

View the Description View the linked PRs

User Story:

As a HyperShift CLI user, I want to be able to:

include my pull secret when installing the HyperShift operator with an externalDNS setup

so that I can achieve

the ability to pull the Red Hat version of the externalDNS image.

Acceptance Criteria:

Description of criteria:

The ability to include the pull secret on a HyperShift operator installation is available in the HyperShift CLI

Out of Scope:

Any additional changes needed to install HCPs on AKS

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3682

Epic HOSTEDCP-1286: Detect when running in managed services vs self-managed

View the Description

User Story:

As a hub cluster admin, I want to be able to:

Detect when running in managed vs self-managed mode in Azure

so that I can prevent

Impossible usage of managed service components in self-manage use cases

Acceptance Criteria:

Description of criteria:

HyperShift controllers are aware of when they run in managed environment
HyperShift components have a single programmatic way to check in which mode they run
HyperShift deployment takes the mode in consideration when setting up its components

(optional) Out of Scope:

Implementing managed-only configs is out of scope for this story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

Story HOSTEDCP-1487: Detect when running in managed services vs self-managed

View the Description View the linked PRs

User Story:

As a hub cluster admin, I want to be able to:

Detect when running in managed vs self-managed mode in Azure

so that I can prevent

Impossible usage of managed service components in self-manage use cases

Acceptance Criteria:

Description of criteria:

HyperShift controllers are aware of when they run in managed environment
HyperShift components have a single programmatic way to check in which mode they run
HyperShift deployment takes the mode in consideration when setting up its components

(optional) Out of Scope:

Implementing managed-only configs is out of scope for this story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/3736

Feature OCPSTRAT-462: Control plane nodes on separate subnets for on-prem IPI [Phase 2]

View the Description

Feature Overview

As of OpenShift 4.14, this functionality is Tech Preview for all platforms but OpenStack, where it is GA. This Feature is to bring the functionality to GA for all remaining platforms.

Feature Description

Allow to configure control plane nodes across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.

Goals

I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and nodes across multiple subnets.

I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.

Background, and strategic fit

Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.

Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.

Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.

While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.

Functionalities per Epic

Epic	Control Plane with Multiple Subnets	Compute with Multiple Subnets	Doesn't need external LB	Built-in LB
NE-1069 (all-platforms)	✓	✓	✓	✓
NE-905 (all-platforms)	✓	✓	✓	✕
~~NE-1086~~ (vSphere)	✓	✓	✓	✓
~~NE-1087~~ (Bare Metal)	✓	✓	✓	✓
~~OSASINFRA-2999~~ (OSP)	✓	✓	✓
~~SPLAT-860~~ (vSphere)	✓	✓	✓	✕
NE-905 (all platforms)	✓	✓	✓	✕
~~OPNET-133~~ (vSphere/Bare Metal for AI/ZTP)	✓	✓	✓	✓
~~OSASINFRA-2087~~ (OSP)	✕	✓	✓	✓
~~KNIDEPLOY-4421~~ (Bare Metal workaround)	✕	✓	✓	✓
~~SPLAT-409~~ (vSphere)	✕	✓	✓	✓

Previous Work

Workers on separate subnets with IPI documentation

We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow

This procedure works on vSphere too, albeit no QE CI and not documented.

External load balancer with IPI documentation

Scenarios

vSphere: I can define 3 or more networks in vSphere and distribute my masters and workers across them. I can configure an external load balancer for the VIPs.
Bare metal: I can configure the IPI installer and the agent-based installer to place my control plane nodes and compute nodes on 3 or more subnets at installation time. I can configure an external load balancer for the VIPs.

Acceptance Criteria

Can place compute nodes on multiple subnets with IPI installations
Can place control plane nodes on multiple subnets with IPI installations
Can configure external load balancers for clusters deployed with IPI with control plane and compute nodes on multiple subnets
Can configure VIPs to in external load balancer routed to nodes on separate subnets and VLANs
Documentation exists for all the above cases

Epic OPNET-305: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OPNET-476: Soften o/installer validations for ELB

View the Description View the linked PRs

Currently o/installer validates that ELB is used only in TechPreview clusters. This validation needs to be removed so that ELB can be consumed from GA.

https://github.com/openshift/installer/pull/8102

Story OPNET-466: Upgrade [...]PlatformLoadBalancer from TP to GA in o/api

View the linked PRs

https://github.com/openshift/api/pull/1757

Feature OCPSTRAT-473: Multus CNI for Microshift MVP

View the Description View Demos

Goal:

Enable and support Multus CNI for microshift.

Background:

Customers with advanced networking requirement need to be able to attach additional networks to a pod, e.g. for high-performance requirements using SR-IOV or complex VLAN setups etc.

Requirements:

opt-in approach: customers can add multus if needed, e.g. by installing/adding "microshift-networking-multus" rpm package to their installation.
if possible, it would be good to be able to add multus an existing installation. If that requires a restart/reboot, that is acceptable. If not possible, it has to be clearly documented.
it is acceptable that once multus has been added to an installation, it can not be removed. If removal can be implemented easily, that would be good. If not possible, then it has to be clearly documented.
Regarding additional networks:
1. As part of the MVP, the Bridge plugin must be fully supported
2. As stretch goal, macvlan and ipvlan plugins should be supported
3. Other plugins, esp. host device and sr-iov are out of scope for the MVP, but will be added with a later version.
4. Multiple additional networks need to be configurable, e.g. two different bridges leading to two different networks, each consumed by different pods.
IP V6 with bridge plugin. Secondary NICs passed to a container via the bridge plugin should work with IP V6, if the consuming pod does support V6. See also "out of scope"
Regarding IPAM CNI support for IP address provisioning, static and DHCP must be supported.

Documentation:

In the existing "networking" book, we need a new chapter "4. Multiple networks". It can re-use a lot content from OCP doc "https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/understanding-multiple-networks.html", but needs an extra chapter in the beginning "Installing support for multiple networks"{}

Testing:

A simple "smoke test" that multus can be added, and a 2nd nic added to a pod (e.g. using host device) is sufficient. No need to replicate all the multus tests from OpenShift, as we assume that if it works there, it works with MicroShift.

Customer Considerations:

This document contains the MVP requirements of a MicroShift EAP customer that need to be considered.

Out of scope:

Other plugins, esp. host device and sr-iov are out of scope for the MVP, but will be added with a later version.
IP V6 support with OVN-K. That is scope of feature OCPSTRAT-385

https://drive.google.com/file/d/18O2-TfhryQgd_oOYPjhl-DfPHNzctafB/view

Epic USHIFT-711: Multus CNI

View the Description

Epic Goal

Provide optional Multus CNI for MicroShift

Why is this important?

Customers need to add extra interfaces directly to Pods

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task USHIFT-2219: Prepare Container Network Plugins Dockerfile for MicroShift

View the Description View the linked PRs

It should include all CNIs (bridge, macvlan, ipvlan, etc.) - if we decide to support something, we'll just update scripting to copy those CNIs

It needs to include IPAMs: static, dynamic (DHCP), host-local (we might just not copy it to host)

RHEL9 binaries only to save space

https://github.com/openshift/containernetworking-plugins/pull/153

Feature OCPSTRAT-487: Pod Security Admission Integration - Restricted Enforcement

View the Description

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".

Epic AUTH-262: Pod Security Admission Integration - Restricted Enforcement

View the Description

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

Story AUTH-481: PSa Review: add PSa labels to openshift prefixed namespaces where missing

View the Description View the linked PRs

Openshift prefixed namespaces should all define their required PSa labels. Currently, the list of namespaces that are missing some or all PSa labels are the following:

namespace	in review	merged
openshift
openshift-apiserver-operator	PR
openshift-cloud-credential-operator	PR
openshift-cloud-network-config-controller	PR
openshift-cluster-samples-operator	PR
openshift-cluster-storage-operator	PR
openshift-config OCPBUGS-28621	PR
openshift-config-managed	PR
openshift-config-operator	PR
openshift-console	PR
openshift-console-operator	PR
openshift-console-user-settings	PR
openshift-controller-manager	PR
openshift-controller-manager-operator	PR
openshift-dns-operator	PR
openshift-etcd-operator	PR
openshift-host-network	PR
openshift-ingress-canary	PR
openshift-ingress-operator	PR
openshift-insights	PR
openshift-kube-apiserver-operator	PR
openshift-kube-controller-manager-operator	PR
openshift-kube-scheduler-operator	PR
openshift-kube-storage-version-migrator	PR
openshift-kube-storage-version-migrator-operator	PR
openshift-network-diagnostics	PR
openshift-node
openshift-operator-lifecycle-manager	PR
openshift-operators	PR
openshift-route-controller-manager	PR
openshift-service-ca	PR
openshift-service-ca-operator	PR
openshift-user-workload-monitoring	PR

Bug OCPBUGS-28621: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story AUTH-482: SCC pinning for all workloads in platform namespaces

View the Description View the linked PRs

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following table tracks progress:

namespace	in review	merged
openshift-apiserver-operator	PR
openshift-authentication	PR
openshift-authentication-operator	PR
openshift-catalogd	PR
openshift-cloud-controller-manager
openshift-cloud-controller-manager-operator
openshift-cloud-credential-operator	PR
openshift-cloud-network-config-controller	PR
openshift-cluster-csi-drivers	PR1, PR2
openshift-cluster-machine-approver
openshift-cluster-node-tuning-operator	PR
openshift-cluster-olm-operator	PR
openshift-cluster-samples-operator	PR
openshift-cluster-storage-operator	PR1, PR2
openshift-cluster-version	PR
openshift-config-operator	PR
openshift-console	PR
openshift-console-operator	PR
openshift-controller-manager	PR
openshift-controller-manager-operator	PR
openshift-dns
openshift-dns-operator
openshift-etcd
openshift-etcd-operator
openshift-image-registry	PR
openshift-ingress	PR
openshift-ingress-canary	PR
openshift-ingress-operator	PR
openshift-insights	PR
openshift-kube-apiserver
openshift-kube-apiserver-operator
openshift-kube-controller-manager
openshift-kube-controller-manager-operator
openshift-kube-scheduler
openshift-kube-scheduler-operator
openshift-kube-storage-version-migrator	PR
openshift-kube-storage-version-migrator-operator	PR
openshift-machine-api	PR1, PR2, PR3, PR4, PR5, PR6
openshift-machine-config-operator	PR
openshift-marketplace	PR
openshift-monitoring	PR
openshift-multus
openshift-network-diagnostics	PR
openshift-network-node-identity	PR
openshift-network-operator
openshift-oauth-apiserver	PR
openshift-operator-controller	PR
openshift-operator-lifecycle-manager	PR
openshift-ovn-kubernetes
openshift-route-controller-manager	PR
openshift-service-ca	PR
openshift-service-ca-operator	PR
openshift-user-workload-monitoring	PR

https://github.com/openshift/service-ca-operator/pull/235

Feature OCPSTRAT-648: (Tech Preview) 'oc adm upgrade status' command

View the Description

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)

Here are common update improvements from customer interactions on Update experience

Show nodes where pod draining is taking more time.
Customers have to dig deeper often to find the nodes for further debugging.
The ask has been to bubble up this on the update progress window.
oc update status ?
From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"
But the ask is to show more details in a human-readable format.
Know where the update has stopped. Consider adding at what run level it has stopped.
```
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS

version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
```

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic OTA-1114: (Tech preview) Adding oc adm upgrade status command (Phase-2)

View the Description

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process.
Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Story OTA-1165: Add worker node status to the "upgrade status" output

View the Description View the linked PRs

Add node status as mentioned in the sample output to "oc adm upgrade status" in OpenShift Update Concepts

With this output, users will be able to see the state of the nodes that are not part of the master machine config pool and what version of RHCOS they are currently on. I am not sure if it is possible to also show corresponding OCP version. If possible we should display that as well.

=Worker Upgrade=

Worker Pool: openshift-machine-api/machineconfigpool/worker
Assessment: Admin required
Completion: 25% (Est Time Remaining: N/A - Manual action required)
Mode: Manual | Assisted [- No Disruption][- Disruption Permitted][- Scaling Permitted]
Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Node(s)
NAME					ASSESSMENT	PHASE		VERSION		EST		MESSAGE
ip-10-0-134-165.ec2.internal		Manual		N/A		4.12.1		?		Awaiting manual reboot
ip-10-0-135-128.ec2.internal		Complete	Updated		4.12.16		-
ip-10-0-138-184.ec2.internal		Manual		N/A		4.12.1		?		Awaiting manual reboot
ip-10-0-139-31.ec2.internal		Manual		N/A		4.12.1		?		Awaiting manual reboot

Definition of done:

Add code to get similar output as listed above. if a particular field/column needs more work, create separate Jira cards.
This output should be shown even when control plane already updated but workers are not yet (as of Jan 2024, the status command will say that the cluster is not upgrading in such case)
Output should take into account that there can be multiple worker pools
We should capture fixtures for the following to scenarios:
cluster that finished control plane update but is still upgrading worker nodes
cluster that finished control plane update but is still upgrading worker nodes and there’s a restrictive PDB that prohibits draining

https://github.com/openshift/oc/pull/1689

Story OTA-1080: Proof-of-concept oc access to firing alerts

View the Description View the linked PRs

Implementing ~~RFE-928~~ would help a number of update-team use-cases:

OTA-368, ~~OSDOCS-2427~~, and other tickets that are mulling over rendering update-related alerts when folks are trying to decide whether to launch an update.
~~OTA-1021~~'s update status subcommand could use these to help admins discover and respond to update-related issues in their updating clusters.

The updates team is not well positioned to maintain oc access long-term; that seems like a better fit for the monitoring team (who maintain Alertmanager) or the workloads team (who maintain the bulk of oc). But we can probably hack together a proof-of-concept which we could later hand off to those teams, and in the meantime it would unblock our work on tech-preview commands consuming the firing-alert information.

The proof-of-concept could follow the following process:

Get alertmanager URL from route alertmanager-main in openshift-monitoring namespace
Use the $ALERTMANAGER/api/v1/alerts endpoint to get data about alerts (see https://github.com/prometheus/alertmanager#api)
The endpoint is authenticated via bearer token, same as against apiserver (possible Role for this mentioned in ~~MON-3396~~ OBSDA-530)

Definition of done:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts
...dump of firing alerts...

and a backing Go function that other subcommands like oc adm upgrade status can consume internally.

https://github.com/openshift/oc/pull/1674

Task OTA-1169: Get a feature gate for server-side upgrade status work

View the linked PRs

Story OTA-1159: Expose machine-readable CVO "result of work" information

View the Description View the linked PRs

Expose 'Result of work' as structured JSON in a ClusterVersion condition (ResourceReconciliationIssues?). This would allow oc to explain what the CVO is having trouble with, without needing to reach around the CVO and look at ClusterOperators directly (avoiding the risk of latency/skew between what the CVO thinks it's waiting on and the current ClusterOperator content). And we could replace the current Progressing string with more information, and iterate quickly on the oc side. Although if we find ways we like more to stringify this content, we probably want to push that back up to the CVO so all clients can benefit.

We want to do this work behind a feature gate ~~OTA-1169~~.

Definition of Done

All code added for this effort must be conditional on the UpgradeStatus feature gate (added in https://github.com/openshift/api/pull/1725 so we will need to bump o/api in CVO).
CVO will have a new .status.conditions item with type==ResourceReconciliationIssues (provisional name, feel free to rename to something better fitting)
When CVO encounters an "issue" (more details on issues below) while applying a resource, the condition value is True and message is a JSON with all issues encountered
Reason values can be invented as necessary
When no issue is encountered, condition is False
Reconciliation issue is when CVO cannot reach desired state for the given resource in the given sync loop - this will differ between upgrading and reconciling mode. In upgrading mode CVO needs to report things like waiting on ClusterOperators going out of Available=False/Degraded=True, waiting for deployments to go ready etc. In reconciling mode we should report failures to apply.

Note that a condition with Message containing a JSON instead of a human-readable message is against apimachinery conventions and is VERY poor API method in general. The purpose of this story is simply to experiment with how such API could like, and will inform how we will build it in the future. Do not worry too much about tech debt, this is exploratory code.

https://github.com/openshift/cluster-version-operator/pull/1031

Story OTA-1087: Add Upgrade Health section to oc adm upgrade status command

View the Description View the linked PRs

As a PoC (proof of concept), warn about Available=False and Degraded=True ClusterOperators, smoothing flapping conditions (like discussed on a refinement call on Nov 29).

Follow Justin's direction from design ideas, make this easily pluggable for more "warnings"

Example Output

=Update Health= 
SINCE	        LEVEL 		        IMPACT 			MESSAGE
3h		Warning		        API Availability	High control plane CPU during upgrade
Resolved	Info			Update Speed		Pod is slow to terminate
4h		Info			Update Speed		Pod was slow to terminate
30m		Info			None			Worker node version skew
1h		Info			None			Update worker node

https://github.com/openshift/oc/pull/1636

Feature OCPSTRAT-698: [GA] Support static IP assignments with vSphere IPI

View the Description

Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where ~~METAL-1~~ added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic SPLAT-1143: [GA] Support static IP assignments with vSphere IPI

View the Description

Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Acceptance Criteria

I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SPLAT-1173: Enhance vSphere Installer to use IPAddressClaims for static IP

View the Description View the linked PRs

{}USER STORY:{}

As a system admin, I would like the static IP support for vSphere to use IPAddressClaims to provide IP address during installation so that after the install, the machines are defined in a way that is intended for use with IPAM controllers.

{}DESCRIPTION:{}

Currently the installer for vSphere will directly set the static IPs into the machine object yaml files. We would like to enhance the installer to create IPAddress, IPAddressClaim for each machine as well as update the machinesets to use addressesFromPools to request the IPAddress. Also, we should create a custom CRD that is the basis for the pool defined in the addressesFromPools field.

{}ACCEPTANCE CRITERIA:{}

After installing static IP for vSphere IPI, the cluster should contain machines, machinesets, crd, ipaddresses and ipaddressclaims related to static IP assignment.

{}ENGINEERING DETAILS:{}

These changes should all be contained in the installer project. We will need to be sure to cover static IP for zonal and non-zonal installs. Additionally, we need to have this work for all control-plane and compute machines.

https://github.com/openshift/installer/pull/7501

Task SPLAT-1423: integrate with capv installer

View the linked PRs

https://github.com/openshift/installer/pull/8081

Task SPLAT-1293: Install IPAM CRDs without requiring tech preview

View the linked PRs

https://github.com/openshift/api/pull/1729

Task SPLAT-1410: Static IP Support with CPMS

View the Description View the linked PRs

We need to ensure that when vSphere static IP IPI install is being performed, we need to make sure the masters that are generated are treated as valid machines and do not get recreated by CPMS operator.

Feature OCPSTRAT-731: OpenShift Optional Capabilities (Phase 6)

View the Description

Feature Overview

As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.

As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane. This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer). This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories
Alternate flow/scenarios - high-level user stories
...

Questions to answer…

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): ~~OCPPLAN-7589~~ Provide a way with CVO to allow disabling and enabling of operators

~~CORS-1873~~ Installer to allow users to select OpenShift components to be included/excluded
~~OTA-555~~ Provide a way with CVO to allow disabling and enabling of operators
~~OLM-2415~~ Make the marketplace operator optional
~~SO-11~~ Make samples operator optional
~~METAL-162~~ Make cluster baremetal operator optional
~~OCPPLAN-8286~~ CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): ~~OCPPLAN-7589~~ Provide a way with CVO to allow disabling and enabling of operators

~~CONSOLE-3160~~ Make console operator optional
~~CCXDEV-8079~~ Make Insights operator optional
~~CNF-5645~~ Make storage operator optional
~~CNF-5646~~ Make csi-snapshot-controller optional

Phase 3 (OpenShift 4.13): ~~OCPBU-117~~

~~OTA-554~~ Make oc aware of cluster capabilities
~~PSAP-741~~ Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): ~~OCPSTRAT-36~~ (formerly ~~OCPBU-236~~)

OCPBU-352 Make Ingress Operator optional
~~CCO-186~~ ccoctl support for credentialing optional capabilities
~~MCO-499~~ MCD should manage certificates via a separate, non-MC path (formerly ~~IR-230~~ Make node-ca managed by CVO)
~~CNF-5642~~ Make cluster autoscaler optional
~~CNF-5643~~ - Make machine-api operator optional
~~WRKLDS-695~~ - Make DeploymentConfig API + controller optional
~~~~CNV-16274~~ OpenShift Virtualization on the Red Hat Application Cloud (not applicable)~~

Phase 4 (OpenShift 4.14): ~~OCPSTRAT-36~~ (formerly ~~OCPBU-236~~)

~~CCO-186~~ ccoctl support for credentialing optional capabilities
~~MCO-499~~ MCD should manage certificates via a separate, non-MC path (formerly ~~IR-230~~ Make node-ca managed by CVO)
~~CNF-5642~~ Make cluster autoscaler optional
~~CNF-5643~~ - Make machine-api operator optional
~~WRKLDS-695~~ - Make DeploymentConfig API + controller optional
~~~~CNV-16274~~ OpenShift Virtualization on the Red Hat Application Cloud (not applicable)~~
~~CNF-9115~~ - Leverage Composable OpenShift feature to make control-plane-machine-set optional
~~BUILD-565~~ - Make Build v1 API + controller optional
~~CNF-5647~~ Leverage Composable OpenShift feature to make image-registry optional (replaces ~~IR-351~~ - Make Image Registry Operator optional)

Phase 5 (OpenShift 4.15): ~~OCPSTRAT-421~~ (formerly ~~OCPBU-519~~)

~~OCPVE-634~~ - Leverage Composable OpenShift feature to make olm optional
~~CCO-419~~ (~~OCPVE-629~~) - Leverage Composable OpenShift feature to make cloud-credential optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

OCPBU-352 Make Ingress Operator optional
~~OCPCLOUD-2151~~ (OCPVE-628) - Leverage Composable OpenShift feature to make cloud-controller-manager optional
MON-3152 (OBSDA-242) Optional built-in monitoring

Phase 7 (OpenShift 4.17): OCPSTRAT-1308

IR-400 - Remove node-ca from CIRO*
CCO-493 Make Cloud Credential Operator optional for remaining providers and topologies (non-SNO topologies)
CNF-9116 Leverage Composable OpenShift feature to machine-auto-approver optional

References

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic OCPEDGE-20: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPEDGE-427: Annotate cloud-controller-manager-operator manifests with cap name

Story OCPEDGE-500: Bump api at CVO and fix tests

Story OCPEDGE-485: Add CloudController capability name to API

Story OCPEDGE-582: Bump api at Installer repo with new cap name

Feature OCPSTRAT-748: On Cluster Layering: Phase 2 (tech preview)

View the Description View Demos

Note: phase 2 target is tech preview.

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience.
Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

The goal of this feature is primarily to bring the 4.14 progress (~~OCPSTRAT-35~~) to a Tech Preview or GA level of support.
Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
- The admin should then be able to correct the build and resume the upgrade.
Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
Users can return a pool to an unmodified image easily.
RHEL entitlements should be wired in or at least simple to set up (once).
Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

https://drive.google.com/file/d/1dFtAjrBJ7wyTxJz54elVeBxAAImtbwjp/

Epic MCO-665: On-Cluster Layering Tech Preview

View the Description

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

Bug OCPBUGS-18991: Sometimes the OCB machine-os-builder pod is not restarted when the imageBuilderType is updated

View the Description View the linked PRs

Description of problem:


In clusters with OCB functionality enabled, sometimes the machine-os-builder pod is not restarted when we update the imageBuilderType.

What we have observed is that the pod is restarted if a build is running, but it is not restarted if we are not building anything.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-09-12-195514   True        False         88m     Cluster version is 4.14.0-0.nightly-2023-09-12-195514

How reproducible:

Always

Steps to Reproduce:

1. Create the configuration resources needed by the OCB functionality.

To reproduce this issue we use an on-cluster-build-config configmap with an empty imageBuilderType

 oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}'


2. Create a infra pool and label it so that it can use OCB functionality

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


 oc label mcp/infra machineconfiguration.openshift.io/layering-enabled=

3. Wait for the build pod to finish.

4. Once the build has finished and it has been cleaned, update the imageBuilderType so that we use "custom-pod-builder" type now.

oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'

Actual results:


We waited for one hour, but the pod is never restarted.

$ oc get pods |grep build
machine-os-builder-6cfbd8d5d-xk6c5           1/1     Running   0          56m

$ oc logs machine-os-builder-6cfbd8d5d-xk6c5 |grep Type
I0914 08:40:23.910337       1 helpers.go:330] imageBuilderType empty, defaulting to "openshift-image-builder"

$ oc get cm on-cluster-build-config -o yaml |grep Type
  imageBuilderType: custom-pod-builder

Expected results:


When we update the imageBuilderType value, the machine-os-builder pod should be restarted.

Additional info:

https://github.com/openshift/api/pull/1752

Feature OCPSTRAT-763: [TechPreview]Disconnected Cluster Update and Boot without local image registry

View the Description

Feature Overview

Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.

Overarching Goal

Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.

Background

For a single node cluster effectively cut off from all other networking, update the cluster despite the lack of access to image registries, local or remote.
For multi-node clusters that could have a complete power outage, recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.
Allow cluster node(s) to boot without any access to a registry in case all the required images are pinned

Epic MCO-838: Pin and pre-load images

View the Description View the linked PRs

Given enhancement - https://github.com/openshift/enhancements/pull/1481

Design Review Doc: https://docs.google.com/document/d/1-XuHN6_jvJMLULFwwAThfIcHqY32s32lU6m4bx7BiBE/edit

We want to allow the relevant APIs to pin images and make sure those don't get garbage collected.

Here is a summary of what will be required:

CRI-O will need to be changed so that it doesn't remove pinned images, regardless of the version_file_persist setting.
Add the new PinnedImageSet custom resource definition to the API.
Initial proposal: #1609
Add a new PinnedIMageSetController to the machine-config-controller.
Add the logic to pin and pull the images to the machine-config-daemon.
Update the documentation of recovery procedures to explain that pinned images shouldn't be removed.

https://github.com/openshift/api/pull/1703

Feature OCPSTRAT-768: Apply user defined tags to all resources created by OpenShift (GCP) GA

View the Description

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

Goals

Functionality on GCP GA
inclusion in the cluster backups
flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

List any affected packages or components.

Installer
Cluster Infrastructure
Storage
Node
NetworkEdge
Internal Registry
CCO

Epic CORS-2783: Apply user defined tags to all resources created by OpenShift (GCP) GA

View the Description

This is continuation of CORS-2455 / CFE-719 work, where support for GCP tags & labels delivered as TechPreview in 4.14 and to make it GA in 4.15. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

Story CFE-857: User defined tags should be applied on all gcp resources created by installer

View the Description View the linked PRs

Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined tags.

Resources List

Resource	Terraform API
VM Instance	google_compute_instance
Storage Bucket	google_storage_bucket

Acceptance Criteria:

Code linting, validation and best practices adhered to
List of gcp resources created by installer should have user defined labels and as well as the default OCP label.

Feature OCPSTRAT-784: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic AGENT-682: Enable day2 add node using agent-install

View the Description

Epic Goal

Enable users to install a host as day2 using agent based installer

Why is this important?

Enable easy day2 installation without requiring additional knowledge from the user
Unified experience for day1 and day2 installation for the agent based installer
Unified experience for day1 and day2 installation for appliance workflow
Eliminate the requirement of installing MCE that have high requirements (requires 4 cores and 16GB RAM for a multi-node cluster, and if the infrastructure operator is included then it will require storage as well)
Eliminate the requirement of nodes having a BMC available to expand bare metal clusters (see docs).
Simplify adding compute nodes based on the the UPI method or other method implemented in the field such as ~~WKLD-433~~ or other automations that try to solve this problem

Scenarios

User installed day1 cluster with agent based install and want to add workers or replace failed nodes, currently alternative is to install MCE or, if connected, use SAAS.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Previous Work (Optional):

ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components

Assisted installer can create installed cluster and use it to perform day2 operations

A doc that explains how it's done with kube-api

Parameters that are required from the user:

Cluster name,Domain name - used to get the url that will be used to pull ignition
Kubeconfig to day1 cluster
Openshift version
Network configuration

Actions required from the user

Post reboot the user will need to manually approve the node on the day1 cluster

Implementation suggestion:

To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster

Open questions::

How should ABI detect an existing day1 cluster? Yes
Using a provided kubeconfig file? Yes
If so, should it be added to the config-image API (i.e. included in the ISO) - question to the agent team
Should we add a new command for day2 in agent-installer-client? E.g. question to the agent team
So this command would import the existing day1 cluster and add the day2 node. It means that every day2 node should run an assisted-service first Yes, those instances are not depending on each other

Story AGENT-850: Add-nodes-image command

View the Description

Deploy a command to generate a suitable ISO for adding a node to an existing cluster

Sub-task AGENT-853: Create ClusterInfo asset

View the Description View the linked PRs

Not all the required info are provided by the user (in reality, we do want to minimize as much as possible the amount of configuration provided by the user). Some of the required info needs to be extracted from the existing cluster, or from the existing kubeconfig. A dedicated asset could be useful for such operation.

https://github.com/openshift/installer/pull/7997

Sub-task AGENT-860: Modify assisted-service agentbased cli

View the Description View the linked PRs

A new workflow will be required to talk to assisted-service to import an existing cluster / add the node.New services could be required in the ignition assets to handle properly that

https://github.com/openshift/assisted-service/pull/6004

Story AGENT-848: New cli tool to expose the add / monitor node commands

View the Description View the linked PRs

The two commands, one for adding the nodes (ISO generation) and the other to monitor the process, should be exposed by a new cli tool (name to be defined) built using the installer source. This task will be used to add the main of the cli tool and the two (empty) commands entry points

https://github.com/openshift/installer/pull/7958

Feature OCPSTRAT-854: Support Heterogeneous NodePools Within HyperShift

View the Description

DoD

We need to ensure we have parity with OCP and support heterogeneous clusters

https://github.com/openshift/enhancements/pull/1014

Define UX for multi arch NodePool input. E.g always enforce multi arch image, this might not be possible because of impact for image registry on disconnected clusters https://github.com/openshift/enhancements/pull/1014#discussion_r798444099.

Goal

Provide a way to install with varied architecture NodePools with the ability to autoscale
Define UX for multi arch NodePool input. E.g always enforce multi arch image, this might not be possible because of impact for image registry on disconnected clusters https://github.com/openshift/enhancements/pull/1014#discussion_r798444099.

Why is this important?

Necessary to enable workloads with different architectures in the same Hosted Clusters.
Cost savings brought by more cost effective ARM instances

Scenarios

I have an x86 hosted cluster and I want to have at least one NodePool running ARM workloads
I have an ARM hosted cluster and I want to have at least one NodePool running x86 workloads

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides

Dependencies (internal and external)

The management cluster must use a multi architecture payload image.
The target architecture is in the OCP payload
MCE has builds for the architecture used by the worker nodes of the management cluster

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic HOSTEDCP-1129: Define UX and failures for HC with multi-arch NodePools.

View the Description

User Story:

Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Story HOSTEDCP-1262: Add CEL to the arch field in NodePool to prevent error when the cloud platform isn't a supported one

View the Description View the linked PRs

User Story:

As a user of multi-arch HyperShift, I would like a CEL validation to be added to the NodePool types to prevent the arch field from being changed from `amd64` when the platform is not supported (AWS is currently the only supported platform).

Acceptance Criteria:

Description of criteria:

CEL added to the arch field for NodePools which will error if a user tries to change the arch field from `amd64` for any other platform other than AWS.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3333

Story HOSTEDCP-1220: Define UX and failures for HC with multi-arch NodePools.

View the Description View the linked PRs

User Story:

Acceptance Criteria:

Upstream documentation
HyperShift CLI accomplishes design specified in the document in the Engineering Details section
HCP CLI accomplishes design specified in the document in the Engineering Details section

Out of Scope:

See Define UX & failures for HC w/ multi-arch NodePools

Engineering Details:

See Define UX & failures for HC w/ multi-arch NodePools

https://github.com/openshift/hypershift/pull/3689

Feature OCPSTRAT-876: [Tech Preview] OC mirror multiple internet disconnected enclaves

View the Description

Technology Preview of the oc mirror enclaves support (Dev Preview was ~~OCPSTRAT-765~~ in OpenShift 4.15).

Feature description

oc-mirror already focuses on mirroring content to disconnected environments for installing and upgrading OCP clusters.

This specific feature addresses use cases where mirroring is needed for several enclaves (disconnected environments), that are secured behind at least one intermediate disconnected network.

In this context, enclave users are interested in:

being able to mirror content for several enclaves, and centralizing it in a single internal registry. Some customers are interested in running security checks on the mirrored content, vetting it before allowing mirroring to downstream enclaves.
being able to mirror contents directly from the internal centralized registry to enclaves without having to restart the mirroring from internet for each enclave
keeping the volume of data transfered from one network stage to the other to a strict minimum, avoiding to transfer a blob or an image more than one time from one stage to another.

Epic CLID-5: [TP] Enclave Support for oc-mirror

View the Description

This epic covers the work for ~~RFE-3800~~ (includes ~~RFE-3393~~ and ~~RFE-3733~~) for mirroring operators and additonal images

The full description / overview of the enclave support is best described here

The design document can be found here

Architecture Overview (diagram)

User Stories

All user stories are in the form :

Role (As a ...)
Goal (I want ..)
Reason (So that ..)

Story CLID-27: As a user I want to be able to use oc-mirror in both partial and fully disconnected install scenarios so that I can filter channels and bundles (versions) for OCI On disk catalog images according to the requirements for complete enclave support

View the Description View the linked PRs

Acceptance Criteria

Compatible with current V1

OCI for ibm

https://github.com/openshift/oc-mirror/pull/800

Story CLID-7: As a user I want to be able to use oc-mirror in both partial and fully disconnected install scenarios so that I can mirror catalog and additional images according to the requirements for enclave support

View the Description View the linked PRs

Acceptance Criteria

All tests pass (80%) coverage
Documentation approved by docs team
All tests QE approved
Well documented README/HOWTO/OpenShift documentation

Tasks

Implement the scenario where the operator catalog is only one file instead of multiple files (catalog.json) see comment here
Investigate if the operator catalog for IBM is all in one json only or multiple
Implement unit tests

https://github.com/openshift/oc-mirror/pull/799

Story CLID-14: As a user I want to be able to use oc-mirror in both partial and fully disconnected install scenarios so that I can filter channels and bundles (versions) for catalog images according to the requirements for complete enclave support

View the Description View the linked PRs

Acceptance Criteria

All tests pass (80%) coverage
Documentation approved by docs team
All tests QE approved
Well documented README/HOWTO/OpenShift documentation
Best effort (but better than v1)
Take into account targetCatalog, targetTag
Catalogs mirrored by tag should be pulled when a minor version is detected (we should not rely only on existence of the folder in the cache)

Tasks

Implement V2 versioning as discussed in the EP document
Implement code to filter catalogs , channels and bundles according to ImageSetConfig criteria.
Implement unit tests

Story CLID-10: As a user I want to be able to use oc-mirror in both partial and fully disconnected install scenarios so that I can generate relevant metadata useful for validation and implementation for catalog images according to the requirements for enclave support

View the Description View the linked PRs

Acceptance Criteria

All tests pass (80%) coverage
Documentation approved by docs team
All tests QE approved
Well documented README/HOWTO/OpenShift documentation

Tasks

Implement V2 versioning as discussed in the EP document
Implement code to generate artifacts IDMS ITMS and audit data (~~CFE-817~~)
keep in mind that IDMS generation might be needed for v1
Implement code to generate CatalogSource
Implement unit tests

Story CLID-9: As a user I want to ensure incremental mirroring using archives by date, chunks and other, so that I can seamlessly mirror the tar contents

View the linked PRs

Sub-task CLID-24: select diff by date

View the Description View the linked PRs

I need an interface and implementation that can read the history file and present it in a map of digests

https://github.com/openshift/oc-mirror/pull/795

Sub-task CLID-21: After extraction is successful, archive should be destroyed?

View the Description View the linked PRs

Clear out all tars before starting the next mirrorToDisk
so that after newly generated archives are not mixed with archives generated in previous runs

https://github.com/openshift/oc-mirror/pull/789

Sub-task CLID-22: Generated archives should respect the max size specified by user

View the Description View the linked PRs

If the archived content exceeds the max size specified, oc-mirror shall generate as many archive chunks as needed .

https://github.com/openshift/oc-mirror/pull/789

Story CLID-30: As a developer I want to be able to ensure all unit tests pass with 80% coverage so that I can have confidence in the quality of the oc-mirror code base

View the Description View the linked PRs

Acceptance Criteria

Ensure all unit tests pass
Ensure code coverage is above 80%
Ensure golanglint-ci has no errors
All tests QE approved
Show the results in code base README

Tasks

Spike to asses current MVP gaps
Implement unit tests for v2

https://github.com/openshift/oc-mirror/pull/783

Feature OCPSTRAT-908: Remove Terraform from the bare metal IPI installer

View the Description

Feature Overview

In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform.

Epic METAL-722: Replace the use of Terraform by the metal platform

View the Description

In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform.

Story METAL-871: Remove terraform-ironic-provider from installer

View the linked PRs

Story METAL-835: Read secret from mounted file in image-customization-controller

View the Description View the linked PRs

What we do in static mode, we need to do in k8s mode.

https://github.com/openshift/image-customization-controller/pull/115

Story METAL-834: Allow metrics port override in image-customization-controller

View the linked PRs

https://github.com/openshift/image-customization-controller/pull/110

Feature OCPSTRAT-913: Remove Terraform from the vSphere IPI installer

View the Description

Feature Overview (aka. Goal Summary)

To avoid an increased support overhead once the license changes at the end of the year, we want to provision vSphere infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The vSphere IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing vSphere Terraform provider. Specifically, _we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CORS-3052: Provision vSphere with CAPI (no mgmt cluster)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Provision vsphere infrastructure without the use of Terraform

Why is this important?

Removing Terraform from Installer

Scenarios

The new provider should aim to provide the same results as the existing vSphere terraform provider.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug CORS-3273: infrastructure provider envs may contain sensitive data

View the Description View the linked PRs

level=info msg=Running process: vsphere infrastructure provider with args [-v=2 --metrics-bind-addr=0 --health-addr=127.0.0.1:37167 --webhook-port=38521 --webhook-cert-dir=/tmp/envtest-serving-certs-445481834 --leader-elect=false] and env [...]

may contain sensitive data - passwords, logins etc. It should be filtered

https://github.com/openshift/installer/pull/8084

Epic SPLAT-1198: post-merge testing: Remove Terraform from the vSphere installer

View the Description View the linked PRs

Epic Goal

In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which the vSphere IPI installer uses Terraform.

https://github.com/openshift/installer/pull/7962

Task SPLAT-1459: Full clone only on capv machines

View the linked PRs

https://github.com/openshift/installer/pull/8042

Task SPLAT-1465: Add missing extra config to capv machines

View the Description View the linked PRs

okd required guestinfo domain

and

stealclock accounting

https://github.com/openshift/installer/pull/8054

Bug OCPBUGS-29860: vsphere capi downloads the rhcos image multiple times with failure domains

View the Description View the linked PRs

Description of problem:

    the installer download the rhcos image locally to cache multiple times when using failure domains

Version-Release number of selected component (if applicable):

4.16

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    time="2024-02-22T14:34:42-05:00" level=debug msg="Generating Cluster..."
time="2024-02-22T14:34:42-05:00" level=warning msg="FeatureSet \"CustomNoUpgrade\" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster."
time="2024-02-22T14:34:42-05:00" level=info msg="Creating infrastructure resources..."
time="2024-02-22T14:34:43-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:34:43-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:36:02-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:36:02-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:37:22-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:37:22-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:38:39-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:38:39-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:39:33-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:39:33-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:39:33-05:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: The name 'ngirard-dev-pr89z-rhcos-us-west-us-west-1a' already exists."

Expected results:

    should only download once

Additional info:

https://github.com/openshift/installer/pull/8059

Task SPLAT-1468: Add notification that the ova is being uploaded

View the Description View the linked PRs

WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
INFO Creating infrastructure resources...
DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'
DEBUG The file was found in cache: /home/jcallen/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing...

https://github.com/openshift/installer/pull/8097

Task SPLAT-1462: Remove auth environmental variables

View the Description View the linked PRs

Causes password leaking in CI.

https://github.com/openshift/installer/pull/8053

Feature OCPSTRAT-914: Remove Terraform from the Azure IPI installer

View the Description

Feature Overview (aka. Goal Summary)

To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The Azure IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing Azure Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CORS-3061: Provision Azure with CAPI (no mgmt cluster)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Provision Azure infrastructure without the use of Terraform

Why is this important?

Removing Terraform from Installer

Scenarios

The new provider should aim to provide the same results as the existing Azure
terraform provider.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3251: Create CAPZ machine manfests

View the Description View the linked PRs

User Story:

As a developer, I want to:

Create machine manifests for Azure CAPI implementation

so that I can achieve

Create control plane and bootstrap machines using CAPZ

Acceptance Criteria:

Description of criteria:

Control plane and bootstrap machines are created using CAPI

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/7969

Feature OCPSTRAT-924: Console: Customer Happiness (RFEs) for 4.16

View the Description

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

Goals

This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?

CI - MUST be running successfully with test automation

This is a requirement for ALL features.

YES

Release Technical Enablement

Provide necessary release enablement details and documents.

YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories

Alternate flow/scenarios - high-level user stories

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CONSOLE-3782: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-29365: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13612

Feature OCPSTRAT-933: Hypershift guest cluster can use external OIDC token issuer

View the Description

Feature Overview (aka. Goal Summary)

A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows

Goals (aka. expected user outcomes)

A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that

allows fixing mistakes
alerts the owner of the configuration that it's likely that there is a misconfiguration (self-service)
makes distinction between product failure (expressed configuration not applied) from configuration failure (the expressed configuration was wrong), easy to determine
makes cluster recovery possible in cases where the external token issuer is permanently gone
allow (might not require) removal of the existing oauth server

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic HOSTEDCP-1246: hypershift control plane wired with external oidc

View the Description

Goal

Provide API for configuring external OIDC to management cluster components
Stop creating oauth server deployment
Stop creating oauth-apiserver
Stop registering oauth-apiserver backed apiservices
See what breaks next

Why is this important?

need API starting point for ROSA CLI and OCM
need cluster that demonstrates what breaks next for us to fix

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-30124: If not required to set the oauthMetadata, oc login can fail with: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused

View the Description View the linked PRs

Description of problem:
In https://issues.redhat.com/browse/OCPBUGS-28625?focusedId=24056681&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24056681 , Seth Jennings states "It is not required to set the oauthMetadata to enable external OIDC".

Today having a chance to try without setting oauthMetadata, hit oc login fails with the error:

$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

Console login can succeed, though.

Note, OCM QE also encounters this when using ocm cli to test ROSA HCP external OIDC. Either oc or HCP, or anywhere (as a tester I'm not sure TBH ), worthy to have a fix, otherwise oc login is affected.

Version-Release number of selected component (if applicable):

[xxia@2024-03-01 21:03:30 CST my]$ oc version --client
Client Version: 4.16.0-0.ci-2024-03-01-033249
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
[xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci-2024-02-29-213249   True        False         8h      Cluster version is 4.16.0-0.ci-2024-02-29-213249

How reproducible:

Always

Steps to Reproduce:

1. Launch fresh HCP cluster.

2. Login to https://entra.microsoft.com. Register application and set properly.

3. Prepare variables.
HC_NAME=hypershift-ci-267920
MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig
HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig
AUDIENCE=7686xxxxxx
ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0
CLIENT_ID=7686xxxxxx
CLIENT_SECRET_VALUE="xxxxxxxx"
CLIENT_SECRET_NAME=console-secret

4. Configure HC without oauthMetadata.
[xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG

[xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p="
spec:
  configuration: 
    authentication: 
      oauthMetadata:
        name: ''
      oidcProviders:
      - claimMappings:
          groups:
            claim: groups
            prefix: 'oidc-groups-test:'
          username:
            claim: email
            prefixPolicy: Prefix
            prefix:
              prefixString: 'oidc-user-test:'
        issuer:
          audiences:
          - $AUDIENCE
          issuerURL: $ISSUER_URL
        name: microsoft-entra-id
        oidcClients:
        - clientID: $CLIENT_ID
          clientSecret:
            name: $CLIENT_SECRET_NAME
          componentName: console
          componentNamespace: openshift-console
      type: OIDC
"

Wait pods to renew:
[xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
certified-operators-catalog-7ff9cffc8f-z5dlg          1/1     Running   0          5h44m
kube-apiserver-6bd9f7ccbd-kqzm7                       5/5     Running   0          17m
kube-apiserver-6bd9f7ccbd-p2fw7                       5/5     Running   0          15m
kube-apiserver-6bd9f7ccbd-fmsgl                       5/5     Running   0          13m
openshift-apiserver-7ffc9fd764-qgd4z                  3/3     Running   0          11m
openshift-apiserver-7ffc9fd764-vh6x9                  3/3     Running   0          10m
openshift-apiserver-7ffc9fd764-b7znk                  3/3     Running   0          10m
konnectivity-agent-577944765c-qxq75                   1/1     Running   0          9m42s
hosted-cluster-config-operator-695c5854c-dlzwh        1/1     Running   0          9m42s
cluster-version-operator-7c99cf68cd-22k84             1/1     Running   0          9m42s
konnectivity-agent-577944765c-kqfpq                   1/1     Running   0          9m40s
konnectivity-agent-577944765c-7t5ds                   1/1     Running   0          9m37s

5. Check console login and oc login.
$ export KUBECONFIG=$HOSTED_KUBECONFIG
$ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server
{
"issuer": "https://:0",
"authorization_endpoint": "https://:0/oauth/authorize",
"token_endpoint": "https://:0/oauth/token",
...
}
Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com.

Check oc login:
$ rm -rf ~/.kube/cache/oc/
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

Actual results:

Console login succeeds. oc login fails.

Expected results:

oc login should also succeed.

Additional info:{}

https://github.com/openshift/hypershift/pull/3678

Story HOSTEDCP-1374: Copy OIDC OAuth client secrets from Authentication config into hosted cluster

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3373

Bug OCPBUGS-30030: Updating oidcProviders does not take effect

View the Description View the linked PRs

Description of problem:

Updating oidcProviders does not take effect. See details below.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-26-155043

How reproducible:

Always

Steps to Reproduce:

1. Install fresh HCP env and configure external OIDC as steps 1 ~ 4 of https://issues.redhat.com/browse/OCPBUGS-29154 (to avoid repeated typing those steps, only referencing as is here).

2. Pods renewed:
$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
network-node-identity-68b7b8dd48-4pvvq               3/3     Running   0          170m
oauth-openshift-57cbd9c797-6hgzx                     2/2     Running   0          170m
kube-controller-manager-66f68c8bd8-tknvc             1/1     Running   0          164m
kube-controller-manager-66f68c8bd8-wb2x9             1/1     Running   0          164m
kube-controller-manager-66f68c8bd8-kwxxj             1/1     Running   0          163m
kube-apiserver-596dcb97f-n5nqn                       5/5     Running   0          29m
kube-apiserver-596dcb97f-7cn9f                       5/5     Running   0          27m
kube-apiserver-596dcb97f-2rskz                       5/5     Running   0          25m
openshift-apiserver-c9455455c-t7prz                  3/3     Running   0          22m
openshift-apiserver-c9455455c-jrwdf                  3/3     Running   0          22m
openshift-apiserver-c9455455c-npvn5                  3/3     Running   0          21m
konnectivity-agent-7bfc7cb9db-bgrsv                  1/1     Running   0          20m
cluster-version-operator-675745c9d6-5mv8m            1/1     Running   0          20m
hosted-cluster-config-operator-559644d45b-4vpkq      1/1     Running   0          20m
konnectivity-agent-7bfc7cb9db-hjqlf                  1/1     Running   0          20m
konnectivity-agent-7bfc7cb9db-gl9b7                  1/1     Running   0          20m

3. oc login can succeed:
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
Please visit the following URL in your browser: http://localhost:8080
Logged into "https://a4af9764....elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.You don't have any projects. Contact your system administrator to request a project.

4. Update HC by changing claim: email to claim: sub:
$ oc edit hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG
...
          username:
            claim: sub
...

Update is picked up:
$ oc get authentication.config cluster -o yaml
...
spec:
  oauthMetadata:
    name: tested-oauth-meta
  oidcProviders:
  - claimMappings:
      groups:
        claim: groups
        prefix: 'oidc-groups-test:'
      username:
        claim: sub
        prefix:
          prefixString: 'oidc-user-test:'
        prefixPolicy: Prefix
    issuer:
      audiences:
      - 76863fb1-xxxxxx
      issuerCertificateAuthority:
        name: ""
      issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0
    name: microsoft-entra-id
    oidcClients:
    - clientID: 76863fb1-xxxxxx
      clientSecret:
        name: console-secret
      componentName: console
      componentNamespace: openshift-console
  serviceAccountIssuer: https://xxxxxx.s3.us-east-2.amazonaws.com/hypershift-ci-267402
  type: OIDC
status:
  oidcClients:
  - componentName: console
    componentNamespace: openshift-console
    conditions:
    - lastTransitionTime: "2024-02-28T10:51:17Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Degraded
    - lastTransitionTime: "2024-02-28T10:51:17Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Progressing
    - lastTransitionTime: "2024-02-28T10:51:17Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "True"
      type: Available
    currentOIDCClients:
    - clientID: 76863fb1-xxxxxx
      issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0
      oidcProviderName: microsoft-entra-id

4. Check pods again:
$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
kube-apiserver-596dcb97f-n5nqn                       5/5     Running   0          108m
kube-apiserver-596dcb97f-7cn9f                       5/5     Running   0          106m
kube-apiserver-596dcb97f-2rskz                       5/5     Running   0          104m
openshift-apiserver-c9455455c-t7prz                  3/3     Running   0          102m
openshift-apiserver-c9455455c-jrwdf                  3/3     Running   0          101m
openshift-apiserver-c9455455c-npvn5                  3/3     Running   0          100m
konnectivity-agent-7bfc7cb9db-bgrsv                  1/1     Running   0          100m
cluster-version-operator-675745c9d6-5mv8m            1/1     Running   0          100m
hosted-cluster-config-operator-559644d45b-4vpkq      1/1     Running   0          100m
konnectivity-agent-7bfc7cb9db-hjqlf                  1/1     Running   0          99m
konnectivity-agent-7bfc7cb9db-gl9b7                  1/1     Running   0          99m

No new pods renewed.

5. Check login again, it does not use "sub", still use "email":
$ rm -rf ~/.kube/cache/
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
Please visit the following URL in your browser: http://localhost:8080
Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.
  
You don't have any projects. Contact your system administrator to request a project.
$ cat ~/.kube/cache/oc/* | jq -r '.id_token' | jq -R 'split(".") | .[] | @base64d | fromjson'
...
{
...
  "email": "xxia@redhat.com",
  "groups": [
...
  ],
...
  "sub": "EEFGfgPXr0YFw_ZbMphFz6UvCwkdFS20MUjDDLdTZ_M",
...

Actual results:

Steps 4 ~ 5: after editing HC field value from "claim: email" to "claim: sub", even if `oc get authentication cluster -o yaml` shows the edited change is propagated:
1> The pods like kube-apiserver are not renewed.
2> After clean-up ~/.kube/cache, `oc login ...` relogin still prints 'Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer', i.e. still uses old claim "email" as user name, instead of using the new claim "sub".

Expected results:

Steps 4 ~ 5: Pods like kube-apiserver pods should renew after HC editing that changes user claim. The login should print that the new claim is used as user name.

Additional info:

https://github.com/openshift/hypershift/pull/3647

Bug OCPBUGS-28625: HCP .well-known/oauth-authorization-server shows "https://:0" even OIDC oauthMetadata is set in hc.spec.configuration.authentication

View the Description View the linked PRs

Description of problem:

HCP does not honor the oauthMetadata field of hc.spec.configuration.authentication, making console crash and oc login fail.

Version-Release number of selected component (if applicable):

HyperShift management cluster: 4.16.0-0.nightly-2024-01-29-233218
HyperShift hosted cluster: 4.16.0-0.nightly-2024-01-29-233218

How reproducible:

Always

Steps to Reproduce:

1. Install HCP env. Export KUBECONFIG:
$ export KUBECONFIG=/path/to/hosted-cluster/kubeconfig

2. Create keycloak applications. Then get the route:
$ KEYCLOAK_HOST=https://$(oc get -n keycloak route keycloak --template='{{ .spec.host }}')
$ echo $KEYCLOAK_HOST
https://keycloak-keycloak.apps.hypershift-ci-18556.xxx
$ curl -sSk "$KEYCLOAK_HOST/realms/master/.well-known/openid-configuration" > oauthMetadata

$ cat oauthMetadata 
{"issuer":"https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master"

$ oc create configmap oauth-meta --from-file ./oauthMetadata -n clusters --kubeconfig /path/to/management-cluster/kubeconfig
...

3. Set hc.spec.configuration.authentication:
$ CLIENT_ID=openshift-test-aud
$ oc patch hc hypershift-ci-18556 -n clusters --kubeconfig /path/to/management-cluster/kubeconfig --type=merge -p="
spec:
  configuration:
    authentication:
      oauthMetadata:
        name: oauth-meta
      oidcProviders:
      - claimMappings:
          ...
        issuer:
          audiences:
          - $CLIENT_ID
          issuerCertificateAuthority:
            name: keycloak-oidc-ca
          issuerURL: $KEYCLOAK_HOST/realms/master
        name: keycloak-oidc-test
      type: OIDC
"

Check KAS indeed already picks up the setting:
$ oc logs -c kube-apiserver kube-apiserver-5c976d59f5-zbrwh -n clusters-hypershift-ci-18556 --kubeconfig /path/to/management-cluster/kubeconfig | grep "oidc-"
...
I0130 08:07:24.266247       1 flags.go:64] FLAG: --oidc-ca-file="/etc/kubernetes/certs/oidc-ca/ca.crt"
I0130 08:07:24.266251       1 flags.go:64] FLAG: --oidc-client-id="openshift-test-aud"
...
I0130 08:07:24.266261       1 flags.go:64] FLAG: --oidc-issuer-url="https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master"
...

Wait about 15 mins.

4. Check COs and check oc login. Both show the same error:
$ oc get co | grep -v 'True.*False.*False'
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.16.0-0.nightly-2024-01-29-233218   True        True          False      4h57m   SyncLoopRefreshProgressing: Working toward version 4.16.0-0.nightly-2024-01-29-233218, 1 replicas available
$ oc get po -n openshift-console
NAME                        READY   STATUS             RESTARTS         AGE
console-547cf6bdbb-l8z9q    1/1     Running            0                4h55m
console-54f88749d7-cv7ht    0/1     CrashLoopBackOff   9 (3m18s ago)    14m
console-54f88749d7-t7x96    0/1     CrashLoopBackOff   9 (3m32s ago)    14m

$ oc logs console-547cf6bdbb-l8z9q -n openshift-console
I0130 03:23:36.788951       1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.059196ms)
E0130 06:48:32.745179       1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused
E0130 06:53:32.757881       1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused
...

$ oc login --exec-plugin=oc-oidc --client-id=openshift-test-aud --extra-scopes=email,profile --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

5. Check root cause, the configured oauthMetadata is not picked up well:
$ curl -k https://a6e149f24f8xxxxxx.elb.ap-east-1.amazonaws.com:6443/.well-known/oauth-authorization-server
{
"issuer": "https://:0",
"authorization_endpoint": "https://:0/oauth/authorize",
"token_endpoint": "https://:0/oauth/token",
...
}

Actual results:

As above steps 4 and 5, the configured oauthMetadata is not picked up well, causing console and oc login hit the error.

Expected results:

The configured oauthMetadata is picked up well. No error.

Additional info:

For oc, if I manually use `oc config set-credentials oidc --exec-api-version=client.authentication.k8s.io/v1 --exec-command=oc --exec-arg=get-token --exec-arg="--issuer-url=$KEYCLOAK_HOST/realms/master" ...` instead of using `oc login --exec-plugin=oc-oidc ...`, oc authentication works well. This means my configuration is correct.
$ oc whoami  
Please visit the following URL in your browser: http://localhost:8080
oidc-user-test:xxia@redhat.com

https://github.com/openshift/hypershift/pull/3511

Epic OSDOCS-8106: Docs for OCPSTRAT-933 Guest cluster can use external OIDC token issuer

View the Description

A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows

Goals (aka. expected user outcomes)

allows fixing mistakes
alerts the owner of the configuration that it's likely that there is a misconfiguration (self-service)
makes distinction between product failure (expressed configuration not applied) from configuration failure (the expressed configuration was wrong), easy to determine
makes cluster recovery possible in cases where the external token issuer is permanently gone
allow (might not require) removal of the existing oauth server

Bug OCPBUGS-29435: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature OCPSTRAT-942: Console needs to be functional with external oidc token issuer

View the Description

Feature Overview (aka. Goal Summary)

When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.

Goals (aka. expected user outcomes)

An end user can use the openshift console without a notable difference in experience. This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery

Requirements (aka. Acceptance Criteria):

User can log in and use the console
User can get a kubeconfig that functions on the CLI with matching oc
Both of those work on hypershift
both of those work on standalone.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CONSOLE-3805: console operator must accept clientID and secret

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

When installed with external OIDC, the clientID and clientSecret need to be configurable to match the external (and unmanaged) OIDC server

Why is this important?

Without a configurable clientID and secret, I don't think the console can identify the user.
There must be a mechanism to do this on both hypershift and openshift, though the API may be very similar.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task AUTH-439: create an API for configuring an OAuth client for OpenShift webconsole

View the linked PRs

Story AUTH-440: Make the webconsole accept more fine-grained OIDC configuration

View the Description View the linked PRs

The web console should behave like a generic OIDC client when requesting tokens from an OIDC provider.

Story CONSOLE-3829: Use k8s SelfSubjectReview to retrieve user info

View the Description View the linked PRs

User API may not always be available. K8S now has a stable API to query for user information - https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3325-self-subject-attributes-review-api. See if it can be used and replace all `user/~` calls with it.

https://github.com/openshift/console/pull/13321

Epic CONSOLE-3804: console operator must stop creating oauthclients when API is not present

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

When the oauthclient API is not present, the operator must stop creating the oauthclient

Why is this important?

This is preventing the operator from creating its deployment

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-3912: Configure console in external OIDC IDP environments

View the Description View the linked PRs

Console needs to be able to auth agains external OIDC IDP. For that console-operator need to set configure it in that order.

AC:

bump console-operator API pkg
sync oauth client secret from OIDC configuration for auth type OIDC
add OIDC config to the console configmap
add auth server CA to the deployment annotations and volumes when auth type OIDC
consume OIDC configuration in the console configmap and deployment
fix roles for oauthclients and authentications watching

Feature OCPSTRAT-956: Enable Break Glass access Mechanism for Cloud Services (ROSA as MVP)

View the Description

Feature Overview (aka. Goal Summary)

Enable a "Break Glass Mechanism" in ROSA (Red Hat OpenShift Service on AWS) and other OpenShift cloud-services in the future (e.g., ARO and OSD) to provide customers with an alternative method of cluster access via short-lived certificate-based kubeconfig when the primary IDP (Identity Provider) is unavailable.

Goals (aka. expected user outcomes)

Enhance cluster reliability and operational flexibility.
Minimize downtime due to IDP unavailability or misconfiguration.
The primary personas here are OpenShift Cloud Services Admins and SREs as part of the shared responsibility.
This will be an addition to the existing ROSA IDP capabilities.

Requirements (aka. Acceptance Criteria)

Enable the generation of short-lived client certificates for emergency cluster access.
Ensure certificates are secure and conform to industry standards.
Functionality to invalidate short-lived certificates in case of an exploit.

Better UX

User Interface within OCM to facilitate the process.
SHOULD have audit capabilities.
Minimal latency when generating and using certificates (to reduce time without access to cluster).

Use Cases (Optional)

A customer's IDP is down, but they successfully use the break-glass feature to gain cluster access.
SREs use their own break-glass feature to perform critical operations on a customer's cluster.

Questions to Answer (Optional)

What is the lifetime of generated certificates? 7 days life and 1 day rotation?
What security measures are in place for certificate generation and storage?
What are the audit requirements?

Out of Scope

Replacement of primary IDP functionality.
Use of break-glass mechanism for routine operations (i.e., this is emergency/contingency mechanism)

Customer Considerations

The feature is not a replacement for the primary IDP.
Customers must understand the security implications of using short-lived certificates.

Documentation Considerations

How-to guides for using the break-glass mechanism.
FAQs addressing common concerns and troubleshooting.
Update existing ROSA IDP documentation to include this new feature.

Interoperability Considerations

Compatibility with existing ROSA, OSD (OpenShift Dedicated), and ARO (Azure Red Hat OpenShift) features.
Interoperability tests should include scenarios where both IDP and break-glass mechanism are engaged simultaneously for access.

Epic HOSTEDCP-1255: Hypershift issues cluster-admin kubeconfig usable by the customer

View the Description

Goal

Be able to provide the customer of managed OpenShift with a cluster-admin kubeconfig that allows them to access the cluster in the event that their identity provider (IdP or external OIDC) becomes unavailable or misconfigured.

Why is this important?

Managed OpenShift customers need to be able to access/repair their clusters without RH intervention (i.e. opening a ticket) in the case of an identity provider outage or misconfiguration.

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Coordination with OCM to make the kubeconfig available to customers upon request

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1257: Create controller for approving and issuing signed customer certs

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3267

Story HOSTEDCP-1344: Create controller for rotating customer signer cert

View the Description View the linked PRs

Customers need to be able to request the revocation of the signer cert for their break-glass credentials and know when it's been done. See design here: https://docs.google.com/document/d/19l48HB7-4_8p96b2kvlpYFpbZXCh5eXN_tEmM8QhaIc/edit#heading=h.1a08yt9sno82 and diagram here: https://miro.com/app/board/uXjVNUPFcQA=/

Feature OCPSTRAT-962: OCP Console support for short-lived token enablement of OLM-managed operators using GCP WIF

View the Description

Feature Overview

Users of the OpenShift Console leverage a streamlined, visual experience when discovering and installing OLM-managed operators in clusters that run on cloud providers with support for short-lived token authentication enabled. Users are intuitively becoming aware when this is the case and are put on the happy path to configure OLM-managed operators with the necessary information to support GCP Workload Identify Foundation (WIF).

Goals:

Customers do not need to re-learn how to enable GCP WIF authentication support for each and every OLM-managed operator that supports it. The experience is standardized and repeatable so customers spend less time with initial configuration and more team implementing business value. The process is so easy that OpenShift is perceived as enabler for an increased security posture.

Requirements:

based on OCPSTRAT-922, the installation and configuration experience for any OLM-managed operator using short-lived token authentication is streamlined using the OCP console in the form of a guided process that avoids misconfiguration or unexpected behavior of the operators in question
the OCP Console helps in detecting when the cluster itself is already using GCP WIF for core functionality
the OCP Console helps discover operators capable of GCP WIF authentication and their IAM permission requirements
the OCP Console drives the collection of the required information for GCP WIF authentication at the right stages of the installation process and stops the process when the information is not provided
the OCP Console implements this process with minimal differences across different cloud providers and is capable of adjusting the terminology depending on the cloud provider that the cluster is running on

Use Cases:

A cluster admin browses the OperatorHub catalog and looks at the details view of a particular operator, there they discover that the cluster is configured for GCP WIF
A cluster admin browsing the OperatorHub catalog content can filter for operators that support the GCP WIF flow described in OCPSTRAT-922
A cluster admin reviewing the details of a particular operator in the OperatorHub view can discover that this operator supports GCP WIF authentication
A cluster admin installing a particular operator can get information about the GCP IAM permission requirements the operator has
A cluster admin installing a particular operator is asked to provide GCP ServiceAccount that is required for GCP WIF prior to the actual installation step and is prevented from continuing without this information
A cluster admin reviewing an installed operators with support for GCP WIF can discover the related CredentialRequest object that the operator created in an intuitive way (not generically via related objects that have an ownership reference or as part of the InstallPlan)

Out of Scope

update handling and blocking in case of increased permission requirements in the next / new version of the operator
more complex scenarios with multiple IAM roles/service principals resulting in multiple CredentialRequest objects used by a single operator

Background

The OpenShift Console today provides little to no support for configuring OLM-managed operators for short-lived token authentication. Users are generally unaware if their cluster runs on a cloud provider and is set up to use short-lived tokens for its core functionality and users are not aware which operators have support for that by implementing the respective flows defined in OCPSTRAT-922.

Customer Considerations

Customers may or may not be aware about short-lived token authentication support. They need to proper context and pointers to follow-up documentation to explain the general concept and the specific configuration flow the Console supports. It needs to become clear that the Console cannot 100% automate the overall process and some steps need to be run outside of the cluster/Console using Cloud-provider specific tooling.

Epic CONSOLE-3851: Support for new operator annotations format 4.16

View the Description

Epic Goal

Transparently support old and new infrastructure annotations format delivered by OLM-packaged operators

Why is this important?

As part of part of OCPSTRAT-288 we are looking to improve the metadata quality of Red Hat operators in OpenShift
via ~~PORTENABLE-525~~ we are defining a new metadata format that supports the aforementioned initiative with more robust detection of individual infrastructure features via boolean data types

Scenarios

A user can use the OCP console to browse through the OperatorHub catalog and filter for all the existing and new annotations defined in ~~PORTENABLE-525~~
A user reviewing an operator's detail can see the supported infrastructures transparently regardless if the operator uses the new or the existing annotations format

Acceptance Criteria

the new annotation format is supported in operatorhub filtering and operator details pages
the old annotation format keeps being supported in operatorhub filtering and operator details pages
the console will respect both the old and the new annotations format
when for a particular feature both the operator denotes data in both the old and new annotation format, the annotations in the newer format take precedence
the newer infrastructure features from ~~PORTENABLE-525~~ tls-profiles and token-auth/* do not have equivalents in the old annotation format and evaluation doesn't need to fall back as described in the previous point

Dependencies (internal and external)

none

Open Questions

due to the non-intrusive nature of this feature, can we ship it in a 4.14.z patch release?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-3776: Add support for features.operators.openshift.io/token-auth-gcp annotation

View the Description View the linked PRs

https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/token-auth-gcp, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.

AC:

Uncomment auth token GCP annotation related code, which has been added in https://github.com/openshift/console/pull/13183/files
Add "Auth Token GCP" filter (TBD) to Infrastructure features filter section in the operator hub

https://github.com/openshift/console/pull/13559

Story CONSOLE-3774: Add support for features.operators.openshift.io/tls-profiles annotation

View the Description View the linked PRs

https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/tls-profiles, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.

AC:

Uncomment TLS annotation related code, which has been added in https://github.com/openshift/console/pull/13183/files
Add "Configurable TLS ciphers" filter to Infrastructure features filter section in the operator hub

https://github.com/openshift/console/pull/13555

Feature OCPSTRAT-974: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OCPNODE-1886: [OCP Rebase] Rebase OCP control plane with Kubernetes v1.29

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Goal of this epic is to capture all the amount of required work and efforts that take to update the openshift control plane with the upstream kubernetes v1.29

Why is this important?

Rebase is a must process for every ocp release to leverage all the new features implemented upstream

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Following epic captured the previous rebase work of k8s v1.28
https://issues.redhat.com/browse/STOR-1425

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPNODE-1895: Bump the openshift/origin repository

View the Description View the linked PRs

While monitoring the payload job failures, open a parallel openshift/origin bump.

Note: There is a high chance of job failures in openshift/origin bump until the openshift/kubernetes PR merges as we only update the test and not the actual kube.

Benefit of opening this PR before ocp/k8s merge is to identify and fix the issues beforehand.

Prev Ref: https://github.com/openshift/origin/pull/28097

Story OCPNODE-1892: Rebase openshift/kubernetes with upstream kubernetes 1.29

View the Description View the linked PRs

Follow the rebase doc[1] and update the spreadsheet[2] that tracks the required commits to be cherry-picked. Rebase the o/k repo with the "merge=ours" strategy as mentioned in the rebase doc.
Save the last commit id in the spreadsheet for future references.

Update the rebase doc if required.

[1] https://github.com/openshift/kubernetes/blob/master/REBASE.openshift.md
[2] https://docs.google.com/spreadsheets/d/10KYptJkDB1z8_RYCQVBYDjdTlRfyoXILMa0Fg8tnNlY/edit#gid=1957024452

Prev. Ref:
https://github.com/openshift/kubernetes/pull/1646

Story OCPNODE-1890: Bump api, dependent libraries

View the Description View the linked PRs

Bump the following libraries in an order with the latest kube and the dependent libraries

openshift/api
openshift/client-go
openshift/library-go
openshift/apiserver-library-go
kube-openapi (if required)

Prev Ref:
https://github.com/openshift/api/pull/1534
https://github.com/openshift/client-go/pull/250
https://github.com/openshift/library-go/pull/1557
https://github.com/openshift/apiserver-library-go/pull/118

https://github.com/openshift/kube-openapi/pull/7

Epic PODAUTO-79: Pod Autoscaling 4.16 release chores

View the Description

This epic will own all of the usual update, rebase and release chores which must be done during the OpenShift 4.16 timeframe for Custom Metrics Autoscaler, Vertical Pod Autoscaler and Cluster Resource Override Operator

Story PODAUTO-99: Vertical Pod Autoscaler deps & toolchain updates for 4.16

View the Description View the linked PRs

Shepherd and merge operator automation PR https://github.com/openshift/vertical-pod-autoscaler-operator/pull/151

Update operator like was done in https://github.com/openshift/vertical-pod-autoscaler-operator/pull/146

Shepherd and merge operand automation PR https://github.com/openshift/kubernetes-autoscaler/pull/277

Coordinate with cluster autoscaler team on upstream rebase as in https://github.com/openshift/kubernetes-autoscaler/pull/250

https://github.com/openshift/kubernetes-autoscaler/pull/286

Epic SDN-4397: Rebase Kube version to 1.29 in repos that the SDN team maintains

View the Description

https://docs.google.com/document/d/1h1XsEt1Iug-W9JRheQas7YRsUJ_NQ8ghEMVmOZ4X-0s/edit --> this is the link for rebase help

Story SDN-4405: Rebase CNCC to 1.29

View the linked PRs

https://github.com/openshift/cloud-network-config-controller/pull/131

Epic WRKLDS-1016: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic CCO-454: Upgrade to Kubernetes 1.29

View the Description View the linked PRs

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that cloud-credential-operator uses to v1.29 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

https://github.com/openshift/cloud-credential-operator/pull/655

Epic OCPCLOUD-2276: Rebase Cluster Infrastructure Components onto 1.29

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Cluster Infrastructure owned components should be running on Kubernetes 1.29
This includes
- The cluster autoscaler (+operator)
- Machine API operator
  - Machine API controllers for:
    - AWS
    - Azure
    - GCP
    - vSphere
    - OpenStack
    - IBM
    - Nutanix
- Cloud Controller Manager Operator
  - Cloud controller managers for:
    - AWS
    - Azure
    - GCP
    - vSphere
    - OpenStack
    - IBM
    - Nutanix
- Cluster Machine Approver
- Cluster API Actuator Package
- Control Plane Machine Set Operator

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2423: Rebase/update to K8s 1.29 for Cloud Provider AWS

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-aws/pull/53

Task OCPCLOUD-2431: Rebase/update to K8s 1.29 for Cluster Autoscaler

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/kubernetes-autoscaler/pull/285

Task OCPCLOUD-2415: Rebase/update to K8s 1.29 for Control Plane Machine Set Operator

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/271

Task OCPCLOUD-2432: Rebase/update to K8s 1.29 for Cluster Autoscaler Operator

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-autoscaler-operator/pull/310

Task OCPCLOUD-2417: Rebase/update to K8s 1.29 for Cluster Machine Approver

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-machine-approver/pull/224

Task OCPCLOUD-2424: Rebase/update to K8s 1.29 for Cluster Cloud Controller Manager Operator

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/325

Task OCPCLOUD-2418: Rebase/update to K8s 1.29 for Cloud Provider Nutanix

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-nutanix/pull/21

Task OCPCLOUD-2420: Rebase/update to K8s 1.29 for Cloud Provider vSphere

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-vsphere/pull/53

Task OCPCLOUD-2419: Rebase/update to K8s 1.29 for Cloud Provider IBM

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-ibm/pull/59

Task OCPCLOUD-2422: Rebase/update to K8s 1.29 for Cloud Provider Azure

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-azure/pull/88

Epic OCPCLOUD-2435: Rebase Cluster API Components onto 1.28

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.28
target is 4.16 since CAPI is always a release behind upstream

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2443: Rebase/update to K8s 1.28 for Cluster API Provider Azure

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-azure/pull/297

Task OCPCLOUD-2441: Rebase/update to K8s 1.28 for Cluster API Provider AWS

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-aws/pull/499

Task OCPCLOUD-2449: Rebase/update to K8s 1.28 for Core Cluster API

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api/pull/192

Task OCPCLOUD-2447: Rebase/update to K8s 1.28 for Cluster API Provider vSphere

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-vsphere/pull/34

Task OCPCLOUD-2451: Rebase/update to K8s 1.28 for Cluster CAPI Operator

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-capi-operator/pull/164

Task OCPCLOUD-2454: Rebase/update to K8s 1.28 for Cluster API Provider IBM

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/74

Task OCPCLOUD-2445: Rebase/update to K8s 1.28 for Cluster API Provider GCP

View the Description View the linked PRs

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-gcp/pull/221

Feature OCPSTRAT-975: Support z-rollback for OCP in 4.12, 4.14

View the Description

Feature Overview{}

As a Cluster Administrator, I want to be able to perform an unassisted z-stream rollback, starting from OCP 4.12 and 4.14 standalone OpenShift.

Epic Goal

Validate z-stream rollbacks in CI starting with 4.10 by ensuring that a rollback completes unassisted and e2e testsuite passes
Provide internal documentation (private KCS article) that explains when this is the best course of action versus working around a specific issue
Provide internal documentation (private KCS article) that explains the expected cluster degradation until the rollback is complete
Provide internal documentation (private KCS article) outlining the process and any post rollback validation

Why is this important?

Even if upgrade success is 100% there's some chance that we've introduced a change which is incompatible with a customer's needs and they desire to roll back to the previous z-stream
Previously we've relied on backup and restore here, however due to many problems with time travel, that's only appropriate for disaster recovery scenarios where the cluster is either completely shut down already or it's acceptable to do so while also accepting loss of any workload state change (PVs that were attached after the backup was taken, etc)
We believe that we can reasonably roll back to a previous z-stream

Scenarios

Upgrade from 4.10.z to 4.10.z+n
oc adm upgrade rollback-z-stream – which will initially be hidden command, this will look at clusterversion history and rollback to the previous version if and only if that version is a z-stream away
Rollback from 4.10.z+n to exactly 4.10.z, during which the cluster may experience degraded service and/or periods of service unavailability but must eventually complete with no further admin action
Must pass 4.10.z e2e testsuite

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Fix all bugs listed here
project = "OpenShift Bugs" AND affectedVersion in( 4.12, 4.14, 4.15) AND labels = rollback AND status not in (Closed ) ORDER BY status DESC

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

At least today we indend to only surface this process internally and work through it with customers actively engaged with support, where do we put that?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OTA-492: Add OC command for rollback-z-stream

View the Description View the linked PRs

The OC command should start update to the previous z stream from the history.
This should be hidden command at this point of time.

https://github.com/openshift/oc/pull/1642

Bug OCPBUGS-24535: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-version-operator/pull/996

Feature OCPSTRAT-976: Ensure CI/Conformance flows are in place for HyperShift-Azure integration

View the Description

Feature Overview (Goal Summary):

Identify and fill gaps related to CI/CD for HyperShift-ARO integration.

Goals (Expected User Outcomes):

Establish a consistent and automated testing pipeline that reduces the risk of regressions and improves code quality.
Enhance the user experience by ensuring that HyperShift's integration with Azure is thoroughly tested and reliable.

Requirements (Acceptance Criteria):

Develop and integrate an Azure-specific testing suite that can be run as part of the pres-ubmit checks.
Implement a schedule for periodic conformance tests on ARO.
Maintain documentation that guides contributors on how to write and run tests within the ARO ecosystem.

Use Cases

A developer submits a pull request to HyperShift’s codebase, which automatically triggers the Azure-specific presubmit tests.
Scheduled conformance tests run automatically at predetermined intervals, providing ongoing assurance of ARO's integration with Azure.

Epic HOSTEDCP-1239: Enable periodic conformance tests for Azure

View the Description

Acceptance Criteria:

Description of criteria:

Conformance tests run periodically for Azure

This does not require a design proposal.
This does not require a feature gate.

Story HOSTEDCP-1460: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/3725

Story HOSTEDCP-1411: Allow setting custom tags on created azure resource group

View the Description View the linked PRs

Currently when creating the resource group it will be deleted after 24 hours. We can change this by setting an expiry tag on the resource group once created.

The ability to set this expiry tag when creating the resource group as part of the cluster creation command would be nice

https://github.com/openshift/hypershift/pull/3490

Epic HOSTEDCP-1235: Enable back presubmit for Azure

View the Description

We want to make sure ARO/HCP development happens while satisfying e2e expectations

Acceptance Criteria:

There's a running, blocking test for azure in presubmits

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

Discussed creating a HC on our root CI cluster that would have an Azure HC which would be used as our Azure mgmt cluster. We would need to
- Manually create the Azure HC on the root CI
- Capture the manifests for that and put them in contrib
- Configure the IDP
- Setup credentials

This does not require a design proposal.
This does not require a feature gate.

Story HOSTEDCP-1407: Create Azure mgmt cluster

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Create ARO HCs off an ARO MGMT cluster

so that I can achieve

Better Dev flow
Enable pre-submit e2e testing for ARO dev work

Acceptance Criteria:

Description of criteria:

ARO mgmt cluster exists.
Documents to reproduce mgmt cluster env exist.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3545

Feature OCPSTRAT-98: Updated boot images: Phase 1 - GCP Tech Preview

View the Description

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

Phase one: GCP tech preview

Epic MCO-994: Update Boot Images - Phase 1 GA

View the Description

This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589

Story MCO-820: Create an opt-in knob for updating boot images in openshift/api

View the Description View the linked PRs

This will be implemented via a global knob in the API. This is required in addition to the feature gate as we expect customers to still want to toggle this feature when it leave tech preview.

Done when:

a new API object is created for opting in machine resources

https://github.com/openshift/api/pull/1672

Feature OCPSTRAT-980: Enforce Data/Secret Encryption for the Control-Planes, Etcd, and Nodes

View the Description

Feature Overview (Goal Summary)

This feature is dedicated to enhancing data security and implementing encryption best practices across control-planes, Etcd, and nodes for HyperShift with Azure. The objective is to ensure that all sensitive data, including secrets is encrypted, thereby safeguarding against unauthorized access and ensuring compliance with data protection regulations.

Epic HOSTEDCP-1293: CAPI: Have worker nodes backed by an encrypted disk

View the Description

User Story:

As a user of HCP on Azure, I would like to be able to pass a customer-managed key when creating a HC so that the disks for the VMs in the NodePool are encrypted.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Story HOSTEDCP-1328: Azure: Have worker nodes backed by an encrypted disk

View the Description View the linked PRs

User Story:

As a user of HCP on Azure, I want to be able to provide a DiskEncryptionSet ID to encrypt the OS disks for the VMs in the NodePool so that the data on the OS disks will be protected by encryption.

Acceptance Criteria:

Description of criteria:

Upstream documentation add on what is needed for the Azure Key Vault and how to encrypt the OS disks thru both the CLI and through the CR spec.
HyperShift CLI lets a user provide a DiskEncryptionSet ID to encrypt the OS disk.
Ability to encrypt the OS disks through the HyperShift CLI.
Ability to encrypt the OS disks through the HC CR.
Any applicable unit tests.

Out of Scope:

N/A

Engineering Details:

BYO resource group is required for this story.
This story programmatically implements this Azure demo of using customer keys to encrypt the OS disk - https://learn.microsoft.com/en-us/azure/virtual-machines/disks-enable-host-based-encryption-portal?tabs=azure-powershell#deploy-a-vm-with-customer-managed-keys

https://github.com/openshift/hypershift/pull/3281

Feature OCPSTRAT-981: Implement Lifecycle & Image Management for NodePools

View the Description

Feature Overview (Goal Summary)

This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.

Epic HOSTEDCP-1290: Allow user to specify subnet ID for NodePool Creation

View the Description

User Story:

As ARO/HCP provider, I want to be able to:

receive a customer's subnet id

so that I can achieve

parse out the resource group name, vnet name, subnet name for use in CAPZ (creating VMs) and Azure CCM (setting up Azure cloud provider).

Acceptance Criteria:

Description of criteria:

Customer hosted cluster resources get created in a managed resource group.
Customer vnet is used in setting up the VMs.
Customer resource group remains unchanged.

Engineering Details:

Benjamin Vesel said the managed resource group and customer resource group would be under the same subscription id.
We are only supporting BYO VNET at the moment; we are not supporting the VNET being created with the other cloud resources in the managed resource group.

This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.

Story HOSTEDCP-1329: Azure: Allow user to specify resource group for NodePool resources

View the Description View the linked PRs

User Story:

As ARO/HCP provider, I want to be able to:

specify which resource group is going to be used for the resources in the customer side

so that I can achieve

control over where resources get allocated instead of getting defaults.

Acceptance Criteria:

Description of criteria:

Customer side resources get created in the API specified resource group.

Engineering Details:

This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.

https://github.com/openshift/hypershift/pull/3279

Epic HOSTEDCP-1297: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story HOSTEDCP-1408: Update CAPZ Identity Type to Service Principal

View the Description View the linked PRs

User Story:

As a HyperShift, I want to use the service principal as the identity type for CAPZ, so that the warning message about using manual service principal in the capi-provider pod goes away.

Acceptance Criteria:

Description of criteria:

The identity type for CAPZ is set to service principal.
The warning messages about manual service principal in the capi-provider pod are gone.

Out of Scope:

Engineering Details:

Manual service principal was deprecated and will be removed in the future. Further details are here title.

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/3501

Story HOSTEDCP-1373: ARO: Add capability for Azure VMs to be created with ephemeral disks

View the Description View the linked PRs

User Story:

As a user of HyperShift, I want to be able to create Azure VMs with ephemeral disks, so that I can achieve higher IOPS.

Note: AKS also defaults to using them.

Acceptance Criteria:

Description of criteria:

Verify upstream documentation exists on how to setup Azure VMs with ephemeral disks
HyperShift documentation added on how to setup Azure VMs with ephemeral disks
Capability to create clusters with Azure VMs with ephemeral disks exists
Capability to create NodePools with Azure VMs with ephemeral disks exists

Out of Scope:

N/A

Engineering Details:

Upstream documentation exists in CAPZ here.

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/3483

Sub-task HOSTEDCP-1382: Restore capability to create Azure HCP from k8s v1.29 bump

View the Description View the linked PRs

The ability to create an Azure HCP since the k8s v1.29 bump is broken.

https://github.com/openshift/hypershift/pull/3404

Epic HOSTEDCP-1326: Remove Old Azure SDKs from Azure Infra

View the Description

Goal

Why is this important?

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1327: Remove Old Azure SDKs from Azure Infra

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3274

Feature OCPSTRAT-982: Network Optimization and Management Enhancements

View the Description

Feature Overview (aka. Goal Summary)

The feature is specifically designed to concentrate on network optimizations, particularly targeting improvements in how network is configured and how access is managed using Cluster API (CAPI) for Azure (Potentially running the control-plane on AKS).

Epic HOSTEDCP-1294: Allow BYO network security group

View the Description

User Story:

As a hosted cluster deployer, I want to be able to:

specify network security groups

so that I can achieve

Network flexibility to run my workloads

Acceptance Criteria:

Able to specify network security group when creating Azure cluster
Able to specify network security group when creating Azure infra

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/44a4d9a25922d2086258228c7692cc326f0eea56/docs/book/src/topics/custom-vnet.md?plain=1#L23

This does not require a design proposal.
This does not require a feature gate.

Story HOSTEDCP-1401: Enable BYO NSG

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Use a pre-existing network security group when creating an Azure cluster

so that I can achieve

more network flexibility in my Hosted Control planes deployment

Acceptance Criteria:

Description of criteria:

Able to specify network security group when creating Azure cluster
Able to specify network security group when creating Azure infra

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/3455

Epic HOSTEDCP-1423: Enable more control over load balancer setup for guest cluster egress

View the Description

Goal

The current HCP implementation lets the Azure CCM pod create a load balancer (LB) and public IP address for guest cluster egress. The outbound SNAT is using default port allocation.

ARO HCP needs more control over how the LB is created and setup. Ideally, it would be nice to have CAPZ create and manage the LB. ARO HCP would also like the ability to utilize a LB with user-defined routing (UDR).

Why is this important?

Utilizing a LB for guest cluster egress is the better option cost wise and availability wise compared to NAT Gateway. NAT Gateways are more expensive and also zonal.

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1424: ARO HCP: Investigate CAPZ creating and managing a LB for guest cluster egress

View the Description View the linked PRs

Investigate the possibility of CAPZ creating and managing a LB for guest cluster egress.

https://github.com/openshift/hypershift/pull/3583

Feature TELCOSTRAT-160: Expeditious SNO Upgrade and Rollback To Meet Telco Far Edge KPIs

View the Description

Feature Overview

Telecommunications providers look to displace Physical Network Functions (PNFs) with modern Virtual Network Functions (VNFs) at the Far Edge. Single Node OpenShift, as a the CaaS layer in the vRAN vDU architecture, must achieve a higher standard in regards to OpenShift upgrade speed and efficiency, as in comparison to PNFs.

Telecommunications providers currently deploy Firmware-based Physical Network Functions (PNFs) in their RAN solutions. These PNFs can be upgraded quickly due to their monolithic nature and image-based download-and-reboot upgrades. Furthermore they often have the ability to retry upgrades and to rollback to the previous image if the new image fails. These Telcos are looking to displace PNFs with virtual solutions, but will not do so unless the virtual solutions have comparable operational KPIs to the PNFs.

Goals

Service Downtime

Service (vDU) Downtime is the time when the CNF is not operational and therefore no traffic is passing through the vDU. This has a significant impact as it degrades the customer’s service (5G->4G) or there’s an outright service outage. These disruptions are scheduled into Maintenance Windows (MW), but the Telecommunications Operators primary goal is to keep service running, so getting vRAN solutions with OpenShift to near PNF-like Service Downtime is and always will be a primary requirement.

Upgrade Duration

Upgrading OpenShift is only one of many operations that occur during a Maintenance Window. Reducing the CaaS upgrade duration is meaningful to many teams within a Telecommunications Operators organization as this duration fits into a larger set of activities that put pressure on the duration time for Red Hat software. OpenShift must reduce the upgrade duration time significantly to compete with existing PNF solutions.

Failure Detection and Remediation

As mentioned above, the Service Downtime disruption duration must be as small as possible, this includes when there are failures. Hardware failures fall into a category called Break+Fix and are covered by TELCOSTRAT-165. In the case of software failures must be detected and remediation must occur.

Detection includes monitoring the upgrade for stalls and failures and remediation would require the ability to rollback to the previously well-known-working version, prior to the failed upgrade.

Implicit Requirements

Upgrade To Any Release

The OpenShift product support terms are too short for Telco use cases, in particular vRAN deployments. The risk of Service Downtime drives Telecommunications Operators to a certify-deploy-and-then-don’t-touch model. One specific request from our largest Telco Edge customer is for 4 years of support.

These longer support needs drive a misalignment with the EUS->EUS upgrade path and drive the requirement that the Single Node OpenShift deployment can be upgraded from OCP X.y.z to any future [X+1].[y+1].[z+1] where X+1 and x+1 are decided by the Telecommunications Operator depending on timing and the desired feature-set and x+1 is determined through Red Hat, vDU vendor and custom maintenance and engineering validation.

Alignment with Related Break+Fix and Installation Requirements

Red Hat is challenged with improving multiple OpenShift Operational KPIs by our telecommunications partners and customers. Improved Break+Fix is tracked in TELCOSTRAT-165 and improved Installation is tracked in TELCOSTRAT-38.

Seamless Management within RHACM

Whatever methodology achieves the above requirements must ensure that the customer has a pleasant experience via RHACM and Red Hat GitOps. Red Hat’s current install and upgrade methodology is via RHACM and any new technologies used to improve Operational KPIs must retain the seamless experience from the cluster management solution. For example, after a cluster is upgraded it must look the same to a RHACM Operator.

Seamless Management when On-Node Troubleshooting

Whatever methodology achieves the above requirements must ensure that a technician troubleshooting a Single Node OpenShift deployment has a pleasant experience. All commands issued on the node must return output as it would before performing an upgrade.

Requirements

Y-Stream

CaaS Upgrade must complete in <2 hrs (single MW) (20 mins for PNF)
Minimize service disruption (customer impact) < 30 mins [cumulative] (5 mins for PNF)
Support In-Place Upgrade without vDU redeployment
Support backup/restore and rollback to known working state in case of unexpected upgrade failures

Z-Stream

CaaS Patch Release (z-release) Upgrade Requirements:
CaaS Upgrade must complete <15 mins
Minimize service disruption (customer impact) < 5 mins
Support In-Place Upgrade without vDU redeployment
Support backup/restore and rollback to known working state in case

Rollback

Restore and Rollback functionality must complete in 30 minutes or less.

Upgrade Path

Allow for upgrades to any future supported OCP release.

References

Epic CNF-9121: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CNF-11091: Change performance validating webhook failure policy to "Ignore"

View the Description View the linked PRs

Webhooks created during bootstrap-in-place will lead to failure in applying their admission subject resources. A permanent solution will be provided by resolving https://issues.redhat.com/browse/OCPBUGS-28550

We need a temporary workaround to unblock the development. This is easiest to do by changing the validating webhook failure policy to "Ignore"

https://github.com/openshift/cluster-node-tuning-operator/pull/933

Story CNF-10170: Run TuneD one-shot during bootstrap

View the Description View the linked PRs

Run tuneD on a container on a one-shot mode and read the output kernel arguments to apply them using a MachineConfig (MC).
This would be run in the bootstrap procedure of the Openshift Installer, just before the MachineConfigOperator(MCO) procedure here
Initial considerations: https://docs.google.com/document/d/1zUpcpFUp4D5IM4GbM4uWbzbjr57h44dS0i4zP-hek2E/edit

Epic MGMT-15783: Cluster Reconfiguration

View the Description

Feature goal (what are we trying to solve here?)

A systemd service that runs on a golden image first boot and configure the following:

1. networking ( the internal IP address require special attention)

2. Update the hostname (~~MGMT-15775~~)

3. Execute recert (regenereate certs, Cluster name and base domain ~~MGMT-15533~~)

4. Start kubelet

5. Apply the personalization info:

Pull Secret
Proxy
ICSP
DNS server
SSH keys

DoD (Definition of Done)

The API for configuring the networking, hostname, personalization manifests and executing recert is well defined
The service configures the required attributes and start a functional OCP cluster
CI job that reconfigures a golden image and form a functional cluster

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

- Internal request

- - This is required as part of https://issues.redhat.com/browse/OCPSTRAT-620

- Catching up with OpenShift
  
  Reasoning (why it’s important?)

The following features depend on this functionality:

Competitor analysis reference

Do our competitors have this feature?
- Yes, they have it and we can have some reference
- No, it's unique or explicit to our product
- No idea. Need to check

Feature usage (do we have numbers/data?)

We have no data - the feature doesn’t exist anywhere
Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
- Please list all related data usage information
We have the numbers and can relate to them
- Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

Please describe the reasoning behind why it should/shouldn't live inside the UI/API
If it's for a specific customer we should consider using AMS
Does this feature exist in the UI of other installers?

Task MGMT-16494: Move nodeip-configuration file creation from mc to ignition

View the Description View the linked PRs

In IBI and IBU flows we need a way to change nodeip-configuration hint file without reboot and before mco even starts. In order for MCO to be happy we need to remove this file from it's management to make it we will stop using machine config and move to ignition

https://github.com/openshift/assisted-service/pull/5861

Feature TELCOSTRAT-18: CPU Manager: mix of exclusive and shared CPUs for a container

View the Description

Problem statement

DPDK applications require dedicated CPUs, and isolated any preemption (other processes, kernel threads, interrupts), and this can be achieved with the “static” policy of the CPU manager: the container resources need to include an integer number of CPUs of equal value in “limits” and “request”. For instance, to get six exclusive CPUs:

spec:

  containers:

  - name: CNF

    image: myCNF

    resources:

      limits:

        cpu: "6"

      requests:

        cpu: "6"

The six CPUs are dedicated to that container, however non trivial, meaning real DPDK applications do not use all of those CPUs as there is always at least one of the CPU running a slow-path, processing configuration, printing logs (among DPDK coding rules: no syscall in PMD threads, or you are in trouble). Even the DPDK PMD drivers and core libraries include pthreads which are intended to sleep, they are infrastructure pthreads processing link change interrupts for instance.

Can we envision going with two processes, one with isolated cores, one with the slow-path ones, so we can have two containers? Unfortunately no: going in a multi-process design, where only dedicated pthreads would run on a process is not an option as DPDK multi-process is going deprecated upstream and has never picked up as it never properly worked. Fixing it and changing DPDK architecture to systematically have two processes is absolutely not possible within a year, and would require all DPDK applications to be re-written. Knowing that the first and current multi-process implementation is a failure, nothing guarantees that a second one would be successful.

The slow-path CPUs are only consuming a fraction of a real CPU and can safely be run on the “shared” CPU pool of the CPU Manager, however containers specifications do not accept to request two kinds of CPUs, for instance:

spec:

  containers:

  - name: CNF

    image: myCNF

    resources:

      limits:

        cpu_dedicated: "4"

        cpu_shared: "20m"

      requests:

        cpu_dedicated: "4"

        cpu_shared: "20m"

Why do we care about allocating one extra CPU per container?

Allocating one extra CPU means allocating an additional physical core, as the CPUs running DPDK application should run on a dedicated physical core, in order to get maximum and deterministic performances, as caches and CPU units are shared between the two hyperthreads.

CNFs are built with a minimum of CPUs per container. This is still between 10 and 20, sometime more, today, but the intent is to decrease this number of CPU and increase the number of containers as this is the “cloud native” way to waste resources by having too large containers to schedule, like in the VNF days (tetris effect)

Let’s take a realistic example, based on a real RAN CNF: running 6 containers with dedicated CPUs on a worker node, with a slow Path requiring 0.1 CPUs means that we waste 5 CPUs, meaning 3 physical cores. With real life numbers:

For a single datacenter composed of 100 nodes, we waste 300 physical cores
For a single datacenter composed of 500 nodes, we waste 1500 physical cores
For a single node OpenShift deployed on 1 Millions of nodes, we waste 3 Millions of physical cores

Intel public CPU price per core is around 150 US$, not even taking into account the ecological aspect of the waste of (rare) materials and the electricity and cooling…

Goals

Implement an equivalent of Nokia CPU pooler, meaning a way to allocate dedicated and shared CPUs to a given container, and provide a way within the container to know which CPUs belong to which pool, so the CNF running in the container can properly pin its pthreads on the available CPUs.

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Questions to answer…

Would an implementation based on annotations be possible rather than an implementation requiring a container (so pod) definition change, like the CPU pooler does?

Out of Scope

Background, and strategic fit

This issue has been addressed lately by OpenStack.

Assumptions

Customer Considerations

Documentation Considerations

The feature needs documentation on how to configure OCP, create pods, and troubleshoot

Epic CNF-11005: GA Mixed CPUs

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Making the MixedCPUs feature GA during 4.16 release time frame
Having full functional comprehensive test-suite exercising the feature on both cgroups versions (v1 and v2)
Having official OCP docs explaining how to use this feature.

Why is this important?

This would unblock lots of options including mixed cpu workloads where some CPUs could be shared among containers / pods ~~CNF-3706~~
This would also allow further research on dynamic (simulated) hyper threading ~~CNF-3743~~

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Previous Work (Optional):

https://issues.redhat.com/browse/CNF-7603
https://issues.redhat.com/browse/CNF-9117 - Dev preveiw epic

Open questions:

N/A

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNF-9173: ci: provide e2e testing for the mixed cpus feature

View the Description View the linked PRs

TBD
But this should probably goes under NTO as a separate lane.

Feature TELCOSTRAT-97: BIOS and Firmware Upgrade/Downgrade prior to Zero Touch Provisioning

View the Description

Feature Overview

Telco customers delivering servers to far edge sites (D-RAN DU) need the ability to upgraded or downgraded the servers BIOS and firmware to to specific versions to ensure the server is configured as was in their validated pattern. After delivering the bare metal to the site this should be done prior to using ZTP to provision the node.

Goals

Allow the operator, via GitOps, to specify the BIOS image and any firmware images to be installed prior to ZTP
Seamless transition from pre-provisioning to existing ZTP solution
Integrate BIOS/Firmware upgrades/downgrades into TALM
(consider) integration with backup/restore recovery feature
Firmware could include: NICs, accelerators, GPUS/DPUS/IPUS

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories
Alternate flow/scenarios - high-level user stories
...

Questions to answer…

Can we provide the ability to fail back to a well known good firmware image if the upgrade/reboot fails?

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Feature XCMSTRAT-320: ROSA HCP: Additional Security Group(s) on Machine Pools

View the Description

Feature Overview (aka. Goal Summary)

Support additional AWS Security Group IDs while creating machine pool in an existing ROSA HCP cluster.

Goals (aka. expected user outcomes)

Allow customers to add up to 10 optional Security Group IDs for an OCM machine pool in Hosted Control Plane (HCP) topology

Requirements (aka. Acceptance Criteria):

See the above Goal entry.

Support adding additional AWS Security Group IDs on while creating machine pool on existing ROSA HCP cluster. i.e., CREATE MACHINEPOOL
Support viewing additional AWS Security Group IDs as part of machine pool i.e., Describe machine pool command
Validate for soft limit of 5 SG-IDs. If quota permits allow up to 10 SG-IDs
Validate the SG-IDs are present in the VPC.
Retain the default worker SG with rules for communication within cluster components.
ROSA CLI, OCM UI and Terraform support
Support for up to 10 security groups.
~~Ability to attach to all nodes (existing and new) part of node pool (OCM machine pool in HCP topology)~~
~~Support for day-1 - creation of cluster or day-one machine pool (called 'worker')~~
~~Support for day-2 - creation of day-2 machine pools~~
~~Support for day-2 changes : add, remove SG IDs on all machine pools (both day-one and day-two)~~
~~Change SG IDs attached to node pools w/o machine or node restart~~
~~OCM API that's similar in UX between Classic and HCP topology~~

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

In HCP clusters, the default-SG is shared between VPCE ENI and worker node ENI. This means ingress/egress rules are cumulative which is not ideal. When can we separate the SGs of ENIs? Yes, HyperShift project will separate the Security Groups for ENIs associated with VPC Endpoint and Worker Nodes, respectively.
In HCP clusters, the default-SG is used on all the workers of all the machine pools. This means there are ingress rules referencing several ports required by HCP components like Ingress that are applied on all worker nodes. And, there is no API or way to remove these rules if customer were to ask the ROSA HCP service. How can we address this? HyperShift project will limit the default Security Group attached to the Worker Nodes with only ingress/egress rules necessary for cluster to function. Additional ports such as SSH/22 or 30000-32767 will be removed from default worker SG, allowing customers to add additional SGs (an use case for this feature) when they need them.
Will it be possible to make changes to the Security Group(s) attached to the Machine Pool i.e., remove an attached SG or add a new SG to the machine pool? HyperShift project does not allow for changing the SGs at least without restarting the existing nodes as defined by MaxUnavailable.

The above requirements will be addressed by the linked HOSTEDCP EPIC so that the OCM API and then by extension clients of OCM API can use these capabilities.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Adding additional Security Groups at the time of cluster creation
Adding additional SG's to cluster's VPCE
Updating list of additional SG-IDs on existing machine pool.
Making changes/patches to default security group created by the cluster

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

1. Early customers will need this by 03/15 - especially the API to attach 10 addtl. SG-ID when creating machine pool.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.

Prerequisites section where Security Group is referenced must be update to call out the default ports that will be enabled for an HCP cluster. Please note this will be different from OCP or ROSA Classic.
Managing Nodes through Machinepool section must be updated to call out the differences between adding Additional SG to ROSA Classic and ROSA HCP.
There are no control plane or infrastructure nodes in ROSA HCP but there will be a default SG for the VPC Endpoint that allows clients to access cluster's API Server from within the VPC.

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

1. Scale to Zero (SD-ADR-030) will impact this feature especially creating a node pool during HCP cluster creation.

References:

Epic HOSTEDCP-1433: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story HOSTEDCP-1419: Always include default security group in ROSA clusters

View the Description View the linked PRs

Currently the behavior is to only use the default security group if no security groups were specified for a NodePool. This makes it difficult to implement additional security groups in ROSA because there is no way to know the default security group on cluster creation. By always appending the default security group, any security groups specified on the NodePool become additional security groups.

https://github.com/openshift/hypershift/pull/3527

Feature XCMSTRAT-480: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDE-3219: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story HOSTEDCP-1207: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/3034

Feature XCMSTRAT-567: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic HOSTEDCP-929: Allow configuration of IMDSv2/v1 in HyperShift

View the Description View the linked PRs

User Story

As a ROSA HyperShift customer I want to enforce that IMDSv2 is always the default, to ensure that I have the most secure setting by default.

Acceptance Criteria

IMDSv2 should be configured in HyperShift clusters by default

Default Done Criteria

All existing/affected SOPs have been updated.
New SOPs have been written.
Internal training has been developed and delivered.
The feature has both unit and end to end tests passing in all test
pipelines and through upgrades.
If the feature requires QE involvement, QE has signed off.
The feature exposes metrics necessary to manage it (VALET/RED).
The feature has had a security review.* Contract impact assessment.
Service Definition is updated if needed.* Documentation is complete.
Product Manager signed off on staging/beta implementation.

Dates

Integration Testing:
Beta:
GA:

Current Status

GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.

References

Links to Gdocs, github, and any other relevant information about this epic.

https://github.com/openshift/hypershift/pull/3584

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Bug MGMT-16966: Problem creating extra partition on main disk in 4.15+

View the Description View the linked PRs

Description of the problem:

Impossible to create an extra partition on the main disk at installation time with OCP 4.15. It works perfectly with 4.14 and under

I supply a custom machineconfig manifest to do so, and the behavior is that during installation, after reboot, screen is blank, and host has no networking (no route to host)

A slack thread explaining the issue with further debugging can be consulted in https://redhat-internal.slack.com/archives/C999USB0D/p1707991107757299

The bug seems to be introduced in https://github.com/openshift/assisted-installer/pull/713 , which allows for one less reboot on installation time, and to do that, it implements part of the post-reboot code. This code runs BEFORE the extra partition is created, and this creates a problem

How reproducible:

Always

Steps to reproduce:

1. Create a 4.15 cluster with an extra manifest that creates a extra partition at the end of the main disk

Example of machineconfig (change device to match installation disk):

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 98-extra-partition
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      disks:
        - device: /dev/vda
          partitions:
            - label: javi
              startMiB: 110000 # space left for the CoreOS partition.
              sizeMiB: 0 # Use all available space

2. Proceed with the installation

Actual results:

After reboot, node never comes back up

Expected results:

Cluster installs without problem

https://github.com/openshift/assisted-installer/pull/787

Bug OCPBUGS-29441: [4.16] Bootimage bump tracker

View the Description View the linked PRs

Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.

The previous bump was ~~OCPBUGS-29384~~.

https://github.com/openshift/installer/pull/8015

Bug OCPBUGS-23299: the checkbox should be displayed on a single row

View the Description View the linked PRs

Description of problem:

the checkbox should be displayed on a single row
eg: for 'Deny all ingress traffic' & 'Deny all egress traffic' in Create NetworkPolicy page
for 'Secure Route' in Create route page

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-11-14-082209

How reproducible:

Always

Steps to Reproduce:

1. Go to Networking -> NetworkPolicies page, click the 'Create NetworkPolicy' button
2. Check the Policy type section, check if the checkbox of 'Deny all ingress traffic' & "Deny all egress traffic" is displayed in a single row
3. Check the same things in 'Create route' page,

Actual results:

not in a single row

Expected results:

in a single row

Additional info:

https://drive.google.com/file/d/1xgEe-CuuRYrY9tBFmIa-7o5Rcn7iCr1e/view?usp=drive_link

https://github.com/openshift/console/pull/13397

Bug OCPBUGS-18939: The apiservers shouldn't expose both SLO and SLI latency metrics to Prometheus

View the Description View the linked PRs

All the apiservers:

kube-apiserver
openshift-apiserver
oauth-apiserver

Expose both `apiserver_request_slo_duration_seconds` and `apiserver_request_sli_duration_seconds`. The SLI metric was introduced in Kubernetes 1.26 as a replacement of `apiserver_request_slo_duration_seconds` which was deprecated in Kubernetes 1.27. This change is only a renaming so both metrics expose the same data. To avoid storing duplicated data in Prometheus, we need to drop the `apiserver_request_slo_duration_seconds` in favor of `apiserver_request_sli_duration_seconds`.

Bug OCPBUGS-24638: Tuned Profiles going degraded due to the extra net.core.rps_default_mask configuration in openshift-node-performance-xxx-profile

View the Description View the linked PRs

Description of problem:
Issue - Profiles are degraded [1]even after applied due to below [2]error:

[1]

$oc get profile -A
NAMESPACE                                NAME                                          TUNED                APPLIED   DEGRADED   AGE
openshift-cluster-node-tuning-operator   master0    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master1    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master2    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   worker0    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker1    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker10   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker11   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker12   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker13   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker14   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker15   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker2    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker3    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker4  rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker5    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker6    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker7    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker8   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker9   rdpmc-patch-worker   True      True       5d

[2]

  lastTransitionTime: "2023-12-05T22:43:12Z"
    message: TuneD daemon issued one or more sysctl override message(s) during profile
      application. Use reapply_sysctl=true or remove conflicting sysctl net.core.rps_default_mask
    reason: TunedSysctlOverride
    status: "True"

If we see in rdpmc-patch-master tuned:

NAMESPACE                                NAME                                          TUNED                APPLIED   DEGRADED   AGE
openshift-cluster-node-tuning-operator   master0    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master1    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master2    rdpmc-patch-master   True      True       5d

We are configuring below in rdpmc-patch-master tuned:

$ oc get tuned rdpmc-patch-master -n openshift-cluster-node-tuning-operator -oyaml |less
spec:
  profile:
  - data: |
      [main]
      include=performance-patch-master
      [sysfs]
      /sys/devices/cpu/rdpmc = 2
    name: rdpmc-patch-master
  recommend:

Below in Performance-patch-master which is included in above tuned:

spec:
  profile:
  - data: |
      [main]
      summary=Custom tuned profile to adjust performance
      include=openshift-node-performance-master-profile
      [bootloader]
      cmdline_removeKernelArgs=-nohz_full=${isolated_cores}

Below(which is coming in error) is in openshift-node-performance-master-profile included in above tuned:

net.core.rps_default_mask=${not_isolated_cpumask}

RHEL BUg has been raised for the same https://issues.redhat.com/browse/RHEL-18972

    Version-Release number of selected component (if applicable):{code:none}
4.14

https://github.com/openshift/cluster-node-tuning-operator/pull/868

Bug OCPBUGS-25045: Update 4.16 openshift-enterprise-tests-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/origin/pull/28452

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/origin/pull/28452

Bug OCPBUGS-25149: Update 4.16 ose-cluster-csi-snapshot-controller-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/179

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/179

Bug OCPBUGS-25835: Installer should validate that 'baremetal' capability is enabled for baremetal platform

View the Description View the linked PRs

If the user specifies baselineCapabilitySet: None in the install-config and does not specifically enable the capability baremetal, yet still uses platform: baremetal then the install will reliably fail.

This failure takes the form of a timeout with the bootkube logs (not easily accessible to the user) full of errors like:

bootkube.sh[46065]: "99_baremetal-provisioning-config.yaml": unable to get REST mapping for "99_baremetal-provisioning-config.yaml": no matches for kind "Provisioning" in version "metal3.io/v1alpha1"
bootkube.sh[46065]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"

Since the installer can tell when processing the install-config if the baremetal capability is missing, we should detect this and error out immediately to save the user an hour of their life and us a support case.

Although this was found on an agent install, I believe the same will apply to a baremetal IPI install.

https://github.com/openshift/installer/pull/7901

Bug OCPBUGS-22749: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13568

Bug OCPBUGS-24790: Update 4.16 ironic-static-ip-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/41

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-static-ip-manager/pull/41

Bug OCPBUGS-24942: Update 4.16 operator-registry-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/630

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-olm/pull/630

Bug OCPBUGS-25002: Update 4.16 ose-aws-pod-identity-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/180

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Story CONSOLE-3909: i18n upload/download routine task

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

https://github.com/openshift/console/pull/13519

Task HOSTEDCP-1314: e2e to make sure that e2e clusters uses NLB

View the Description View the linked PRs

Modify e2e to ensure all HCs use NLB for ingress controller.
Modify private cluster e2e to ensure it uses NLB wit scope internal for ingress controller.
This is to mimic what we do in ROSA
https://redhat-internal.slack.com/archives/C01C8502FMM/p1701254572808179

https://github.com/openshift/hypershift/pull/3293

Bug OCPBUGS-26492: Operation cannot be fulfilled on networks.operator.openshift.io during OVN live migration

View the Description View the linked PRs

Description of problem:

Operation cannot be fulfilled on networks.operator.openshift.io during OVN live migration

Version-Release number of selected component (if applicable):

How reproducible:

Not always

Steps to Reproduce:

1. Enable features of egressfirewall, externalIP,multicast, multus, network-policy, service-idle.
2. Start migrate SDN to OVN cluster

Actual results:

[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation
validatingwebhookconfiguration.admissionregistration.k8s.io "sre-techpreviewnoupgrade-validation" deleted
[weliang@weliang ~]$ oc edit featuregate cluster
featuregate.config.openshift.io/cluster edited
[weliang@weliang ~]$ oc get node
NAME                          STATUS   ROLES                  AGE   VERSION
ip-10-0-20-154.ec2.internal   Ready    control-plane,master   86m   v1.28.5+9605db4
ip-10-0-45-93.ec2.internal    Ready    worker                 80m   v1.28.5+9605db4
ip-10-0-49-245.ec2.internal   Ready    worker                 74m   v1.28.5+9605db4
ip-10-0-57-37.ec2.internal    Ready    infra,worker           60m   v1.28.5+9605db4
ip-10-0-60-0.ec2.internal     Ready    infra,worker           60m   v1.28.5+9605db4
ip-10-0-62-121.ec2.internal   Ready    control-plane,master   86m   v1.28.5+9605db4
ip-10-0-62-56.ec2.internal    Ready    control-plane,master   86m   v1.28.5+9605db4
[weliang@weliang ~]$ for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" --  chroot /host cat /etc/kubernetes/kubelet.conf | grep NetworkLiveMigration ; done
Starting pod/ip-10-0-20-154ec2internal-debug-9wvd8 ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-45-93ec2internal-debug-rwvls ...
To use host binaries, run `chroot /host`
    "NetworkLiveMigration": true,Removing debug pod ...
Starting pod/ip-10-0-49-245ec2internal-debug-rp9dt ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-57-37ec2internal-debug-q5thk ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-60-0ec2internal-debug-zp78h ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-62-121ec2internal-debug-42k2g ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-62-56ec2internal-debug-s99ls ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/live-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
network.config.openshift.io/cluster patched
[weliang@weliang ~]$ 
[weliang@weliang ~]$ oc get co network
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.15.0-0.nightly-2024-01-06-062415   True        False         True       4h1m    Internal error while updating operator configuration: could not apply (/, Kind=) /cluster, err: failed to apply / update (operator.openshift.io/v1, Kind=Network) /cluster: Operation cannot be fulfilled on networks.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
[weliang@weliang ~]$ oc get node
NAME                          STATUS   ROLES                  AGE     VERSION
ip-10-0-2-52.ec2.internal     Ready    worker                 3h54m   v1.28.5+9605db4
ip-10-0-26-16.ec2.internal    Ready    control-plane,master   4h2m    v1.28.5+9605db4
ip-10-0-32-116.ec2.internal   Ready    worker                 3h54m   v1.28.5+9605db4
ip-10-0-32-67.ec2.internal    Ready    infra,worker           3h38m   v1.28.5+9605db4
ip-10-0-35-11.ec2.internal    Ready    infra,worker           3h39m   v1.28.5+9605db4
ip-10-0-39-125.ec2.internal   Ready    control-plane,master   4h2m    v1.28.5+9605db4
ip-10-0-6-117.ec2.internal    Ready    control-plane,master   4h2m    v1.28.5+9605db4
[weliang@weliang ~]$ oc get Network.operator.openshift.io/cluster -o json
{
    "apiVersion": "operator.openshift.io/v1",
    "kind": "Network",
    "metadata": {
        "creationTimestamp": "2024-01-08T13:28:07Z",
        "generation": 417,
        "name": "cluster",
        "resourceVersion": "236888",
        "uid": "37fb36f0-c13c-476d-aea1-6ebc1c87abe8"
    },
    "spec": {
        "clusterNetwork": [
            {
                "cidr": "10.128.0.0/14",
                "hostPrefix": 23
            }
        ],
        "defaultNetwork": {
            "openshiftSDNConfig": {
                "enableUnidling": true,
                "mode": "NetworkPolicy",
                "mtu": 8951,
                "vxlanPort": 4789
            },
            "ovnKubernetesConfig": {
                "egressIPConfig": {},
                "gatewayConfig": {
                    "ipv4": {},
                    "ipv6": {},
                    "routingViaHost": false
                },
                "genevePort": 6081,
                "mtu": 8901,
                "policyAuditConfig": {
                    "destination": "null",
                    "maxFileSize": 50,
                    "maxLogFiles": 5,
                    "rateLimit": 20,
                    "syslogFacility": "local0"
                }
            },
            "type": "OVNKubernetes"
        },
        "deployKubeProxy": false,
        "disableMultiNetwork": false,
        "disableNetworkDiagnostics": false,
        "kubeProxyConfig": {
            "bindAddress": "0.0.0.0"
        },
        "logLevel": "Normal",
        "managementState": "Managed",
        "migration": {
            "mode": "Live",
            "networkType": "OVNKubernetes"
        },
        "observedConfig": null,
        "operatorLogLevel": "Normal",
        "serviceNetwork": [
            "172.30.0.0/16"
        ],
        "unsupportedConfigOverrides": null,
        "useMultiNetworkPolicy": false
    },
    "status": {
        "conditions": [
            {
                "lastTransitionTime": "2024-01-08T13:28:07Z",
                "status": "False",
                "type": "ManagementStateDegraded"
            },
            {
                "lastTransitionTime": "2024-01-08T17:29:52Z",
                "status": "False",
                "type": "Degraded"
            },
            {
                "lastTransitionTime": "2024-01-08T13:28:07Z",
                "status": "True",
                "type": "Upgradeable"
            },
            {
                "lastTransitionTime": "2024-01-08T17:26:38Z",
                "status": "False",
                "type": "Progressing"
            },
            {
                "lastTransitionTime": "2024-01-08T13:28:20Z",
                "status": "True",
                "type": "Available"
            }
        ],
        "readyReplicas": 0,
        "version": "4.15.0-0.nightly-2024-01-06-062415"
    }
}
[weliang@weliang ~]$

Expected results:

OVN live migration pass

Additional info:

must-gather: https://people.redhat.com/~weliang/must-gather1.tar.gz

https://github.com/openshift/cluster-network-operator/pull/2192

Bug OCPBUGS-25004: Update 4.16 ose-apiserver-network-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/46

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/apiserver-network-proxy/pull/46

Bug OCPBUGS-30242: 4.15 Control plane won't allow the creation of a 4.14 and lower node pool

View the Description View the linked PRs

Description of problem:

    4.15 control plane can't create a 4.14 node pool due to an issue with payload

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Create an Hosted Cluster in 4.15
    2. Create a Node Pool in 4.14
    3. Node pool stuck in provisioning

Actual results:

    No node pool is created

Expected results:

    Node pool is created as we support N-2 version there

Additional info:

Possibly linked to OCPBUGS-26757

Bug OCPBUGS-30279: Do imports on imagestreams respect ImageTagMirrorSet?

View the Description View the linked PRs

Not sure which component this bug should be associated with.

I am not even sure if importing respects ImageTagMirrorSet.

We could not figure out in the slack conversion.

https://redhat-internal.slack.com/archives/C013VBYBJQH/p1709583648013199

Description of problem:

The expecting behaviour of ImageTagMirrorSet of redirecting the pulling of a proxy to quay.io did not work out.

Version-Release number of selected component (if applicable):

oc --context build02 get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-ec.3   True        False         7d4h

Steps to Reproduce:

oc --context build02 get ImageTagMirrorSet quay-proxy -o yaml
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"config.openshift.io/v1","kind":"ImageTagMirrorSet","metadata":{"annotations":{},"name":"quay-proxy"},"spec":{"imageTagMirrors":[{"mirrors":["quay.io/openshift/ci"],"source":"quay-proxy.ci.openshift.org/openshift/ci"}]}}
  creationTimestamp: "2024-03-05T03:49:59Z"
  generation: 1
  name: quay-proxy
  resourceVersion: "4895378740"
  uid: 69fb479e-85bd-4a16-a38f-29b08f2636c3
spec:
  imageTagMirrors:
  - mirrors:
    - quay.io/openshift/ci
    source: quay-proxy.ci.openshift.org/openshift/ci


oc --context build02 tag --source docker quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest hongkliu-test/proxy-test-2:011 --as system:admin
Tag proxy-test-2:011 set to quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest.

oc --context build02 get is proxy-test-2 -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  annotations:
    openshift.io/image.dockerRepositoryCheck: "2024-03-05T20:03:02Z"
  creationTimestamp: "2024-03-05T20:03:02Z"
  generation: 2
  name: proxy-test-2
  namespace: hongkliu-test
  resourceVersion: "4898915153"
  uid: f60b3142-1f5f-42ae-a936-a9595e794c05
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest
    generation: 2
    importPolicy:
      importMode: Legacy
    name: "011"
    referencePolicy:
      type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/hongkliu-test/proxy-test-2
  publicDockerImageRepository: registry.build02.ci.openshift.org/hongkliu-test/proxy-test-2
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2024-03-05T20:03:02Z"
      message: 'Internal error occurred: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest:
        Get "https://quay-proxy.ci.openshift.org/v2/": EOF'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: "011"

Actual results:

The status of the stream shows that it still tries to connect to quay-proxy.

Expected results:

The request goes to quay.io directly.

Additional info:

The proxy has been shut down completely just to simplify the case. If it was on, there are Access logs showing the proxy get the requests for the image.
oc scale deployment qci-appci -n ci --replicas 0
deployment.apps/qci-appci scaled

I also checked the pull secret in the namespace and it has correct pull credentials to both proxy and quay.io.

https://github.com/openshift/openshift-apiserver/pull/419

Story TRT-1370: Break etcd leadership intervals out of pod logs section in chart

View the Description View the linked PRs

Today these are in isPodLog in the javascript, we'd like them in their own section, preferably charted very close to the node update section.

https://github.com/openshift/origin/pull/28441

Bug MGMT-16373: Minimal ISO s390x architecture - there is no URL to change to Full ISO (type=full-iso), the discovery ISO fails to create

View the Description View the linked PRs

OCP 4.14.5
multicluster-engine.v2.4.1
advanced-cluster-management.v2.9.0

Attempt to run the create a spoke cluster:

apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
  creationTimestamp: "2023-12-08T16:59:25Z"
  finalizers:
  - agentclusterinstall.agent-install.openshift.io/ai-deprovision
  generation: 1
  name: infraenv-spoke
  namespace: infraenv-spoke
  ownerReferences:
  - apiVersion: hive.openshift.io/v1
    kind: ClusterDeployment
    name: infraenv-spoke
    uid: 34f1fe43-2af2-4880-b4ca-fb9ab8df13df
  resourceVersion: "3468594"
  uid: 79a42bdf-db1f-4500-b689-8b3813bd27a6
spec:
  clusterDeploymentRef:
    name: infraenv-spoke
  imageSetRef:
    name: 4.14-test
  networking:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    serviceNetwork:
    - 172.30.0.0/16
    userManagedNetworking: true
  provisionRequirements:
    controlPlaneAgents: 3
    workerAgents: 2
status:
  conditions:
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: SyncOK
    reason: SyncOK
    status: "True"
    type: SpecSynced
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The cluster is not ready to begin the installation
    reason: ClusterNotReady
    status: "False"
    type: RequirementsMet
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: 'The cluster''s validations are failing: '
    reason: ValidationsFailing
    status: "False"
    type: Validated
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The installation has not yet started
    reason: InstallationNotStarted
    status: "False"
    type: Completed
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The installation has not failed
    reason: InstallationNotFailed
    status: "False"
    type: Failed
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The installation is waiting to start or in progress
    reason: InstallationNotStopped
    status: "False"
    type: Stopped
  debugInfo:
    eventsURL: https://assisted-service-rhacm.apps.sno-0.qe.lab.redhat.com/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiZjU4MjVmMTctNTg0OS00OTljLWE1NDctNjJmMDc4ZDU3MDJiIn0.qpSeZuqLwZ3cr3qn6AZo665o1ANp45YVE6IWUv7Gdn1RmapG4HZaxsUUY4iswkRMiqIfka_pLHFnBeVzXSTbrg&cluster_id=f5825f17-5849-499c-a547-62f078d5702b
    logsURL: ""
    state: insufficient
    stateInfo: Cluster is not ready for install
  platformType: None
  progress:
    totalPercentage: 0
  userManagedNetworking: true
apiVersion: agent-install.openshift.io/v1beta1
kind: InfraEnv
metadata:
  creationTimestamp: "2023-12-08T16:59:26Z"
  finalizers:
  - infraenv.agent-install.openshift.io/ai-deprovision
  generation: 1
  name: infraenv-spoke
  namespace: infraenv-spoke
  resourceVersion: "3468794"
  uid: 6254bbb3-5531-4665-bb78-f073b439b023
spec:
  clusterRef:
    name: infraenv-spoke
    namespace: infraenv-spoke
  cpuArchitecture: s390x
  ipxeScriptType: ""
  nmStateConfigLabelSelector: {}
  pullSecretRef:
    name: infraenv-spoke-pull-secret
status:
  agentLabelSelector:
    matchLabels:
      infraenvs.agent-install.openshift.io: infraenv-spoke
  bootArtifacts:
    initrd: ""
    ipxeScript: ""
    kernel: ""
    rootfs: ""
  conditions:
  - lastTransitionTime: "2023-12-08T16:59:51Z"
    message: 'Failed to create image: cannot use Minimal ISO because it''s not compatible
      with the s390x architecture on version 4.14.6 of OpenShift'
    reason: ImageCreationError
    status: "False"
    type: ImageCreated
  debugInfo:
    eventsURL: ""



 oc get clusterimagesets.hive.openshift.io 4.14-test -o yaml
apiVersion: hive.openshift.io/v1
kind: ClusterImageSet
metadata:
  creationTimestamp: "2023-12-08T18:11:29Z"
  generation: 1
  name: 4.14-test
  resourceVersion: "3514589"
  uid: 32e2ba8d-6bb7-4e4b-b3a5-63fa8224d144
spec:
  releaseImage: registry.ci.openshift.org/ocp-s390x/release-s390x@sha256:f024a617c059bf2cbf4a669c2a19ab4129e78a007c6863b64dd73a413c0bdf46


oc get agentserviceconfigs.agent-install.openshift.io agent -o yaml
apiVersion: agent-install.openshift.io/v1beta1
kind: AgentServiceConfig
metadata:
  creationTimestamp: "2023-12-08T18:10:42Z"
  finalizers:
  - agentserviceconfig.agent-install.openshift.io/ai-deprovision
  generation: 1
  name: agent
  resourceVersion: "3514534"
  uid: ef204896-25f1-4ff3-ae60-c80c2f45cd30
spec:
  databaseStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
  filesystemStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 20Gi
  imageStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
  mirrorRegistryRef:
    name: mirror-registry-ca
  osImages:
  - cpuArchitecture: x86_64
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/amd64/dependencies/rhcos/4.14/latest/rhcos-live.x86_64.iso
    version: "4.14"
  - cpuArchitecture: arm64
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/aarch64/dependencies/rhcos/4.14/latest/rhcos-live.aarch64.iso
    version: "4.14"
  - cpuArchitecture: ppc64le
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/ppc64le/dependencies/rhcos/4.14/latest/rhcos-live.ppc64le.iso
    version: "4.14"
  - cpuArchitecture: s390x
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.14/latest/rhcos-live.s390x.iso
    version: "4.14"
status:
  conditions:
  - lastTransitionTime: "2023-12-08T18:10:42Z"
    message: AgentServiceConfig reconcile completed without error.
    reason: ReconcileSucceeded
    status: "True"
    type: ReconcileCompleted
  - lastTransitionTime: "2023-12-08T18:11:23Z"
    message: All the deployments managed by Infrastructure-operator are healthy.
    reason: DeploymentSucceeded
    status: "True"
    type: DeploymentsHealthy

https://github.com/openshift/assisted-service/pull/5825

Bug OCPBUGS-24788: Update 4.16 ironic-rhcos-downloader-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/94

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-rhcos-downloader/pull/94

Bug OCPBUGS-25533: Update 4.16 ose-cloud-credential-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/642

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-credential-operator/pull/642

Bug OCPBUGS-29068: [Custom DNS] installer should skip DNS zone validation

View the Description View the linked PRs

Description of problem:

User may provide an DNS domain outside GCP, once custom DNS is enabled, installer should skip DNS zone validation:

level=fatal msg="failed to fetch Terraform Variables: failed to generate asset \"Terraform Variables\": failed to get GCP public zone: no matching public DNS Zone found"

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-02-03-192446
4.16.0-0.nightly-2024-02-03-221256

How reproducible:

 Always

Steps to Reproduce:

1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade
2. config a baseDomain which does not exist on GCP.

Actual results:

See description.

Expected results:

Installer should skip the validation, as the custom domain may not exist on GCP

Additional info:

https://github.com/openshift/installer/pull/7986

Bug OCPBUGS-24919: Update 4.16 baremetal-machine-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/206

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-baremetal/pull/206

Bug OCPBUGS-27242: fix or ignore snyk errors for ocp storage repos

View the Description View the linked PRs

Description of problem:

In https://github.com/openshift/release/pull/47618 there are quite a few warnings from snyk in the presubmit rehearsal jobs that have not been reported in the bugs filed against storage

We need to go through each one and either fix (in the case of legit bugs) or ignore (false positives / test cases) to avoid having a presubmit job that always fails

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

Steps to Reproduce:

run 'security' presubmit rehearsal jobs in https://github.com/openshift/release/pull/47618

Actual results:

snyk issues reported

Expected results:

clean test runs

Additional info:

Bug OCPBUGS-28763: --destroy-cloud-resources does not work for HCP KubeVirt platform

View the Description View the linked PRs

Description of problem:

    When destroying an HCP KubeVirt cluster using the cli and the --destroy-cloud-resources, pvcs are not cleaned up within the guest cluster due to the cli not properly honoring the --destroy-cloud-resources option for KubeVirt.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

    100%

Steps to Reproduce:

    1. destroy an hcp kubevirt cluster using cli and --destroy-cloud-resources
    2.
    3.

Actual results:

the hosted cluster does not have the hypershift.openshift.io/cleanup-cloud-resources: "true" annotation added which ensures the hosted cluster config controller cleans up pvcs

Expected results:

the hypershift.openshift.io/cleanup-cloud-resources: "true" should get added to the hosted cluster during tear down when the --destroy-cloud-resources cli option is used

Additional info:

https://github.com/openshift/hypershift/pull/3494

Bug OCPBUGS-26757: Switch to using new image for KAS container bootstrap

View the Description View the linked PRs

Description of problem:

Manifests will be removed from CCO image so we have to start using CCA(cluster-config-api) image for bootstrap

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

  KAS bootstrap container fails

Expected results:

    KAS bootstrap container suceeds

Additional info:

https://github.com/openshift/hypershift/pull/3400

Bug OCPBUGS-28643: PowerVS: Add dal12 region

View the Description View the linked PRs

Description of problem:

There is a new zone in PowerVS called dal12.  We need to add this zone to the list of supported zones in the installer.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Deploy OpenShift cluster to the zone
    2.
    3.

Actual results:

Fails

Expected results:

Works

Additional info:

https://github.com/openshift/installer/pull/7956

Bug MGMT-16843: non-lowercase hostname in DHCP breaks assisted installation

View the Description View the linked PRs

Description of the problem:

non-lowercase hostname in DHCP breaks assisted installation

How reproducible:

100%

Steps to reproduce:

https://issues.redhat.com/browse/AITRIAGE-10248
User did ask for a valid requested_hostname

Actual results:

bootkube fails

Expected results:{}

bootkube should succeed

slack thread

https://github.com/openshift/assisted-installer/pull/788

Bug OCPBUGS-25996: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/7898

Bug OCPBUGS-25558: Update 4.16 ose-cluster-csi-snapshot-controller-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/183

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/183

Bug OCPBUGS-27926: Update 4.16 cluster-etcd-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1187

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-etcd-operator/pull/1189

Task MGMT-16452: Change MCE subscription to use default channel

View the linked PRs

https://github.com/openshift/assisted-service/pull/5843

Bug OCPBUGS-24966: Update 4.16 csi-attacher-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-attacher/pull/66

Bug OCPBUGS-24993: Update 4.16 kube-state-metrics-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/108

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kube-state-metrics/pull/108

Bug OCPBUGS-25849: PrometheusAdapter and MetricsServer tasks are conflict prone

View the Description View the linked PRs

Description of problem:

The two tasks will always "UPDATE" the underlying "APIService" resource even when no changes are to be made.
This behavior significantly elevates the likelihood of encountering conflicts, especially during upgrades, with other controllers that are concurrently monitoring the same resources (CA controller e.g.). Moreover, this consumes resources for unnecessary work.

The tasks rely on CreateOrUpdateAPIService: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745 which always "UPDATE" the APIService resource.

Version-Release number of selected component (if applicable):

How reproducible:

keep a cluster running then take a look at audit logs concerning the APIService v1beta1.metrics.k8s.io, you could use: oc adm must-gather -- /usr/bin/gather_audit_logs

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

You would see that every "get" is followed by an "update"

You can also take a look at the code taking care of that: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745

Expected results:

"Updates" should be avoided if no changes are to be made.

Additional info:

it'd be even better if we could avoid the "get"s, but that would be another subject to discuss.

https://github.com/openshift/cluster-monitoring-operator/pull/2218

Bug OCPBUGS-28590: GCP: unhelpful error message when using env credentials

View the Description View the linked PRs

Description of problem:

 Facing error while creating manifests:

./openshift-install create manifests --dir openshift-config
FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: unexpected end of JSON input

Using below document :

https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-vpc.html#installation-gcp-config-yaml_installing-gcp-vpc

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8002

Bug OCPBUGS-26504: Update 4.16 atomic-openshift-cluster-autoscaler-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/279

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-autoscaler/pull/279

Bug OCPBUGS-28558: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2057

Bug OCPBUGS-25030: Update 4.16 ovn-kubernetes-microshift-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1979

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ovn-kubernetes/pull/1979

Bug OCPBUGS-29304: "Failed to watch *v1.PartialObjectMetadata" errors in prometheus-operator logs

View the Description View the linked PRs

Description of problem:

    Sometimes the prometheus-operator's informer will be stuck because it receives objects that can't be converted to *v1.PartialObjectMetadata.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Not always

Steps to Reproduce:

    1. Unknown
    2.
    3.

Actual results:

    prometheus-operator logs show errors like

2024-02-09T08:29:35.478550608Z level=warn ts=2024-02-09T08:29:35.478491797Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"
2024-02-09T08:29:35.478592909Z level=error ts=2024-02-09T08:29:35.478541608Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"

Expected results:

    No error

Additional info:

    The bug has been introduced in v0.70.0 by https://github.com/prometheus-operator/prometheus-operator/pull/5993 so it only affects 4.16 and 4.15.

https://github.com/openshift/prometheus-operator/pull/277

Bug OCPBUGS-29425: Power VS: Destroy code needs to account for edge case of lists

View the Description View the linked PRs

Description of problem:

    In accounts with a large amount of resources, the destroy code will fail to list all resources. This has revealed some changes that need to be made to the destroy code to handle these situations.

Version-Release number of selected component (if applicable):

How reproducible:

    Difficult - but we have an account where we can reproduce it consistently

Steps to Reproduce:

    1. Try to destroy a cluster in an account with a large amount of resources.
    2. Fail.
    3.

Actual results:

Fail to destroy

Expected results:

Destroy succeeds

Additional info:

https://github.com/openshift/installer/pull/8010

Bug OCPBUGS-29969: ART requests updates to 4.16 image golang-github-prometheus-alertmanager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/87

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus-alertmanager/pull/88

Bug OCPBUGS-24838: Update 4.16 ose-vmware-vsphere-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/197

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/197

Bug OCPBUGS-28556: Update 4.16 ose-multus-whereabouts-ipam-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/242

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/whereabouts-cni/pull/242

Bug OCPBUGS-24124: Update 4.15 ose-alibaba-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/46

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-alibaba/pull/46

Bug OCPBUGS-24946: Update 4.16 ose-cluster-baremetal-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/393

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-baremetal-operator/pull/393

Bug OCPBUGS-27473: Error in displaying BuildRun logs in Console

View the Description View the linked PRs

Description of problem:

BuildRun logs cannot be displayed in the console and shows the following error:

The buildrun is created and started using the shp cli (similar behavior is observed when the build is created & started via console/yaml too):

shp build create goapp-buildah \
    --strategy-name="buildah" \
    --source-url="https://github.com/shipwright-io/sample-go" \
    --source-context-dir="docker-build" \
    --output-image="image-registry.openshift-image-registry.svc:5000/demo/go-app"

The issue occurs on OCP 4.14.6. Investigation showed that this works correctly on OCP 4.14.5.

https://github.com/openshift/console/pull/13583

Bug OCPBUGS-30604: Misformatted node labels causing origin-tests to panic

View the Description View the linked PRs

Description of problem:

    Panic thrown by origin-tests

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Create aws or rosa 4.15 cluster
    2. run origin tests
    3.

Actual results:

    time="2024-03-07T17:03:50Z" level=info msg="resulting interval message" message="{RegisteredNode  Node ip-10-0-8-83.ec2.internal event: Registered Node ip-10-0-8-83.ec2.internal in Controller map[reason:RegisteredNode roles:worker]}"
  E0307 17:03:50.319617      71 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
  goroutine 310 [running]:
  k8s.io/apimachinery/pkg/util/runtime.logPanic({0x84c6f20?, 0xc006fdc588})
  	k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:75 +0x99
  k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc008c38120?})
  	k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:49 +0x75
  panic({0x84c6f20, 0xc006fdc588})
  	runtime/panic.go:884 +0x213
  github.com/openshift/origin/pkg/monitortests/testframework/watchevents.nodeRoles(0x0?)
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:251 +0x1e5
  github.com/openshift/origin/pkg/monitortests/testframework/watchevents.recordAddOrUpdateEvent({0x96bcc00, 0xc0076e3310}, {0x7f2a0e47a1b8, 0xc007732330}, {0x281d36d?, 0x0?}, {0x9710b50, 0xc000c5e000}, {0x9777af, 0xedd7be6b7, ...}, ...)
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:116 +0x41b
  github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring.func2({0x8928f00?, 0xc00b528c80})
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:65 +0x185
  k8s.io/client-go/tools/cache.(*FakeCustomStore).Add(0x8928f00?, {0x8928f00?, 0xc00b528c80?})
  	k8s.io/client-go@v0.29.0/tools/cache/fake_custom_store.go:35 +0x31
  k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0xe16d020?}, {0x9694a10, 0xc006b00180}, {0x96d2780, 0xc0078afe00}, {0x96f9e28?, 0x8928f00}, 0x0, ...)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:756 +0x603
  k8s.io/client-go/tools/cache.(*Reflector).watch(0xc0005dcc40, {0x0?, 0x0?}, 0xc005cdeea0, 0xc005bf8c40?)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:437 +0x53b
  k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0005dcc40, 0xc005cdeea0)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:357 +0x453
  k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:291 +0x26
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10?)
  	k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:226 +0x3e
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc007974ec0?, {0x9683f80, 0xc0078afe50}, 0x1, 0xc005cdeea0)
  	k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:227 +0xb6
  k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0005dcc40, 0xc005cdeea0)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:290 +0x17d
  created by github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:83 +0x6a5
panic: runtime error: slice bounds out of range [24:23] [recovered]
	panic: runtime error: slice bounds out of range [24:23]

Expected results:

    execution of tests

Additional info:

https://github.com/openshift/origin/pull/28650

Story CONSOLE-3924: i18n upload/download routine task

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

https://github.com/openshift/console/pull/13608

Task MON-3664: Avoid issues with std.set* functions in jsonnet

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/2231

Bug OCPBUGS-24359: oc-mirror with v2 will create more data compared with v1 format

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc-mirror/pull/753

Bug OCPBUGS-25332: HyperShift failing on kube 1.29 rebase due to KMSv1

View the Description View the linked PRs

Description of problem:

E1213 07:18:34.291004 1 run.go:74] "command failed" err="error while building transformers: KMSv1 is deprecated and will only receive security updates going forward. Use KMSv2 instead. Set --feature-gates=KMSv1=true to use the deprecated KMSv1 feature."

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3318

Bug OCPBUGS-7656: Console with custom route can not be accessed when the console is enabled after cluster upgrade

View the Description View the linked PRs

Description of problem:

When console with custom route is disabled before cluster upgrade, and re-enabled after cluster upgrade, console could not be accessed successfully.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-15-111607

How reproducible:

Always

Steps to Reproduce:

1. Launch a cluster with available update.
2. Create custom route for console in ingress configuration:
# oc edit ingresses.config.openshift.io cluster
spec:
  componentRoutes:
  - hostname: console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com
    name: console
    namespace: openshift-console
  - hostname: openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com
    name: downloads
    namespace: openshift-console
  domain: apps.qe-413-0216.qe.devcluster.openshift.com
3. After custom route is created, access console with custom route.
4. Remove console by setting managementState as Removed in console operator:
# oc edit consoles.operator.openshift.io cluster
spec:
  logLevel: Normal
  managementState: Removed
  operatorLogLevel: Normal
5. Upgrade cluster to a target version.
6. Enable console by setting managementState as Managed in console operator:
# oc edit consoles.operator.openshift.io cluster
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
7. After console resources are created, access console url.

Actual results:

3. Console could be accessed through custom route.
4. Console resources are removed. And all cluster operators are in normal status
# oc get all -n openshift-console
No resources found in openshift-console namespace.

5. Upgrade succeeds, all cluster operators are in normal status
6. Console resources are created:

oc get all -n openshift-console
NAME READY STATUS RESTARTS AGE
pod/console-69d88985b-bvh46 1/1 Running 0 3m41s
pod/console-69d88985b-fwhjf 1/1 Running 0 3m41s
pod/downloads-6b6b555d8d-kn822 1/1 Running 0 3m49s
pod/downloads-6b6b555d8d-wp6zc 1/1 Running 0 3m49s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/console ClusterIP 172.30.226.112 <none> 443/TCP 3m50s
service/console-redirect ClusterIP 172.30.147.151 <none> 8444/TCP 3m50s
service/downloads ClusterIP 172.30.251.248 <none> 80/TCP 3m50s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/console 2/2 2 2 3m47s
deployment.apps/downloads 2/2 2 2 3m50s

NAME DESIRED CURRENT READY AGE
replicaset.apps/console-69d88985b 2 2 2 3m42s
replicaset.apps/console-6dbdd487d 0 0 0 3m47s
replicaset.apps/downloads-6b6b555d8d 2 2 2 3m50s

NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/console console-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com console-redirect custom-route-redirect edge/Redirect None
route.route.openshift.io/console-custom console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com console https reencrypt/Redirect None
route.route.openshift.io/downloads downloads-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None
route.route.openshift.io/downloads-custom openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None

7. Could not open console url successfully. There is error info for console operator:

oc get co | grep console
console 4.13.0-0.nightly-2023-02-15-202607 False False False 42s RouteHealthAvailable: route not yet available, https://console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com returns '503 Service Unavailable'
oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.0-0.nightly-2023-02-15-202607 True False 4h48m Error while reconciling 4.13.0-0.nightly-2023-02-15-202607: the cluster operator console is not available

Expected results:

7. Should be able to access console successfully.

Additional info:

https://github.com/openshift/console-operator/pull/826

Task MON-3693: Update Prometheus owners in openshift/origin

View the linked PRs

https://github.com/openshift/origin/pull/28552

Bug OCPBUGS-24814: Update 4.16 ose-installer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/7816

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/7816

Bug OCPBUGS-28538: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-25881: [UI] in Openshift-storage-client namespace, 'RWX' access mode RBD PVC with Volume mode 'Filesystem' is not blocked, it attempt to create and stuck in pending state

View the Description View the linked PRs

Description of problem:

Copying BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 on OCP side (as fix is needed on console).

[UI] In Openshift-storage-client namespace, 'RWX' access  mode RBD PVC with volumemode'Filesystem' can be created from Client. However, this is an invalid combination for RBD PVC creation From ODF Operator UI of other Platforms. Volume mode is not available when Cepfrbd storageclass and RWX access mode selected on other platform. This is visible in client operator view.  This attempt to create PVc and stuck in pending state

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Deploy Provider Client setup.
2. From UI Create PVC, select storage class : ceph-rbd, RWX access mode, check filemode : in case of this bug 'Filesystem' and 'block' volume mode is visible on UI, select volumemode: Filesystem and create the PVC.

Actual results:

PVC Created and stuck in pending status. 
PVC event shows error like:
 Generated from openshift-storage-client.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6d9dcb9fc7-vjj22_2bd4ede5-9418-4c8e-80ae-169b5cb4fa8012 times in the last 13 minutes
failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes

Expected results:

Volumemode should not be visible on page when PVC with RWX access mode and RBD storage class is selected.

Additional info:

Screenshots are attached to the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911

https://bugzilla.redhat.com/show_bug.cgi?id=2250911#c3

https://github.com/openshift/console/pull/13418

Bug OCPBUGS-29982: ART requests updates to 4.16 image ose-thanos-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/thanos/pull/142

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/thanos/pull/142

Bug MGMT-16739: [STG][BE][Nutanix] CNV and MCE should be disabled when select platform Nutanix

View the Description View the linked PRs

Description of the problem:

Up to latest decision RH won't going to support installation OCP cluster on Nutanix with

nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"

Slack discussion thread

https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159

Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS

https://github.com/openshift/assisted-service/pull/5941

Bug OCPBUGS-24921: Update 4.16 ose-ibm-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/61

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-ibm/pull/61

Bug OCPBUGS-20085: [IBMCloud] Unhandled response during destroy disks

View the Description View the linked PRs

Description of problem:

During the destroy cluster operation, unexpected results from the IBM Cloud API calls for Disks can result in panics when response data (or responses) are missing, resulting in unexpected failures during destroy.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Unknown, dependent on IBM Cloud API responses

Steps to Reproduce:

1. Successfully create IPI cluster on IBM Cloud
2. Attempt to cleanup (destroy) the cluster

Actual results:

Golang panic attempting to parse a HTTP response that is missing or lacking data.


level=info msg=Deleted instance "ci-op-97fkzvv2-e6ed7-5n5zg-master-0"
E0918 18:03:44.787843      33 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 228 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x6a3d760?, 0x274b5790})
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe?})
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x6a3d760, 0x274b5790})
	/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion.func1()
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:84 +0x12a
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).Retry(0xc000791ce0, 0xc000573700)
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:99 +0x73
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion(0xc000791ce0, {{0xc00160c060, 0x29}, {0xc00160c090, 0x28}, {0xc0016141f4, 0x9}, {0x82b9f0d, 0x4}, {0xc00160c060, ...}})
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:78 +0x14f
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyDisks(0xc000791ce0)
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:118 +0x485
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction.func1()
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:201 +0x3f
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x7f7801e503c8, 0x18})
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:109 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x227a2f78?, 0xc00013c000?}, 0xc000a9b690?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:154 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x227a2f78, 0xc00013c000}, 0xd0?, 0x146fea5?, 0x7f7801e503c8?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:245 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x227a2f78, 0xc00013c000}, 0x4136e7?, 0x28?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:229 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x100000000000000?, 0x806f00?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:214 +0x46
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction(0xc000791ce0, {{0x82bb9a3?, 0xc000a9b7d0?}, 0xc000111de0?}, 0x840366?, 0xc00054e900?)
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:198 +0x108
created by github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyCluster
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:172 +0xa87
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

Destroy IBM Cloud Disks during cluster destroy, or provide a useful error message to follow up on.

Additional info:

The ability to reproduce is relatively low, as it requires the IBM Cloud API's to return specific data (or lack there of), which is currently unknown why the HTTP respoonse and/or data is missing.

IBM Cloud already has a PR to attempt to mitigate this issue, like done with other destroy resource calls. Potentially followup for additional resources as necessary.
https://github.com/openshift/installer/pull/7515

https://github.com/openshift/installer/pull/7515

Bug OCPBUGS-23631: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-scheduler-operator/pull/524

Bug OCPBUGS-23888: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/196

Bug OCPBUGS-27201: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/92

Bug OCPBUGS-29932: image registry operator displays panic in status from move-blobs command

View the Description View the linked PRs

Description of problem:

    Sample job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-data-path-9nodes/1760228008968327168

Version-Release number of selected component (if applicable):

How reproducible:

    Anytime there is an error from the move-blobs command

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    An error message is shown

Expected results:

    A panic is shown followed by the error message

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/1006

Bug OCPBUGS-25571: Update 4.16 ose-alibaba-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/49

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-alibaba/pull/49

Bug OCPBUGS-25855: Update 4.16 ose-vertical-pod-autoscaler-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/277

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-autoscaler/pull/277

Bug OCPBUGS-30244: Update warning to add image names into the kubectl version mismatch message in addition to the version list

View the Description View the linked PRs

Description of problem:

when extracting `oc` and `openshift-install` from release payload  below warnings are shown which might be confusing for the user, to make this clear please help update the warning to add image names into the kubectl version mismatch message  in addition to the version list
    Version-Release number of selected component (if applicable):{code:none}

always
    How reproducible:{code:none}
Always

Steps to Reproduce:

    1. Run command to extract oc & openshift-install using `oc adm extract`
    2. Run oc adm release info --commits <payload>
    3.

Actual results:

    $ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-03-05-032119
warning: multiple versions reported for the kubectl: 1.29.1,1.28.2,1.29.0

Expected results:

     show image names which needs kubernetes bump along with kubectl version

Additional info:

   Thread here:  https://redhat-internal.slack.com/archives/GK58XC2G2/p1709565188855519

https://github.com/openshift/oc/pull/1695

Bug OCPBUGS-25103: HyperShift failing on kube 1.29 rebase

View the Description View the linked PRs

Description of problem:

    Kube apiserver pod keeps crashing when tested against v1.29 rebase

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Run hypershift e2e agains v1.29 rebase
    2.
    3.

Actual results:

    Fails https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1810-periodics-e2e-aws-ovn/1732734494688940032

Expected results:

    Succeeds

Additional info:

    Kube apiserver pod is crashlooping with:
E1208 21:17:06.619997 1 run.go:74] "command failed" err="group version flowcontrol.apiserver.k8s.io/v1alpha1 that has not been registered"

https://github.com/openshift/hypershift/pull/3304

Bug OCPBUGS-22994: Add alert to warn vSphere users about using usernames without domain

View the Description View the linked PRs

This is a follow up for https://issues.redhat.com/browse/OCPBUGS-14829 and https://issues.redhat.com/browse/OCPBUGS-21821, let's add an alert if vSphere users using usernames without domain, it makes storage doesn't work and should be alerted.

Bug OCPBUGS-28787: CCO degrade when remove root credential for GCP cluster in Mint mode

View the Description View the linked PRs

Description of problem:

It was found when testing OCP-71263 and regression OCP-35770 for 4.15.
For GCP in Mint mode, the root credential can be removed after cluster installation.
But after removing the root credential, CCO became degrade.

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-25-051548

4.15.0-rc.3

How reproducible:

    
Always

Steps to Reproduce:

    1.Install a GCP cluster with Mint mode

    2.After install, remove the root credential
jianpingshu@jshu-mac ~ % oc delete secret -n kube-system gcp-credentials
secret "gcp-credentials" deleted     

    3.Wait some time(about 1/2h to 1h), CCO became degrade 
    
jianpingshu@jshu-mac ~ % oc get co cloud-credential
NAME               VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
cloud-credential   4.15.0-rc.3   True        True          True       6h45m   6 of 7 credentials requests are failing to sync.

jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sort
CredentialsProvisionFailure=False openshift-cloud-network-config-controller-gcp CredentialsProvisionSuccess: successfully granted credentials request
CredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-gcp-ccm CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-gcp-pd-csi-driver-operator CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-image-registry-gcs CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-ingress-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-machine-api-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found

openshift-cloud-network-config-controller-gcp has no failure because it doesn't has customized role in 4.15.0.rc3

Actual results:

 CCO became degrade

Expected results:

 CCO not in degrade, just "upgradeable" condition updated with missing the root credential

Additional info:

Tested the same case on 4.14.10, no issue

https://github.com/openshift/cloud-credential-operator/pull/685

Bug OCPBUGS-28718: Revision tab and routes tab in serving details page showing no resource found

View the Description View the linked PRs

Description of problem:

    In service details page, under Revision and Route tabs, user is able to see No resource found message although Revision and Route is created for that service

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    Always

Steps to Reproduce:

    1.Install serverless operator
    2.Create serving instance
    3.Create knative service/ function
    4.Go to details page

Actual results:

    User is not able to see Revision and Route created for the service

Expected results:

     User should be able to see Revision and Route created for the service

Additional info:

https://github.com/openshift/console/pull/13556

Bug OCPBUGS-24581: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/877

Bug OCPBUGS-28941: ART requests updates to 4.16 image ose-csi-driver-shared-resource-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/101

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-shared-resource-operator/pull/101

Bug OCPBUGS-23327: file path used for oci images can result in an error

View the Description View the linked PRs

Description of problem:

When executing oc mirror using an oci path, you can end up with in an error state when the destination is a file://&lt;path> destination (i.e. mirror to disk).

Version-Release number of selected component (if applicable):

4.14.2

How reproducible:

always

Steps to Reproduce:

At IBM we use the ibm-pak tool to generate a OCI catalog, but this bug is reproducible using a simple skopeo copy. Once you've copied the image locally you can move it around using file system copy commands to test this in different ways.

1. Make a directory structure like this to simulate how ibm-pak creates its own catalogs. The problem seems to be related to the path you use, so this represents the failure case:

mkdir -p /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list

2. make a location where the local storage will live:

mkdir -p /root/.ibm-pak/oc-mirror-storage

3. Next, copy the image locally using skopeo:

skopeo copy docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:8d28189637b53feb648baa6d7e3dd71935656a41fd8673292163dd750ef91eec oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list --all --format v2s2

4. You can copy the OCI catalog content to a location where things will work properly so you can see a working example:

cp -r /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list /root/ibm-zcon-zosconnect-catalog

5. You'll need an ISC... I've included both the oci references in the example (the commented out one works, but the oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list reference fails).

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
operators:
- catalog: oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list
#- catalog: oci:///root/ibm-zcon-zosconnect-catalog
packages:
- name: ibm-zcon-zosconnect
channels:
- name: v1.0
full: true
targetTag: 27ba8e
targetCatalog: ibm-catalog
storageConfig:
local:
path: /root/.ibm-pak/oc-mirror-storage

6. run oc mirror (remember the ISC has oci refs for good and bad scenarios). You may want to change your working directory to different locations between running the good/bad examples.

oc mirror --config /root/.ibm-pak/data/publish/latest/image-set-config.yaml "file://zcon --dest-skip-tls --max-per-registry=6

Actual results:


Logging to .oc-mirror.log
Found: zcon/oc-mirror-workspace/src/publish
Found: zcon/oc-mirror-workspace/src/v2
Found: zcon/oc-mirror-workspace/src/charts
Found: zcon/oc-mirror-workspace/src/release-signatures
error: ".ibm-pak/data/publish/latest/catalog-oci/manifest-list/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" is not a valid image reference: invalid reference format

Expected results:


Simple example where things were working with the oci:///root/ibm-zcon-zosconnect-catalog reference (this was executed in the same workspace so no new images were detected).

Logging to .oc-mirror.log
Found: zcon/oc-mirror-workspace/src/publish
Found: zcon/oc-mirror-workspace/src/v2
Found: zcon/oc-mirror-workspace/src/charts
Found: zcon/oc-mirror-workspace/src/release-signatures
3 related images processed in 668.063974ms
Writing image mapping to zcon/oc-mirror-workspace/operators.1700092336/manifests-ibm-zcon-zosconnect-catalog/mapping.txt
No new images detected, process stopping

Additional info:


I debugged the error that happened and captured one of the instances where the ParseReference call fails. This is only for reference to help narrow down the issue.

github.com/openshift/oc/pkg/cli/image/imagesource.ParseReference (/root/go/src/openshift/oc-mirror/vendor/github.com/openshift/oc/pkg/cli/image/imagesource/reference.go:111)
github.com/openshift/oc-mirror/pkg/image.ParseReference (/root/go/src/openshift/oc-mirror/pkg/image/image.go:79)
github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:194)
github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3 (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/operator.go:575)
golang.org/x/sync/errgroup.(*Group).Go.func1 (/root/go/src/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75)
runtime.goexit (/usr/local/go/src/runtime/asm_amd64.s:1594)

Also, I wanted to point out that because we use a period in the path (i.e. .ibm-pak) I wonder if that's causing the issue? This is just a guess and something to consider. *FOLLOWUP* ... I just removed the period from ".ibm-pak" and that seemed to make the error go away.

https://github.com/openshift/oc-mirror/pull/748

Bug OCPBUGS-24915: Update 4.16 ose-cluster-kube-scheduler-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/515

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25339: OLM pod panics when EnsureSecretOwnershipAnnotations runs

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Run OLM on 4.15 cluster
    2.
    3.

Actual results:

    OLM pod will panic

Expected results:

    Should run just fine

Additional info:

    This issue is due to failure of initiate a new map if nil

https://github.com/openshift/operator-framework-olm/pull/634

Bug OCPBUGS-27180: Sync openshift-oauth-apiserver's shutdown-delay-duration with core offering

View the Description View the linked PRs

Description of problem:

The shutdown-delay-duration argument for the openshift-oauth-apiserver is set to 3s in hypershift, but set to 15s in core openshift. Hypershift should update the value to match.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3608

Bug OCPBUGS-28203: Power VS: Installer create workspace is not instantly ready for PER configuration

View the Description View the linked PRs

Description of problem:

    If you allow the installer to provision a Power VS Workspace instead of bringing your own, it can sometimes fail when creating a network. This is because Power Edge Router can sometimes take up to a minute to configure.

Version-Release number of selected component (if applicable):

How reproducible:

    Infrequent, but will probably hit it within 50-100 runs

Steps to Reproduce:

    1. Install on Power VS with IPI with serviceInstanceGUID not set in the install-config.yaml
    2. Occasionally you'll observe a failure due to the workspace not being ready for networks

Actual results:

    Failure

Expected results:

    Success

Additional info:

    Not consistently reproducible

https://github.com/openshift/installer/pull/7889

Bug OCPBUGS-29981: ART requests updates to 4.16 image golang-github-prometheus-prometheus-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus/pull/195

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus/pull/197

Bug OCPBUGS-25126: Update 4.16 ose-aws-ebs-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/247

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-ebs-csi-driver/pull/247

Bug OCPBUGS-25006: Update 4.16 ose-cluster-ingress-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1006

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-ingress-operator/pull/1006

Bug OCPBUGS-17207: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2081

Bug OCPBUGS-24650: AdminPolicyBasedExternalRoute status.Status is not updated

View the Description View the linked PRs

Description of problem:

https://issues.redhat.com/browse/OCPBUGS-22710?focusedId=23594559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23594559

seems the issue still here, test on 4.15.0-0.nightly-2023-12-04-223539, there is status.message for each zone, but there is no summarized status, so move to Assigned.
apbexternalroute yaml file is:

apiVersion: k8s.ovn.org/v1
kind: AdminPolicyBasedExternalRoute
metadata:
  name: default-route-policy
spec:
  from:
    namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: test
  nextHops:
    static:
    - ip: "172.18.0.8"
    - ip: "172.18.0.9"

and Status section as below:

% oc get apbexternalroute
NAME                   LAST UPDATE   STATUS
default-route-policy   12s <--- still empty
% oc describe apbexternalroute default-route-policy | tail -n 10
Status:
Last Transition Time: 2023-12-06T02:12:11Z
Messages:
qiowang-120620-gtt85-master-2.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-master-0.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-worker-a-55fzx.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-master-1.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-worker-b-m98ms.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-worker-c-vtl8q.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
Events: <none>

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

https://github.com/openshift/cluster-network-operator/pull/2149

Bug OCPBUGS-26203: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-workload-identity/pull/12

Bug OCPBUGS-30077: Add search filters in the toolbar

View the Description View the linked PRs

Add the possibility to have other search filters in the resource list toolbar

Why is this important?

This will let plugins define additional search filters on top of Name and Label in a resource list page

Scenarios

In the nmstate-console-plugin I would like to define other filters like IP search in the NMState list. This search will let the user filter the nmstate resources that are inside a subnet or by specific ips.

For now, using the props and some hacks we were able to change the Name search into an IP search but we would like to have both.

https://issues.redhat.com/browse/CNV-36247

https://github.com/openshift/console/pull/13233

Bug OCPBUGS-30805: no response when clicking on 'Configure' button for AlertmanagerReceiversNotConfigured alert

View the Description View the linked PRs

Description of problem:

nothing happens when user clicks on the 'Configure' button next to AlertmanagerReceiversNotConfigured alert

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-03-11-041450

How reproducible:

 Always

Steps to Reproduce:

1. navigate to Home -> Overview, locate the AlertmanagerReceiversNotConfigured alert in 'Status' card
2. click the 'Configure' button next to AlertmanagerReceiversNotConfigured alert

Actual results:

nothing happens

Expected results:

user should be taken to alert manager configuration page /monitoring/alertmanagerconfig

Additional info:

https://github.com/openshift/console/pull/13666

Bug OCPBUGS-24041: Console blips Available=False with RouteHealth_FailedGet and such

View the Description View the linked PRs

Description

Seen in 4.15-related update CI:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n
      1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused
      1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable'
      2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host
      2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF
      8 console RouteHealth_RouteNotAdmitted console route is not admitted
     16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)

For example this 4.14 to 4.15 run had:

: [bz-Management Console] clusteroperator/console should not change condition/Available 
Run #0: Failed 	1h25m23s
{  1 unexpected clusteroperator state transitions during e2e test run 

Nov 28 03:42:41.207 - 1s    E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}

While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant immediate admin intervention. Teaching the console operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component

At least 4.15. Possibly other versions; I haven't checked.

.h2 How reproducible

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact

Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.

Steps to reproduce

There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.

Actual results

Lots of console ClusterOperator going Available=False blips in 4.15 update CI.

Expected results

Console goes Available=False if and only if immediate admin intervention is appropriate.

https://github.com/openshift/console-operator/pull/834

Bug OCPBUGS-24895: Update 4.16 ose-openshift-apiserver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/409

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openshift-apiserver/pull/409

Bug OCPBUGS-29363: TaskRuns list page is loading constantly for all projects

View the Description View the linked PRs

Description of problem:

    1. TaskRuns list page is loading constantly for all projects
    2. Archive icon is not displayed for some tasks in TaskRun list page
    3. On change of ns to All Projects, PipelineRuns and TaskRuns are not loading properly

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    Always

Steps to Reproduce:

    1.Create some TaskRun
    2.Go to TaskRun list page
    3.Select all project in project dropdown

Actual results:

Screen is keep on loading

Expected results:

     Should load TaskRuns from all projects

Additional info:

https://github.com/openshift/console/pull/13604

Bug OCPBUGS-30498: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/463

Bug OCPBUGS-28879: vsphere-problem-detector-operator does not respect cluster wide proxy

View the Description View the linked PRs

Description of problem:

In a recently installed cluster running 4.13.29, after configuring the cluster-wide-proxy, the "vsphere-problem-detector" is not taking the proxy configuration.
As the pod cannot reach vSphere  it's failing to run checks:
2024-02-01T09:28:00.150332407Z E0201 09:28:00.150292       1 operator.go:199] failed to run checks: failed to connect to vsphere.local: Post "https://vsphere.local/sdk": dial tcp 172.16.1.3:443: i/o timeout  

The pod doesn't get the cluster proxy settings as expected:
   - name: HTTPS_PROXY
     value: http://proxy.local:3128
   - name: HTTP_PROXY
     value: http://proxy.local:3128

Other storage related pods get the configuration expected as above.

This causes the vsphere-problem-detector to fail connections to vSphere, hence failing the health checks.

Version-Release number of selected component (if applicable):

  4.13.29

How reproducible:

   Always

Steps to Reproduce:

    1.Configure cluster-wide proxy in the environment. 
    2. Wait for the change
    3. Check the pod configuration

Actual results:

    vSphere health checks failing

Expected results:

    vSphere health checks working through the cluster proxy

Additional info:

https://github.com/openshift/cluster-storage-operator/pull/457

Bug OCPBUGS-30586: CAPI E2Es failing to start in some CAPI provider's release branches

View the Description View the linked PRs

Description of problem:

CAPI E2Es failing to start in some CAPI provider's release branches.

Failing with the following error:

`go: errors parsing go.mod:94/tmp/tmp.ssf1LXKrim/go.mod:5: unknown directive: toolchain`

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api/199/pull-ci-openshift-cluster-api-master-e2e-aws-capi-techpreview/1765512397532958720#1:build-log.txt%3A91-95

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

  This is because the script launching the e2e is launching it from the `main` branch of the cluster-capi-operator (which has some backward incompabible go toolchain changes), rather than the correctly matching release branch.

https://github.com/openshift/cluster-api/pull/200

Bug OCPBUGS-24833: Update 4.16 ose-libvirt-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/273

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-libvirt/pull/273

Bug OCPBUGS-27366: hypershift needs different default APIs

View the Description View the linked PRs

To support external OIDC on hypershift, but not on self-managed, we need different schemas for the authentication CRD on a default-hypershift versus a default-self-managed. This requires us to change rendering so that it honors the clusterprofile.

Then we have to update the installer to match, then update hypershift, then update the manifests.

Bug OCPBUGS-24823: Update 4.16 ose-cloud-network-config-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/129

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-network-config-controller/pull/129

Bug OCPBUGS-25029: Update 4.16 openshift-enterprise-haproxy-router-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/router/pull/547

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/router/pull/547

Bug OCPBUGS-23811: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-node-driver-registrar/pull/58

Bug OCPBUGS-24044: Converting load balancer service from internal scope to external keeps internal load balancer IP on GCP

View the Description View the linked PRs

Reproducer:
1. On a GCP cluster, create an ingress controller with internal load balancer scope, like this:

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: foo
  namespace: openshift-ingress-operator
spec:
  domain: foo.<cluster-domain>
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: Internal

2. Wait for load balancer service to complete rollout

$ oc -n openshift-ingress get service router-foo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-foo LoadBalancer 172.30.101.233 10.0.128.5 80:32019/TCP,443:32729/TCP 81s

3. Edit ingress controller to set spec.endpointPublishingStrategy.loadBalancer.scope to External

the load balancer service (router-foo in this case) should get an external IP address, but currently it keeps the 10.x.x.x address that was already assigned.

https://github.com/openshift/cloud-provider-gcp/pull/40

Bug MGMT-15886: [STG] Numeric dns name 11.11.11 is now accepted in BE as valid dns name

View the Description View the linked PRs

Description of the problem:

Until latest release we had a test which try to set dns name to 11.11.11 and was expecting BE to throw exception was succeding , (BE was throwing exception)
since last release it is no longer the case and dns name is beeing accepted as

How reproducible:

Steps to reproduce:

1.create a cluster , set base dns name to 11.11.11

Actual results:

according to discussion in thread dns name must start with a letter , in that case we expecting BE to throw exception

Expected results:
No exception thrown

https://github.com/openshift/assisted-service/pull/5801

Task MON-3552: Get rid of temp code for kubelet custom SM

View the Description View the linked PRs

See TODOs in https://github.com/openshift/cluster-monitoring-operator/pull/2160/files

Remove the DeleteConfigMapByNamespaceAndName util (if not used by others )

https://github.com/openshift/cluster-monitoring-operator/pull/2210

Bug OCPBUGS-22410: Install on vSphere using relative path for datastore is not backwards compatible on OCP 4.13+

View the Description View the linked PRs

Description of problem:

Original issue reported here: https://issues.redhat.com/browse/ACM-6189 reported by QE and customer.

Using ACM/hive, customers can deploy Openshift on vSphere. In the upcoming release of ACM 2.9, we support customers on OCP 4.12 - 4.15. ACM UI updates the install config as users add configurations details.

This has worked for several releases over the last few years. However in OCP 4.13+ the format has changed and there is now additional validation to check if the datastore is a full path.

As per https://issues.redhat.com/browse/SPLAT-1093, removal of the legacy fields should not happen until later, so any legacy configurations such as relative paths should still work.

Version-Release number of selected component (if applicable):

ACM 2.9.0-DOWNSTREAM-2023-10-24-01-06-09
OpenShift 4.14.0-rc.7
OpenShift 4.13.18
OpenShift 4.12.39

How reproducible:

Always

Steps to Reproduce:

1. Deploy OCP 4.12 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS
2. Installer passes.
3. Deploy OCP 4.12 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS
4. Installer fails.
5. Deploy OCP 4.12 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 
6. Installer fails.

7. Deploy OCP 4.13 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS
8. Installer fails.
9. Deploy OCP 4.13 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 
10. Installer passes. 
11. Deploy OCP 4.13 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 
12. Installer fails.

Actual results:

Default Datastore Value	OCP 4.12	OCP 4.13	OCP 4.14
`/Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS`	No	Yes	Yes
`WORKLOAD-DS-Folder/WORKLOAD-DS`	No	Yes	Yes
`WORKLOAD-DS`	Yes	No	No

For OCP 4.12.z managed clusters deployments name-only path is the only one that works as expected.
For OCP 4.13.z+ managed cluster deployments only full name and relative path with folder works as expected.

Expected results:

OCP 4.13.z+ takes relative path without specifying the folder like OCP 4.12.z does.

Additional info:

https://github.com/openshift/installer/pull/7931

Bug OCPBUGS-25544: Update 4.16 csi-node-driver-registrar-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/62

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-node-driver-registrar/pull/62

Bug OCPBUGS-24903: Update 4.16 ose-csi-driver-shared-resource-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/158

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-shared-resource/pull/158

Bug OCPBUGS-22556: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-azure/pull/88

Bug OCPBUGS-26490: Update 4.16 ose-network-tools-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/network-tools/pull/108

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/network-tools/pull/108

Bug OCPBUGS-30301: Guest nodes can't join the cluster with NodePort publish strategy

View the Description View the linked PRs

Description of problem:

When creating an HostedCluster with 'NodePort' service publishing strategy, the VMs (guest nodes) are trying to contact HCP services, such as ignition and oauth. If these services are colocated on the same infra node, they can't be reached via NodePort because the 'virt-launcher' NetworkPolicy is blocking it.
Need to explicitly add access to oauth and ignition-server-proxy pods so they can be reached from the virtual machines on the same node.

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Always, if conditions are met

Steps to Reproduce:

    1. As described above
    2.
    3.

Actual results:

VMs are not joining the cluster as nodes if the ignition server is running on the same infra node as the VM.

Expected results:

All VMs are joining the cluster as nodes, and the HostedCluster is eventually Completed and Available

Additional info:

https://github.com/openshift/hypershift/pull/3680

Bug OCPBUGS-26594: potential regression: [sig-arch] events should not repeat pathologically for ns/openshift-monitoring

View the Description View the linked PRs

Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-monitoring.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.15
Start Time: 2024-01-04T00:00:00Z
End Time: 2024-01-10T23:59:59Z
Success Rate: 42.31%
Successes: 11
Failures: 15
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 151
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Monitoring&confidence=95&environment=sdn%20no-upgrade%20amd64%20gcp%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=sdn&network=sdn&pity=5&platform=gcp&platform=gcp&sampleEndTime=2024-01-10%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-01-04%2000%3A00%3A00&testId=openshift-tests%3A567152bb097fa9ce13dd2fb6885e094a&testName=%5Bsig-arch%5D%20events%20should%20not%20repeat%20pathologically%20for%20ns%2Fopenshift-monitoring&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

Bug OCPBUGS-26767: Image registry operator does not support new PowerVS regions

View the Description View the linked PRs

Description of problem:

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get co/image-registry
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry             False       True          True       50m     Available: The deployment does not exist...
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe co/image-registry
...
    Message:               Progressing: Unable to apply resources: unable to sync storage configuration: cos region corresponding to a powervs region wdc not found
...

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-ppc64le-2024-01-10-083055

How reproducible:

Always

Steps to Reproduce:

    1. Deploy a PowerVS cluster in wdc06 zone

Actual results:

See above error message

Expected results:

Cluster deploys

https://github.com/openshift/cluster-image-registry-operator/pull/987

Bug OCPBUGS-28856: Modal dialogs expose code that is not null object safe

View the Description View the linked PRs

Description of problem:

When using the modal dialogs in a hook as part of the actions hook (i.e. useApplicationsActionsProvider) the console will throw an error since the console framework will pass null objects as part of the render cycle. According to Jon Jackson, the console should be safe from null objects but it looks like the code for useDeleteModal and getGroupVersionKindForresource are not safe,

Version-Release number of selected component (if applicable):

How reproducible:

   Always

Steps to Reproduce:

    1. Use one of the modal APIs in an actions provider hook
    2.
    3.

Actual results:

    Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'split')
    at i (main-chunk-9fbeef79a…d3a097ed.min.js:1:1)
    at u (main-chunk-9fbeef79a…d3a097ed.min.js:1:1)
    at useApplicationActionsProvider (useApplicationActionsProvider.tsx:23:43)
    at ApplicationNavPage (ApplicationDetails.tsx:38:67)
    at na (vendors~main-chunk-8…87b.min.js:174297:1)
    at Hs (vendors~main-chunk-8…87b.min.js:174297:1)
    at Sc (vendors~main-chunk-8…87b.min.js:174297:1)
    at Cc (vendors~main-chunk-8…87b.min.js:174297:1)
    at _c (vendors~main-chunk-8…87b.min.js:174297:1)
    at pc (vendors~main-chunk-8…87b.min.js:174297:1)

Expected results:

    Works with no error

Additional info:

https://github.com/openshift/console/pull/13574

Bug OCPBUGS-22438: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/28421

Bug OCPBUGS-29196: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-24859: Update 4.16 ose-haproxy-router-base-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/router/pull/546

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/router/pull/546

Bug OCPBUGS-28676: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2238

Bug OCPBUGS-23550: Unable to use oc-mirror on RHEL9 Host with FIPS enabled OCP cluster

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-28383: Activity card expanded content is not left aligned correctly

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13547

Bug OCPBUGS-25553: Update 4.16 ose-gcp-pd-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/56

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver/pull/56

Bug OCPBUGS-27094: [regression] increased etcd leader elections significantly impacting vsphere amd64 platform

View the Description View the linked PRs

Description of problem:

Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1720630313664647168

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1719915053026643968

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721475601161785344

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1724202075631390720

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721927613917696000

These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.

Since this particular platform's business significance is high, I'm setting this as "Critical" severity.

Please get in touch with me or Dean West if more teams need to be pulled into investigation and mitigation.

Version-Release number of selected component (if applicable):

4.15 / master

How reproducible:

Component Readiness Board

Actual results:

The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload.

Expected results:

1. We NEED to understand what is causing this problem.
2. If we can mitigate this, we should.
3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem.
4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.

Additional info:

Story TRT-1503: gcp jobs failing with /tmp/google-cloud-sdk/bin/gcloud: line 129: exec: python: not

View the Description View the linked PRs

4.16 payload are failing because of multiple issues. One of them is due to missing python module on RHEL9.

Here is a slack thread on it: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1707744121181479

PR for the fix:

https://github.com/openshift/oc/pull/1682

https://github.com/openshift/oc/pull/1682

Bug OCPBUGS-24855: Update 4.16 ose-csi-driver-shared-resource-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/94

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-shared-resource-operator/pull/94

Bug OCPBUGS-26765: Frequent SAST false positives

View the Description View the linked PRs

Description of problem:

The SAST scans keep coming up with bogus positive results from test and vendor files. This bug is just a placeholder to allow us to backport the change to ignore those files.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/baremetal-runtimecfg/pull/292

Bug OCPBUGS-27397: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/coredns/pull/109

Bug OCPBUGS-29482: Power VS: All deploys are failing due to terraform-provider-ibm

View the Description View the linked PRs

Description of problem:

    A change to how Power VS Workspaces are queried is not compatible with the version of terraform-provider-ibm

Version-Release number of selected component (if applicable):

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy with Power VS
    2. Fail with an error stating that [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist

Actual results:

    Fail with [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist

Expected results:

    Install should succeed.

Additional info:

https://github.com/openshift/installer/pull/8023

Bug OCPBUGS-29519: CAPI manifests missing CustomNoUpgrade annotation

View the Description View the linked PRs

Description of problem:

CAPI manifests have the TechPreviewNoUpgrade annotation but are missing the CustomNoUpgrade annotation

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-28659: HyperShift KAS config should set ValidatingAdmissionPolicy plugin

View the Description View the linked PRs

Description of problem:

    The ValidatingAdmissionPolicy admission plugin is set in OpenShift 4.14+ kube-apiserver config, but is missing from the HyperShift config. It should be set.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    4.15: https://github.com/openshift/hypershift/blob/release-4.15/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L293-L341

    4.14: https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L283-L331

Expected results:

    Expect to see ValidatingAdmissionPolicy

Additional info:

https://github.com/openshift/hypershift/pull/3488

Bug OCPBUGS-23004: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-alibaba-cloud/pull/37

Bug OCPBUGS-25576: Update 4.16 csi-attacher-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/69

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-attacher/pull/69

Bug OCPBUGS-23378: PF5 bubble component with wrong layout in Create NetworkPolicy page

View the Description View the linked PRs

Description of problem:

The bubble box with wrong layout

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-11-16-110328

How reproducible:

Always

Steps to Reproduce:

1. Make sure there is no pod under your using project
2. navigate to Networking -> NetworkPolicies -> Create NetworkPolicy page, click the 'affected pods' in Pod selector section
3. Check the layout in the bubble component

Actual results:

the layout is in correct (shared file:https://drive.google.com/file/d/1I8e2ZkiFO2Gu4nSt9kJ6JmRG3LdvkE-u/view?usp=drive_link )

Expected results:

layout should correct

Additional info:

https://github.com/openshift/console/pull/13429

Bug OCPBUGS-24382: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2262

Bug OCPBUGS-24575: GCP: The installer doesn’t precheck if node architecture and vm type are consistent

View the Description View the linked PRs

Description of problem:

The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-multi-2023-12-06-195439

How reproducible:

   Always

Steps to Reproduce:

    1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config     
    2.Create cluster 
    3.Check installation

Actual results:

Azure will precheck if architecture is consistent with instance type when creating manifests, like:
12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj"
12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64

But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster

Expected results:

The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)

Additional info:

https://github.com/openshift/installer/pull/7850

Bug OCPBUGS-25549: Update 4.16 ose-aws-ebs-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/250

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-ebs-csi-driver/pull/250

Story TRT-1557: Attribute unknown cert tests to kube-apiserver

View the Description View the linked PRs

[sig-arch][Late] collect certificate data [Suite:openshift/conformance/parallel]

Test is currently making the Unknown component red, but this test should be aligned to the kube-apiserver component. Looks like two others in the same file should be as well.

https://github.com/openshift/origin/pull/28649

Bug OCPBUGS-25703: oc tag command not having regexp check towards tag names cause OADP backup image fail

View the Description View the linked PRs

Description of problem:

1) Customer tag a image which including # (Hashtag) in the tag name

uk302-img-app-j:v0.6.12-build0000#000

2)When customer using OADP to backup images , they got below error

error excuting custom action(groupResource=imagestream.image.openshift.io namespace=dbp-p0010001, name=uk302-image-app-j): rpc error: code= Unknown= Invalid destination name udistribution-s3-c9814a92-67a4-4251-bd0d-142dfc4d3c80://dbp-p0010001/uk302-image-app-j:v0.6.12-build0000#00: invalid reference format

3) when check the source code below, we found that there are check towards tag name , seems # (Hashtag) is not allowed in regexp check

https://github.com/openshift/openshift-velero-plugin/blob/83f5067b1e04d740cd79ee0046e24283a8d7a184/velero-plugins/imagecopy/imagestream.go#L138

func copyImage(log logr.Logger, src, dest string, copyOptions *copy.Options) ([]byte, error) {
    policyContext, err := getPolicyContext()
    if err != nil {
        return []byte{}, fmt.Errorf("Error loading trust policy: %v", err)
    }
    defer policyContext.Destroy()
    srcRef, err := alltransports.ParseImageName(src)
    if err != nil {
        return []byte{}, fmt.Errorf("Invalid source name %s: %v", src, err)
    }
    destRef, err := alltransports.ParseImageName(dest)
    if err != nil {
        return []byte{}, fmt.Errorf("Invalid destination name %s: %v", dest, err)
    }

https://github.com/containers/image/blob/main/docker/reference/regexp.go#L111

const (
    // alphaNumeric defines the alpha numeric atom, typically a
    // component of names. This only allows lower case characters and digits.
    alphaNumeric = `[a-z0-9]+`

    // separator defines the separators allowed to be embedded in name
    // components. This allow one period, one or two underscore and multiple
    // dashes. Repeated dashes and underscores are intentionally treated
    // differently. In order to support valid hostnames as name components,
    // supporting repeated dash was added. Additionally double underscore is
    // now allowed as a separator to loosen the restriction for previously
    // supported names.
    separator = `(?:[._]|__|[-]*)`

    // repository name to start with a component as defined by DomainRegexp
    // and followed by an optional port.
    domainComponent = `(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])`

    // The string counterpart for TagRegexp.
    tag = `[\w][\w.-]{0,127}`

    // The string counterpart for DigestRegexp.
    digestPat = `[A-Za-z][A-Za-z0-9]*(?:[-_+.][A-Za-z][A-Za-z0-9]*)*[:][[:xdigit:]]{32,}`

    // The string counterpart for IdentifierRegexp.
    identifier = `([a-f0-9]{64})`

    // The string counterpart for ShortIdentifierRegexp.
    shortIdentifier = `([a-f0-9]{6,64})`

Expected results:

Customer want to know if this should be a bug that ,  when doing 
{code:java}
oc tag

We should have some checking towards the tag name to prevent the #(Hashtag) or other non allowed code been setting in the image tag which causing unexpected issue like in using OADP or other tools.

please have a check , thank you!

Regards
Jacob

https://github.com/openshift/oc/pull/1643

Bug OCPBUGS-26147: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/269

Bug OCPBUGS-24515: Observer -> Alerting, Metrics and Targets page does not load

View the Description View the linked PRs

Description of problem:

    Observer - Alerting, Metrics, and Targets page does not load as expected, blank page would be shown

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-07-041003

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Observer -> Alerting, Metrics, and Targets page directly
    2.
    3.

Actual results:

    Blank page, no data be loaded

Expected results:

    Work as normal

Additional info:

 Failed to load resource: the server responded with a status of 404 (Not Found)
/api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%2715ace915-53d3-4455-b7e3-b7a5a4796b5c%27:1

Failed to load resource: the server responded with a status of 403 (Forbidden)
main-chunk-bb9ed989a7f7c65da39a.min.js:1 API call to get support level has failed r: Access denied due to cluster policy.
    at https://console-openshift-console.apps.ci-ln-9fl1l5t-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-bb9ed989a7f7c65da39a.min.js:1:95279
(anonymous) @ main-chunk-bb9ed989a7f7c65da39a.min.js:1
/api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/#ALL_NS#/clusterserviceversions?:1
        
        
       Failed to load resource: the server responded with a status of 404 (Not Found)
vendor-patternfly-5~main-chunk-95cb256d9fa7738d2c46.min.js:1 Modal: When using hasNoBodyWrapper or setting a custom header, ensure you assign an accessible name to the the modal container with aria-label or aria-labelledby.

https://github.com/openshift/monitoring-plugin/pull/84

Bug OCPBUGS-24601: Add minReadySeconds to network-node-identity

View the Description View the linked PRs

Description of problem:

    The node-network-identity deployment should be set to assist in a controlled rollout of the microservice pods. The general goal is to have a microservice pod only report to Kubernetes as being ready when it has completed initialization and is stable enough to complete tasks.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2151

Bug OCPBUGS-24968: Update 4.16 ose-vmware-vsphere-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/103

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/vmware-vsphere-csi-driver/pull/103

Bug OCPBUGS-26937: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-operator/pull/899

Story HELM-522: Bump Helm version to 3.13 in ODC

View the Description View the linked PRs

Owner: Architect:

Story (Required)

As an ODC helm backend developer I would like to be able to bump version of helm to 3.13 to stay synched up with the version we will ship with OCP 4.15

Background (Required)

Normal activity we do every time a new OCP version is release to stay current

Glossary

Out of scope

Approach(Required)

Bump version of helm to 3.13 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.

Dependencies

Might had dependencies with DevFile team to move some dependencies forward

Edge Case

Acceptance Criteria

Console Helm dependency is moved to 3.13

INVEST Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Legend

Unknown
Verified
Unsatisfied

https://github.com/openshift/console/pull/13410

Bug OCPBUGS-23544: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-machine-approver/pull/222

Bug OCPBUGS-24592: Re-enable and fix broken-tests after PatternFly 5 update

View the Description View the linked PRs

As part of the PatternFly update from 5.1.0 to 5.1.1 it was required to disable some dev console e2e tests.

See https://github.com/openshift/console/pull/13380

We need to re-enable and adapt at least this tests:

In frontend/packages/dev-console/integration-tests/features/addFlow/create-from-devfile.feature and in frontend/packages/dev-console/integration-tests/features/e2e/add-flow-ci.feature

Scenario: Deploy git workload with devfile from topology page: A-04-TC01

In frontend/packages/helm-plugin/integration-tests/features/helm-release.feature and frontend/packages/helm-plugin/integration-tests/features/helm/actions-on-helm-release.feature:

Scenario: Context menu options of helm release: HR-01-TC01

Can we please also check why we have both broken tests in two features files? 🤷

https://github.com/openshift/console/pull/13464

Bug OCPBUGS-27928: Update 4.16 coredns-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/coredns/pull/111

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/coredns/pull/111

Bug OCPBUGS-24888: Update 4.16 ose-cluster-openshift-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321

Bug OCPBUGS-29104: HCP Has No Signer For SRE Break-Glass Access

View the Description View the linked PRs

Description of problem:

    Only customers have a break-glass certificate signer.

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    Always

Steps to Reproduce:

    1.create CSR with any other signer chosen
    2.does not work
    3.

Actual results:

    does not work

Expected results:

    should work

Additional info:

https://github.com/openshift/hypershift/pull/3542

Bug OCPBUGS-25530: The filter component should not exist in resource section on the search page with Phone View

View the Description View the linked PRs

Description of problem:

    The unworkable filter component should not exist in resource section on the search page with Phone View

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2023-12-15-211129

How reproducible:

    Always

Steps to Reproduce:

    1. Change to phone view for the browser (Browser -> F12 - Toggle device toolbar)
       eg: iPhone 14 Pro Max
    2. Navigate to Home -> Search page, select one resource
       eg: APIRequestCounts
    3. Check the component in the resources panel

Actual results:

   There is an unworkable filter icon under the 'Create APIREquestCount' button

Expected results:

    Remove the filter component in Phone view
 OR, make sure the filter is workable in phone view if customer needed

Additional info:

    https://drive.google.com/file/d/1Fwb8EGznWkA1z3cVVzGcJJjMFkjkuUhK/view?usp=drive_link

https://github.com/openshift/console/pull/13456

Bug OCPBUGS-28535: CCO Pod crashes on BM cluster when AWS Root Credential exists

View the Description View the linked PRs

Description of problem:

Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1996624, when the AWS root credential (must possesses the "iam:SimulatePrincipalPolicy" permission) exists on a BM cluster, the CCO Pod crashes when running the secretannotator controller.

Steps to Reproduce:

1. Install a BM cluster
fxie-mac:cloud-credential-operator fxie$ oc get infrastructures.config.openshift.io cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2024-01-28T19:50:05Z"
  generation: 1
  name: cluster
  resourceVersion: "510"
  uid: 45bc2a29-032b-4c74-8967-83c73b0141c4
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: None
status:
  apiServerInternalURI: https://api-int.fxie-bm1.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.fxie-bm1.qe.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: fxie-bm1-x74wn
  infrastructureTopology: SingleReplica
  platform: None
  platformStatus:
    type: None 

2. Create an AWS user with IAMReadOnlyAccess permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:GenerateCredentialReport",
                "iam:GenerateServiceLastAccessedDetails",
                "iam:Get*",
                "iam:List*",
                "iam:SimulateCustomPolicy",
                "iam:SimulatePrincipalPolicy"
            ],
            "Resource": "*"
        }
    ]
}

3. Create AWS root credentials with a set of access keys of the user above
4. Trigger a reconcile of the secretannotator controller, e.g. via editting cloudcredential/cluster

Logs:

time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:PutUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:TagUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Tested creds not able to perform all requested actions" controller=secretannotator
I0129 04:47:27.988535 1 reflector.go:289] Starting reflector *v1.Infrastructure (10h37m20.569091933s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.988546 1 reflector.go:325] Listing and watching *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.989503 1 reflector.go:351] Caches populated for *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a964a0]

goroutine 341 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:115 +0x1e5
panic({0x3fe72a0?, 0x809b9e0?})
/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws.LoadInfrastructureRegion({0x562e1c0?, 0xc002c99a70?}, {0x5639ef0, 0xc0001b6690})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws/utils.go:72 +0x40
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).validateCloudCredsSecret(0xc0008c2000, 0xc002586000)
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:206 +0x1a5
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).Reconcile(0xc0008c2000, {0x30?, 0xc000680c00?}, {0x4f38a3d?, 0x0?}, {0x4f33a20?, 0x416325?})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:166 +0x605
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x561ff20?, {0x561ff20?, 0xc002ff3b00?}, {0x4f38a3d?, 0x3b180c0?}, {0x4f33a20?, 0x55eea08?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000189360, {0x561ff58, 0xc0007e5040}, {0x4589f00?, 0xc000570b40?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314 +0x365
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000189360, {0x561ff58, 0xc0007e5040})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 183
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x565

Actual results:

CCO Pod crashes and restarts in a loop:
fxie-mac:cloud-credential-operator fxie$ oc get po -n openshift-cloud-credential-operator -w
NAME                                         READY   STATUS    RESTARTS        AGE
cloud-credential-operator-657bdffdff-9wzrs   2/2     Running   3 (2m35s ago)   8h

https://github.com/openshift/cloud-credential-operator/pull/667

Bug OCPBUGS-14481: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13491

Bug OCPBUGS-24835: Update 4.16 ose-network-interface-bond-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/bond-cni/pull/60

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/bond-cni/pull/60

Bug OCPBUGS-25173: Update 4.16 ose-libvirt-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-libvirt/pull/275

Bug OCPBUGS-30762: Fix build issues in Console dynamic plugin SDK v1

View the Description View the linked PRs

We have detected several bugs in Console dynamic plugin SDK v1 as part of Kubevirt plugin PR #1804

These bugs affect dynamic plugins which target Console 4.15+

1. Build errors related to Hot Module Replacement chunk files

ERROR in [entry] [initial] kubevirt-plugin.494371abc020603eb01f.hot-update.js
Missing call to loadPluginEntry

2. Build warnings issued by `dynamic-module-import-loader`

LOG from @openshift-console/dynamic-plugin-sdk-webpack/lib/webpack/loaders/dynamic-module-import-loader ../node_modules/ts-loader/index.js??ruleSet[1].rules[0].use[0]!./utils/hooks/useKubevirtWatchResource.ts
<w> Detected parse errors in /home/vszocs/work/kubevirt-plugin/src/utils/hooks/useKubevirtWatchResource.ts

3. Build warnings related to PatternFly shared modules

WARNING in shared module @patternfly/react-core
No required version specified and unable to automatically determine one. Unable to find required version for "@patternfly/react-core" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/@openshift-console/dynamic-plugin-sdk/package.json). It need to be in dependencies, devDependencies or peerDependencies.

How to reproduce

1. git clone Kubevirt plugin repo
2. switch to commit containing changes from PR #1804
3. yarn install && yarn dev to update dependencies and start local dev server

https://github.com/openshift/console/pull/13657

Story TRT-1489: AWS minor jobs failing on TargetDown for crio metrics

View the Description View the linked PRs

In https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-06-031624, I see several PR's involving moving crio metrics. This payload is being rejected on TargetAlerts alerts on AWS minor upgrades.

Example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]e-from-stable-4.15-e2e-aws-ovn-upgrade/1754707749708500992

[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system expand_less 0s { TargetDown was at or above info for at least 1m58s on platformidentification.JobType

{Release:"4.16", FromRelease:"4.15", Platform:"aws", Architecture:"amd64", Network:"ovn", Topology:"ha"}

(maxAllowed=1s): pending for 15m0s, firing for 1m58s: Feb 06 04:48:42.698 - 118s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS{alertname="TargetDown", alertstate="firing", job="crio", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}}

https://github.com/openshift/cluster-monitoring-operator/pull/2255

Bug OCPBUGS-24701: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1599

Bug OCPBUGS-28938: ART requests updates to 4.16 image ose-ibm-vpc-block-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/109

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/109

Bug OCPBUGS-30119: cert-syncer is forcibly changing secret type without retaining content

View the Description View the linked PRs

Description of problem:

`ensureSigningCertKeyPair` and `ensureTargetCertKeyPair` are always updating secret type. if the secret requires metadata update, its previous content will not be retained

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Install 4.6 cluster (or make sure installer-generated secrets have `type: SecretTypeTLS` instead of `type: kubernetes.io/tls`
    2. Run secret sync
    3. Check secret contents

Actual results:

    Secret was regenerated with new content

Expected results:

Existing content should be preserved, content is not modified

Additional info:

    This causes api-int CA update for clusters born in 4.6 or earlier.

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1651

Task MON-3689: Update downstream prometheus-operator to v0.71.2

View the linked PRs

Story PSAP-1210: Automate for the bug SNO installation does not finish due to machine-config

View the Description View the linked PRs

User Story:
This is story use to merge PR in openshift/origin repository

https://github.com/openshift/origin/pull/28382

Acceptance criteria:

https://github.com/openshift/origin/pull/28382

Bug OCPBUGS-28282: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-network-config-controller/pull/130

Bug OCPBUGS-19054: No warning that TechPreview is not supported by agent installer

View the Description View the linked PRs

Description of problem:

The agent-based installer does not support the TechPreviewNoUpgrade featureSet, and by extension nor does it support any of the features gated by it. Because of this, there is no warning about one of these features being specified - we expect the TechPreviewNoUpgrade feature gate to error out when any of them are used.

However, we don't warn about TechPreviewNoUpgrade itself being ignored, so if the user does specify it then they can use some of these non-supported features without being warned that their configuration is ignored.

We should fail with an error when TechPreviewNoUpgrade is specified, until such time as AGENT-554 is implemented.

https://github.com/openshift/installer/pull/7825

Bug OCPBUGS-27247: improve empty state message for Machines and MachineSets page

View the Description View the linked PRs

Description of problem:

in UPI cluster, there is no MachineSets and Machines resource, when user visits Machines and MachineSets list page, we will see simple text 'Not found'

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-16-113018

How reproducible:

Always

Steps to Reproduce:

1. setup UPI cluster
2. goes to MachineSets and Machines list page, check the empty state message

Actual results:

2. we just simply show 'Not found' text

Expected results:

2. for other resources, we show richer text 'No <resourcekind> found', so we should also show 'No Machines found' and 'No MachineSets found' for these pages

Additional info:

https://github.com/openshift/console/pull/13577

Bug OCPBUGS-26023: SessionAffinity does not work after scaling down the Pods

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

4.14

How reproducible:

1. oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'

2. curl <svc>:<port>

3. oc scale --replicas=3 deploy/<deploy>

4. oc scale --replicas=0 deploy/<deploy>

5. oc scale --replicas=3 deploy/<deploy>

Actual results:

Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720

Expected results:

Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520

Additional info:

See the hostname in the server log output for each command.

$ oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'

$ oc scale --replicas=1 deploy/<deploy>

Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47082
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47088
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54832
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54848

$ oc scale --replicas=3 deploy/<deploy>

https://github.com/openshift/ovn-kubernetes/pull/2018

Bug OCPBUGS-25394: namespace port group is cleaned up on restart

View the Description View the linked PRs

Description of problem:

The problem was that namespace handler on initial sync would delete all ports (because logical port cache where it got lsp UUIDs wasn't populated) and all acls (they were just set to nil). Even though both ports and acls will be re-added by the corresponding handlers, it may cause disruption.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. create a namespace with at least 1 pod and egress firewall in it

2. pick any ovnkube-node pod, find namespace port group UUID in nbdb by external_ids["name"]=<namespace name>, e.g. for "test" namespace

_uuid               : 6142932d-4084-4bc3-bdcb-1990fc71891b
acls                : [ab2be619-1266-41c2-bb1d-1052cb4e1e97, b90a4b4a-ceee-41ee-a801-08c37a9bf3e7, d314fa8d-7b5a-40a5-b3d4-31091d7b9eae]
external_ids        : {name=test}
name                : a18007334074686647077
ports               : [55b700e4-8176-42e7-97a6-8b32a82fefe5, cb71739c-ad6c-4436-8fd6-0643a5417c7d, d8644bf1-6bed-4db7-abf8-7aaab0625324]

3. restart chosen ovn-k pod

4. check logs on restart that update chosen port group to have zero ports and zero acls

Update operations generated as: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:test]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {6142932d-4084-4bc3-bdcb-1990fc71891b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUID: UUIDName:}]

Actual results:

Expected results:

On restart port group stays the same, no extra update with empty ports and acls is generated

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

https://github.com/openshift/ovn-kubernetes/pull/1990

Bug OCPBUGS-24745: Update 4.16 golang-github-prometheus-prometheus-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus/pull/187

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus/pull/187

Bug OCPBUGS-24783: Update 4.16 atomic-openshift-cluster-autoscaler-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/271

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-autoscaler/pull/271

Bug OCPBUGS-29099: Add hostNetwork:true to cni-sysctl-allowlist ds

View the Description View the linked PRs

In CNO, need to change cni-sysctl-allowlist daemonset to hostNetwork: true in case of multus ds + cni-sysctl-allowlist ds upgrade successfully.

https://github.com/openshift/cluster-network-operator/pull/2237

Bug OCPBUGS-25572: Update 4.16 ose-cluster-baremetal-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/394

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-baremetal-operator/pull/394

Bug OCPBUGS-29565: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/793

Bug OCPBUGS-28744: Upgrade blocked due to the OLM operator stuck in CrashLoopBackOff

View the Description View the linked PRs

Description of problem:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.15.0-rc.4: 701 of 873 done (80% complete), waiting on operator-lifecycle-manager

Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.


$ oc get pods -n openshift-operator-lifecycle-manager 
NAME                                      READY   STATUS             RESTARTS        AGE
catalog-operator-db86b7466-gdp4g          1/1     Running            0               9h
collect-profiles-28443465-9zzbk           0/1     Completed          0               34m
collect-profiles-28443480-kkgtk           0/1     Completed          0               19m
collect-profiles-28443495-shvs7           0/1     Completed          0               4m10s
olm-operator-56cb759d88-q2gr7             0/1     CrashLoopBackOff   8 (3m27s ago)   20m
package-server-manager-7cf46947f6-sgnlk   2/2     Running            0               9h
packageserver-7b795b79f-thxfw             1/1     Running            1               14d
packageserver-7b795b79f-w49jj             1/1     Running            0               4d17h

Version-Release number of selected component (if applicable):

How reproducible:

Unknown

Steps to Reproduce:

Upgrade from 4.15.0-rc.2 to 4.15.0-rc.4

Actual results:

The upgrade is unable to proceed

Expected results:

The upgrade can proceed

Additional info:

https://github.com/openshift/operator-framework-olm/pull/679

Bug OCPBUGS-25054: Update 4.16 ose-must-gather-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/must-gather/pull/395

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/must-gather/pull/395

Bug OCPBUGS-25555: Update 4.16 ose-powervs-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/70

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-provider-powervs/pull/70

Bug OCPBUGS-26014: CVO does not reconcile metadata on ClusterOperators

View the Description View the linked PRs

Description of problem:

While testing oc adm upgrade status against b02, I noticed some COs do not have any annotations, while I expected them to have the include/exclude.release.openshift.io/* ones (to recognize COs that come from the payload).

$ b02 get clusteroperator etcd -o jsonpath={.metadata.annotations}
$ ota-stage get clusteroperator etcd -o jsonpath={.metadata.annotations}
{"exclude.release.openshift.io/internal-openshift-hosted":"true","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"}

CVO does not reconcile CO resources once they exist, only precreates them but does not touch them once they exist. Build02 does not have CO with reconciled metadata because it was born as 4.2 which (AFAIK) is before OCP started to use the exclude/include annotations.

Version-Release number of selected component (if applicable):

4.16 (development branch)

How reproducible:

deterministic

Steps to Reproduce:

1. delete an annotation on a ClusterOperator resource

Actual results:

The annotation wont be recreated

Expected results:

The annotation should be recreated

https://github.com/openshift/cluster-version-operator/pull/1012

Bug CNF-11145: NTO make render-sync does not support the bootstrap scenarios

View the Description View the linked PRs

Jose added few other rendering tests that utilize different inputs and outputs. make render-sync should be able to prepare them.

https://github.com/openshift/cluster-node-tuning-operator/pull/932

Task MULTIARCH-4160: Stop using the DetermineFromRelease function

View the Description View the linked PRs

Multi-arch compute clusters have an issue where the cluster version's image ref is single arch, so this change resolves the image ref without spinning up a pod.

https://github.com/openshift/origin/pull/28598

Bug OCPBUGS-25590: Update 4.16 ose-azure-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/102

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-azure/pull/102

Bug OCPBUGS-24436: Add probes to node-network-operator

View the Description View the linked PRs

Description of problem:

    The node-network-identity deployment should conform to hypershift control plane expectations that all applicable containers should have a liveness probe, and a readiness probe if it is an endpoint for a service.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    No liveness or readiness probes

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2148

Bug OCPBUGS-29701: Console should be using SelfSubjectReview API from frontend

View the Description View the linked PRs

Description of problem:

Currently console frontend and backend is using OpenShift centric UserKind type. In order for the console to work without OAuth server, iow. with. external OIDC it needs to use k8s UserInfo type, which is retrieved querying SelfSubjectReview API

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Console is not working with external OIDC provider

Expected results:

Console will be working with external OIDC provider

Additional info:

This is mainly an API change.

https://github.com/openshift/console/pull/13605

Bug OCPBUGS-25504: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/49

Bug OCPBUGS-25036: sdn: Decouple from cli image

View the Description View the linked PRs

Description of problem:

The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

https://github.com/openshift/sdn/pull/593

Bug OCPBUGS-15827: console operator degraded following service CA rotation by deleting the signing-key

View the Description View the linked PRs

Description of problem:

following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using
oc delete secret/signing-key -n openshift-service-ca
operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.

console                                    4.14.0-0.nightly-2023-06-30-131338   False       False         True       159m    RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable'

monitoring                                 4.14.0-0.nightly-2023-06-30-131338   False       True          True       161m    reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

How reproducible:

100%

Steps to Reproduce:

1.oc delete secret/signing-key -n openshift-service-ca
2. wait at least 30+ minutes
3. observe oc get co

Actual results:

console and monitoring degraded and not recovering

Expected results:

able to recover eventually as in previous versions

Additional info:

using manual deletion of all pods it is possible to recover the cluster from this state as follows:
for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \
      do oc delete pods --all -n $I; \
      sleep 1; \
      done

must-gather:
https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing

Bug OCPBUGS-25974: Specifying a control-plane-operator image without a digest results in InvalidImageRef

View the Description View the linked PRs

Description of problem:

    When specifying a control plane operator image for dev purposes, the control plane operator pod fails to come up with an InvalidImageRef status.

Version-Release number of selected component (if applicable):

    Mgmt cluster is 4.15, HyperShift control plane operator is latest from main

How reproducible:

Always

Steps to Reproduce:

    1. Create a hosted cluster with an annotation to override control plane operator image and point it to a non-digest image ref.

Actual results:

    The cluster fails to come up with the CPO pod failing with InvalidImageRef

Expected results:

The cluster comes up fine.

Additional info:

https://github.com/openshift/hypershift/pull/3361

Bug OCPBUGS-23788: gatewayConfig.ipForwarding allows any invalid value

View the Description View the linked PRs

Description of problem:
gatewayConfig.ipForwarding allows any invalid value but it should enforce "", "Restricted" or "Global"

You can currently even do really funky stuff with that:

oc edit network.operator/cluster
(...)
 15 spec:                                                                                                                   
 16   clusterNetwork:                                                                                                       
 17   - cidr: 10.128.0.0/14                                                                                                 
 18     hostPrefix: 23                                                                                                      
 19   - cidr: fd01::/48                                                                                                     
 20     hostPrefix: 64                                                                                                      
 21   defaultNetwork:                                                                                                       
 22     ovnKubernetesConfig:                                                                                                
 23       egressIPConfig: {}                                                                                                
 24       gatewayConfig:                                                                                                    
 25         ipForwarding: $(echo 'Im injected'; lscpu)

$ oc get pods -n openshift-ovn-kubernetes ovnkube-node-24628 -o yaml | grep sysctl -C5
      fi

      # If IP Forwarding mode is global set it in the host here.
      ip_forwarding_flag=
      if [ "$(echo 'Im injected'; lscpu)" == "Global" ]; then
        sysctl -w net.ipv4.ip_forward=1
        sysctl -w net.ipv6.conf.all.forwarding=1
      else
        ip_forwarding_flag="--disable-forwarding"
      fi

      NETWORK_NODE_IDENTITY_ENABLE=

$ oc logs -n openshift-ovn-kubernetes ovnkube-node-24628 -c ovnkube-controller | grep inje -A5
++ echo 'Im injected'
++ lscpu
+ '[' 'Im injected
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             112

I wouldn't consider this a security issue, because I have to be the admin to do that, and as the admin I can also simply modify the pod, but it's not very elegant to allow for some sort of code injection, even by the admin

Bug OCPBUGS-28836: Fix usersettings identifier creation

View the Description View the linked PRs

Description of problem:

Usernames can contain all kinds of characters that are not allowed in resource names. Hash the name instead and use hex representation of the result to get a usable identifier.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. log in to the web console configured with a login to a 3rd party OIDC provider
    2. go to the User Preferences page / check the logs in the javascript console

Actual results:

The User Preferences page shows empty values instead of defaults.
The javascript console reports things like
```
consoleFetch failed for url /api/kubernetes/api/v1/namespaces/openshift-console-user-settings/configmaps/user-settings-kubeadmin r: configmaps "user-settings-kubeadmin" not found
```

Expected results:

   I am able to persist my user preferences.

Additional info:

https://github.com/openshift/console/pull/13557

Bug OCPBUGS-23812: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/81

Bug OCPBUGS-24874: Update 4.16 ose-azure-file-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/45

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-file-csi-driver/pull/45

Bug OCPBUGS-14257: coreos-installer iso kargs show broken on Agent ISO

View the Description View the linked PRs

Running the command coreos-installer iso kargs show no longer works with the 4.13 Agent ISO. Instead we get this error:

$ coreos-installer iso kargs show agent.x86_64.iso
Writing manifest to image destination
Storing signatures
Error: No karg embed areas found; old or corrupted CoreOS ISO image.

This is almost certainly due to the way we repack the ISO as part of embedding the agent-tui binary in it.

It worked fine in 4.12. I have tested both with every version of coreos-installer from 0.14 to 0.17

https://github.com/openshift/installer/pull/7896

Bug OCPBUGS-24219: Azure - OCP IPI Installation UDP packets are subject to SNAT with LB Service using ETP equals to Local (OVN-Kubernetes as CNI)

View the Description View the linked PRs

UDP Packets are subject to SNAT in a self-managed OCP 4.13.13 cluster on Azure (OVN-K as CNI) using a Load Balancer Service with `externalTrafficPolicy: Local`. UDP Packets correctly arrive to the Node hosting the Pod but the source IP seen by the Pod is the OVN GW Router of the Node.

I've reproduced the customer scenario with the following steps:

Deploy a blank enviroment on Azure using [demo lab | https://demo.redhat.com/catalog?item=babylon-catalog-prod/azure-gpte.open-environment-azure-subscription.prod&utm_source=webapp&utm_medium=share-link ]
Install an OCP 4.13.13 cluster with the installer
Use this GitHub repo to deploy a simple UDP server
Run `oc apply -f server.yaml` to deploy Deployment and Service resources
application listens on 10001 for UDP traffic, while the load balancer listens on 10001 for UDP traffic and forward it on 10001 pod port
Using nc from the bastion host to connect to the external load balancer IP and starting writing something

This is issue is very critical because it is blocking customer business.

https://github.com/openshift/ovn-kubernetes/pull/2038

Bug OCPBUGS-26983: monitoring-plugin becomes unavailable after forcing the rotation of the service's certificate

View the Description View the linked PRs

Description of problem:

following signing-key deletion, there is a service CA rotation process which might temporary disrupt platform components, but eventually all should use the updated certificates.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

How reproducible:

100%

Steps to Reproduce:

1.oc delete secret/signing-key -n openshift-service-ca
2. reload the management console
3.

Actual results:

The Observe tab disappears from the menu bar and the monitoring-plugin shows as unavailable.

Expected results:

No disruption

Additional info:

using manual deletion of the monitoring-plugin pods it is possible to recover the situation

https://github.com/openshift/cluster-monitoring-operator/pull/2233

Bug OCPBUGS-30600: [4.16] Bootimage bump tracker

View the Description View the linked PRs

Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-29441.

https://github.com/openshift/installer/pull/8121

Bug OCPBUGS-27871: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/498

Bug OCPBUGS-24865: Update 4.16 ose-baremetal-runtimecfg-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/291

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/baremetal-runtimecfg/pull/291

Bug OCPBUGS-25840: Warning info title on operator page for Azure WI/FI cluster are not completely consistent

View the Description View the linked PRs

Description of problem:

For special operators, there are warning info on operator detail modal page and installation page when it's  Azure WI/FI cluster. The warning info titles are not consistent on these two pages.

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-21-155123

How reproducible:

Always

Steps to Reproduce:

    1.Prepare a special operator which would show warning info on  Azure WI/FI cluster.
    2.Login console of  Azure WI/FI cluster, check the warning info title on the operator detail item modal and installation page.
    3.

Actual results:

2. On operator detail item modal, the warning title is "Cluster in Azure Workload Identity / Federated Identity Mode", and on installation page, the warning info title is "Cluster in Workload Identity / Federated Identity Mode". The word "Azure" is missed on the second page.

Expected results:

2. The warning title should keep consistent.

Additional info:

screenshot: https://drive.google.com/drive/folders/1alFBEtO1gN4q5_mAtHCNzuLTOe5zXp0K?usp=drive_link

https://github.com/openshift/console/pull/13472

Bug OCPBUGS-26069: cluster-monitoring-operator watches on metal-ipi are higher

View the Description View the linked PRs

Component Readiness has found a potential regression in [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel].

Probability of significant regression: 98.46%

Sample (being evaluated) Release: 4.15
Start Time: 2023-12-29T00:00:00Z
End Time: 2024-01-04T23:59:59Z
Success Rate: 83.33%
Successes: 15
Failures: 3
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 98.36%
Successes: 120
Failures: 2
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=sdn%20no-upgrade%20amd64%20metal-ipi%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=sdn&network=sdn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-01-04%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2023-12-29%2000%3A00%3A00&testId=openshift-tests%3A9ff4e9b171ea809e0d6faf721b2fe737&testName=%5Bsig-arch%5D%5BLate%5D%20operators%20should%20not%20create%20watch%20channels%20very%20often%20%5Bapigroup%3Aapiserver.openshift.io%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

https://github.com/openshift/origin/pull/28498

Bug OCPBUGS-26513: The source of idms should not be localhost:55000/openshift

View the Description View the linked PRs

Description of problem:

oc-mirror with v2 will create the idms file as output , but the source is like :
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  creationTimestamp: null
  name: idms-2024-01-08t04-19-04z
spec:
  imageDigestMirrors:
  - mirrors:
    - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift
    source: localhost:55000/openshift
  - mirrors:
    - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift-release-dev
    source: quay.io/openshift-release-dev
status: {}

The source should always be the origin registry like :quay.io/openshift-release-dev

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

   1. run the command with v2 :
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
      - name: stable-4.14
        minVersion: 4.14.3
        maxVersion: 4.14.3
    graph: true

`oc-mirror --config config.yaml file://out --v2` 
`oc-mirror --config config.yaml --from file://out  --v2 docker://xxxx:5000/ocp2`    
2. check the idms file

Actual results:

    2. cat idms-2024-01-08t04-19-04z.yaml 
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  creationTimestamp: null
  name: idms-2024-01-08t04-19-04z
spec:
  imageDigestMirrors:
  - mirrors:
    - xxxx.com:5000/ocp2/openshift
    source: localhost:55000/openshift
  - mirrors:
    - xxxx.com:5000/ocp2/openshift-release-dev
    source: quay.io/openshift-release-dev

Expected results:

The source should not be localhost:55000, should be like the origin registry.

Additional info:

https://github.com/openshift/oc-mirror/pull/772

Task OU-310: Add an extension point for alerts, to be able to add an inspect link

View the Description View the linked PRs

Background

Lightspeed requires to add a link to each alert item, into the kebab menu so they can open a panel

Outcomes

An extension point is added to the alert table so new elements can be added to each alert kebab menu

https://github.com/openshift/monitoring-plugin/pull/95

Bug OCPBUGS-24911: Update 4.16 ose-cluster-autoscaler-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/303

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-autoscaler-operator/pull/303

Bug OCPBUGS-24952: Update 4.16 csi-driver-nfs-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/136

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-nfs/pull/136

Bug OCPBUGS-28967: 'Oh no somthing went wrong' shown on Image Manifest Vulnerability page after create IMV via CLI

View the Description View the linked PRs

Description of problem:

    'Oh no somthing went wrong' shown on Image Manifest Vulnerability page after create IMV via CLI

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-02-03-192446

How reproducible:

    Always

Steps to Reproduce:

    1. Installed the operator of 'Red Hat Quay Container Security Operator'
    2. Use Command Line to created the IMV
       $ oc create -f imv.yaml 
        imagemanifestvuln.secscan.quay.redhat.com/example created
       $ cat IMV.yaml
         apiVersion: secscan.quay.redhat.com/v1alpha1
         kind: ImageManifestVuln
         metadata:
           name: example
           namespace: openshift-operators
         spec: {}
     3. Navigate to page /k8s/ns/openshift-operators/operators.coreos.com~v1alpha1~ClusterServiceVersion/container-security-operator.v3.10.3/secscan.quay.redhat.com~v1alpha1~ImageManifestVuln

Actual results:

    Oh no! Something went wrong. will be shown
Description:Cannot read properties of undefined (reading 'replace')Component trace:Copy to clipboardat T (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/container-security-chunk-c75b48f176a6a5981ee2.min.js:1:3465)
    at https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631947
    at tr
    at x (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:630876)
    at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:82:73479)
    at tbody
    at table
    at g (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:6:199268)
    at l (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:10:88631)
    at D (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:632038)
    at div
    at div
    at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:50:39294)
    at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:49:16122)
    at o (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:642088)
    at div
    at M (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631697)
    at div

Expected results:

    no error issue

Additional info:

https://github.com/openshift/console/pull/13578

Bug OCPBUGS-29590: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/77

Bug OCPBUGS-24718: Update 4.16 golang-github-prometheus-alertmanager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/85

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-26067: IP and CIDR CEL validation for OpenShift 4.16

View the Description View the linked PRs

Description of problem:

We would like to include the CEL IP and CIDR validations in 4.16. They have been mergeded upstream and can be backported into OpenShift to improve out validation downstream.

Upstream PR: https://github.com/kubernetes/kubernetes/pull/121912

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/1828

Task HOSTEDCP-1397: Add documentation on how to debug Azure nodes

View the Description View the linked PRs

Add documentation on how to debug Azure nodes

https://github.com/openshift/hypershift/pull/3452

Bug OCPBUGS-24794: Update 4.16 ose-cluster-olm-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/39

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-olm-operator/pull/39

Bug OCPBUGS-30005: Power VS: PlatformCredsCheck relies on endpoint that has been removed.

View the Description View the linked PRs

Description of problem:

    On February 27th endpoints were turned off that were being queried for account details. The check is not vital so we are fine with removing it, however it is currently blocking all Power VS installs.

Version-Release number of selected component (if applicable):

    4.13.0 - 4.16.0

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy with Power VS
    2. Fail at the platform credentials check

Actual results:

    Check fails

Expected results:

    Check should succeed

Additional info:

https://github.com/openshift/installer/pull/8072

Bug OCPBUGS-24983: Update 4.16 telemeter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/telemeter/pull/497

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24806: Update 4.16 ose-olm-catalogd-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-catalogd/pull/36

Bug OCPBUGS-25050: Update 4.16 ose-azure-disk-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-disk-csi-driver/pull/67

Bug OCPBUGS-24630: don't find "scrape.timestamp-tolerance" setting in prometheus

View the Description View the linked PRs

Description of problem:

tested https://github.com/openshift/cluster-monitoring-operator/pull/2187 with PR

launch 4.15,openshift/cluster-monitoring-operator#2187 aws

don't find "scrape.timestamp-tolerance" setting in prometheus and prometheus pod, no result for below commands

$ oc -n openshift-monitoring get prometheus k8s -oyaml | grep -i "scrape.timestamp-tolerance"
$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "scrape.timestamp-tolerance"
$ oc -n openshift-monitoring get sts  prometheus-k8s  -oyaml | grep -i "scrape.timestamp-tolerance"

not in prometheus configuration file either

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | head
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: openshift-monitoring/k8s
    prometheus_replica: prometheus-k8s-0
rule_files:
- /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
scrape_configs:
- job_name: serviceMonitor/openshift-apiserver-operator/openshift-apiserver-operator/0

https://github.com/openshift/cluster-monitoring-operator/pull/2189

Bug OCPBUGS-30052: PAC: Repositories list page breaks with a TypeError

View the Description View the linked PRs

Description of problem:

    Repositories list page breaks with a TypeError 
cannot read properties of undefined (reading `pipelinesascode.tekton.dev/repository`)

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://drive.google.com/file/d/1TpH_PTyBxNX0b9SPZ2yS8b-q-tbvp6Ok/view?usp=sharing

https://github.com/openshift/console/pull/13659

Bug OCPBUGS-18326: CMO manifest lack capability annotations

View the Description View the linked PRs

The CVO managed manifest, that CMO ships lack capability annotations as defined in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations.

The dashboards should be tied to the console capability so that when CMO deploys on a cluster without the Console capability, CVO doesn't deploy the dashboards configmap.

https://github.com/openshift/cluster-monitoring-operator/pull/2254

Bug OCPBUGS-26566: Page fails to return to the Secrets list after clicking 'Cancel' on any Secret creation page

View the Description View the linked PRs

Description of problem:

  when user click ‘Cancel’ on any Secret creation page, it doesn’t return to Secrets list page

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-01-06-062415

How reproducible:

    Always

Steps to Reproduce:

    1. Go to Create Key/value secret|Image pull secret|Source secret|Webhook secret|FromYaml page
       eg：/k8s/ns/default/secrets/~new/generic
    2. Click Cancel button
    3.

Actual results:

    The page does not go back to Secrets list page
    eg: /k8s/ns/default/core~v1~Secret

Expected results:

    The page should go back to the Secrets list page

Additional info:

https://github.com/openshift/console/pull/13504

Bug MGMT-16241: preparing-for-installation fails when modify applied-config

View the Description View the linked PRs

Description of the problem:

Prepare cluster for installation , add applied configuration from api code.

When we try to install cluster it backs to ready as expected but in case we fix configuration and try to install again it ALWAYS fails on the first attemp to install ( not related to time)

Only on the second install it works ! without changing the configuration.

How reproducible:

Always

Steps to reproduce:

1.Prepare cluster for installation

2.Create invalid applied configuration to verify that preparing-for-installation returns to ready state as expected.

invalid_override_config = {
"capabilities":

{ "additionalEnabledCapabilities": ["123454baremetal"], }

}
3. Start installation , back to ready as expected.

4. fix applied configuration and try to install again -> Fails

Actual results:

On the first attemp after config change installation fails.

It work only on the second try

Expected results:

https://github.com/openshift/assisted-service/pull/5811

Bug MGMT-17008: [STG] Unable to install cluster with lvms cnv and MCE

View the Description View the linked PRs

Description of the problem:
When trying to install cluster on 4.15 LVMS with CNV and MCE operator
In operator page i am unabe to continue with installation
since host discovery pointing that hosts required more resources (as if i also have selected odf)

How reproducible:

Steps to reproduce:

1.Create a multi node cluster 4.15

2.make sure to have enough resources for lvms , cnv and mce operator

3.select cnv lvms and mce operator

Actual results:

In operator page it show that also cpu and ram resources related to ODF are required (which is not selected) and user unable to start installation

Expected results:

Should be able to start installation

https://github.com/openshift/assisted-service/pull/6021

Bug OCPBUGS-25005: Update 4.16 ose-azure-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/100

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-azure/pull/100

Bug OCPBUGS-30260: Correctly handle HCP subnet tagging for management clusters to not break cloud-provider-aws subnet selection

View the Description View the linked PRs

Description of problem:

Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress.
These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws.

The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments.

On management clusters *all* subnets are tagged with the MCs cluster-id.
This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626

This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ.

In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules.

In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).

Version-Release number of selected component (if applicable):

4.14.z & 4.15.z

How reproducible:

Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.

Steps to Reproduce:

1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead.
2. Check the securitygroups to see if the source CIDRs are incorrect.

Actual results:

SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.

Expected results:

The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones.

Additional info:

Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289

https://github.com/openshift/hypershift/pull/3746

Bug OCPBUGS-24985: Update 4.16 ose-cluster-machine-approver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/218

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-machine-approver/pull/218

Bug OCPBUGS-28658: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oauth-apiserver/pull/98

Bug OCPBUGS-22648: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/108

Bug OCPBUGS-27056: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/115

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug MGMT-14226: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-30169: CEO deadlocks on health checking a downed member

View the Description View the linked PRs

Description of problem:

For certain operations the CEO will check the etcd member health by creating a client directly and waiting for its status report.

Under a situation of any member not being reachable for a longer period, we found the CEO was constantly getting stuck / deadlocked and couldn't move certain controllers forward. 

In OCPBUGS-12475 we introduced a health-check that would dump stack and automatically restart with the operator deployment health probe.

In a more recent upgrade run we could find the culprit [1] to be a missing context during client initialization to etcd, making it stuck infinitely:


W0229 02:55:46.820529       1 aliveness_checker.go:33] Controller [EtcdEndpointsController] didn't sync for a long time, declaring unhealthy and dumping stack

goroutine 1426 [select]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth({0x3272768?, 0xc002090310}, {0xc0000a6880, 0x3, 0xc001c98360?})
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:64 +0x330
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.(*etcdClientGetter).MemberHealth(0xc000c24540, {0x3272688, 0x4c20080})
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:412 +0x18c
github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.CheckSafeToScaleCluster({0x324ccd0?, 0xc000b6d5f0?}, {0x3284250?, 0xc0008dda10?}, {0x324e6c0, 0xc000ed4fb0}, {0x3250560, 0xc000ed4fd0}, {0x32908d0, 0xc000c24540})
	github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/bootstrap.go:149 +0x28e
github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.(*QuorumCheck).IsSafeToUpdateRevision(0x2893020?)
	github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/qourum_check.go:37 +0x46
github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).syncConfigMap(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x32801b0, 0xc001198540})
	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:146 +0x5d8
github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).sync(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x325d240, 0xc003569e90})
	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:66 +0x71
github.com/openshift/cluster-etcd-operator/pkg/operator/health.(*CheckingSyncWrapper).Sync(0xc000f21bc0, {0x32726f8?, 0xc0008e60a0?}, {0x325d240?, 0xc003569e90?})
	github.com/openshift/cluster-etcd-operator/pkg/operator/health/checking_sync_wrapper.go:22 +0x43
github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc00113cd80, {0x32726f8, 0xc0008e60a0}, {0x325d240?, 0xc003569e90?})
	github.com/openshift/library-go@v0.0.0-20240124134907-4dfbf6bc7b11/pkg/controller/factory/base_controller.go:201 +0x43



goroutine 11640 [select]:
google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc003707000, {0x3272768, 0xc002091260}, 0x3)
	google.golang.org/grpc@v1.58.3/clientconn.go:724 +0xb1
google.golang.org/grpc.DialContext({0x3272768, 0xc002091260}, {0xc003753740, 0x3c}, {0xc00355a880, 0x7, 0xc0023aa360?})
	google.golang.org/grpc@v1.58.3/clientconn.go:295 +0x128e
go.etcd.io/etcd/client/v3.(*Client).dial(0xc000895180, {0x32754a0?, 0xc001785670?}, {0xc0017856b0?, 0x28f6a80?, 0x28?})
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:303 +0x407
go.etcd.io/etcd/client/v3.(*Client).dialWithBalancer(0xc000895180, {0x0, 0x0, 0x0})
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:281 +0x1a9
go.etcd.io/etcd/client/v3.newClient(0xc002484e70?)
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:414 +0x91c
go.etcd.io/etcd/client/v3.New(...)
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:81
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.newEtcdClientWithClientOpts({0xc0017853d0, 0x1, 0x1}, 0x0, {0x0, 0x0, 0x0?})
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:127 +0x77d
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x32726f8, 0xc00318ac30}, 0xc002090460)
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:103 +0xc5
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x6c
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth in goroutine 1426
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5

  

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

any currently supported OCP version

How reproducible:

Always

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack (similar to the one above)

Actual results:

CEO controllers are getting deadlocked, but the operator will restart eventually after some time due to health probes failing

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

clientv3.New doesn't take any timeout context, but tries to establish a connection forever

https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdcli/etcdcli.go#L127-L130

There's a way to pass the "default context" via the client config, which is slightly misleading.

https://github.com/openshift/cluster-etcd-operator/pull/1215

Bug OCPBUGS-27341: ingress operator appears to be reporting unavailable in error

View the Description View the linked PRs

: [bz-Routing] clusteroperator/ingress should not change

Has been failing for over a month in the e2e-metal-ipi-sdn-bm-upgrade jobs

I think this is because there are only two worker nodes in the BM environment and some HA services loose redundancy when one of the workers is rebooted.

In the medium term I hope to add another node to each cluster but in the sort term we should skip the test.

https://github.com/openshift/origin/pull/28480

Story TRT-1515: openshift-tests OOMing

View the Description View the linked PRs

https://github.com/openshift/release/pull/48835

The scatter chart is very slow to load for the last week, but while there are a few hits here and there over last y days, it looks like this got a lot more common yesterday around noon and has continued ever since.

Suspicious PR: https://github.com/openshift/origin/pull/28587

https://github.com/openshift/origin/pull/28601

Bug OCPBUGS-24945: Update 4.16 ose-alibaba-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/48

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-alibaba/pull/48

Bug OCPBUGS-29816: Switch to service to get the data from the Tekton results summary API

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13624

Bug OCPBUGS-26135: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/75

Bug OCPBUGS-29095: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-runtimecfg/pull/298

Bug OCPBUGS-25766: Cluster Baremetal operator should use a leader lock

View the Description View the linked PRs

Description of problem:

Seen in this 4.15 to 4.16 CI run:

: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
{  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

The operator recovered, and the update completed, but it's still probably worth cleaning up whatever's happening to avoid alarming anyone.

Version-Release number of selected component (if applicable):

Seems like all recent CI runs that match this string touch 4.15, 4.16, or development branches:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Back-off+restarting+failed+container+cluster-baremetal-operator+in+pod+cluster-baremetal-operator' | grep 'failures match'
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway (all) - 11 runs, 36% failed, 25% of failures match = 9% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 20% failed, 33% of failures match = 7% impact
pull-ci-openshift-kubernetes-master-e2e-aws-ovn-downgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 15 runs, 27% failed, 25% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 32 runs, 91% failed, 7% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 40 runs, 25% failed, 20% of failures match = 5% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 40 runs, 8% failed, 33% of failures match = 3% impact
pull-ci-openshift-azure-file-csi-driver-operator-main-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-arm64 (all) - 6 runs, 33% failed, 50% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 11 runs, 18% failed, 50% of failures match = 9% impact

How reproducible:


Looks like ~8% impact.

h2. Steps to Reproduce:

1.  Run ~20 exposed job types.
2. Check for {{: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers}} failures with {{Back-off restarting failed container cluster-baremetal-operator}} messages.

h2. Actual results:

~8% impact.

h2. Expected results:

~0% impact.

h2. Additional info:

Dropping into Loki for the run I'd picked:

{code:none}
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016"} | unpack | pod="cluster-baremetal-operator-574577fbcb-z8nd4" container="cluster-baremetal-operator" |~ "220 06:0"

includes:

E1220 06:04:18.794548       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
I1220 06:05:40.753364       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
I1220 06:05:40.766200       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
I1220 06:05:40.780426       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
E1220 06:05:40.795555       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
I1220 06:08:21.730591       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
I1220 06:08:21.747466       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
I1220 06:08:21.768138       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
E1220 06:08:21.781058       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"

So some kind of ClusterOperator-modification race?

https://github.com/openshift/cluster-baremetal-operator/pull/395

Bug OCPBUGS-27455: Count mismatch in Image vunerabilities reported in the Openshift Console

View the Description View the linked PRs

Problem Description:

Installed the Red Hat Quay Container Security Operator on the 4.13.25 cluster .

Below are my test results :

```

sasakshi@sasakshi ~]$ oc version
Client Version: 4.12.7
Kustomize Version: v4.5.7
Server Version: 4.13.25
Kubernetes Version: v1.26.9+aa37255

[sasakshi@sasakshi ~]$ oc get csv -A | grep -i "quay" | tail -1
openshift container-security-operator.v3.10.2 Red Hat Quay Container Security Operator 3.10.2 container-security-operator.v3.10.1 Succeeded

[sasakshi@sasakshi ~]$ oc get subs -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-operators container-security-operator container-security-operator redhat-operators stable-3.10

[sasakshi@sasakshi ~]$ oc get imagemanifestvuln -A | wc -l
82
[sasakshi@sasakshi ~]$ oc get vuln --all-namespaces | wc -l
82

Console -> Administration -> Image Vulnerabitlites : 82

Home -> Overiview -> Status -> Image Vulnerabitlites : 66
```

Observations from My testing :

`oc get vuln --all-namespaces` reports the same count as `oc get imagemanifestvuln -A`

Difference in the count is reported in the following
```
Console -> Administration -> Image Vulnerabitlites : 82
Home -> Overiview -> Status -> Image Vulnerabitlites : 66
```
Why there is a difference in reporting of the above two entries?

Kindly refer to the attached screenshots for reference .

Documentation link referred:

https://docs.openshift.com/container-platform/4.14/security/pod-vulnerability-scan.html#security-pod-scan-query-cli_pod-vulnerability-scan

https://github.com/openshift/console/pull/13529

Bug OCPBUGS-25581: Update 4.16 ose-gcp-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/52

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-gcp/pull/52

Bug MGMT-16414: When trying to create cluster with s390x architecture, an error occurs that stops cluster creation

View the Description View the linked PRs

Description of the problem:

When trying to create cluster with s390x architecture, an error occurs that stops cluster creation. The error is "cannot use Skip MCO reboot because it's not compatible with the s390x architecture on version 4.15.0-ec.3 of OpenShift"

How reproducible:

Always

Steps to reproduce:

Create cluster with architecture s390x

Actual results:

Create failed

Expected results:

Create should succeed

https://github.com/openshift/assisted-service/pull/5822

Bug OCPBUGS-24861: Update 4.16 openshift-enterprise-egress-dns-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/158

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/158

Bug OCPBUGS-28933: ART requests updates to 4.16 image ose-powervs-block-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/66

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/66

Bug OCPBUGS-24997: Update 4.16 ose-cluster-image-registry-operator-container image to be consistent with ART

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-24938: Update 4.16 ose-azure-disk-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-disk-csi-driver/pull/65

Bug OCPBUGS-25779: Update 4.16 kube-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/600

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/600

Bug OCPBUGS-28705: add new tested azure instance types in installer doc

View the Description View the linked PRs

Description of problem:

When running 4.15 installer full function test, detect below three instance families and verified, need to append them in installer doc[1]:
- standardHBv4Family
- standardMSMediumMemoryv3Family
- standardMDSMediumMemoryv3Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/7961

Bug OCPBUGS-16181: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/514

Bug OCPBUGS-18716: OVN br-int flows do not get updated on other nodes when a nodes bond MACADDR is changed to other slave interface after reboot.

View the Description View the linked PRs

Description of problem:

OVN br-int ovs flows do not get updated on other nodes when a nodes bond MACADDR is changed to other slave interface after reboot. This causes network traffic coming for the sdn of one node to get dropped when it hits the node that changed mac addresses on its bond interface.

Version-Release number of selected component (if applicable): 4.12+

How reproducible: 100% of the time after rebooting if the mac changes. mac does not always change.

Steps to Reproduce:

1. Capture bond0 mac before reboot
2. Reboot host
3. Confirm mac change
4.  oc run --rm -it test-pod-sdn --image=registry.redhat.io/openshift4/network-tools-rhel8  --overrides='{"spec": {"tolerations": [{"operator": "Exists"}],"nodeSelector":{"kubernetes.io/hostname":"nodeb-not-rebooted"}}}' /bin/bash
5. Ping rebooted node

Actual results:

ping hits rebooted node but is dropped because the MAC address is of other slave interface and not the one bond is using.

Expected results:

OVS flows to update on all nodesafter reboot if MAC changes

Additional info:

  If we restart NetworkManager a couple times this triggers OVS flows to get updated, not sure why. 

Possible workarounds 
 -  https://access.redhat.com/solutions/6972925
 - Statically set the mac of bond0 to one of the slave interfaces.

https://github.com/openshift/ovn-kubernetes/pull/2010

Bug OCPBUGS-28934: ART requests updates to 4.16 image csi-driver-manila-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/227

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-manila-operator/pull/227

Bug OCPBUGS-22637: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/108

Bug OCPBUGS-24241: RHOCP installation on RHOSP fails with an error "Incompatible openstacksdk library found"

View the Description View the linked PRs

RHOCP installation on RHOSP fails with an error

~~~

$ ansible-playbook -i inventory.yaml security-groups.yaml
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.36.5 is smaller than minimum version 1.0."}

~~~

Packages Installed :

ansible-2.9.27-1.el8ae.noarch Fri Oct 13 06:56:05 2023
python3-netaddr-0.7.19-8.el8.noarch Fri Oct 13 06:55:44 2023
python3-openstackclient-4.0.2-2.20230404115110.54bf2c0.el8ost.noarch Tue Nov 21 01:38:32 2023
python3-openstacksdk-0.36.5-2.20220111021051.feda828.el8ost.noarch Fri Oct 13 06:55:52 2023

Document followed :
https://docs.openshift.com/container-platform/4.13/installing/installing_openstack/installing-openstack-user.html#installation-osp-downloading-modules_installing-openstack-user

https://github.com/openshift/installer/pull/7821

Bug OCPBUGS-25441: Oh no! Something went wrong" in Topology -> Observese Tab

View the Description View the linked PRs

Description of problem:

    Oh no! Something went wrong" in Topology -> Observese Tab

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-14-115151

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Topology -> click one deployment and go to Observer Tab
    2.
    3.

Actual results:

    The page crushed
ErrorDescription:Component trace:Copy to clipboardat te (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:31:9773)
    at j (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:12:3324)
    at div
    at s (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:70124)
    at div
    at g (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:6:11163)
    at div
    at d (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:1:174472)
    at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:487478)
    at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:486390)
    at div
    at l (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:106304)
    at div

Expected results:
{code:none}
    not crush

Additional info:

https://github.com/openshift/console/pull/13451

Bug OCPBUGS-28665: openshift/openshift-controller-manager - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/openshift-controller-manager/pull/284

Story TRT-1441: Stop pushing intervals to Loki

View the Description View the linked PRs

We've moved to using bigquery for this, stop pushing and free up some loki cycles.

This is done with a command in origin: https://github.com/openshift/origin/blob/24e011ba3adf2767b88619351895bb878de3d62a/pkg/cmd/openshift-tests/dev/dev.go#L211

So all this code and probably some libraries could be removed.

But first, remove the invocation of this command in the release repo. (upload-intervals)

https://github.com/openshift/origin/pull/28525

Task MGMT-14633: Include custom manifests in cluster logs archive

View the Description View the linked PRs

Sometimes user manifests could be the source of problems, and right now they're not included in the logs archive downloaded for an Assisted cluster from the UI.

Currently we can only see the feature usage "Custom manifest" in metadata.json but that only tells us the user has custom manifests, not manifests and what is their value

https://github.com/openshift/assisted-service/pull/5777

Bug OCPBUGS-27933: Update 4.16 ose-ovn-kubernetes-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2027

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ovn-kubernetes/pull/2027

Bug OCPBUGS-28742: ovnkube-node doesn't refresh certificates after node was suspended for 30 days

View the Description View the linked PRs

Description of problem:

ovnkube-node doesn't issue a CSR to get new certificates when node is suspended for 30 days

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Setup a libvirt cluster on machine
    2. Disable chronyd on all nodes and host machine
    3. Suspend nodes
    4. Change time on host 30 days forward
    5. Resume nodes
    6. Wait for API server to come up
    7. Wait for all operators to become ready

Actual results:

ovnkube-node would attempt to use expired certs:  2024-01-21T01:24:41.576365431+00:00 stderr F I0121 01:24:41.573615    8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0"
2024-04-20T01:25:08.519622252+00:00 stderr F I0420 01:25:08.516550    8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service
2024-04-20T01:25:08.900228370+00:00 stderr F I0420 01:25:08.898580    8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service
2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137891    8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137933    8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:17.137997952+00:00 stderr F I0420 01:25:17.137979    8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 
2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099057    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-1
2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099080    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-1: 35.077µs
2024-04-20T01:25:22.245550966+00:00 stderr F W0420 01:25:22.242774    8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-controller-manager/controller-manager-5485d88c84-xztxq IP address from the namespace address-set, err: pod openshift-controller-manager/controller-manager-5485d88c84-xztxq: no pod IPs found 
2024-04-20T01:25:22.262446336+00:00 stderr F W0420 01:25:22.261351    8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9 IP address from the namespace address-set, err: pod openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9: no pod IPs found 
2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154744    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-0
2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154770    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-0: 31.72µs
2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168666    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-2
2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168692    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-2: 34.346µs
2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194311    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-0
2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194339    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-0: 40.027µs
2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194582    8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0"
2024-04-20T01:25:27.215435944+00:00 stderr F I0420 01:25:27.215387    8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0"
2024-04-20T01:25:35.789830706+00:00 stderr F I0420 01:25:35.789782    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-1
2024-04-20T01:25:35.790044794+00:00 stderr F I0420 01:25:35.790025    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-1: 250.227µs
2024-04-20T01:25:37.596875642+00:00 stderr F I0420 01:25:37.596834    8852 iptables.go:358] "Running" command="iptables-save" arguments=["-t","nat"]
2024-04-20T01:25:47.138312366+00:00 stderr F I0420 01:25:47.138266    8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:47.138382299+00:00 stderr F I0420 01:25:47.138370    8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:47.138453866+00:00 stderr F I0420 01:25:47.138440    8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 
2024-04-20T01:26:17.138583468+00:00 stderr F I0420 01:26:17.138544    8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:26:17.138640587+00:00 stderr F I0420 01:26:17.138629    8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:26:17.138708817+00:00 stderr F I0420 01:26:17.138696    8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 
2024-04-20T01:26:39.474787436+00:00 stderr F I0420 01:26:39.474744    8852 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.EndpointSlice total 130 items received
2024-04-20T01:26:39.475670148+00:00 stderr F E0420 01:26:39.475653    8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: the server has asked for the client to provide credentials (get endpointslices.discovery.k8s.io)
2024-04-20T01:26:40.786339334+00:00 stderr F I0420 01:26:40.786255    8852 reflector.go:325] Listing and watching *v1.EndpointSlice from k8s.io/client-go/informers/factory.go:159
2024-04-20T01:26:40.806238387+00:00 stderr F W0420 01:26:40.804542    8852 reflector.go:535] k8s.io/client-go/informers/factory.go:159: failed to list *v1.EndpointSlice: Unauthorized
2024-04-20T01:26:40.806238387+00:00 stderr F E0420 01:26:40.804571    8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized

Expected results:

ovnkube-node detects that cert is expired, requests new certs via CSR flow and reloads them

Additional info:

CI periodic to check this flow: https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ovn-sno-cert-rotation-suspend-30d
artifacts contain sosreport

Applies to SNO and HA clusters, works as expected when nodes are being properly shutdown instead of suspended

https://github.com/openshift/ovn-kubernetes/pull/2081

Bug OCPBUGS-29003: Default Internal Registry cleans custom images stored on it from 4.13 to 4.14

View the Description View the linked PRs

Description of problem:

After upgrade from 4.13.x to 4.14.10, the workload images that the customer stored inside the internal registry are lost, resulting the applications pods into error 
"Back-off pulling image".

Even when manually pulling with podman, it fails then with "manifest unknown" because the image cannot be found in the registry anymore.


- This behavior was found and reproduced 100% on ARO clusters, where the internal registry is by default backed up by the Storage Account created by the ARO RP service principal, which is the Containers blob service.

- I do not know if in non-managed Azure clusters or any other architecture the same behavior is found.

Version-Release number of selected component (if applicable):

4.14.10

How reproducible:

100% with an ARO cluster (Managed cluster)

Steps to Reproduce: Attached.

The workaround found so far is to rebuild the apps or re-import the images. But those tasks are lengthy and costly specially if it is a production cluster.

Bug OCPBUGS-25001: Update 4.16 ose-cluster-kube-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/774

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24862: Update 4.16 ose-installer-altinfra-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/7819

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/7872

Bug OCPBUGS-27399: duplicate pending CSR info shown on Nodes list page

View the Description View the linked PRs

Description of problem:

when there is only one server CSR pending on approval, we still show two records(one is client CSR requires approval which is already several hours old and the other is server CSR requires approval)

Version-Release number of selected component (if applicable):

pre-merge testing of https://github.com/openshift/console/pull/13493

How reproducible:

Always

Steps to Reproduce:

1. select one node which is joining the cluster, approve client CSR and do not approve server CSR, wait for some time

=> we can see only one node is pending on server CSR approval
$ oc get csr | grep Pending | grep system:node
csr-54sn4   142m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-7nhb9   65m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-9g22f   4m4s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-bgrdq   35m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-chqnf   50m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-f4sbl   127m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-msnml   157m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-p9qrp   19m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-qp2pw   112m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-qrlnv   96m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-tk7j4   81m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending

Actual results:

1. on nodes list page, we can see two rows shown for node ip-10-0-49-55.us-east-2.compute.internal

Expected results:

since the pending client CSR has been there for several hours and the node now is actually waiting for server CSR approval, we should only show one record/row to indicate user that it requires server CSR approval

The pending client CSR associated with ip-10-0-49-55.us-east-2.compute.internal is already 3 hours old
$ oc get csr csr-4d628
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-4d628   3h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending

Additional info:

https://github.com/openshift/console/pull/13522

Bug OCPBUGS-27860: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource/pull/164

Bug OCPBUGS-12876: Cluster fails to convert from IPv6 Primary Dualstack to IPv6 single stack

View the Description View the linked PRs

Description of problem:

Converting IPv6 Primary Dual Stack to IPv6 Single stack causing control plane failures. OVN masters in CLBO state peridically

OVN masters logs: http://shell.lab.bos.redhat.com/~anusaxen/convert/

MG is not working as cluster lands in bad shape. Happy to share cluster if needed for debugging

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-21-084440

How reproducible:

Always

Steps to Reproduce:

1.Bring cluster on IPv6 Primary Dual Stack

2.Edit network.config.openshift.io from dual stack to single like as follow


spec:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    externalIP:
      policy: {}
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
    - 172.30.0.0/16
  status:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    clusterNetworkMTU: 1400
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
    - 172.30.0.0/16


TO



 apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Network
  metadata:
    creationTimestamp: "2023-04-27T14:11:37Z"
    generation: 3
    name: cluster
    resourceVersion: "81045"
    uid: 28f15675-e739-4262-9acc-4c2c0df4b38d
  spec:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    externalIP:
      policy: {}
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
  status:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    clusterNetworkMTU: 1400
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
kind: List
metadata:
  resourceVersion: ""

3. Wait for control plane components to roll out successfully

Actual results:

Cluster fails with network, ETCD, Kube API and ingress failures

Expected results:

Cluster should convert to IPv6 single without any issues

Additional info:

MGs not working due to varios control place restricting it

https://github.com/openshift/ovn-kubernetes/pull/2073

Bug OCPBUGS-19429: oc-mirror failed with a ImageSetConfiguration yaml containing two EUS channels

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc-mirror/pull/752

Bug OCPBUGS-25172: Update 4.16 ose-olm-catalogd-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-catalogd/pull/37

Bug OCPBUGS-25457: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-alibaba-cloud/pull/42

Bug OCPBUGS-25492: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/85

Bug OCPBUGS-25206: Dev console: Pipelines integration tests was disabled because the operator wasn't available on 4.15

View the Description View the linked PRs

We need to reenable the e2e integration tests as soon as the operator is available again.

https://github.com/openshift/console/pull/13438

Bug OCPBUGS-24795: Update 4.16 ose-olm-operator-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/51

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-operator-controller/pull/51

Bug OCPBUGS-16397: Nutanix: Telemetry “host_type” shows “virt-unknown” in an OCP Nutanix cluster

View the Description View the linked PRs

Description of problem:

Looking at the telemetry data for Nutanix I noticed that the “host_type” for clusters installed with platform nutanix shows as “virt-unknown”. Do you know what needs to happen in the code to tell telemetry about host type being Nutanix? The problem is that we can’t track those installations with platform none, just IPI.

Refer to the slack thread https://redhat-internal.slack.com/archives/C0211848DBN/p1687864857228739.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

Create an OCP Nutanix cluster

Actual results:

The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as “virt-unknown”.

Expected results:

The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as "nutanix".

Additional info:

https://github.com/openshift/telemeter/pull/504

Bug OCPBUGS-23859: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/98

Bug OCPBUGS-25588: Update 4.16 ose-aws-pod-identity-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/183

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-pod-identity-webhook/pull/183

Bug OCPBUGS-25440: [AWS] iam:TagInstanceProfile permission is required for ipi install

View the Description View the linked PRs

Description of problem:

iam:TagInstanceProfile is not listed in official document [1], IPI install would fail if iam:TagInstanceProfile permission is missing

level=error msg=Error: creating IAM Instance Profile (ci-op-4hw2rz1v-49c30-zt9vx-worker-profile): AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-4hw2rz1v-49c30-minimal-perm is not authorized to perform: iam:TagInstanceProfile on resource: arn:aws:iam::301721915996:instance-profile/ci-op-4hw2rz1v-49c30-zt9vx-worker-profile because no identity-based policy allows the iam:TagInstanceProfile action
level=error msg=    status code: 403, request id: bb0641f5-d01c-4538-b333-261a804ddb59

[1] https://docs.openshift.com/container-platform/4.14/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-14-115151

How reproducible:

Always

Steps to Reproduce:

    1. install a common IPI cluster with minimal permission provided in official document
    2.
    3.

Actual results:

Install failed.

Expected results:

Additional info:

install does a precheck for iam:TagInstanceProfile

https://github.com/openshift/installer/pull/7843

Bug OCPBUGS-15253: Add namespace to IngressWithoutClassName and UnmanagedRoutes alert message

View the Description View the linked PRs

Description of problem:

It would help making debugging easier if we included the namespace in the message for these alerts: https://github.com/openshift/cluster-ingress-operator/blob/master/manifests/0000_90_ingress-operator_03_prometheusrules.yaml#L69

Version-Release number of selected component (if applicable):

4.12.x

How reproducible:

Always

Steps to Reproduce:

1. 
2.
3.

Actual results:

No namespace in the alert message

Expected results:

Additional info:

https://github.com/openshift/cluster-ingress-operator/pull/980

Bug OCPBUGS-17249: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/1787

Bug OCPBUGS-24905: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/78

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25567: Update 4.16 ose-baremetal-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/328

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/baremetal-operator/pull/328

Bug OCPBUGS-29566: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-scheduler-operator/pull/535

Bug OCPBUGS-25147: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/81

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24383: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2171

Bug OCPBUGS-24996: Update 4.16 ose-csi-snapshot-validation-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/129

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25142: OCP conformance - oc service creates and deletes services - provided port is already allocated

View the Description View the linked PRs

Description of problem:

    Ran into a problem with our testing this morning on a newly create ROKS cluster
```
    Error running /usr/bin/oc --namespace=e2e-test-oc-service-p4fz2 --kubeconfig=/tmp/configfile2694323048 create service nodeport mynodeport --tcp=8080:7777 --node-port=30000:    StdOut>    error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated    StdErr>    error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated    exit status 1
```

The port was already used by a different service, we would like to make a feature request to the testing to make the port number dynamic so that if that port is taken up, it can choose an available one.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/28495

Task MON-2853: Write Runbook for Alert TargetDown

View the Description View the linked PRs

A runbook for the alert TargetDown will useful for Openshift users.
This runbook should explain:
1. how to identify which targets are down
2. how to investigate the reason why the target goes offline
3. resolution of common causes bringing down the target

Related cases:

TLS certificate expiration, see discussion

https://github.com/openshift/cluster-monitoring-operator/pull/2237

Bug OCPBUGS-21800: Dev console buildconfig got [the server does not allow this method on the requested resource] error when not setting metadate.namespace

View the Description View the linked PRs

Description of problem:

Dev console buildconfig got [the server does not allow this method on the requested resource] error when not setting metadate.namespace

How reproducible:

Test case is shown in below

Steps to Reproduce:

Using below to create a Buildconfig in GUI page of 
openshift console -> Developer -> Builds -> Create BuildConfig -> yaml view


  
~~~
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: mywebsite
  labels:
    name: mywebsite
spec:
  triggers:
  - type: ImageChange
    imageChange: {}
  - type: ConfigChange
  source:
    type: Git
    git:
      uri: https://github.com/monodot/container-up
    contextDir: httpd-hello-world
  strategy:
    type: Docker
    dockerStrategy:
      dockerfilePath: Dockerfile
      from:
        kind: ImageStreamTag
        name: httpd:latest 
        namespace: testbuild
  output:
    to:
      kind: ImageStreamTag
      name: mywebsite:latest 
~~~

Actual results:

Get  [the server does not allow this method on the requested resource] error

Expected results:

we can find even not setting metadata.namespace in CLI mode or by contact from the customer that in 4.11 GUI console will not trigger this error, Does that mean the code changed in 4.13 ?

https://github.com/openshift/console/pull/13544

Bug OCPBUGS-27450: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-azure/pull/299

Bug OCPBUGS-29476: Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators

View the Description View the linked PRs

Description of problem:

Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators.

An example of this is on VSphere clusters. CAPI is not yet supported on VSphere clusters, but the CAPI IPAM CRDs, are needed by other operators than the usual consumer, cluster-capi-operator and the CAPI controllers.

Version-Release number of selected component (if applicable):

How reproducible:

    Launch a techpreview cluster for an unsupported platform (e.g. vsphere/azure). Check that the Core CAPI CRDs are not present.

Steps to Reproduce:

    $ oc get crds | grep cluster.x-k8s.io

Actual results:

    Core CAPI CRDs are not present (only the metal ones)

Expected results:

    Core CAPI CRDs should be present

Additional info:

Bug OCPBUGS-24815: Update 4.16 csi-livenessprobe-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/55

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/55

Bug OCPBUGS-28596: Update 4.16 ose-cluster-ingress-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1020

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-ingress-operator/pull/1020

Bug OCPBUGS-29103: HCP CSR Allows Invalid CNs

View the Description View the linked PRs

Description of problem:

    The HCP CSR flow allows any CN in the incoming CSR.

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    Using the CSR flow, any name you add to the CN in the CSR will be your username against the Kubernetes API server - check your username using the SelfSubjectRequest API (kubectl auth whoami)

Steps to Reproduce:

    1.create CSR with CN=whatever
    2.CSR signed, create kubeconfig
    3.using kubeconfig, kubectl auth whoami should show whatever CN

Actual results:

    any CN in CSR is the username against the cluster

Expected results:

    we should only allow CNs with some known prefix (system:customer-break-glass:...)

Additional info:

https://github.com/openshift/hypershift/pull/3538

Story RHOBS-995: Add unit testing for telemeter server rules

View the Description View the linked PRs

While implementing ~~MON-3669~~, we realized that none of the recording rules running on the telemeter server side are tested. Given how complex these rules can be, it's important for us to be confident that future changes won't bring regressions.

Even though not perfect, it's possible to unit tests Prometheus rules with the promtool binary (example in CMO: https://github.com/openshift/cluster-monitoring-operator/blob/2ca7067a4d1fc86b31f7a4816c85da6abc0c8abf/Makefile#L218-L221).

DoD

Add a unit test for "cluster:capacity_effective_cpu_cores" recording rule (added by https://github.com/openshift/telemeter/pull/501).
Rule unit testing added as a merge requirement to openshift/telemeter.

https://github.com/openshift/telemeter/pull/506

Bug OCPBUGS-24847: Update 4.16 ose-gcp-pd-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/99

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/99

Bug MGMT-16459: [STG] Cluster 4.15.0-rc.0 with HTTP Proxy failed on timeout due the failed to StartContainer for etcd with CrashLoopBackOff

View the Description View the linked PRs

Description of the problem:

Installation of cluster using OCP image 4.15.0-rc.0 and using HTTP Proxy configuration failed on

"Control plane was not installed

3/3 control plane nodes failed to install. Please check the installation logs for more information."
and
"error Host master-0-1: updated status from installing-in-progress to error (Host failed to install because its installation stage Waiting for control plane took longer than expected 1h0m0s)"
After looked at master-0-1 node jpurnactl log found the error:
"Dec 21 00:54:29 master-0-1 kubelet.sh[5111]: E1221 00:54:29.290568 5111 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-master-0-1_openshift-etcd(97cad44a9feb70b1091eaa3fb1e565ca)\"" pod="openshift-etcd/etcd-bootstrap-member-master-0-1" podUID="97cad44a9feb70b1091eaa3fb1e565ca"
"
HTTP Proxy configuration works fine with OCP images 4.14.5 and 4.13.26
Steps to reproduce:

1. Setup HTTP Proxy server on hypervisor using quay.io/sameersbn/squid:3.5.27-2

2. Create cluster and got Host Discovery

3. Press Add Host. Fill out SSH public key

4. Select Show proxy settings and fill out

HTTP proxy URL and No proxy domains

5. Generate ISO image, download it and boot 3 masters and 2 workers node.

6. Continue regular cluster installation

Actual results:

Installation failed on 69%

Expected results:

Installation passed

https://github.com/openshift/assisted-installer-agent/pull/664

Bug OCPBUGS-24587: ResolutionFailed doesn't clear after recovery

View the Description View the linked PRs

Description of problem:

Installation some operators. After some time the ResolutionFailed showing up:

$ kubectl get subscription.operators -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,ResolutionFailed:.status.conditions[?(@.type=="ResolutionFailed")].status,MSG:.status.conditions[?(@.type=="ResolutionFailed")].message'
NAMESPACE                   NAME                                                                         ResolutionFailed   MSG
infra-sso                   rhbk-operator                                                                True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
metallb-system              metallb-operator-sub                                                         True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
multicluster-engine         multicluster-engine                                                          True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
open-cluster-management     acm-operator-subscription                                                    True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
openshift-cnv               kubevirt-hyperconverged                                                      True               [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"]
openshift-gitops-operator   openshift-gitops-operator                                                    True               [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"]
openshift-local-storage     local-storage-operator                                                       True               [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"]
openshift-nmstate           kubernetes-nmstate-operator                                                  <none>             <none>
openshift-operators         devworkspace-operator-fast-redhat-operators-openshift-marketplace            <none>             <none>
openshift-operators         external-secrets-operator                                                    <none>             <none>
openshift-operators         web-terminal                                                                 <none>             <none>
openshift-storage           lvms                                                                         <none>             <none>
openshift-storage           mcg-operator-stable-4.14-redhat-operators-openshift-marketplace              <none>             <none>
openshift-storage           ocs-operator-stable-4.14-redhat-operators-openshift-marketplace              <none>             <none>
openshift-storage           odf-csi-addons-operator-stable-4.14-redhat-operators-openshift-marketplace   <none>             <none>
openshift-storage           odf-operator                                                                 <none>             <none>

At the package server logs you can see one time the catalog source is not available, after a while the catalog source is available but the error doesn't disappear from the subscription.

Package server logs:

time="2023-12-05T14:27:09Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.37.69:50051: connect: connection refused\"" source="{redhat-operators openshift-marketplace}"
time="2023-12-05T14:27:09Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace
time="2023-12-05T14:28:26Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace
time="2023-12-05T14:30:23Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace
time="2023-12-05T14:35:56Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace
time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace
time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace
time="2023-12-05T14:39:40Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace
time="2023-12-05T14:46:07Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace
time="2023-12-05T14:47:37Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace
time="2023-12-05T14:48:21Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace
time="2023-12-05T14:49:53Z" level=info msg="updating

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Steps to Reproduce:

    1. Install an operator for example metallb
    2. Wait until the catalog pod is not available for on time.
    3. ResolutionFailed doesn't disappear anymore

Actual results:

ResolutionFailed doesn't disappear anymore from subscription.

Expected results:

ResolutionFailed disappear from subscription.

Additional info:

https://github.com/openshift/operator-framework-olm/pull/679

Bug OCPBUGS-24864: Update 4.16 ose-multus-admission-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/78

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-16801: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/773

Bug OCPBUGS-25124: Update 4.16 ose-csi-snapshot-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/130

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/130

Bug OCPBUGS-26048: The default channel is not correct

View the Description View the linked PRs

Description of problem:

The default channel of 4.15, 4.16 clusters is stable-4.14.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-03-193825

How reproducible:

Always

Steps to Reproduce:

    1. Install a 4.16 cluster
    2. Check default channel
# oc adm upgrade 
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-0.nightly-2024-01-03-193825 not found in the "stable-4.14" channel

Cluster version is 4.16.0-0.nightly-2024-01-03-193825

Upgradeable=False

  Reason: MissingUpgradeableAnnotation
  Message: Cluster operator cloud-credential should not be upgraded between minor versions: Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade.

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.14

    3.

Actual results:

Default channel is stable-4.14 in a 4.16 cluster

Expected results:

Default channel should be stable-4.16 in a 4.16 cluster

Additional info:

4.15 cluster has the issue as well.

https://github.com/openshift/installer/pull/7867

Bug MGMT-16716: [STG][UI][Nutanix] Operator Install OpenShift Virtualization should be disabled when select platform Nutanix

View the Description View the linked PRs

Description of the problem:

Up to latest decision RH won't going to support installation OCP cluster on Nutanix with

nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"

Slack discussion thread

https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159

Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS

https://github.com/openshift/assisted-service/pull/5991

Bug OCPBUGS-24805: Update 4.16 ose-kubevirt-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/26

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubevirt-csi-driver/pull/26

Bug OCPBUGS-20368: [4.15] E2E Automation of Dynamic OVS Pinning

View the Description View the linked PRs

Description of problem:

Automate E2E tests of Dynamic OVS Pinning. This bug is created for merging

https://github.com/openshift/cluster-node-tuning-operator/pull/746

Version-Release number of selected component (if applicable):

4.15.0

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/746

Bug OCPBUGS-24818: Update 4.16 kube-rbac-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/88

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kube-rbac-proxy/pull/88

Bug OCPBUGS-25894: Kube-apiserver operator is trying to delete prometheus rule that does not exists

View the Description View the linked PRs

Description of problem:

Kube-apiserver operator is trying to delete prometheus rule that does not exists leading to huge amount of unwanted audit logs, 

With the introduction of the change as a part of BUG-2004585 kube-apiserver SLO rulesare split into 2 groups kube-apiserver-slos-basic and kube-apiserver-slos-extended kube-apiserver-operator is trying to delete /apis/monitoring.coreos.com/v1/namespaces/openshift-kube-apiserver/prometheusrules/kube-apiserver-slos which no longer exist in the cluster

Version-Release number of selected component (if applicable):

4.12
4.13
4.14

How reproducible:

    Its easy to reproduce

Steps to Reproduce:

    1. install a cluster with 4.12
    2. enable cluster logging 
    3. forward the audit log to internal or external logstore using below config

apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
  namespace: openshift-logging
spec:
  pipelines: 
  - name: all-to-default
    inputRefs:
    - infrastructure
    - application
    - audit
    outputRefs:
    - default     

    4. Check the audit logs in kibana, it will show the logs like below image

Actual results:

    Kube-apiserver-operator is trying to delete prometheus rule that does not exists in the cluster

Expected results:

if the rule is not there in the cluster it should not be searched for deletion

Additional info:

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1642

Bug OCPBUGS-25132: dual-stack UPI: IPv6 security group rules created for single-stack cluster

View the Description View the linked PRs

Description of problem:

security-groups.yaml playbook runs the IPv6 security group rules creation tasks regardless of the os_subnet6 value.
The when clause is not considering the os_subnet6 [1] value and is always executed.

It works with:

  - name: 'Create security groups for IPv6'
    block:
    - name: 'Create master-sg IPv6 rule "OpenShift API"'
    [...]
    when: os_subnet6 is defined

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-11-033133

How reproducible:

Always

Steps to Reproduce:

1. Don't set the os_subnet6 in the inventory file [2] (so it's not dual-stack)
2. Deploy 4.15 UPI by running the UPI playbooks

Actual results:

IPv6 security group rules are created

Expected results:

IPv6 security group rules shouldn't be created

Additional info:
[1] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/security-groups.yaml#L375
[2] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/inventory.yaml#L77

https://github.com/openshift/installer/pull/7833

Bug OCPBUGS-23925: VPAs from different projects are shown under one deployment "Resources" tab

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-25927: vSphere config ini does not have resourcepool-path key

View the Description View the linked PRs

Reported in https://issues.redhat.com/browse/MGMT-14527

resourcepool-path key needs to be added to config ini

https://github.com/openshift/console/pull/13210

Bug OCPBUGS-27507: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/7942

Bug OCPBUGS-29974: ART requests updates to 4.16 image kube-rbac-proxy-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/91

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kube-rbac-proxy/pull/91

Bug OCPBUGS-27446: CCO reports manual instead of manualpodidentity mode in metrics for an Azure Workload Identity cluster

View the Description View the linked PRs

Steps to Reproduce:

1. Install a cluster using Azure Workload Identity
2. Check the value of the cco_credentials_mode metric

Actual results:

mode = manual

Expected results:

mode = manualpodidentity

Additional info:

The cco_credentials_mode metric reports manualpodidentity mode for an AWS STS cluster.

https://github.com/openshift/cloud-credential-operator/pull/657

Task OPRUN-3182: Clean up Downstream OLM go.mod

View the linked PRs

https://github.com/openshift/operator-framework-olm/pull/654

Bug OCPBUGS-24421: [vSphere-CSI-Driver-Operator] does not update the VSphereCSIDriverOperatorCRAvailable status timely

View the Description View the linked PRs

Description of problem:

[vSphere-CSI-Driver-Operator] does not update the VSphereCSIDriverOperatorCRAvailable status timely

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-04-162702

How reproducible:

Always

Steps to Reproduce:

1. Set up a vSphere cluster with 4.15 nightly;
2. Backup the secret/vmware-vsphere-cloud-credentials to "vmware-cc.yaml"
3. Change the secret/vmware-vsphere-cloud-credentials password to an invalid value under ns/openshift-cluster-csi-drivers by oc edit;
4. Wait for the cluster storage operator degrade and the driver controller pods CrashLoopBackOff, then recover the backup secret "vmware-cc.yaml" back by apply;
5. Observer the driver controller pods back to Running and the cluster storage operator should be back to healthy.

Actual results:

In Step5 : The driver controller pods back to Running but the cluster storage operator stuck at Degrade: True status for almost 1 hour$ oc get po
NAME                                                    READY   STATUS    RESTARTS        AGE
vmware-vsphere-csi-driver-controller-664db7d497-b98vt   13/13   Running   0               16s
vmware-vsphere-csi-driver-controller-664db7d497-rtj49   13/13   Running   0               23s
vmware-vsphere-csi-driver-node-2krg6                    3/3     Running   1 (3h4m ago)    3h5m
vmware-vsphere-csi-driver-node-2t928                    3/3     Running   2 (3h16m ago)   3h16m
vmware-vsphere-csi-driver-node-45kb8                    3/3     Running   2 (3h16m ago)   3h16m
vmware-vsphere-csi-driver-node-8vhg9                    3/3     Running   1 (3h16m ago)   3h16m
vmware-vsphere-csi-driver-node-9fh9l                    3/3     Running   1 (3h4m ago)    3h5m
vmware-vsphere-csi-driver-operator-5954476ddc-rkpqq     1/1     Running   2 (3h10m ago)   3h17m
vmware-vsphere-csi-driver-webhook-7b6b5d99f6-rxdt8      1/1     Running   0               3h16m
vmware-vsphere-csi-driver-webhook-7b6b5d99f6-skcbd      1/1     Running   0               3h16m
$ oc get co/storage -w
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.15.0-0.nightly-2023-12-04-162702   False       False         True       8m39s   VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: error logging into vcenter: ServerFaultCode: Cannot complete login due to an incorrect user name or password.
storage   4.15.0-0.nightly-2023-12-04-162702   True        False         False      0s
$  oc get co/storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.15.0-0.nightly-2023-12-04-162702   True        False         False      3m41s

Expected results:

In Step5 : After driver controller pods back to Running the cluster storage operator should recover healthy status immediatelly

Additional info:

I compare with the previous CI results seems this issue happened after 4.15.0-0.nightly-2023-11-25-110147

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/215

Bug OCPBUGS-24873: Update 4.16 ose-vsphere-cluster-api-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/29

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-vsphere/pull/29

Bug OCPBUGS-25546: Update 4.16 ose-haproxy-router-base-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/router/pull/550

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/router/pull/550

Bug OCPBUGS-30132: OCP 4.15.0 is not correctly refreshing operator catalogs (imagePullPolicy: IfNotPresent)

View the Description View the linked PRs

Description of problem:

In OCP 4.14 the catalog pods in openshift-marketplace where defined as:

$ oc get pods -n openshift-marketplace redhat-operators-4bnz4 -o yaml
apiVersion: v1
kind: Pod
metadata:
...
  labels:
    olm.catalogSource: redhat-operators
    olm.pod-spec-hash: 658b699dc
  name: redhat-operators-4bnz4
  namespace: openshift-marketplace
...
spec:
  containers:
  - image: registry.redhat.io/redhat/redhat-operator-index:v4.14
    imagePullPolicy: Always



Now on OCP 4.15 they are defined as:
apiVersion: v1
kind: Pod
metadata:
...
  name: redhat-operators-44wxs
  namespace: openshift-marketplace
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: CatalogSource
    name: redhat-operators
    uid: 3b41ac7b-7ad1-4d58-a62f-4a9e667ae356
  resourceVersion: "877589"
  uid: 65ad927c-3764-4412-8d34-82fd856a4cbc
spec:
  containers:
  - args:
    - serve
    - /extracted-catalog/catalog
    - --cache-dir=/extracted-catalog/cache
    command:
    - /bin/opm
...
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7259b65d8ae04c89cf8c4211e4d9ddc054bb8aebc7f26fac6699b314dc40dbe3
    imagePullPolicy: Always
...
  initContainers:
...
  - args:
    - --catalog.from=/configs
    - --catalog.to=/extracted-catalog/catalog
    - --cache.from=/tmp/cache
    - --cache.to=/extracted-catalog/cache
    command:
    - /utilities/copy-content
    image: registry.redhat.io/redhat/redhat-operator-index:v4.15
    imagePullPolicy: IfNotPresent
...



And due to `imagePullPolicy: IfNotPresent` on the initContainer used to extract the index image (referenced by tag) content, they are never really updated.

Version-Release number of selected component (if applicable):

    OCP 4.15.0

How reproducible:

    100%

Steps to Reproduce:

    1. wait for the next version of a released operator on OCP 4.15
    2.
    3.

Actual results:

    Operator catalogs are never really refreshed due to  imagePullPolicy: IfNotPresent for the index image

Expected results:

    Operator catalogs are periodically (every 10 minutes by default) refreshed

Additional info:

https://github.com/openshift/operator-framework-olm/pull/709

Bug OCPBUGS-25750: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/313

Bug OCPBUGS-27282: Make controllerAvailabilityPolicy field immutable

View the Description View the linked PRs

Description of problem:

    We need to make controllerAvailabilityPolicy field inmutable in the HostedCluster spec section to ensure the customer cannot go from/to SingleReplica to HighAvailability.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3513

Bug OCPBUGS-30297: TaskRun with same name in different project don't show 2 entries when listing in all namespace

View the Description View the linked PRs

Description of problem:

    If there is a taskRun with same name in 2 different namespace, then in TaskRuns list page for All namespace, showing only one record due to same name

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1. Create TaskRun using https://gist.github.com/karthikjeeyar/eb1bbdf9157431f5c875eb55ce47580c in 2 different namespace
    2. Go to TaskRun list page
    3. Select All Projects

Actual results:

    Only one entry is shown

Expected results:

    Both entries should be visible

Additional info:

https://github.com/openshift/console/pull/13650

Bug OCPBUGS-24852: Update 4.16 ose-cluster-ovirt-csi-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ovirt-csi-driver-operator/pull/129

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ovirt-csi-driver-operator/pull/129

Bug OCPBUGS-27027: CNO pod restart during kubevirt e2e

View the Description View the linked PRs

W0109 17:47:02.340203       1 builder.go:109] graceful termination failed, controllers failed with error: failed to get infrastructure name: infrastructureName not set in infrastructure 'cluster'

https://github.com/openshift/hypershift/pull/3409

Bug OCPBUGS-27156: [gcp] destroying the problem cluster unexpectedly deletes the dns record-sets not created by the installer

View the Description View the linked PRs

Description of problem:

   Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected.

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-01-14-100410

How reproducible:

    Always

Steps to Reproduce:

1. create the first cluster and make sure it succeeds
2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed
3. destroy the second cluster which failed due to "Platform Provisioning Check"
4. check if the first cluster is still healthy

Actual results:

    The first cluster turns unhealthy, because the dns record-sets are deleted by step3

Expected results:

    The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.

Additional info:

(1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-01-14-100410   True        False         54m     Cluster version is 4.15.0-0.nightly-2024-01-14-100410
$ oc get nodes
NAME                                                       STATUS   ROLES                  AGE   VERSION
jiwei-0115y-lgns8-master-0.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-master-1.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-master-2.c.openshift-qe.internal         Ready    control-plane,master   74m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal   Ready    worker                 62m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal   Ready    worker                 63m   v1.28.5+c84a6b8
$ 

(2) try to create the second cluster and expect failing due to dns record already exists

$ openshift-install version
openshift-install 4.15.0-0.nightly-2024-01-14-100410
built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5
release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d
release architecture amd64
$ mkdir test1
$ cp install-config.yaml test1
$ yq-3.3.0 r test1/install-config.yaml baseDomain
qe.gcp.devcluster.openshift.com
$ yq-3.3.0 r test1/install-config.yaml metadata
creationTimestamp: null
name: jiwei-0115y
$ yq-3.3.0 r test1/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
$ openshift-install create cluster --dir test1
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
INFO Consuming Install Config from target directory 
FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue 
$ 

(3) delete the second cluster

$ openshift-install destroy cluster --dir test1
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
INFO Deleted 2 recordset(s) in zone qe            
INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone 
WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer 
INFO Time elapsed: 37s                            
INFO Uninstallation complete!                     
$ 

(4) check the first cluster status and the dns record-sets

$ oc get clusterversion
Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host
$
$ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2024-01-15T07:22:55.199Z'
description: Created By OpenShift Installer
dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com.
id: '9193862213315831261'
kind: dns#managedZone
labels:
  kubernetes-io-cluster-jiwei-0115y-lgns8: owned
name: jiwei-0115y-lgns8-private-zone
nameServers:
- ns-gcp-private.googledomains.com.
privateVisibilityConfig:
  kind: dns#managedZonePrivateVisibilityConfig
  networks:
  - kind: dns#managedZonePrivateVisibilityConfigNetwork
    networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network
visibility: private
$ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone
NAME                                          TYPE  TTL    DATA
jiwei-0115y.qe.gcp.devcluster.openshift.com.  NS    21600  ns-gcp-private.googledomains.com.
jiwei-0115y.qe.gcp.devcluster.openshift.com.  SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
$ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y'
Listed 0 items.
$

https://github.com/openshift/installer/pull/7932

Bug TRT-1499: gcp-network-liveness has taken out payloads

View the Description View the linked PRs

This wasn't supposed to have a junit associated but it looks like it did and is now killing payloads. It started failing because the LB was reaped, and we do not yet have confirmation the new one will be preserved.

This should be pulled out until we've got confirmation that (a) there is no junit for the backend, and (b) the LB is on the preserve whitelist.

https://github.com/openshift/origin/pull/28586

Bug OCPBUGS-24381: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2262

Bug OCPBUGS-25099: Update 4.16 ose-image-customization-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/114

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-customization-controller/pull/114

Task MON-3649: Upgrade prom-label-proxy to v0.8.0

View the linked PRs

Bug OCPBUGS-23362: CPO Failing to delete default worker security group, but not reflected in HostedCluster status condition

View the Description View the linked PRs

A hostedcluster/hostedcontrolplane were stuck uninstalling. Inspecting the CPO logs, it showed that

"error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"

Unfortunately, I do not have enough access to the AWS account to inspect this security group, though I know it is the default worker security group because it's recorded in the hostedcluster .status.platform.aws.defaultWorkerSecurityGroupID

Version-Release number of selected component (if applicable):

4.14.1

How reproducible:

I haven't tried to reproduce it yet, but can do so and update this ticket when I do. My theory is:

Steps to Reproduce:

1. Create an AWS HostedCluster, wait for it to create/populate defaultWorkerSecurityGroupID
2. Attach the defaultWorkerSecurityGroupID to anything else in the AWS account unrelated to the HCP cluster
3. Attempt to delete the HostedCluster

Actual results:

CPO logs:
"error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"

HostedCluster Status Condition
  - lastTransitionTime: "2023-11-09T22:18:09Z"
    message: ""
    observedGeneration: 3
    reason: StatusUnknown
    status: Unknown
    type: CloudResourcesDestroyed

Expected results:

I would expect that the CloudResourcesDestroyed status condition on the hostedcluster would reflect this security group as holding up the deletion instead of having to parse through logs.

Additional info:

https://github.com/openshift/hypershift/pull/3307

Bug OCPBUGS-25543: Update 4.16 ose-azure-workload-identity-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/10

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-workload-identity/pull/10

Task MON-1047: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2228

Bug OCPBUGS-23900: DynamicResources plugin is disabled even though DynamicResourceAllocation feature is enabled

View the Description View the linked PRs

Description of problem:

DRA plugins can be installed, but do not really work because the required scheduler plugin DynamicResources isn't enabled.

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Always

Steps to Reproduce:

1. Install an OpenShift cluster and enable the TechPreviewNoUpgrade feature set either during installation or post-install. The feature set includes the DynamicResourceAllocation feature gate.
2. Install a DRA plugin by any vendor, e.g. by NVIDIA (requires at least one GPU worker with NVIDIA GPU drivers installed on the node, and a few tweaks to allow the plugin to run on OpenShift).
3. Create a resource claim.
4. Create a pod that consumes the resource claim.

Actual results:

The pod remains in ContainerCreating state, the claim in WaitingForFirstConsumer state forever, without any meaningful event or error message.

Expected results:

A resource is allocated according to the resource claim, and assigned to the pod.

Additional info:

The problem is caused by the DynamicResources scheduler plugin not being automatically enabled when the feature flag is turned on. This makes DRA plugins run without issues (the right APIs are available), but do nothing.

Bug OCPBUGS-24322: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1986

Bug OCPBUGS-25898: PipelineRuns details page get active on Task selection on logs page and logs page get empty on logs tab selection

View the Description View the linked PRs

Description of problem:

PipelineRun logs page navigation is broken on navigate through the task on the PiplineRun log tab.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Navigate to PipelineRuns details page and select the Logs tab.
    2. Navigate through the tasks of the PipelineRun tasks

Actual results:

- Details tab gets active on selection of any task
- Logs page gets empty on seldction of Logs tab again
- Last task is not selected for completed PipelineRuns

Expected results:

- Logs tab should be active when user is not the Logs tab
- Last task should be selected in case of the completed PipelineRuns

Additional info:

  It is a regression after change in logic of tab selection in HorizontalNav component.

https://github.com/openshift/console/pull/13216/files#diff-267d61f330ad6cd9b0f2d743d9ff27929fbe7001780d73e1ec88599d3778eb96R177-R190

Video- https://drive.google.com/file/d/15fx9GWO2dRh4uaibRmZ4VTk4HFxQ7NId/view?usp=sharing

https://github.com/openshift/console/pull/13470

Bug OCPBUGS-29915: CNCC got crashed when upgrade from 4.15 to 4.16 for gcp-ipi-disc-priv-oidc ci

View the Description View the linked PRs

Description of problem:
After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job (gcp-ipi-oidc-rt-fips-f14) failed by same error.
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-disc-priv-oidc-f14/1761337726575054848

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-oidc-rt-fips-f14/1760520933212164096

must-gather:
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-disc-priv-oidc-f14/1761337726575054848/artifacts/gcp-ipi-disc-priv-oidc-f14/gather-must-gather/artifacts/

Version-Release number of selected component (if applicable):

 4.16.0-0.nightly-2024-02-23-013505

How reproducible:

Steps to Reproduce:

After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job  (gcp-ipi-oidc-rt-fips-f14) failed by same error.

Actual results:


containerStatuses:
  - containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02a0ea00865bda78b3b04056dc9e4f596dae74996ecc1fcdee7fbe8d603e33f1
    imageID: 9dfa10971dce332900b111bbe6a28df76e1d6e0c5b9c132c3abfff80ea0afa9c
    lastState:
      terminated:
        containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8
        exitCode: 255
        finishedAt: "2024-02-24T15:20:08Z"
        message: |
          r,UID:,APIVersion:apps/v1,ResourceVersion:,FieldPath:,},Reason:FeatureGatesInitialized,Message:FeatureGates updated to featuregates.Features{Enabled:[]v1.FeatureGateName{\"AlibabaPlatform\", \"AzureWorkloadIdentity\", \"BuildCSIVolumes\", \"CloudDualStackNodeIPs\", \"ExternalCloudProvider\", \"ExternalCloudProviderAzure\", \"ExternalCloudProviderExternal\", \"ExternalCloudProviderGCP\", \"KMSv1\", \"NetworkLiveMigration\", \"OpenShiftPodSecurityAdmission\", \"PrivateHostedZoneAWS\", \"VSphereControlPlaneMachineSet\"}, Disabled:[]v1.FeatureGateName{\"AdminNetworkPolicy\", \"AutomatedEtcdBackup\", \"CSIDriverSharedResource\", \"ClusterAPIInstall\", \"DNSNameResolver\", \"DisableKubeletCloudCredentialProviders\", \"DynamicResourceAllocation\", \"EventedPLEG\", \"GCPClusterHostedDNS\", \"GCPLabelsTags\", \"GatewayAPI\", \"InsightsConfigAPI\", \"InstallAlternateInfrastructureAWS\", \"MachineAPIOperatorDisableMachineHealthCheckController\", \"MachineAPIProviderOpenStack\", \"MachineConfigNodes\", \"ManagedBootImages\", \"MaxUnavailableStatefulSet\", \"MetricsServer\", \"MixedCPUsAllocation\", \"NodeSwap\", \"OnClusterBuild\", \"PinnedImages\", \"RouteExternalCertificate\", \"SignatureStores\", \"SigstoreImageVerification\", \"TranslateStreamCloseWebsocketRequests\", \"UpgradeStatus\", \"VSphereStaticIPs\", \"ValidatingAdmissionPolicy\", \"VolumeGroupSnapshot\"}},Source:EventSource{Component:cloud-network-config-controller-86bc6cf968-54kkg,Host:,},FirstTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,LastTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:cloud-network-config-controller-86bc6cf968-54kkg,ReportingInstance:,}"
          F0224 15:20:08.633010       1 main.go:138] Error building cloud provider client, err: error: cannot initialize google client, err: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused
        reason: Error
        startedAt: "2024-02-24T15:20:07Z"
    name: controller
    ready: false
    restartCount: 12
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=controller pod=cloud-network-config-controller-86bc6cf968-54kkg_openshift-cloud-network-config-controller(95a0c264-ad8b-4fb0-9218-5b2b84fb8194)
        reason: CrashLoopBackOff

Expected results:

   CNCC won't crash after upgrade

Additional info:

https://github.com/openshift/cloud-network-config-controller/pull/132

Bug OCPBUGS-24912: Update 4.16 configmap-reload-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/configmap-reload/pull/58

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/configmap-reload/pull/58

Bug OCPBUGS-26400: tuned: tuned breaks dynamic IRQ affinity

View the Description View the linked PRs

Description of problem:

If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable"

There's an issue when the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. 

Scenarios
- If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok 
- On a reboot if tuned recovers after the guaranteed pod has been launched - broken
- If tuned restarts at runtime for any reason - broken

Version-Release number of selected component (if applicable):

   4.14 and likely earlier

How reproducible:

    See description

Steps to Reproduce:

    1.See description 
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/976

Bug OCPBUGS-24956: Installer should have a pre-check which prevents installation on non-BareMetal platforms without the CloudCredential cap

View the Description View the linked PRs

The Cloud Credential operator was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. The CloudCredential cap was added as a new capability.

However, for OCP 4.15 the disablement of CCO is only supported on BareMetal platforms, see https://issues.redhat.com/browse/OCPEDGE-69?focusedId=23595076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23595076.

We propose to guard against installations on non-BareMetal platforms without the CloudCredential cap, which could be implemented similar to https://issues.redhat.com/browse/OCPBUGS-15659. 

Bug OCPBUGS-24977: Update 4.16 cluster-monitoring-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2190

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-30046: Missing dependency warning error in console UI dev env

View the Description View the linked PRs

Description of problem Missing `useEffect` hook dependency warning error in console UI dev env

    Version-Release number of selected component (if applicable): 4.15.0-0.ci.test-2024-02-28-133709-ci-ln-svyfg32-latest

How reproducible:

To reproduce:

Run `yarn lint --fix`

https://github.com/openshift/console/pull/13640

Story MON-3669: Provide a telemeter metric that tracks effective cluster size in cores for subscription usage

View the Description View the linked PRs

Red Hat OpenShift Container Platform subscriptions are often measured against underlying cores. However, the metrics for cores are unreliable with some known edge cases. Namely, when virtualization is used, depending on a variety of factors, the hypervisor doesn't report the underlying cores, and instead reports a core per "cpu" where "cpu" is a schedulable executor (possibly backed by a single hyperthreaded executor). In order to address, we assume a ratio of 2-vCPU to 1 core, and divide the "cores" value by 2 to normalize when we detect that hyperthreading information was not reported, when we're on x86-64 CPU architecture, and when the cluster is not a bare-metal cluster.

At this time, x86-64 virtualized clusters are the ones affected.

Bug OCPBUGS-24963: Update 4.16 cluster-version-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1004

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25577: Update 4.16 ose-cluster-control-plane-machine-set-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/270

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/270

Bug MGMT-16266: Staging: Indication event showing how often host has been rebooted missing on some nodes

View the Description View the linked PRs

Description of the problem:
Event showing how often host has been rebooted showing now only for part of the nodes

11/22/2023, 12:15:10 AM Node test-infra-cluster-57eb6989-master-1 has been rebooted 2 times before completing installation
11/22/2023, 12:00:01 AM Host: test-infra-cluster-57eb6989-master-1, reached installation stage Rebooting
11/21/2023, 11:53:14 PM Host: test-infra-cluster-57eb6989-worker-0, reached installation stage Rebooting
11/21/2023, 11:53:13 PM Host: test-infra-cluster-57eb6989-worker-1, reached installation stage Rebooting
11/21/2023, 11:34:56 PM Host: test-infra-cluster-57eb6989-master-0, reached installation stage Rebooting
11/21/2023, 11:34:26 PM Host: test-infra-cluster-57eb6989-master-2, reached installation stage Rebooting

in this cluster 4 events are missing

11/21/2023, 3:49:34 PM Node test-infra-cluster-164a0f73-master-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-1 has been rebooted 2 times before completing installation
11/21/2023, 3:37:15 PM Host: test-infra-cluster-164a0f73-master-0, reached installation stage Rebooting
11/21/2023, 3:27:34 PM Host: test-infra-cluster-164a0f73-worker-0, reached installation stage Rebooting
11/21/2023, 3:27:30 PM Host: test-infra-cluster-164a0f73-worker-1, reached installation stage Rebooting
11/21/2023, 3:09:40 PM Host: test-infra-cluster-164a0f73-master-2, reached installation stage Rebooting
11/21/2023, 3:09:35 PM Host: test-infra-cluster-164a0f73-master-1, reached installation stage Rebooting

in this cluster 2 events are missing

How reproducible:

Steps to reproduce:

1. create cluster

2. start installation

Actual results:
some events are missing for the indication how often

Expected results:
for each host there should be indication evet

https://github.com/openshift/assisted-installer/pull/757

Bug OCPBUGS-25514: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-vsphere/pull/33

Bug OCPBUGS-25931: Singleline syntax for inline code snippet

View the Description View the linked PRs

Description of problem:

Single line execute markdown reference is not working.

Steps to Reproduce

Install Web Terminal Operator
Create a ConsoleQuickStart resource using "copy-execute-demo.yaml" file
Open "Sample Quick Start" from quick start catalog page

Actual results:

The inline code is not getting rendered properly specifically for single line execute syntax.

Expected results:

The inline code should show a code block with a small execute icon to run the commands in web terminal

https://github.com/openshift/console/pull/13580

Bug OCPBUGS-23519: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2073

Story STOR-1714: Release leader election on operator shutdown

View the Description View the linked PRs

As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.

This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.

Steps to reproduce:

Delete an operator Pod (`oc delete pod xyz`).
Wait for a replacement Pod to be created.
Check logs of the replacement Pod. It should contain "successfully acquired lease XYZ" relatively quickly after the Pod start (+/- 1 second?)
Go to 1. and retry few times.

This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).

Bug OCPBUGS-24537: pathological events test failed multiple times for ns/openshift-kube-scheduler

View the Description View the linked PRs

Description of problem:

    4.15 nightly payloads have been affected by this test multiple times:

: [sig-arch] events should not repeat pathologically for ns/openshift-kube-scheduler expand_less0s{ 1 events happened too frequently

event happened 21 times, something is wrong: namespace/openshift-kube-scheduler node/ci-op-2gywzc86-aa265-5skmk-master-1 pod/openshift-kube-scheduler-guard-ci-op-2gywzc86-aa265-5skmk-master-1 hmsg/2652c73da5 - reason/ProbeError Readiness probe error: Get "https://10.0.0.7:10259/healthz": dial tcp 10.0.0.7:10259: connect: connection refused result=reject
body:
 From: 08:41:08Z To: 08:41:09Z}

In each of the 10 jobs aggregated, 2 to 3 jobs failed with this test. Historically this test passed 100%. But with the past two days test data, the passing rate has dropped to 97% and aggregator started allowing this in the latest payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1732295947339173888

The first payload this started appearing is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-12-05-071627.

All the events happened during cluster-operator/kube-scheduler progressing.

For comparison, here is a passed job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936539870498816

Here is a failed one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936538192777216

They both have the same set of probe error events. For the passing jobs, the frequency is lower than 20, while for the failed job, one of those events repeated more than 20 times and therefore results in the test failure.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/28448

Bug OCPBUGS-30299: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2299

Bug OCPBUGS-26940: If OLMPlacement is set to management, disableAllDefaultSources doesn't get updated in the guest cluster after it is removed in the HostedCluster CR

View the Description View the linked PRs

Description of problem:

If OLMPlacement is set to management,  the cluster is up with disableAllDefaultSources set to true, remove it in the HostedCluster CR, in the guest cluster disableAllDefaultSources isn't removed and still set to true

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3454

Bug OCPBUGS-29389: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2089

Bug OCPBUGS-29987: ART requests updates to 4.16 image csi-node-driver-registrar-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/65

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-node-driver-registrar/pull/68

Bug OCPBUGS-25810: Update downstream OWNERS to include Surya

View the Description View the linked PRs

No QA required, updating approvers across releases

https://github.com/openshift/ovn-kubernetes/pull/2000

Bug OCPBUGS-24245: seLinuxMount is missed after changing to csi-operator

View the Description View the linked PRs

https://github.com/openshift/csi-operator/blob/master/assets/overlays/aws-ebs/base/csidriver.yaml

Missed "seLinuxMount: true" which has been merged in https://github.com/bertinatto/aws-ebs-csi-driver-operator-1/blob/0a9642cff6d2a7f9aea940ce89b65fc189cba6b6/assets/csidriver.yaml#L14

Bug OCPBUGS-25018: PipelineRun List page list PipelineRuns from all namespace

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13432

Bug OCPBUGS-27959: Panic: send on closed channel

View the Description View the linked PRs

In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:

E0125 11:04:58.158222       1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded)
panic: send on closed channel

goroutine 15608 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5

which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.

Job run for reference:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288

https://github.com/openshift/cluster-etcd-operator/pull/1190

Bug OCPBUGS-25079: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2208

Bug OCPBUGS-28370: HCP deletion can get stuck if CPO is unable to delete the default worker security group

View the Description View the linked PRs

Description of problem:

If a ROSA HCP customer uses the default worker security group that the CPO creates for some other purpose (i.e. creates their own VPC Endpoint or EC2 instance using this security group) and then starts an uninstallation - the uninstallation will hang indefinitely because the CPO is unable to delete the security group.

https://github.com/openshift/hypershift/blob/9e6255e5e44c8464da0850f8c19dc085bdbaf8cb/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L317-L331

Version-Release number of selected component (if applicable):

4.14.8

How reproducible:

100%

Steps to Reproduce:

    1. Create a ROSA HCP cluster
    2. Attach the default worker security group to some other object unrelated to the cluster, like an EC2 instance or VPC Endpoint
    3. Uninstall the ROSA HCP cluster

Actual results:

The uninstall hangs without much feedback to the customer

Expected results:

Either that the uninstall gives up and moves on eventually, or that clear feedback is provided to the customer, so that they know that the uninstall is held up because of an inability to delete a specific security group id. If this feedback mechanism is already in place, but not wired through to OCM, this may not be an OCPBUGS and could just be an OCM bug instead!

Additional info:

Bug OCPBUGS-28724: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2073

Bug OCPBUGS-29584: PowerVS: handle composite_instance

View the Description View the linked PRs

Description of problem:

When the IPI installer creates a service instance for the user, PowerVS will now have the type as composite_instance rather than service_instance. Fixup delete cluster to account for this change.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Create cluster
    2. Destroy cluster
    3.

Actual results:

The newly created service instance does not delete.

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8029

Bug OCPBUGS-24924: Update 4.16 ose-ibmcloud-cluster-api-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/70

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/70

Bug OCPBUGS-25561: Update 4.16 baremetal-machine-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/208

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-baremetal/pull/208

Bug OCPBUGS-29568: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-machine-approver/pull/229

Bug OCPBUGS-25019: opm 4.15 does not work for RHEL8, but RHEL 9

View the Description View the linked PRs

Description of problem:

    When we run opm on RHEL8, we met the following error
./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm)

Note: it happened for 4.15.0-ec.3
I tried the 4.14, it works.
I also tried to compile it with latest code, it also work.

Version-Release number of selected component (if applicable):

    4.15.0-ec.3

How reproducible:

    always

Steps to Reproduce:

[root@preserve-olm-env2 slavecontainer]# curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/candidate/opm-linux-4.15.0-ec.3.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz
opm
[root@preserve-olm-env2 slavecontainer]# ./opm version
./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm)
[root@preserve-olm-env2 slavecontainer]# curl -s -l -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest-4.14/opm-linux-4.14.5.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz
opm
[root@preserve-olm-env2 slavecontainer]# opm version
Version: version.Version{OpmVersion:"639fc1203", GitCommit:"639fc12035292dec74a16b306226946c8da404a2", BuildDate:"2023-11-21T08:03:15Z", GoOs:"linux", GoArch:"amd64"}
[root@preserve-olm-env2 kuiwang]# cd operator-framework-olm/
[root@preserve-olm-env2 operator-framework-olm]# git branch
  gs
* master
  release-4.10
  release-4.11
  release-4.12
  release-4.13
  release-4.8
  release-4.9
[root@preserve-olm-env2 operator-framework-olm]# git pull origin master
remote: Enumerating objects: 1650, done.
remote: Counting objects: 100% (1650/1650), done.
remote: Compressing objects: 100% (831/831), done.
remote: Total 1650 (delta 727), reused 1617 (delta 711), pack-reused 0
Receiving objects: 100% (1650/1650), 2.03 MiB | 12.81 MiB/s, done.
Resolving deltas: 100% (727/727), completed with 468 local objects.
From github.com:openshift/operator-framework-olm
 * branch master -> FETCH_HEAD
   639fc1203..85c579f9b master -> origin/master
Updating 639fc1203..85c579f9b
Fast-forward
 go.mod | 120 +-
 go.sum | 240 ++--
 manifests/0000_50_olm_00-pprof-secret.yaml
...
 create mode 100644 vendor/google.golang.org/protobuf/types/dynamicpb/types.go
[root@preserve-olm-env2 operator-framework-olm]# rm -fr bin/opm
[root@preserve-olm-env2 operator-framework-olm]# make build/opm
make bin/opm
make[1]: Entering directory '/data/kuiwang/operator-framework-olm'
go build -ldflags "-X 'github.com/operator-framework/operator-registry/cmd/opm/version.gitCommit=85c579f9be61aaea11e90b6c870452c72107300a' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.opmVersion=85c579f9b' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.buildDate=2023-12-11T06:12:50Z'" -mod=vendor -tags "json1" -o bin/opm github.com/operator-framework/operator-registry/cmd/opm
make[1]: Leaving directory '/data/kuiwang/operator-framework-olm'
[root@preserve-olm-env2 operator-framework-olm]# which opm
/data/kuiwang/operator-framework-olm/bin/opm
[root@preserve-olm-env2 operator-framework-olm]# opm version
Version: version.Version{OpmVersion:"85c579f9b", GitCommit:"85c579f9be61aaea11e90b6c870452c72107300a", BuildDate:"2023-12-11T06:12:50Z", GoOs:"linux", GoArch:"amd64"}

Actual results:

Expected results:

Additional info:

https://github.com/openshift/operator-framework-olm/pull/704

Spike HOSTEDCP-1309: Spike on Priority/Fairness configuration for k8s API server

View the Description View the linked PRs

In the "request-serving" deployment model for HCPs, the request-serving nodes are being memory starved by the k8s API server. This has an observed impact on limiting the number of nodes a guest HCP cluster can provision, especially during upgrade events.

This is a spike card to investing setting the API Priority and Fairness [1] configuration, and exactly what configuration would be necessary to set.

[1] https://kubernetes.io/docs/concepts/cluster-administration/flow-control/

https://github.com/openshift/hypershift/pull/3384

Bug OCPBUGS-23852: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/431

Bug OCPBUGS-27161: Nodepool has message NotFound when replica is set to 0

View the Description View the linked PRs

Description of problem:

When the replica for a nodepool is set to 0, the message for the nodepool is "NotFound". This message should not be displayed if the desired replica is 0.

Version-Release number of selected component (if applicable):

How reproducible:

    Create a nodepool and set the replica to 0

Steps to Reproduce:

    1. Create a hosted cluster
    2. Set the replica for the nodepool to 0
    3.

Actual results:

NodePool message is "NotFound"

Expected results:

NodePool message to be empty

Additional info:

https://github.com/openshift/hypershift/pull/3472

Bug OCPBUGS-24995: [azure] bootstrap failed to be provisioned when vm type is set to Standard_NP10s

View the Description View the linked PRs

Description of problem:

Configure vm type as Standard_NP10s in install-config, which only supports Generation V1.
--------------
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      type: Standard_NP10s
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      type: Standard_NP10s
  replicas: 3

Continue installation, installer failed when provisioning bootstrap node.
--------------
ERROR                                              
ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" 
ERROR                                              
ERROR   with azurerm_linux_virtual_machine.bootstrap, 
ERROR   on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": 
ERROR  193: resource "azurerm_linux_virtual_machine" "bootstrap" { 
ERROR                                              
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failure applying terraform for "bootstrap" stage: error applying Terraform configs: failed to apply Terraform: exit status 1 
ERROR                                              
ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" 
ERROR                                              
ERROR   with azurerm_linux_virtual_machine.bootstrap, 
ERROR   on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": 
ERROR  193: resource "azurerm_linux_virtual_machine" "bootstrap" { 
ERROR                                              
ERROR                                              

seems that issue is introduced by https://github.com/openshift/installer/pull/7642/

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-09-012410

How reproducible:

Always

Steps to Reproduce:

    1. configure vm type to Standard_NP10s on control-plane in install-config.yaml
    2. install cluster
    3.

Actual results:

    installer failed when provisioning bootstrap node

Expected results:

    installation get successful

Additional info:

https://github.com/openshift/installer/pull/7822

Bug OCPBUGS-25627: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-vsphere/pull/31

Bug OCPBUGS-19635: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2070

Bug OCPBUGS-27796: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-23518: y-stream upgrade fails because CVO has no Upgradeable condition

View the Description View the linked PRs

When upgrading a HC from 4.13 to 4.14, after admin-acking the API deprecation check, the upgrade is still blocked by the ClusterVersionUpgradeble condition on the HC being Unknown. This is because the CVO in the guest cluster does not have an Upgradeable condition anymore.

https://github.com/openshift/hypershift/pull/3239

Bug OCPBUGS-25788: Console Node view is missing information when using metal3-plugin

View the Description View the linked PRs

Description of problem:

When an OpenShift Container Platform cluster is installed on Bare Metal with RHACM, the "metal3-plugin" for the OpenShift Console is installed automatically.

The "Nodes view (`<console>/k8s/cluster/core~v1~Node`) uses the `BareMetalNodesTable` which has very limited columns. However in the meantime OCP improved their Nodes table and added more features (like metrics) and we havent done any work in metal3. Customers are missing information like metrics or Pods, which are present in the standard Node view.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.33

How reproducible:

Always

Steps to Reproduce:

    1. Install a cluster using RHACM on Bare Metal
    2. Ensure the "metal3-plugin" is enabled
    3. Navigate to the "Nodes" view in the OpenShift Container Platform Console (`<console>/k8s/cluster/core~v1~Node`)

Actual results:

Limited columns (Name, Status, Role, Machine, Management Address) is visible. Columns like Memory, CPU, Pods, Filesystem, Instance Type are missing

Expected results:

All the columns from the standard view are visible, plus the "Management Address" column

Additional info:

* Issue was discussed here: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552957981989
* Screenshot of non-metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552980878029?thread_ts=1702552957.981989&cid=C027TN14SGJ
* Screenshot of metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552995363389?thread_ts=1702552957.981989&cid=C027TN14SGJ

https://github.com/openshift/console/pull/13506

Bug OCPBUGS-19830: conformance tests failing due to openshift-multus config

View the Description View the linked PRs

Description of problem:

There are several testcases in conformance testsuite that are failing due to openshift-multus configuration.

We are running conformance testsuite as part of our Openshift on Openstack CI. We use that just to confirm correct functionality of the cluster. The command we are using to run the test suite is:

openshift-tests run  --provider '{\"type\":\"openstack\"}' openshift/conformance/parallel

The name of the tests that failed are:
1. sig-arch] Managed cluster should ensure platform components have system-* priority class associated [Suite:openshift/conformance/parallel]

Reason is:

6 pods found with invalid priority class (should be openshift-user-critical or begin with system-):
openshift-multus/whereabouts-reconciler-6q6h7 (currently "")
openshift-multus/whereabouts-reconciler-87dwn (currently "")
openshift-multus/whereabouts-reconciler-fvhwv (currently "")
openshift-multus/whereabouts-reconciler-h68h5 (currently "")
openshift-multus/whereabouts-reconciler-nlz59 (currently "")
openshift-multus/whereabouts-reconciler-xsch6 (currently "")

2. [sig-arch] Managed cluster should only include cluster daemonsets that have maxUnavailable or maxSurge update of 10 percent or maxUnavailable of 33 percent [Suite:openshift/conformance/parallel]
Reason is:

fail [github.com/openshift/origin/test/extended/operators/daemon_set.go:105]: Sep 23 16:12:15.283: Daemonsets found that do not meet platform requirements for update strategy:
  expected daemonset openshift-multus/whereabouts-reconciler to have maxUnavailable 10% or 33% (see comment) instead of 1, or maxSurge 10% instead of 0
Ginkgo exit error 1: exit with code 1

3.[sig-arch] Managed cluster should set requests but not limits [Suite:openshift/conformance/parallel]

Reason is:

fail [github.com/openshift/origin/test/extended/operators/resources.go:196]: Sep 23 16:12:17.489: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted:
  apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on cpu of 50m which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[cpu]")
  apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on memory of 100Mi which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[memory]")
Ginkgo exit error 1: exit with code 1

4. [sig-node][apigroup:config.openshift.io] CPU Partitioning cluster platform workloads should be annotated correctly for DaemonSets [Suite:openshift/conformance/parallel]

Reason is:

fail [github.com/openshift/origin/test/extended/cpu_partitioning/pods.go:159]: Expected
    <[]error | len:1, cap:1>: [
        <*errors.errorString | 0xc0010fa380>{
            s: "daemonset (whereabouts-reconciler) in openshift namespace (openshift-multus) must have pod templates annotated with map[target.workload.openshift.io/management:{\"effect\": \"PreferredDuringScheduling\"}]",
        },
    ]
to be empty

How reproducible: Always
Steps to Reproduce: Run conformance testsuite:
https://github.com/openshift/origin/blob/master/test/extended/README.md

Actual results: Testcases failing
Expected results: Testcases passing

https://github.com/openshift/cluster-network-operator/pull/2103

Bug OCPBUGS-30531: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-powervs/pull/64

Bug OCPBUGS-25191: [azure] using marketplace image fails while retrieving the image

View the Description View the linked PRs

Description of problem:

    https://github.com/openshift/installer/pull/7778 introduced a bug where an error is always returned while retrieving a marketplace image.

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Configure marketplace image in the install-config
    2. openshift-install create manifests
    3.

Actual results:

    $ ./openshift-install create manifests --dir ipi1 --log-level debug
DEBUG OpenShift Installer 4.16.0-0.test-2023-12-12-020559-ci-ln-xkqmlqk-latest 
DEBUG Built from commit 456ae720a83e39dffd9918c5a71388ad873b6a38 
DEBUG Fetching Master Machines...                  
DEBUG Loading Master Machines...                   
DEBUG   Loading Cluster ID...                      
DEBUG     Loading Install Config...                
DEBUG       Loading SSH Key...                     
DEBUG       Loading Base Domain...                 
DEBUG         Loading Platform...                  
DEBUG       Loading Cluster Name...                
DEBUG         Loading Base Domain...               
DEBUG         Loading Platform...                  
DEBUG       Loading Pull Secret...                 
DEBUG       Loading Platform...                    
INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" 
ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>), compute[0].platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>)]

Expected results:

    Success

Additional info:

    When {{errors.Wrap(err, ...)}} was replaced by {{fmt.Errorf(...)}}, there is a slight difference in behavior in which {{errors.Wrap}} returns {{nil}} if {{err}} is {{nil}} but {{fmt.Errorf}} always returns an error.

https://github.com/openshift/installer/pull/7826

Bug OCPBUGS-24606: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/1821

Bug OCPBUGS-26049: Tab "VolumeSnapshots" crashed on PVC page.

View the Description View the linked PRs

Description of problem:

Go to one pvc "VolumeSnapshots" tab, it shows error "Oh no! Something went wrong."

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-01-03-140457

How reproducible:

Always

Steps to Reproduce:

    1.Create a pvc in project. Go to the pvc's "VolumeSnapshots" tab.
    2.
    3.

Actual results:

1. The error "Oh no! Something went wrong." shows up on the page.

Expected results:

1. Should show volumesnapshot related to the pvc without error.

Additional info:

screenshot: https://drive.google.com/file/d/1l0i0DCFh_q9mvFHxnftVJL0AM1LaKFOO/view?usp=sharing

https://github.com/openshift/console/pull/13481

Bug OCPBUGS-27192: Remove NCv2 series from azure doc tested_instance_types_x86_64

View the Description View the linked PRs

Description of problem:

Based on Azure doc [1], NCv2 series Azure virtual machines (VMs) are retired on September 6, 2023. VM could not be provisioned on those instance types.

So remove standardNCSv2Family from azure doc tested_instance_types_x86_64 on 4.13+.

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/ncv2-series

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. cluster is installed failed on NCv2 series instance type 
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/7911

Bug OCPBUGS-24854: Update 4.16 ose-azure-workload-identity-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/8

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-workload-identity/pull/8

Bug OCPBUGS-28216: Warning icon in Cluster > Overview > Activity card is clipped

View the Description View the linked PRs

See https://drive.google.com/file/d/1lsUgaaLf2YKUOEum6pGesl-ZMn7J3O08/view?usp=sharing

https://github.com/openshift/console/pull/13546

Bug TRT-1512: Aggregator not seeing test results?

View the Description View the linked PRs

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-azure-sdn-upgrade-4.15-minor-release-openshift-release-analysis-aggregator/1757905312053989376

Aggregator claims these tests only ran 4 times out of what looks like 10 jobs that ran to normal completion:

[sig-network-edge] Application behind service load balancer with PDB remains available using new connections
[sig-network-edge] Application behind service load balancer with PDB remains available using reused connections

However looking at one of the jobs not in the list of passes, we can see these tests ran:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade/1757905303602466816

Why is the aggregator missing this result somehow?

https://github.com/openshift/origin/pull/28604

Bug OCPBUGS-24845: Update 4.16 ose-ibm-vpc-block-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/96

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/96

Bug OCPBUGS-25337: [OVN][IPSEC] ovn-ipsec-host pods got deleted when there is a NotReady node

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

https://github.com/openshift/cluster-network-operator/pull/2161

Bug OCPBUGS-25764: OVN remove explicit memory-trim-on-compaction enable

View the Description View the linked PRs

Description of problem:

From ~~OCPBUGS-237~~ we discussed disabled memory-trim-on-compaction once it was enabled by default.

On OCP with 4.15.0-0.nightly-2023-12-19-033450 we are at

Red Hat Enterprise Linux CoreOS 415.92.202312132107-0 (Plow)
openvswitch3.1-3.1.0-59.el9fdp.x86_64

This should have memory-trim-on-compaction enabled by default

v3.0.0 - 15 Aug 2022
--------------------
     * Returning unused memory to the OS after the database compaction is now
       enabled by default.  Use 'ovsdb-server/memory-trim-on-compaction off'
       unixctl command to disable.

https://github.com/openvswitch/ovs/commit/e773140ec3f6d296e4a3877d709fb26fb51bc6ee

If we are enabled by default, we should remove the enable loop.

      # set trim-on-compaction
      if ! retry 60 "trim-on-compaction" "ovn-appctl -t ${nbdb_ctl} --timeout=5 ovsdb-server/memory-trim-on-compaction on"; then
        exit 1
      fi

https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/common/008-script-lib.yaml#L314

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-19-033450

How reproducible:

Always

Steps to Reproduce:

1. check if memory-trim-on-compaction is enabled by default in OVS

2. check ndbd log files for

Actual results:

2023-12-20T18:12:47.053444489Z 2023-12-20T18:12:47.053Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
2023-12-20T18:12:49.001580092Z 2023-12-20T18:12:49.001Z|00003|ovsdb_server|INFO|memory trimming after compaction enabled.

Expected results:

memory-trim-on-compaction should be enabled by default, we don't need to re-enable it.

Affected Platforms:

All

https://github.com/openshift/cluster-network-operator/pull/2260

Bug OCPBUGS-7714: Holding lock during slow payload verification/download interferes with reconciling manifests

View the Description View the linked PRs

Description of problem:

After the fix for ~~OCPBUGSM-44759~~, we put timeouts on payload retrieval operations (verification and download); previously they were uncapped and under certain network circumstances could take hours to terminate. Testing the fix uncovered a problem where, after CVO passes the path with the timeouts, CVO starts logging errors for the core manifest reconciliation loop:

I0208 11:22:57.107819       1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834)
I0208 11:22:57.107887       1 task_graph.go:474] Canceled worker 1 while waiting for work
I0208 11:22:57.107900       1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834)
I0208 11:22:57.107911       1 task_graph.go:474] Canceled worker 0 while waiting for work                                                                                                                                                                                                                              
I0208 11:22:57.107918       1 task_graph.go:523] Workers finished
I0208 11:22:57.107925       1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)]
I0208 11:22:57.107938       1 sync_worker.go:1169] Summarizing 1 errors
I0208 11:22:57.107947       1 sync_worker.go:1173] Update error 648 of 834: UpdatePayloadFailed Could not update role "openshift-marketplace/marketplace-operator" (648 of 834) (context.deadlineExceededError: context deadline exceeded)
E0208 11:22:57.107966       1 sync_worker.go:654] unable to synchronize image (waiting 3m39.457405047s): Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)

This is caused by locks. The SyncWorker.Update method acquires its lock for its whole duration. The payloadRetriever.RetrievePayload method is called inside SyncWorker.Update, on the following call chain:

SyncWorker.Update ->
  SyncWorker.loadUpdatedPayload ->
    SyncWorker.syncPayload ->
      payloadRetriever.RetrievePayload

RetrievePayload can take 2 or 4 minutes before it timeouts, so CVO holds the lock for this whole wait.

The manifest reconciliation loop is implemented in the apply method. The whole apply method is bounded by a timeout context set to 2*minimum reconcile interval so it will be set to a value between 4 and 8 minutes. While in the reconciling mode, the manifest graph is split into multiple "tasks" where smaller sequences of these tasks are applied in parallel. Individual tasks in these series are iterated over and each iteration uses a consistentReporter to report status via its Update method, which also acquires the lock on the following call sequence:

SyncWorker.apply ->
  { for _, task := range tasks ... ->
    consistentReporter.Update ->
      statusWrapper.Report ->
        SyncWorker.updateApplyStatus ->

This leads to the following sequence:

1. apply is called with a timeout between 4 and 8 minutes
2. in parallel, SyncWorker.Update starts and acquires the lock
3. tasks under apply wait on the reporter to acquire lock
4. after 2 or 4 minutes RetrievePayload under SyncWorker.Update timeout and terminate, SyncWorker.Update terminates and releases the lock
5. tasks under apply report results after briefly acquiring the lock, start to do their thing
6. in parallel, SyncWorker.Update starts again and acquires the lock
7. further iterations over tasks under apply wait on the reporter to acquire lock
8. context passed to apply times out
9. Canceled worker 0 while waiting for work... errors

Version-Release number of selected component (if applicable):

4.13.0-0.ci.test-2023-02-06-062603 with https://github.com/openshift/cluster-version-operator/pull/896

How reproducible:

always in certain cluster configuration

Steps to Reproduce:

1. in a disconnected cluster, upgrade to an unrechachable payload image with --force
2. observe the CVO log

Actual results:

CVO starts to fail reconciling manifests

Expected results:

no failures, cluster continues to try retrieving the image but no interference with manifest reconciliation

Additional info:

This problem was discovered by Evgeni Vakhonin while testing fix for ~~OCPBUGSM-44759~~: https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22

https://github.com/openshift/cluster-version-operator/pull/896 uncovers this issue but still gets CVO into a better shape - previously the RetrievePayload could be running for a much longer time (hours), preventing the CVO from working at all.

When the cluster gets into this buggy state, the solution is to abort the upgrade that fails to verify or download.

Bug OCPBUGS-9714: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13411

Bug OCPBUGS-25771: Long catalog source display name will flow out of tile on operatorhub page

View the Description View the linked PRs

Description of problem:

Check on OperatorHub page, the long catalogsource display name will overflow the operator item tile

    Version-Release number of selected component (if applicable):{code:none}
4.15.0-0.nightly-2023-12-19-033450

How reproducible:

Always

Steps to Reproduce:

    1. Create a catalogsource with a long display name.
    2. Check operator items supplied by the created catalogsource on OperatorHub page
    3.

Actual results:

2. The catalogsource display name overflows from the item tile

Expected results:

2. Show show catalogsource display name in the item tile dynamically without overflow.

Additional info:

screenshot: https://drive.google.com/file/d/1GOHJOxoBmtZX3QWDsIvc2RT5a2inkpzM/view?usp=sharing

https://github.com/openshift/console/pull/13474

Bug OCPBUGS-26977: Required RBAC for network-node-identity is not created when hosted cluster networkType is set to Other.

View the Description View the linked PRs

Description of problem:

When using a custom CNI plugin in a hostedcluster, multus requires some CSRs to be approved. The component approving these CSRs is the network-node-identity. This component only gets the proper RBAC rules configured when networkType is set to Calico.

In the current implementation, there is an condition that will apply the required RBAC if the networkType is set to Calico[1].

When using other CNI plugins, like Cilium, you're supposed to set networkType to Other. With current implementation, you won't get the required RBAC in place and as such, the required CSRs won't be approved automatically.


[1] https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L139

Version-Release number of selected component (if applicable):

Latest

How reproducible:

Always

Steps to Reproduce:

    1. Set hostedcluster.spec.networking.networkType to Other
    2. Wait for the HC to start deploying and for the Nodes to join the cluster
    3. The nodes will remain in NotReady. Multus pods will complaing about certificates not being ready.
    4. If you list CSRs you will find pending CSRs.

Actual results:

RBAC not properly configured when networkType set to Other

Expected results:

RBAC properly configured when networkType set to Other

Additional info:

Slack discussion:

https://redhat-internal.slack.com/archives/C01C8502FMM/p1704824277049609

https://github.com/openshift/hypershift/pull/3403

Bug OCPBUGS-29341: dhcp-daemon pods have priority: 0 and no priority class

View the Description View the linked PRs

Description of problem:

If these pods are evicted, they loose all knowlage of exsisting dhcp leases, and any pods using dhcp ipam will fail to renew the dhcp lease. even after the pod is re-created.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. use a NAD with ipam: dhcp.
    2. delete the dhcp deamon pod on the smae node as your workload.
    3. observe the lease expire on dhcp server / get reissued to a different pod causing network outage from duplicate addresses.

Actual results:

dhcp-daemon

Expected results:

dhcp-daemon pod does not get evicted before workloads. because of system-node-critical

Additional info:

All other multus components system-node-critical 
  priority: 2000001000
  priorityClassName: system-node-critical

https://github.com/openshift/cluster-network-operator/pull/2280

Bug OCPBUGS-24716: 4.16 [vSphere CSI Driver] [zonal] Volume provisioning failed with: No compatible datastores found for accessibility requirements

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-25626: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-azure/pull/295

Bug OCPBUGS-27498: MachineAutoscaler with a minimum value of zero display a hyphen instead

View the Description View the linked PRs

Description of problem:

MachineAutoscaler resources with a minimum replica size of zero will display a hyphen ("-") instead of zero on the list and detail pages.

Version-Release number of selected component (if applicable):

    4.14.7

How reproducible:

    always

Steps to Reproduce:

    1.create a MachineAutoscaler with "min: 0" field
    2.save record
    3.navigate to MachineAutoscalers page under the Compute tab

Actual results:

    the min replicas indicates "-"

Expected results:

    min replicas indicates "0"

Additional info:

    attaching a screenshot

https://github.com/openshift/console/pull/13533

Bug OCPBUGS-23430: EgressIP cannot be applied to egress node(rhcos) on clusters with Windows nodes existing

View the Description View the linked PRs

Description of problem:

On a hybrid cluster with Windows nodes and coreOS nodes mixed, egressIP cannot be applied to coreOS anymore. 
QE testing profile: 53_IPI on AWS & OVN & WindowsContainer

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Always

Steps to Reproduce:

1.  Setup cluster with template aos-4_14/ipi-on-aws/versioned-installer-ovn-winc-ci
2.  Label on coreOS node as egress node 
% oc describe node ip-10-0-59-132.us-east-2.compute.internal
Name:               ip-10-0-59-132.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m6i.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    k8s.ovn.org/egress-assignable=
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-59-132.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m6i.xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-2b
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        cloud.network.openshift.io/egress-ipconfig:
                      [{"interface":"eni-0c661bbdbb0dde54a","ifaddr":{"ipv4":"10.0.32.0/19"},"capacity":{"ipv4":14,"ipv6":15}}]
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0629862832fff4ae3"}
                    k8s.ovn.org/host-cidrs: ["10.0.59.132/19"]
                    k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip: 10.129.2.13
                    k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 0a:58:0a:81:02:0d
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-132.us-east-2.compute.internal","mac-address":"06:06:e2:7b:9c:45","ip-address...
                    k8s.ovn.org/network-ids: {"default":"0"}
                    k8s.ovn.org/node-chassis-id: fa1ac464-5744-40e9-96ca-6cdc74ffa9be
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.7/16"}
                    k8s.ovn.org/node-id: 7
                    k8s.ovn.org/node-mgmt-port-mac-address: a6:25:4e:55:55:36
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.132/19"}
                    k8s.ovn.org/node-subnets: {"default":["10.129.2.0/23"]}
                    k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.7/16"}
                    k8s.ovn.org/remote-zone-migrated: ip-10-0-59-132.us-east-2.compute.internal
                    k8s.ovn.org/zone-name: ip-10-0-59-132.us-east-2.compute.internal
                    machine.openshift.io/machine: openshift-machine-api/wduan-debug-1120-vtxkp-worker-us-east-2b-z6wlc
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 22806
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 20 Nov 2023 09:46:53 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-59-132.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 20 Nov 2023 14:01:05 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:46:53 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:46:53 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:46:53 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:47:34 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.59.132
  InternalDNS:  ip-10-0-59-132.us-east-2.compute.internal
  Hostname:     ip-10-0-59-132.us-east-2.compute.internal
Capacity:
  cpu:                4
  ephemeral-storage:  125238252Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16092956Ki
  pods:               250
Allocatable:
  cpu:                3500m
  ephemeral-storage:  114345831029
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             14941980Ki
  pods:               250
System Info:
  Machine ID:                             ec21151a2a80230ce1e1926b4f8a902c
  System UUID:                            ec21151a-2a80-230c-e1e1-926b4f8a902c
  Boot ID:                                cf4b2e39-05ad-4aea-8e53-be669b212c4f
  Kernel Version:                         5.14.0-284.41.1.el9_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 414.92.202311150705-0 (Plow)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.27.1-13.1.rhaos4.14.git956c5f7.el9
  Kubelet Version:                        v1.27.6+b49f9d1
  Kube-Proxy Version:                     v1.27.6+b49f9d1
ProviderID:                               aws:///us-east-2b/i-0629862832fff4ae3
Non-terminated Pods:                      (21 in total)
  Namespace                               Name                                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                               ----                                                      ------------  ----------  ---------------  -------------  ---
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-tlw5h                             30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         4h14m
  openshift-cluster-node-tuning-operator  tuned-4fvgv                                               10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h14m
  openshift-dns                           dns-default-z89zl                                         60m (1%)      0 (0%)      110Mi (0%)       0 (0%)         11m
  openshift-dns                           node-resolver-v9stn                                       5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         4h14m
  openshift-image-registry                image-registry-67b88dc677-76hfn                           100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         4h14m
  openshift-image-registry                node-ca-hw62n                                             10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4h14m
  openshift-ingress-canary                ingress-canary-9r9f8                                      10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         4h13m
  openshift-ingress                       router-default-5957f4f4c6-tl9gs                           100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         4h18m
  openshift-machine-config-operator       machine-config-daemon-h7fx4                               40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         4h14m
  openshift-monitoring                    alertmanager-main-1                                       9m (0%)       0 (0%)      120Mi (0%)       0 (0%)         4h12m
  openshift-monitoring                    monitoring-plugin-68995cb674-w2wr9                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h13m
  openshift-monitoring                    node-exporter-kbq8z                                       9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         4h13m
  openshift-monitoring                    prometheus-adapter-54fc7b9c87-sg4vt                       1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         4h13m
  openshift-monitoring                    prometheus-k8s-1                                          75m (2%)      0 (0%)      1104Mi (7%)      0 (0%)         4h12m
  openshift-monitoring                    prometheus-operator-admission-webhook-84b7fffcdc-x8hsz    5m (0%)       0 (0%)      30Mi (0%)        0 (0%)         4h18m
  openshift-monitoring                    thanos-querier-59cbd86d58-cjkxt                           15m (0%)      0 (0%)      92Mi (0%)        0 (0%)         4h13m
  openshift-multus                        multus-7gjnt                                              10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         4h14m
  openshift-multus                        multus-additional-cni-plugins-gn7x9                       10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4h14m
  openshift-multus                        network-metrics-daemon-88tf6                              20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         4h14m
  openshift-network-diagnostics           network-check-target-kpv5v                                10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         4h14m
  openshift-ovn-kubernetes                ovnkube-node-74nl9                                        80m (2%)      0 (0%)      1630Mi (11%)     0 (0%)         3h51m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                619m (17%)    0 (0%)
  memory             4296Mi (29%)  0 (0%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:              <none>

 % oc get node -l k8s.ovn.org/egress-assignable=             
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-59-132.us-east-2.compute.internal   Ready    worker   4h14m   v1.27.6+b49f9d1
3.  Create egressIP object

Actual results:

% oc get egressip        
NAME         EGRESSIPS     ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-1   10.0.59.101        

% oc get cloudprivateipconfig
No resources found

Expected results:

The egressIP should be applied to egress node

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2018

Bug OCPBUGS-28225: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-image-registry-operator/pull/993

Bug OCPBUGS-24957: Update 4.16 coredns-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/coredns/pull/107

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/coredns/pull/107

Bug OCPBUGS-24948: Update 4.16 ose-cluster-cloud-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/308

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/308

Bug OCPBUGS-29288: Ensure proper deprecation for the default field manager in CNO

View the Description View the linked PRs

Description of problem:

We deprecated the default field manager in CNO, but it is still used by default calls like Patch(). We need to update all calls to use explicit fieldManager, and add a test to verify deprecated managers are not used.

Since 4.14 https://github.com/openshift/cluster-network-operator/commit/db57a477b10f517bc4ae501d95cc7b8398a8755c#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R31 (more specifically, since sigs.k8s.io/controller-runtime bump) we were exposed to this bug https://github.com/kubernetes-sigs/controller-runtime/pull/2435/commits/a6b9c0b672c77a79fff4d5bc03221af1e1fe21fa which made the default fieldManager to be "Go-http-client" instead of "cluster-network-operator".
It means, that "cluster-network-operator" deprecation doesn't really work, since the manager is called differently. Manager name, when unset, is coming from https://github.com/kubernetes/kubernetes/blob/b85c9bbf1ac911a2a2aed2d5c1f5eaf5956cc199/staging/src/k8s.io/client-go/rest/config.go#L498 and then is cropped https://github.com/openshift/cluster-network-operator/blob/5f18e4231f291bf5a01812974b0b4dff19c77f2c/vendor/k8s.io/apiserver/pkg/endpoints/handlers/create.go#L253-L260.

Identified changes needed (may be more):

(status *StatusManager) setAnnotation
mysterious message for the newly created clusters only (didn't see on CI runs, only seen once)
`Depreciated field manager cluster-network-operator for object "operator.openshift.io/v1, Kind=Network" cluster`

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Check CNO logs for deprecated field manager logs
    2. oc logs -l name=network-operator --tail=-1 -n openshift-network-operator|grep "Depreciated field manager"     
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2274

Task MGMT-16501: Refactor API_VIP check to pass additional headers

View the Description View the linked PRs

Assisted installer agent's api_vip check doesn't accept multiple headers (src). This poses an issue when there are different ignition servers (e.g. hypershift) that expect different headers.

Latest use case: Hypershift's ignition server expects this header: Nodepool name and targetconfigversionhash

Bug OCPBUGS-28708: add new arm64 tested azure instance types in installer doc

View the Description View the linked PRs

Description of problem:

When running 4.15 installer full function test, detect below one arm64 instance families and verified, need to append them in installer doc[1]:
- standardBpsv2Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_aarch64.md

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/7965

Bug OCPBUGS-25306: HyperShift failing conformance tests on Kubernetes 1.29 rebase

View the Description View the linked PRs

Description of problem:

Hosted control plane kube scheduler pods crashloop on clusters created with Kube 1.29 rebase

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Create a hosted cluster using 4.16 kube rebase code base
    2. Wait for the cluster to come up

Actual results:

    Cluster never comes up because kube scheduler pod crashloops

Expected results:

    Cluster comes up

Additional info:

    The kube scheduler configuration generated by the control plane operator is using the v1beta3 version of the configuration. That version is no longer included in Kubernetes v1.29

https://github.com/openshift/hypershift/pull/3313

Bug OCPBUGS-29469: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-29895: ServiceInstanceNameToGUID needs more debugging statements

View the Description View the linked PRs

Description of problem:

A user noticed on delete cluster that the IPI generated service instance was not cleaned up. Add more debugging statements to find out why.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Create cluster
    2. Delete cluster

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8058

Task MON-3698: Update downstream prom-label-proxy to v0.8.1

View the linked PRs

https://github.com/openshift/prom-label-proxy/pull/363

Bug OCPBUGS-26488: CCO reports wrong credentials mode in metrics

View the Description View the linked PRs

Description of problem:

CCO reports credsremoved mode in metrics when the cluster is actually in the default mode. 
See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/47349/rehearse-47349-pull-ci-openshift-cloud-credential-operator-release-4.16-e2e-aws-qe/1744240905512030208 (OCP-31768).

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always.

Steps to Reproduce:

1. Creates an AWS cluster with CCO in the default mode (ends up in mint)
2. Get the value of the cco_credentials_mode metric

Actual results:

credsremoved

Expected results:

mint

Root cause:

The controller-runtime client used in metrics calculator (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L77) is unable to GET the root credentials Secret (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L184) since it is backed by a cache which only contains target Secrets requested by other operators (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/cmd/operator/cmd.go#L164-L168).

https://github.com/openshift/cloud-credential-operator/pull/645

Bug OCPBUGS-27263: Bump Golang to 1.21 to be consistent with ART

View the Description View the linked PRs

Description of problem:

    ART is moving the container images to be built by Golang 1.21. We should do the same to keep our build config in sync with ART.

Version-Release number of selected component (if applicable):

    4.16/master

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/7925

Bug OCPBUGS-385: oc-mirror requires that the default channel of an operator is mirrored

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc-mirror/pull/749

Bug OCPBUGS-29305: IPSec - ovn-ipsec-containerized ds typo

View the Description View the linked PRs

Description of problem:

There's a typo in the openssl commands within the ovn-ipsec-containerized/ovn-ipsec-host daemonsets. The correct parameter is "-checkend", not "-checkedn".

Version-Release number of selected component (if applicable):

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.10   True        False         7s      Cluster version is 4.14.10

How reproducible:

Steps to Reproduce:

1. Enable IPsec encryption

# oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec": 
 {"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'

Actual results:

Examining the initContainer (ovn-keys) logs

# oc logs ovn-ipsec-containerized-7bcd2 -c ovn-keys
...
+ openssl x509 -noout -dates -checkedn 15770000 -in /etc/openvswitch/keys/ipsec-cert.pem
x509: Use -help for summary.

# oc get ds
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
ovn-ipsec-containerized   1         1         0       1            0           beta.kubernetes.io/os=linux   159m
ovn-ipsec-host            1         1         1       1            1           beta.kubernetes.io/os=linux   159m
ovnkube-node              1         1         1       1            1           beta.kubernetes.io/os=linux   3h44m

# oc get ds ovn-ipsec-containerized -o yaml | grep edn
if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then     

# oc get ds ovn-ipsec-host -o yaml | grep edn
if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then

https://github.com/openshift/cluster-network-operator/pull/2269

Bug OCPBUGS-25117: Update 4.16 csi-node-driver-registrar-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/59

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-node-driver-registrar/pull/59

Bug OCPBUGS-25670: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2038

Bug OCPBUGS-25503: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-disk-csi-driver/pull/71

Bug OCPBUGS-22623: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-nutanix/pull/29

Bug OCPBUGS-24994: Update 4.16 operator-lifecycle-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/631

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-olm/pull/631

Bug OCPBUGS-25563: Update 4.16 ose-alibaba-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/44

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-alibaba-cloud/pull/44

Bug OCPBUGS-24721: Update 4.16 ose-multus-route-override-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/route-override-cni/pull/51

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Task MON-3175: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/telemeter/pull/512

Bug OCPBUGS-24602: [sig-arch] events should not repeat pathologically for ns/openshift-dns failing

View the Description View the linked PRs

Description of problem: The [sig-arch] events should not repeat pathologically for ns/openshift-dns test is permafailing in the periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial job

{ 1 events happened too frequently

event happened 114 times, something is wrong: namespace/openshift-dns hmsg/d0c68b9435 service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 From: 17:11:03Z To: 17:11:04Z result=reject }

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial/1732612509958934528

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-dns-operator/pull/398

Bug OCPBUGS-30855: OCP4.16 - Port_Security flag doesn't work in ShiftonStack Sriov Worker node deployment

View the Description View the linked PRs

Description of problem:

    The Port_Security has been override although it has set to false in Worker machineset configuration

Version-Release number of selected component (if applicable):

    OCP=4.14.14
    RHOSP=17.1

How reproducible:

    NFV Perf lab 
    ShiftonStack Deployment mode = IPI

Steps to Reproduce:

    1.Network configuration resources for Worker node
$ oc get machinesets.machine.openshift.io -n openshift-machine-api | grep worker
5kqfbl3y0rhocpnfv-wj2jj-worker-0   1         1         1       1           5d23h
$ oc describe machinesets.machine.openshift.io -n openshift-machine-api 5kqfbl3y0rhocpnfv-wj2jj-worker-0
Name:         5kqfbl3y0rhocpnfv-wj2jj-worker-0
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=5kqfbl3y0rhocpnfv-wj2jj
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
Annotations:  machine.openshift.io/memoryMb: 47104
              machine.openshift.io/vCPU: 26
API Version:  machine.openshift.io/v1beta1
Kind:         MachineSet
Metadata:
  Creation Timestamp:  2024-03-07T05:24:07Z
  Generation:          3
  Resource Version:    226098
  UID:                 8cb06872-9b62-4c2c-b66b-bf91a03efa2d
Spec:
  Replicas:  1
  Selector:
    Match Labels:
      machine.openshift.io/cluster-api-cluster:     5kqfbl3y0rhocpnfv-wj2jj
      machine.openshift.io/cluster-api-machineset:  5kqfbl3y0rhocpnfv-wj2jj-worker-0
  Template:
    Metadata:
      Labels:
        machine.openshift.io/cluster-api-cluster:       5kqfbl3y0rhocpnfv-wj2jj
        machine.openshift.io/cluster-api-machine-role:  worker
        machine.openshift.io/cluster-api-machine-type:  worker
        machine.openshift.io/cluster-api-machineset:    5kqfbl3y0rhocpnfv-wj2jj-worker-0
    Spec:
      Lifecycle Hooks:
      Metadata:
      Provider Spec:
        Value:
          API Version:        machine.openshift.io/v1alpha1
          Availability Zone:  worker
          Cloud Name:         openstack
          Clouds Secret:
            Name:        openstack-cloud-credentials
            Namespace:   openshift-machine-api
          Config Drive:  true
          Flavor:        sos-worker
          Image:         5kqfbl3y0rhocpnfv-wj2jj-rhcos
          Kind:          OpenstackProviderSpec
          Metadata:
          Networks:
            Filter:
            Subnets:
              Filter:
                Id:  7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34
          Ports:
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p1
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p1
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p2
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p2
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p3
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p3
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p4
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p4
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:         false
            Vnic Type:     direct
          Primary Subnet:  7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34
          Security Groups:
            Filter:
            Name:             5kqfbl3y0rhocpnfv-wj2jj-worker
          Server Group Name:  5kqfbl3y0rhocpnfv-wj2jj-worker-worker
          Server Metadata:
            Name:                  5kqfbl3y0rhocpnfv-wj2jj-worker
            Openshift Cluster ID:  5kqfbl3y0rhocpnfv-wj2jj
          Tags:
            openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj
          Trunk:  true
          User Data Secret:
            Name:  worker-user-data
Status:
  Available Replicas:      1
  Fully Labeled Replicas:  1
  Observed Generation:     3
  Ready Replicas:          1
  Replicas:                1
Events:                    <none>
$ oc get nodes
NAME                                     STATUS   ROLES                  AGE     VERSION
5kqfbl3y0rhocpnfv-wj2jj-master-0         Ready    control-plane,master   5d23h   v1.27.10+28ed2d7
5kqfbl3y0rhocpnfv-wj2jj-master-1         Ready    control-plane,master   5d23h   v1.27.10+28ed2d7
5kqfbl3y0rhocpnfv-wj2jj-master-2         Ready    control-plane,master   5d23h   v1.27.10+28ed2d7
5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr   Ready    worker                 5d22h   v1.27.10+28ed2d7
$ oc describe nodes 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
Name:               5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=sos-worker
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=worker
                    feature.node.kubernetes.io/network-sriov.capable=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=sos-worker
                    node.openshift.io/os_id=rhcos
                    topology.cinder.csi.openstack.org/zone=worker
                    topology.kubernetes.io/region=regionOne
                    topology.kubernetes.io/zone=worker
Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.0.91
                    csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879"}
                    machine.openshift.io/machine: openshift-machine-api/5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 505735
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
                    tuned.openshift.io/bootcmdline:
                      skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=10-25 tuned.non_isolcpus=000003ff systemd.cpu_affinity=0,1,2,3...
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 07 Mar 2024 06:09:31 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
  AcquireTime:     <unset>
  RenewTime:       Wed, 13 Mar 2024 04:55:28 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:00 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:00 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:00 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:05 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.0.91
  Hostname:    5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
Capacity:
  cpu:                          26
  ephemeral-storage:            104266732Ki
  hugepages-1Gi:                20Gi
  hugepages-2Mi:                0
  memory:                       47264764Ki
  openshift.io/intl_provider3:  4
  openshift.io/intl_provider4:  4
  pods:                         250
Allocatable:
  cpu:                          16
  ephemeral-storage:            95018478229
  hugepages-1Gi:                20Gi
  hugepages-2Mi:                0
  memory:                       25166844Ki
  openshift.io/intl_provider3:  4
  openshift.io/intl_provider4:  4
  pods:                         250
System Info:
  Machine ID:                             aa5cfdcbeb4646d88ac25bb6f0c0d879
  System UUID:                            aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879
  Boot ID:                                77573755-0d27-4717-80fe-4579692d9c2c
  Kernel Version:                         5.14.0-284.54.1.el9_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 414.92.202402201520-0 (Plow)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.27.3-6.rhaos4.14.git7eb2281.el9
  Kubelet Version:                        v1.27.10+28ed2d7
  Kube-Proxy Version:                     v1.27.10+28ed2d7
ProviderID:                               openstack:///aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879
Non-terminated Pods:                      (19 in total)
  Namespace                               Name                                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                               ----                                                 ------------  ----------  ---------------  -------------  ---
  crucible-rickshaw                       testpmd-host-device-e810-sriov                       10 (62%)      10 (62%)    10000Mi (40%)    10000Mi (40%)  3d13h
  openshift-cluster-csi-drivers           openstack-cinder-csi-driver-node-hnv49               30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         5d22h
  openshift-cluster-node-tuning-operator  tuned-fcjfp                                          10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         5d22h
  openshift-dns                           dns-default-v7s59                                    60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         5d22h
  openshift-dns                           node-resolver-gkz8b                                  5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         5d22h
  openshift-image-registry                node-ca-p5dn5                                        10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         5d22h
  openshift-ingress-canary                ingress-canary-fk59t                                 10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         5d22h
  openshift-machine-config-operator       machine-config-daemon-9qw8z                          40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         5d22h
  openshift-monitoring                    node-exporter-czcmj                                  9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         5d22h
  openshift-monitoring                    prometheus-adapter-7696787779-vj5wk                  1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         5d4h
  openshift-multus                        multus-additional-cni-plugins-l7rpv                  10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         5d22h
  openshift-multus                        multus-nxr6k                                         10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         5d22h
  openshift-multus                        network-metrics-daemon-tb7sq                         20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         5d22h
  openshift-network-diagnostics           network-check-target-pqtp9                           10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         5d22h
  openshift-openstack-infra               coredns-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr       200m (1%)     0 (0%)      400Mi (1%)       0 (0%)         5d22h
  openshift-openstack-infra               keepalived-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr    200m (1%)     0 (0%)      400Mi (1%)       0 (0%)         5d22h
  openshift-sdn                           sdn-9mdnb                                            110m (0%)     0 (0%)      220Mi (0%)       0 (0%)         5d22h
  openshift-sriov-network-operator        sriov-device-plugin-tr68w                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         5d13h
  openshift-sriov-network-operator        sriov-network-config-daemon-dtf95                    100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         5d22h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                     Requests       Limits
  --------                     --------       ------
  cpu                          10845m (67%)   10 (62%)
  memory                       11928Mi (48%)  10000Mi (40%)
  ephemeral-storage            0 (0%)         0 (0%)
  hugepages-1Gi                8Gi (40%)      8Gi (40%)
  hugepages-2Mi                0 (0%)         0 (0%)
  openshift.io/intl_provider3  4              4
  openshift.io/intl_provider4  4              4
Events:                        <none>

    2. OpenStack Network resource for Worker node
$ openstack server list --all --fit-width
+--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+
| ID                                   | Name                                   | Status | Networks                                                                                                             | Image                         | Flavor     |
+--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+
| aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr | ACTIVE | management=192.168.0.91; provider-3=192.168.177.197, 192.168.177.59, 192.168.177.66, 192.168.177.83;                 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-worker |
|                                      |                                        |        | provider-4=192.168.178.108, 192.168.178.121, 192.168.178.144, 192.168.178.18                                         |                               |            |
| 1a24baf3-acde-49a0-ab8e-4f4afcc9d3cc | 5kqfbl3y0rhocpnfv-wj2jj-master-2       | ACTIVE | management=192.168.0.62                                                                                              | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master |
| 3e545ab5-6e28-4189-8d94-9272dfa1cd05 | 5kqfbl3y0rhocpnfv-wj2jj-master-1       | ACTIVE | management=192.168.0.78                                                                                              | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master |
| 97e5c382-0fb0-4a70-b58e-0469d3869a4e | 5kqfbl3y0rhocpnfv-wj2jj-master-0       | ACTIVE | management=192.168.0.93                                                                                              | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master |
+--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+$ openstack port list --server aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879
+--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+
| ID                                   | Name                                               | MAC Address       | Fixed IP Addresses                                                             | Status |
+--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+
| 0a562c29-4ddc-41c4-82e8-13934d3ee273 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-0           | fa:16:3e:16:9a:c3 | ip_address='192.168.0.91', subnet_id='7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34'    | ACTIVE |
| 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 | fa:16:3e:15:88:d7 | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE |
| 1778cb62-5fbf-42be-8847-53a7b092bdf5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 | fa:16:3e:2a:64:e4 | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE |
| 557f205b-2674-4f6e-91a2-643fe1702be2 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p1 | fa:16:3e:56:a3:48 | ip_address='192.168.177.83', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'  | ACTIVE |
| 721b5f15-2dc9-4509-a4ba-09f364ae8771 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p3 | fa:16:3e:dd:c3:28 | ip_address='192.168.177.59', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'  | ACTIVE |
| 9da4b1be-27d7-4428-a194-9eb4b02f6ac5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p3 | fa:16:3e:fb:06:1b | ip_address='192.168.178.144', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE |
| a72fcbd2-83d3-4fa9-be3d-e9fbde27d4bf | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p4 | fa:16:3e:a9:28:0e | ip_address='192.168.177.66', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'  | ACTIVE |
| ba5cd10f-c6bc-4bed-b978-3b8a3560ad5c | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p1 | fa:16:3e:33:e4:c4 | ip_address='192.168.178.18', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19'  | ACTIVE |
| bf2ce123-76fc-4e5c-9e4f-0473febbdeac | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p2 | fa:16:3e:ce:91:10 | ip_address='192.168.178.121', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE |
+--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+
$ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                     |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | UP                                                                                                                        |
| allowed_address_pairs   |                                                                                                                           |
| binding_host_id         | nfv-intel-11.perflab.com                                                                                                  |
| binding_profile         | pci_slot='0000:b1:11.2', pci_vendor_info='8086:1889', physical_network='provider4'                                        |
| binding_vif_details     | connectivity='l2', port_filter='False', vlan='178'                                                                        |
| binding_vif_type        | hw_veb                                                                                                                    |
| binding_vnic_type       | direct                                                                                                                    |
| created_at              | 2024-03-07T06:03:43Z                                                                                                      |
| data_plane_status       | None                                                                                                                      |
| description             | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj                           |
| device_id               | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879                                                                                      |
| device_owner            | compute:worker                                                                                                            |
| device_profile          | None                                                                                                                      |
| dns_assignment          | fqdn='host-192-168-178-108.openstacklocal.', hostname='host-192-168-178-108', ip_address='192.168.178.108'                |
| dns_domain              |                                                                                                                           |
| dns_name                |                                                                                                                           |
| extra_dhcp_opts         |                                                                                                                           |
| fixed_ips               | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19'                                            |
| id                      | 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767                                                                                      |
| ip_allocation           | None                                                                                                                      |
| mac_address             | fa:16:3e:15:88:d7                                                                                                         |
| name                    | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4                                                                        |
| network_id              | e2106b16-8f83-4e2e-bdbd-20e2c12ec279                                                                                      |
| numa_affinity_policy    | None                                                                                                                      |
| port_security_enabled   | True                                                                                                                      |
| project_id              | 927450d0f06647a99d86214acd822679                                                                                          |
| propagate_uplink_status | None                                                                                                                      |
| qos_network_policy_id   | None                                                                                                                      |
| qos_policy_id           | None                                                                                                                      |
| resource_request        | None                                                                                                                      |
| revision_number         | 6                                                                                                                         |
| security_group_ids      | f0df9265-c7fd-4f47-875f-d346e5cb5074                                                                                      |
| status                  | ACTIVE                                                                                                                    |
| tags                    | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj |
| trunk_details           | None                                                                                                                      |
| updated_at              | 2024-03-07T06:04:10Z                                                                                                      |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+$ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                     |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | UP                                                                                                                        |
| allowed_address_pairs   |                                                                                                                           |
| binding_host_id         | nfv-intel-11.perflab.com                                                                                                  |
| binding_profile         | pci_slot='0000:b1:01.1', pci_vendor_info='8086:1889', physical_network='provider3'                                        |
| binding_vif_details     | connectivity='l2', port_filter='False', vlan='177'                                                                        |
| binding_vif_type        | hw_veb                                                                                                                    |
| binding_vnic_type       | direct                                                                                                                    |
| created_at              | 2024-03-07T06:03:41Z                                                                                                      |
| data_plane_status       | None                                                                                                                      |
| description             | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj                           |
| device_id               | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879                                                                                      |
| device_owner            | compute:worker                                                                                                            |
| device_profile          | None                                                                                                                      |
| dns_assignment          | fqdn='host-192-168-177-197.openstacklocal.', hostname='host-192-168-177-197', ip_address='192.168.177.197'                |
| dns_domain              |                                                                                                                           |
| dns_name                |                                                                                                                           |
| extra_dhcp_opts         |                                                                                                                           |
| fixed_ips               | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'                                            |
| id                      | 1778cb62-5fbf-42be-8847-53a7b092bdf5                                                                                      |
| ip_allocation           | None                                                                                                                      |
| mac_address             | fa:16:3e:2a:64:e4                                                                                                         |
| name                    | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2                                                                        |
| network_id              | 50a557b5-34c2-4c47-b539-963688f7167c                                                                                      |
| numa_affinity_policy    | None                                                                                                                      |
| port_security_enabled   | True                                                                                                                      |
| project_id              | 927450d0f06647a99d86214acd822679                                                                                          |
| propagate_uplink_status | None                                                                                                                      |
| qos_network_policy_id   | None                                                                                                                      |
| qos_policy_id           | None                                                                                                                      |
| resource_request        | None                                                                                                                      |
| revision_number         | 9                                                                                                                         |
| security_group_ids      | f0df9265-c7fd-4f47-875f-d346e5cb5074                                                                                      |
| status                  | ACTIVE                                                                                                                    |
| tags                    | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj |
| trunk_details           | None                                                                                                                      |
| updated_at              | 2024-03-07T06:10:42Z                                                                                                      |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
$ openstack network list
+--------------------------------------+-------------+--------------------------------------+
| ID                                   | Name        | Subnets                              |
+--------------------------------------+-------------+--------------------------------------+
| 50a557b5-34c2-4c47-b539-963688f7167c | provider-3  | 1a892dcf-bf93-46ef-bf37-bda6cf923471 |
| e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | provider-4  | 76430b9e-302f-428d-916a-77482d9cfb19 |
| 5fdddf1c-3a71-4752-94bd-bdb5b9674500 | management  | 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 |
+--------------------------------------+-------------+--------------------------------------+$ openstack network show provider-3
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2024-03-01T16:45:48Z                 |
| description               |                                      |
| dns_domain                |                                      |
| id                        | 50a557b5-34c2-4c47-b539-963688f7167c |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | None                                 |
| is_vlan_transparent       | None                                 |
| mtu                       | 9216                                 |
| name                      | provider-3                           |
| port_security_enabled     | True                                 |
| project_id                | ad4b9a972ac64bd9916ad7ee80288353     |
| provider:network_type     | vlan                                 |
| provider:physical_network | provider3                            |
| provider:segmentation_id  | 177                                  |
| qos_policy_id             | None                                 |
| revision_number           | 2                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | True                                 |
| status                    | ACTIVE                               |
| subnets                   | 1a892dcf-bf93-46ef-bf37-bda6cf923471 |
| tags                      |                                      |
| updated_at                | 2024-03-01T16:45:52Z                 |
+---------------------------+--------------------------------------+$ openstack network show provider-4
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2024-03-01T16:45:57Z                 |
| description               |                                      |
| dns_domain                |                                      |
| id                        | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | None                                 |
| is_vlan_transparent       | None                                 |
| mtu                       | 9216                                 |
| name                      | provider-4                           |
| port_security_enabled     | True                                 |
| project_id                | ad4b9a972ac64bd9916ad7ee80288353     |
| provider:network_type     | vlan                                 |
| provider:physical_network | provider4                            |
| provider:segmentation_id  | 178                                  |
| qos_policy_id             | None                                 |
| revision_number           | 2                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | True                                 |
| status                    | ACTIVE                               |
| subnets                   | 76430b9e-302f-428d-916a-77482d9cfb19 |
| tags                      |                                      |
| updated_at                | 2024-03-01T16:46:01Z                 |
+---------------------------+--------------------------------------+
     3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-api-provider-openstack/pull/107

Bug OCPBUGS-26494: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory

View the Description View the linked PRs

In PowerVS, when I try and deploy a 4.16 cluster, I see the following:

Description of problem:

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get pods -n openshift-cloud-controller-manager
NAME                                                READY   STATUS             RESTARTS      AGE
powervs-cloud-controller-manager-6b6fbcc9db-9rhtj   0/1     CrashLoopBackOff   4 (10s ago)   2m47s
powervs-cloud-controller-manager-6b6fbcc9db-wnvck   0/1     CrashLoopBackOff   3 (49s ago)   2m46s
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager
Error from server: no preferred addresses found; known addresses: []
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-wnvck -n openshift-cloud-controller-manager
Error from server: no preferred addresses found; known addresses: []

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-ppc64le-2024-01-07-111144

How reproducible:

Aways

Steps to Reproduce:

    1. Deploy OpenShift cluster

On the master-0 node, I see:

[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl ps -a
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                               ATTEMPT             POD ID              POD
a048556553827       ec3035a371e09312254a277d5eb9affba2930adbd4018f7557899a2f3d76bc88                                                         18 seconds ago      Exited              kube-rbac-proxy                    7                   0381a589d57cd       cluster-cloud-controller-manager-operator-94dd5b468-kxqw5
a326f7ec83ddb       60f5c9455518c79a9797cfbeab0b3530dae1bf77554eccc382ff12d99053efd1                                                         11 minutes ago      Running             config-sync-controllers            0                   0381a589d57cd       cluster-cloud-controller-manager-operator-94dd5b468-kxqw5
ddaa6999b5b86       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60eff87ed56ee4761fd55caa4712e6bea47dccaa11c59ba53a6d5697eacc7d32   11 minutes ago      Running             cluster-cloud-controller-manager   0                   0381a589d57cd       cluster-cloud-controller-manager-operator-94dd5b468-kxqw5

The failing pod has this as its log:

[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl logs a048556553827
Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components
I0108 18:09:12.320332       1 flags.go:64] FLAG: --add-dir-header="false"
I0108 18:09:12.320401       1 flags.go:64] FLAG: --allow-paths="[]"
I0108 18:09:12.320413       1 flags.go:64] FLAG: --alsologtostderr="false"
I0108 18:09:12.320420       1 flags.go:64] FLAG: --auth-header-fields-enabled="false"
I0108 18:09:12.320427       1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups"
I0108 18:09:12.320435       1 flags.go:64] FLAG: --auth-header-groups-field-separator="|"
I0108 18:09:12.320441       1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user"
I0108 18:09:12.320447       1 flags.go:64] FLAG: --auth-token-audiences="[]"
I0108 18:09:12.320454       1 flags.go:64] FLAG: --client-ca-file=""
I0108 18:09:12.320460       1 flags.go:64] FLAG: --config-file="/etc/kube-rbac-proxy/config-file.yaml"
I0108 18:09:12.320467       1 flags.go:64] FLAG: --help="false"
I0108 18:09:12.320473       1 flags.go:64] FLAG: --http2-disable="false"
I0108 18:09:12.320479       1 flags.go:64] FLAG: --http2-max-concurrent-streams="100"
I0108 18:09:12.320486       1 flags.go:64] FLAG: --http2-max-size="262144"
I0108 18:09:12.320492       1 flags.go:64] FLAG: --ignore-paths="[]"
I0108 18:09:12.320500       1 flags.go:64] FLAG: --insecure-listen-address=""
I0108 18:09:12.320506       1 flags.go:64] FLAG: --kubeconfig=""
I0108 18:09:12.320512       1 flags.go:64] FLAG: --log-backtrace-at=":0"
I0108 18:09:12.320520       1 flags.go:64] FLAG: --log-dir=""
I0108 18:09:12.320526       1 flags.go:64] FLAG: --log-file=""
I0108 18:09:12.320531       1 flags.go:64] FLAG: --log-file-max-size="1800"
I0108 18:09:12.320537       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I0108 18:09:12.320543       1 flags.go:64] FLAG: --logtostderr="true"
I0108 18:09:12.320550       1 flags.go:64] FLAG: --oidc-ca-file=""
I0108 18:09:12.320556       1 flags.go:64] FLAG: --oidc-clientID=""
I0108 18:09:12.320564       1 flags.go:64] FLAG: --oidc-groups-claim="groups"
I0108 18:09:12.320570       1 flags.go:64] FLAG: --oidc-groups-prefix=""
I0108 18:09:12.320576       1 flags.go:64] FLAG: --oidc-issuer=""
I0108 18:09:12.320581       1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]"
I0108 18:09:12.320590       1 flags.go:64] FLAG: --oidc-username-claim="email"
I0108 18:09:12.320595       1 flags.go:64] FLAG: --one-output="false"
I0108 18:09:12.320601       1 flags.go:64] FLAG: --proxy-endpoints-port="0"
I0108 18:09:12.320608       1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:9258"
I0108 18:09:12.320614       1 flags.go:64] FLAG: --skip-headers="false"
I0108 18:09:12.320620       1 flags.go:64] FLAG: --skip-log-headers="false"
I0108 18:09:12.320626       1 flags.go:64] FLAG: --stderrthreshold="2"
I0108 18:09:12.320631       1 flags.go:64] FLAG: --tls-cert-file="/etc/tls/private/tls.crt"
I0108 18:09:12.320637       1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]"
I0108 18:09:12.320654       1 flags.go:64] FLAG: --tls-min-version="VersionTLS12"
I0108 18:09:12.320661       1 flags.go:64] FLAG: --tls-private-key-file="/etc/tls/private/tls.key"
I0108 18:09:12.320667       1 flags.go:64] FLAG: --tls-reload-interval="1m0s"
I0108 18:09:12.320674       1 flags.go:64] FLAG: --upstream="http://127.0.0.1:9257/"
I0108 18:09:12.320681       1 flags.go:64] FLAG: --upstream-ca-file=""
I0108 18:09:12.320686       1 flags.go:64] FLAG: --upstream-client-cert-file=""
I0108 18:09:12.320692       1 flags.go:64] FLAG: --upstream-client-key-file=""
I0108 18:09:12.320697       1 flags.go:64] FLAG: --upstream-force-h2c="false"
I0108 18:09:12.320703       1 flags.go:64] FLAG: --v="3"
I0108 18:09:12.320709       1 flags.go:64] FLAG: --version="false"
I0108 18:09:12.320719       1 flags.go:64] FLAG: --vmodule=""
I0108 18:09:12.320735       1 kube-rbac-proxy.go:578] Reading config file: /etc/kube-rbac-proxy/config-file.yaml
I0108 18:09:12.321427       1 kube-rbac-proxy.go:285] Valid token audiences: 
I0108 18:09:12.321473       1 kube-rbac-proxy.go:399] Reading certificate files
E0108 18:09:12.321519       1 run.go:74] "command failed" err="failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory"

When I describe the pod, I see:

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager
Name:                 powervs-cloud-controller-manager-6b6fbcc9db-9rhtj
Namespace:            openshift-cloud-controller-manager
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      cloud-controller-manager
Node:                 rdr-hamzy-test-wdc06-fs5m2-master-2/
Start Time:           Mon, 08 Jan 2024 11:57:45 -0600
Labels:               infrastructure.openshift.io/cloud-controller-manager=PowerVS
                      k8s-app=powervs-cloud-controller-manager
                      pod-template-hash=6b6fbcc9db
Annotations:          operator.openshift.io/config-hash: 09205e81b4dc20086c29ddbdd3fccc29a675be94b2779756a0e748dd9ba91e40
Status:               Running
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/powervs-cloud-controller-manager-6b6fbcc9db
Containers:
  cloud-controller-manager:
    Container ID:  cri-o://4365a326d05ecaac8e4114efabb4a46e01a308459ad30438d742b4829c24a717
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09
    Image ID:      65401afa73528f9a425a9d7f5dee8a9de8d9d3d82c8fd84cd653b16409093836
    Port:          10258/TCP
    Host Port:     10258/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      fi
      exec /bin/ibm-cloud-controller-manager \
      --bind-address=$(POD_IP_ADDRESS) \
      --use-service-account-credentials=true \
      --configure-cloud-routes=false \
      --cloud-provider=ibm \
      --cloud-config=/etc/ibm/cloud.conf \
      --profiling=false \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager \
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \
      --v=2
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 08 Jan 2024 12:35:12 -0600
      Finished:     Mon, 08 Jan 2024 12:35:12 -0600
    Ready:          False
    Restart Count:  12
    Requests:
      cpu:     75m
      memory:  60Mi
    Liveness:  http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3
    Environment:
      POD_IP_ADDRESS:               (v1:status.podIP)
      VPCCTL_CLOUD_CONFIG:         /etc/ibm/cloud.conf
      ENABLE_VPC_PUBLIC_ENDPOINT:  true
    Mounts:
      /etc/ibm from cloud-conf (rw)
      /etc/kubernetes from host-etc-kube (ro)
      /etc/pki/ca-trust/extracted/pem from trusted-ca (ro)
      /etc/vpc from ibm-cloud-credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5xdm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ccm-trusted-ca
    Optional:  false
  host-etc-kube:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes
    HostPathType:  Directory
  cloud-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cloud-conf
    Optional:  false
  ibm-cloud-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ibm-cloud-credentials
    Optional:    false
  kube-api-access-z5xdm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  38m                    default-scheduler  Successfully assigned openshift-cloud-controller-manager/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj to rdr-hamzy-test-wdc06-fs5m2-master-2
  Normal   Pulling    38m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09"
  Normal   Pulled     37m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" in 36.694s (36.694s including waiting)
  Normal   Started    36m (x4 over 37m)      kubelet            Started container cloud-controller-manager
  Normal   Created    35m (x5 over 37m)      kubelet            Created container cloud-controller-manager
  Normal   Pulled     35m (x4 over 37m)      kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" already present on machine
  Warning  BackOff    2m57s (x166 over 37m)  kubelet            Back-off restarting failed container cloud-controller-manager in pod powervs-cloud-controller-manager-6b6fbcc9db-9rhtj_openshift-cloud-controller-manager(bf58b824-b1a2-4d2e-8735-22723642a24a)

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/322

Bug OCPBUGS-30620: Remove csi-operator legacy/ directory

View the Description View the linked PRs

AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.

There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!

Technically, this is a copy of ~~STOR-1797~~, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.

Bug OCPBUGS-30575: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13659

Bug OCPBUGS-23115: Doc for Builds with Red Hat Subscriptions Missing Steps

View the Description View the linked PRs

Description of problem:


Documentation for using Red Hat subscriptions in builds is missing a few important steps, especially for customers that have not turned on the tech preview feature for the Shared Resource CSI driver.

These are the following:

1. Customer needs Simple Content Access import enabled in the Insights Operator: https://docs.openshift.com/container-platform/4.12/support/remote_health_monitoring/insights-operator-simple-access.html
2. Customer needs to copy the secret data from openshift-config-managed/etc-pki-entitlement to the workspace the build is running in. We should provide oc commands that a cluster admin/platform team can execute.

For builds that are running in a network-restricted environment and access RHEL content through Satellite, the documentation must also provide instructions on how to obtain an `rhsm.conf` file for the Satellite instance and mount it into the build container.

Version-Release number of selected component (if applicable):


4.12

How reproducible:


Always

Steps to Reproduce:


Read the documentation for https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html#builds-create-imagestreamtag_running-entitled-builds and execute the commands as is.

Actual results:


Build probably won't run because the required secret is not created.

Expected results:


Customers should be able to run a build that requires RHEL entitlements following the exact steps as described in the doc.

Additional info:


https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html

https://github.com/openshift/origin/pull/28560

Bug OCPBUGS-24635: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2153

Bug OCPBUGS-27908: Workloads -> Deployments -> Deployment -> Details -> Volumes -> Remove volume : Translation missing

View the Description View the linked PRs

Description of problem:

    Navigation:
    Workloads -> Deployments -> (select any Deployment from list) -> Details -> Volumes -> Remove volume

    Issue:
    Message "Are you sure you want to remove volume audit-policies from Deployment: apiserver?" is in English.

    Observation:
    Translation is present in branch release-4.15 file...
    frontend/public/locales/ja/public.json

Version-Release number of selected component (if applicable):

    4.15.0-rc.3

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    Content is in English

Expected results:

    Content should be in selected language

Additional info:

    Reference screenshot attached.

https://github.com/openshift/console/pull/13550

Bug OCPBUGS-27879: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-aws/pull/78

Bug OCPBUGS-30135: OpenShiftSDN error should say "unsupported" rather than "deprecated"

View the Description View the linked PRs

Description of problem:

    Installer now errors when attempting to use networkType: OpenShiftSDN; but the message still says "deprecated".

Version-Release number of selected component (if applicable):

4.15+

How reproducible:

100%

Steps to Reproduce:

    1. Attempt to install 4.15+ with networkType: OpenShiftSDN
Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"

Actual results:

Observe error in logs:

time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"

Expected results:

A message more like:

Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is not supported, please use OVNKubernetes"

Additional info:
See thread

https://github.com/openshift/installer/pull/8092

Bug OCPBUGS-22542: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-aws/pull/53

Bug OCPBUGS-24868: Update 4.16 ose-ovirt-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ovirt-csi-driver/pull/132

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ovirt-csi-driver/pull/132

Bug OCPBUGS-26046: ART-8361: Replace genisoimage with xorriso in 4.15

View the Description View the linked PRs

This is duplicate of https://issues.redhat.com/browse/ART-8361 one since on ART bugs we are not able to set `target` so creating the issue here.

https://github.com/openshift/cluster-api-provider-libvirt/pull/281

Task SPLAT-1209: [vsphere] log known datacenters in the problem detector when datacenter not found

View the linked PRs

https://github.com/openshift/vsphere-problem-detector/pull/154

Bug OCPBUGS-24531: Changes required for regression in mutable scope

View the Description View the linked PRs

Description of problem:
Based on the discussion in https://issues.redhat.com/browse/OCPBUGS-24044
and the discussion in this slack [https://redhat-internal.slack.com/archives/CBWMXQJKD/p1700510945375019|thread] we need to update our CI and some of the work done for mutable scope in ~~NE-621~~.

Specifically, we need to

modify TestScopeChange and TestUnmanagedDNSToManagedDNSInternalIngressController to delete the service on all platforms, as toggling scope is no longer recommended.
modify any special behavior added for platformsWithMutableScope.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

100%

Steps to Reproduce:

  1. Run CI TestUnmanagedDNSToManagedDNSInternalIngressController
  2. Observe failure in unmanaged-migrated-internal

Actual results:

CI tests fail.

Expected results:

CI tests shouldn't fail.

Additional info:

This is a change from past behavior, as reported in https://issues.redhat.com/browse/OCPBUGS-24044. Further discussion revealed that the new behavior is currently expected but could be restored in the future. Notes to SRE and release notes are needed for this change to behavior.

Bug OCPBUGS-27376: Domain name starting with numeric character is invalid for Assisted Installer

View the Description View the linked PRs

Description of problem:

Customer is trying to create RHOCP cluster with Domain name 123mydomain.com

In RedHat Hybrid Cloud Console customer is getting below error : 
~~~
Failed to update the cluster
DNS format mismatch: 123mydomain.com domain name is not valid
~~~

*** As per regex check as described in KCS - https://access.redhat.com/solutions/5517531
The domain name starting with numeric character is valid e.g. 123mydomain.com

Below is the RegeX check to find the domain name validity :
 [a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]

*** From the validations the assisted installer does as per : https://github.com/openshift/assisted-service/blob/master/pkg/validations/validations.go 

The below regexps are applied:
baseDomainRegex          = `^[a-z]([\-]*[a-z\d]+)+$`
dnsNameRegex             = `^([a-z]([\-]*[a-z\d]+)*\.)+[a-z\d]+([\-]*[a-z\d]+)+$`
wildCardDomainRegex      = `^(validateNoWildcardDNS\.).+\.?$`
hostnameRegex            = `^[a-z0-9][a-z0-9\-\.]{0,61}[a-z0-9]$`
installerArgsValuesRegex = `^[A-Za-z0-9@!#$%*()_+-=//.,";':{}\[\]]+$`
  
This means the domain name must start with a letter [a-z].

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

1. Open RedHat Hybrid Cloud Console
2. Go to Clusters
3. Create Cluster
4. Go to Datacenter 
5. Under Assisted Installer -> Create Cluster
6. Enter Cluster Name mytestcluster and enter Domain Name 123mydomain.com
7. Click on Next

Actual results:

Domain name with numeric character first and then letters e.g. 123mydomain.com showing invalid in RedHat Hybrid Cloud Console Assisted Installer , throwing error :-
Failed to create new cluster
DNS format mismatch: 123mydomain.com domain name is not valid

Expected results:

Domain name with numeric character first and then letters e.g. 123mydomain.com must be valid in OpenShift RedHat Hybrid Cloud Console Assisted Installer

https://github.com/openshift/assisted-service/pull/5914

Bug OCPBUGS-29355: Output image url link leads to 404 for Shipwright Builds

View the Description View the linked PRs

When clicking on the output image link on a Shipwright BuildRun details page, the link leads to the imagestream details page but shows 404 error.

The image link is:

https://console-openshift-console.apps...openshiftapps.com/k8s/ns/buildah-example/imagestreams/sample-kotlin-spring%3A1.0-shipwright

The BuildRun spec

apiVersion: shipwright.io/v1beta1
kind: BuildRun
metadata: 
  generateName: sample-spring-kotlin-build-
  name: sample-spring-kotlin-build-xh2dq
  namespace: buildah-example
  labels: 
    build.shipwright.io/generation: '2'
    build.shipwright.io/name: sample-spring-kotlin-build
spec: 
  build: 
    name: sample-spring-kotlin-build
status: 
  buildSpec: 
    output: 
      image: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright'
    paramValues: 
      - name: run-image
        value: 'paketocommunity/run-ubi-base:latest'
      - name: cnb-builder-image
        value: 'paketobuildpacks/builder-jammy-tiny:0.0.176'
      - name: app-image
        value: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright'
    source: 
      git: 
        url: 'https://github.com/piomin/sample-spring-kotlin-microservice.git'
      type: Git
    strategy: 
      kind: ClusterBuildStrategy
      name: buildpacks
  completionTime: '2024-02-12T12:15:03Z'
  conditions: 
    - lastTransitionTime: '2024-02-12T12:15:03Z'
      message: All Steps have completed executing
      reason: Succeeded
      status: 'True'
      type: Succeeded
  output: 
    digest: 'sha256:dc3d44bd4d43445099ab92bbfafc43d37e19cfaf1cac48ae91dca2f4ec37534e'
  source: 
    git: 
      branchName: master
      commitAuthor: Piotr Mińkowski
      commitSha: aeb03d60a104161d6fd080267bf25c89c7067f61
  startTime: '2024-02-12T12:13:21Z'
  taskRunName: sample-spring-kotlin-build-xh2dq-j47ql

https://github.com/openshift/console/pull/13602

Bug OCPBUGS-30303: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-kubevirt/pull/35

Bug OCPBUGS-25534: Update 4.16 csi-livenessprobe-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/58

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/58

Bug OCPBUGS-27932: Update 4.16 ose-apiserver-network-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/47

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/apiserver-network-proxy/pull/47

Bug OCPBUGS-29723: Hypershift CLI requires security group id when creating a nodepool

View the Description View the linked PRs

Description of problem:

 Hypershift CLI requires security group id when creating a nodepool, otherwise it fails to create a nodepool.

jiezhao-mac:hypershift jiezhao$ ./bin/hypershift create nodepool aws --name=test --cluster-name=jie-test --node-count=3 2024-02-20T11:29:19-05:00 ERROR Failed to create nodepool {"error": "security group ID was not specified and cannot be determined from default nodepool"} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /Users/jiezhao/hypershift-test/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /Users/jiezhao/hypershift-test/hypershift/main.go:78 runtime.main /usr/local/Cellar/go/1.20.4/libexec/src/runtime/proc.go:250 Error: security group ID was not specified and cannot be determined from default nodepool security group ID was not specified and cannot be determined from default nodepool jiezhao-mac:hypershift jiezhao${code}
Version-Release number of selected component (if applicable):
{code:none}

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

    nodepool creation should succeed without security group specified in hypershift cli

Additional info:

https://github.com/openshift/hypershift/pull/3614

Bug MGMT-17106: Network disconnection due to vlan interface creation

View the Description View the linked PRs

Description of the problem:

In a deployment with bonding and vlan during the booting of the provision image, the system loose the connectivity (even the ping) as soon as two new network interfaces appear on the node, created by the ironic-python-agent, those interfaces are the slaves interfaces with the vlan added.{code}
Version-Release number of selected component (if applicable):

4.12.48\{code}
How reproducible:
{code:none}
Always\{code}
Steps to Reproduce:
{code:none}
1. Deploy a cluster with bonding + vlan
2.
3.
Actual results:
{code:none}
After investigation from Openstack team, it looks like having this option "enable_vlan_interfaces = all" enabled in "/etc/ironic-python-agent.conf" is what trigger the creation of the vlan interfaces. This new interfaces is what cuts the communication.

Expected results:

No extra vlan interfaces created, communication is not lost and installation succeeds.

Additional info:

How customer crafted the test:

As soon as the node start pinging he connected with ssh and set a password to core user
one communication is lost (~1 min after started pinging) it connects through the KVM interface and cor password.
If we disable the ironic-python-agent and manually remove the vlan interfaces created the communication is restored.  Installation works if LLDP is turned off at teh switch.

This issue was supposed to be fixed in these versions, according to the original JIRA which I have linked here.

Team lead from that JIRA suggested the issue has to be fixed by re-vendoring ICC in the assisted-service, hence this JIRA creation.

https://github.com/openshift/assisted-service/pull/6055

Bug OCPBUGS-28936: ART requests updates to 4.16 image ose-vmware-vsphere-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/222

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/222

Bug OCPBUGS-29479: excessive Back-off restarting failed container console

View the Description View the linked PRs

Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.15
Start Time: 2024-02-08T00:00:00Z
End Time: 2024-02-14T23:59:59Z
Success Rate: 91.30%
Successes: 63
Failures: 6
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 735
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20upgrade-micro%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=azure&platform=azure&sampleEndTime=2024-02-14%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-02-08%2000%3A00%3A00&testId=openshift-tests-upgrade%3A37f1600d4f8d75c47fc5f575025068d2&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers&upgrade=upgrade-micro&upgrade=upgrade-micro&variant=standard&variant=standard

Note: When you look at the link above you will notice some of the failures mention the bare metal operator. That's being investigated as part of https://issues.redhat.com/browse/OCPBUGS-27760. There have been 3 cases in the last week where the console was in a fail loop. Here's an example:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1757637415561859072

We need help understanding why this is happening and what needs to be done to avoid it.

https://github.com/openshift/console-operator/pull/869

Bug OCPBUGS-25055: no detail log on signature verification failure

View the Description View the linked PRs

Description of problem:

    No detail failure on signature verification while failing to validate signature of the target release payload during upgrade. It's unclear for user to know which action could be taken for the failure. For example, checking if any wrong configmap set, or default store is not available or any issue on custom store?
 
# ./oc adm upgrade
Cluster version is 4.15.0-0.nightly-2023-12-08-202155
Upgradeable=False  

  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates

ReleaseAccepted=False  
  Reason: RetrievePayload
  Message: Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat

Upstream: https://amd64.ocp.releases.ci.openshift.org/graph
Channel: stable-4.15
Recommended updates:  
  VERSION                            IMAGE
  4.15.0-0.nightly-2023-12-09-012410 registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7
 
# ./oc -n openshift-cluster-version logs cluster-version-operator-6b7b5ff598-vxjrq|grep "verified"|tail -n4
I1211 09:28:22.755834       1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
I1211 09:28:22.755974       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
I1211 09:28:37.817102       1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
I1211 09:28:37.817488       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-08-202155

How reproducible:

    always

Steps to Reproduce:

    1. trigger an fresh installation with tp enabled(no spec.signaturestores property set by default) 

    2.trigger an upgrade against a nightly build(no signature available in default signature store)

    3.

Actual results:

    no detail log on signature verification failure

Expected results:

    include detail failure on signature verification in the cvo log

Additional info:

    https://github.com/openshift/cluster-version-operator/pull/1003

https://github.com/openshift/cluster-version-operator/pull/1003

Bug OCPBUGS-25589: Update 4.16 ose-azure-cloud-node-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/101

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-azure/pull/101

Bug OCPBUGS-27310: Source column header not displayed in PVC > VolumeSnapshots tab

View the Description View the linked PRs

Description of problem:

    Missing Source column header in PVC > VolumeSnapshots tab

Version-Release number of selected component (if applicable):

    Cluster 4.10, 4.14, 4.16

How reproducible:

Yes

Steps to Reproduce:

    1. Create a PVC i.e. "my-pvc"
    2. Create a Pod and bind it to the "my-pvc"
    3. Create a VolumeSnapshots and associate it with the "my-pvc"
    4. Goto to PVC detail > VolumeSnapshots tab

Actual results:

    The Source column header is not displayed

Expected results:

    
the Source column header should be displayed

Additional info:

https://github.com/openshift/console/pull/13565

Bug OCPBUGS-29971: ART requests updates to 4.16 image csi-provisioner-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/90

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-provisioner/pull/93

Bug OCPBUGS-23862: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/100

Bug OCPBUGS-27825: Incompatible upstream manifests

View the Description View the linked PRs

The upstream project reorganized the config directory and we need to adapt it for downstream. Until then, upstream->downstream syncing is blocked.

https://github.com/openshift/baremetal-operator/pull/329

Story HOSTEDCP-1488: Use AWS Regionalized STS Endpoints

View the Description View the linked PRs

User Story:

As a ROSA customer, I want to enforce that my workloads follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.

As Red Hat, I would like to follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.

Per AWS docs:

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html

AWS recommends using Regional AWS STS endpoints instead of the global endpoint to reduce latency, build in redundancy, and increase session token validity.

https://docs.aws.amazon.com/sdkref/latest/guide/feature-sts-regionalized-endpoints.html

All new SDK major versions releasing after July 2022 will default to regional. New SDK major versions might remove this setting and use regional behavior. To reduce future impact regarding this change, we recommend you start using regional in your application when possible.

Acceptance Criteria:

Areas where HyperShift creates STS credentials use regionalized STS endpoints, e.g. https://github.com/openshift/hypershift/blob/ae1caa00ff3a2c2bfc1129f0168efc1e786d1d12/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1225-L1228

Engineering Details:

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html
https://docs.aws.amazon.com/sdkref/latest/guide/feature-sts-regionalized-endpoints.html
This was done in ROSA Classic via ~~SDA-7313~~

https://github.com/openshift/hypershift/pull/3742

Bug OCPBUGS-23199: Whereabouts reconciler errors with "IPPool not found" on pod deletion although the IPPool exists

View the Description View the linked PRs

Description of problem:

During a pod deletion, the whereabouts reconciler correctly detects the pod deletion but it errors out claiming that the IPPool is not found.However, when checking the audit logs, we can see no deletion, no re-creation and we can even see successful "patch" and "get" requests to the same IPPool. This means that the IPPool was never deleted and properly accessible at the time of the issue, so the error in the reconciler looks like it made some mistake while retrieving the IPPool.

Version-Release number of selected component (if applicable):

4.12.22

How reproducible:

Sometimes

Steps to Reproduce:

    1.Delete pod
    2.
    3.

Actual results:

Error in whereabouts reconciler. New pods cannot using additional networks with whereabouts IPAM plugin cannot have IPs allocated due to wrong cleanup.

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2160

Bug OCPBUGS-25612: Logs for PipelineRuns fetched from the Tekton Results API is not loading

View the Description View the linked PRs

Description of problem:

    Logs for PipelineRuns fetched from the Tekton Results API is not loading

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Navigate to the Log tab of PipelineRun fetched from the Tekton Results
    2.
    3.

Actual results:

    Logs window is empty with a loading indicator

Expected results:

Logs should be shown

Additional info:

https://github.com/openshift/console/pull/13455

Bug OCPBUGS-26541: cleanup cluster-config-operator image

View the Description View the linked PRs

Description of problem:

    manifests are duplicated with cluster-config-api image

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-29417: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-etcd-operator/pull/1201

Bug OCPBUGS-29676: namespace "openshift-cluster-api" not found in CustomNoUpgrade

View the Description View the linked PRs

Description of problem:

    capi-based installer failing with missing openshift-cluster-api namespace

Version-Release number of selected component (if applicable):

How reproducible:

Always in CustomNoUpgrade

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    Install failure

Expected results:

    namespace create, install succeeds or does not error on missing namespace

Additional info:

https://github.com/openshift/cluster-capi-operator/pull/163

Bug OCPBUGS-26952: no ipsec on cluster post NS mc's deletion during ipsecConfig mode `Full`

View the Description View the linked PRs

Description of problem:

 no ipsec on cluster post NS mc's deletion during ipsecConfig mode `Full`, on an upgraded cluster from 4.14 ->4.15 build

Version-Release number of selected component (if applicable):

 bot build on https://github.com/openshift/cluster-network-operator/pull/2191

How reproducible:

    Always

Steps to Reproduce:

Steps:
1. Cluster on EW+NS cluster(4.14), Upgraded to above bot build to check ipsecConfig modes 
2. ipsecConfig mode changed to Full
3. Deleted NS MCs 
4. new MCs spawned up as `80-ipsec-master-extensions` and `80-ipsec-worker-extensions`
5. cluster settled with no ipsec at all (no ovn-ipsec-host ds)
6. mode still Full

Actual results:

mode Full actually replicated Diasbled state on above steps

Expected results:

Just NS IPsec should have gone away. EW should have persisted

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2201

Bug OCPBUGS-26232: HCP fails to deploy with TechPreviewNoUpgrade featue set

View the Description View the linked PRs

spec:
  configuration:
    featureGate:
      featureSet: TechPreviewNoUpgrade

$ oc get pod
NAME                                      READY   STATUS             RESTARTS      AGE
capi-provider-bd4858c47-sf5d5             0/2     Init:0/1           0             9m33s
cluster-api-85f69c8484-5n9ql              1/1     Running            0             9m33s
control-plane-operator-78c9478584-xnjmd   2/2     Running            0             9m33s
etcd-0                                    3/3     Running            0             9m10s
kube-apiserver-55bb575754-g4694           4/5     CrashLoopBackOff   6 (81s ago)   8m30s

$ oc logs kube-apiserver-55bb575754-g4694 -c kube-apiserver --tail=5
E0105 16:49:54.411837       1 controller.go:145] while syncing ConfigMap "kube-system/kube-apiserver-legacy-service-account-token-tracking", err: namespaces "kube-system" not found
I0105 16:49:54.415074       1 trace.go:236] Trace[236726897]: "Create" accept:application/vnd.kubernetes.protobuf, */*,audit-id:71496035-d1fe-4ee1-bc12-3b24022ea39c,client:::1,api-group:scheduling.k8s.io,api-version:v1,name:,subresource:,namespace:,protocol:HTTP/2.0,resource:priorityclasses,scope:resource,url:/apis/scheduling.k8s.io/v1/priorityclasses,user-agent:kube-apiserver/v1.29.0 (linux/amd64) kubernetes/9368fcd,verb:POST (05-Jan-2024 16:49:44.413) (total time: 10001ms):
Trace[236726897]: ---"Write to database call failed" len:174,err:priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request 10001ms (16:49:54.415)
Trace[236726897]: [10.001615835s] [10.001615835s] END
F0105 16:49:54.415382       1 hooks.go:203] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request

https://github.com/openshift/hypershift/pull/3377

Bug OCPBUGS-29972: ART requests updates to 4.16 image csi-livenessprobe-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/59

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/62

Task ACM-8917: Add agent options to HyperShift CLI nodepool create command

View the Description View the linked PRs

There is currently no way to specify AgentLabelSelector when creating a nodepool via the hypershift CLI

https://github.com/openshift/hypershift/pull/3285

Bug OCPBUGS-25207: machine-config degraded with error "rendered-master-${hash}" not found

View the Description View the linked PRs

Description of problem:

    The cluster operator "machine-config" degraded due to MachineConfigPool master is not ready, which tells error like "rendered-master-${hash} not found".

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-11-033133

How reproducible:

    Always. We met the issue in 2 CI profiles, Flexy template "functionality-testing/aos-4_15/upi-on-gcp/versioned-installer-ovn-ipsec-ew-ns-ci", and PROW CI test "periodic-ci-openshift-openshift-tests-private-release-4.15-multi-ec-gcp-ipi-disc-priv-oidc-arm-mixarch-f14".

Steps to Reproduce:

The Flexy template brief steps:
1. "create install-config" and then "create manifests"
2. add manifest file to config ovnkubernetes network for IPSec (please refer to https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blame/master/functionality-testing/aos-4_15/hosts/create_ign_files.sh#L517-530)
3. (optional) "create ignition-config"
4. UPI installation steps (see OCP doc https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-user-infra.html#installation-gcp-user-infra-exporting-common-variables)

Actual results:

    Installation failed, with the machine-config operator degraded

Expected results:

    Installation should succeed.

Additional info:

    The must-gather is at https://drive.google.com/file/d/12xbjWUknDL_DRNSS8T_Z3u4d1KrNlJgT/view?usp=drive_link

https://github.com/openshift/cluster-network-operator/pull/2187

Bug OCPBUGS-27930: Update 4.16 ose-cluster-kube-storage-version-migrator-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/102

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/103

Bug OCPBUGS-30489: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-scheduler-operator/pull/538

Bug OCPBUGS-27792: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-24846: Update 4.16 ose-csi-external-resizer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/152

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/152

Bug OCPBUGS-29586: Control Plane Kube Apiserver Service Port should remain as 2040 for IBM Cloud Provider

View the Description View the linked PRs

Description of problem:

    A recent [PR](https://github.com/openshift/hypershift/commit/c030ab66d897815e16d15c987456deab8d0d6da0) updated the kube-apiserver service port to `6443`. That change causes a small outage when upgrading from a 4.13 cluster in IBMCloud. We need to keep the service port as 2040 for IBM Cloud Provider to avoid the outage.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3569

Bug OCPBUGS-23306: Avoid eviction of CSI driver daemonsets pods from the cluster-autoscaler

View the Description View the linked PRs

Cluster-autoscaler by default evict all those pods -including those coming from daemon sets-
In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption

Additional info:

Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod
cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"

Bug OCPBUGS-25513: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/73

Bug OCPBUGS-29594: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/957

Bug OCPBUGS-18115: PrometheusOperatorRejectedResources alert fires on Hypershift clusters with user-defined monitoring

View the Description View the linked PRs

Description of problem:

After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Start an hypershift-hosted cluster with cluster-bot
2. Enable user-defined monitoring
3.

Actual results:

PrometheusOperatorRejectedResources alert becomes firing

Expected results:

No alert firing

Additional info:

Need to reach out to the HyperShift folks as the fix should probably be in their code base.

Bug OCPBUGS-24984: Update 4.16 ose-csi-driver-shared-resource-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/159

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-shared-resource/pull/159

Bug OCPBUGS-26181: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/495

Bug OCPBUGS-29858: origin needs workaround for ROSA's infra labels

View the Description View the linked PRs

The convention is a format like node-role.kubernetes.io/role: "", not node-role.kubernetes.io: role, however ROSA uses the latter format to indicate the infra role. This changes the node watch code to ignore it, as well as other potential variations like node-role.kubernetes.io/.

The current code panics when run against a ROSA cluster:
{{ E0209 18:10:55.533265 78 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
goroutine 233 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x7a71840?, 0xc0018e2f48})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1000251f9fe?})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x7a71840, 0xc0018e2f48})
runtime/panic.go:884 +0x213
github.com/openshift/origin/pkg/monitortests/node/watchnodes.nodeRoles(0x7ecd7b3?)
github.com/openshift/origin/pkg/monitortests/node/watchnodes/node.go:187 +0x1e5
github.com/openshift/origin/pkg/monitortests/node/watchnodes.startNodeMonitoring.func1(0}}

https://github.com/openshift/origin/pull/28585

Task OU-311: monitoring-plugin: Add Dockerfile for local testing

View the Description View the linked PRs

We should add a Dockerfile that is optimized for running the tests locally. (The current Dockerfile assumes it is running with the CI setup.)

https://github.com/openshift/monitoring-plugin/pull/91

Bug OCPBUGS-14496: TechPreviewNoUpgrade alert should only trigger on the TechPreviewNoUpgrade feature set

View the Description View the linked PRs

Description of problem:

For years, the TechPreviewNoUpgrade alert has used:

cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0

But recently testing 4.12.19, I saw the alert pending with name="LatencySensitive". Alerting on that not-useful-since-4.6 feature set is fine (~~OCPBUGS-14497~~), but TechPreviewNoUpgrade isn't a great name when the actual feature set is LatencySensitive. And the summary and descripition don't apply to LatencySensitive either.

Version-Release number of selected component (if applicable):

The buggy expr / alertname pair shipped in 4.3.0.

How reproducible:

All the time.

Steps to Reproduce:

1. Install a cluster like 4.12.19.
2. Set the LatencySensitive feature set:

$ oc patch featuregate cluster --type=json --patch='[{"op":"add","path":"/spec/featureSet","value":"LatencySensitive"}]'

3. Check alerts with /monitoring/alerts?rowFilter-alert-source=platform&resource-list-text=TechPreviewNoUpgrade in the web console.

Actual results:

TechPreviewNoUpgrade is pending or firing.

Expected results:

Something appropriate to LatencySensitive, like a generic alert that covers all non-default feature sets, is pending or firing.

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1512

Bug OCPBUGS-24214: Network TLS artifacts should have ownership annotations

View the linked PRs

https://github.com/openshift/cluster-network-operator/pull/2120

Bug OCPBUGS-24975: Update 4.16 ose-cluster-bootstrap-container image to be consistent with ART

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-16736: ‘Oh no! Something went wrong’ will be shown when user go to MultiClusterEngine details -> Yaml tab

View the Description View the linked PRs

Description of problem:

    Oh no! Something went wrong’ will be shown when user go to MultiClusterEngine details -> Yaml tab

Version-Release number of selected component (if applicable):

    4.14.0-0.nightly-2023-07-20-215234

How reproducible:

    Always

Steps to Reproduce:

1. Install 'multicluster engine for Kubernetes' operator in the cluster
2. Use the default value to create a new MultiClusterEngine
3. Navigate to the MultiClusterEngine details -> Yaml Tab

Actual results: 
‘Oh no! Something went wrong.’ error will be shown with below details
TypeErrorDescription:
 Cannot read properties of null (reading 'editor')

Expected results:

    no error

Additional info:

    This bug fix is in conjunction with https://issues.redhat.com/browse/OCPBUGS-22778

https://github.com/openshift/console/pull/13510

Bug OCPBUGS-25535: Update 4.16 ose-olm-rukpak-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/68

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-rukpak/pull/68

Bug OCPBUGS-26236: VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab

View the Description View the linked PRs

Description of problem:

    VolumeSnapshots data  is not displayed in PVC >  VolumeSnapshots tab

Version-Release number of selected component (if applicable):

    4.16.0-0.ci-2024-01-05-050911

How reproducible:

Steps to Reproduce:

    1. Create a PVC i.e. "my-pvc"
    2. Create a Pod and bind it to the "my-pvc"
    3. Create a VolumeSnapshots and associate it with the "my-pvc"
    4. Goto to PVC detail > VolumeSnapshots tab

Actual results:

  VolumeSnapshots data  is not displayed in PVC >  VolumeSnapshots tab

Expected results:

 VolumeSnapshots data should be displayed in PVC >  VolumeSnapshots tab

Additional info:

https://github.com/openshift/console/pull/13485

Bug OCPBUGS-26017: Update 4.16 ose-multus-whereabouts-ipam-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/223

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/whereabouts-cni/pull/223

Bug OCPBUGS-19807: 'oc adm release extract --include ...' obsoletes ccoctl's --enable-tech-preview

View the Description View the linked PRs

Description of problem:

ccoctl consumes CredentialsRequests extracted from OpenShift releases and manages secrets associated with those requests for the cluster. Over time, ccoctl has grown a number of CredentialRequest filters, including deletion annotations in ~~CCO-175~~ and tech-preview annotations in cco#444.

But with ~~OTA-559~~, 4.14 and later oc adm release extract ... learned about an --included parameter, which allows oc to perform that "will the cluster need this credential?" filtering, and there is no longer a need for ccoctl to perform that filtering, or for ccoctl callers to have to think through "do I need to enable tech-preview CRs for this cluster or not?".

Version-Release number of selected component (if applicable):

4.14 and later.

How reproducible:

100%.

Steps to Reproduce:

$ cat <<EOF >install-config.yaml 
> apiVersion: v1
> platform:
>   gcp:
>     dummy: data
> featureSet: TechPreviewNoUpgrade
> EOF
$ oc adm release extract --included --credentials-requests --install-config install-config.yaml --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64
$ ccoctl gcp create-all --dry-run --name=test --region=test --project=test --credentials-requests-dir=credentials-requests

Actual results:

ccoctl doesn't dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it {--enable-tech-preview}}.

Expected results:

ccoctl does dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it --enable-tech-preview=false.

Additional info:

Longer-term, we likely want to go through some phases of deprecating and maybe eventually removing --enable-tech-preview and the ccoctl-side filtering. But for now, I think we want to pivot to defaulting to true, so that anyone with existing flows that do not include the new --included extraction has an easy way to keep their workflow going (they can set --enable-tech-preview=false). And I think we should backport that to 4.14's ccoctl to simplify ~~OSDOCS-4158~~'s docs#62148. But we're close enough to 4.14's expected GA, that it's worth some consensus-building and alternative consideration, before trying to rush changes back to 4.14 branches.

https://github.com/openshift/oc/pull/1551

Bug OCPBUGS-22899: Self-managed HCP pods are scheduled on single mgmt cluster node when no zones are in use

View the Description View the linked PRs

Description of problem:


In the self-managed HCP use case, if the on-premise baremetal management cluster does not have nodes labeled with the "topology.kubernetes.io/zone" key, then all HCP pods for a High Available cluster are scheduled to a single mgmt cluster node.

This is a result of the way the affinity rules are constructed.

Take the pod affinity/antiAffinity example below, which is generated for a HA HCP cluster. If the "topology.kubernetes.io/zone" label does not exist on the mgmt cluster nodes, then the pod will still get scheduled but that antiAffinity rule is effectively ignored. That seems odd due to the usage of the "requiredDuringSchedulingIgnoredDuringExecution" value, but I have tested this and the rule truly is ignored if the topologyKey is not present.

        podAffinity: 
          preferredDuringSchedulingIgnoredDuringExecution: 
          - podAffinityTerm: 
              labelSelector: 
                matchLabels: 
                  hypershift.openshift.io/hosted-control-plane: clusters-vossel1
              topologyKey: kubernetes.io/hostname
            weight: 100
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector: 
              matchLabels: 
                app: kube-apiserver
                hypershift.openshift.io/control-plane-component: kube-apiserver
            topologyKey: topology.kubernetes.io/zone

In the event that no "zones" are configured for the baremetal mgmt cluster, then the only other pod affinity rule is one that actually colocates the pods together. This results in a HA HCP having all the etcd, apiservers, etc... scheduled to a single node.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100%

Steps to Reproduce:

1. Create a self-managed HA HCP cluster on a mgmt cluster with nodes that lack the "topology.kubernetes.io/zone" label

Actual results:

all HCP pods are scheduled to a single node.

Expected results:

HCP pods should always be spread across multiple nodes.

Additional info:


A way to address this is to add another anti-affinity rule which prevents every component from being scheduled on the same node as its replicas

https://github.com/openshift/hypershift/pull/3286

Bug OCPBUGS-22923: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2048

Bug OCPBUGS-25564: Update 4.16 ose-cluster-storage-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/435

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-storage-operator/pull/435

Bug OCPBUGS-23838: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/177

Bug MON-3584: prometheus-adapter image fails to start

View the Description View the linked PRs

There was a glitch with the prometheus-adapter image after the 4.16 branching.

https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702041610708869

https://github.com/openshift/k8s-prometheus-adapter/pull/97

Task MON-3650: Upgrade Thanos to v0.33.0

View the linked PRs

Bug OCPBUGS-26081: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource/pull/161

Story WRKLDS-1004: components should use AlwaysAllow UnhealthyPodEvictionPolicy in PDBs

View the Description View the linked PRs

allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving.

to prevent possible issues similar to https://issues.redhat.com//browse/OCPBUGS-23796

Bug TRT-1507: Metal jobs permafailing 4.16 nightly: pathological event should not see excessive Back-off restarting failed containers for ns/openshift-kni-infra

View the Description View the linked PRs

[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-kni-infra

Test is now passing near 0% on metal-ipi.

Started around Feb 10th.

event [namespace/openshift-kni-infra node/master-1.ostest.test.metalkube.org pod/haproxy-master-1.ostest.test.metalkube.org hmsg/64785a22cf - Back-off restarting failed container haproxy in pod haproxy-master-1.ostest.test.metalkube.org_openshift-kni-infra(336080d8c1b455c151170524132c026d)] happened 295 times

Possible relation to haproxy 2.8 merge?

Logs indicate the error is:

/bin/bash: line 47: socat: command not found

https://github.com/openshift/router/pull/561

Bug OCPBUGS-24950: Update 4.16 ose-powervs-block-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/68

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/68

Bug MGMT-16721: Some instructions request crash assisted-service

View the Description View the linked PRs

Description of the problem:

Some requests of this type:

{"x_request_id":"05e66411-7612-46bb-86c2-69bf7096b6da","protocol":"HTTP/1.1","authority":"zoscaru4s08w1mz.api.openshift.com","user_agent":"Go-http-client/2.0","method":"GET","response_flags":"UC","x_forwarded_for":"163.244.72.2,10.128.10.16,23.21.192.204","bytes_rx":0,"duration":13,"bytes_tx":95,"response_code":503,"timestamp":"2024-01-31T15:52:41.418Z","upstream_duration":null,"path":"/api/assisted-install/v2/infra-envs/84596f7d-0138-4f57-ada4-be72aea031a5/hosts/62148a92-8588-b591-5f7e-046bf1136b3b/instructions?timestamp=1706716359"}

are causing 503 because the applicaiton crashes. Fortunately it's only a goroutine to crash, so main loop is still going and other requests seem unaffected

How reproducible:

Not sure what conditions we need to meet but plenty of requests of this type can be found from prod logs

Steps to reproduce:

Actual results:

2024/01/31 16:41:31 http: panic serving 127.0.0.1:39486: runtime error: invalid memory address or nil pointer dereference
goroutine 575931 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1854 +0xbf
panic({0x4369120, 0x6bee680})
	/usr/local/go/src/runtime/panic.go:890 +0x263
github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).getTangServersFromHostIgnition(0x34?, 0x484da60?)
	/assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:41 +0x3e
github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).GetSteps(0xc000a5c440, {0xc00057c1c8?, 0x48adb49?}, 0xc000f5d180)
	/assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:104 +0x112
github.com/openshift/assisted-service/internal/host/hostcommands.(*InstructionManager).GetNextSteps(0xc000ad0780, {0x50cf558, 0xc00444b4a0}, 0xc000f5d180)
	/assisted-service/internal/host/hostcommands/instruction_manager.go:178 +0xa2f
github.com/openshift/assisted-service/internal/host.(*Manager).GetNextSteps(0xc0003e4990?, {0x50cf558?, 0xc00444b4a0?}, 0x0?)
	/assisted-service/internal/host/host.go:548 +0x48
github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2GetNextSteps(0xc000da4800, {0x50cf558, 0xc00444b4a0}, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158})
	/assisted-service/internal/bminventory/inventory.go:5357 +0x1b8
github.com/openshift/assisted-service/restapi.HandlerAPI.func54({0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980?, 0xc000fb67e0?})
	/assisted-service/restapi/configure_assisted_install.go:654 +0xf4
github.com/openshift/assisted-service/restapi/operations/installer.V2GetNextStepsHandlerFunc.Handle(0xc00203b300?, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980, 0xc000fb67e0})
	/assisted-service/restapi/operations/installer/v2_get_next_steps.go:19 +0x7a
github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetNextSteps).ServeHTTP(0xc00169e468, {0x50bfe00, 0xc001399d20}, 0xc00203b500)
	/assisted-service/restapi/operations/installer/v2_get_next_steps.go:66 +0x2dd
github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x50bfe00, 0xc001399d20}, 0xc00203b500)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59
net/http.HandlerFunc.ServeHTTP(0x0?, {0x50bfe00?, 0xc001399d20?}, 0x17334b7?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/internal/metrics.Handler.func1.1()
	/assisted-service/internal/metrics/reporter.go:37 +0x31
github.com/slok/go-http-metrics/middleware.Middleware.Measure({{{0x50bfdd0, 0xc0014ce0e0}, {0x48922ea, 0x12}, 0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x50d23c0, ...}, ...)
	/assisted-service/vendor/github.com/slok/go-http-metrics/middleware/middleware.go:117 +0x30e
github.com/openshift/assisted-service/internal/metrics.Handler.func1({0x50cdea0?, 0xc0005ec770}, 0xc00203b500)
	/assisted-service/internal/metrics/reporter.go:36 +0x35f
net/http.HandlerFunc.ServeHTTP(0x50cf558?, {0x50cdea0?, 0xc0005ec770?}, 0xc002b9ee80?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/context.ContextHandler.func1.1({0x50cdea0, 0xc0005ec770}, 0xc00203b400)
	/assisted-service/pkg/context/param.go:95 +0xc8
net/http.HandlerFunc.ServeHTTP(0xc001735430?, {0x50cdea0?, 0xc0005ec770?}, 0xc0026625f8?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.NewRouter.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/router.go:77 +0x257
net/http.HandlerFunc.ServeHTTP(0x7fc8dc2c3820?, {0x50cdea0?, 0xc0005ec770?}, 0xc00009f000?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.Redoc.func1({0x50cdea0, 0xc0005ec770}, 0x418b480?)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/redoc.go:72 +0x242
net/http.HandlerFunc.ServeHTTP(0x1?, {0x50cdea0?, 0xc0005ec770?}, 0x0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.Spec.func1({0x50cdea0, 0xc0005ec770}, 0x486ac6f?)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/spec.go:46 +0x18c
net/http.HandlerFunc.ServeHTTP(0xc001136380?, {0x50cdea0?, 0xc0005ec770?}, 0xc00203b200?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/rs/cors.(*Cors).Handler.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200)
	/assisted-service/vendor/github.com/rs/cors/cors.go:281 +0x1c4
net/http.HandlerFunc.ServeHTTP(0x0?, {0x50cdea0?, 0xc0005ec770?}, 0x4?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x50cd990, 0xc0012e8540}, 0xc0017358f0?)
	/assisted-service/vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e
net/http.HandlerFunc.ServeHTTP(0x100c0017359e8?, {0x50cd990?, 0xc0012e8540?}, 0x10?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/app.WithMetricsResponderMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x16f0a19?)
	/assisted-service/pkg/app/middleware.go:32 +0xb0
net/http.HandlerFunc.ServeHTTP(0xc000a26900?, {0x50cd990?, 0xc0012e8540?}, 0xc00444a7e0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/app.WithHealthMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x5094901?)
	/assisted-service/pkg/app/middleware.go:55 +0x162
net/http.HandlerFunc.ServeHTTP(0x50cf4b0?, {0x50cd990?, 0xc0012e8540?}, 0x5094980?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/requestid.handler.ServeHTTP({{0x50aa9c0?, 0xc0008a4eb0?}}, {0x50cd990, 0xc0012e8540}, 0xc00203b100)
	/assisted-service/pkg/requestid/requestid.go:69 +0x1ad
github.com/openshift/assisted-service/internal/spec.WithSpecMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0xc00203b100?)
	/assisted-service/internal/spec/spec.go:38 +0x9b
net/http.HandlerFunc.ServeHTTP(0xc00124ec35?, {0x50cd990?, 0xc0012e8540?}, 0x170a0ce?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0xc002cb8f30?}, {0x50cd990, 0xc0012e8540}, 0xc00203b100)
	/usr/local/go/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc00189d320, {0x50cf558, 0xc0011fc240})
	/usr/local/go/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3089 +0x5ed

Expected results:

https://github.com/openshift/assisted-service/pull/5940

Bug OCPBUGS-25625: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-autoscaler/pull/276

Bug OCPBUGS-30048: Update OWNERS file in route-controller-manager

View the Description View the linked PRs

Description of problem:

Update OWNERS file in route-controller-manager repository.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

n/a

Steps to Reproduce:

n/a

Actual results:

n/a

Expected results:

n/a

Additional info:

https://github.com/openshift/route-controller-manager/pull/41

Bug OCPBUGS-18701: [Azuredisk-csi-driver] allocatable volumes count incorrect in csinode for Standard_B4as_v2 instance types

View the Description View the linked PRs

Description of problem:

[Azuredisk-csi-driver] allocatable volumes count incorrect in csinode for Standard_B4as_v2 instance types

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-02-132842

How reproducible:

Always

Steps to Reproduce:

1. Install Azure OpenShift cluster use the Standard_B4as_v2 instance type
2. Check the csinode object allocatable volumes count
3. Create a pod with the max allocatable volumes count pvcs(provision by azuredisk-csi-driver)

Actual results:

In step 2 the allocatable volumes count is 16.
$ oc get csinode pewang-0908s-r6lwd-worker-southcentralus3-tvwwr -ojsonpath='{.spec.drivers[?(@.name=="disk.csi.azure.com")].allocatable.count}'
16

In step 3 the pod stuck at containerCreating that caused by attach volume failed of 
09-07 22:38:28.758        "message": "The maximum number of data disks allowed to be attached to a VM of this size is 8.",\r

Expected results:

In step 2 the allocatable volumes count should be 8.
In step 3 the pod should be Running well and all volumes could be read and written data

Additional info:

$ az vm list-skus -l eastus --query "[?name=='Standard_B4as_v2']"| jq -r '.[0].capabilities[] | select(.name =="MaxDataDiskCount")'
{
  "name": "MaxDataDiskCount",
  "value": "8"
}

Currently in 4.14 we use the v1.28.1 driver, I checked the upstream issues and PRs, the issue fixed in v1.28.2
https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.2

https://github.com/openshift/azure-disk-csi-driver/pull/49

Bug OCPBUGS-24877: Update 4.16 ose-cluster-policy-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/144

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-policy-controller/pull/144

Story TRT-1491: Reorder Intervals groups to put more relevant info at top

View the Description View the linked PRs

Tired of scrolling through alerts and pod states that are seldom useful to get to things that we need every day.

https://github.com/openshift/origin/pull/28576

Bug OCPBUGS-24225: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1619

Bug OCPBUGS-29855: Need option to disable dedicated request serving isolation

View the Description View the linked PRs

Description of problem:

    To operate HyperShift at high scale, we need an option to disable dedicated request serving isolation, if not used.

Version-Release number of selected component (if applicable):

    4.16, 4.15, 4.14, 4.13

How reproducible:

    100%

Steps to Reproduce:

    1. Install hypershift operator for versions 4.16, 4.15, 4.14, or 4.13
    2. Observe start-up logs
    3. Dedicated request serving isolation controllers are started

Actual results:

    Dedicated request serving isolation controllers are started

Expected results:

    Dedicated request serving isolation controllers to not start, if unneeded

Additional info:

https://github.com/openshift/hypershift/pull/3601

Bug OCPBUGS-22598: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-gcp/pull/53

Bug OCPBUGS-24526: Bundle Snapshot taken from wrong namespace for Deprecation Conditions

View the Description View the linked PRs

Description of problem:

Snapshots taken to gather deprecation information from bundles are from the Subscription namespace instead of the CatalogSource namespace. That means that if the Subscription is in a different namespace then no bundles will be present in the snapshot.

How reproducible:

100%

Steps to Reproduce:

1.Create CatalogSource with olm.deprecation entries
2.Create Subscription targeting a package with deprecations in a different namespace.

Actual results:

No Deprecation Conditions will be present.

Expected results:

Deprecation Conditions should be present.

https://github.com/openshift/operator-framework-olm/pull/654

Bug OCPBUGS-24886: Update 4.16 ose-gcp-pd-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/53

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver/pull/53

Bug OCPBUGS-23167: On SNO with DU profile(RT kernel) tuned profile is always degraded due to net.core.busy_read, net.core.busy_poll and kernel.numa_balancing sysctl not existing in RT kernel

View the Description View the linked PRs

Description of problem:

On SNO with DU profile(RT kernel) tuned profile is always degraded due to net.core.busy_read, net.core.busy_poll and kernel.numa_balancing sysctl not existing in RT kernel

Version-Release number of selected component (if applicable):

4.14.1

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with DU profile(RT kernel)
2. Check tuned profile

Actual results:

oc -n openshift-cluster-node-tuning-operator get profile -o yaml
apiVersion: v1
items:
- apiVersion: tuned.openshift.io/v1
  kind: Profile
  metadata:
    creationTimestamp: "2023-11-09T18:26:34Z"
    generation: 2
    name: sno.kni-qe-1.lab.eng.rdu2.redhat.com
    namespace: openshift-cluster-node-tuning-operator
    ownerReferences:
    - apiVersion: tuned.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: Tuned
      name: default
      uid: 4e7c05a2-537e-4212-9009-e2724938dec9
    resourceVersion: "287891"
    uid: 5f4d5819-8f84-4b3b-9340-3d38c41501ff
  spec:
    config:
      debug: false
      tunedConfig: {}
      tunedProfile: performance-patch
  status:
    conditions:
    - lastTransitionTime: "2023-11-09T18:26:39Z"
      message: TuneD profile applied.
      reason: AsExpected
      status: "True"
      type: Applied
    - lastTransitionTime: "2023-11-09T18:26:39Z"
      message: 'TuneD daemon issued one or more error message(s) during profile application.
        TuneD stderr: net.core.rps_default_mask'
      reason: TunedError
      status: "True"
      type: Degraded
    tunedProfile: performance-patch
kind: List
metadata:
  resourceVersion: ""

Expected results:

Not degraded

Additional info:

Looking at the tuned log the following errors show up which are probably causing the profile to get into degraded state:

2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_read', the parameter does not exist
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: sysctl option net.core.busy_read will not be set, failed to read the original value.
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_poll', the parameter does not exist
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: sysctl option net.core.busy_poll will not be set, failed to read the original value.
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'kernel.numa_balancing', the parameter does not exist
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: sysctl option kernel.numa_balancing will not be set, failed to read the original value.

These sysctl parameters seem not to be available with RT kernel.

https://github.com/openshift/cluster-node-tuning-operator/pull/954

Bug OCPBUGS-25560: Update 4.16 prometheus-config-reloader-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/270

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus-operator/pull/270

Bug OCPBUGS-26111: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api/pull/194

Bug OCPBUGS-25526: Update 4.16 ironic-rhcos-downloader-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/95

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-rhcos-downloader/pull/95

Bug OCPBUGS-25777: Update 4.16 ose-azure-workload-identity-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/11

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-workload-identity/pull/11

Bug OCPBUGS-29863: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-autoscaler-operator/pull/314

Bug OCPBUGS-20336: Upgrade from 4.13.13 to 4.14rc2 failed at 250 nodes.

View the Description View the linked PRs

Description of problem:
While upgrading a loaded 250 node ROSA cluster from 4.13.13 to 4.14.rc2 the cluster failed to upgrade and was stuck at when network operator was trying
to upgrade.
Around 20 multus pods were in CrashLookpack state with the log

oc logs multus-4px8t
2023-10-10T00:54:34+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315
2023-10-10T00:54:34+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 to /host/opt/cni/bin/
2023-10-10T00:54:34Z [verbose] multus-daemon started
2023-10-10T00:54:34Z [verbose] Readiness Indicator file check
2023-10-10T00:55:19Z [error] have you checked that your default network is ready? still waiting for readinessindicatorfile @ /host/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition

https://github.com/openshift/ovn-kubernetes/pull/2057

Bug OCPBUGS-23666: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/604

Bug OCPBUGS-25662: ECR Image pull fails in-spite of attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes.

View the Description View the linked PRs

Description of problem:

In ROSA/OCP 4.14.z, attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes (in ROSA's case, this was attached to the ManagedOpenShift-Worker-Role, which is assigned by the installer to all the worker nodes), has no effect on ECR Image pull. User gets an authentication error. Attaching the policy ideally should avoid the need to provide an image-pull-secret. However, the error is resolved only if the user also provides an image-pull-secret.
This is proven to work correctly in 4.12.z. Seems something has changed in the recent OCP versions.

Version-Release number of selected component (if applicable):

4.14.2 (ROSA)

How reproducible:

The issue is reproducible using the below steps.

Steps to Reproduce:

    1. Create a deployment in ROSA or OCP on AWS, pointing at a private ECR repository
    2. The image pulling will fail with Error: ErrImagePull & authentication required errors
    3.

Actual results:

The image pull fails with "Error: ErrImagePull" & "authentication required" errors. However, the image pull is successful only if the user provides an image-pull-secret to the deployment.

Expected results:

The image should be pulled successfully by virtue of the ECR-read-only policy attached to the worker node role; without needing an image-pull-secret.

Additional info:

In other words:

in OCP 4.13 (and below) if a user adds the ECR:* permissions to the worker instance profile, then the user can specify ECR images and authentication of the worker node to ECR is done using the instance profile. In 4.14 this no longer works.

It is not sufficient as an alternative, to provide a pull secret in a deployment because AWS rotates ECR tokens every 12 hours. That is not a viable solution for customers that until OCP 4.13, did not have to rotate pull secrets constantly.

The experience in 4.14 should be the same as in 4.13 with ECR.

The current AWS policy that's used is this one: `arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly`

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:DescribeImages",
                "ecr:BatchGetImage",
                "ecr:GetLifecyclePolicy",
                "ecr:GetLifecyclePolicyPreview",
                "ecr:ListTagsForResource",
                "ecr:DescribeImageScanFindings"
            ],
            "Resource": "*"
        }
    ]
}

Bug OCPBUGS-26415: Application creation fail when manually entering input scaling value in local setup

View the Description View the linked PRs

Description of problem:

In local setup this error appears when creating a deployment with scaling in the git form page locally: 
`Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32`

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-05-154400

How reproducible:

Everytime

Steps to Reproduce:

    1. In the local setup go to the git form page
    2. Enter a git repo and select deployment as the resource type
    3. In scaling enter the value as '5' and click on Create button

Actual results:

Got this error:
"Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32"

Expected results:

Deployment should be created

Additional info:

Happening with Deployment-config creation as well

https://github.com/openshift/console/pull/13487

Bug OCPBUGS-27861: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-azure/pull/107

Bug OCPBUGS-30091: TestHostNetworkPort is half serial and half parallel

View the Description View the linked PRs

Description of problem

CI is flaky because the TestHostNetworkPort test fails:

=== NAME  TestAll/serial/TestHostNetworkPortBinding
    operator_test.go:1034: Expected conditions: map[Admitted:True Available:True DNSManaged:False DeploymentReplicasAllAvailable:True LoadBalancerManaged:False]
         Current conditions: map[Admitted:True Available:True DNSManaged:False Degraded:False DeploymentAvailable:True DeploymentReplicasAllAvailable:False DeploymentReplicasMinAvailable:True DeploymentRollingOut:True EvaluationConditionsDetected:False LoadBalancerManaged:False LoadBalancerProgressing:False Progressing:True Upgradeable:True]
    operator_test.go:1034: Ingress Controller openshift-ingress-operator/samehost status: {
          "availableReplicas": 0,
          "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=samehost",
          "domain": "samehost.ci-op-xlwngvym-43abb.origin-ci-int-aws.dev.rhcloud.com",
          "endpointPublishingStrategy": {
            "type": "HostNetwork",
            "hostNetwork": {
              "protocol": "TCP",
              "httpPort": 9080,
              "httpsPort": 9443,
              "statsPort": 9936
            }
          },
          "conditions": [
            {
              "type": "Admitted",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "Valid"
            },
            {
              "type": "DeploymentAvailable",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentAvailable",
              "message": "The deployment has Available status condition set to True"
            },
            {
              "type": "DeploymentReplicasMinAvailable",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentMinimumReplicasMet",
              "message": "Minimum replicas requirement is met"
            },
            {
              "type": "DeploymentReplicasAllAvailable",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentReplicasNotAvailable",
              "message": "0/1 of replicas are available"
            },
            {
              "type": "DeploymentRollingOut",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentRollingOut",
              "message": "Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n"
            },
            {
              "type": "LoadBalancerManaged",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer",
              "message": "The configured endpoint publishing strategy does not include a managed load balancer"
            },
            {
              "type": "LoadBalancerProgressing",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "LoadBalancerNotProgressing",
              "message": "LoadBalancer is not progressing"
            },
            {
              "type": "DNSManaged",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "UnsupportedEndpointPublishingStrategy",
              "message": "The endpoint publishing strategy doesn't support DNS management."
            },
            {
              "type": "Available",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z"
            },
            {
              "type": "Progressing",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "IngressControllerProgressing",
              "message": "One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n)"
            },
            {
              "type": "Degraded",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z"
            },
            {
              "type": "Upgradeable",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "Upgradeable",
              "message": "IngressController is upgradeable."
            },
            {
              "type": "EvaluationConditionsDetected",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "NoEvaluationCondition",
              "message": "No evaluation condition is detected."
            }
          ],
          "tlsProfile": {
            "ciphers": [
              "ECDHE-ECDSA-AES128-GCM-SHA256",
              "ECDHE-RSA-AES128-GCM-SHA256",
              "ECDHE-ECDSA-AES256-GCM-SHA384",
              "ECDHE-RSA-AES256-GCM-SHA384",
              "ECDHE-ECDSA-CHACHA20-POLY1305",
              "ECDHE-RSA-CHACHA20-POLY1305",
              "DHE-RSA-AES128-GCM-SHA256",
              "DHE-RSA-AES256-GCM-SHA384",
              "TLS_AES_128_GCM_SHA256",
              "TLS_AES_256_GCM_SHA384",
              "TLS_CHACHA20_POLY1305_SHA256"
            ],
            "minTLSVersion": "VersionTLS12"
          },
          "observedGeneration": 1
        }
    operator_test.go:1036: failed to observe expected conditions for the second ingresscontroller: timed out waiting for the condition
    operator_test.go:1059: deleted ingresscontroller samehost
    operator_test.go:1059: deleted ingresscontroller hostnetworkportbinding

This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1017/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1762147882179235840. Search.ci shows another failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/48873/rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1762576595890999296. The test has failed sporadically in the past, beyond what search.ci is able to search.

TestHostNetworkPort is marked as a serial test in TestAll and marked with t.Parallel() in the test itself. Not sure if this is what is causing a new failure seen in this test, but something is incorrect.

Version-Release number of selected component (if applicable)

The test failures have been observed recently on 4.16 as well as on 4.12 (https://github.com/openshift/cluster-ingress-operator/pull/828#issuecomment-1292888086) and 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/914#issuecomment-1526808286). The logic error was introduced in 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc).

How reproducible

The logic error is self-evident. The test failure is very rare. The failure has been observed sporadically over the past couple years. Presently, search.ci shows two failures, with the following impact, for the past 14 days:

rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact

pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact

Steps to Reproduce

N/A.

Actual results

The TestHostNetworkPort test fails. The test is marked as both serial and parallel.

Expected results

Test should be marked as either serial or parallel, and it should pass consistently.

Additional info

When TestAll was introduced, TestHostNetworkPortBinding was initially marked parallel in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc. After some discussion, it was moved to the serial list in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a449e497e35fafeecbee9ea656e0631393182f70, but the commit to remove t.Parallel() evidently got inadvertently dropped.

https://github.com/openshift/cluster-ingress-operator/pull/1032

Bug OCPBUGS-29772: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-service/pull/6010

Bug OCPBUGS-25396: Archieved in Tekton Results icon is not shown in list and details page for PipelineRuns imported from Tekton Results db

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13450

Bug OCPBUGS-25730: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/28478

Bug OCPBUGS-26554: Topology: Chinese translation was broken

View the Description View the linked PRs

Chinese translation in topology was invalid, see https://github.com/openshift/console/pull/13458

https://github.com/openshift/console/pull/13458

Bug OCPBUGS-25052: Update 4.16 ose-csi-external-snapshotter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/131

Bug OCPBUGS-25585: Update 4.16 kube-state-metrics-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/109

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kube-state-metrics/pull/109

Bug OCPBUGS-28765: Azure CSI Driver operator has incorrect node selector for HyperShift

View the Description View the linked PRs

Description of problem:

  The azure csi driver operator cannot run in a HyperShift control plane because it has this selector: node-role.kubernetes.io/master: ""

Version-Release number of selected component (if applicable):

    4.16 ci latest

How reproducible:

always

Steps to Reproduce:

    1. Install hypershift
    2. Create azure hosted cluster

Actual results:

    azure-disk-csi-driver-operator pod remains in Pending state

Expected results:

    all control plane pods run

Additional info:

https://github.com/openshift/cluster-storage-operator/pull/448

Task MON-3703: Update downstream thanos to v0.34.0

View the linked PRs

https://github.com/openshift/thanos/pull/139

Bug OCPBUGS-24872: Update 4.16 prometheus-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/264

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus-operator/pull/264

Bug OCPBUGS-25718: vSphere ABI failed due to storage operator degraded

View the Description View the linked PRs

Description of problem:

The degradation of the storage operator occurred because it couldn't locate the node by UUID. I noticed that the providerID was present for node 0, but it was blank for other nodes. A successful installation can be achieved on day 2 by executing step 4 after step 7 from this document: https://access.redhat.com/solutions/6677901. Additionally, if we provide credentials from the install-config, it's necessary to add a taint to the node using the uninitialized taint(oc adm taint node "$NODE" node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule) after the bootstrap completed.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

100%

Steps to Reproduce:

    1. Create an agent ISO image
    2. Boot the created ISO on vSphere VM

Actual results:

Installation is failing due to storage operator unable to find the node by UUID.

Expected results:

Storage operator should be installed without any issue.

Additional info:

Slack discussion: https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1702893456002729

https://github.com/openshift/assisted-installer/pull/778

Bug OCPBUGS-30574: 4.16+ HCP clusters are using default catalog sources of v4.14

View the Description View the linked PRs

Description of problem:

Hosted control plane clusters of OCP 4.16 are using default catalog sources (redhat-operators, certified-operators, community-operators and redhat-marketplace) pointing to the 4.14, thus 4.16 operators are not available and this can't be updated from within the guest.

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Always

Steps to Reproduce:

1. check the .spec.image of the default catalog sources in openshift-marketplace namespace.

Actual results:

the default catalogs are pointing to :v4.14

Expected results:

they should point to :v4.16 instead

Additional info:

https://github.com/openshift/hypershift/pull/3707

Bug OCPBUGS-3522: Improve CanaryChecksRepetitiveFailures actionability

View the Description View the linked PRs

Description of problem:

CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing

doesn't give much detail or suggest next-steps. Expanding it to include at least a more detailed error message would make it easier for the admin to figure out how to resolve the issue.

Version-Release number of selected component (if applicable):

It's in the dev branch, and probably dates back to whenever the canary system was added.

How reproducible:

100%

Steps to Reproduce:

1. Break ingress. FIXME: Maybe by deleting the cloud load balancer, or dropping a firewall in the way, or something.
2. See the canary pods start failing.
3. Ingress operator sets CanaryChecksRepetitiveFailures with a message.

Actual results:

Canary route checks for the default ingress controller are failing

Expected results:

Canary route checks for the default ingress controller are failing: ${ERROR_MESSAGE}. ${POSSIBLY_ALSO_MORE_TROUBLESHOOTING_IDEAS?}

Additional info:

Plumbing the error message through might be as straightforward as passing probeRouteEndpoint's err through to setCanaryFailingStatusCondition for formatting. Or maybe it's more complicated than that?

https://github.com/openshift/cluster-ingress-operator/pull/865

Bug OCPBUGS-27929: Update 4.16 ose-kube-storage-version-migrator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/203

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/203

Bug OCPBUGS-28634: Unable to add node to HCP cluster in a disconnected environment

View the Description View the linked PRs

Description of problem:

When creating hypershift cluster in disconnected env, the worker node cannot pass the validation of the assisted service due to an ignition error.

Version-Release number of selected component (if applicable):

4.14.z

How reproducible:

  100 %

Steps to Reproduce:

    1. Steps to install HCP cluster is mentioned in documentation: https://hypershift-docs.netlify.app/labs/dual/mce/agentserviceconfig/#assisted-service-customization
    2.
    3.

Actual results:

Node addition fails

Expected results:

Node should get added to the cluster

Additional info:

https://github.com/openshift/hypershift/pull/3687

Bug OCPBUGS-25638: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-ibm/pull/63

Bug OCPBUGS-24807: Update 4.16 ose-cloud-credential-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/639

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-credential-operator/pull/639

Bug OCPBUGS-28664: openshift/csi-driver-shared-resource-operator - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/csi-driver-shared-resource-operator/pull/99

Story TRT-1477: Standup GCP Liveness Monitoring Endpoint

View the Description View the linked PRs

As part of our investigation into GCP disruption we want an endpoint separate from the cluster under test but inside GCP to monitor for connectivity.

One approach is to use a GCP Cloud Function with a HTTP Trigger

Another alternative it to standup our own server and collect logging

We need to consider cost of implementation, cost of maintaining and how well the implementation lines up with our overall test scenario (we are wanting to use this as a control to compare with reaching a pod within a cluster under test)

We may want to also consider standing up similar endpoints in AWS and Azure in the future.

A separate story will cover monitoring the endpoint from within Origin

We want to capture the audit id and log when we receive an incoming request
audit id could include the build id or another field could be used to correlate back to the job instance

https://github.com/openshift/origin/pull/28583

Bug OCPBUGS-28607: Excessive revision history stored for HyperShift control plane components

View the Description View the linked PRs

Description of problem:

HyperShift-managed components use the default RevisionHistoryLimit of 10. This significantly impacts etcd load and scalability on the management cluster.

Version-Release number of selected component (if applicable):

4.9, 4.10, 4.11, 4.12, 4.13, 4.14, 4.15, 4.16

How reproducible:

100% (may vary depending on resource availablility on management cluster)

Steps to Reproduce:

    1. Create 375+ HostedCluster
    2. Observe etcd performance on management cluster
    3.

Actual results:

etcd hitting storage space limits

Expected results:

Able to manage HyperShift control planes at scale (375+ HostedClusters)

Additional info:

https://github.com/openshift/hypershift/pull/3477

Bug OCPBUGS-30073: Upload Jar form's Clear button is not functioning

View the Description View the linked PRs

Description of problem:

Clear Button in Upload Jar Form is not working, user need to close the form in-order to remove the previous selected JAR file.

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. Open Upload Jar File form from Add Page
    2. Upload a JAR file
    3. Remove the JAR the file by using clear button

Actual results:

    The selected JAR file is not removed even after using "Clear" button

Expected results:

    The "Clear" button should remove the selected file from the form.

Additional info:

https://github.com/openshift/console/pull/13644

Bug OCPBUGS-30483: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/198

Bug OCPBUGS-20209: [Multi-NIC]EgressIP was not moved to second egress node after first egress node unavailable

View the Description View the linked PRs

Description of problem:

Not able to reproduce it manually, but frequently happens when run auto scripts.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-05-195247

How reproducible:

Steps to Reproduce:

1. Label worker-0 node as egress node, created egressIP object,the egressIP was assigned to worker-0 node successfully on secondary NIC

2. Block 9107 port on  worker-0 node and label worker-1 as egress node

3.

Actual results:

EgressIP was not moved to second node
 % oc get egressip
NAME             EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-66330   172.22.0.196


40m         Warning   EgressIPConflict          egressip/egressip-66330       Egress IP egressip-66330 with IP 172.22.0.196 is conflicting with a host (worker-0) IP address and will not be assigned
sh-4.4# ip a show enp1s0
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:1c:cf:40:5d:25 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.109/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0
       valid_lft 76sec preferred_lft 76sec
    inet6 fe80::21c:cfff:fe40:5d25/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Expected results:

EgressIP should move to second egress node

Additional info:

Workaround: deleted it and recreated it works
% oc get egressip
NAME             EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-66330   172.22.0.196                   
% oc delete egressip --all
egressip.k8s.ovn.org "egressip-66330" deleted
 % oc create -f ../data/egressip/config1.yaml 
egressip.k8s.ovn.org/egressip-3 created
% oc get egressip
NAME         EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-3   172.22.0.196   worker-1        172.22.0.196

https://github.com/openshift/ovn-kubernetes/pull/2048

Bug OCPBUGS-29757: Devconsole internet proxy failed on airgapped cluster

View the Description View the linked PRs

Description of problem:

    Tekton Results API endpoint failed to fetch data on airgapped cluster.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13619

Bug OCPBUGS-25628: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-gcp/pull/216

Bug OCPBUGS-27210: Reminder to revert openshift/origin/pull#28522 (http2 test skips) before OCP 4.16 goes GA

View the Description View the linked PRs

https://github.com/openshift/origin/pull/28522 removes two tests related to http2 testing with the default certificate that are know known to fail with HAProxy 2.8. We are reworking the tests as part of ~~NE-1444~~ (HAProxy 2.8 bump).

This bug is a reminder that come OCP 4.16 GA we need to have reworked the tests so that they now pass with HAProxy 2.8 or, if not fixed, revert https://github.com/openshift/origin/pull/28522 which is why I'm marking this bug as a blocker. We do not want to ship 4.16 without reinstating the two tests.

The goal of removing the two tests in https://github.com/openshift/origin/pull/28522 is to allow us to make additional progress in https://github.com/openshift/router/pull/551 (which is our HAProxy 2.8 bump). With all tests passing in router#551 we can continue our assessment of HAProxy 2.8 by a) running the payload tests and b) creating a HAproxy 2.8 image that QE can use with their reliability test suite.

https://github.com/openshift/origin/pull/28540

Bug OCPBUGS-27881: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-powervs/pull/74

Bug OCPBUGS-25610: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-30836: Power VS: DHCP service is not dependent on wait_for_workspace

View the Description View the linked PRs

Description of problem:

    When deploying a cluster on Power VS, you need to wait for a short period after the workspace is created to facilitate the network configuration. This period is ignored by the DHCP service.

Version-Release number of selected component (if applicable):

How reproducible:

    Easily

Steps to Reproduce:

    1. Deploy a cluster on Power VS with an installer provisioned workspace
    2. Observe that the terraform logs ignore the waiting period

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8145

Bug OCPBUGS-24280: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13421

Bug OCPBUGS-24858: Update 4.16 ose-kube-metrics-server-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/21

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-metrics-server/pull/21

Bug OCPBUGS-25093: Update 4.16 csi-livenessprobe-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/56

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/56

Bug OCPBUGS-26125: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-vsphere/pull/61

Bug OCPBUGS-26772: Wrong disk size filled in and cannot be changed when cloning a pvc in the UI

View the Description View the linked PRs

Description of problem:

When cloning a PVC of 60GiB size, the system autofills the remote size to be 8192 PeB. This size cannot be changed in the UI before starting the clone.

Version-Release number of selected component (if applicable):

CNV - 4.14.3

How reproducible:

always

Steps to Reproduce:

1.Create a VM with a PVC of 60Gib
2.Power off the VM
3.As a cluster admin, clone the 60GiB PVC (Storage -> PersistentVolumeClaims -> Kebab menu next to pvc

Actual results:

The system tries to clone the 60 GiB PVC as a 8192 PeB

Expected results:

A new pvc of the 60 GiB

Additional info:

This seems like the closed BZ 2177979.I will upload a screenshot of the UI.
Here is the yaml for the original pvc.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.contentType: kubevirt
cdi.kubevirt.io/storage.pod.phase: Succeeded
cdi.kubevirt.io/storage.populator.progress: 100.0%
cdi.kubevirt.io/storage.preallocation.requested: "false"
cdi.kubevirt.io/storage.usePopulator: "true"
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
creationTimestamp: "2023-12-05T17:34:19Z"
finalizers:kubernetes.io/pvc-protectionprovisioner.storage.kubernetes.io/cloning-protection
labels:
app: containerized-data-importer
app.kubernetes.io/component: storage
app.kubernetes.io/managed-by: cdi-controller
app.kubernetes.io/part-of: hyperconverged-cluster
app.kubernetes.io/version: 4.14.0
kubevirt.io/created-by: 60f46f91-2db3-4118-aaba-b1697b29c496
name: win2k19-base
namespace: base-images
ownerReferences:apiVersion: cdi.kubevirt.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: DataVolume
name: win2k19-base
uid: 8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe
resourceVersion: "697047"
uid: fccb0aa9-8541-4b51-b49e-ddceaa22b68c
spec:
accessModes:ReadWriteMany
dataSource:
apiGroup: cdi.kubevirt.io
kind: VolumeImportSource
name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe
dataSourceRef:
apiGroup: cdi.kubevirt.io
kind: VolumeImportSource
name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe
resources:
requests:
storage: "64424509440"
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Block
volumeName: pvc-dbfc9fe9-5677-469d-9402-c2f3a22dab3f
status:
accessModes:ReadWriteMany
capacity:
storage: 60Gi
phase: Bound



Here is the yaml for the cloning pvc.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
creationTimestamp: "2023-12-06T14:24:07Z"
finalizers:kubernetes.io/pvc-protection
name: win2k19-base-clone
namespace: base-images
resourceVersion: "1551054"
uid: f72665c3-6408-4129-82a2-e663d8ecc0cc
spec:
accessModes:ReadWriteMany
dataSource:
apiGroup: ""
kind: PersistentVolumeClaim
name: win2k19-base
dataSourceRef:
apiGroup: ""
kind: PersistentVolumeClaim
name: win2k19-base
resources:
requests:
storage: "9223372036854775807"
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Block
status:
phase: Pending

https://github.com/openshift/console/pull/13503

Bug OCPBUGS-25197: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/249

Bug OCPBUGS-19303: OKD: Agent-based Installer is broken on OKD/FCOS

View the Description View the linked PRs

Description of problem:

OKD/FCOS uses FCOS for its bootimage which lacks several tools and services such as oc and crio that the rendezvous host of the Agent-based Installer needs to set up a bootstrap control plane.

Version-Release number of selected component (if applicable):

4.13.0
4.14.0
4.15.0

https://github.com/openshift/installer/pull/7484

Bug OCPBUGS-24486: cloud-provider-azure: drop temporary HealthProbe downstream carry

View the Description View the linked PRs

We added a carry patch to change the healthcheck behaviour in the Azure CCM: https://github.com/openshift/cloud-provider-azure/pull/72 and whilst we opened an upstream twin PR for that https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3887 it got closed in favour of a different approach https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4891 .

As such in the next rebase we need to drop the commit introduced in 72, in favour of downstreaming through the rebase the change in 4891. While doing that we need to explicitly set the new probe behaviour, as the default is still the classic behaviour, which doesn't work with our cluster architecture setup on Azure.

For the steps on how to do this, we can follow this comment: https://github.com/openshift/cloud-provider-azure/pull/88#issuecomment-1803832076

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/326

Bug OCPBUGS-25013: Update 4.16 ose-powervs-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/62

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-powervs/pull/62

Bug OCPBUGS-25082: Update 4.16 csi-provisioner-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/80

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-provisioner/pull/80

Bug OCPBUGS-27892: panic in poller

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/28544

Bug OCPBUGS-22487: certificate signed by unknown authority while uninstalling operators from console.

View the Description View the linked PRs

Description of problem:

The customer has a custom apiserver certificate.

This error can be found while trying to uninstall any operator by console:

openshift-console/pods/console-56494b7977-d7r76/console/console/logs/current.log:

2023-10-24T14:13:21.797447921+07:00 E1024 07:13:21.797400 1 operands_handler.go:67] Failed to get new client for listing operands: Get "https://api.<cluster>.<domain>:6443/api?timeout=32s": x509: certificate signed by unknown authority

when trying the same request from the console pod we can see no issue.

We see the root ca that signs apiserver certificate and this CA is trusted in the pod.

It seems the code that provokes this issue is:

https://github.com/openshift/console/blob/master/pkg/server/operands_handler.go#L62-L70

https://github.com/openshift/console/pull/13632

Bug OCPBUGS-24261: Use better update strategy for konnectivty daemonset

View the Description View the linked PRs

Currently the konnectivity agent has the following update strategy:

```
updateStrategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 0

```

We (IBM) suggest to update it to the following:
```
updateStrategy:
rollingUpdate:
maxUnavailable: 10%
type: RollingUpdate
```

In a big cluster, it would speed up the konnectivity-agent update. As the agents are independent, this would not hurt the service.

https://github.com/openshift/hypershift/pull/3294

Bug OCPBUGS-25634: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-operator/pull/33

Bug OCPBUGS-16814: The web console's JS client is using stale tokens

View the Description View the linked PRs

Description of problem:

Starting OpenShift 4.8 (https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-notable-technical-changes), all pods are getting bound SA tokens.

Currently, instead of expiring the token, we use the `service-account-extend-token-expiration` that extends a bound token validity to 1yr and warns in case of a use of a token that would've otherwise been expired.

We want to disable this behavior in a future OpenShift release, which would break the OpenShift web console.

Version-Release number of selected component (if applicable):

4.8 - 4.14

How reproducible:

100%

Steps to Reproduce:

1. install a fresh cluster
2. wait ~1hr since console pods were deployed for the token rotation to occur
3. log in to the console and click around
4. check the kube-apiserver audit logs events for the "authentication.k8s.io/stale-token" annotation

Actual results:

many occurrences (I doubt I'll be able to upload a text file so I'll show a few audit events in the first comment.

Expected results:

The web-console re-reads the SA token regularly so that it never uses an expired token

Additional info:

In a theoretical case where a console pod lasts for a year, it's going to break and won't be able to authenticate to the kube-apiserver.

We are planning on disallowing the use of stale tokens in a future release and we need to make sure that the core platform is not broken so that the metrics we collect from the clusters in the wild are not polluted.

https://github.com/openshift/console/pull/13321

Bug OCPBUGS-25484: Install failure for console operator

View the Description View the linked PRs

Description of problem:

Reviewing 4.15 Install failures (install should succeed: overall) there are a number of variants impacted by recent install failures.

search.ci: Cluster operator console is not available

Jobs like periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial show failures that appear to start with 4.15.0-0.nightly-2023-12-07-225558 have installation failures due to console-operator

ConsoleOperator reconciliation failed: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again

4.15.0-0.nightly-2023-12-07-225558 contains console-operator/pull/814, noting in case it is related

Version-Release number of selected component (if applicable):

 4.15

How reproducible:

Steps to Reproduce:

    1. Review link to install failures above
    2.
    3.

Actual results:

Expected results:

Additional info:
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade

Bug OCPBUGS-24573: [vsphere]keepalived still running on worker nodes even using usermanaged ELB

View the Description View the linked PRs

Description of problem:

 setup cluster cluster on vsphere by usermanaged ELB, with install-config.yaml

    apiVIPs:
      - 10.38.153.2
    ingressVIPs:
      - 10.38.153.3
    loadBalancer:
      type: UserManaged
networking:
  machineNetwork:
    - cidr: "10.38.153.0/25"
featureSet: TechPreviewNoUpgrade

after cluster is started, Found the keeplaived still running on worker nodes.

omc get pod -n openshift-vsphere-infra
NAME                                                   READY   STATUS    RESTARTS   AGE
coredns-ci-op-2kch7ldp-72b07-7l4vs-master-0            2/2     Running   0          1h
coredns-ci-op-2kch7ldp-72b07-7l4vs-master-1            2/2     Running   0          59m
coredns-ci-op-2kch7ldp-72b07-7l4vs-master-2            2/2     Running   0          59m
coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74      2/2     Running   0          39m
coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k      2/2     Running   0          37m
keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74   2/2     Running   0          39m
keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k   2/2     Running   0          37m

Version-Release number of selected component (if applicable):

4.15

How reproducible:

always

Steps to Reproduce:

    1. setup vsphere on multi-subnet network with ELB, job
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328     2.
    3.

Actual results:

Expected results:

    keepalived should not be running on worker node.

Additional info:

    must-gather logs:  https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328/artifacts/vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/gather-must-gather/artifacts/must-gather.tar

Version-Release number of selected component (if applicable):
{code:none}

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/api/pull/1698

Bug OCPBUGS-25021: Update 4.16 ose-cli-artifacts-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oc/pull/1627

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oc/pull/1628

Bug OCPBUGS-27140: Update 4.16 operator-lifecycle-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/658

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-olm/pull/658

Bug OCPBUGS-24986: Update 4.16 ose-prometheus-adapter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/98

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-27152: node status popver dialog always jump to the first line

View the Description View the linked PRs

Description of problem:

click on any node status popover, the popover dialog will always move to the first line

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-14-100410

How reproducible:

Always

Steps to Reproduce:

1. goes to Nodes list page, mark any worker as unschedulable
2. click on the 'Ready/Scheduling diabled' text, a popover dialog is opened 
3.

Actual results:

2. the popover dialog will always jump to the first line, it looks like popover dialog is showing wrong status/info, also it's difficult for user to perform node actions

Expected results:

2. popover dialog should be shown exactly next to the correct node

Additional info:

https://github.com/openshift/console/pull/13524

Bug OCPBUGS-26173: Update 4.16 openshift-enterprise-registry-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-registry/pull/390

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-registry/pull/390

Bug OCPBUGS-29773: The hypershift installer does not set the cipher suites for konnectivity-server

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3618

Story PODAUTO-91: Add joelsmith to OWNERS of openshift/kubernetes-autoscaler

View the Description View the linked PRs

This ticket is to satisfy the bot on GH

https://github.com/openshift/kubernetes-autoscaler/pull/273

Bug OCPBUGS-27732: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1671

Bug OCPBUGS-24896: Update 4.16 ose-azure-cluster-api-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/293

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-azure/pull/293

Bug OCPBUGS-25548: Update 4.16 ose-csi-snapshot-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/135

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/135

Bug OCPBUGS-28601: webhook release payload validation introduces resource ordering error

View the Description View the linked PRs

Description of problem:

    Recent introductions of a validation within the hypershift operator's webhook conflicts with the UI's ability to create HCP clusters. Previously the pull secret was not required to be posted before an HC or NP, but with a recent change, the pull secret is required because the pull secret is used to validate the release image payload.

This issue is isolated to 4.15

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    100% attempt to post a HC before the pull secret is posted and the HC will be rejected. 

The expected outcome is that it should be able to post the pull secret for a HC after the HC is posted, and the controller should be eventually consistent to this change.

https://github.com/openshift/hypershift/pull/3484

Bug OCPBUGS-25843: The third link title doesn't show up on feedback modal

View the Description View the linked PRs

Description of problem:

On customer feedback modal, there are 3 links for user to feedback to Red Hat, the third link lacks a title.

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-21-155123

How reproducible:

Always

Steps to Reproduce:

    1.Login admin console. Click on "?"->"Share Feedback", check the links on the modal
    2.
    3.

Actual results:

1. The third link lacks a link title (the link for "Learn about opportunities to ……").

Expected results:

1. There is link title "Inform the direction of Red Hat" in 4.14, it should also exists for 4.15.

Additional info:

screenshot for 4.14 page: https://drive.google.com/file/d/19AnPlE0h9WwvIjxV0gLuf5x27jLN7TLS/view?usp=drive_link
screenshot for 4.15 page: https://drive.google.com/file/d/19MRjzNGRWfYnK-zcoMozh7Z7eaDDG2L-/view?usp=drive_link

https://github.com/openshift/console/pull/13483

Bug OCPBUGS-27469: "Deploy Image" with "Serverless Deployment", Scaling "Min Pods"/"Max Pods" should set "autoscaling.knative.dev/min-scale"/max-scale not minScale/maxScale

View the Description View the linked PRs

Description of problem:

Creating a Serverless Deployment with "Scaling" "Min Pods"/"Max Pods" options set, uses deprecated knative annotations "autoscaling.knative.dev/minScale" / "maxScale",

the correct current ones are "autoscaling.knative.dev/min-scale" / "max-scale"

The same problem with "autoscaling.knative.dev/targetUtilizationPercentage" , which should be "autoscaling.knative.dev/target-utilization-percentage"

Prerequisites (if any, like setup, operators/versions):

Serverless operator

Steps to Reproduce

Install serverless operator
Create KnativeServing in knative-serving namespace
create a test "foobar" namespace
Go to <console>/deploy-image/ns/foobar
Use gcr.io/knative-samples/helloworld-go as "Image name from external registry" (or any webserver image listening on :8080)
Choose "Serverless Deployment" for the "resource type"
Click on "Scaling" in "Click on the names to access advanced options for ..."
Set "2" for "Min Pods" and "3" for "Max Pods"
Create

Actual results:

The created ksvc resource has

 spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "3"
        autoscaling.knative.dev/minScale: "2"
        autoscaling.knative.dev/targetUtilizationPercentage: "70"

Expected results:

The created ksvc should have

spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/max-scale: "3"
        autoscaling.knative.dev/min-scale: "2"
        autoscaling.knative.dev/target-utilization-percentage: "70"

Reproducibility (Always/Intermittent/Only Once): Always

Build Details:

4.14.8

Workaround:

none required ATM, current serverless still supports the deprecated "minScale"/"maxScale" annotations.

Additional info:

https://docs.openshift.com/serverless/1.31/knative-serving/autoscaling/serverless-autoscaling-developer-scale-bounds.html

https://issues.redhat.com/browse/SRVKS-910

https://github.com/openshift/console/pull/13534

Bug OCPBUGS-28920: OCP 4.13.30 - allow-from-ingress NetworkPolicy does not consistently allow traffic from HostNetworked pods or from node IP's (packet timeout)

View the Description View the linked PRs

Description of problem:

- Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.

This manifests as the router pods marking backends as DOWN and dropping traffic; but The behavior can be replicated with curl outside of the HAProxy pods via entering a debug shell to a host node (or SSH) and curling the pod IP directly. A significant percentage of packets time out to the target backend on intermittent subsequent calls.
We narrowed the behavior down to the moment we applied the NetworkPolicy for `allow-from-ingress` as outlined below - immediately the namespace began to drop packets on a curl loop running from an infra node directly against the pod IP (some 2-3% of all calls timed out).

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
  metadata:
    name: allow-from-openshift-ingress
    namespace: testing
spec:
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
             policy-group.network.openshift.io/ingress: ""
    podSelector: {}
    policyTypes:
    - Ingress

Version-Release number of selected component (if applicable):

How reproducible:

every time, all namespaces with this network policy on this clusterversion (replicated on two clusters that underwent the same upgrade).

Steps to Reproduce:

1. Upgrade cluster to 4.13.30

2. Apply test pod running basic HTTP instance at random port

3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).

4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).

$ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done

Actual results:

- Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.

Expected results:

- traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.

Additional info:

This behavior began immediately after completed upgrade from 4.13.24 to 4.13.30 and has been replicated on two separate clusters.
Customer has been forced to reinstall a cluster at downgraded version to ensure stability/deliverables for their user-base and this is a critical impact outage scenario for them

– additional required template details in first comment below.

RCA UPDATE:
So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with

oc get ns openshift-host-network --show-labels
# expected this
kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=

# but before the fix you will see 
kubernetes.io/metadata.name=openshift-host-network,policy-group.network.openshift.io/host-network=

Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged

oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0
oc scale deployment network-operator -n openshift-network-operator --replicas=0

and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears

Potentially affected versions (may need to reproduce to confirm)

4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070

4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293

4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039

Mitigation/support KCS:
https://access.redhat.com/solutions/7055050

https://github.com/openshift/cluster-network-operator/pull/2259

Bug OCPBUGS-24723: Update 4.16 openshift-enterprise-base-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/154

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/154

Bug OCPBUGS-25587: Update 4.16 ose-cluster-kube-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/779

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/780

Bug OCPBUGS-28540: Should contain oc.rhel8 for 4.16 ocp payload

View the Description View the linked PRs

Description of problem:

Now for 4.16 ocp payload , only contain the oc.rhel9 , can't find the oc.rhel8. 
oc adm release extract --command='oc.rhel8'  registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64  -a tmp/config.json --to octest/
error: image did not contain usr/share/openshift/linux_amd64/oc.rhel8

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

always

Steps to Reproduce:

    1.oc adm release extract --command='oc.rhel8'  registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64  -a tmp/config.json --to octest/
    2.
    3.

Actual results:

 failed to extract oc.rhel8

Expected results:

 for ocp paypload should contain the oc.rhel8.

Additional info:

https://github.com/openshift/oc/pull/1669

Bug OCPBUGS-30102: Add support to disable machine management components

View the Description View the linked PRs

Description of problem:

    For high scalability, we need an option to disable unused machine management control plane components.

Version-Release number of selected component (if applicable):

How reproducible:

    100%

Steps to Reproduce:

    1. Create HostedCluster/HostedControlPlane
    2. 
    3.

Actual results:

    Machine management components (cluster-api, machine-approver, auto-scaler, etc) are deployed

Expected results:

    Should have option to disable as some use cases they provide no utility.

Additional info:

https://github.com/openshift/hypershift/pull/3570

Bug OCPBUGS-18844: There is blank space at the bottom of the cluster settings page when an upgrade is in progress

View the Description View the linked PRs

Description of problem:

When the cluster is in upgrade, scroll sidebar to bottom on cluster settings page, there is blank space at the bottom.

Version-Release number of selected component (if applicable):

upgrade 4.14.0-0.nightly-2023-09-09-164123 to 4.14.0-0.nightly-2023-09-10-184037

How reproducible:

Always

Steps to Reproduce:

1.Launch a 4.14 cluster, trigger an upgrade.
2.Go to "Cluster Settings"->"Details" page during upgrade, scroll down the right sidebar to the bottom.
3.

Actual results:

2. It's blank at the bottom

Expected results:

2. Should not show blank.

Additional info:

screenshot: ~~https://drive.google.com/drive/folders/1DenrQTX7K0chbs9hG9ZbSZyY-viRRy1k?ths=true~~
https://drive.google.com/drive/folders/10dgToTxZf7gOfmL2Mp5gAMVQnM06XvAf?usp=sharing

https://github.com/openshift/console/pull/13453

Bug OCPBUGS-22894: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13634

Bug OCPBUGS-24810: Update 4.16 ose-containernetworking-plugins-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/142

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-27957: Should not collect the previous.log which not corresponding with the --since/--since-time

View the Description View the linked PRs

Description of problem:

Should not collect the previous.log which not corresponding with the --since/--since-time for `oc adm inspect` command

Version-Release number of selected component (if applicable):

  4.16

How reproducible:

    always

Steps to Reproduce:

    1.  `oc adm inspect --since-time="2024-01-25T01:35:27Z"  ns/openshift-multus`

Actual results:

    also collect the previous.log which not corresponding with the specified time.

Expected results:

    Only collect the logs after the --since/--since-time.

Additional info:

https://github.com/openshift/oc/pull/1666

Story OSASINFRA-3368: Revendor CAPO in MAPO

View the Description View the linked PRs

We want to use the latest version of CAPO in MAPO. We need to revendor CAPO to version 0.9 before the 4.16 FF.

There are several API changes that might require Matt's help.

https://github.com/openshift/machine-api-provider-openstack/pull/103

Bug OCPBUGS-28982: oauthclients degraded condition never gets removed

View the Description View the linked PRs

Description of problem:

oauthclients degraded condition that never gets removed, meaning once its set due to an issue on a cluster, it wont be unset

Version-Release number of selected component (if applicable):

How reproducible:

Sporadically, when the AuthStatusHandlerFailedApply condition is set on the console operator status conditions.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console-operator/pull/855

Bug OCPBUGS-20220: [Multi-NIC]Egress traffic connection got timeout after remove another pod label

View the Description View the linked PRs

Description of problem:

[Multi-NIC]Egress traffic connect got timeout after remove another pod label in same namespace

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-08-024357

How reproducible:

Always

Steps to Reproduce:

1. Label one node as egress node
2. Create an egressIP object, egressIP was assigned to egress node secondary interface
# oc get egressip -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip-66293"},"spec":{"egressIPs":["172.22.0.190"],"namespaceSelector":{"matchLabels":{"org":"qe"}},"podSelector":{"matchLabels":{"color":"pink"}}}}
    creationTimestamp: "2023-10-08T07:28:04Z"
    generation: 2
    name: egressip-66293
    resourceVersion: "461590"
    uid: f1ca3483-63f1-4f31-99b0-e6a55161c285
  spec:
    egressIPs:
    - 172.22.0.190
    namespaceSelector:
      matchLabels:
        org: qe
    podSelector:
      matchLabels:
        color: pink
  status:
    items:
    - egressIP: 172.22.0.190
      node: worker-0
kind: List
metadata:
  resourceVersion: ""
3. Created a namespace and two pod under it. 
% oc get pods -n hrw -o wide
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
hello-pod    1/1     Running   0          6m46s   10.129.2.7    worker-1   <none>           <none>
hello-pod1   1/1     Running   0          6s      10.131.0.14   worker-0   <none>           <none>

4. Add label org=qe to namespace hrw
# oc get ns hrw --show-labels
NAME   STATUS   AGE   LABELS
hrw    Active   21m   kubernetes.io/metadata.name=hrw,*org=qe,*pod-security.kubernetes.io/audit-version=v1.24,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=v1.24,pod-security.kubernetes.io/warn=restricted

5. At this time, from both pods to access external endpoint, succeeded. 
% oc rsh -n hrw hello-pod 
~ $ curl 172.22.0.1 --connect-timeout 5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
~ $ exit
 % oc rsh -n hrw hello-pod1 
~ $ curl 172.22.0.1 --connect-timeout 5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>

6. Add label color=pink to both pods
 % oc label pod hello-pod color=pink -n hrw
pod/hello-pod labeled
 % oc label pod hello-pod1 color=pink -n hrw
pod/hello-pod1 labeled

7. Both pods can access external endpoint.
8. Remove label color=pink from pod hello-pod
% oc label pod hello-pod color- -n hrw     
pod/hello-pod unlabeled

Actual results:


Access external endpoint from the pod which keep the label got connect timeout
 % oc rsh -n hrw hello-pod1            
~ $ curl 172.22.0.1 --connect-timeout 5
curl: (28) Connection timeout after 5000 ms
~ $ 
~ $ 
~ $ curl 172.22.0.1 --connect-timeout 5
curl: (28) Connection timeout after 5000 ms

Note the label was removed from hello-pod , but try to access external endpoint from another pod, here hello-pod1 which should still use egressIP and be able to access

Expected results:

Should be able to access external endpoint

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2048

Bug OCPBUGS-23511: CPO needs to filter out IAM Role paths when modifying the allowed principals of a VPC Endpoint Service

View the Description View the linked PRs

Description of problem:

When creating an IAM role with a "path" (https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-friendly-names) its "principal" when applied to trust policies or VPC Endpoint Service allowed principals confusingly does not include the path. That is, for the folowing rolesRef on a hostedcluster:

spec:
  platform:
    aws:
      rolesRef:
        controlPlaneOperatorARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator
        imageRegistryARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-image-registry-installer-cloud-crede
        ingressARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-ingress-operator-cloud-credentials
        kubeCloudControllerARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-kube-controller-manager
        networkARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cloud-network-config-controller-clou
        nodePoolManagementARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-capa-controller-manager
        storageARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cluster-csi-drivers-ebs-cloud-creden

The actual valid principal that should be added to the VPC Endpoint Service's allowed principals is:

arn:aws:iam::765374464689:role/ad-int-path1-y4y2-kube-system-control-plane-operator

instead of

arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator

However, for all other cases, the full ARN including the path should be used, e.g. https://github.com/openshift/hypershift/blob/082e880d0a492a357663d620fa58314a4a477730/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L237-L273

Version-Release number of selected component (if applicable):

4.14.1

How reproducible:

100%

Steps to Reproduce:

ROSA HCP-specific steps:

1. rosa create account-roles --path /anything/ -m auto -y
2. rosa create cluster --hosted-cp
3. ...etc
4a. Observe on the hosted cluster AWS Account that the VPC Endpoint cannot be created with the error: 'failed to create vpc endpoint: InvalidServiceName'
4b. Observe on the management cluster that CPO is failing to update the VPC Endpoint Service's allowed principals with the error: Client.InvalidPrincipal
5. If the contents of .spec.platform.aws.rolesRef.controlPlaneOperatorARN are manually applied to the additional allowed principals with the path component removed, then the problems are largely fixed on the hosted cluster side. VPC Endpoint is created, worker nodes can spin up, etc.

Actual results:

The VPC Endpoint Service is attempting and failing to get this applied to its additional allowed principals:

arn:aws:iam::${ACCOUNT_ID}:role/path/name

Expected results:

The VPC Endpoint Service gets this applied to its additional allowed principals:

arn:aws:iam::${ACCOUNT_ID}:role/name

Additional info:

https://github.com/openshift/hypershift/pull/3215

Bug OCPBUGS-28827: Agent can't find 'boot.catalog' when creating agent.aarch64.iso

View the Description View the linked PRs

Description of problem:

openshift-install is unable to generate an aarch64 iso:
              
FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100 %

Steps to Reproduce:

    1. Create an install_config.yaml with controlplane.architecture and compute.architecture = arm64 
    2. openshift-install agent create image  --log-level debug

Actual results:

DEBUG Generating Agent Installer ISO... 
INFO Consuming Install Config from target directory 
DEBUG Purging asset "Install Config" from disk 
INFO Consuming Agent Config from target directory 
DEBUG Purging asset "Agent Config" from disk 
DEBUG initDisk(): start DEBUG initDisk(): regular file 
FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file

Expected results:

    agent.aarch64.iso is created

Additional info:

Seems to be related to this PR:
https://github.com/openshift/installer/pull/7896

boot.catalog is also referenced in the assisted-image-service here:
https://github.com/openshift/installer/blob/master/vendor/github.com/openshift/assisted-image-service/pkg/isoeditor/isoutil.go#L155

https://github.com/openshift/installer/pull/7972

Bug OCPBUGS-24473: when set a custom endpoint, the private IAM url would be overrode together for installing a ibmcloud cluster

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/7805

Bug OCPBUGS-25632: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-alibaba-cloud/pull/47

Bug TRT-1506: Find out what happened to single second disruption tests

View the Description View the linked PRs

These tests look to have been mistakenly deleted in 4.14 during the big monitortest refactor. We could use them right now to identify gcp jobs with spatter disruption.

[sig-network] there should be nearly zero single second disruptions for _
[sig-network] there should be reasonably few single second disruptions for _

Find out what happened and get them restored. Code is there but it looks like there are assumptions about extracting the backend name that may have been broken somewhere.

https://github.com/openshift/origin/pull/28592

Bug OCPBUGS-26143: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-gcp/pull/219

Bug OCPBUGS-27779: Wrong disk size filled in when expanding a pvc in the UI

View the Description View the linked PRs

Description of problem:

When expanding a PVC of unit-less size (e.g., '2147483648'), the Expand PersistentVolumeClaim modal populates the spinner with a unit-less value (e.g., 2147483648) instead of a meaningful value.

Version-Release number of selected component (if applicable):

CNV - 4.14.3

How reproducible:

always

Steps to Reproduce:

1.Create a PVC using the following YAML.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:   
  name: task-pv-claim
spec: 
  storageClassName: gp3-csi
  accessModes:     
    - ReadWriteOnce
  resources: 
    requests:       
      storage: "2147483648"

apiVersion: v1
kind: Pod
metadata:   
  name: task-pv-pod
spec:   
  securityContext:     
    runAsNonRoot: true
    seccompProfile:       
      type: RuntimeDefault
  volumes:     
    - name: task-pv-storage
      persistentVolumeClaim:         
        claimName: task-pv-claim
  containers:     
    - name: task-pv-container
      image: nginx
      ports:         
        - containerPort: 80
          name: "http-server"
      volumeMounts:         
        - mountPath: "/usr/share/nginx/html"
          name: task-pv-storage

2. From the newly created PVC details page, Click Actions > Expand PVC.
3. Note the value in the spinner input.

See https://drive.google.com/file/d/1toastX8rCBtUzx5M-83c9Xxe5iPA8fNQ/view for a demo

https://github.com/openshift/console/pull/13532

Bug OCPBUGS-29637: image-registry co is degraded on Azure MAG, Azure Stack Hub cloud or with azure workload identity

View the Description View the linked PRs

Description of problem:

Install IPI cluster against 4.15 nightly build on Azure MAG and Azure Stack Hub or with Azure workload identity, image-registry co is degraded with different errors.

On MAG:
$ oc get co image-registry
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.15.0-0.nightly-2024-02-16-235514   True        False         True       5h44m   AzurePathFixControllerDegraded: Migration failed: panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such host...

$ oc get pod -n openshift-image-registry
NAME                                               READY   STATUS    RESTARTS        AGE
azure-path-fix-ssn5w                               0/1     Error     0               5h47m
cluster-image-registry-operator-86cdf775c7-7brn6   1/1     Running   1 (5h50m ago)   5h58m
image-registry-5c6796b86d-46lvx                    1/1     Running   0               5h47m
image-registry-5c6796b86d-9st5d                    1/1     Running   0               5h47m
node-ca-48lsh                                      1/1     Running   0               5h44m
node-ca-5rrsl                                      1/1     Running   0               5h47m
node-ca-8sc92                                      1/1     Running   0               5h47m
node-ca-h6trz                                      1/1     Running   0               5h47m
node-ca-hm7s2                                      1/1     Running   0               5h47m
node-ca-z7tv8                                      1/1     Running   0               5h44m

$ oc logs azure-path-fix-ssn5w -n openshift-image-registry
panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such hostgoroutine 1 [running]:
main.main()
    /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:49 +0x125

The blob storage endpoint seems not correct, should be:
$ az storage account show -n imageregistryjima41xvvww -g jima415a-hfxfh-rg --query primaryEndpoints
{
  "blob": "https://imageregistryjima41xvvww.blob.core.usgovcloudapi.net/",
  "dfs": "https://imageregistryjima41xvvww.dfs.core.usgovcloudapi.net/",
  "file": "https://imageregistryjima41xvvww.file.core.usgovcloudapi.net/",
  "internetEndpoints": null,
  "microsoftEndpoints": null,
  "queue": "https://imageregistryjima41xvvww.queue.core.usgovcloudapi.net/",
  "table": "https://imageregistryjima41xvvww.table.core.usgovcloudapi.net/",
  "web": "https://imageregistryjima41xvvww.z2.web.core.usgovcloudapi.net/"
}

On Azure Stack Hub:
$ oc get co image-registry
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.15.0-0.nightly-2024-02-16-235514   True        False         True       3h32m   AzurePathFixControllerDegraded: Migration failed: panic: open : no such file or directory...

$ oc get pod -n openshift-image-registry
NAME                                               READY   STATUS    RESTARTS        AGE
azure-path-fix-8jdg7                               0/1     Error     0               3h35m
cluster-image-registry-operator-86cdf775c7-jwnd4   1/1     Running   1 (3h38m ago)   3h54m
image-registry-658669fbb4-llv8z                    1/1     Running   0               3h35m
image-registry-658669fbb4-lmfr6                    1/1     Running   0               3h35m
node-ca-2jkjx                                      1/1     Running   0               3h35m
node-ca-dcg2v                                      1/1     Running   0               3h35m
node-ca-q6xmn                                      1/1     Running   0               3h35m
node-ca-r46r2                                      1/1     Running   0               3h35m
node-ca-s8jkb                                      1/1     Running   0               3h35m
node-ca-ww6ql                                      1/1     Running   0               3h35m

$ oc logs azure-path-fix-8jdg7 -n openshift-image-registry
panic: open : no such file or directorygoroutine 1 [running]:
main.main()
    /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:36 +0x145

On cluster with Azure workload identity:
Some operator's PROGRESSING is True
image-registry                             4.15.0-0.nightly-2024-02-16-235514   True        True          False      43m     Progressing: The deployment has not completed...

pod azure-path-fix is in CreateContainerConfigError status, and get error in its Event.

"state": {
    "waiting": {
        "message": "couldn't find key REGISTRY_STORAGE_AZURE_ACCOUNTKEY in Secret openshift-image-registry/image-registry-private-configuration",
        "reason": "CreateContainerConfigError"
    }
}

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-02-16-235514

How reproducible:

    Always

Steps to Reproduce:

    1. Install IPI cluster on MAG or Azure Stack Hub or config Azure workload identity
    2.
    3.

Actual results:

    Installation failed and image-registry operator is degraded

Expected results:

    Installation is successful.

Additional info:

    Seems that issue is related with https://github.com/openshift/image-registry/pull/393

https://github.com/openshift/cluster-image-registry-operator/pull/1003

Bug OCPBUGS-13665: Egress firewall rules with 'nodeSelector' shall include all the IPs of the node

View the Description View the linked PRs

Description of problem:

Today Egress firewall rules with 'nodeSelector' only use the nodeIP in the OVN ACL rule. But there is possibility that one node may have secondary IPs other that the nodeIP. We shall create ACL with all the possible IPs of the selected node.

Version-Release number of selected component (if applicable):

How reproducible:

Create a egress firewall rule with 'nodeSelector'

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2073

Story TRT-1420: console and auth operators failing to install in 4.16 intermittently

View the Description View the linked PRs

Last payload showed several occurrences of this problem seemingly surfacing the same way:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.nightly/release/4.16.0-0.nightly-2023-12-18-092716

Example jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496 (blocked the payload)

Looking at sippy, both tests took a dip in pass rate on the 16th, meaning a regression may have merged late on the 15th (friday) or somewhere on the 16th (less likely)

console operator test 4.16

auth operator test 4.16

The problem kills the install and thus we are getting no intervals charts to help debug.

Bug OCPBUGS-28661: openshift/builder - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/builder/pull/377

Bug OCPBUGS-24379: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2170

Bug OCPBUGS-25890: there is flicker when clicking on perspective switcher after hard refresh

View the Description View the linked PRs

Description of problem:

when user clicks on perspective switcher after a hard refresh, the flicker appears

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-25-100326

How reproducible:

Always after user refresh the console

Steps to Reproduce:

1. user login to OCP console
2. refresh the whole console then click perspective switcher 
3.

Actual results:

there is flicker when clicking on perspective switcher

Expected results:

no flickers

Additional info:

screen recording https://drive.google.com/file/d/1_2tPZ0DXNTapFP9sSz27vKbnwxxdWZSV/view?usp=drive_link

https://github.com/openshift/console/pull/13500

Bug OCPBUGS-27853: OCP upgrade to nightly build failed on provider cluster - OVN-K fails to process annotation on live-migratable VM pods

View the Description View the linked PRs

Description of problem:

Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup.
Platform: IBM Cloud Bare Metal cluster.

Steps done:

Step 1.

$ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge
clusterversion.config.openshift.io/version patched

Step 2:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837

The cluster was not upgraded successfully.

 
$ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837   True        False         False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.0-0.nightly-2024-01-18-050837   True        False         True       111s    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
console                                    4.15.0-0.nightly-2024-01-18-050837   False       False         False      111s    RouteHealthAvailable: console route is not admitted
dns                                        4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6."
etcd                                       4.15.0-0.nightly-2024-01-18-050837   True        False         True       12d     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379"  Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379"  Healthy:true Took:9.090133ms Error:<nil>}]...
image-registry                             4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     Progressing: All registry resources are removed...
machine-config                             4.14.7                               True        True          True       7d22h   Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]]
monitoring                                 4.15.0-0.nightly-2024-01-18-050837   False       True          True       7m54s   UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded
network                                    4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)...
node-tuning                                4.15.0-0.nightly-2024-01-18-050837   True        True          False      98m     Working towards "4.15.0-0.nightly-2024-01-18-050837"


$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9b7e02d956d965d0906def1426cb03b5   False     True       True       3              0                   0                     1                      12d
worker   rendered-worker-4f54b43e9f934f0659761929f55201a1   False     True       True       3              1                   1                     1                      12d


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.7    True        True          120m    Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors


$ oc get nodes
NAME                          STATUS                     ROLES                         AGE   VERSION
baremetal2-01.qe.rh-ocs.com   Ready                      worker                        12d   v1.27.8+4fab27b
baremetal2-02.qe.rh-ocs.com   Ready                      worker                        12d   v1.27.8+4fab27b
baremetal2-03.qe.rh-ocs.com   Ready                      control-plane,master,worker   12d   v1.27.8+4fab27b
baremetal2-04.qe.rh-ocs.com   Ready                      control-plane,master,worker   12d   v1.27.8+4fab27b
baremetal2-05.qe.rh-ocs.com   Ready                      worker                        12d   v1.28.5+c84a6b8
baremetal2-06.qe.rh-ocs.com   Ready,SchedulingDisabled   control-plane,master,worker   12d   v1.27.8+4fab27b
----------------------------------------------------

During the efforts to bring the cluster back to a good state, these steps were done:
The node baremetal2-06.qe.rh-ocs.com was uncordoned.

Tried to upgrade to using the command

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading:  Reason: ClusterOperatorsDegraded
  Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500


Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful.
Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works.

Some clusteroperators stayed on the previous version. Some moved to Degraded state. 

$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9b7e02d956d965d0906def1426cb03b5   False     True       False      3              1                   1                     0                      13d
worker   rendered-worker-4f54b43e9f934f0659761929f55201a1   False     True       True       3              1                   1                     1                      13d


$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     11d
rook-ceph-mon-pdb                                 N/A             1                 1                     11d
rook-ceph-osd                                     N/A             1                 1                     3h17m


$ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                             STATUS  REWEIGHT  PRI-AFF
-1         5.23672  root default                                                   
-5         1.74557      host baremetal2-01-qe-rh-ocs-com                           
 1    ssd  0.87279          osd.1                             up   1.00000  1.00000
 4    ssd  0.87279          osd.4                             up   1.00000  1.00000
-7         1.74557      host baremetal2-02-qe-rh-ocs-com                           
 3    ssd  0.87279          osd.3                             up   1.00000  1.00000
 5    ssd  0.87279          osd.5                             up   1.00000  1.00000
-3         1.74557      host baremetal2-05-qe-rh-ocs-com                           
 0    ssd  0.87279          osd.0                             up   1.00000  1.00000
 2    ssd  0.87279          osd.2                             up   1.00000  1.00000


OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/

Version-Release number of selected component (if applicable):

Initial version:
OCP 4.14.7
ODF 4.14.4-5.fusion-hci
OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380
Local Storage: local-storage-operator.v4.14.0-202312132033
OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci

How reproducible:

Reporting the first occurance of the isue.

Steps to Reproduce:

    1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP

Actual results:

    OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.

Expected results:

OCP upgrade to nightly build from 4.14.7 should be success.

Additional info:

    There are 3 hosted clients present

https://github.com/openshift/ovn-kubernetes/pull/2063

Bug OCPBUGS-21865: oc-mirror doesn't respect the minVersion & maxVersion

View the Description View the linked PRs

oc-mirror - maxVersion of the imageset config is ignored for operators

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Create 2 imageset that we are using:

_imageset-config-test1-1.yaml:_
~~~
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /local/oc-mirror/test1/metadata
mirror:
  platform:
    architectures:
      - amd64
    graph: true
    channels:
      - name: stable-4.12
        type: ocp
        minVersion: 4.12.1
        maxVersion: 4.12.1
        shortestPath: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
        - name: cincinnati-operator
          channels:
            - name: v1
              minVersion: 5.0.1
              maxVersion: 5.0.1
~~~

_imageset-config-test1-2.yaml:_
~~~
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /local/oc-mirror/test1/metadata
mirror:
  platform:
    architectures:
      - amd64
    graph: true
    channels:
      - name: stable-4.12
        type: ocp
        minVersion: 4.12.1
        maxVersion: 4.12.1
        shortestPath: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
        - name: cincinnati-operator
          channels:
            - name: v1
              minVersion: 5.0.1
              maxVersion: 5.0.1
        - name: local-storage-operator
          channels:
            - name: stable
              minVersion: 4.12.0-202305262042
              maxVersion: 4.12.0-202305262042
        - name: odf-operator
          channels:
            - name: stable-4.12
              minVersion: 4.12.4-rhodf
              maxVersion: 4.12.4-rhodf
        - name: rhsso-operator
          channels:
            - name: stable
              minVersion: 7.6.4-opr-002
              maxVersion: 7.6.4-opr-002
    - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12
      packages:
        - name: k10-kasten-operator-rhmp
          channels:
            - name: stable
              minVersion: 6.0.6
              maxVersion: 6.0.6
  additionalImages:
    - name: registry.redhat.io/rhel8/postgresql-13:1-125
~~~

2. Generate a first .tar file from the first imageset-config file (imageset-config-test1-1.yaml)
oc mirror --config=imageset-config-test1-1.yaml file:///local/oc-mirror/test1

3. Use the first .tar file to populate our registry
oc mirror --from=/root/oc-mirror/test1/mirror_seq1_000000.tar docker://registry-url/oc-mirror1

4.Generate a second .tar file from the second imageset-config file (imageset-config-test1-2.yaml)
oc mirror --config=imageset-config-test1-2.yaml file:///local/oc-mirror/test1

5. Populate the private registry named `oc-mirror1` with the second .tar file:

oc mirror --from=/root/oc-mirror/test1/mirror_seq2_000000.tar docker://registry-url/oc-mirror1

6. Check the catalog index for **odf** and **rhsso** operators

[root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12
    VERSIONS
    4.12.7-rhodf
    4.12.8-rhodf
    4.12.4-rhodf
    4.12.5-rhodf
    4.12.6-rhodf

[root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable
    VERSIONS
    7.6.4-opr-002
    7.6.4-opr-003
    7.6.5-opr-001
    7.6.5-opr-002

Actual results:

Check the catalog index for **odf** and **rhsso** operators. oc-mirror is not respecting the minVersion & maxVersion

[root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.7-rhodf 4.12.8-rhodf 4.12.4-rhodf 4.12.5-rhodf 4.12.6-rhodf

[root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002 7.6.4-opr-003 7.6.5-opr-001 7.6.5-opr-002

Expected results:

oc-mirror should respect the minVersion & maxVersion

[root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.4-rhodf

[root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002

Additional info:

https://github.com/openshift/oc-mirror/pull/761

Bug OCPBUGS-25778: Update 4.16 ose-cluster-capi-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/153

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-capi-operator/pull/153

Bug OCPBUGS-27335: Installation fails with 1 master and 2 workers as the console deployment set the number of replicas based on the InfrastructureTopology rather than the ControlPlaneTopology

View the Description View the linked PRs

Description of problem:

The node selector for the console deployment requires deploying it on the master nodes, The node selector for the console deployment requires deploying it on the master nodes, while the replica count is determined by the infrastructureTopology, which primarily tracks the workers' setup.

When an OpenShift cluster is installed with a single master node and multiple workers, this leads the console deployment to request 2 replicas as infrastructureTopology is set to HighlyAvailable. Instead, ControlPlaneTopology is set to SingleReplica as expected.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

    1. Install an openshift cluster with 1 master and 2 workers

Actual results:

The installation fails as the replicas for the console deployment is set to 2.

  apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: "2024-01-18T08:34:47Z"
    generation: 1
    name: cluster
    resourceVersion: "517"
    uid: d89e60b4-2d9c-4867-a2f8-6e80207dc6b8
  spec:
    cloudConfig:
      key: config
      name: cloud-provider-config
    platformSpec:
      aws: {}
      type: AWS
  status:
    apiServerInternalURI: https://api-int.adstefa-a12.qe.devcluster.openshift.com:6443
    apiServerURL: https://api.adstefa-a12.qe.devcluster.openshift.com:6443
    controlPlaneTopology: SingleReplica
    cpuPartitioning: None
    etcdDiscoveryDomain: ""
    infrastructureName: adstefa-a12-6wlvm
    infrastructureTopology: HighlyAvailable
    platform: AWS
    platformStatus:
      aws:
        region: us-east-2
      type: AWS


apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
   .... 
  creationTimestamp: "2024-01-18T08:54:23Z"
  generation: 3
  labels:
    app: console
    component: ui
  name: console
  namespace: openshift-console
spec:
  progressDeadlineSeconds: 600
  replicas: 2

Expected results:

The replica is set to 1, tracking the ControlPlaneTopology value instead of hte infrastructureTopology.

Additional info:

Bug OCPBUGS-29532: console-operator is unable to add its OIDC client info

View the Description View the linked PRs

Description of problem:

    If the authentication.config/cluster Type=="" but the OAuth/User APIs are already missing, the console-operator won't update the authentication.config/cluster status with its own client as it's crashing on being unable to retrieve OAuthClients.

Version-Release number of selected component (if applicable):

    4.15.0

How reproducible:

    100%

Steps to Reproduce:

    1. scale oauth-apiserver to 0
    2. set featuregates to TechPreviewNotUpgradable
    3. watch the authentication.config/cluster .status.oidcClients

Actual results:

    The client for the console does not appear.

Expected results:

    The client for the console should appear.

Additional info:

https://github.com/openshift/console-operator/pull/861

Bug OCPBUGS-24959: Update 4.16 ose-aws-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/59

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-aws/pull/59

Bug OCPBUGS-30831: ART requests updates to 4.16 image cluster-network-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2308

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-network-operator/pull/2308

Bug MGMT-16505: [PSI] [BE] - Huge amount of "Cluster was updated with api-vip , ingress-vip " in cluster events

View the Description View the linked PRs

Description of the problem:

In PSI, BE master ~ 2.30 - Massive amount of the following message "Cluster was updated with api-vip <IP ADDRESS>, ingress-vip <IP ADDRESS>" in cluster events.

This message is repeating itself every minute X 5 (I guess related to number of hosts ?)

the installation was started, but aborted due to network connection issues.

I've tried to reproduce in staging, but couldn't.

How reproducible:

Still checking

Steps to reproduce:

Actual results:

Expected results:
This message should not be shown more than once

https://github.com/openshift/assisted-service/pull/5872

Bug OCPBUGS-15861: openshift-baremetal-installer should not link against libvirt

View the Description View the linked PRs

Currently the openshift-baremetal-install binary is dynamically linked to libvirt-client, meaning that it is only possible to run it on a RHEL system with libvirt installed.

A new version of the libvirt bindings, v1.8010.0, allows the library to be loaded only on demand, so that users who do not execute any libvirt code can run the rest of the installer without needing to install libvirt. (See this comment from Dan Berrangé.) In practice, the "rest of the installer" is everything except the baremetal destroy cluster command (which destroys the bootstrap storage pool - though only if the bootstrap itself has already been successfully destroyed - and has probably never been used by anybody ever). The Terraform providers all run in a separate binary.

There is also a pure-go libvirt library that can be used even within a statically-linked binary on any platform, even when interacting with libvirt. The libvirt terraform provider that does almost all of our interaction with libvirt already uses this library.

https://github.com/openshift/installer/pull/7252

Bug OCPBUGS-26117: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/784

Bug OCPBUGS-24825: Update 4.16 ose-olm-rukpak-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/67

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-rukpak/pull/67

Bug OCPBUGS-23228: hosted-cluster-config-operator-manager should throttle creation attempts

View the Description View the linked PRs

Description of problem:

Release controller > 4.14.2 > HyperShift conformance run > gathered assets:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c
     65 hosted-cluster-config-operator-manager
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5
2023-11-09T17:17:15.130454Z 409 AlreadyExists
2023-11-09T17:17:15.163256Z 409 AlreadyExists
2023-11-09T17:17:15.198908Z 409 AlreadyExists
2023-11-09T17:17:15.230532Z 409 AlreadyExists
2023-11-09T17:17:22.899579Z 409 AlreadyExists

That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?

Version-Release number of selected component (if applicable):

4.14.2. I haven't checked other releases.

How reproducible:

Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.

Steps to Reproduce:

1. Dump a hosted cluster.
2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.

Actual results:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c
    130 create 409

Expected results:

Zero or rare 409 creation request from this user-agent.

Additional info:

The user agent seems to be defined here, so likely the fix will involve changes to that manager.

https://github.com/openshift/hypershift/pull/3396

Bug OCPBUGS-24824: Update 4.16 csi-driver-manila-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/213

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-manila-operator/pull/213

Bug OCPBUGS-29832: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-capi-operator/pull/164

Bug OCPBUGS-25035: Update 4.16 ose-csi-driver-shared-resource-mustgather-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/160

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-shared-resource/pull/160

Bug OU-309: Tables in dashboards are not merging multiple query responses correctly

View the Description View the linked PRs

A table in a dashboard relies on the order of the metric labels to merge results

How to Reproduce:

Create a dashboard with a table including this query:

label_replace(sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by ( alertstate, alertname)), "aaa", "$1", "alertstate", "(.+)")

A single row will be displayed as the query is simulating that the first label `aaa` has a single value.

Expected result:

The table should not rely on a single metric label to merge results but consider all the labels so the expected rows are displayed.

https://github.com/openshift/monitoring-plugin/pull/96

Story HOSTEDCP-1308: Add an e2e check for sa token not being mounted on management side workloads

View the Description View the linked PRs

Add an e2e check for sa token not being mounted on management side workloads unless necessary. This could be an extension of the needManamgementKasAccess check

Bug OCPBUGS-29576: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-config-operator/pull/408

Bug OCPBUGS-24930: Update 4.16 ose-ibm-vpc-block-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60

Bug OCPBUGS-27264: flakiness in local/shared gateway migration jobs

View the Description View the linked PRs

Description of problem:

The e2e-aws-ovn-shared-to-local-gateway-mode-migration and e2e-aws-ovn-local-to-shared-gateway-mode-migration jobs fail about 50% of the time with

+ oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}'
network.operator.openshift.io/cluster patched
+ oc wait co network --for=condition=PROGRESSING=True --timeout=60s
error: timed out waiting for the condition on clusteroperators/network

https://github.com/openshift/cluster-network-operator/pull/2206

Bug ACM-9504: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task MON-3753: Fix thanos make-local pipeline

View the Description View the linked PRs

ci/prow/test.local pipeline is currently broken due to the build04 cluster addressed to it in the buildfarm being a bit slow and making github.com/thanos-io/thanos/pkg/store go over the default 900s.

panic: test timed out after 15m0s1027running tests:1028	TestTSDBStoreSeries (4m3s)1029	TestTSDBStoreSeries/1SeriesWith10000000Samples (2m58s)1030

Extending it makes the test pass

 ok github.com/thanos-io/thanos/pkg/store 984.344s

We'll be addressing this alongside a follow-up issue to address this with an env var in upstream Thanos.

https://github.com/openshift/thanos/pull/144

Bug OCPBUGS-27823: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2213

Bug OCPBUGS-29871: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource/pull/171

Bug OCPBUGS-5825: system:openshift:kube-controller-manager:gce-cloud-provider referencing non existing serviceAccount

View the Description View the linked PRs

Description of problem:

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.20 True False 43h Cluster version is 4.11.20

$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2023-01-11T13:16:47Z"
  name: system:openshift:kube-controller-manager:gce-cloud-provider
  resourceVersion: "6079"
  uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:kube-controller-manager:gce-cloud-provider
subjects:
- kind: ServiceAccount
  name: cloud-provider
  namespace: kube-system

$ oc get sa cloud-provider -n kube-system
Error from server (NotFound): serviceaccounts "cloud-provider" not found

The serviceAccount cloud-provider does not exist. Neither in kube-system nor in any other namespace.

It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount.

From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4 (all version from what we have found)

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4
2. Run oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml

Actual results:

$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2023-01-11T13:16:47Z"
  name: system:openshift:kube-controller-manager:gce-cloud-provider
  resourceVersion: "6079"
  uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:kube-controller-manager:gce-cloud-provider
subjects:
- kind: ServiceAccount
  name: cloud-provider
  namespace: kube-system

$ oc get sa cloud-provider -n kube-system
Error from server (NotFound): serviceaccounts "cloud-provider" not found

Expected results:

The serviceAccount called cloud-provider to exist or otherwise the ClusterRoleBinding to be removed.

Additional info:

Finding related to a Security review done on the OpenShift Container Platform 4 - Platform

Bug MGMT-16791: [BE] - in feature support api - VIP_AUTO_ALLOC is dev_preview for 4.15 - should be unavailable

View the Description View the linked PRs

Description of the problem:

BE master ~2.30 - in feature support api - VIP_AUTO_ALLOC is dev_preview for 4.15 - should be unavailable

How reproducible:

100%

Steps to reproduce:

1. GET https://<SERVICE_ADDRESS>/api/assisted-install/v2/support-levels/features?openshift_version=4.15&cpu_architecture=x86_64

2. BE response support level

Actual results:

VIP_AUTO_ALLOC is dev_preview for 4.15

Expected results:
VIP_AUTO_ALLOC should be unavailable

https://github.com/openshift/assisted-service/pull/5959

Bug OCPBUGS-25569: Update 4.16 ose-azure-disk-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/70

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-disk-csi-driver/pull/70

Bug OCPBUGS-29493: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/28595

Bug OCPBUGS-24961: Update 4.16 vmware-vsphere-syncer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/102

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24926: Update 4.16 ose-alibaba-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/40

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-alibaba-cloud/pull/40

Bug OCPBUGS-29976: ART requests updates to 4.16 image golang-github-prometheus-node_exporter-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/node_exporter/pull/141

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/node_exporter/pull/142

Bug OCPBUGS-25699: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/899

Task MON-3697: use maximumStartupDurationSeconds instead of container patch

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/2251

Bug OCPBUGS-24955: Update 4.16 ose-cluster-config-api-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/api/pull/1700

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25015: Update 4.16 kube-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/596

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/596

Bug OCPBUGS-29641: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/972

Bug OCPBUGS-24884: Update 4.16 ose-cluster-dns-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/397

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-dns-operator/pull/397

Bug OCPBUGS-27285: Bump version of OVS to openvswitch3.1-3.1.0-73

View the Description View the linked PRs

Bump OVS version to fix https://bugzilla.redhat.com/show_bug.cgi?id=2228464

https://github.com/openshift/ovn-kubernetes/pull/1995

Bug OCPBUGS-24840: Update 4.16 ose-thanos-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/thanos/pull/134

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Story TRT-1452: Breakout Additional Pathological Event Tests by Namespace

View the Description View the linked PRs

Recent periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade failure caused by

: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers expand_less 	0s
{  event [namespace/openshift-machine-api node/ci-op-j666c60n-23cd9-nb7wr-master-1 pod/cluster-baremetal-operator-79b78c4548-n5vrt hmsg/b7cb271b13 - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-79b78c4548-n5vrt_openshift-machine-api(32835332-fc25-4ddf-84ce-d3aa447d3ce0)] happened 25 times}

Shows in Component Readiness as unknown component

/ ovn upgrade-minor amd64 gcp rt > Unknown> [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers

We should update testBackoffStartingFailedContainer to check for / use known namespaces where a junit is created for known namespaces indicating pass, fail or flake.

The code already handles testBackoffStartingFailedContainerForE2ENamespaces

It looks as though we only check for known or e2e namespaces, need to double check that if that is the case we are ok with potentially unknown namespaced events getting through.

We should also review Sippy pathological tests for results that don't contain `for ns/namespace` format and review if they need to be broken out as well.

For each test that we break out we need to map the new namespace specific test to the correct component in the test mapping repository

https://github.com/openshift/origin/pull/28543

Bug OCPBUGS-27468: Enable TranslateStreamCloseWebsocketRequests through TechPreviewNoUpgrade

View the Description View the linked PRs

Enable websockets (https://github.com/kubernetes/enhancements/issues/4006) in 4.16 so https://issues.redhat.com/browse/OCPBUGS-20515 can test whether websockets allows to timeout idle connection.

Bug OCPBUGS-25669: apbexternalroute and egressfirewall status shows empty on hypershift hosted cluster

View the Description View the linked PRs

Description of problem:

apbexternalroute and egressfirewall status shows empty on hypershift hosted cluster

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-17-173511

How reproducible:

always

Steps to Reproduce:

1. setup hypershift, login hosted cluster
% oc get node
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-128-55.us-east-2.compute.internal    Ready    worker   125m   v1.28.4+7aa0a74
ip-10-0-129-197.us-east-2.compute.internal   Ready    worker   125m   v1.28.4+7aa0a74
ip-10-0-135-106.us-east-2.compute.internal   Ready    worker   125m   v1.28.4+7aa0a74
ip-10-0-140-89.us-east-2.compute.internal    Ready    worker   125m   v1.28.4+7aa0a74


2. create new project test
% oc new-project test


3. create apbexternalroute and egressfirewall on hosted cluster
apbexternalroute yaml file:
---
apiVersion: k8s.ovn.org/v1
kind: AdminPolicyBasedExternalRoute
metadata:
  name: apbex-route-policy
spec:
  from:
    namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: test
  nextHops:
    static:
    - ip: "172.18.0.8"
    - ip: "172.18.0.9"
% oc apply -f apbexroute.yaml 
adminpolicybasedexternalroute.k8s.ovn.org/apbex-route-policy created

egressfirewall yaml file:
---
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress:
  - type: Allow
    to: 
      cidrSelector: 0.0.0.0/0
% oc apply -f egressfw.yaml 
egressfirewall.k8s.ovn.org/default created


3. oc get apbexternalroute and oc get egressfirewall

Actual results:

The status show empty:
% oc get apbexternalroute
NAME                 LAST UPDATE   STATUS
apbex-route-policy   49s                     <--- status is empty
% oc describe apbexternalroute apbex-route-policy | tail -n 8
Status:
  Last Transition Time:  2023-12-19T06:54:17Z
  Messages:
    ip-10-0-135-106.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
    ip-10-0-129-197.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
    ip-10-0-128-55.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
    ip-10-0-140-89.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
Events:  <none>

% oc get egressfirewall
NAME      EGRESSFIREWALL STATUS
default                           <--- status is empty 
% oc describe egressfirewall default | tail -n 8
    Type:             Allow
Status:
  Messages:
    ip-10-0-129-197.us-east-2.compute.internal: EgressFirewall Rules applied
    ip-10-0-128-55.us-east-2.compute.internal: EgressFirewall Rules applied
    ip-10-0-140-89.us-east-2.compute.internal: EgressFirewall Rules applied
    ip-10-0-135-106.us-east-2.compute.internal: EgressFirewall Rules applied
Events:  <none>

Expected results:

the status can be shown correctly

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

https://github.com/openshift/cluster-network-operator/pull/2169

Bug OCPBUGS-28597: Reenable cluster-olm-operator, platform-operators

View the Description View the linked PRs

Rukpak – an alpha tech preview API – has pushed a breaking change upstream. This bug tracks the need for us to disable and then reenable the cluster-olm-operator and platform-operators components which both depend on rukpak in order to push the breaking API change. This bug can be closed once those components are all updated and available on the cluster again.

Bug OCPBUGS-24579: Cannot deploy v6-only hosts from a v4-primary dualstack cluster

View the Description View the linked PRs

Description of problem:

In a dualstack cluster, we use IPv4 URLs for callbacks to Ironic and Inspector. If the host only has IPv6 networking, the provisioning will fail. This issue affects both the normal IPI and ZTP with the converged flow.

A similar issue has been fixed as part of ~~METAL-163~~ where we use the BMC's address family to determine which URL to send to it. This bug is somewhat simpler: we can provide IPA with several URLs and let it decide which one it can use. This way, only small changes to IPA itself, ICC and CBO are required.

The fix will only affect virtual media deployments without provisioning network or with virtualMediaViaExternalNetwork:true. We don't have a good dualstack story around provisioning networks anyway.

Upstream IPA request: https://bugs.launchpad.net/ironic/+bug/2045548

Bug OCPBUGS-25434: nmstateconfig file not getting applied to bare metal host

View the Description View the linked PRs

Description of problem:

Configuration files applied via the API do not have any effect on the configuration of the bare metal host.

Version-Release number of selected component (if applicable):

OpenShift 4.12.42 with ACM 2.8

How reproducible:

Reproducible.

Steps to Reproduce:

1. Applied nmstateconfig via 'oc apply', realized after it booted the subnet prefix was incorrect.
2. Deleted bmh and nmstateconfig.
3. Applied correct config  via 'oc apply', machine boots with 1st config still.
4. Deleted bmh and nmstateconfig.
5. Created host via BMC form in GUI with correct config.  Machine boots with correct config.
6. Tested deleting bmh and nmstateconfig, and creating new machine by just applying the bmh file with zero network config, and machine boots again with networking from step 5.

Actual results:

Bare metal host does not get config via 'oc apply'.

Expected results:

'oc apply -f nmstateconfig.yaml' should work to apply networking configuration.

Additional info:

HP Synergy 480 Gen10 (871942-B21)
UEFI boot with redfish virtual media
Static IP with bonding.

https://github.com/openshift/assisted-service/pull/5844

Bug OCPBUGS-25942: OpenShift vSphere Connection Configuration Does Not Appropriately Insert Escaped Strings

View the Description View the linked PRs

Recently a user was attempting to change the Virtual Machine Folder for a cluster installed on vSphere. The user used the configuration panel "vSphere Connection Configuration" to complete this process. Upon updating the path and clicking "Save Configuration" cluster wide issues emerged including nodes not coming back online after a reboot.

OpenShift nodes eventually crashed with an error resultant of an incorrectly parsed folder due to the string literal " " characters missing.

While this was exhibited on OCP 4.13, other versions may be affected.

https://github.com/openshift/console/pull/13477

Bug OCPBUGS-29209: HyperShift operator should not apply PKI operator RBAC if PKI disabled

View the Description View the linked PRs

Description of problem:

    HyperShift operator is applying control-plane-pki-operator RBAC resources regardless of if PKI reconciliation is disabled for the HostedCluster.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    100%

Steps to Reproduce:

    1. Create 4.15 HostedCluster with PKI reconciliation disabled
    2. Unused RBAC resources for control-plane-pki-operator is created

Actual results:

    Unused RBAC resources for control-plane-pki-operator is created

Expected results:

RBAC resources for control-plane-pki-operator should not be created if deployment for control-plane-pki-operator itself is not created.

Additional info:

https://github.com/openshift/hypershift/pull/3544

Bug OCPBUGS-25600: AWS: The installer doesn’t precheck if node architecture and vm type are consistent

View the Description View the linked PRs

Description of problem:

The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-multi-2023-12-06-195439

How reproducible:

   Always

Steps to Reproduce:

    1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config     
    2.Create cluster 
    3.Check installation

Actual results:

Azure will precheck if architecture is consistent with instance type when creating manifests, like:
12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj"
12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64

But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster

Expected results:

The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)

Additional info:

https://github.com/openshift/installer/pull/7835

Bug OCPBUGS-25538: Update 4.16 ose-image-customization-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/116

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-customization-controller/pull/116

Bug OCPBUGS-28548: Promote ecr-credential-provider image with RPM

View the Description View the linked PRs

Description of problem:

In https://github.com/openshift/release/pull/47648 ecr-credentials-provider is built in CI and later included in RHCOS.

In order to make it work on OKD it needs to be included in the payload, so that OKD machine-os could extract RPM and install it on the host

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Ref: OCPBUGS-25662

Bug OCPBUGS-24908: Update 4.16 ose-cluster-csi-snapshot-controller-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/178

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/178

Bug OCPBUGS-20487: Missing 'ping' executable file on s390x node in origin tests:[sig-network][Feature:EgressFirewall]

View the Description View the linked PRs

Description of problem:

1. [sig-network][Feature:EgressFirewall] egressFirewall should have no impact outside its namespace [Suite:openshift/conformance/parallel] 
2. [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]

The issue arises during the execution of the above tests and appears to be related to the image in use, specifically, the image located at https://quay.io/repository/redhat-developer/nfs-server?tab=tags&tag=1.1 (quay.io/redhat-developer/nfs-server:1.1). 
This image does not include the 'ping' executable for the s390x architecture, leading to the following error in the prow job logs:
...
msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6mg9v --kubeconfig=/tmp/configfile3768380277 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n"
...

Our suggest fix: build new s390x image that contains ping binary.

Version-Release number of selected component (if applicable):

How reproducible:

The issue is reproducible when the test container (quay.io/redhat-developer/nfs-server:1.1) is scheduled on an s390x node, leading to test failures.

Steps to Reproduce:

1.Have a multi-arch cluster (x86 + s390x day2 worker node attached)
2.Execute the two tests
3.Try few times to make the pod assigned to s390x node

Actual results from prow job:

Run #0: Failed expand_less30s{  fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:70]: Unexpected error:
    <*fmt.wrapError | 0xc005924300>: 
    Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8:
    StdOut>
    time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH"
    command terminated with exit code 255
    StdErr>
    time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH"
    command terminated with exit code 255
    exit status 255
    
    {
        msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n",
        err: <*exec.ExitError | 0xc0059242e0>{
            ProcessState: {
                pid: 78611,
                status: 65280,
                rusage: {
                    Utime: {Sec: 0, Usec: 168910},
                    Stime: {Sec: 0, Usec: 60897},
                    Maxrss: 206428,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 4199,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 753,
                    Nivcsw: 149,
                },
            },
            Stderr: nil,
        },
    }
occurred
Ginkgo exit error 1: exit with code 1}

Expected results:

Passed

Additional info:

This issue pertains to a specific bug on the s390x architecture and additionally impacts the libvirt-s390x prow job.

https://github.com/openshift/origin/pull/28408

Bug OCPBUGS-24853: Update 4.16 ose-installer-artifacts-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/7818

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/7818

Bug OCPBUGS-24866: Update 4.16 ose-csi-snapshot-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/127

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/127

Bug OCPBUGS-25540: Update 4.16 ose-csi-external-resizer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/154

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/154

Bug OCPBUGS-26434: Ensure Passwords are Redacted in Agent Gather manifest Files

View the Description View the linked PRs

When platform specific passwords are included in the install-config.yaml they are stored in the generated agent-cluster-install.yaml, which is included in the output of the agent-gather command. These passwords should be redacted.

https://github.com/openshift/installer/pull/7873

Bug MGMT-14380: While setting up OCP with ODF using AI, host status is shown as insufficient though additional multipath data disk is attached

View the Description View the linked PRs

Description of the problem:

Setting up OCP with ODF (compact mode) using AI (stage). I have 3 hosts with install disk (120GB) and data disk (500GB, disk type: Multipath). Though we have non bootable disk (500GB), host status is "Insufficient". Could not proceed forward as the "Next" button is disabled.

Steps to reproduce:

1. Create a new cluster

2. Select "Install OpenShift Data Foundation" in Operators page

3. Take 3 hosts with 1 installation disk and 1 non-installation disk on each.

4. Add hosts by booting hosts with downloaded iso

Actual results:

Status of hosts is "Insufficient" and "Next" button is disabled

data.json

Expected results:

Status of hosts should be "Ready" and "Next" button should be enabled to proceed with installation

https://github.com/openshift/assisted-service/pull/6072

Bug OCPBUGS-24005: tls: bad certificate from kube-apiserver-operator

View the Description View the linked PRs

As this shows tls: bad certificate from kube-apiserver operator, for example, https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/470214, checked its must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-aws-ipi-imdsv2-fips-f14/1726036030588456960/artifacts/aws-ipi-imdsv2-fips-f14/gather-must-gather/artifacts/

MacBook-Pro:~ jianzhang$ omg logs prometheus-operator-admission-webhook-6bbdbc47df-jd5mb | grep "TLS handshake"
2023-11-27 10:11:50.687 | WARNING  | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader
2023-11-19T00:57:08.318983249Z ts=2023-11-19T00:57:08.318923708Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48334: remote error: tls: bad certificate"
2023-11-19T00:57:10.336569986Z ts=2023-11-19T00:57:10.336505695Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48342: remote error: tls: bad certificate"
...
MacBook-Pro:~ jianzhang$ omg get pods -A -o wide | grep "10.129.0.35"
2023-11-27 10:12:16.382 | WARNING  | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader
openshift-kube-apiserver-operator                 kube-apiserver-operator-f78c754f9-rbhw9                          1/1    Running    2         5h27m  10.129.0.35   ip-10-0-107-238.ec2.internal

for more information slack - https://redhat-internal.slack.com/archives/CC3CZCQHM/p1700473278471309

Bug OCPBUGS-25113: Update 4.16 ose-gcp-pd-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/101

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/101

Bug OCPBUGS-24819: Update 4.16 ose-powervs-block-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/53

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-30530: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/76

Bug OCPBUGS-30537: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/thanos/pull/143

Bug OCPBUGS-24988: Update 4.16 ose-etcd-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/etcd/pull/236

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/etcd/pull/236

Bug OCPBUGS-26547: external-dns causing route53 throttling

View the Description View the linked PRs

Many jobs are failing because route53 is throttling us during cluster creation.
We need a make external-dns make fewer calls.

The theoretical minimum is:
list zones - 1 call
list zone records - (# of records / 100) calls
create 3 records per HC - 1-3 calls depending on how they are batched

Bug OCPBUGS-26140: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-operator/pull/36

Bug OCPBUGS-30060: Image Registry ClusterOperator not progressing on Azure Hosted Control Planes

View the Description View the linked PRs

Description of problem:

  The image registry CO is not progressing on Azure Hosted Control Planes

Version-Release number of selected component (if applicable):

How reproducible:

    Every time

Steps to Reproduce:

    1. Create an Azure HCP
    2. Create a kubeconfig for the guest cluster
    3. Check the image-registry CO

Actual results:

    image-registry co's message is Progressing: The registry is ready...

Expected results:

    image-registry finishes progressing

Additional info:

    I let it go for about 34m

% oc get co | grep -i image
image-registry                             4.16.0-0.nightly-multi-2024-02-26-105325   True        True          False      34m     Progressing: The registry is ready...

% oc get co/image-registry -oyaml
...
  - lastTransitionTime: "2024-02-28T19:10:30Z"
    message: |-
      Progressing: The registry is ready
      NodeCADaemonProgressing: The daemon set node-ca is deployed
      AzurePathFixProgressing: The job does not exist
    reason: AzurePathFixNotFound::Ready
    status: "True"
    type: Progressing

https://github.com/openshift/hypershift/pull/3667

Bug OCPBUGS-22422: [4.13+] Alert PodStartupStorageOperationsFailing not getting raised

View the Description View the linked PRs

Description of problem:

PodStartupStorageOperationsFailing alert is not getting raised when there are no successfull(zero) mount/attach happens on node

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-10-25-185510

How reproducible:

Always

Steps to Reproduce:

1. Install any platform cluster.
2. Create sc, pvc, dep.
3. Check dep pod reaching to containercreatingstate and check for alert

Actual results:

Alert is not getting raised when there are 0 successfull mount/attach happens

Expected results:

Alert should get raised when there are no successfull mount/attach happens

Additional info:

Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1697793500890839 

When we take same alerting expression from 4.12, we can observe the alert in ocp web console page.

https://github.com/openshift/cluster-storage-operator/pull/436

Bug OCPBUGS-25696: HCP does not deploy cloud provider kubevirt with configured node selectors

View the Description View the linked PRs

Description of problem:

    When deploying a HCP KubeVirt cluster using the hcp's --node-selector cli arg, that node selector is not applied to the "kubevirt-cloud-controller-manager" pods within the HCP namespace. 

This makes it not possible to pin the entire HCP pods to specific nodes.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    100%

Steps to Reproduce:

    1. deploy an hcp kubevirt cluster with the --node-selector cli option
    2.
    3.

Actual results:

    the node selector is not applied to cloud provider kubevirt pod

Expected results:

    the node selector should be applied to cloud provider kubevirt pod.

Additional info:

https://github.com/openshift/hypershift/pull/3382

Bug OCPBUGS-25032: Update 4.16 ose-ovn-kubernetes-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1980

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ovn-kubernetes/pull/1980

Bug OCPBUGS-24989: Update 4.16 ose-cluster-kube-storage-version-migrator-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/101

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/101

Task MON-3673: Bump downstream Prometheus to v2.49.1

View the linked PRs

Bug OCPBUGS-24186: ODF Dynamic plugin should not expose Server header

View the Description View the linked PRs

Customer pentest shows that the Server header is returned by admin console when browsing

https://console-openshift-console$domain/locales/resource.json?lng=en&ns=plugin__odf-console

This could lead to information about CVE for a potential attacker.

Response header:

Server: nginx/1.20.1

https://github.com/openshift/console/pull/13404

Bug OCPBUGS-24928: Update 4.16 ose-aws-cluster-api-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/488

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-aws/pull/488

Bug OCPBUGS-24927: Update 4.16 ose-cluster-storage-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/432

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-storage-operator/pull/432

Bug TRT-1359: monitor test azure-metrics-collector failed with throttle error

View the Description View the linked PRs

[Jira:"Test Framework"] monitor test azure-metrics-collector collection failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/28395/pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd/1724427658311241728

Looks like Azure is throttling our request. We should probably try some retry mechanism.

Relevant thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1699977299650309

https://github.com/openshift/origin/pull/28459

Story TRT-1476: Add a GCP liveness probe to origin

View the Description View the linked PRs

Work to setup the endpoint will be handled in another card.

For this one we want to setup a new disruption backend similar to the cluster-network-liveness-probe. We'll poll, submit request ids and possibly job identifiers.

This approach us gets us free disruption intervals in bigquery, charting per job, and graphing capabilities from the disruption dashboard.

Bug OCPBUGS-29249: CPMS leaves only 2 masters during update

View the Description View the linked PRs

Observed during testing of candidate-4.15 image as of 2024-02-08.

This is an incomplete report as I haven't verified the reproducer yet or attempted to get a must-gather. I have observed this multiple times now, so I am confident it's a thing. I can't be confident that the procedure described here reliably reproduces it, or that all the described steps are required.

I have been using MCO to apply machine config to masters. This involves a rolling reboot of all masters.

During a rolling reboot I applied an update to CPMS. I observed the following sequence of events:

master-1 was NotReady as it was rebooting
I modified CPMS
CPMS immediately started provisioning a new master-0
CPMS immediately started deleting master-1
CPMS started provisioning a new master-1

At this point there were only 2 nodes in the cluster:

old master-0
old master-2

and machines provisioning:

new master-0
new master-1

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/278

Bug OCPBUGS-30544: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/telemeter/pull/516

Story HOSTEDCP-1283: Investigate and fix why some azure nodes not joining the cluster

View the Description View the linked PRs

Currently when creating an Azure cluster, only the first node of the nodePool will be ready and join the cluster, all other azure machines are stuck in the `Creating` state.

https://github.com/openshift/hypershift/pull/3445

Bug OCPBUGS-24720: Update 4.16 ose-multus-whereabouts-ipam-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/212

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25586: Update 4.16 ose-cluster-api-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api/pull/191

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api/pull/191

Bug OCPBUGS-26190: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-policy-controller/pull/145

Bug OCPBUGS-22604: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/74

Bug OCPBUGS-24974: Update 4.16 ose-vertical-pod-autoscaler-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/272

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-autoscaler/pull/272

Bug OCPBUGS-25532: Update 4.16 csi-provisioner-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/84

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-provisioner/pull/84

Bug OCPBUGS-26127: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/277

Bug OCPBUGS-10996: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/1899

Bug OCPBUGS-26039: sidebar on deployment/deploymentconfig creation YAML view page is occupying too much screen width

View the Description View the linked PRs

Description of problem:

The YAML sidebar is occupying too much space on some pages

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-03-140457

How reproducible:

Always

Steps to Reproduce:

1. Go to Deployment/DeploymentConfig creation page
2. Choose 'YAML view'
3. (for comparison) Go to other resources YAML page, open the sidebar

Actual results:

We can see the sidebar is occupying too much screen compared with other resources YAML page

Expected results:

We should reduce the space sidebar occupies

Additional info:

https://github.com/openshift/console/pull/13495

Bug OCPBUGS-29567: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/106

Bug OCPBUGS-30567: Remove wrong arguments in K8sCreate method instance for Create YAML editor

View the Description View the linked PRs

Description of problem:

- Investigate why  `name` and `namespcae` properties are passed as arguments in `k8sCreate` instance for Create YAML editor function

- Remove the `name` and `namespcae` arguments in `k8sCreate` instance for Create YAML editor function if it does not require a big change. 

Problem:
If consoleFetchCommon takes an additional option(argument) and return response based on the option as proposed in "[Add support for returning response.header in consoleFetchCommon function|https://issues.redhat.com/browse/CONSOLE-3949]" story, the wrong and unused arguments in k8sCreate would cause the consoleFetchCommon method arguments to return entire response instead of response body which would break the Create Resource YAML functionality.

Code: https://github.com/openshift/console/blob/master/frontend/public/components/edit-yaml.jsx#L334

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13658

Bug OCPBUGS-25340: Update 4.16 ose-openstack-cinder-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/149

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Task SPLAT-1460: Fix vSphere installer to not provide double slash in resourcepool path

View the Description View the linked PRs

Currently when attempting installs, our infrastructure "cluster" CR is containing resource with a path such as:

"Workspace.ResourcePool: /DEVQEdatacenter/host/DEVQEcluster//Resources

This is causing issue with CPMSO resulting in rollout of new masters after initial control plane is established

I0219 07:46:32.076950       1 updates.go:478] "msg"="Machine requires an update" "controller"="controlplanemachineset" "diff"=["Workspace.ResourcePool: /DEVQEdatacenter/host/DEVQEcluster//Resources != /DEVQEdatacenter/host/DEVQEcluster/Resources"] "index"=2 "name"="sgao-devqe-vblw8-master-2" "namespace"="openshift-machine-api" "reconcileID"="5f47f5a5-0a90-4168-bfcc-dae0fad9b953" "updateStrategy"="RollingUpdate"

https://github.com/openshift/installer/pull/8044

Bug OCPBUGS-24055: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1990

Bug OCPBUGS-24907: Update 4.16 ose-cluster-kube-apiserver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1597

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1606

Bug OCPBUGS-26979: [EIP multi NIC] In LGW mode, when pod is selected, connection to node IP fails

View the Description View the linked PRs

Description of problem:

In LGW (local gateway mode) mode, when pod is selected by an EIP thats hosted by an interface that isnt the default interface, connection to node IP fails

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2048

Bug OCPBUGS-26124: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-azure/pull/107

Bug OCPBUGS-30540: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vsphere-problem-detector/pull/156

Bug TRT-1400: Update disruption cronjob to use NO-ISSUE

View the Description View the linked PRs

Placeholder so we can merge.

https://github.com/openshift/origin/pull/28416

Bug OCPBUGS-10851: Dynamic Plugin Template: Unable to use React Developer Tools when running console as an image

View the Description View the linked PRs

Currently, the plugin template gives you instructions for running the console using a container image, which is a lightweight to do development and avoids the need to build the console source code from scratch. The image we reference uses a production version of React, however. This means that you aren't able to use the React browser plugin to debug your application.

We should look at alternatives that allow you to use React Developer Tools. Perhaps we can publish a different image that uses a development build. Or at least we need to better document building console locally instead of using an image to allow development builds.

https://github.com/openshift/console/pull/13497

Bug OCPBUGS-11624: ose-cluster-image-registry-operator-container: cluster-image-registry-operator: Minimize wildcard/privilege Usage in Cluster and Local Roles [openshift-4]

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/964

Story HOSTEDCP-1436: NoodPool Subnet field should be required

View the Description View the linked PRs

Nodepools fail to provision if not subnet is specified.
Subnet field https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/nodepool_types.go#L760 should be required.

https://github.com/openshift/hypershift/pull/3581

Bug OCPBUGS-24834: Need to bump api at oc to include the CloudCredential capability

View the Description View the linked PRs

Background:

CCO was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. CloudCredential was introduced as a new capability to openshift/api. We need to bump api at oc to include the CloudCredential capability so oc adm release extract works correctly.

Description of problem:

Some relevant CredentialsRequests are not extracted by the following command: oc adm release extract --credentials-requests --included --install-config=install-config.yaml ...
where install-config.yaml looks like the following:
...
capabilities:
  baselineCapabilitySet: None
  additionalEnabledCapabilities:
  - MachineAPI
  - CloudCredential
platform:
  aws:
...

Logs:

...
I1209 19:57:25.968783 79037 extract.go:418] Found manifest 0000_50_cloud-credential-operator_05-iam-ro-credentialsrequest.yaml
I1209 19:57:25.968902 79037 extract.go:429] Excluding Group: "cloudcredential.openshift.io" Kind: "CredentialsRequest" Namespace: "openshift-cloud-credential-operator" Name: "cloud-credential-operator-iam-ro": unrecognized capability names: CloudCredential
...

https://github.com/openshift/oc/pull/1622

Bug OCPBUGS-30641: Power VS: Cannot deploy with service IDs

View the Description View the linked PRs

Description of problem:

    When deploying with a service ID, the installer is unable to query resource groups.

Version-Release number of selected component (if applicable):

    4.13-4.16

How reproducible:

    Easily

Steps to Reproduce:

    1. Create a service ID with seemingly enough permissions to do an IPI install
    2. Deploy to power vs with IPI
    3. Fail

Actual results:

    Fail to deploy a cluster with service ID

Expected results:

    cluster create should succeed

Additional info:

https://github.com/openshift/installer/pull/8111

Story HOSTEDCP-1420: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/3525

Bug OCPBUGS-25003: Update 4.16 ose-azure-cloud-node-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/99

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-azure/pull/99

Bug OCPBUGS-24415: Installer will fail if disable CloudControllerManager capabilities for cloud

View the Description View the linked PRs

Description of problem:

If ccm disabled in cloud such as aws, installation will continue until failed in ingress LoadBalancerPending

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Build image with pr openshift/cluster-cloud-controller-manager-operator#284,openshift/installer#7546,openshift/cluster-version-operator#979,openshift/machine-config-operator#3999 
2. Install cluster on aws with "baselineCapabilitySet: v4.14"
3.

Actual results:

Installation failed, ingress LoadBalancerPending.
$ oc get node              
NAME                                        STATUS   ROLES                  AGE   VERSION
ip-10-0-25-230.us-east-2.compute.internal   Ready    control-plane,master   86m   v1.28.3+20a5764
ip-10-0-3-101.us-east-2.compute.internal    Ready    worker                 78m   v1.28.3+20a5764
ip-10-0-46-198.us-east-2.compute.internal   Ready    control-plane,master   87m   v1.28.3+20a5764
ip-10-0-48-220.us-east-2.compute.internal   Ready    worker                 80m   v1.28.3+20a5764
ip-10-0-79-203.us-east-2.compute.internal   Ready    control-plane,master   86m   v1.28.3+20a5764
ip-10-0-95-83.us-east-2.compute.internal    Ready    worker                 78m   v1.28.3+20a5764
 
$ oc get co          
NAME                            VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   False       False         True       85m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
baremetal                       4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
cloud-credential                4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      86m
cluster-autoscaler              4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
config-operator                 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
console                         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   False       True          False      79m     DeploymentAvailable: 0 replicas available for console deployment...
control-plane-machine-set       4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      81m
csi-snapshot-controller         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
dns                             4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
etcd                            4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      83m
image-registry                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
ingress                                                                                   False       True          True       78m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)
insights                        4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
kube-apiserver                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      71m
kube-controller-manager         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      82m
kube-scheduler                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      82m
kube-storage-version-migrator   4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
machine-api                     4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      77m
machine-approver                4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
machine-config                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
marketplace                     4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
monitoring                      4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      73m
network                         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      86m
node-tuning                     4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
openshift-apiserver             4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      71m
openshift-controller-manager    4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      75m
openshift-samples               4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
service-ca                      4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
storage                         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False

Expected results:

Tell users not to turn CCM off for cloud.

Additional info:

https://github.com/openshift/installer/pull/7546

Task HOSTEDCP-1355: Remove Unused Functions from Repo

View the Description View the linked PRs

As a maintainer of the HyperShift repo, I would like to remove unused functions from the code base to reduce the code footprint of the repo.

https://github.com/openshift/hypershift/pull/3325

Bug OCPBUGS-25040: Update 4.16 ose-network-tools-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/network-tools/pull/105

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/network-tools/pull/105

Bug OCPBUGS-25803: ConsolePluginComponents task is not resilient

View the Description View the linked PRs

Description of problem:

ConsolePluginComponents CMO task depends on the availability of a conversion service that is part of the console-operator Pod, that Pod is not duplicated, thus when it restarts due to a cluster upgrade or other that conversion webhook becomes unavailable and all the ConsolePlugin API queries from that CMO task fail.

Version-Release number of selected component (if applicable):

How reproducible:

Create a 4.14 cluster, make the console-operator unmanaged and bring it down, watch the ConsolePluginComponents tasks fail instantly after they're run.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

The ConsolePluginComponents tasks fail instantly after they're run.

Expected results:

The tasks should be more resilient and retry.

The long term solution is for that ConsolePlugin conversion service to be duplicated.

Additional info:

For OCP >=4.15, CMO v1 ConsolePlugin querie no longer rely on the conversion webhook because of https://github.com/openshift/api/pull/1477.

But the retries will keep the task future proof + we'll be able to backport the fix.

https://github.com/openshift/cluster-monitoring-operator/pull/2193

Bug OCPBUGS-26025: oc command fails to execute occasionally in some CI jobs

View the Description View the linked PRs

Description of problem:

The job [sig-node] [Conformance] Prevent openshift node labeling on update by the node TestOpenshiftNodeLabeling [Suite:openshift/conformance/parallel/minimal] uses `oc debug` command [1]. Occasionally we find that command fails to run with ends up failing the test.

Failing job example - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/2220/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-single-node/1742564553356480512

CI search results - https://search.ci.openshift.org/?search=TestOpenshiftNodeLabeling&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

[1] https://github.com/openshift/origin/pull/28296/files#diff-001ed6507a42a0a1689e86c05512e6186c3483488ec96bf5f4354a8f7fa79261R39

Version-Release number of selected component (if applicable):

How reproducible:

Observed in CI jobs mentioned above.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

   oc debug command occasionally exits with error

Expected results:

    oc debug command should not occasionally exit with error

Additional info:

https://github.com/openshift/origin/pull/28499

Bug OCPBUGS-27214: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/654

Bug OCPBUGS-29210: Several CLI example commands do not work

View the Description View the linked PRs

Several oc examples are incorrect. These are used in the CLI reference docs, but would also appear in the oc CLI help.

The commands that don't work have been removed manually from the CLI reference docs via this update: https://github.com/openshift/openshift-docs/compare/9907074162999c982a8a97c45665c98913d848c9..441f3419ef460d9863a45e4c2d6914b1c019e1d1

List of commands:

oc adm copy-to-node --copy=new-bootstrap-kubeconfig=/etc/kubernetes/kubeconfig
oc adm copy-to-node --copy=new-bootstrap-kubeconfig=/etc/kubernetes/kubeconfig -l node-role.kubernetes.io/master
oc adm certificates regenerate-leaf -A --all
oc adm ocp-certificates regenerate-leaf -n openshift-config-managed kube-controller-manager-client-cert
oc adm certificates regenerate-machine-config-server-serving-cert --update-ignition=false
oc adm certificates update-ignition-ca-bundle-for-machine-config-server
oc adm certificates regenerate-signers -A --all
oc adm certificates regenerate-signers -n openshift-kube-apiserver-operator loadbalancer-serving-signer
oc adm ocp-certificates remove-old-trust configmaps -A --all
oc adm ocp-certificates remove-old-trust -n openshift-config-managed configmaps/kube-apiserver-aggregator-client-ca

For more information, see the feedback on these PRs:

4.15 CLI doc PR: https://github.com/openshift/openshift-docs/pull/71213
4.14 CLI doc PR: https://github.com/openshift/openshift-docs/pull/65709

https://github.com/openshift/oc/pull/1686

Bug OCPBUGS-29888: [2186372] Packet drops during the initial phase of VM live migration

View the Description View the linked PRs

Description of problem:

Following https://issues.redhat.com/browse/CNV-28040
On CNV, when virtual machine, with secondary interfaces connected with bridge CNI, is live migrated we observe disruption at the VM inbound traffic.

The root cause for it is the migration target bridge interface advertise before the migration is completed.

When the migration destination pod is created an IPv6 NS (Neighbor Solicitation)
and NA (Neighbor Advertisement) are sent automatically by the kernel.
The switches at the endpoints (e.g.: migration destination node) tables
get updated and the traffic is forwarded to the migration destination before
the migration is completed [1].

The solution is to have the bridge CNI create the pod interface in "link-down" state [2], the IPv6 NS/NA packets are avoided, CNV in turn, set the pod interface to "link-up" [3].

CNV depends on bridge CNI with [2] bits, which is deployed by cluster-network-operator.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2186372#c6
[2] https://github.com/kubevirt/kubevirt/pull/11069
[3] https://github.com/containernetworking/plugins/pull/997

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

100%

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

CNO deploys CNI bridge w/o an option to set the bridge interface down.

Expected results:

CNO to deploy bridge CNI with [1] changes, from release-4.16 branch. 

[1] https://github.com/containernetworking/plugins/pull/997

Additional info:

More https://issues.redhat.com/browse/CNV-28040

https://github.com/openshift/containernetworking-plugins/pull/154

Bug OCPBUGS-24902: Update 4.16 openshift-enterprise-registry-container image to be consistent with ART

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-25342: HyperShift should encrypt the same resources that OCP standalone encrypts in etcd

View the Description View the linked PRs

Description of problem:

    Standalone OCP encrypts various resources at rest in etcd:
https://docs.openshift.com/container-platform/4.14/security/encrypting-etcd.html
HyperShift control planes are only encrypting secrets. We should have parity with standalone.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    Always

Steps to Reproduce:

    1. Create HyperShift standalone control plane
    2. Check that configmaps, routes, oauth access tokens or oauth authorize tokens are encrypted

Actual results:

    Those resources are not encrypted

Expected results:

    Those resources are encrypted

Additional info:

Resources to be encrypted are configured here:
https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/kas/kms/aws.go#L121-L126

https://github.com/openshift/hypershift/pull/3341

Bug OCPBUGS-25841: [vsphere] IPI destroy cluster failed to delete TagCategory

View the Description View the linked PRs

Description of problem:

After running ./openshift-install destroy cluster, TagCategory still exist

# ./openshift-install destroy cluster --dir cluster --log-level debug
DEBUG OpenShift Installer 4.15.0-0.nightly-2023-12-18-220750
DEBUG Built from commit 2b894776f1653ab818e368fa625019a6de82a8c7
DEBUG Power Off Virtual Machines
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-master-2
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-master-1
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-master-0
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn
DEBUG Delete Virtual Machines
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-rhcos-generated-region-generated-zone
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-master-2
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-master-1
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-master-0
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn
DEBUG Delete Folder
INFO Destroyed                                     Folder=sgao-devqe-spn2w
DEBUG Delete                                        StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w
INFO Destroyed                                     StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w
DEBUG Delete                                        Tag=sgao-devqe-spn2w
INFO Deleted                                       Tag=sgao-devqe-spn2w
DEBUG Delete                                        TagCategory=openshift-sgao-devqe-spn2w
INFO Deleted                                       TagCategory=openshift-sgao-devqe-spn2w
DEBUG Purging asset "Metadata" from disk
DEBUG Purging asset "Master Ignition Customization Check" from disk
DEBUG Purging asset "Worker Ignition Customization Check" from disk
DEBUG Purging asset "Terraform Variables" from disk
DEBUG Purging asset "Kubeconfig Admin Client" from disk
DEBUG Purging asset "Kubeadmin Password" from disk
DEBUG Purging asset "Certificate (journal-gatewayd)" from disk
DEBUG Purging asset "Cluster" from disk
INFO Time elapsed: 29s
INFO Uninstallation complete!

# govc tags.category.ls | grep sgao
openshift-sgao-devqe-spn2w

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-18-220750

How reproducible:

    always

Steps to Reproduce:

    1. IPI install OCP on vSphere
    2. Destroy cluster installed, check TagCategory

Actual results:

    TagCategory still exist

Expected results:

    TagCategory should be deleted

Additional info:

    Also reproduced in openshift-install-linux-4.14.0-0.nightly-2023-12-20-184526,4.13.0-0.nightly-2023-12-21-194724, while 4.12.0-0.nightly-2023-12-21-162946 have not this issue

https://github.com/openshift/installer/pull/7876

Bug OCPBUGS-25125: regression - aws-ebs-csi-driver-node- fails to deploy too many times because of SCCs

View the Description View the linked PRs

Description of problem:

 The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

  in a statistically significant pattern

Steps to Reproduce:

    1. run OCP test suite many times for it to matter

Actual results:

    fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors
Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times

Expected results:

Test pass

Additional info:

Link to the regression dashboard - https://sippy.dptools.openshift.org/sippy-ng/component_readiness/capability?baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=SCC&component=oauth-apiserver&confidence=95&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&pity=5&sampleEndTime=2023-12-11%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2023-12-05%2000%3A00%3A00

[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

Bug OCPBUGS-24976: Update 4.16 cluster-network-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2156

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-26399: Missing support for singular VIP in ACI for BareMetal

View the Description View the linked PRs

Description of problem:

Since the singular variant of APIVIP/IngressVIP has been removed as part of https://github.com/openshift/installer/pull/7574, the appliance disk image e2e job is now failing: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static

The job fails since th appliance support only 4.14, which still requires the singular variant of the VIP properties.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

1. Invoke appliance e2e job on master: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static

Actual results:

Job fails with the following validation error:
"the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs"
Due to missing apiVIP and ingressVIP in AgentClusterInstall.

Expected results:

AgentClusterInstall should include also the singular 'apiVIP' and 'ingressVIP', and the e2e job should successfully complete

Additional info:

https://github.com/openshift/installer/pull/7859

Bug OCPBUGS-25516: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/106

Bug OCPBUGS-24036: [4.15] CNO fails to apply ovnkube-master daemonset during upgrade

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-22293~~. The following is the description of the original issue:
—
Description of problem:

Upgrading from 4.13.5 to 4.13.17 fails at network operator upgrade

Version-Release number of selected component (if applicable):

How reproducible:

Not sure since we only had one cluster on 4.13.5.

Steps to Reproduce:

1. Have a cluster on version 4.13.5 witn ovn kubernetes
2. Set desired update image to quay.io/openshift-release-dev/ocp-release@sha256:c1f2fa2170c02869484a4e049132128e216a363634d38abf292eef181e93b692
3. Wait until it reaches network operator

Actual results:

Error message: Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: failed to apply / update (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: DaemonSet.apps "ovnkube-master" is invalid: [spec.template.spec.containers[1].lifecycle.preStop: Required value: must specify a handler type, spec.template.spec.containers[3].lifecycle.preStop: Required value: must specify a handler type]

Expected results:

Network operator upgrades successfully

Additional info:

Since I'm not able to attach files please gather all required debug data from https://access.redhat.com/support/cases/#/case/03645170

https://github.com/openshift/cluster-network-operator/pull/2114

Bug OCPBUGS-24176: unable to create hypershift-hosted cluster via cluster-bot

View the Description View the linked PRs

Upon debugging, nodes are stuck in NotReady state and CNI is not initialised on them.

Seeing the following error log in cluster network operator

failed parsing certificate data from ConfigMap "openshift-service-ca.crt": failed to parse certificate PEM

CNO operator logs: https://docs.google.com/document/d/1hor1r9ue4gnetkXm9mh8AKa7vm8zNBPhUQqWCbbnnUc/edit?usp=sharing

This is happening on a management cluster that is configured to use legacy service CA's:

$ oc get kubecontrollermanager/cluster -o yaml --as system:admin
apiVersion: operator.openshift.io/v1
kind: KubeControllerManager
metadata:
  name: cluster
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
  unsupportedConfigOverrides: null
  useMoreSecureServiceCA: false

In newer clusters, useMoreSecureServiceCA is set to true.

https://github.com/openshift/cluster-network-operator/pull/2178

Bug OCPBUGS-25630: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-libvirt/pull/276

Bug OCPBUGS-27323: Test failure in upgrade jobs- [bz-Image Registry] clusteroperator/image-registry should not change condition/Available

View the Description View the linked PRs

Description of problem:

Observing the following test case failure in 4.14 to 4.15 and 4.15 to 4.16 upgrade CI runs continuously.
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available

JobLink:https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le/1746834772249808896

4.14 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.14.0-0.nightly-ppc64le-2024-01-15-085349
4.15 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.15.0-0.nightly-ppc64le-2024-01-15-042536

https://github.com/openshift/origin/pull/28605

Story TRT-1403: Port remaining intervals to structured

View the Description View the linked PRs

I was identifying what remains with:

cat e2e-events_20231204-183144.json| jq '.items[] | select(has("tempSource") | not)'

I think I've cleared all the difficult ones, hopefully these are just simple stragglers.

https://github.com/openshift/origin/pull/28463

Bug OCPBUGS-2135: [GCP]capi machine cannot be deleted by installer during cluster destroy

View the Description View the linked PRs

Description of problem:

capi machine cannot be deleted by installer during cluster destroy, checked on GCP console, found this machine lacks label(kubernetes-io-cluster-clusterid: owned), if adding this label manually on GCP console for the machine, then the machine can be deleted by installer during cluster destroy.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

Always

Steps to Reproduce:

1.Follow the steps here https://bugzilla.redhat.com/show_bug.cgi?id=2107999#c9 to create a capi machine

liuhuali@Lius-MacBook-Pro huali-test % oc get machine.cluster.x-k8s.io      -n openshift-cluster-api
NAME             CLUSTER            NODENAME   PROVIDERID                                                         PHASE         AGE   VERSION
capi-ms-mtchm    huliu-gcpx-c55vm              gce://openshift-qe/us-central1-a/capi-gcp-machine-template-gcw9t   Provisioned   51m   

2.Destroy the cluster
The cluster destroyed successfully, but checked on GCP console, found the capi machine is still there.

labels of capi machine

labels of mapi machine

Actual results:

capi machine cannot be deleted by installer during cluster destroy

Expected results:

capi machine should be deleted by installer during cluster destroy

Additional info:

Also checked on aws, the case worked well, and found there is tag(kubernetes.io/cluster/clusterid:owned) for capi machines.

https://github.com/openshift/installer/pull/7907

Bug OCPBUGS-26145: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource/pull/161

Bug OCPBUGS-29388: OpenShift Document for AWS Cloudformation Template on Worker Nodes needs updated description

View the Description View the linked PRs

Description of problem:[link Worker CloudFormation Template|[Installing a cluster on AWS using CloudFormation templates - Installing on AWS | Installing | OpenShift Container Platform 4.13|https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-user-infra.html#installation-cloudformation-worker_installing-aws-user-infra]]

    In OpenShift Documentation under Manual AWS Cloudformation Templates. Within the cloudformation template for Worker Nodes. The description for Subnet and WorkerSecurityGroupId refer to the Master Nodes. Based on the variable names the descriptions should refer to Worker Nodes instead.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8112

Bug OCPBUGS-25000: Update 4.16 ose-cluster-api-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api/pull/190

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api/pull/190

Bug OCPBUGS-25541: Update 4.16 ose-gcp-pd-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/106

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/106

Story OCPCLOUD-2517: Promote CAPI IPAM CRDs to GA

View the Description View the linked PRs

User Story

As a developer I want to remove the NoUpgrade annotation from the CAPI IPAM CRDs so that I can promote them to General Availability

Background

The SPLAT team is planning to have the CAPI IPAM CRDs promoted to GA because they need them in a component they are promoting to GA.

Steps

remove the NoUpgrade annotation from the CAPI IPAM CRDs

Stakeholders

SPLAT
Cluster Infra Team

Definition of Done

manifests-generator PR merged
cluster-api repo PR merged

Bug OCPBUGS-29115: hcp create nodepool agent '--node-upgrade-type' param is mandatory although in --help it has default value

View the Description View the linked PRs

Description of problem:

Trying to run without --node-upgrade-type param fails for "spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""


although in --help it is documented to have a default value of 'InPlace'

Version-Release number of selected component (if applicable):

 [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp -v
hcp version openshift/hypershift: af9c0b3ce9c612ec738762a8df893c7598cbf157. Latest supported OCP: 4.15.0
[

How reproducible:

  happens all the time

Steps to Reproduce:

    1.on an hosted cluster setup run :
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help
Creates basic functional NodePool resources for Agent platformUsage:
  hcp create nodepool agent [flags]Flags:
  -h, --help   help for agentGlobal Flags:
      --cluster-name string             The name of the HostedCluster nodes in this pool will join. (default "example")
      --name string                     The name of the NodePool.
      --namespace string                The namespace in which to create the NodePool. (default "clusters")
      --node-count int32                The number of nodes to create in the NodePool. (default 2)
      --node-upgrade-type UpgradeType   The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default )
      --release-image string            The release image for nodes; if this is empty, defaults to the same release image as the HostedCluster.
      --render                          Render output as YAML to stdout instead of applying.
     

2.try to run with default value of --node-upgrade-type:
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2

Actual results:

[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2
2024-02-06T19:57:03+02:00       ERROR   Failed to create nodepool       {"error": "NodePool.hypershift.openshift.io \"nodepool-of-extra1\" is invalid: spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""}
github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1
        /home/kni/hypershift_working/hypershift/cmd/nodepool/core/create.go:39
github.com/spf13/cobra.(*Command).execute
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1032
main.main
        /home/kni/hypershift_working/hypershift/product-cli/main.go:60
runtime.main
        /home/kni/hypershift_working/go/src/runtime/proc.go:250
Error: NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"
NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"

Expected results:

   should pass as if your adding the param :
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type InPlace
NodePool nodepool-of-extra1 created
[kni@ocp-edge119 ~]$

Additional info:

A related issue is that we have a difference if the --help is used with other parameters or not :

[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help > long.help.out
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --help > short.help.out
[kni@ocp-edge119 ~]$ diff long.help.out short.help.out 
14c14
<       --node-upgrade-type UpgradeType   The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default )
---
>       --node-upgrade-type UpgradeType   The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace
[kni@ocp-edge119 ~]$

https://github.com/openshift/hypershift/pull/3572

Task HOSTEDCP-1372: Bump k8s to v0.29

View the Description View the linked PRs

Bump k8s to v0.29

https://github.com/openshift/hypershift/pull/3360

Bug OCPBUGS-5113: Date&Time values are not showing as per browser default language

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/12506

Bug OCPBUGS-25011: Update 4.16 ose-vsphere-problem-detector-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/144

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-25943: Adding test case when exceed openshift.io/image-tags will ban to create new image references in the project

View the Description View the linked PRs

Description of problem:

Adding test case when exceed openshift.io/image-tags will ban to create new image references in the project

Version-Release number of selected component (if applicable):

    4.16

pr - https://github.com/openshift/origin/pull/28464

https://github.com/openshift/origin/pull/28464

Bug CLID-70: v2 is not mirroring multiple catalogs

View the Description View the linked PRs

In v2 there is an issue to mirroring multiple catalogs at the same mirroring flow.

https://github.com/openshift/oc-mirror/pull/807

Story HOSTEDCP-1188: Include docs for our control plane scheduling topologies

View the Description View the linked PRs

Complement https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ to include Share everything / share nothing / dedicated behaviour and requirements. -> https://docs.google.com/document/d/1eSaqR7rUwelq0PRC_trL3d5vxLMlJmLytHtFZyFOpvg/edit

e.g https://docs.google.com/presentation/d/1gfvr2dqcfdJ3UdOmgy09E8wCZYIIsvagva7RA-w4-s0/edit#slide=id.g2519af5cbc8_0_54

https://github.com/openshift/hypershift/pull/3434

Bug OCPBUGS-24947: Update 4.16 prometheus-operator-admission-webhook-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/266

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus-operator/pull/266

Bug OCPBUGS-26058: Revert termination log permissions carry

View the Description View the linked PRs

~~OCPBUGS-11856~~ added a patch to change termination log permissions manually. Since 4.15.z this is not longer necessary as its fixed by lumberjack dep bump.

This bug tracks carry revert

https://github.com/openshift/kubernetes/pull/1841

Bug OCPBUGS-24804: Update 4.16 csi-provisioner-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/79

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-provisioner/pull/79

Bug OCPBUGS-25579: Update 4.16 ose-vertical-pod-autoscaler-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/274

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-autoscaler/pull/274

Bug OCPBUGS-25575: Update 4.16 ose-csi-external-snapshotter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/136

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/136

Bug OCPBUGS-27267: Update 4.16 ose-azure-disk-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/126

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-27445: Client side throttling when running the metrics controller

View the Description View the linked PRs

Description of problem:

Client side throttling observed when running the metrics controller.

Steps to Reproduce:

1. Install an AWS cluster in mint mode
2. Enable debug log by editing cloudcredential/cluster
3. Wait for the metrics loop to run for a few times
4. Check CCO logs

Actual results:

// 7s consumed by metrics loop which is caused by client-side throttling 
time="2024-01-20T19:43:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
I0120 19:43:56.251278       1 request.go:629] Waited for 176.161298ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
I0120 19:43:56.451311       1 request.go:629] Waited for 197.182213ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
I0120 19:43:56.651313       1 request.go:629] Waited for 197.171082ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
I0120 19:43:56.850631       1 request.go:629] Waited for 195.251487ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
...
time="2024-01-20T19:44:03Z" level=info msg="reconcile complete" controller=metrics elapsed=7.231061324s

Expected results:

No client-side throttling when running the metrics controller.

https://github.com/openshift/cloud-credential-operator/pull/656

Bug OCPBUGS-29623: Some oc cli commands don't respect --certificate-authority

View the Description View the linked PRs

Description of problem:

When using the oc cli to query information about release images it is not possible to use the --certificate-authority option to specify an alternative CA bundle for verifying connections to the target registry.

Version-Release number of selected component (if applicable): 4.14.5

How reproducible: 100%

Steps to Reproduce:

    1. oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64

Actual results:

error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority

Expected results:

Something beginning with:

Name:           4.14.9
Digest:         sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44
Created:        2024-01-12T06:48:42Z
OS/Arch:        linux/amd64
Manifests:      680
Metadata files: 1

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44

Release Metadata:

Additional info:

To fully verify that this was an issue I went through the following steps which should show that the oc command is not using the CA bundle in the provided file and that the command would have worked if oc was using the provided bundle

// show the command works with the system CA bundle

# oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head
Name:           4.14.9
Digest:         sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44
Created:        2024-01-12T06:48:42Z
OS/Arch:        linux/amd64
Manifests:      680
Metadata files: 1

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44

Release Metadata:

// move the system CA bundle to the local directory

# mv /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem .

// show the same command now fails without that bundle file

# oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head
error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority

// show using that same bundle file with --certificate-authority doesn't work

# oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head
error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority


Additionally this also seems to be a problem for at least the following commands as well:
oc image info
oc adm release extract

https://github.com/openshift/oc/pull/1693

Bug OCPBUGS-30551: Switch to service to get the PLR and TR logs from the Tekton results summary API

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13654

Bug OCPBUGS-26197: PKI Operator Starts Even When Hosted Cluster Is Annoated To Turn Off PKI

View the Description View the linked PRs

Description of problem:

    pki operator runs even when annotation to turn off PKI is on the hosted control plane

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/3368

Bug OCPBUGS-25583: Update 4.16 ose-prometheus-adapter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/100

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/k8s-prometheus-adapter/pull/100

Bug OCPBUGS-26061: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-aws/pull/53

Bug OCPBUGS-25262: Update 4.16 operator-registry-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/633

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-olm/pull/657

Bug OCPBUGS-29220: cluster install failed with azure workload identity

View the Description View the linked PRs

Description of problem:

Install cluster with azure workload identity against 4.16 nightly build, failed as some co are degraded.
$ oc get co | grep -v "True        False         False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.16.0-0.nightly-2024-02-07-200316   False       False         True       153m    OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
console                                    4.16.0-0.nightly-2024-02-07-200316   False       True          True       141m    DeploymentAvailable: 0 replicas available for console deployment...
ingress                                                                         False       True          True       137m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)

Ingress LB public IP is pending to be created
$ oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.199.169   <pending>     80:32007/TCP,443:30229/TCP   154m
router-internal-default   ClusterIP      172.30.112.167   <none>        80/TCP,443/TCP,1936/TCP      154m


Detected that CCM pod is CrashLoopBackOff with error
$ oc get pod -n openshift-cloud-controller-manager
NAME                                              READY   STATUS             RESTARTS         AGE
azure-cloud-controller-manager-555cf5579f-hz6gl   0/1     CrashLoopBackOff   21 (2m55s ago)   160m
azure-cloud-controller-manager-555cf5579f-xv2rn   0/1     CrashLoopBackOff   21 (15s ago)     160m

error in ccm pod:
I0208 04:40:57.141145       1 azure.go:931] Azure cloudprovider using try backoff: retries=6, exponent=1.500000, duration=6, jitter=1.000000
I0208 04:40:57.141193       1 azure_auth.go:86] azure: using workload identity extension to retrieve access token
I0208 04:40:57.141290       1 azure_diskclient.go:68] Azure DisksClient using API version: 2022-07-02
I0208 04:40:57.141380       1 azure_blobclient.go:73] Azure BlobClient using API version: 2021-09-01
F0208 04:40:57.141471       1 controllermanager.go:314] Cloud provider azure could not be initialized: could not init cloud provider azure: no token file specified. Check pod configuration or set TokenFilePath in the options

Version-Release number of selected component (if applicable):

4.16 nightly build

How reproducible:

Always

Steps to Reproduce:

    1. Install cluster with azure workload identity
    2.
    3.

Actual results:

    Installation failed due to some operators are degraded

Expected results:

    Installation is successful.

Additional info:

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/332

Bug OCPBUGS-25053: Update 4.16 csi-attacher-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-attacher/pull/67

Bug OCPBUGS-30200: Power VS: Removal of unmaintained package (bluemix-go)

View the Description View the linked PRs

Description of problem:

    Package that we use for Power VS has recently been revealed to be unmaintained. We should remove it in favor of maintained solutions.

Version-Release number of selected component (if applicable):

    4.13.0 onward

How reproducible:

It's always used

Steps to Reproduce:

    1. Deploy with IPI on Power VS
    2. Use bluemix-go
    3.

Actual results:

    bluemix-go is used

Expected results:

bluemix-go should be avoided

Additional info:

https://github.com/openshift/installer/pull/8025

Story CORS-3182: 4.16 Release Branching Checklist

View the Description

User Story:

This is a checklist of tasks for when we break off a new feature branch.

Acceptance Criteria:

Description of criteria:

Complete checklist items
Clone this card to the next release

Sub-task CORS-3185: Update default release image

View the Description View the linked PRs

e.g. https://github.com/openshift/installer/pull/5774/files

https://github.com/openshift/installer/pull/7874

Sub-task CORS-3188: Update k8s dependencies (api, etc)

View the linked PRs

https://github.com/openshift/installer/pull/7970

Bug MGMT-16588: [BE] Day2 - Non-Nutanix node successfully added to Nutanix Day2 cluster

View the Description View the linked PRs

Description of the problem:

Non-Nutanix node was successfully added to Nutanix day1 cluster

How reproducible:

100%

Steps to reproduce:

1. Deploy Nutanix day1 cluster

4. Try to add non-Nutanix day-2 node to Nutanix cluster

Actual results:

Day-2 node installation started and host installed

Expected results:

Day-2 node doesn't pass pre-installation checks

https://github.com/openshift/assisted-service/pull/5946

Bug OCPBUGS-24889: Update 4.16 cluster-etcd-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1175

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-etcd-operator/pull/1175

Bug OCPBUGS-25550: Update 4.16 ose-azure-file-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/47

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24742: Update 4.16 prom-label-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/359

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24972: Update 4.16 ose-cluster-openshift-apiserver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/561

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-apiserver-operator/pull/561

Bug OCPBUGS-29587: Power VS: Handle composite_instance for cluster create

View the Description View the linked PRs

Description of problem:

    When deploying to a Power VS workspace created after February 14th 2024, it will not be found by the installer.

Version-Release number of selected component (if applicable):

How reproducible:

    Easily.

Steps to Reproduce:

    1. Create a Power VS Workspace
    2. Specify it in the install config
    3. Attempt to deploy
    4. Fail with "...is not a valid guid" error.

Actual results:

    Failure to deploy to service instance

Expected results:

    Should deploy to service instance

Additional info:

https://github.com/openshift/installer/pull/8033

Bug OU-308: monitoring-plugin failing in CI

View the Description View the linked PRs

PR https://github.com/openshift/monitoring-plugin/pull/83 was intended to just modify the images built for local testing, but accidentally changed the deafult Dockerfile leading to a mismatch between the nginx config and Dockerfile used in CI. The causes the monitoring-plugin to fail to load in CI builds.

https://github.com/openshift/monitoring-plugin/pull/90

Bug OCPBUGS-27050: Update 4.16 ose-azure-disk-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/114

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-23827: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-24711: Last visited tab not get selected on Pipelines page in dev perspective

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13430

Story TRT-1456: hypershift broken on 4.16 ci payloads

View the Description View the linked PRs

https://redhat-internal.slack.com/archives/C01C8502FMM/p1706104538539689

https://github.com/openshift/hypershift/pull/3460

Bug OCPBUGS-18577: e2e test failure: [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created"

View the Description View the linked PRs

Description of problem:

Job link: powervs » zstreams » zstream-ocp4x-powervs-london06-p9-current-upgrade #282 Console [Jenkins] (ibm.com)

Must-gather link

long snippet from e2e log

external internet 09/01/23 07:26:09.624
Sep  1 07:26:09.624: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 http://www.google.com:80'
STEP: creating an egressfirewall object 09/01/23 07:26:09.903
STEP: calling oc create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml 09/01/23 07:26:09.903
Sep  1 07:26:09.904: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml'
egressfirewall.k8s.ovn.org/default createdSTEP: sending traffic to control plane nodes should work 09/01/23 07:26:22.122
Sep  1 07:26:22.130: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443'
Sep  1 07:26:23.358: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:
StdOut>
command terminated with exit code 28
StdErr>
command terminated with exit code 28[AfterEach] [sig-network][Feature:EgressFirewall]
  github.com/openshift/origin/test/extended/util/client.go:180
STEP: Collecting events from namespace "e2e-test-egress-firewall-e2e-2vvzx". 09/01/23 07:26:23.358
STEP: Found 4 events. 09/01/23 07:26:23.361
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {multus } AddedInterface: Add eth0 [10.131.0.89/23] from ovn-kubernetes
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-quay-io-redhat-developer-nfs-server-1-1-dlXGfzrk5aNo8EjC" already present on machine
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Created: Created container egressfirewall-container
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Started: Started container egressfirewall-container
Sep  1 07:26:23.363: INFO: POD             NODE                                           PHASE    GRACE  CONDITIONS
Sep  1 07:26:23.363: INFO: egressfirewall  lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT  }]
Sep  1 07:26:23.363: INFO: 
Sep  1 07:26:23.367: INFO: skipping dumping cluster info - cluster too large
Sep  1 07:26:23.383: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-egress-firewall-e2e-2vvzx-user}, err: <nil>
Sep  1 07:26:23.398: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-egress-firewall-e2e-2vvzx}, err: <nil>
Sep  1 07:26:23.414: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~X_2HPGEj3O9hpd-3XKTckrp9bO23s_7zlJ3Tkn7ncBE}, err: <nil>
[AfterEach] [sig-network][Feature:EgressFirewall]
  github.com/openshift/origin/test/extended/util/client.go:180
STEP: Collecting events from namespace "e2e-test-no-egress-firewall-e2e-84f48". 09/01/23 07:26:23.414
STEP: Found 0 events. 09/01/23 07:26:23.416
Sep  1 07:26:23.417: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Sep  1 07:26:23.417: INFO: 
Sep  1 07:26:23.421: INFO: skipping dumping cluster info - cluster too large
Sep  1 07:26:23.446: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-no-egress-firewall-e2e-84f48-user}, err: <nil>
Sep  1 07:26:23.451: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-no-egress-firewall-e2e-84f48}, err: <nil>
Sep  1 07:26:23.457: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~2Lk8-jWfwpdyo59E9YF7kQFKH2LBUSvnbJdKj7rOzn4}, err: <nil>
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  dump namespaces | framework.go:196
STEP: dump namespace information after failure 09/01/23 07:26:23.457
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  tear down framework | framework.go:193
STEP: Destroying namespace "e2e-test-no-egress-firewall-e2e-84f48" for this suite. 09/01/23 07:26:23.457
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  dump namespaces | framework.go:196
STEP: dump namespace information after failure 09/01/23 07:26:23.462
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  tear down framework | framework.go:193
STEP: Destroying namespace "e2e-test-egress-firewall-e2e-2vvzx" for this suite. 09/01/23 07:26:23.463
fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:155]: Unexpected error:
    <*fmt.wrapError | 0xc001dd50a0>: {
        msg: "Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:\nStdOut>\ncommand terminated with exit code 28\nStdErr>\ncommand terminated with exit code 28\nexit status 28\n",
        err: <*exec.ExitError | 0xc001dd5080>{
            ProcessState: {
                pid: 140483,
                status: 7168,
                rusage: {
                    Utime: {Sec: 0, Usec: 149480},
                    Stime: {Sec: 0, Usec: 19930},
                    Maxrss: 222592,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 1536,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 596,
                    Nivcsw: 173,
                },
            },
            Stderr: nil,
        },
    }
    Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:
    StdOut>
    command terminated with exit code 28
    StdErr>
    command terminated with exit code 28
    exit status 28
    
occurred
Ginkgo exit error 1: exit with code 1failed: (18.7s) 2023-09-01T11:26:23 "[sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created  [Suite:openshift/conformance/parallel]"

Version-Release number of selected component (if applicable):

4.13.11

How reproducible:

This e2e failure is not consistently reproduceable.

Steps to Reproduce:

1.Start a Z stream Job via Jenkins
2.monitor e2e

Actual results:

e2e is getting failed

Expected results:

e2e should pass

Additional info:

https://github.com/openshift/origin/pull/28300

Bug OCPBUGS-25101: Update 4.16 ose-libvirt-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-libvirt/pull/274

Story CONSOLE-3948: Rename “supported but not recommended” to "known issues"

View the Description View the linked PRs

Description of problem:

OTA team wants to rename the `supported but not recommended` update edges to `known issues`

Version-Release number of selected component (if applicable):

    openshift-4.16

Expected results:

`supported but not recommended` edges are renamed to `known issues`

Additional info:

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L191

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L216

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L219

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L234

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219[…]rontend/public/components/cluster-settings/cluster-settings.tsx

https://github.com/openshift/console/pull/13622

Bug OCPBUGS-25618: Bump documentationBaseURL to 4.16

View the Description View the linked PRs

Description of problem:

documentationBaseURL still points to 4.14

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

1.Check documentationBaseURL on 4.16 cluster: 
# oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/

2.
3.

Actual results:

1.documentationBaseURL is still pointing to 4.14

Expected results:

1.documentationBaseURL should point to 4.16

Additional info:

https://github.com/openshift/console-operator/pull/824

Bug OCPBUGS-25708: CVO should continue to periodically fetch upstream Cincinnati despite Recommended=Unknown risks

View the Description View the linked PRs

Description of problem:

Changes made for faster risk cache-warming (the ~~OCPBUGS-19512~~ series) introduced an unfortunate cycle:

1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like graph-data#4528.
4. Cases:

- (a) Before the cache-warming changes, and also after this bug's fix, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
- (b) Clusters with the cache-warming changes but without this bug's fix say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

Version-Release number of selected component (if applicable):

The regression went back via:

Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.

How reproducible:

Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.

Steps to Reproduce:

1. Launch a cluster.
2. Point it at dummy Cincinnati data, as described in ~~OTA-520~~. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .

Actual results:

Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).

Expected results:

Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).

Additional info:

Identification

To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:

$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))'
{
  "lastTransitionTime": "2023-12-15T22:00:45Z",
  "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958",
  "reason": "EvaluationFailed",
  "status": "Unknown",
  "type": "Recommended"
}

To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
I1221 20:36:39.783530       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:36:39.831358       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:40:19.674925       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:40:19.727998       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:43:59.567369       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:43:59.620315       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:47:39.457582       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:47:39.509505       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:51:19.348286       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:51:19.401496       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"

showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
I1221 20:50:10.165101       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:11.166170       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:12.166314       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:13.166517       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:14.166847       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:15.167737       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:16.168486       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:17.169417       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:18.169576       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:19.170544       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail
...no hits...

Recovery

If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods

https://github.com/openshift/cluster-version-operator/pull/1009

Task HOSTEDCP-1371: Bump Golang to v1.21

View the Description View the linked PRs

Bump Golang to v1.21 in main and hack/tools go.mod's.

https://github.com/openshift/hypershift/pull/3359

Task MON-3633: Upgrade Prometheus to v2.48.1

View the linked PRs

Bug OCPBUGS-23624: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/325

Bug TRT-1539: Loki outages should not fail tests

View the Description View the linked PRs

Yesterday a major DPCR and thus Loki outage took the system down entirely. One test would fail as a result:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1762573177050894336

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "KubeDaemonSetRolloutStuck",
          "alertstate": "firing",
          "container": "kube-rbac-proxy-main",
          "daemonset": "loki-promtail",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "openshift-e2e-loki",
          "prometheus": "openshift-monitoring/k8s",
          "service": "kube-state-metrics",
          "severity": "warning"
        },
        "value": [
          1709071917.851,
          "1"
        ]
      }
    ]

The query this test uses should be adapted to omit everything in openshift-e2e-loki.

Ideally, backports would be good here, but we could just fix it going forward also if this is too cumbersome.

https://github.com/openshift/origin/pull/28627

Bug OCPBUGS-25631: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/71

Bug OCPBUGS-28728: Update 4.16 ose-image-customization-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/122

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-customization-controller/pull/122

Bug OCPBUGS-28207: Not clear if the Memory limits must be set in GiB or GB

View the Description View the linked PRs

The customer is pointing that the memory max value scale up is based in GB value, while the min memory for scale down is based on GiB.

So, setting both values in GB makes scale down fail:

Skipping ocgc4preplatgt-98fwh-worker-c-sk2sz - minimal limit exceeded for [memory]

While setting both values in GiB makes the scale up fail.

API reference says that GBs must be used for memory min/max limits:

https://docs.openshift.com/container-platform/4.12/rest_api/autoscale_apis/clusterautoscaler-autoscaling-openshift-io-v1.html#spec-resourcelimits

While OCP Cluster Autoscaler documentation points to GiBs:

https://docs.openshift.com/container-platform/4.12/machine_management/applying-autoscaling.html#cluster-autoscaler-cr_applying-autoscaling

https://github.com/openshift/cluster-autoscaler-operator/pull/313

Bug MGMT-16813: ClusterImageSetReference pulls wrong image

View the Description View the linked PRs

Description of the problem:

Specified ClusterImageSetRef in SiteConfig CR in HubCluster mistmatches the specific image pulled from quay when trying to install 4.12.x managed SNO clusters. This behavior has not been detected when installing SNO 4.14.x

How reproducible:

Install 4.12.19 from ACM using clusterImageSetNameRef: "img4.12.19-x86-64-appsub"

Actual results:

Image pulled is actually img4.12.19-multi-x86-64-appsub

# oc adm release info 4.12.19 | more
Name:           4.12.19
Digest:         sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
Created:        2023-05-24T06:58:32Z
OS/Arch:        linux/amd64
Manifests:      647
Metadata files: 1
 
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
  Metadata:     
     release.openshift.io/architecture: multi                               url:                               https://access.redhat.com/errata/RHSA-2023:3287

Expected results:

Image pulled is img4.12.19-x86-64-appsub

# oc adm release info 4.12.19 | more

Name:           4.12.19
Digest:     sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
 
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
  Metadata:     
url: https://access.redhat.com/errata/RHSA-2023:3287

https://github.com/openshift/assisted-service/pull/6066

Bug OCPBUGS-24832: Update 4.16 ose-image-customization-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/113

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-customization-controller/pull/113

Bug OCPBUGS-25710: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/50

Bug OCPBUGS-24991: Update 4.16 ose-vsphere-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/59

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-vsphere/pull/59

Bug TRT-1445: Resource Watch on 4.16 payload jobs takes one extra hour to finish

View the Description View the linked PRs

With 4.15, resource watch completes whenever it is interrupted. But for 4.16 jobs, it does not until 1h grace period kicks in and the job is terminated by ci-operator. This means:

All 4.16 jobs take 1h longer than normally
If an upgrade pushes the overall time close to the limit, sub-jobs from aggregator will fail at 4h mark.

This was discovered when investigating this slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705578623724259

https://github.com/openshift/origin/pull/28580

Bug OCPBUGS-25105: Update 4.16 ose-ibm-vpc-block-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/98

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/98

Bug OCPBUGS-25357: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1986

Bug OCPBUGS-29573: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/458

Bug OCPBUGS-27842: Add SNO to HighOverallControlPlaneCPU alert description

View the Description View the linked PRs

Current description of HighOverallControlPlaneCPU is wrong for SNO cases and can mislead users. We need to add information regarding SNO clusters to the description of the alert

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1633

Bug OCPBUGS-28539: capi-ibmcloud-controller-manager ContainerCreating shouldn't happen on IBMCloud

View the Description View the linked PRs

Description of problem:

Pod capi-ibmcloud-controller-manager stuck in ContainerCreating on IBM cloud

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Built a cluster on ibm cloud and enable TechPreviewNoUpgrade
    2.
    3.

Actual results:

4.16 cluster
$ oc get po                        
NAME                                                READY   STATUS              RESTARTS      AGE
capi-controller-manager-6bccdc844-jsm4s             1/1     Running             9 (24m ago)   175m
capi-ibmcloud-controller-manager-75d55bfd7d-6qfxh   0/2     ContainerCreating   0             175m
cluster-capi-operator-768c6bd965-5tjl5              1/1     Running             0             3h

  Warning  FailedMount       5m15s (x87 over 166m)  kubelet            MountVolume.SetUp failed for volume "credentials" : secret "capi-ibmcloud-manager-bootstrap-credentials" not found

$ oc get clusterversion               
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-01-21-154905   True        False         156m    Cluster version is 4.16.0-0.nightly-2024-01-21-154905

4.15 cluster
$ oc get po                           
NAME                                                READY   STATUS              RESTARTS        AGE
capi-controller-manager-6b67f7cff4-vxtpg            1/1     Running             6 (9m51s ago)   35m
capi-ibmcloud-controller-manager-54887589c6-6plt2   0/2     ContainerCreating   0               35m
cluster-capi-operator-7b7f48d898-9r6nn              1/1     Running             1 (17m ago)     39m
$ oc get clusterversion           
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-01-22-160236   True        False         11m     Cluster version is 4.15.0-0.nightly-2024-01-22-160236

Expected results:

No pod is in ContainerCreating status

Additional info:

must-gather: https://drive.google.com/file/d/1F5xUVtW-vGizAYgeys0V5MMjp03zkSEH/view?usp=sharing

https://github.com/openshift/cluster-capi-operator/pull/157

Bug OCPBUGS-26116: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-machine-approver/pull/225

Story CONSOLE-3921: Removal of NICs/Disks page from Node Details page

View the Description View the linked PRs

When metal3-plugin is enabled it adds Disks and NICs tabs to Nodes details page.

We would like to remove these tabs becase:

The same tabs are added to Bare Metal Hosts page - thus we would be removing the duplicity.

Not every Node has to be backed by Bare Metal Host resource. If that is the case, then Disks/NICs tabs wont show anything.
RFE https://issues.redhat.com/browse/RFE-5061

https://github.com/openshift/console/pull/13552

Bug OCPBUGS-30477: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-baremetal/pull/213

Story HOSTEDCP-1438: Preserve container resources for more hosted control plane components

View the Description View the linked PRs

This is an extension of https://issues.redhat.com/browse/HOSTEDCP-190, in which we are adding container resource preservation to more hosted control plane components.

https://github.com/openshift/hypershift/pull/3120

Bug OCPBUGS-22559: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-disk-csi-driver/pull/66

Bug OCPBUGS-25313: Alert, Metrics page not loading in OCP Console

View the Description View the linked PRs

Description of problem:

Unable to view the alerts, metrics page, getting a blank page.

Version-Release number of selected component (if applicable):

4.15.0-nightly

How reproducible:

Always

Steps to Reproduce:

Click on any alert under "Notification Panel" to view more, and you will be redirected to the alert page.

Actual results:

User is unable to view any alerts, metrics.

Expected results:

User should be able to view all/individual alerts, metrics.

Additional info:

N.A

https://github.com/openshift/monitoring-plugin/pull/88

Bug OCPBUGS-25372: vsphere-problem-detector-operator pod CrashLoopBackOff with panic

View the Description View the linked PRs

Description of problem:

Find in QE's CI (with vsphere-agent profile), storage CO is not avaliable and vsphere-problem-detector-operator pod is CrashLoopBackOff with panic.
(Find must-garther here: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-agent-disconnected-ha-f14/1734850632575094784/artifacts/vsphere-agent-disconnected-ha-f14/gather-must-gather/)


The storage CO reports "unable to find VM by UUID":
  - lastTransitionTime: "2023-12-13T09:15:27Z"
    message: "VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable:
      unable to find VM ci-op-782gwsbd-b3d4e-master-2 by UUID \nVSphereProblemDetectorDeploymentControllerAvailable:
      Waiting for Deployment"
    reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_vcenter_api_error::VSphereProblemDetectorDeploymentController_Deploying
    status: "False"
    type: Available
(But I did not see the "unable to find VM by UUID" from vsphere-problem-detector-operator log in must-gather)


The vsphere-problem-detector-operator log:
2023-12-13T10:10:56.620216117Z I1213 10:10:56.620159       1 vsphere_check.go:149] Connected to vcenter.devqe.ibmc.devcluster.openshift.com as ci_user_01@devqe.ibmc.devcluster.openshift.com
2023-12-13T10:10:56.625161719Z I1213 10:10:56.625108       1 vsphere_check.go:271] CountVolumeTypes passed
2023-12-13T10:10:56.625291631Z I1213 10:10:56.625258       1 zones.go:124] Checking tags for multi-zone support.
2023-12-13T10:10:56.625449771Z I1213 10:10:56.625433       1 zones.go:202] No FailureDomains configured.  Skipping check.
2023-12-13T10:10:56.625497726Z I1213 10:10:56.625487       1 vsphere_check.go:271] CheckZoneTags passed
2023-12-13T10:10:56.625531795Z I1213 10:10:56.625522       1 info.go:44] vCenter version is 8.0.2, apiVersion is 8.0.2.0 and build is 22617221
2023-12-13T10:10:56.625562833Z I1213 10:10:56.625555       1 vsphere_check.go:271] ClusterInfo passed
2023-12-13T10:10:56.625603236Z I1213 10:10:56.625594       1 datastore.go:312] checking datastore /DEVQEdatacenter/datastore/vsanDatastore for permissions
2023-12-13T10:10:56.669205822Z panic: runtime error: invalid memory address or nil pointer dereference
2023-12-13T10:10:56.669338411Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x23096cb]
2023-12-13T10:10:56.669565413Z 
2023-12-13T10:10:56.669591144Z goroutine 550 [running]:
2023-12-13T10:10:56.669838383Z github.com/openshift/vsphere-problem-detector/pkg/operator.getVM(0xc0005da6c0, 0xc0002d3b80)
2023-12-13T10:10:56.669991749Z     github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:319 +0x3eb
2023-12-13T10:10:56.670212441Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*vSphereChecker).enqueueSingleNodeChecks.func1()
2023-12-13T10:10:56.670289644Z     github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:238 +0x55
2023-12-13T10:10:56.670490453Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker.func1(0xc000c88760?, 0x0?)
2023-12-13T10:10:56.670702592Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:40 +0x55
2023-12-13T10:10:56.671142070Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker(0xc000c78660, 0xc000c887a0?)
2023-12-13T10:10:56.671331852Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:41 +0xe7
2023-12-13T10:10:56.671529761Z github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool.func1()
2023-12-13T10:10:56.671589925Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:28 +0x25
2023-12-13T10:10:56.671776328Z created by github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool
2023-12-13T10:10:56.671847478Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:27 +0x73

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-11-033133

How reproducible:

Steps to Reproduce:

    1. See description
    2.
    3.

Actual results:

   vpd is panic

Expected results:

   vpd should not panic

Additional info:

   I guess it is privileges issue, but our pod should not be panic.

https://github.com/openshift/vsphere-problem-detector/pull/148

Bug OCPBUGS-28738: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/68

Bug OU-321: Alerts extension is not passing the alert object correctly

View the Description View the linked PRs

Looks like we're accidentally passing the JavaScript `window.alert()` method instead of the Prometheus alert object.

https://github.com/openshift/monitoring-plugin/pull/101

Bug OCPBUGS-30103: Cluster-network-operator doesn't use node local kube-apiserver loadbalancer when templating in cluster resources

View the Description View the linked PRs

Description of problem:

The cluster-network-operator in hypershift when templating in cluster resources does not use the node local address of the client side haproxy load balancer that runs on all nodes. This bypasses a level of health checks for the backend redundant apiserver addresses that is performed by the local kube-apiserver-proxy pods that run on every node in a hypershift environment. In environments where the backend api servers are not fronted through an additional cloud load balancer: this leads to a percentage of request failures from the in cluster components occuring when a control plane endpoint goes down even if other endpoints are available.

Version-Release number of selected component (if applicable):

  4.16 4.15 4.14

How reproducible:

    100%

Steps to Reproduce:

    1. Setup a hypershift cluster in a baremetal/non cloud environment where there are redundant API servers behind a DNS that point directly to the node IPs.
    2. Power down one of the control plane nodes
    3. Schedule workload into cluster that depends on kube-proxy and/or multus to setup networking configuration
    4. You will see errors like the following 
```
add): Multus: [openshiftai/moe-8b-cmisale-master-0/9c1fd369-94f5-481c-a0de-ba81a3ee3583]: error getting pod: Get "https://[p9d81ad32fcdb92dbb598-6b64a6ccc9c596bf59a86625d8fa2202-c000.us-east.satellite.appdomain.cloud]:30026/api/v1/namespaces/openshiftai/pods/moe-8b-cmisale-master-0?timeout=1m0s": dial tcp 192.168.98.203:30026: connect: timeout
```

Actual results:

    When a control plane node fails intermittent timeouts occur when kube-proxy/multus resolve the dns and a failed control plane node ip is returned

Expected results:

    No requests fail (which will occur if all traffic is routed through the node local load balancer instance

Additional info:

    Additionally: control plane components in the management cluster that live next to the apiserver are adding uneeded dependencies by using an external DNS entry to talk to the kube-apiserver when it can use the local kube-apiserver address to have it all go over cluster local networking

https://github.com/openshift/cluster-network-operator/pull/2288

Bug OCPBUGS-25559: Update 4.16 ose-cluster-autoscaler-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/305

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-autoscaler-operator/pull/305

Bug OCPBUGS-25989: [AMQ Broker Operator] OLM deployed operator with watching multiple namespaces can't deploy its resources

View the Description View the linked PRs

Description of problem:

    Since OCP 4.15 we see issue with OLM deployed operator unable to operate in watched namespaces (multiple). It works fine with single watched namespace (subscription). Also, same test passes if we don't deploy operator using OLM, but using files.
It seems like it is permission issue based on operator log. Same test works fine on any other previous OCP 4.14 and older.

Version-Release number of selected component (if applicable):

Server Version: 4.15.0-ec.3
Kubernetes Version: v1.28.3+20a5764

How reproducible:

Always

Steps to Reproduce:

    0. oc login OCP4.15
    1. git clone https://gitlab.cee.redhat.com/amq-broker/claire
    2. make -f Makefile.downstream build ARTEMIS_VERSION=7.11.4 RELEASE_TYPE=released
    3. make -f Makefile.downstream operator_test OLM_IIB=registry-proxy.engineering.redhat.com/rh-osbs/iib:636350 OLM_CHANNEL=7.11.x  TESTS=ClusteredOperatorSmokeTests TEST_LOG_LEVEL=debug DISABLE_RANDOM_NAMESPACES=true

Actual results:

    Can't deploy artemis broker custom resource in given namespace (permission issue - see details below)

Expected results:

    Successfully deployed broker on watched namespaces

Additional info:

Log from AMQ Broker operator - seems like some permission issues since 4.15

    E0103 10:04:54.425202       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemis: failed to list *v1beta1.ActiveMQArtemis: activemqartemises.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemises" in API group "broker.amq.io" in the namespace "cluster-testsa"
E0103 10:04:54.425207       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemisSecurity: failed to list *v1beta1.ActiveMQArtemisSecurity: activemqartemissecurities.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemissecurities" in API group "broker.amq.io" in the namespace "cluster-testsa"
E0103 10:04:54.425221       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "pods" in API group "" in the namespace "cluster-testsa"
W0103 10:04:54.425296       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1beta1.ActiveMQArtemisScaledown: activemqartemisscaledowns.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemisscaledowns" in API group "broker.amq.io" in the namespace "cluster-testsa"

https://github.com/openshift/operator-framework-olm/pull/663

Bug OCPBUGS-25453: duplicate failure domains in CMPS

View the Description View the linked PRs

Description of problem:

 CMPS was supported in 4.15 on vsphere platform when enable TechPreviewNoUpgrade. but after I build the cluster with no failure domains/single failure domain setting in install-config. there were three duplicated failure domains.

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-11-033133

How reproducible:

    install a cluster with TP enabled and don't set failure domain (or set single failure doamin) in install-config.

Steps to Reproduce:

    1. do not config failure domain in install-config (or set single failure doamin).
    2. install a cluster with TP enabled
    3. check CPMS with command:   
       oc get controlplanemachineset -oyaml

Actual results:

duplicated failure domains.
        failureDomains:
     platform: VSphere
     vsphere:
     - name: generated-failure-domain
     - name: generated-failure-domain
     - name: generated-failure-domain
    metadata:
     labels:

Expected results:

 failure domain should not duplicated when setting single failure domain in install-config.
 failure domain should not exists when not setting failure domain in install-config.

Additional info:

https://github.com/openshift/installer/pull/7860

Bug OCPBUGS-25887: [IBMCloud] cluster install fails nodes stuck in node.cloudprovider.kubernetes.io/uninitialized

View the Description View the linked PRs

Description of problem:

Cluster install fails on IBMCloud, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

Version-Release number of selected component (if applicable):

from 4.16.0-0.nightly-2023-12-22-210021

last PASS version: 4.16.0-0.nightly-2023-12-20-061023

How reproducible:

Always

Steps to Reproduce:

    1. Install a cluster on IBMCloud, we use auto flexy template: aos-4_16/ipi-on-ibmcloud/versioned-installer

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          92m     Unable to apply 4.16.0-0.nightly-2023-12-25-200355: an unknown error has occurred: MultipleErrors
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                                                                                                               
baremetal                                                                                                                    
cloud-controller-manager                   4.16.0-0.nightly-2023-12-25-200355   True        False         False      89m     
cloud-credential                                                                                                             
cluster-autoscaler                                                                                                           
config-operator                                                                                                              
console                                                                                                                      
control-plane-machine-set                                                                                                    
csi-snapshot-controller                                                                                                      
dns                                                                                                                          
etcd                                                                                                                         
image-registry                                                                                                               
ingress                                                                                                                      
insights                                                                                                                     
kube-apiserver                                                                                                               
kube-controller-manager                                                                                                      
kube-scheduler                                                                                                               
kube-storage-version-migrator                                                                                                
machine-api                                                                                                                  
machine-approver                                                                                                             
machine-config                                                                                                               
marketplace                                                                                                                  
monitoring                                                                                                                   
network                                                                                                                      
node-tuning                                                                                                                  
openshift-apiserver                                                                                                          
openshift-controller-manager                                                                                                 
openshift-samples                                                                                                            
operator-lifecycle-manager                                                                                                   
operator-lifecycle-manager-catalog                                                                                           
operator-lifecycle-manager-packageserver                                                                                     
service-ca                                                                                                                   
storage                                                                                                                       
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                        STATUS     ROLES                  AGE   VERSION
huliu-ibma-qbg48-master-0   NotReady   control-plane,master   89m   v1.29.0+b0d609f
huliu-ibma-qbg48-master-1   NotReady   control-plane,master   89m   v1.29.0+b0d609f
huliu-ibma-qbg48-master-2   NotReady   control-plane,master   89m   v1.29.0+b0d609f
liuhuali@Lius-MacBook-Pro huali-test % oc describe node huliu-ibma-qbg48-master-0
Name:               huliu-ibma-qbg48-master-0
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=huliu-ibma-qbg48-master-0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.openshift.io/os_id=rhcos
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 27 Dec 2023 18:02:21 +0800
Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  huliu-ibma-qbg48-master-0
  AcquireTime:     <unset>
  RenewTime:       Wed, 27 Dec 2023 19:32:24 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Addresses:
Capacity:
  cpu:                4
  ephemeral-storage:  104266732Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16391716Ki
  pods:               250
Allocatable:
  cpu:                3500m
  ephemeral-storage:  95018478229
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15240740Ki
  pods:               250
System Info:
  Machine ID:                 0ae21a012be844f18c5871f6eaefb85b
  System UUID:                0ae21a01-2be8-44f1-8c58-71f6eaefb85b
  Boot ID:                    fbe619e2-8ff5-4cdb-b6a4-cd6830ccc568
  Kernel Version:             5.14.0-284.45.1.el9_2.x86_64
  OS Image:                   Red Hat Enterprise Linux CoreOS 416.92.202312250319-0 (Plow)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.28.2-9.rhaos4.15.git6d902a3.el9
  Kubelet Version:            v1.29.0+b0d609f
  Kube-Proxy Version:         v1.29.0+b0d609f
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:
  Type    Reason                   Age                From             Message
  ----    ------                   ----               ----             -------
  Normal  NodeHasNoDiskPressure    90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientPID
  Normal  NodeHasSufficientMemory  90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientMemory
  Normal  RegisteredNode           90m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           73m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           53m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           32m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           12m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller 
liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager
NAME                                            READY   STATUS             RESTARTS         AGE
ibm-cloud-controller-manager-787645668b-djqnr   0/1     CrashLoopBackOff   22 (2m29s ago)   90m
ibm-cloud-controller-manager-787645668b-pgkh2   0/1     Error              15 (5m8s ago)    52m
liuhuali@Lius-MacBook-Pro huali-test % oc describe pod ibm-cloud-controller-manager-787645668b-pgkh2 -n openshift-cloud-controller-manager
Name:                 ibm-cloud-controller-manager-787645668b-pgkh2
Namespace:            openshift-cloud-controller-manager
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 huliu-ibma-qbg48-master-2/
Start Time:           Wed, 27 Dec 2023 18:41:23 +0800
Labels:               infrastructure.openshift.io/cloud-controller-manager=IBMCloud
                      k8s-app=ibm-cloud-controller-manager
                      pod-template-hash=787645668b
Annotations:          operator.openshift.io/config-hash: 82a75c6ff86a490b0dac9c8c9b91f1987da0e646a42d72c33c54cbde3c29395b
Status:               Running
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/ibm-cloud-controller-manager-787645668b
Containers:
  cloud-controller-manager:
    Container ID:  cri-o://c56e246f64c770146c30b7a894f6a4d974159551dbb9d1ea31c238e516a0f854
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218
    Image ID:      e494d0d4b28e31170a4a2792bb90701c7f1e81c78c03e3686c5f0e601801937e
    Port:          10258/TCP
    Host Port:     10258/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      fi
      exec /bin/ibm-cloud-controller-manager \
      --bind-address=$(POD_IP_ADDRESS) \
      --use-service-account-credentials=true \
      --configure-cloud-routes=false \
      --cloud-provider=ibm \
      --cloud-config=/etc/ibm/cloud.conf \
      --profiling=false \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager \
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \
      --v=2
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 27 Dec 2023 19:33:23 +0800
      Finished:     Wed, 27 Dec 2023 19:33:23 +0800
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:     75m
      memory:  60Mi
    Liveness:  http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3
    Environment:
      POD_IP_ADDRESS:           (v1:status.podIP)
      VPCCTL_CLOUD_CONFIG:     /etc/ibm/cloud.conf
      VPCCTL_PUBLIC_ENDPOINT:  false
    Mounts:
      /etc/ibm from cloud-conf (rw)
      /etc/kubernetes from host-etc-kube (ro)
      /etc/pki/ca-trust/extracted/pem from trusted-ca (ro)
      /etc/vpc from ibm-cloud-credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cbd4b (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ccm-trusted-ca
    Optional:  false
  host-etc-kube:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes
    HostPathType:  Directory
  cloud-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cloud-conf
    Optional:  false
  ibm-cloud-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ibm-cloud-credentials
    Optional:    false
  kube-api-access-cbd4b:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  52m                    default-scheduler  Successfully assigned openshift-cloud-controller-manager/ibm-cloud-controller-manager-787645668b-pgkh2 to huliu-ibma-qbg48-master-2
  Normal   Pulling    52m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218"
  Normal   Pulled     52m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" in 3.431s (3.431s including waiting)
  Normal   Created    50m (x5 over 52m)      kubelet            Created container cloud-controller-manager
  Normal   Started    50m (x5 over 52m)      kubelet            Started container cloud-controller-manager
  Normal   Pulled     50m (x4 over 52m)      kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" already present on machine
  Warning  BackOff    2m19s (x240 over 52m)  kubelet            Back-off restarting failed container cloud-controller-manager in pod ibm-cloud-controller-manager-787645668b-pgkh2_openshift-cloud-controller-manager(d7f93ecf-cd14-450e-a986-028559a775b3)
liuhuali@Lius-MacBook-Pro huali-test %

Actual results:

    cluster install failed on IBMCloud

Expected results:

    cluster install succeed on IBMCloud

Additional info:

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/323

Bug TRT-1480: Disruption risk analysis empty on /payload jobs

View the Description View the linked PRs

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-router-551-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1752661032133726208

The funky job renaming done in these is breaking risk analysis. The disruption checking actually ran if you look in the logs, but we don't get far enough to generate the html template, I suspect because of the error returned by risk analysis.

Would be nice if both would work but I'd be happy just to get the disruption portion populated for now. I don't think the overall risk analysis will be easy.

https://github.com/openshift/origin/pull/28568

Bug OCPBUGS-23896: [OCP 4.16] VM stuck in terminating state after OCP node crash

View the Description View the linked PRs

Description of problem:

After a manual crash of a OCP node the OSPD VM running on the OCP node is stuck in terminating state

Version-Release number of selected component (if applicable):

OCP 4.12.15 
osp-director-operator.v1.3.0
kubevirt-hyperconverged-operator.v4.12.5

How reproducible:

Login to a OCP 4.12.15 Node running a VM 
Manually crash the master node.
After reboot the VM stay in terminating state

Steps to Reproduce:

    1. ssh core@masterX 
    2. sudo su
    3. echo c > /proc/sysrq-trigger

Actual results:

After reboot the VM stay in terminating state


$ omc get node|sed -e 's/modl4osp03ctl/model/g' | sed -e 's/telecom.tcnz.net/aaa.bbb.ccc/g'
NAME                               STATUS   ROLES                         AGE   VERSION
model01.aaa.bbb.ccc   Ready    control-plane,master,worker   91d   v1.25.8+37a9a08
model02.aaa.bbb.ccc   Ready    control-plane,master,worker   91d   v1.25.8+37a9a08
model03.aaa.bbb.ccc   Ready    control-plane,master,worker   91d   v1.25.8+37a9a08


$ omc get pod -n openstack 
NAME                                                        READY   STATUS         RESTARTS   AGE
openstack-provision-server-7b79fcc4bd-x8kkz                 2/2     Running        0          8h
openstackclient                                             1/1     Running        0          7h
osp-director-operator-controller-manager-5896b5766b-sc7vm   2/2     Running        0          8h
osp-director-operator-index-qxxvw                           1/1     Running        0          8h
virt-launcher-controller-0-9xpj7                            1/1     Running        0          20d
virt-launcher-controller-1-5hj9x                            1/1     Running        0          20d
virt-launcher-controller-2-vhd69                            0/1     NodeAffinity   0          43d

$ omc describe  pod virt-launcher-controller-2-vhd69 |grep Status:
Status:                    Terminating (lasts 37h)

$ xsos sosreport-xxxx/|grep time
...
  Boot time: Wed Nov 22 01:44:11 AM UTC 2023
  Uptime:    8:27,  0 users

Expected results:

VM restart automatically OR does not stay in Terminating state

Additional info:

The issue has been seen two time.

First time, a crash of the kernel occured and we had the associated VM on the node in terminating state

Second time we try to reproduce the issue by crashing manually the kernel and we got the same result.
The VM running on the OCP node stay in terminating state

https://github.com/openshift/kubernetes/pull/1829

Bug OCPBUGS-27213: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2234

Bug OCPBUGS-29453: catalogd crash loops after etcd restore

View the Description View the linked PRs

Description of problem:

The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1].

We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping:

1 events happened too frequently
event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times}


I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time.


[1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97

[2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368

[3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144

Version-Release number of selected component (if applicable):

> 4.15

How reproducible:

always by running the test

Steps to Reproduce:

Run the test:

[sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial]     

and observe the event invariant failing on it crash looping

Actual results:

catalogd-controller-manager crash loops and causes our CI jobs to fail

Expected results:

our e2e job is green again and catalogd-controller-manager doesn't crash loop

Additional info:

https://github.com/openshift/operator-framework-catalogd/pull/42

Bug OCPBUGS-24751: Update 4.16 openshift-enterprise-base-rhel9-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/155

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/155

Story TRT-1406: hypershift e2e conformance failing OLM test on 4.16 nightlies

View the Description View the linked PRs

This is failing on hypershift:

[sig-operator] an end user can use OLM can subscribe to the operator [apigroup:config.openshift.io]

https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702325839977229?thread_ts=1702316018.298589&cid=C01CQA76KMX

https://github.com/openshift/hypershift/pull/3306

Bug OCPBUGS-15755: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/coredns/pull/105

Bug OCPBUGS-30873: CEO aliveness check should only detect deadlocks

View the Description View the linked PRs

Description of problem:

From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.

Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

>4.14

How reproducible:

Always

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack

Actual results:

CEO controllers are returning errors, but might not deadlock, which currently results in a restart

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

highly related to ~~OCPBUGS-30169~~

https://github.com/openshift/cluster-etcd-operator/pull/1223

Story CONSOLE-3965: i18n upload/download routine task for version-4.16/ sprint-248

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

Upload strings to Memosource at the start of the sprint and reach out to localization team

Download translated strings from Memsource when it is ready

Review the translated strings and open a pull request

Open a followup story for next sprint

https://github.com/openshift/console/pull/13652

Bug OCPBUGS-18454: CVO chokes on ConditionalUpdateRisk with name that is not a valid condition reason

View the Description View the linked PRs

Description of problem:

Conditional risks have looser naming restrictions:

	// +kubebuilder:validation:Required
	// +kubebuilder:validation:MinLength=1
	// +required
	Name string `json:"name"`

...than condition Reason field:

	// +required
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:MaxLength=1024
	// +kubebuilder:validation:MinLength=1
	// +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$`
	Reason string `json:"reason" protobuf:"bytes,5,opt,name=reason"`

CVO can use a name risk as a reason so when a name of the applying risk is invalid, CVO will start trying to update the ClusterVersion resource with an invalid one.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

Make the cluster consume update graph data containing a conditional edge with a risk with a name that does not follow the Condition.Reason restriction, e.g. uses a - character. The risk needs to apply on the cluster. For example:

{
  "nodes": [
    {"version": "CLUSTER-BOT-VERSION", "payload": "CLUSTER-BOT-PAYLOAD"},
    {"version": "4.12.22", "payload": "quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111"}
  ],
  "conditionalEdges": [
    {
      "edges": [{"from": "CLUSTER-BOT-VERSION", "to": "4.12.22"}],
      "risks": [
        {
          "url": "https://github.com/openshift/api/blob/8891815aa476232109dccf6c11b8611d209445d9/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1515C4-L1520",
          "name": "OCPBUGS-9050",
          "message": "THere is no validation that risk name is a valid Condition.Reason so let's just use a - character in it.",
          "matchingRules": [{"type": "PromQL", "promql": { "promql": "group by (type) (cluster_infrastructure_provider)"}}]
        }
      ]
    }
  ]
}

Then, observe the ClusterVersion status field after the cluster has a chance to evaluate the risk:

$ oc get clusterversion version -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Progressing" or .type=="Available" or .type=="Failing")'

Actual results:

{
  "lastTransitionTime": "2023-09-01T13:21:49Z",
  "status": "False",
  "type": "Available"
}
{
  "lastTransitionTime": "2023-09-01T13:21:49Z",
  "message": "ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'",
  "status": "True",
  "type": "Failing"
}
{
  "lastTransitionTime": "2023-09-01T13:14:34Z",
  "message": "Error ensuring the cluster version is up to date: ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'",
  "status": "False",
  "type": "Progressing"
}

Expected results:

No errors, CVO continues to work and either uses some sanitized version of the name as the reason, or maybe uses something generic, like RiskApplies.

CVO does not get stuck after consuming data from external source

Additional info:

1. We should CI PRs to o/cincinnati-graph-data so we never create invalid data
2. We should sanitize the field in CVO code so that CVO never attempts to submit an invalid ClusterVersion.status.conditionalUpdates.condition.reason
3. We should further restrict the conditional update risk name in the CRD so it is guaranteed compatible with Condition.Reason
4. We should sanitize the field in CVO code after being read from OSUS so that CVO never attempts to submit an invalid (after we do 3) ClusterVersion.conditinalUpdates.name

https://github.com/openshift/cluster-version-operator/pull/962

Bug OCPBUGS-27092: Baremetal bootstrap logs no longer contain all services

View the Description View the linked PRs

Description of problem:

When bootstrap logs are collected (e.g. as part of a CI run when bootstrapping fails), it no longer contains most of the Ironic services. They used to be run in standalone pods, but after a recent refactoring, they are systemd services.

https://github.com/openshift/installer/pull/7854

Bug OCPBUGS-27489: sha256sum mismatch for openshift-install-mac-arm64-4.14.9.tar.gz

View the Description View the linked PRs

Description of problem:

sha256 sum for "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/openshift-install-mac-arm64-4.14.9.tar.gz" does not match what it should be

# sha256sum openshift-install-mac-arm64-4.14.9.tar.gz 61cccc282f39456b7db730a0625d0a04cd6c1c2ac0f945c4c15724e4e522a073 openshift-install-mac-arm64-4.14.9.tar.gz

Which does not match what is posted here: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/sha256sum.txt


It should be :
c765c90a32b8a43bc62f2ba8bd59dc8e620b972bcc2a2e217c36ce139d517e29  openshift-install-mac-arm64-4.14.9.tar.gz

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc/pull/1659

Bug OCPBUGS-29760: Dev console: Observe > Dashboard page should be called "Dashboards"

View the Description View the linked PRs

Description of problem:

There are multiple dashboards, so this page title should be "Dashboards" rather than "Dashboard". The admin console version of this page is already titled "Dashboards".

Version-Release number of selected component (if applicable):

4.16

Steps to Reproduce:

    1. Open "Developer" View > Observe

Actual results:

    See the tab title is "Dashboard"

Expected results:

    The tab title should be "Dashboards"

https://github.com/openshift/console/pull/13597

Bug OCPBUGS-24887: Update 4.16 ose-cluster-kube-cluster-api-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/32

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-operator/pull/32

Bug OCPBUGS-25403: update apiserver et.al. to v0.29.0

View the Description View the linked PRs

This update is compatible with recent kube-openapi changes so that we can drop our replace k8s.io/kube-openapi => k8s.io/kube-openapi v0.0.0-20230928195430-ce36a0c3bb67 introduced in 53b387f4f54c8426526478afd0fd3e2b4e7aec66.

https://github.com/openshift/cluster-monitoring-operator/pull/2205

Story HOSTEDCP-336: clean up API package deps to enable clean import

View the Description View the linked PRs

Currently importing the HyperShift API packages in other projects brings also the dependencies and conflicts of the rest of the non-API packages. This is a request to create Go submodule containing only the API packages.

Once we cut beta and the API is stable we should move it into its own repo

Bug OCPBUGS-24715: PipelineRuns is not loaded on repository details page

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13432

Bug OCPBUGS-26486: [Driver: pd.csi.storage.gke.io] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source in parallel [Slow] failing

View the Description View the linked PRs

Description of problem:

The following test started to fail freequently in the periodic tests:

External Storage [Driver: pd.csi.storage.gke.io] [Testpattern: Dynamic PV
 (block volmode)] provisioning should provision storage with pvc data 
source in parallel

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Sometimes, but way too often in the CI

Steps to Reproduce:

    1. Run the periodic-ci-openshift-release-master-nightly-X.X-e2e-gcp-ovn-csi test

Actual results:

    Provisioning of some volumes fails with

time="2024-01-05T02:30:07Z" level=info msg="resulting interval message" message="{ProvisioningFailed  failed to provision volume with StorageClass \"e2e-provisioning-9385-e2e-scw2z8q\": rpc error: code = Internal desc = CreateVolume failed to create single zonal disk pvc-35b558d6-60f0-40b1-9cb7-c6bdfa9f28e7: failed to insert zonal disk: unknown Insert disk operation error: rpc error: code = Internal desc = operation operation-1704421794626-60e299f9dba08-89033abf-3046917a failed (RESOURCE_OPERATION_RATE_EXCEEDED): Operation rate exceeded for resource 'projects/XXXXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-a/disks/pvc-501347a5-7d6f-4a32-b0e0-cf7a896f316d'. Too frequent operations from the source resource. map[reason:ProvisioningFailed]}"

Expected results:

    Test passes

Additional info:

    Looks like we're hitting the API quota limits with the test

Failed test run example:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-ovn-csi/1743082616304701440

Link to Sippy:

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Dynamic%20PV%20%28block%20volmode%29&component=Storage%20%2F%20Kubernetes%20External%20Components&confidence=95&environment=ovn%20no-upgrade%20amd64%20gcp%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=gcp&platform=gcp&sampleEndTime=2024-01-08%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-01-02%2000%3A00%3A00&testId=openshift-tests%3A7845229f6a2c8faee6573878f566d2f3&testName=External%20Storage%20%5BDriver%3A%20pd.csi.storage.gke.io%5D%20%5BTestpattern%3A%20Dynamic%20PV%20%28block%20volmode%29%5D%20provisioning%20should%20provision%20storage%20with%20pvc%20data%20source%20in%20parallel%20%5BSlow%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=standard&variant=standard

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/112

Bug OCPBUGS-28388: Redundant reconciles by CCO's status controller

View the Description View the linked PRs

Description of problem:

The status controller of CCO reconciles 500+ times/h on average on a resting 6-node mint-mode OCP cluster on AWS.

Steps to Reproduce:

1. Install a 6-node mint-mode OCP cluster on AWS
2. Do nothing with it and wait for a couple of hours
3. Plot the following metric in the metrics dashboard of OCP console:
rate(controller_runtime_reconcile_total{controller="status"}[1h]) * 3600

Actual results:

500+ reconciles/h on a resting cluster

Expected results:

12-50 reconciles/h on a resting cluster
Note: the reconcile() function always requeues after 5min so the theoretical minimum is 12 reconciles/h

https://github.com/openshift/cloud-credential-operator/pull/665

Bug OCPBUGS-20035: CI does not provide failure details

View the Description View the linked PRs

Description of problem:

Results of -hypershift-aws-e2e-external CI jobs do not contain obvious reason why a test failed. For example, this TestCreateCluster is listed as failed, but all failures in TestCreateCluster look like errors dumping the cluster after failure.

It should show that "storage operator did not become Available=true". Or even tell that "pod cluster-storage-operator-6f6d69bf89-fx2d2 in the hosted control plane XYZ is in CrashloopBackoff"

The PR under test had a simple typo leading to crashloop and it should be more obvious what went wrong.

Version-Release number of selected component (if applicable):

4.15.0-0.ci.test-2023-10-03-040803

https://github.com/openshift/hypershift/pull/3190

Bug OCPBUGS-28937: ART requests updates to 4.16 image ose-openstack-cinder-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/160

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/160

Bug MGMT-16993: [STG] avoid reboot not working correctly when there is a partition on installation disk

View the Description View the linked PRs

Description of the problem:
this issue was discovered while trying to verify bug opened by Javipolo
~~MGMT-16966~~

It seems that the avoid extra reboot not working properly for 4.15 cluster with partition on installation disk

Here are 3 clusters:
1) test-infra-cluster-7a4cb4cc OCP 4.15 with partition on installation disk
2) test-infra-cluster-066749e2 OCP 4.14 with partition on installation disk
3) test-infra-cluster-f9051a36 OCP 4.15 without partition on installation disk

i see indications:
*1) test-infra-cluster-7a4cb4cc *

2/19/2024, 9:19:53 PM	Node test-infra-cluster-7a4cb4cc-worker-0 has been rebooted 1 times before completing installation
2/19/2024, 9:19:51 PM	Node test-infra-cluster-7a4cb4cc-master-2 has been rebooted 2 times before completing installation
2/19/2024, 9:19:08 PM	Node test-infra-cluster-7a4cb4cc-worker-1 has been rebooted 1 times before completing installation
2/19/2024, 8:49:59 PM	Node test-infra-cluster-7a4cb4cc-master-1 has been rebooted 2 times before completing installation
2/19/2024, 8:49:55 PM	Node test-infra-cluster-7a4cb4cc-master-0 has been rebooted 2 times before completing installation

2) test-infra-cluster-066749e2

2/19/2024, 8:32:36 PM	Node test-infra-cluster-066749e2-master-2 has been rebooted 2 times before completing installation
2/19/2024, 8:32:35 PM	Node test-infra-cluster-066749e2-worker-1 has been rebooted 2 times before completing installation
2/19/2024, 8:32:31 PM	Node test-infra-cluster-066749e2-worker-0 has been rebooted 2 times before completing installation
2/19/2024, 8:05:26 PM	Node test-infra-cluster-066749e2-master-0 has been rebooted 2 times before completing installation
2/19/2024, 8:05:25 PM	Node test-infra-cluster-066749e2-master-1 has been rebooted 2 times before completing installation

3) test-infra-cluster-f9051a36

2/18/2024, 5:13:49 PM	Node test-infra-cluster-f9051a36-worker-1 has been rebooted 1 times before completing installation
2/18/2024, 5:10:10 PM	Node test-infra-cluster-f9051a36-worker-0 has been rebooted 1 times before completing installation
2/18/2024, 5:08:46 PM	Node test-infra-cluster-f9051a36-worker-2 has been rebooted 1 times before completing installation
2/18/2024, 5:03:12 PM	Node test-infra-cluster-f9051a36-master-1 has been rebooted 1 times before completing installation
2/18/2024, 4:33:39 PM	Node test-infra-cluster-f9051a36-master-2 has been rebooted 1 times before completing installation
2/18/2024, 4:33:38 PM	Node test-infra-cluster-f9051a36-master-0 has been rebooted 1 times before completing installation

according to Ori analysis
It seems skip MCO reboot didn't happen for masters. The ignition was not accessible

Feb 19 18:38:19 test-infra-cluster-7a4cb4cc-master-0 installer[3403]: time="2024-02-19T18:38:19Z" level=warning msg="failed getting encapsulated machine config. Continuing installation without skipping MCO reboot" error="failed after 240 attempts, last error: unexpected end of JSON input"

How reproducible:

Steps to reproduce:

1.create cluster with 4.15

2.add ustom manifest which modify ignition to create a partition on disk

3.start installation

Actual results:

seems that reboot avoid did not work properly

Expected results:

https://github.com/openshift/assisted-installer/pull/798

Bug OCPBUGS-28557: Update 4.16 ose-multus-route-override-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/route-override-cni/pull/54

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/route-override-cni/pull/54

Bug OCPBUGS-23744: operator-lifecycle-manager-packageserver ClusterOperator should not blip Available=False on 4.14 to 4.15 updates

View the Description View the linked PRs

Description of problem:

Seen in 4.14 to 4.15 update CI:

: [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available expand_less
Run #0: Failed expand_less	1h34m55s
{  1 unexpected clusteroperator state transitions during e2e test run 

Nov 22 21:48:41.624 - 56ms  E clusteroperator/operator-lifecycle-manager-packageserver condition/Available reason/ClusterServiceVersionNotSucceeded status/False ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: APIServiceInstallFailed, message: APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"}

While a brief auth failure isn't fantastic, an issue that only persists for 56ms is not long enough to warrant immediate admin intervention. Teaching the operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required. It's also possible that this is an incoming-RBAC vs. outgoing-RBAC race of some sort, and that shifting manifest filenames around could avoid the hiccup entirely.

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/operator-lifecycle-manager-packageserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 20% failed, 400% of failures match = 80% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 6 runs, 67% failed, 75% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 5 runs, 20% failed, 300% of failures match = 60% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 100% of failures match = 40% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 43 runs, 51% failed, 36% of failures match = 19% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 300% of failures match = 60% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 63% of failures match = 19% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 25% failed, 200% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 50% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 50% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-vsphere-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 178% of failures match = 32% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 60% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 19 runs, 63% failed, 33% of failures match = 21% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 57% of failures match = 27% impact

I'm not sure if all of those are from this system:anonymous issue, or if some of them are other mechanisms. Ideally we fix all of the Available=False noise, while, again, still going Available=False when it is worth summoning an admin immediately. Checking for different reason and message strings in recent 4.15-touching update runs:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/operator-lifecycle-manager-packageserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*message: \(.*\)|\1 \2 \3|' | sort | uniq -c | sort -n
      3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: Unauthorized
      3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install timeout
      4 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": the object has been modified; please apply your changes to the latest version and try again
      9 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded apiServices not installed
     23 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists
     82 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"

How reproducible:

Lots of hits in the above CI search. Running one of the 100% impact flavors has a good chance at reproducing.

Steps to Reproduce:

1. Install 4.14
2. Update to 4.15
3. Keep an eye on operator-lifecycle-manager-packageserver's ClusterOperator Available.

Actual results:

Available=False blips.

Expected results:

Available=True the whole time, or any Available=False looks like a serious issue where summoning an admin would have been appropriate.

Additional info

Causes also these testcases to fail (mentioning them here for Sippy to link here on relevant component readiness failures):

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

https://github.com/openshift/operator-framework-olm/pull/708

Bug OCPBUGS-29028: 4.16 Test failing on Power trying to connect to thanos-querier

View the Description View the linked PRs

Description of problem:

Test case:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should start and expose a secured proxy and unsecured metrics [apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Example Z Job Link:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864

Z must-gather Link:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864/artifacts/ocp-e2e-ovn-remote-libvirt-s390x/gather-libvirt/artifacts/must-gather.tar

Example P Job Link:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480

P must-gather Link:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/gather-libvirt/artifacts/must-gather.tar

JSON body of error:
{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:383]: Unexpected error:
    <*fmt.wrapError | 0xc001d9c000>: 
    https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get "https://thanos-querier.openshift-monitoring.svc:9091": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host
    {
        msg: "https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get \"https://thanos-querier.openshift-monitoring.svc:9091\": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host",
        err: <*url.Error | 0xc0020e02d0>{
            Op: "Get",
            URL: "https://thanos-querier.openshift-monitoring.svc:9091",
            Err: <*net.OpError | 0xc000b8f770>{
                Op: "dial",
                Net: "tcp",
                Source: nil,
                Addr: nil,
                Err: <*net.DNSError | 0xc0020df700>{
                    Err: "no such host",
                    Name: "thanos-querier.openshift-monitoring.svc",
                    Server: "172.30.38.188:53",
                    IsTimeout: false,
                    IsTemporary: false,
                    IsNotFound: true,
                },
            },
        },
    }

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Observe Nightlies on P and/or Z

Actual results:

    Test failing

Expected results:

    Test passing

Additional info:

https://github.com/openshift/origin/pull/28621

Bug OCPBUGS-18776: Test case failure- OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations

View the Description View the linked PRs

Description of problem:

Test case failure- OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations
The obtained response seems to have unmarshalling errors. 

Failed to fetch alerting rules: unable to parse response 

invalid character 's' after object key

Expected output- The response should be proper and the unmarshalling should have worked

Openshift Version- 4.13 & 4.14

Cloud Provider/Platform- PowerVS

Prow Job Link/Must gather path- https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-ovn-ppc64le-powervs/1700992665824268288/artifacts/ocp-e2e-ovn-ppc64le-powervs/

https://github.com/openshift/origin/pull/28546

Bug OCPBUGS-25483: LB not getting External-IP

View the Description View the linked PRs

Description of problem:

A regression was identified creating LoadBalancer services in ARO in new 4.14 clusters (handled for new installations in OCPBUGS-24191)

The same regression has been also confirmed in ARO clusters upgraded to 4.14

Version-Release number of selected component (if applicable):

4.14.z

How reproducible:

On any ARO cluster upgraded to 4.14.z

Steps to Reproduce:

    1. Install an ARO cluster
    2. Upgrade to 4.14 from fast channel
    3. oc create svc loadbalancer test-lb -n default --tcp 80:8080

Actual results:

# External-IP stuck in Pending
$ oc get svc test-lb -n default
NAME      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
test-lb   LoadBalancer   172.30.104.200   <pending>     80:30062/TCP   15m


# Errors in cloud-controller-manager being unable to map VM to nodes
$ oc logs -l infrastructure.openshift.io/cloud-controller-manager=Azure  -n openshift-cloud-controller-manager
I1215 19:34:51.843715       1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(default/test-lb) - wantLb(true): started
I1215 19:34:51.844474       1 event.go:307] "Event occurred" object="default/test-lb" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I1215 19:34:52.253569       1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-r5iks3dh) success
I1215 19:34:52.253632       1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(default/test-lb): lb(aro-r5iks3dh/mabad-test-74km6) wantLb(true) resolved load balancer name
I1215 19:34:52.528579       1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again...
E1215 19:34:52.714678       1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-r5iks3dh/providers/Microsoft.Network/networkInterfaces/mabad-test-74km6-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
E1215 19:34:52.714888       1 azure_loadbalancer.go:126] reconcileLoadBalancer(default/test-lb) failed: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
I1215 19:34:52.714956       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.871261893 request="services_ensure_loadbalancer" resource_group="aro-r5iks3dh" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="default/test-lb" result_code="failed_ensure_loadbalancer"
E1215 19:34:52.715005       1 controller.go:291] error processing service default/test-lb (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0

Expected results:

# The LoadBalancer gets an External-IP assigned
$ oc get svc test-lb -n default 
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE 
test-lb      LoadBalancer   172.30.193.159   20.242.180.199                         80:31475/TCP   14s

Additional info:

In cloud-provider-config cm in openshift-config namespace, vmType=""

When vmType gets changed to "standard" explicitly, the provisioning of the LoadBalancer completes and an ExternalIP gets assigned without errors.

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/316

Bug OCPBUGS-30496: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-policy-controller/pull/147

Bug OCPBUGS-27016: ResizeObserver limit exceed in dev-console

View the Description View the linked PRs

Description of problem:

    During automation test execution for dev-console package it is observed that cypress intensely fails the ongoing test due to "uncaught:exception : ResizeObserver limit exceed", but there is no visible failure from UI.

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. 
    2.
    3.

Actual results:

Expected results:

Additional info: Screenshot

https://github.com/openshift/console/pull/13502

Bug TRT-1501: NetPol tests may not be recording results correctly?

View the Description View the linked PRs

On https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1756168710529224704 we failed three netpol tests on just one result, failure, with no successess. However the other 9 jobs seemed to run fine.

[sig-network] Netpol NetworkPolicy between server and client should allow ingress access from namespace on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-network] Netpol NetworkPolicy between server and client should allow egress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-network] Netpol NetworkPolicy between server and client should allow ingress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

Something seems funky with these tests, they're barely running, they don't seem to consistently report results. 4.15 has just 3 runs in the last two weeks, however 4.16 has just 1 but it passed

Whatever's going on, it's capable of taking out a payload, though it's not happening 100% of the time. https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-10-040857

https://github.com/openshift/origin/pull/28593

Bug OCPBUGS-24979: Update 4.16 ose-cluster-capi-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/150

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-capi-operator/pull/150

Task MON-3661: Update downstream prometheus-operator to v0.71.0

View the linked PRs

Bug OCPBUGS-25673: CNV upgrades from v4.14.1 to v4.15.0 (unreleased) are not starting due to out of sync operatorCondition

View the Description View the linked PRs

Description of problem:

CNV upgrades from v4.14.1 to v4.15.0 (unreleased) are not starting due to out of sync operatorCondition.

We see:

$ oc get csv
NAME                                       DISPLAY                    VERSION               REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.14.1   OpenShift Virtualization   4.14.1                kubevirt-hyperconverged-operator.v4.14.0   Replacing
kubevirt-hyperconverged-operator.v4.15.0   OpenShift Virtualization   4.15.0                kubevirt-hyperconverged-operator.v4.14.1   Pending

And on the v4.15.0 CSV:

$ oc get csv kubevirt-hyperconverged-operator.v4.15.0 -o yaml
....
status:
  cleanup: {}
  conditions:
  - lastTransitionTime: "2023-12-19T01:50:48Z"
    lastUpdateTime: "2023-12-19T01:50:48Z"
    message: requirements not yet checked
    phase: Pending
    reason: RequirementsUnknown
  - lastTransitionTime: "2023-12-19T01:50:48Z"
    lastUpdateTime: "2023-12-19T01:50:48Z"
    message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True"
      is outdated'
    phase: Pending
    reason: OperatorConditionNotUpgradeable
  lastTransitionTime: "2023-12-19T01:50:48Z"
  lastUpdateTime: "2023-12-19T01:50:48Z"
  message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True"
    is outdated'
  phase: Pending
  reason: OperatorConditionNotUpgradeable

and if we check the pending operator condition (v4.14.1) we see:

$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml
apiVersion: operators.coreos.com/v2
kind: OperatorCondition
metadata:
  creationTimestamp: "2023-12-16T17:10:17Z"
  generation: 18
  labels:
    operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: ""
  name: kubevirt-hyperconverged-operator.v4.14.1
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: ClusterServiceVersion
    name: kubevirt-hyperconverged-operator.v4.14.1
    uid: 7db79d4b-e69e-4af8-9335-6269cf004440
  resourceVersion: "4116127"
  uid: 347306c9-865a-42b8-b2c9-69192b0e350a
spec:
  conditions:
  - lastTransitionTime: "2023-12-18T18:47:23Z"
    message: ""
    reason: Upgradeable
    status: "True"
    type: Upgradeable
  deployments:
  - hco-operator
  - hco-webhook
  - hyperconverged-cluster-cli-download
  - cluster-network-addons-operator
  - virt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  serviceAccounts:
  - hyperconverged-cluster-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
status:
  conditions:
  - lastTransitionTime: "2023-12-18T09:41:06Z"
    message: ""
    observedGeneration: 11
    reason: Upgradeable
    status: "True"
    type: Upgradeable

where metadata.generation (18) is not in sync with status.conditions[*].observedGeneration (11).

Even manually redacting spec.conditions.lastTransitionTime is causing a change in metadata.generation (as expected) but this doesn't trigger any reconciliation on the OLM and so status.conditions[*].observedGeneration remains at 11.

$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml
apiVersion: operators.coreos.com/v2
kind: OperatorCondition
metadata:
  creationTimestamp: "2023-12-16T17:10:17Z"
  generation: 19
  labels:
    operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: ""
  name: kubevirt-hyperconverged-operator.v4.14.1
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: ClusterServiceVersion
    name: kubevirt-hyperconverged-operator.v4.14.1
    uid: 7db79d4b-e69e-4af8-9335-6269cf004440
  resourceVersion: "4147472"
  uid: 347306c9-865a-42b8-b2c9-69192b0e350a
spec:
  conditions:
  - lastTransitionTime: "2023-12-18T18:47:25Z"
    message: ""
    reason: Upgradeable
    status: "True"
    type: Upgradeable
  deployments:
  - hco-operator
  - hco-webhook
  - hyperconverged-cluster-cli-download
  - cluster-network-addons-operator
  - virt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  serviceAccounts:
  - hyperconverged-cluster-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
status:
  conditions:
  - lastTransitionTime: "2023-12-18T09:41:06Z"
    message: ""
    observedGeneration: 11
    reason: Upgradeable
    status: "True"
    type: Upgradeable

since its observedGeneration is out of sync, this check:
https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/olm/operatorconditions.go#L44C1-L48

fails and the upgrade never starts.

I suspect (I'm only guessing) that it could be a regression introduced with the memory optimization for https://issues.redhat.com/browse/OCPBUGS-17157 .

Version-Release number of selected component (if applicable):

    OCP 4.15.0-ec.3

How reproducible:

- Not reproducible (with the same CNV bundles) on OCP v4.14.z.
- Pretty high (but not 100%) on OCP 4.15.0-ec.3

Steps to Reproduce:

    1. Try triggering a CNV v4.14.1 -> v4.15.0 on OCP 4.15.0-ec.3
    2.
    3.

Actual results:

    The OLM is not reacting to changes on spec.conditions on the pending operator condition, so metadata.generation is constantly out of sync with status.conditions[*].observedGeneration and so the CSV is reported as 

    message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True"
      is outdated'
    phase: Pending
    reason: OperatorConditionNotUpgradeable

Expected results:

    The OLM correctly reconcile the operatorCondition and the upgrade starts

Additional info:

    Not reproducible with exactly the same bundle (origin and target) on OCP v4.14.z

https://github.com/openshift/operator-framework-olm/pull/641

Bug OCPBUGS-27465: origin tests fail when two systemd service files are present in the kubelet machineconfig

View the Description View the linked PRs

Description of problem:

The test implementation in https://github.com/openshift/origin/commit/5487414d8f5652c301a00617ee18e5ca8f339cb4#L56 assumes there is just one kubelet service or at least that it is always the first one in the MCP. Which just changed in https://github.com/openshift/machine-config-operator/pull/4124 and the test is failing.

Version-Release number of selected component (if applicable):

master branch of 4.16

How reproducible:

always during test

Steps to Reproduce:

    1. Test with https://github.com/openshift/machine-config-operator/pull/4124 applied

Actual results:

Test detects a wrong service and fails

Expected results:

Test finds the proper kubelet.service and passes

Additional info:

https://github.com/openshift/origin/pull/28529

Bug OCPBUGS-25688: microshift is not managed by CNO, remove the folder

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

https://github.com/openshift/cluster-network-operator/pull/2173

Bug OCPBUGS-30212: oc adm catalog mirror does not work on windows

View the Description View the linked PRs

The command does not honor Windows path separators.

Related to https://issues.redhat.com//browse/OCPBUGS-28864 (access restricted and not publicly visible). This report serves as a target issue for the fix and its backport to older OCP versions. Please see more details in https://issues.redhat.com//browse/OCPBUGS-28864.

https://github.com/openshift/oc/pull/1680

Bug OCPBUGS-24428: Ensure Passwords are Redacted in Agent Log Files

View the Description View the linked PRs

Description of problem:

I've noticed that 'agent-cluster-install.yaml' and 'journal.export' from the agent gather process contain passwords. It's important not to expose password information in any of these generated files.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

    1. Generate an agent ISO by utilising agent-config and install-config, including platform credentials
    2. Boot the ISO that was created
    3. Run the agent-gather command on the node 0 machine to generate files.

Actual results:

The 'agent-cluster-install.yaml' and 'journal.export' are containing the passwords information.

Expected results:

Password should be redacted.

Additional info:

https://github.com/openshift/assisted-service/pull/5868

Bug OCPBUGS-27222: egressIP with IPv6 not working on dualstack cluster on openstack

View the Description View the linked PRs

Description of problem:

On ipv6primary dualstack cluster, creating an ipv6 egressIP following this procedure:

https://docs.openshift.com/container-platform/4.14/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html

is not working. ovnkube-cluster-manager shows below error:

2024-01-16T14:48:18.156140746Z I0116 14:48:18.156053       1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6
2024-01-16T14:48:18.161367817Z I0116 14:48:18.161269       1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
2024-01-16T14:48:18.161416023Z I0116 14:48:18.161357       1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
2024-01-16T14:49:37.714410622Z I0116 14:49:37.714342       1 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.Service total 8 items received
2024-01-16T14:49:48.155826915Z I0116 14:49:48.155330       1 obj_retry.go:296] Retry object setup: *v1.EgressIP egress-dualstack-ipv6
2024-01-16T14:49:48.156172766Z I0116 14:49:48.155899       1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6
2024-01-16T14:49:48.168795734Z I0116 14:49:48.168520       1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
2024-01-16T14:49:48.169400971Z I0116 14:49:48.168937       1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]

Same is observed with ipv6 subnet on slaac mode.

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-06-062415
RHOS-16.2-RHEL-8-20230510.n.1

How reproducible: Always.
Steps to Reproduce:

Applying below:

$ oc label node/ostest-8zrlf-worker-0-4h78l k8s.ovn.org/egress-assignable=""

$ cat egressip_ipv4.yaml && cat egressip_ipv6.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-dualstack-ipv4
spec:
  egressIPs:
    - 192.168.192.111
  namespaceSelector:
    matchLabels: 
      app: egress
      
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-dualstack-ipv6
spec:
  egressIPs:
    - fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333
  namespaceSelector:
    matchLabels: 
      app: egress

$ oc apply -f egressip_ipv4.yaml
$ oc apply -f egressip_ipv6.yaml

But it shows only info about ipv4 egressIP. The IPv6 port is not even created in openstack:

oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-67cbc4bc84-786jm 
I0116 13:15:48.914323       1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue
I0116 13:15:48.928927       1 cloudprivateipconfig_controller.go:357] CloudPrivateIPConfig: "192.168.192.111" will be added to node: "ostest-8zrlf-worker-0-4h78l"
I0116 13:15:48.942260       1 cloudprivateipconfig_controller.go:381] Adding finalizer to CloudPrivateIPConfig: "192.168.192.111"
I0116 13:15:48.943718       1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue
I0116 13:15:49.758484       1 openstack.go:760] Getting port lock for portID 8854b2e9-3139-49d2-82dd-ee576b0a0cce and IP 192.168.192.111
I0116 13:15:50.547268       1 cloudprivateipconfig_controller.go:439] Added IP address to node: "ostest-8zrlf-worker-0-4h78l" for CloudPrivateIPConfig: "192.168.192.111"
I0116 13:15:50.602277       1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue
I0116 13:15:50.614413       1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue

$ openstack port list --network network-dualstack | grep -e 192.168.192.111 -e 6f44:5dd8:c956:f816:3eff:fef0:3333
| 30fe8d9a-c1c6-46c3-a873-9a02e1943cb7 | egressip-192.168.192.111      | fa:16:3e:3c:23:2a | ip_address='192.168.192.111', subnet_id='ae8a4c1f-d3e4-4ea2-bc14-ef1f6f5d0bbe'                         | DOWN   |

Actual results: ipv6 egressIP object is ignored.
Expected results: ipv6 egressIP is created and can be attached to a pod.
Additional info: must-gather linked in private comment.

Story CLID-67: Add --v1 flag

View the Description View the linked PRs

Customer is asking for this flag so they can keep using v1 code even when v2 will be the default.

https://github.com/openshift/oc-mirror/pull/806

Bug MGMT-16926: [STG] LVMS Multi Node - lvm requirments success although no attached disk added

View the Description View the linked PRs

Description of the problem:

LVMS multi node
requires a additional disk for the operator

however i was able to create cluster 4.15 multinode , select lvms and without adding the additional disk i see that lvm requirment passes and i am able to continue and start installaiton

How reproducible:

Steps to reproduce:

1. create cluter 4.15 multi node

2. select lvms operator

3. do not attach additional disk

Actual results:

it is possible to continue until installation page and start installation
lvm requirment is set as success

Expected results:

lvm requirement should show fail
should not be bale to proceed to installation before attaching difk

https://github.com/openshift/assisted-service/pull/5998

Bug OCPBUGS-21593: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-23891: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/108

Task MON-3699: Unify how fields are hidden from CMO doc

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/2247

Bug OCPBUGS-30547: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8131

Bug OCPBUGS-24722: Update 4.16 golang-github-prometheus-node_exporter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/node_exporter/pull/140

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/node_exporter/pull/140

Bug OCPBUGS-26114: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource/pull/161

Bug OCPBUGS-13726: Hypershift image configuration not working for Hypershift HostedCluster

View the Description View the linked PRs

Description of problem:

Add image configuration for hypershift Hosted Cluster not working as expected.

Version-Release number of selected component (if applicable):

# oc get clusterversions.config.openshift.io
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-rc.8   True        False         6h46m   Cluster version is 4.13.0-rc.8

How reproducible:

Always

Steps to Reproduce:

1. Get hypershift hosted cluster detail from management cluster. 

# hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r '.items[].metadata.name')  

2. Apply image setting for hypershift hosted cluster. 
#  oc patch hc/$hostedcluster -p '{"spec":{"configuration":{"image":{"registrySources":{"allowedRegistries":["quay.io","registry.redhat.io","image-registry.openshift-image-registry.svc:5000","insecure.com"],"insecureRegistries":["insecure.com"]}}}}}' --type=merge -n clusters     
hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched 

# oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.image
{
  "registrySources": {
    "allowedRegistries": [
      "quay.io",
      "registry.redhat.io",
      "image-registry.openshift-image-registry.svc:5000",
      "insecure.com"
    ],
    "insecureRegistries": [
      "insecure.com"
    ]
  }
}

3. Check Pod or operator restart to apply configuration changes. 

# oc get pods -l app=kube-apiserver  -n clusters-${hostedcluster}
NAME                              READY   STATUS    RESTARTS   AGE
kube-apiserver-67b6d4556b-9nk8s   5/5     Running   0          49m
kube-apiserver-67b6d4556b-v4fnj   5/5     Running   0          47m
kube-apiserver-67b6d4556b-zldpr   5/5     Running   0          51m

#oc get pods -l app=kube-apiserver  -n clusters-${hostedcluster} -l app=openshift-apiserver
NAME                                   READY   STATUS    RESTARTS   AGE
openshift-apiserver-7c69d68f45-4xj8c   3/3     Running   0          136m
openshift-apiserver-7c69d68f45-dfmk9   3/3     Running   0          135m
openshift-apiserver-7c69d68f45-r7dqn   3/3     Running   0          136m  

4. Check image.config in hosted cluster.
# oc get image.config -o yaml
...
  spec:
    allowedRegistriesForImport: []
  status:
    externalRegistryHostnames:
    - default-route-openshift-image-registry.apps.hypershift-ci-32506.qe.devcluster.openshift.com
    internalRegistryHostname: image-registry.openshift-image-registry.svc:5000  

#oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-128-61.us-east-2.compute.internal    Ready    worker   6h42m   v1.26.3+b404935
ip-10-0-130-68.us-east-2.compute.internal    Ready    worker   6h42m   v1.26.3+b404935
ip-10-0-134-89.us-east-2.compute.internal    Ready    worker   6h42m   v1.26.3+b404935
ip-10-0-138-169.us-east-2.compute.internal   Ready    worker   6h42m   v1.26.3+b404935

# oc debug node/ip-10-0-128-61.us-east-2.compute.internal
Temporary namespace openshift-debug-mtfcw is created for debugging node...
Starting pod/ip-10-0-128-61us-east-2computeinternal-debug-mctvr ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.61
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# cat /etc/containers/registries.conf
unqualified-search-registries = ["registry.access.redhat.com", "docker.io"]
short-name-mode = ""[[registry]]
  prefix = ""
  location = "registry-proxy.engineering.redhat.com"  [[registry.mirror]]
    location = "brew.registry.redhat.io"
    pull-from-mirror = "digest-only"[[registry]]
  prefix = ""
  location = "registry.redhat.io"  [[registry.mirror]]
    location = "brew.registry.redhat.io"
    pull-from-mirror = "digest-only"[[registry]]
  prefix = ""
  location = "registry.stage.redhat.io"  [[registry.mirror]]
    location = "brew.registry.redhat.io"
    pull-from-mirror = "digest-only"

Actual results:

Config changes not applied in backend.Not operator & pod restart

Expected results:

Configuration should applied and pod & operator should restart after config changes.

Additional info:

https://github.com/openshift/hypershift/pull/3714

Bug OCPBUGS-24935: Update 4.16 ose-baremetal-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/327

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/baremetal-operator/pull/327

Bug OCPBUGS-26539: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/router/pull/553

Bug OCPBUGS-25138: Update 4.16 ose-gcp-pd-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/54

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver/pull/54

Bug OCPBUGS-24303: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/7893

Bug OCPBUGS-24867: Update 4.16 ose-aws-ebs-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/246

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-ebs-csi-driver/pull/246

Bug OCPBUGS-27289: metrics-server should handle kubelet server CA rotation

View the Description View the linked PRs

Description of problem

Build02, a years old cluster currently running 4.15.0-ec.2 with TechPreviewNoUpgrade, has been Available=False for days:

$ oc get -o json clusteroperator monitoring | jq '.status.conditions[] | select(.type == "Available")'
{
  "lastTransitionTime": "2024-01-14T04:09:52Z",
  "message": "UpdatingMetricsServer: reconciling MetricsServer Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/metrics-server: context deadline exceeded",
  "reason": "UpdatingMetricsServerFailed",
  "status": "False",
  "type": "Available"
}

Both pods had been having CA trust issues. We deleted one pod, and it's replacement is happy:

$ oc -n openshift-monitoring get -l app.kubernetes.io/component=metrics-server pods
NAME                             READY   STATUS    RESTARTS   AGE
metrics-server-9cc8bfd56-dd5tx   1/1     Running   0          136m
metrics-server-9cc8bfd56-k2lpv   0/1     Running   0          36d

The young, happy pod has occasional node-removed noise, which is expected in this cluster with high levels of compute-node autoscaling:

$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-dd5tx
E0117 17:16:13.492646       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5"
E0117 17:16:28.611052       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5"
E0117 17:16:56.898453       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": context deadline exceeded" node="build0-gstfj-ci-builds-worker-b-srjk5"

While the old, sad pod is complaining about unknown authorities:

$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-k2lpv
E0117 17:19:09.612161       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.0.3:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-m-2.c.openshift-ci-build-farm.internal"
E0117 17:19:09.620872       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.90:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-ci-prowjobs-worker-b-cg7qd"
I0117 17:19:14.538837       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

More details in the Additional details section, but the timeline seems to have been something like:

2023-12-11, metrics-server-* pods come up, and are running happily, scraping kubelets with a CA trust store descended from openshift-config-managed's kubelet-serving-ca ConfigMap.
2024-01-02, a new openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 is created.
2024-01-04, kubelets rotate their serving CA. Not entirely clear how this works yet outside of bootstrapping, but at least for bootstrapping it uses a CertificateSigningRequest, approved by cluster-machine-approver, and signed by the kubernetes.io/kubelet-serving signing component in the kube-controller-manager-* pods in the openshift-kube-controller-manager namespace.
2024-01-04, the csr-signer Secret in openshift-kube-controller-manager has the new openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 issuing a certificate for kube-csr-signer_@1704338196.
The kubelet-serving-ca ConfigMap gets updated to include a CA for the new kube-csr-signer_@1704338196, signed by the new openshift-kube-controller-manager-operator_csr-signer-signer@1704206554.
Local /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt updated in metrics-server-* containers.
But metrics-server-* pods fail to notice the file change and reload /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, so the existing pods do not trust the new kubelet server certs.
Mysterious time delay. Perhaps the monitoring operator does not notice sad metrics-server-* pods outside of things that trigger DeploymentRollout?
2024-01-14, monitoring ClusterOperator goes Available=False on UpdatingMetricsServerFailed.
2024-01-17, deleting one metrics-server-* pod triggers replacement-pod creation, and the replacement pod comes up fine.

So addressing the metrics-server /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt change detection should resolve this use-case. And triggering a container or pod restart would be an aggressive-but-sufficient mechanism, although loading the new data without rolling the process would be less invasive.

Version-Release number of selected component (if applicable)

4.15.0-ec.3, which has fast CA rotation, see discussion in API-1687.

How reproducible

Unclear.

Steps to Reproduce

Unclear.

Actual results

metrics-server pods having trouble with CA trust when attempting to scrape nodes.

Expected results

metrics-server pods successfully trusting kubelets when scraping nodes.

Additional details

The monitoring operator sets up the metrics server with --kubelet-certificate-authority=/etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, which is the "Path to the CA to use to validate the Kubelet's serving certificates" and is mounted from the kubelet-serving-ca-bundle ConfigMap. But that mount point only contains openshift-kube-controller-manager-operator_csr-signer-signer@... CAs:

$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- cat /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not '
Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-gtctn ...

Removing debug pod ...
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec  3 14:42:33 2023 GMT
            Not After : Feb  1 14:42:34 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec 20 03:16:35 2023 GMT
            Not After : Jan 19 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1703042196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  4 03:16:35 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1704338196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  2 14:42:34 2024 GMT
            Not After : Mar  2 14:42:35 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
unable to load certificate
137730753918272:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE

While actual kubelets seem to be using certs signed by kube-csr-signer_@1704338196 (which is one of the Subjects in /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt):

$ oc get -o wide -l node-role.kubernetes.io/master= nodes
NAME                                                  STATUS   ROLES    AGE      VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
build0-gstfj-m-0.c.openshift-ci-build-farm.internal   Ready    master   3y240d   v1.28.3+20a5764   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow)   5.14.0-284.41.1.el9_2.x86_64   cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9
build0-gstfj-m-1.c.openshift-ci-build-farm.internal   Ready    master   3y240d   v1.28.3+20a5764   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow)   5.14.0-284.41.1.el9_2.x86_64   cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9
build0-gstfj-m-2.c.openshift-ci-build-farm.internal   Ready    master   3y240d   v1.28.3+20a5764   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow)   5.14.0-284.41.1.el9_2.x86_64   cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9
$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- openssl s_client -connect 10.0.0.3:10250 -showcerts </dev/null
Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-ksl2k ...
Can't use SSL_get_servername
depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
verify return:1
CONNECTED(00000003)
---
Certificate chain
 0 s:O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
   i:CN = kube-csr-signer_@1704338196
-----BEGIN CERTIFICATE-----
MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm
MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3
MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx
SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp
ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH
A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X
PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud
JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2
9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w
ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB
CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/
vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON
rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb
q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2
H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw
JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM
-----END CERTIFICATE-----
---
Server certificate
subject=O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal

issuer=CN = kube-csr-signer_@1704338196

---
Acceptable client certificate CA names
OU = openshift, CN = admin-kubeconfig-signer
CN = openshift-kube-controller-manager-operator_csr-signer-signer@1699022534
CN = kube-csr-signer_@1700450189
CN = kube-csr-signer_@1701746196
CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1691004449
CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1702234292
CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1699642292
OU = openshift, CN = kubelet-bootstrap-kubeconfig-signer
CN = openshift-kube-apiserver-operator_node-system-admin-signer@1678905372
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1902 bytes and written 383 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 21 (unable to verify the first certificate)
---
DONE

Removing debug pod ...
$ openssl x509 -noout -text <<EOF 2>/dev/null
> -----BEGIN CERTIFICATE-----
MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm
MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3
MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx
SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp
ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH
A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X
PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud
JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2
9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w
ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB
CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/
vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON
rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb
q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2
H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw
JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM
-----END CERTIFICATE-----
> EOF
...
        Issuer: CN = kube-csr-signer_@1704338196
        Validity
            Not Before: Jan 17 03:14:30 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
...

The monitoring operator populates the openshift-monitoring kubelet-serving-ca-bundle} ConfigMap using data from the openshift-config-managed kubelet-serving-ca ConfigMap, and that propagation is working, but does not contain the kube-csr-signer_ CA:

$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not '
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec  3 14:42:33 2023 GMT
            Not After : Feb  1 14:42:34 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec 20 03:16:35 2023 GMT
            Not After : Jan 19 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1703042196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  4 03:16:35 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1704338196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  2 14:42:34 2024 GMT
            Not After : Mar  2 14:42:35 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
unable to load certificate
140531510617408:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | sha1sum 
a32ab44dff8030c548087d70fea599b0d3fab8af  -
$ oc -n openshift-monitoring get -o json configmap kubelet-serving-ca-bundle | jq -r '.data["ca-bundle.crt"]' | sha1sum 
a32ab44dff8030c548087d70fea599b0d3fab8af  -

Flipping over to the kubelet side, nothing in the machine-config operator's template is jumping out at me as a key/cert pair for serving on 10250. The kubelet seems to set up server certs via serverTLSBootstrap: true. But we don't seem to set the beta RotateKubeletServerCertificate, so I'm not clear on how these are supposed to rotate on the kubelet side. But there are CSRs from kubelets requesting serving certs:

$ oc get certificatesigningrequests | grep 'NAME\|kubelet-serving'
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-8stgd   51m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-xkdw2                           <none>              Approved,Issued
csr-blbjx   9m1s    kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-longtests-worker-b-5w9dz                        <none>              Approved,Issued
csr-ghxh5   64m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-sdwdn                           <none>              Approved,Issued
csr-hng85   33m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-longtests-worker-d-7d7h2                        <none>              Approved,Issued
csr-hvqxz   24m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-fp6wb                           <none>              Approved,Issued
csr-vc52m   50m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-xlmt6                           <none>              Approved,Issued
csr-vflcm   40m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-djpgq                           <none>              Approved,Issued
csr-xfr7d   51m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-8v4vk                           <none>              Approved,Issued
csr-zhzbs   51m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-rqr68                           <none>              Approved,Issued
$ oc get -o json certificatesigningrequests csr-blbjx
{
    "apiVersion": "certificates.k8s.io/v1",
    "kind": "CertificateSigningRequest",
    "metadata": {
        "creationTimestamp": "2024-01-17T19:20:43Z",
        "generateName": "csr-",
        "name": "csr-blbjx",
        "resourceVersion": "4719586144",
        "uid": "5f12d236-3472-485f-8037-3896f51a809c"
    },
    "spec": {
        "groups": [
            "system:nodes",
            "system:authenticated"
        ],
        "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlh6Q0NBUVFDQVFBd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVQwd093WURWUVFERXpSegplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnpMWGR2Y210bGNpMWlMVFYzCk9XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZ3F4ZHNZWkdmQXovTEpoZVgKd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVSUpUN2pCblV1WEdnZktCTQpNRW9HQ1NxR1NJYjNEUUVKRGpFOU1Ec3dPUVlEVlIwUkJESXdNSUlvWW5WcGJHUXdMV2R6ZEdacUxXTnBMV3h2CmJtZDBaWE4wY3kxM2IzSnJaWEl0WWkwMWR6bGtlb2NFQ2dBZ0F6QUtCZ2dxaGtqT1BRUURBZ05KQURCR0FpRUEKMHlRVzZQOGtkeWw5ZEEzM3ppQTJjYXVJdlhidTVhczNXcUZLYWN2bi9NSUNJUURycEQyVEtScHJOU1I5dExKTQpjZ0ZpajN1dVNieVJBcEJ5NEE1QldEZm02UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=",
        "signerName": "kubernetes.io/kubelet-serving",
        "usages": [
            "digital signature",
            "server auth"
        ],
        "username": "system:node:build0-gstfj-ci-longtests-worker-b-5w9dz"
    },
    "status": {
        "certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN6ekNDQWJlZ0F3SUJBZ0lSQUlGZ1NUd0ovVUJLaE1hWlE4V01KcEl3RFFZSktvWklodmNOQVFFTEJRQXcKSmpFa01DSUdBMVVFQXd3YmEzVmlaUzFqYzNJdGMybG5ibVZ5WDBBeE56QTBNek00TVRrMk1CNFhEVEkwTURFeApOekU1TVRVME0xb1hEVEkwTURJd016QXpNVFl6Tmxvd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6Ck1UMHdPd1lEVlFRREV6UnplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnoKTFhkdmNtdGxjaTFpTFRWM09XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZwpxeGRzWVpHZkF6L0xKaGVYd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVCklKVDdqQm5VdVhHZ2ZLT0JrakNCanpBT0JnTlZIUThCQWY4RUJBTUNCNEF3RXdZRFZSMGxCQXd3Q2dZSUt3WUIKQlFVSEF3RXdEQVlEVlIwVEFRSC9CQUl3QURBZkJnTlZIU01FR0RBV2dCVGYzZ0FHNUxiMkxTcXl6MEFxVCtaRAoyV0VuenpBNUJnTlZIUkVFTWpBd2dpaGlkV2xzWkRBdFozTjBabW90WTJrdGJHOXVaM1JsYzNSekxYZHZjbXRsCmNpMWlMVFYzT1dSNmh3UUtBQ0FETUEwR0NTcUdTSWIzRFFFQkN3VUFBNElCQVFBRE5ad0pMdkp4WWNta2RHV08KUm5ocC9rc3V6akJHQnVHbC9VTmF0RjZScml3eW9mdmpVNW5Kb0RFbGlLeHlDQ2wyL1d5VXl5a2hMSElBK1drOQoxZjRWajIrYmZFd0IwaGpuTndxQThudFFabS90TDhwalZ5ZzFXM0VwR2FvRjNsZzRybDA1cXBwcjVuM2l4WURJClFFY2ZuNmhQUnlKN056dlFCS0RwQ09lbU8yTFllcGhqbWZGY2h5VGRZVGU0aE9IOW9TWTNMdDdwQURIM2kzYzYKK3hpMDhhV09LZmhvT3IybTVBSFBVN0FkTjhpVUV0M0dsYzI0SGRTLzlLT05tT2E5RDBSSk9DMC8zWk5sKzcvNAoyZDlZbnYwaTZNaWI3OGxhNk5scFB0L2hmOWo5TlNnMDN4OFZYRVFtV21zN29xY1FWTHMxRHMvWVJ4VERqZFphCnEwMnIKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=",
        "conditions": [
            {
                "lastTransitionTime": "2024-01-17T19:20:43Z",
                "lastUpdateTime": "2024-01-17T19:20:43Z",
                "message": "This CSR was approved by the Node CSR Approver (cluster-machine-approver)",
                "reason": "NodeCSRApprove",
                "status": "True",
                "type": "Approved"
            }
        ]
    }
}
$ oc get -o json certificatesigningrequests csr-blbjx | jq -r '.status.certificate | @base64d' | openssl x509 -noout -text | grep '^Certificate:\|Issuer\|Subject:\|Not '
Certificate:
        Issuer: CN = kube-csr-signer_@1704338196
            Not Before: Jan 17 19:15:43 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: O = system:nodes, CN = system:node:build0-gstfj-ci-longtests-worker-b-5w9dz

So that's approved by cluster-machine-approver, but signerName: kubernetes.io/kubelet-serving is an upstream Kubernetes component documented here, and the signer is implemented by kube-controller-manager.

Bug OCPBUGS-29039: default value of option parallelism cannot be parsed to int

View the Description View the linked PRs

Description of problem:

default value of option --parallelism cannot be parsed to int

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-02-02-002725

How reproducible:

    reproduce with cmd copy-to-node

Steps to Reproduce:

    Cmd: "oc --namespace=e2e-test-mco-4zb88 --kubeconfig=/tmp/kubeconfig-3071675436 adm copy-to-node node/ip-10-0-17-85.ec2.internal --copy=/tmp/fetch-w637bgyv=/etc/mco-compressed-test-file",
            StdErr: "error: --parallelism must be either N or N%: strconv.ParseInt: parsing \"10%%\": invalid syntax",

Actual results:

default value of --parallelism cannot be parsed

Expected results:

no error

Additional info:

there is a hack code to append % to the default value

ref: https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L75-L79

https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L94

err var percentParseErr should be used

https://github.com/openshift/oc/pull/1679

Bug OCPBUGS-18699: Web Console Shows Non-printable file detected

View the Description View the linked PRs

Description of problem:

Openshift Console shows "Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available." while edit an XML file type configmaps.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Create configmap from file:
# oc create cm test-cm --from-file=server.xml=server.xml
configmap/test-cm created

2. If we try to edit the configmap in the OCP console we see the following error:

Info alert:Non-printable file detected.
File contains non-printable characters. Preview is not available.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13496

Bug OCPBUGS-24226: setting TLSSecurityProfile with no minTLSVersion crashes controller

View the Description View the linked PRs

Maxim Patlasov pointed this out in ~~STOR-1453~~ but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.

It is possible to set a custom TLSSecurityProfile without minTLSversion:

$ oc edit apiserver cluster
...
spec:
tlsSecurityProfile:
type: Custom
custom:
ciphers:
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-ECDSA-AES128-GCM-SHA256

This causes the controller to crash loop:

$ oc get pods -n openshift-cluster-csi-drivers
NAME READY STATUS RESTARTS AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2 6/11 CrashLoopBackOff 10 (18s ago) 37s
...

because the `${TLS_MIN_VERSION}` placeholder is never replaced:

- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}

The observed config in the ClusterCSIDriver shows an empty string:

$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
"targetcsiconfig": {
"servingInfo":

{ "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" ], "minTLSVersion": "" }

}
}

which means minTLSVersion is empty when we get to this line, and the string replacement is not done:

[https://github.com/openshift/library-go/blob/c7f15dcc10f5d0b89e8f4c5d50cd313ae158de20/pkg/operator/csi/csidrivercontrollerservicecontroller/helpers.go#L234]

So it seems we have a couple of options:

1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object

Bug OCPBUGS-24969: Update 4.16 ose-kube-storage-version-migrator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/202

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/202

Bug OCPBUGS-25676: monitoring ClusterOperator should better handle timeouts

View the Description View the linked PRs

Description of problem:

The monitoring operator may be down or disabled, and the components it manages may be unavailable or degraded.
Upon quick check I've noticed an error:

oc get co -o json | jq -r '.items[].status | select (.conditions) '.conditions | jq -r '.[] | select( (.type == "Degraded") and (.status == "True") )'

    {
      "lastTransitionTime": "2023-12-19T10:25:24Z",
      "message": "syncing Thanos Querier trusted CA bundle ConfigMap failed: reconciling trusted CA bundle ConfigMap failed: updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded, syncing Thanos Querier trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=alertmanager,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps), syncing Prometheus trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=prometheus,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)",
      "reason": "MultipleTasksFailed",
      "status": "True",
      "type": "Degraded"
    }

i.e. updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded

I ran oc get co again and everything looked fine, it seems this timeout condition could be handled better to avoid alerting SRE.

Actual results:

operator degraded

Expected results:

operator retries operation

https://github.com/openshift/cluster-monitoring-operator/pull/2219

Bug OCPBUGS-25584: Update 4.16 ose-vsphere-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/60

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-vsphere/pull/60

Bug OCPBUGS-25687: [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup fails frequently for OpenStack CSI jobs

View the Description View the linked PRs

Description of problem:

The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test is a frequent offender in the OpenStack CSI jobs. We're seeing it fail on 4.14 up to 4.16.

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.14-e2e-openstack-csi-cinder

Example of failed job.
Example of successful job.

It seems like the 1 min timeout is too short and does not give enough time for the pods backing the service to come up.

https://github.com/openshift/origin/blob/1e6c579/pkg/monitortests/network/disruptionpodnetwork/monitortest.go#L191

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

https://github.com/openshift/origin/pull/28474

Bug OCPBUGS-27159: Storage is Progressing when it can't connect to vCenter

View the Description View the linked PRs

This is continuation of ~~OCPBUGS-23342~~, now the vmware-vsphere-csi-driver-operator cannot connect to vCenter at all. Tested using invalid credentials.

The operator ends up with no Progressing condition during upgrade from 4.11 to 4.12, and cluster-storage-operator interprets it as Progressing=true.

Bug OCPBUGS-23809: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-manila-operator/pull/220

Bug OCPBUGS-26088: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/route-controller-manager/pull/37

Bug OCPBUGS-27927: Update 4.16 ose-openshift-apiserver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/415

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openshift-apiserver/pull/415

Bug OCPBUGS-28939: ART requests updates to 4.16 image ose-gcp-pd-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/117

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/117

Bug OCPBUGS-25562: Update 4.16 ose-ibm-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/62

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-ibm/pull/62

Bug OCPBUGS-25582: Update 4.16 ose-cluster-machine-approver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/223

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-machine-approver/pull/223

Bug OCPBUGS-30469: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug MGMT-16757: [STG][BE][vSphere] CNV should be disabled when select platform vSphere

View the Description View the linked PRs

Description of the problem:

Up to latest decision RH won't going to support installation OCP cluster on vSphere with

nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "vSphere" on page "Cluster Details"

Slack discussion thread

https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159

Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS

https://github.com/openshift/assisted-service/pull/5945

Bug OCPBUGS-25754: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-gcp/pull/54

Bug OCPBUGS-28334: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-autoscaler/pull/287

Bug OCPBUGS-28662: openshift/origin - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/28564

Bug OCPBUGS-27502: "all tls artifacts must be registered" failing on some metal jobs

View the Description View the linked PRs

These two tests are permafailing on some metal jobs:

[sig-arch][Late] all tls artifacts must be registered [Suite:openshift/conformance/parallel]

[sig-arch][Late] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]

Example run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-upgrade-ovn-ipv6/1749394708653674496

It was previously passing 100%, maybe it's tied to recent changes?

https://github.com/openshift/origin/pull/28444

Additional context here:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.16/analysis?test=%5Bsig-arch%5D%5BLate%5D%20all%20tls%20artifacts%20must%20be%20registered%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-arch%5D%5BLate%5D%20all%20tls%20artifacts%20must%20be%20registered%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22never-stable%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22aggregated%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

https://github.com/openshift/origin/pull/28541

Bug OCPBUGS-24308: Ingress Router should have a PodDisruptionBudget

View the Description View the linked PRs

hypershift#1614 gave us the router Deployment (descended from the private-router Deployment), but it lacks PDB coverage. For example:

$ git --no-pager log -1 --oneline origin/main
f3f421bc7 (origin/release-4.16, origin/release-4.15, origin/main, origin/HEAD) Merge pull request #3183 from muraee/azure-kms
$ git --no-pager grep 'func [^(]*\(Deployment\|PodDisruptionBudget\)' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas}
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/ingress/router.go:func ReconcileRouterDeployment(deployment *appsv1.Deployment, ownerRef config.OwnerRef, deploymentConfig config.DeploymentConfig, image string, config *corev1.ConfigMap) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/deployment.go:func ReconcileKubeAPIServerDeployment(deployment *appsv1.Deployment,
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/pdb.go:func ReconcilePodDisruptionBudget(pdb *policyv1.PodDisruptionBudget, p *KubeAPIServerParams) error {

Both the ingress and kas packages have Reconcile*Deployment methods. Only kas has a ReconcilePodDisruptionBudget method.

This bug is asking for router to get a covering PDB too, because being able to simultaneously evict all router-* pods simultaneously (for the cluster flavors that have replicas > 1 on that Deployment) can make the incoming traffic unreachable. And some of that Route traffic looks like stuff that folks would want to be reliably reachable:

$ git --no-pager grep 'func Reconcile[^(]*Route(' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas}
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPublicRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPrivateRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileInternalRoute(route *routev1.Route, owner *metav1.OwnerReference) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityExternalRoute(route *routev1.Route, ownerRef config.OwnerRef, hostname string, defaultIngressDomain string) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityInternalRoute(route *routev1.Route, ownerRef config.OwnerRef) error {

Test plan:

1. Install a hosted cluster.
2. Log into the managment cluster, and find the namespace of the hosted cluster $NAMESPACE.
3. Evict both router pods (using a raw create, because there isn't more convenient syntax yet):

oc -n "${NAMESPACE}" get -l app=private-router -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | while read NAME
do
  oc create -f - <<EOF --raw "/api/v1/namespaces/${NAMESPACE}/pods/${NAME}/eviction"
{"apiVersion": "policy/v1", "kind": "Eviction", "metadata": {"name": "${NAME}"}}
EOF
done

If that clears out both router pods right after the other, ingress will probably hiccup. And with the PDB in place, I'd expect the second eviction to fail.

https://github.com/openshift/hypershift/pull/3337

Bug OCPBUGS-25025: did not find "trackTimestampsStaleness: true" setting for kubelet/kubelet-minimal servicemonitor

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2191

Bug OCPBUGS-25780: There is no response when clicking on button "Select a version" when there is new update

View the Description View the linked PRs

Description of problem:

When there is new update for cluster, try to click "Select a version" from cluster settings page, there is no reaction.

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-19-033450

How reproducible:

Always

Steps to Reproduce:

    1.Prepare a cluster with available update.
    2.Go to Cluster Settings page, choose a version by clicking on "Select a version" button.
    3.

Actual results:

2. There is no response when click on the button, user could not select a version from the page.

Expected results:

2. A modal should show up for user to select version after clicking on "Select a version" button

Additional info:

screenshot: https://drive.google.com/file/d/1Kpyu0kUKFEQczc5NVEcQFbf_uly_S60Y/view?usp=sharing

Bug OCPBUGS-24408: Node Overview Pane not displaying

View the Description View the linked PRs

Description of problem:

Node Overview Pane not displaying

Version-Release number of selected component (if applicable):

How reproducible:

    In the openshift console, under Compute > Node > Node Details >
the Overview tab does not display

Steps to Reproduce:

 In the openshift console, under Compute > Node > Node Details >
the Overview tab does not display

Actual results:

    Overview tab does not display

Expected results:

    Overview tab should display

Additional info:

{code

https://github.com/openshift/console/pull/13422

Bug OCPBUGS-28246: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-1490: Update go to version 1.20 in cluster-ingress-operatror

View the Description View the linked PRs

Split go version update out of PR https://github.com/openshift/cluster-ingress-operator/pull/999.

https://github.com/openshift/cluster-ingress-operator/pull/1012

Bug OCPBUGS-25740: Update 4.16 ose-node-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/sdn/pull/599

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/sdn/pull/599

Bug OCPBUGS-24791: Update 4.16 golang-github-openshift-oauth-proxy-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oauth-proxy/pull/270

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oauth-proxy/pull/270

Bug OCPBUGS-27865: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-vsphere/pull/63

Task HOSTEDCP-1480: Improve security posture of TLS cert hash

View the Description View the linked PRs

Update hash creation to use sha512 instead of sha1

Links

https://app.snyk.io/org/hypershift/project/4c9cffe4-d47c-473d-bea4-4daf4266d02d#issue-aa63e0f3-bed0-44e9-bf14-515127af79d7

https://github.com/openshift/hypershift/pull/3718

Spike MON-3644: Ease the tracking of monitoring components versions.

View the Description View the linked PRs

We frequently receive inquiries regarding the versions of monitoring components (such as Prometheus, Alertmanager, etc.) that are used in a giving OCP version.
Currently, obtaining this information requires several manual steps on our part, e.g.:

Identify the relevant GitHub repository.
Check out the appropriate branch.
Locate the file that contains the version.

What if we automate this?

How about a view that displays the versions of all components for all recent OCP versions.

https://github.com/openshift/cluster-monitoring-operator/pull/2220

Bug OCPBUGS-25862: Spurious "wait has exceeded 40 minutes" when etcd operator briefly goes degraded in late upgrade

View the Description View the linked PRs

Description of problem:

At 17:26:09, the cluster is happily upgrading nodes:

An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config

At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is OCPBUGS-20061)

An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available

~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:

An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver

This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:

1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
2. These two COs upgrade successfuly and upgrade proceeds
3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.

Version-Release number of selected component (if applicable):

all

How reproducible:

Not entirely deterministic:

1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)

Steps to Reproduce:

1. Watch oc adm upgrade during the upgrade

Actual results:

Spurious "waiting for over 40m" message pops out of the blue

Expected results:

CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.

https://github.com/openshift/cluster-version-operator/pull/1011

Bug OCPBUGS-29984: ART requests updates to 4.16 image ose-csi-external-resizer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/155

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/158

Bug OCPBUGS-25049: Update 4.16 ose-ibm-vpc-block-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver/pull/61

Bug OCPBUGS-27215: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2048

Bug OCPBUGS-30314: Serialisation error in DoHTTPProbe function logging at verbosity level 4 in probehttp.go

View the Description View the linked PRs

In the `DoHTTPProbe` function located at `github.com/openshift/router/pkg/router/metrics/probehttp/probehttp.go`, logging of the HTTP response object at verbosity level 4 results in a serialisation error due to non-serialisable fields within the `http.Response` object. The error logged is `<<error: json: unsupported type: func() (io.ReadCloser, error)>>`, pointing towards an inability to serialise the `Body` field, which is of type `io.ReadCloser`.

This function is designed to check if a GET request to the specified URL succeeds, logging detailed response information at higher verbosity levels for diagnostic purposes.

Steps to Reproduce:
1. Increase the logging level to 4.
2. Perform an operation that triggers the `DoHTTPProbe` function.
3. Review the logging output for the error message.

Expected Behaviour:

The logger should gracefully handle or exclude non-serialisable fields like `Body`, ensuring clean and informative logging output that aids in diagnostics without encountering serialisation errors.

Actual Behaviour:

Non-serialisable fields in the `http.Response` object lead to the error `<<error: json: unsupported type: func() (io.ReadCloser, error)>>` being logged. This diminishes the utility of logs for debugging at higher verbosity levels.

Impact:

The issue is considered of low severity since it only appears at logging level 4, which is beyond standard operational levels (level 2) used in production. Nonetheless, it could hinder effective diagnostics and clutter logs with errors when high verbosity levels are employed for troubleshooting.

Suggested Fix:

Modify the logging functionality within `DoHTTPProbe` to either filter out non-serialisable fields from the `http.Response` object or implement a custom serialisation approach that allows these fields to be logged in a more controlled and error-free manner.

https://github.com/openshift/router/pull/566

Task HOSTEDCP-1376: Bump sigs.k8s and update dependabot groupings

View the Description View the linked PRs

Bump the sigs.k8s dependencies and update dependabot groupings

https://github.com/openshift/hypershift/pull/3392

Bug OCPBUGS-24404: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-24693: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2166

Bug OCPBUGS-27788: Power VS: Cannot deploy to mad

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Try to deploy in mad02 or mad04 with powervs
    2. Cannot import boot image
    3. fail

Actual results:

Fail

Expected results:

Cluster comes up

Additional info:

https://github.com/openshift/installer/pull/7941

Bug OCPBUGS-25775: Update 4.16 ironic-rhcos-downloader-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/96

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-rhcos-downloader/pull/96

Bug OCPBUGS-28251: unable to use `continue: true` in user-defined AlertmanagerConfig

View the Description View the linked PRs

Description of problem:

Trying to define multiple receivers in a single user-defined AlertmanagerConfig

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

#### Monitoring for user-defined projects is enabled
```
oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml | head -4
```
```
apiVersion: v1
data:
  config.yaml: |
    enableUserWorkload: true
```

#### separate Alertmanager instance for user-defined alert routing is Enabled and Configured
```
oc -n openshift-user-workload-monitoring get configmap user-workload-monitoring-config -o yaml | head -6
```
```
apiVersion: v1
data:
  config.yaml: |
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true
```
create testing namespace 
oc new-project libor-alertmanager-testing 
```
## TESTING - MULTIPLE RECEIVERS IN ALERTMANAGERCONFIG
Single AlertmanagerConfig
`alertmanager_config_webhook_and_email_rootDefault.yaml`
```
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: libor-alertmanager-testing-email-webhook
  namespace: libor-alertmanager-testing
spec:
  receivers:
  - name: 'libor-alertmanager-testing-webhook'
    webhookConfigs:
      - url: 'http://prometheus-msteams.internal-monitoring.svc:2000/occ-alerts'
  - name: 'libor-alertmanager-testing-email'
    emailConfigs:
      - to: USER@USER.CO
        requireTLS: false
        sendResolved: true
  - name: Default
  route:
    groupBy:
    - namespace
    receiver: Default
    groupInterval: 60s
    groupWait: 60s
    repeatInterval: 12h
    routes:
    - matchers:
      - name: severity
        value: critical
        matchType: '='
        continue: true
      receiver: 'libor-alertmanager-testing-webhook'
    - matchers:
      - name: severity
        value: critical
        matchType: '='
      receiver: 'libor-alertmanager-testing-email'
```
Once saved the continue statement is removed from the object. 
```
the configuration applied to alertmanager contains continue false statements
```
oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 

```
route:
  receiver: Default
  group_by:
  - namespace
  continue: false
  routes:
  - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/Default
    group_by:
    - namespace
    matchers:
    - namespace="libor-alertmanager-testing"
    continue: true
    routes:
    - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-webhook
      matchers:
      - severity="critical"
      continue: false  <----
    - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-email
      matchers:
      - severity="critical"
      continue: false <-----
```
If I update the statements to read `continue: true` 
and test here: https://prometheus.io/webtools/alerting/routing-tree-editor/ 

then I get the desired results

workaround is to use 2 separate files - the continue statement is being added.

Actual results:

Once saved the continue statement is removed from the object.

Expected results:

continue true statement is retain and applied to alertmanager

Additional info:

https://github.com/openshift/prometheus-operator/pull/275

Bug OCPBUGS-29429: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8004

Bug TRT-1522: gcp-network-liveness test looks broken on metal

View the Description View the linked PRs

revert: https://github.com/openshift/origin/pull/28603

output looks like:

INFO[2024-02-16T00:25:04Z] time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n
\n
ERROR
\n
The requested URL could not be retrieved
\n
\n
\n\n
\n

The following error was encountered while trying to retrieve the URL: http://35.212.33.188/health\">http://35.212.33.188/health

\n\n
\n

Access Denied.

\n
\n\n

Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.

\n\n

Your cache administrator is root.

\n
\n
\n\n
\n
\n

Generated Fri, 16 Feb 2024 00:20:41 GMT by ofcir-3329d9226457452fb2040e269776e3a5 (squid/5.2)

\n\n
\n\n" auditID=facdfd31-51e5-4812-a356-6e4b0e30cd38 backend=gcp-network-liveness-reused-connections this-instance="{Disruption map[backend-disruption-name:gcp-network-liveness-reused-connections connection:reused disruption:openshift-tests]}" type=reused
time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n
\n
ERROR

https://github.com/openshift/origin/pull/28603

Task MON-3589: Add a test for Prometheus timestamp-tolerance

View the Description View the linked PRs

Introduced in https://github.com/openshift/cluster-monitoring-operator/pull/2187

https://github.com/openshift/cluster-monitoring-operator/pull/2194

Bug OCPBUGS-24878: Update 4.16 openshift-state-metrics-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/112

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openshift-state-metrics/pull/112

Bug OCPBUGS-25500: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/252

Bug OCPBUGS-27760: excessive Back-off restarting failed containers

View the Description View the linked PRs

I noticed this today when looking at component readiness. A ~5% decrease in instability may seem minor, but these can certainly add up. This test passed 713 times in a row on 4.14. You can see today's failure here.

Details below:

-------

Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.

Probability of significant regression: 99.96%

Sample (being evaluated) Release: 4.15
Start Time: 2024-01-17T00:00:00Z
End Time: 2024-01-23T23:59:59Z
Success Rate: 94.83%
Successes: 55
Failures: 3
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 713
Failures: 0
Flakes: 4

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20upgrade-minor%20amd64%20gcp%20rt&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=gcp&platform=gcp&sampleEndTime=2024-01-23%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-01-17%2000%3A00%3A00&testId=openshift-tests-upgrade%3A37f1600d4f8d75c47fc5f575025068d2&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers&upgrade=upgrade-minor&upgrade=upgrade-minor&variant=rt&variant=rt

https://github.com/openshift/cluster-baremetal-operator/pull/403

Bug OCPBUGS-29601: [CI issue]Pipeline tests are failing in CI

View the Description View the linked PRs

Description of problem:

The pipeline operator has been removed from the operator hub so CI has been failing since
https://search.ci.openshift.org/?search=Entire+pipeline+flow+from+Builder+page+%22before+all%22+hook+for+%22Background+Steps%22&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/13611

Bug OCPBUGS-24380: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2262

Bug OCPBUGS-24190: when baselinecapabiliity set is set to None, still see SA with name `deployer-controller` being present in the cluster

View the Description View the linked PRs

When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.

steps to Reproduce:

=================

1. Install 4.15 cluster with baselineCapabilitySet to None

2. Run command `oc get sa -A | grep deployer`

Actual Results:

================

[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m

Expected Results:

==================

No SA related to deployer should be returned

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/320

Bug OCPBUGS-30832: ART requests updates to 4.16 image telemeter-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/telemeter/pull/522

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/telemeter/pull/521

Bug OCPBUGS-25378: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2225

Story TRT-1440: Weekly disruption data update in origin is broken... again

View the Description View the linked PRs

https://github.com/openshift/origin/pull/28484

Broken for 3 weeks, possibly for different reasons.

Unit test failing now, looks like 4.16 data not getting picked up, and it probably should be.

Bug OCPBUGS-16760: NodeLogQuery e2e tests are failing with Kubernetes 1.28 bump

View the Description View the linked PRs

GDescription of problem:

NodeLogQuery e2e tests are failing with Kubernetes 1.28 bump. Example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1646/pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-ovn/1683472309211369472

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/1792

Bug OCPBUGS-25542: Update 4.16 ose-cluster-olm-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/42

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-olm-operator/pull/42

Bug OCPBUGS-22528: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/81

Bug OCPBUGS-24876: Update 4.16 ose-cluster-platform-operators-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/platform-operators/pull/105

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/platform-operators/pull/105

Bug OCPBUGS-24954: Update 4.16 ose-oauth-apiserver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/94

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oauth-apiserver/pull/94

Bug OCPBUGS-29350: Agent CI jobs failing due console/authentication operator degraded

View the Description View the linked PRs

Description of problem:

    Agent CI jobs started to fail on a pretty regular basis, especially the compact ones.
Jobs time out due either the console or authentication operators remaining in a degraded state.
From the logs analysis, the are not able to get a route. Both apiserver and etcd component logs report connection refused messages, possibly indicating an underlying network problem

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Pretty frequently

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2089

Bug OCPBUGS-21846: Test "static pods should start after being created" failed

View the Description View the linked PRs

Description of problem:

Recently, the passing rate for test "static pods should start after being created" has dropped significantly for some platforms: 

https://sippy.dptools.openshift.org/sippy-ng/tests/4.15/analysis?test=%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

Take a look at this example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072

The test failed with the following message:
{  static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ci-op-2z99zzqd-7f99c-rfp4q-master-0" didn't show up, waited: 3m0s}

Seemingly revision 6 was never reached. But if we look at the log from kube-controller-manager-operator, it jumps from revision 5 to revision 7: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072/artifacts/e2e-azure-sdn-techpreview/gather-extra/artifacts/pods/openshift-kube-controller-manager-operator_kube-controller-manager-operator-7cd978d745-bcvkm_kube-controller-manager-operator.log

The log also indicates that there is a possibility of race:

W1013 12:59:17.775274       1 staticpod.go:38] revision 7 is unexpectedly already the latest available revision. This is a possible race!

This might be a static controller issue. But I am starting with kube-controller-manager component for the case. Feel free to reassign. 

Here is a slack thread related to this:
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1697472297510279

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-28663: openshift/csi-driver-shared-resource - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/csi-driver-shared-resource/pull/165

Bug OCPBUGS-30769: Kube-Apiserver server certificates should be signed with node local client load balancer address

View the Description View the linked PRs

Description of problem:

   In-cluster clients should be able to talk directly to the node local apiservert ip address and as a best practice should all be configured to use it. This load balancer provides added benefit in cloud environments of healthchecking the path from the machine to the load balancer fronting the kube-apiserver. It becomes more cruicial in baremetal/on-prem environments where there may not be a load balancer and instead just 3 unique endpoints directly to redundant kube-apiservers. In this case: if using just dns: intermittent traffic failures will be experienced if a control plane instance goes down. Using the node local load balancer: there will be no traffic disruption

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100%

Steps to Reproduce:

    1. Schedule a pod on any hypershift cluster node
    2. In the pod run curl -v -k https://172.20.0.1:6443
    3. The verbose output will show that the kube-apiserver cert does not have the node local client load balancer IP address in it's IPs section and therefore will not allow valid HTTPS requests on that address

Actual results:

    Secure HTTPS requests cannot be made to the kube-apiserver

Expected results:

     Secure HTTPS requests can be made to the kube-apiserver (no need to run -k when specifying proper CA bundle)

Additional info:

https://github.com/openshift/hypershift/pull/3699

Bug OCPBUGS-24844: Update 4.16 ose-openstack-cinder-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/145

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/145

Bug MGMT-16216: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-service/pull/5836

Bug OCPBUGS-24649: Private endpoint creation does not work on cluster created with minimal permissions

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/971

Bug OCPBUGS-25462: Number of configured control plane replicas should be validated

View the Description View the linked PRs

The number of control plane replicas defined in install-config.yaml (or agent-cluster-install.yaml) should be validated to check its set to 3, or 1 in the case of SNO. If set to another value the "create image" command should fail.

We recently had a case where the number of replicas was set to 2 and the installation failed. It would be good to catch this misconfiguration prior to the install.

https://github.com/openshift/installer/pull/8082

Bug OCPBUGS-25159: Update 4.16 baremetal-machine-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/207

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-baremetal/pull/207

Bug OCPBUGS-29645: Revocation of customer certificate causes access to the cluster using kubeconfig with sre cert to be denied

View the Description View the linked PRs

Description of problem:

When a customer certificate and sre certificate are configured and approved, revocation of customer certificate causes access to the cluster using kubeconfig with sre cert to be denied

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Create a cluster
    2. Configure a customer cert and a sre cert, they are approved
    3. Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied

Actual results:

   Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied

Expected results:

   Revoke a customer cert, access to the cluster using kubeconfig with sre cert succeeds

Additional info:

https://github.com/openshift/hypershift/pull/3615

Bug OCPBUGS-25047: Update 4.16 ose-powervs-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-provider-powervs/pull/68

Bug OCPBUGS-25461: Implement multi-rhel artifact extraction

View the Description View the linked PRs

Description of problem:


The following binaries need to get extracted from the release payload for both rhel8 and rhel9:

oc
ccoctl
opm
openshift-install
oc-mirror

The images that contain these, should produce artifacts of both kinds in some locatiuon, and probably make the artifact of their architecture available under a normal location in path. Example:

/usr/share/<binary>.rhel8
/usr/share/<binary>.rhel9
/usr/bin/<binary>

This ticket is about getting "oc adm release extract" to do the right thing in a backwards compatible way. If both binaries are available get those. If not, get from the old location.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc/pull/1647

Bug OCPBUGS-25760: Cannot change default network type when not doing migration

View the Description View the linked PRs

Description of problem:

During live OVN migration, network operator show the error message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Create 4.15 nightly SDN ROSA cluster
2. oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation
3. oc edit featuregate cluster to enable featuregates 
4. Wait for all node rebooting and back to normal
5. oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'

Actual results:

[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation[weliang@weliang ~]$ oc edit featuregate cluster[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'network.config.openshift.io/cluster patched[weliang@weliang ~]$ [weliang@weliang ~]$ oc get co networkNAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGEnetwork   4.15.0-0.nightly-2023-12-18-220750   True        False         True       105m    Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.[weliang@weliang ~]$ oc describe Network.config.openshift.io clusterName:         clusterNamespace:    Labels:       <none>Annotations:  network.openshift.io/network-type-migration: API Version:  config.openshift.io/v1Kind:         NetworkMetadata:  Creation Timestamp:  2023-12-20T15:13:39Z  Generation:          3  Resource Version:    119899  UID:                 6a621b88-ac4f-4918-a7f6-98dba7df222cSpec:  Cluster Network:    Cidr:         10.128.0.0/14    Host Prefix:  23  External IP:    Policy:  Network Type:  OVNKubernetes  Service Network:    172.30.0.0/16Status:  Cluster Network:    Cidr:               10.128.0.0/14    Host Prefix:        23  Cluster Network MTU:  8951  Network Type:         OpenShiftSDN  Service Network:    172.30.0.0/16Events:  <none>[weliang@weliang ~]$ oc describe Network.operator.openshift.io clusterName:         clusterNamespace:    Labels:       <none>Annotations:  <none>API Version:  operator.openshift.io/v1Kind:         NetworkMetadata:  Creation Timestamp:  2023-12-20T15:15:37Z  Generation:          275  Resource Version:    120026  UID:                 278bd491-ac88-4038-887f-d1defc450740Spec:  Cluster Network:    Cidr:         10.128.0.0/14    Host Prefix:  23  Default Network:    Openshift SDN Config:      Enable Unidling:          true      Mode:                     NetworkPolicy      Mtu:                      8951      Vxlan Port:               4789    Type:                       OVNKubernetes  Deploy Kube Proxy:            false  Disable Multi Network:        false  Disable Network Diagnostics:  false  Kube Proxy Config:    Bind Address:      0.0.0.0  Log Level:           Normal  Management State:    Managed  Observed Config:     <nil>  Operator Log Level:  Normal  Service Network:    172.30.0.0/16  Unsupported Config Overrides:  <nil>  Use Multi Network Policy:      falseStatus:  Conditions:    Last Transition Time:  2023-12-20T15:15:37Z    Status:                False    Type:                  ManagementStateDegraded    Last Transition Time:  2023-12-20T16:58:58Z    Message:               Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.    Reason:                InvalidOperatorConfig    Status:                True    Type:                  Degraded    Last Transition Time:  2023-12-20T15:15:37Z    Status:                True    Type:                  Upgradeable    Last Transition Time:  2023-12-20T16:52:11Z    Status:                False    Type:                  Progressing    Last Transition Time:  2023-12-20T15:15:45Z    Status:                True    Type:                  Available  Ready Replicas:          0  Version:                 4.15.0-0.nightly-2023-12-18-220750Events:                    <none>[weliang@weliang ~]$ oc get clusterversionNAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUSversion   4.15.0-0.nightly-2023-12-18-220750   True        False         84m     Error while reconciling 4.15.0-0.nightly-2023-12-18-220750: the cluster operator network is degraded[weliang@weliang ~]$

Expected results:

Migration success

Additional info:

Get same error message from ROSA and GCP cluster.

https://github.com/openshift/cluster-network-operator/pull/2179

Bug OCPBUGS-19551: Clean up northd templating

View the Description View the linked PRs

Some templating code can be simplified now.

https://github.com/openshift/cluster-network-operator/pull/2234

Bug OCPBUGS-25557: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/87

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-24981: Update 4.16 ose-gcp-cloud-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/48

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-gcp/pull/48

Bug OCPBUGS-24363: ovnkube-controller bug: ovn service lb still has the endpoint when pod is in terminating state

View the Description View the linked PRs

Description of problem:

The users are experiencing an issue with NodePort traffic forwarding, where the TCP traffic continues to be directed to pods which are under terminating state, the connection cannot be created sucessfully, as per the customer mentioned this issue is causing the connection disruptions in the business transaction.

Version-Release number of selected component (if applicable):

On the OpenShift 4.12.13 with RHEL8.6 workers and OVN environment.

How reproducible:

here is the code found.
https://github.com/openshift/ovn-kubernetes/blob/dd3c7ed8c1f41873168d3df26084ecbfd3d9a36b/go-controller/pkg/util/kube.go#L360；
—
func IsEndpointServing(endpoint discovery.Endpoint) bool {
if endpoint.Conditions.Serving != nil

{ return *endpoint.Conditions.Serving }

else

{ return IsEndpointReady(endpoint) }

}

// IsEndpointValid takes as input an endpoint from an endpoint slice and a boolean that indicates whether to include
// all terminating endpoints, as per the PublishNotReadyAddresses feature in kubernetes service spec. It always returns true
// if includeTerminating is true and falls back to IsEndpointServing otherwise.
func IsEndpointValid(endpoint discovery.Endpoint, includeTerminating bool) bool

{ return includeTerminating || IsEndpointServing(endpoint) }

—

Look like 'IsEndpointValid' function will retrun serving=true endpoint, it not checking the ready=true endpoint
I see recently the code has been changed in this section(look up Ready=true is changed to Serving=true)?

[Check the "Serving" field for endpoints]
https://github.com/openshift/ovn-kubernetes/commit/aceef010daf0697fe81dba91a39ed0fdb6563dea#diff-daf9de695e0ff81f9173caf83cb88efa138e92a9b35439bd7044aa012ff931c0

https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/util/kube.go#L326-L386
—
out.Port = *port.Port
for _, endpoint := range slice.Endpoints {
// Skip endpoint if it's not valid
if !IsEndpointValid(endpoint, includeTerminating)

{ klog.V(4).Infof("Slice endpoint not valid") continue }

for _, ip := range endpoint.Addresses {
klog.V(4).Infof("Adding slice %s endpoint: %v, port: %d", slice.Name, endpoint.Addresses, *port.Port)
ipStr := utilnet.ParseIPSloppy(ip).String()
switch slice.AddressType

{ case discovery.AddressTypeIPv4: v4ips.Insert(ipStr) case discovery.AddressTypeIPv6: v6ips.Insert(ipStr) default: klog.V(5).Infof("Skipping FQDN slice %s/%s", slice.Namespace, slice.Name) }

}
}
—

Steps to Reproduce:

Here is the customer's sample pods for you refering.
mbgateway-st-8576f6f6f8-5jc75 1/1 Running 0 104m 172.30.195.124 appn01-100.app.paas.example.com <none> <none>
mbgateway-st-8576f6f6f8-q8j6k 1/1 Running 0 5m51s 172.31.2.97 appn01-202.app.paas.example.com <none> <none>

pod yaml：
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
name: mbgateway-st
ports:
- containerPort: 9190
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9190
timeoutSeconds: 5
resources:
limits:
cpu: "2"
ephemeral-storage: 10Gi
memory: 2G
requests:
cpu: 50m
ephemeral-storage: 100Mi
memory: 1111M

when delete pod Pod（mbgateway-st-8576f6f6f8-5jc75）, check the EndpointSlice status：
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:

addresses:
- 172.30.195.124
conditions:
ready: false
serving: true
terminating: true
nodeName: appn01-100.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-5jc75
namespace: lb59-10-st-unigateway
uid: 5e8a375d-ba56-4894-8034-0009d0ab8ebe
zone: AZ61QEBIZ_AZ61QEM02_FD3
addresses:
- 172.31.2.97
conditions:
ready: true
serving: true
terminating: false
nodeName: appn01-202.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-q8j6k
namespace: lb59-10-st-unigateway
uid: 5bd195b7-e342-4b34-b165-12988a48e445
zone: AZ61QEBIZ_AZ61QEM02_FD1

Wait for a little moment, try to check Ovn Service lb， it found the endpoints information doesn't update to the latest.
9349d703-1f28-41fe-b505-282e8abf4c40 Service_lb59-10- tcp 172.35.0.185:31693 172.30.195.124:9190,172.31.2.97:9190
dca65745-fac4-4e73-b412-2c7530cf4a91 Service_lb59-10- tcp 172.35.0.170:31693 172.30.195.124:9190,172.31.2.97:9190
a5a65766-b0f2-4ac6-8f7c-cdebeea303e3 Service_lb59-10- tcp 172.35.0.89:31693 172.30.195.124:9190,172.31.2.97:9190
a36517c5-ecaa-4a41-b686-37c202478b98 Service_lb59-10- tcp 172.35.0.213:31693 172.30.195.124:9190,172.31.2.97:9190
16d997d1-27f0-41a3-8a9f-c63c8872d7b8 Service_lb59-10- tcp 172.35.0.92:31693 172.30.195.124:9190,172.31.2.97:9190

Wait for a little moment,
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:

addresses:
- 172.30.195.124
conditions:
ready: false
serving: true
terminating: true
nodeName: appn01-100.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-5jc75
namespace: lb59-10-st-unigateway
uid: 5e8a375d-ba56-4894-8034-0009d0ab8ebe
zone: AZ61QEBIZ_AZ61QEM02_FD3
addresses:
- 172.31.2.97
conditions:
ready: true
serving: true
terminating: false
nodeName: appn01-202.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-q8j6k
namespace: lb59-10-st-unigateway
uid: 5bd195b7-e342-4b34-b165-12988a48e445
zone: AZ61QEBIZ_AZ61QEM02_FD1
addresses:
- 172.30.132.78
conditions:
ready: false
serving: false
terminating: false
nodeName: appn01-089.app.paas.example.com
targetRef:
kind: Pod
name: mbgateway-st-8576f6f6f8-8lp4s
namespace: lb59-10-st-unigateway
uid: 755cbd49-792b-4527-b96a-087be2178e9d
zone: AZ61QEBIZ_AZ61QEM02_FD3

check Ovn Service lb， it found the Pod Endpoint information is still here：
fceeaf8f-e747-4290-864c-ba93fb565a8a Service_lb59-10- tcp 172.35.0.56:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
bef42efd-26db-4df3-b99d-370791988053 Service_lb59-10- tcp 172.35.1.26:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
84172e2c-081c-496a-afec-25ebcb83cc60 Service_lb59-10- tcp 172.35.0.118:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
34412ddd-ab5c-4b6b-95a3-6e718dd20a4f Service_lb59-10- tcp 172.35.1.14:31693 172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190

Actual results:

Service LB endpoint determines on the POD.status.condition[type=Serving] status.

Expected results:

Service LB endpoint should determines on the POD.status.condition[type=Ready] status.

Additional info:

The ovn-controller determines whether an endpoint should be added to the Service Load Balancer (serviceLB) based on the condition.serving. The current issue is that when a pod is in the terminating state, the condition.serving remains true. Its state determines on the POD.status.condition[type=Ready] status is being true.

However when a pod is deleted, the endpointslice condition.serving state remains unchanged, and the backend pool of the service LB still includes the IP information of the deleted pod.Why doesn't ovn-controller use the condition.ready status to decide whether the pod's IP should be added to the service LB backend pool?

Could the shift-networking experts confirm whether this is the openshift ovn service lb bug or not?

https://github.com/openshift/ovn-kubernetes/pull/2018

Bug OCPBUGS-25897: Failed to create secret on HyperShift Hosted Cluster with short-lived token was enabled by CCO.

View the Description View the linked PRs

Description of problem:

Hosted cluster credentialsMode mode is not manual and cannot create secrets.
Now the Control Plan credentialsMode is the same as Management Cluster, but for this feature, it should be manual mode on Hosted Cluster no matter what the credentialsMode of Management Cluster is.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

   
 1.Creates CredentialsRequest including the spec.providerSpec.stsIAMRoleARN string. 
   
 2.Cloud Credential Operator could not populate Secret based on CredentialsRequest.   

$ oc get secret -A | grep test-mihuang
#Secret not found.  

$ oc get CredentialsRequest -n openshift-cloud-credential-operator
NAME                                                  AGE
...
test-mihuang                                               44s
    3.

Actual results:

    Secret not create successfully.

Expected results:

    Successfully created the secret on the hosted cluster.

Additional info:

https://github.com/openshift/hypershift/pull/3375

Bug OCPBUGS-24608: Whereabouts assignment error

View the Description View the linked PRs

Description of problem:

    Before: Warning  FailedCreatePodSandBox  8s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82187d55b1379aad1e6c02b3394df7a8a0c84cc90902af413c1e0d9d56ddafb0": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-hhvfn/89e6349b-9797-4e03-8828-ebafe224dfaf:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: net.IPNet{IP:net.IP{0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}

After:  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               6s    default-scheduler  Successfully assigned default/netshoot-deployment-59898b5dd9-kk2zm to whereabouts-worker
  Normal   AddedInterface          6s    multus             Add eth0 [10.244.2.2/24] from kindnet
  Warning  FailedCreatePodSandBox  6s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "23dd45e714db09380150b5df74be37801bf3caf73a5262329427a5029ef44db1": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-kk2zm/142de5eb-9f8a-4818-8c5c-6c7c85fe575e:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: 2000::/64 / excludeRanges: [2000::/32]

Fixed upstream in #366 https://github.com/k8snetworkplumbingwg/whereabouts/pull/366

https://github.com/openshift/whereabouts-cni/pull/211

Bug OCPBUGS-25539: Update 4.16 ose-ibm-vpc-block-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/102

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/102

Bug OCPBUGS-29423: https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#tabledata references PF4 classname

View the Description View the linked PRs

Description of problem:

https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#tabledata includes a reference to `pf-c-table__action`, but v1+ of console-dynamic-plugin-sdk requires PatternFly 5, so the reference should be updated to `pf-v5-c-table__action`.

https://github.com/openshift/console/pull/13606

Bug OCPBUGS-13106: e2e-gcp-operator loadbalancer service not going ready CI flakes

View the Description View the linked PRs

Description of problem:

Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address.

Tests so far affected are:
- TestUnmanagedDNSToManagedDNSInternalIngressController
- TestScopeChange
- TestInternalLoadBalancerGlobalAccessGCP
- TestInternalLoadBalancer
- TestAllowedSourceRanges

For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready:

operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True]
         Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True]

Where DNSReady:False and LoadBalancerReady:False.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

10% of the time

Steps to Reproduce:

1. Run e2e-gcp-operator many times until you see one of these failures

Actual results:

Test Failure

Expected results:

Not failure

Additional info:

Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer

This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/329

Bug OCPBUGS-24271: Egress IP multi NIC: ipv6 does not work

View the linked PRs

https://github.com/openshift/ovn-kubernetes/pull/2048

Bug OCPBUGS-24839: Update 4.16 ose-route-controller-manager-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/route-controller-manager/pull/36

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/route-controller-manager/pull/36

Bug OCPBUGS-24856: Update 4.16 csi-node-driver-registrar-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/57

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-node-driver-registrar/pull/57

Bug OCPBUGS-29577: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/194

Bug OCPBUGS-25566: Update 4.16 ose-ibm-vpc-block-csi-driver-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/63

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver/pull/63

Task MON-3708: Fix integration job for telemeter

View the Description View the linked PRs

https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-telemeter-master-integration shows that the job fails a lot while there was no recent change which could explain this.

It blocks merges to the telemeter repository for no valid reason.

https://github.com/openshift/telemeter/pull/509

Bug OCPBUGS-24848: Update 4.16 monitoring-plugin-container image to be consistent with ART

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/monitoring-plugin/pull/86

Bug OCPBUGS-27246: Unhealthy conditions table should put Type as first column on MachineHealthCheck details page

View the Description View the linked PRs

Description of problem:

There is an 'Unhealthy Conditions' table on MachineHealthCheck details page, currently the first column is 'Status'， user care more about Type then its Status

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-16-113018

How reproducible:

Always

Steps to Reproduce:

1. go to any MHC details page and check 'Unhealthy Conditions' table
2.
3.

Actual results:

in the table, 'Type' is the last column

Expected results:

we should put 'Type' as the first column since this is the most important factor user care

for comparsion, we can check the 'Conditions' table on ClusterOperators details page, the order is Type -> Status -> other info which is very user friendly

Additional info:

https://github.com/openshift/console/pull/13576

Bug OCPBUGS-28741: Default release image does not fall within operator support window

View the Description View the linked PRs

Description of problem:

    When no release image is provided on a HostedCluster, the backend hypershift operator picks the latest OCP release image within the operator's support windown.

Today this fails due to how the operator selects this default image. For example, hypershift operator 4.14 does not support 4.15, but the 4.15.0.rc.3 is picked as a default release image today. 

This is a result of not anticipating that release candidates would not be reported as the latest stable release. The filter used to pick the latest release needs to consider patch level releases before the next y stream release.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

100%

Steps to Reproduce:

    1.create a self managed hcp cluster and do not specify a release image

Actual results:

    the hcp will be rejected because the default release image picked does not fall within the support window

Expected results:

    hcp should be created with the latest release image in the support window

Additional info:

https://github.com/openshift/hypershift/pull/3450

Bug OCPBUGS-24964: Update 4.16 ose-csi-external-snapshotter-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/128

Bug OCPBUGS-25629: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-nutanix/pull/26

Bug OCPBUGS-25633: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-autoscaler-operator/pull/306

Task MON-3705: Bump jsonnet dependencies

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/2208

Bug OCPBUGS-28835: Failed to watch Metal3Remediation template

View the Description View the linked PRs

Description of problem:

NHC failed to watch Metal3 remediation template

Version-Release number of selected component (if applicable):

OCP4.13 and higher

How reproducible:

    100%

Steps to Reproduce:

    1. Create Metal3RemediationTemplate
    2. Install NHCv.0.7.0
    3. Create NHC with Metal3RemediationTemplate

Actual results:

E0131 14:07:51.603803 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource "metal3remediationtemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope

E0131 14:07:59.912283 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3Remediation: unknown

W0131 14:08:24.831958 1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource

Expected results:

    No errors

Additional info:

https://github.com/openshift/cluster-api-provider-baremetal/pull/209

Bug OCPBUGS-24875: Update 4.16 ose-gcp-cluster-api-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/215

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-gcp/pull/215

Bug OCPBUGS-29546: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/958

Task MGMT-14159: Migrate ImageContentSourcePolicy to ImageDigestMirrorSet

View the Description View the linked PRs

Use the new CRD when available, as the current one is being deprecated

https://github.com/openshift/assisted-service/pull/5799

Bug OCPBUGS-24894: Update 4.16 ose-powervs-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-provider-powervs/pull/67

Bug OCPBUGS-28666: openshift/openshift-controller-manager-operator - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/326

Bug OCPBUGS-22221: DNS pods stuck in CrashLoopBackOff on upgrade of cluster using Hybrid Networking

View the Description View the linked PRs

Description of problem:

When upgrading clusters to any 4.13 version (from either 4.12 or 4.13), clusters with Hybrid Networking enabled appear to have a few DNS pods (not all) falling into CrashLoopBackOff status. Notably, pods are failing Readiness probes, and when deleted, work without incident. Nearly all other aspects of upgrade continue as expected

Version-Release number of selected component (if applicable):

4.13.x

How reproducible:

Always for systems in use

Steps to Reproduce:

1. Configure initial cluster installation with OVNKubernetes and enable Hybrid Networking on 4.12 or 4.13, e.g. 4.13.13
2. Upgrade cluster to 4.13.z, e.g. 4.13.14

Actual results:

dns-default pods in CrashLoopBackOff status, failing Readiness probes

Expected results:

dns-default pods are rolled out without incident

Additional info:

Appears strongly related to OCPBUGS-13172. CU has kept an affected cluster with the DNS pod issue ongoing for additional investigating, if needed.

https://github.com/openshift/ovn-kubernetes/pull/2038

Bug OCPBUGS-24914: Update 4.16 prometheus-config-reloader-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/265

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug USHIFT-2256: OVN-K is flaking in periodic nightlies

View the Description View the linked PRs

Description of problem:

Pre-test greenboot checks fail in during scenario run due to OVN-K pods reporting a "failed" status.

Version-Release number of selected component (if applicable):

I believe this is only affecting `periodic-ci-openshift-microshift-main-ocp-metal-nightly` jobs.

How reproducible:

Unsure.  Has occurred 2 times in consecutive daily-periodic jobs.

Steps to Reproduce:

n/a

Actual results:

- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753221880392716288
- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753403067350388736

Expected results:

OVN-K Pods should deploy into a healthy state

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2051

Story MGMT-16649: Assisted should pass correct headers when pulling ignition from hypershift

View the Description View the linked PRs

Hypershift's ignition server expects certain headers. Assisted needs to gather all of this data and pass it when fetching the ignition from hypershift's ignition server.

Work items:

Modify DB host to contain ignition information
Modify agent kube api and agent controller to get additional information needed (targetconfighash and nodepool name) and store it in the db host
Use the headers when fetching the ignition

Bug OCPBUGS-24965: Update 4.16 ose-ovn-kubernetes-base-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1978

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ovn-kubernetes/pull/1978

Bug OCPBUGS-30488: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/798

Bug OCPBUGS-27211: IPv6 ETP=Local Services broken on LGW

View the Description View the linked PRs

Description of problem:

    See https://issues.redhat.com/browse/OCPBUGS-26053

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. Create an ETP=Local LB Service on LGW for some v6 workload (assign IP to lb with MetalLB or manually)
    2. Set static routes to a node hosting a pod on the client
    3. Attempt reaching the IPv6 Service fails

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/2018

Bug OCPBUGS-28261: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource/pull/164

Task MON-3667: remove outdated documentation in the CMO repository

View the linked PRs

https://github.com/openshift/cluster-monitoring-operator/pull/2232

Bug OCPBUGS-24967: Update 4.16 ose-cluster-control-plane-machine-set-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/268

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/268

Bug OCPBUGS-23729: Multus admission controller should honour Hypershift controllerAvailabilityPolicy

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2159

Bug OCPBUGS-26087: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-shared-resource-operator/pull/95

Bug OCPBUGS-29114: Installer creates CPMS incorrectly for vSphere IPI when static IPs are configured

View the Description View the linked PRs

Description of problem:

When installing a new vSphere cluster with static IPs, control plane machine sets (CPMS) are also enabled in TechPreviewNoUpgrade and the installer applies the incorrect config to the CPMS resulting in masters being recreated.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

always

Steps to Reproduce:

1. create install-config.yaml with static IPs following documentation
2. run `openshift-install create cluster`
3. as install progresses, watch the machines definitions

Actual results:

new master machines are created

Expected results:

all machines are the same as what was created by the installer.

Additional info:

Bug OCPBUGS-24820: Update 4.16 ose-baremetal-installer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/7817

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/7817

Bug OCPBUGS-24925: Update 4.16 ose-ovirt-machine-controllers-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-ovirt/pull/176

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-ovirt/pull/176

Bug OCPBUGS-26115: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-azure/pull/103

Bug OCPBUGS-26121: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-capi-operator/pull/156

Bug OCPBUGS-27509: Autoscaler should scale-from zero MachineSets that declare taints

View the Description View the linked PRs

Description of problem

When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.

Version-Release number of selected component

Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.

How reproducible

Always.

Steps to Reproduce

With a launch 4.14.10 gcp Cluster Bot cluster (logs):

$ oc adm upgrade
Cluster version is 4.14.10

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
$ oc -n openshift-machine-api get machinesets.machine.openshift.io
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-a   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-b   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-c   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             29m

Pick that set with 0 nodes. They don't come with taints by default:

$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints'
null

So patch one in:

$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"}
]}]'
machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched

And set up autoscaling:

$ cat cluster-autoscaler.yaml
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  maxNodeProvisionTime: 30m
  scaleDown:
    enabled: true
$ oc apply -f cluster-autoscaler.yaml 
clusterautoscaler.autoscaling.openshift.io/default created

I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?

$ cat machine-autoscaler.yaml 
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: test
  namespace: openshift-machine-api
spec:
  maxReplicas: 2
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: ci-ln-s48f02k-72292-5z2hn-worker-f
$ oc apply -f machine-autoscaler.yaml 
machineautoscaler.autoscaling.openshift.io/test created

Checking the autoscaler's logs:

$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint
W0122 19:18:47.246369       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:18:58.474000       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:19:09.703748       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:19:20.929617       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
...

And the MachineSet is failing to scale:

$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             50m

While if I remove the taint:

$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]'
machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched

The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:

$ oc -n openshift-machine-api get machines.machine.openshift.io
NAME                                       PHASE     TYPE                REGION        ZONE            AGE
ci-ln-s48f02k-72292-5z2hn-master-0         Running   e2-custom-6-16384   us-central1   us-central1-a   53m
ci-ln-s48f02k-72292-5z2hn-master-1         Running   e2-custom-6-16384   us-central1   us-central1-b   53m
ci-ln-s48f02k-72292-5z2hn-master-2         Running   e2-custom-6-16384   us-central1   us-central1-c   53m
ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf   Running   e2-standard-4       us-central1   us-central1-a   45m
ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt   Running   e2-standard-4       us-central1   us-central1-b   45m
ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m   Running   e2-standard-4       us-central1   us-central1-c   45m
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             53m
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50
I0122 19:23:17.284762       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:17.687036       1 legacy.go:296] No candidates for scale down
W0122 19:23:27.924167       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:28.510701       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:28.909507       1 legacy.go:296] No candidates for scale down
W0122 19:23:39.148266       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:39.737359       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:40.135580       1 legacy.go:296] No candidates for scale down
W0122 19:23:50.376616       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:50.963064       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:51.364313       1 legacy.go:296] No candidates for scale down
W0122 19:24:01.601764       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:24:02.191330       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:02.589766       1 legacy.go:296] No candidates for scale down
I0122 19:24:13.415183       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:13.815851       1 legacy.go:296] No candidates for scale down
I0122 19:24:24.641190       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:25.040894       1 legacy.go:296] No candidates for scale down
I0122 19:24:35.867194       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:36.266400       1 legacy.go:296] No candidates for scale down
I0122 19:24:47.097656       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:47.498099       1 legacy.go:296] No candidates for scale down
I0122 19:24:58.326025       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:58.726034       1 legacy.go:296] No candidates for scale down
I0122 19:25:04.927980       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0122 19:25:04.938213       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms
I0122 19:25:09.552086       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:09.952094       1 legacy.go:296] No candidates for scale down
I0122 19:25:20.778317       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:21.178062       1 legacy.go:296] No candidates for scale down
I0122 19:25:32.005246       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:32.404966       1 legacy.go:296] No candidates for scale down
I0122 19:25:43.233637       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:43.633889       1 legacy.go:296] No candidates for scale down
I0122 19:25:54.462009       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:54.861513       1 legacy.go:296] No candidates for scale down
I0122 19:26:05.688410       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:06.088972       1 legacy.go:296] No candidates for scale down
I0122 19:26:16.915156       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:17.315987       1 legacy.go:296] No candidates for scale down
I0122 19:26:28.143877       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:28.543998       1 legacy.go:296] No candidates for scale down
I0122 19:26:39.369085       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:39.770386       1 legacy.go:296] No candidates for scale down
I0122 19:26:50.596923       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:50.997262       1 legacy.go:296] No candidates for scale down
I0122 19:27:01.823577       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:27:02.223290       1 legacy.go:296] No candidates for scale down
I0122 19:27:04.938943       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0122 19:27:04.947353       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms

Actual results

Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.

Expected results

Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.

https://github.com/openshift/kubernetes-autoscaler/pull/281

Task MON-3671: update OWNERS file for configmap-reloader

View the linked PRs

https://github.com/openshift/configmap-reload/pull/55

Bug OCPBUGS-24860: Update 4.16 ose-egress-router-cni-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/egress-router-cni/pull/79

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/egress-router-cni/pull/79

Bug OCPBUGS-22602: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-ibm/pull/59

Story TRT-1485: 4.16 CI payloads failed on an Alert we've never seen before

View the Description View the linked PRs

Our test that watches for alerts to appear we've never seen before has picked something up on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-02-031544

[sig-trt][invariant] No new alerts should be firing expand_less 0s

{ Found alerts firing which are new or less than two weeks old, which should not be firing: PrometheusOperatorRejectedResources has no test data, this alert appears new and should not be firing}

It hit about 3-4 out of 10 on both azure and aws agg jobs. Could be a regression, could be something really rare.

https://github.com/openshift/cluster-etcd-operator/pull/1193

Bug OCPBUGS-23386: Unable to run oc commands on RHEL9 Host with FIPS enabled OCP cluster

View the Description View the linked PRs

Description of problem:

Unable to run oc commands in FIPS enable OCP cluster on PowerVS

Version-Release number of selected component (if applicable):

4.15.0-ec2

How reproducible:

Deploy OCP cluster with FIPS enabled

Steps to Reproduce:

1. Enable the var in var.tfvars - fips_compliant      = true
2. Deploy the cluster
3. run oc commands

Actual results:

[root@rdr-swap-fips-syd05-bastion-0 ~]# oc version
FIPS mode is enabled, but the required OpenSSL library is not available

[root@rdr-swap-fips-syd05-bastion-0 ~]# oc debug node/syd05-master-0.rdr-swap-fips.ibm.com
FIPS mode is enabled, but the required OpenSSL library is not available

[root@rdr-swap-fips-syd05-bastion-0 ~]# fips-mode-setup --check
FIPS mode is enabled.

Expected results:

# oc debug node/syd05-master-0.rdr-swap-fips1.ibm.com
Temporary namespace openshift-debug-dns7d is created for debugging node...
Starting pod/syd05-master-0rdr-swap-fips1ibmcom-debug-hs4dr ...
To use host binaries, run `chroot /host`
Pod IP: 193.168.200.9

Additional info:

Not able to collect must gather logs due to the issue

links - https://access.redhat.com/solutions/7034387

Bug OCPBUGS-24299: Add permission to network-node-identity secrets

View the Description View the linked PRs

CNO managed component (network-node-identity) to conform to hypershift control plane expectations that All secrets should be mounted to not have global read. change from 420(0644) to 416(0640)

https://github.com/openshift/cluster-network-operator/pull/2142

Bug OCPBUGS-24904: Update 4.16 ose-machine-api-provider-openstack-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/100

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-provider-openstack/pull/100

Bug OCPBUGS-29176: ART requests updates to 4.16 image ose-oauth-apiserver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/101

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oauth-apiserver/pull/101

Bug OCPBUGS-24913: Update 4.16 openshift-enterprise-console-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/console-operator/pull/823

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-26986: whereabouts reconciler schedule is not configurable

View the Description View the linked PRs

Description of problem:

whereabouts reconciler is responsible for reclaiming dangling IPs, and freeing them to be available to allocate to new pods.
This is crucial for scenarios where the amount of addresses are limited and dangling IPs prevent whereabouts from successfully allocating new IPs to new pods.

The reconciliation schedule is currently hard-coded to run once a day, without a user-friendly way to configure.

Version-Release number of selected component (if applicable):

How reproducible:

    Create a Whereabouts reconciler daemon set, not able to configure the reconciler schedule.

Steps to Reproduce:

    1. Create a Whereabouts reconciler daemonset
       instructions: https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional-      network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network

     2. Run `oc get pods -n openshift-multus | grep whereabouts-reconciler`

     3. Run `oc logs whereabouts-reconciler-xxxxx`

Actual results:

    You can't configure the cron-schedule of the reconciler.

Expected results:

    Be able to modify the reconciler cron schedule.

Additional info:

    The fix for this bug is in two places: whereabouts, and cluster-network-operator.
    From this reason, in order to verify correctly we need to use both fixed components.
    Please read below for more details about how to apply the new configurations.

How to Verify:

    Create a whereabouts-config ConfigMap with a custom value, and check in the
    whereabouts-reconciler pods' logs that it is updated, and triggering the clean up.

Steps to Verify:

    1. Create a Whereabouts reconciler daemonset
    2. Wait for the whereabouts-reconciler pods to be running. (takes time for the daemonset to get created).
    3. See in logs: "[error] could not read file: <nil>, using expression from flatfile: 30 4 * * *"
       This means it uses the hardcoded default value. (Because no ConfigMap yet)
    4. Run: oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/2 * * * *"
    5. Check in the logs for: "successfully updated CRON configuration" 
    6. Check that in the next 2 minutes the reconciler runs: "[verbose] starting reconciler run"

Requirements	Notes	IS MVP
Discover new offerings in Home Dashboard		Y
Access details outlining value of offerings		Y
Access step-by-step guide to install offering		N
Allow developers to easily find and use newly installed offerings		Y
Support air-gapped clusters		Y

4.16.0-0.okd-2024-03-23-225022

Changes from 4.15.0-0.okd-scos-2024-04-25-183708

Complete Features

Feature Overview (aka. Goal Summary)

Use Cases :

BU Priority Overview

Goals

State of the Business

Execution Plans

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Epic Goal

Why is this important?

Feature Overview

Goals

Requirements (aka. Acceptance Criteria):

vSphere Implementation

Existing vSphere documentation

Epic Goal

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Background

Documentation Considerations

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Done Checklist

User Story:

Acceptance Criteria:

(optional) Out of Scope:

Engineering Details:

Epic Goal

Why is this important?

Acceptance Criteria

Feature Overview

Goals

Background, and strategic fit

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

User Story:

Description:

Acceptance Criteria:

Template:

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

< High-Level description of the feature ie: Executive Summary >

Goals

Requirements

(Optional) Use Cases

Out of scope

Dependencies

Assumptions

Customer Considerations