Feature OCPPLAN-4333: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Feature OCPPLAN-5083: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic NE-414: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-425: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/477

Feature OCPPLAN-7784: Storage Maintainability

View the Description

We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.

Goals

To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
Reduce storage test flakiness so we can spot real bugs in our CI.

Requirements

Requirement	Notes	isMvp?
Telemetry		No
Certification		No
API metrics		No

Out of Scope

n/a

Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low

Assumptions

Customer Considerations

Documentation Considerations

Target audience: internal
Updated content: none at this time.

Notes

In progress:

CSI certification flakes a lot. We should fix it before we start testing migration.
- In progress (API server restarts...) https://bugzilla.redhat.com/show_bug.cgi?id=1865857

Get local-storage-operator and AWS EBS CSI driver operator logs in must-gather (OLM-managed operators are not included there)
- In progress for LSO (must-gather script being included in image) https://bugzilla.redhat.com/show_bug.cgi?id=1756096

CI flakes:
- Configurable timeouts for e2e tests
  - Azure is slow and times out often
  - Cinder times out formatting volumes
  - AWS resize test times out

High prio:

Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
- Should it be part of the installer?
- Spike exists

Add / use cloud API call metrics

- Helps customers to understand why things are slow
- Helps build cop to understand a flake
  - With a post-install step that filters data from Prometheus that’s still running in the CI job.
- Ideas:
  - Cloud is throttling X% of API calls longer than Y seconds
  - Attach / detach / provisioning / deletion / mount / unmount / resize takes longer than X seconds?
- Capture metrics of operations that are stuck and won’t finish.
  - Sweep operation map from executioner???
  - Report operation metric into the highest bucket after the bucket threshold (i.e. if 10minutes is the last bucket, report an operation into this bucket after 10 minutes and don’t wait for its completion)?
  - Ask the monitoring team?
- Include in CSI drivers too.
  - With alerts too

Report events for cloud issues
- E.g. cloud API reports weird attach/provision error (e.g. due to outage)
What volume plugins actually users use the most? https://issues.redhat.com/browse/STOR-324

Unsorted

As the number of storage operators grows, it would be grafana board for storage operators
- CSI driver metrics (from CSI sidecars + the driver itself + its operator?)
- CSI migration?

Get aggregated logs in cluster
- They're rotated too soon
- No logs from dead / restarted pods
- No tools to combine logs from multiple pods (e.g. 3 controller managers)

What storage issues customers have? it was 22% of all issues.
- Insufficient docs?
- Probably garbage

Document basic storage troubleshooting for our supports
- What logs are useful when, what log level to use
- This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.

Common vSphere errors, their debugging and fixing.

Document sig-storage flake handling - not all failed [sig-storage] tests are ours

Epic STOR-857: OCP 4.12 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-858: Chore: update libraries in all operators

View the Description View the linked PRs

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

Kubernetes:
- client-go
- controller-runtime
OCP:
- library-go
- openshift/api
- openshift/client-go
- operator-sdk

Operators:

aws-ebs-csi-driver-operator
aws-efs-csi-driver-operator
azure-disk-csi-driver-operator
azure-file-csi-driver-operator
openstack-cinder-csi-driver-operator
gcp-pd-csi-driver-operator
gcp-filestore-csi-driver-operator
manila-csi-driver-operator
ovirt-csi-driver-operator
vmware-vsphere-csi-driver-operator
alibaba-disk-csi-driver-operator
ibm-vpc-block-csi-driver-operator
csi-driver-shared-resource-operator

cluster-storage-operator
csi-snapshot-controller-operator
local-storage-operator
vsphere-problem-detector

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/161

Story STOR-859: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release.

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

This includes update of VolumeSnapshot CRDs in https://github.com/openshift/cluster-csi-snapshot-controller-operator/tree/master/assets

https://github.com/openshift/csi-external-resizer/pull/132

Feature OCPPLAN-7786: CSI Migration

View the Description

Epic Goal

Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
These migrations would include, but are not limited to:
- CSI driver for AWS EBS
- CSI driver for GCP
- CSI driver for Azure (file and disk)
- CSI driver for VMware vSphere

Why is this important?

OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

User initiated move to from intree to CSI driver
Upgrade initiated move from intree to CSI driver
Upgrade from EUS to EUS

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic STOR-575: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story STOR-764: Change the default StorageClass to the CSI one

View the Description View the linked PRs

On new installations, we should make the StorageClass created by the CSI operator the default one.

However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set a different quota on the CSI driver Storage Class.

Exit criteria:

New clusters get the CSI Storage Class as the default one.
Existing clusters don't get their default Storage Classes changed.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/160

Feature OCPPLAN-8108: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic OCPBUILD-30: Build Rebases OCP 4.11

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Rebase OpenShift components to k8s v1.24

Why is this important?

Rebasing ensures components work with the upcoming release of Kubernetes
Address tech debt related to upstream deprecations and removals.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

k8s 1.24 release

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story BUILD-417: Rebase openshift-controller-manager to k8s 1.24

View the Description View the linked PRs

Rebase openshift-controller-manager to k8s 1.24

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/250

Story BUILD-418: Rebase openshift-controller-manager-operator to k8s 1.24

View the linked PRs

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/242

Feature OCPSTRAT-103: Ensuring the Control Plane is Fully Decoupled for Hosted Control Planes

View the Description

Why?

Decouple control and data plane.
- Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
Improve security
- Shift credentials out of cluster that support the operation of core platform vs workload
Improve cost
- Allow a user to toggle what they don’t need.
- Ensure a smooth path to scale to 0 workers and upgrade with 0 workers.

Assumption

A customer will be able to associate a cluster as “Infrastructure only”
E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
OR the entire cluster is labeled infrastructure , and node roles are ignored.

Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit

Epic STOR-960: Ensure openshift cluster-storage-operator & aws-ebs-csi-driver-operator are running on hosted control planes

View the Description

Overview

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumption

A customer will be able to associate a cluster as “Infrastructure only”
E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
OR the entire cluster is labeled infrastructure, and node roles are ignored.
Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

DoD

Run cluster-storage-operator (CSO) + AWS EBS CSI driver operator + AWS EBS CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster.

More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit

Story STOR-1040: Update AWS EBS CSI driver operator for HyperShift

View the Description View the linked PRs

As HyperShift Cluster Instance Admin, I want to run AWS EBS CSI driver operator + control plane of the CSI driver in the management cluster, so the guest cluster runs just my applications.

Add a new cmdline option for the guest cluster kubeconfig file location

Parse both kubeconfigs:
- One from projected service account, which leads to the management cluster.
- Second from the new cmdline option introduced above. This one leads to the guest cluster.

Only on HyperShift:

- When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.

- Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.

- Pass only the guest kubeconfig to the operand (control-plane Deployment of the CSI driver).

Exit criteria:

Control plane Deployment of AWS EBS CSI driver runs in the management cluster in HyperShift.
Storage works in the guest cluster.

No regressions in standalone OCP.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/159

Feature OCPSTRAT-158: Default Storage Class Management

View the Description

Feature Goal*

What is our purpose in implementing this? What new capability will be available to customers?

The goal of this feature is to provide a consistent, predictable and deterministic approach on how the default storage class(es) is managed.

Why is this important? (mandatory)

The current default storage class implementation has corner cases which can result in PVs staying in pending because there is either no default storage class OR multiple storage classes are defined

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

No default storage class

In some cases there is no default SC defined, this can happen during OCP deployment where components such as the registry request a PV whereas the SC are not been defined yet. This can also happen during a change in default SC, there won't be any between the admin unset the current one and set the new on.

The admin marks the current default SC1 as non-default.

Another user creates PVC requesting a default SC, by leaving pvc.spec.storageClassName=nil. The default SC does not exist at this point, therefore the admission plugin leaves the PVC untouched with pvc.spec.storageClassName=nil.
The admin marks SC2 as default.
PV controller, when reconciling the PVC, updates pvc.spec.storageClassName=nil to the new SC2.
PV controller uses the new SC2 when binding / provisioning the PVC.

The installer creates PVC for the image registry first, requesting the default storage class by leaving pvc.spec.storageClassName=nil.

The installer creates a default SC.
PV controller, when reconciling the PVC, updates pvc.spec.storageClassName=nil to the new default SC.
PV controller uses the new default SC when binding / provisioning the PVC.

Multiple Storage Classes

In some cases there are multiple default SC, this can be an admin mistake (forget to unset the old one) or during the period where a new default SC is created but the old one is still present.

New behavior:

Create a default storage class A
Create a default storage class B
Create PVC with pvc.spec.storageCLassName = nil

-> the PVC will get the default storage class with the newest CreationTimestamp (i.e. B) and no error should show.

-> admin will get an alert that there are multiple default storage classes and they should do something about it.

CSI that are shipped as part of OCP

The CSI drivers we ship as part of OCP are deployed and managed by RH operators. These operators automatically create a default storage class. Some customers don't like this approach and prefer to:

Create their own default storage class
Have no default storage class in order to disable dynamic provisioning

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

No external dependencies.

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development - STOR
Documentation - STOR
QE - STOR
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Can bring confusion to customer as there is a change in the default behavior customer are used to. This needs to be carefully documented.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Epic STOR-743: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story STOR-947: Update all CSI driver operators to support new API

View the linked PRs

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/173

Feature OCPSTRAT-31: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic STOR-1007: OCP 4.13 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-1019: Chore: update libraries in all operators

View the Description View the linked PRs

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

Kubernetes:
- client-go
- controller-runtime
OCP:
- library-go
- openshift/api
- openshift/client-go
- operator-sdk

Operators:

aws-ebs-csi-driver-operator
aws-efs-csi-driver-operator
azure-disk-csi-driver-operator
azure-file-csi-driver-operator
cinder-csi-driver-operator
gcp-pd-csi-driver-operator
gcp-filestore-csi-driver-operator
manila-csi-driver-operator
ovirt-csi-driver-operator
vmware-vsphere-csi-driver-operator
alibaba-disk-csi-driver-operator
ibm-vpc-block-csi-driver-operator
csi-driver-shared-resource-operator

cluster-storage-operator
csi-snapshot-controller-operator
local-storage-operator
vsphere-problem-detector

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/179

Story STOR-1020: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.

Epic WRKLDS-594: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/277

Feature OCPSTRAT-339: [Phase 2] Add a new platform type ("external") to identify clusters with non-integrated partner components enabled

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Feature Goal

Enable platform=external to support onboarding new partners, e.g. Oracle Cloud Infrastructure and VCSP partners.
Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

~~OCPBU-5~~: Phase 1

Write platform “External” enhancement.
Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".
Start implementing changes towards Phase 2.

~~OCPBU-510~~: Phase 2

Update OpenShift API with new platform and ensure all components have updated dependencies.
Update capabilities API to include coverage for all replaceable components.
Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

OCPBU-329: Phase.Next

Why is this important?

As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

Partners can read "External" platform enhancement and plan for their platform integrations.
Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

Components running in cluster can detect the “External” platform through the Infrastructure config API
Components running in cluster react to “External” platform as if it is “None” platform
Partners can disable any of the platform specific components through the capabilities API

Phase 3

Components running in cluster react to the “External” platform based on their function.
- for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
- the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

Previous Work (Optional):

Identifying OpenShift Components for Install Flexibility

Open questions::

Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic OCPCLOUD-2011: Update External platform with CCM settings

View the Description

Epic Goal

Empower External platform type user to specify when they will run their own CCM

Why is this important?

For partners wishing to use components that require zonal awareness provided by the infrastructure (for example CSI drivers), they will need to exercise their own cloud controller managers. This epic is about adding the proper configuration to OpenShift to allow users of External platform types to run their own CCMs.

Scenarios

As a Red Hat partner, I would like to deploy OpenShift with my own CSI driver. To do this I need my CCM deployed as well. Having a way to instruct OpenShift to expect an external CCM deployment would allow me to do this.

Acceptance Criteria

CI - A new periodic test based on the External platform test would be ideal
Release Technical Enablement - Provide necessary release enablement details and documents.
- Update docs.ci.openshift.org with CCM docs

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2010: Add CCM support to External platform API

View the Description View the linked PRs

User Story

As a Red Hat Partner installing OpenShift using the External platform type, I would like to install my own Cloud Controller Manager(CCM). Having a field in the Infrastructure configuration object to signal that I will install my own CCM and that Kubernetes should be configured to expect an external CCM will allow me to run my own CCM on new OpenShift deployments.

Background

This work has been defined in the External platform enhancement , and had previously been part of openshift/api . The CCM API pieces were removed for the 4.13 release of OpenShift to ensure that we did not ship unused portions of the API.

In addition to the API changes, library-go will need to have an update to the IsCloudProviderExternal function to detect the if the External platform is selected and if the CCM should be enabled for external mode.

We will also need to check the ObserveCloudVolumePlugin function to ensure that it is not affected by the external changes and that it continues to use the external volume plugin.

After updating openshift/library-go, it will need to be re-vendored into the MCO , KCMO , and CCCMO (although this is not as critical as the other 2).

Steps

update openshift/api with new CCM fields (re-revert #1409)
revendor api to library-go
update IsCloudProviderExternal in library-go to observe the new API fields
investigate ObserveCloudVolumePlugin to see if it requires changes
revendor library-go to MCO, KCMO, and CCCMO
update enhancement doc to reflect state

Stakeholders

openshift eng
oracle cloud install effort

Definition of Done

openshift can be installed with External platform type with kubelet, and related components, using the external cloud provider flags.

Docs

this will need to be documented in the API and as part of ~~OCPCLOUD-1581~~

Testing

this will need validation through unit test, integration testing may be difficult as we will need a new e2e built off the external platform with a ccm

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/736

Feature OCPSTRAT-410: BYOK for encryption should encrypt the default storageclass with the same key

View the Description

1. Proposed title of this feature request
BYOK encrypts root vols AND default storageclass

2. What is the nature and description of the request?
User story
As a customer spinning up managed OpenShift clusters, if I pass a custom AWS KMS key to the installer, I expect it (installer and cluster-storage-operator) to not only encrypt the root volumes for the nodes in the cluster, but also be applied to encrypt the first/default (gp2 in current case) StorageClass, so that my assumptions around passing a custom key are met.
In current state, if I pass a KMS key to the installer, only root volumes are encrypted with it, and the default AWS managed key is used for the default StorageClass.
Perhaps this could be offered as a flag to set in the installer to further pass the key to the storage class, or not.

3. Why does the customer need this? (List the business requirements here)
To satisfy that customers wish to encrypt their owned volumes with their selected key instead of the AWS default account key, by accident.

4. List any affected packages or components.

uncertain.

Note: this implementation should take effect on AWS, GCP and Azure (any cloud provider) equally.

Epic STOR-870: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story STOR-875: Implement custom keys in AWS EBS CSI driver operator

View the Description View the linked PRs

User Story:

As a cluster admin, I want OCP to provision new volumes with my custom encryption key that I specified during cluster installation in install-config.yaml so all OCP assets (PVs, VMs & their root disks) use the same encryption key.

Acceptance Criteria:

Description of criteria:

Check that dynamically provisioned PVs use the key specified in install-config.yaml
Check that the key can be changed in TBD API and all volumes newly provisioned after the key change use the new key. (Exact API is not defined yet, probably a new field in `Infrastructure`, calling it TBD API now).

(optional) Out of Scope:

Re-encryption of existing PVs with a new key. Only newly provisioned PVs will use the new key.

Engineering Details:

Enhancement (incl. TBD API with encryption key reference) will be provided as part of https://issues.redhat.com/browse/CORS-2080.

"Raw meat" of this story is translation of the key reference in TBD API to StorageClass.Parameters. AWS EBS CSi driver operator should update both the StorageClass it manages (managed-csi) with:

Parameters:
encrypted: "true"

kmsKeyId: "arn:aws:kms:us-east-1:012345678910:key/abcd1234-a123-456a-a12b-a123b4cd56ef"

Upstream docs: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/parameters.md

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/185

Feature OCPSTRAT-590: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic STOR-1263: Update Control Plane Kubernetes Version to 1.27

View the Description View the linked PRs

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/713

Epic STOR-1155: OCP 4.14 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-1168: Chore: update libraries in all operators

View the Description View the linked PRs

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

Kubernetes:
- client-go
- controller-runtime
OCP:
- library-go
- openshift/api
- openshift/client-go
- operator-sdk

Operators:

aws-ebs-csi-driver-operator
aws-efs-csi-driver-operator
azure-disk-csi-driver-operator
azure-file-csi-driver-operator
openstack-cinder-csi-driver-operator
gcp-pd-csi-driver-operator
gcp-filestore-csi-driver-operator
csi-driver-manila-operator
vmware-vsphere-csi-driver-operator
alibaba-disk-csi-driver-operator
ibm-vpc-block-csi-driver-operator
csi-driver-shared-resource-operator
ibm-powervs-block-csi-driver-operator

cluster-storage-operator
cluster-csi-snapshot-controller-operator
local-storage-operator
vsphere-problem-detector

EOL, do not upgrade:

github.com/oVirt/csi-driver-operator

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/222

Story STOR-1169: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

Story STOR-1167: Chore: Update aws-ebs-csi-driver to the latest release

View the Description View the linked PRs

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/223

Feature CNV-24729: Dynamic Resource Allocation (DRA) for VMs

View the Description

Goal: Resources provided via the Dynamic Resource Allocation Kubernetes mechanism can be consumed by VMs.

Details: Dynamic Resource Allocation

Epic CNV-24730: Design: Dynamic Resource Allocation (k8s feature) for VMs

View the Description

Goal

Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.

Description

The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.

NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.

When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.

Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs [7]. There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.

User Stories

GPU consumption optimization
"As an Admin, I want to let NVIDIA GPU DRA driver provision vGPUs for OpenShift Virtualization, so that it optimizes the allocation with dynamic provisioning of time or MIG backed vGPUs"
GPU mixed types per server
"As an Admin, I want to be able to mix different types of GPU to collocate different types of workloads on the same host, in order to improve multi-pod/stack performance.

Non-Requirements

List of things not included in this epic, to alleviate any doubt raised during the grooming process.

Notes

Any additional details or decisions made/needed

References

Done Checklist

Who	What	Reference
DEV	Upstream roadmap issue (or individual upstream PRs)	<link to GitHub Issue>
DEV	Upstream documentation merged	<link to meaningful PR>
DEV	gap doc updated	<name sheet and cell>
DEV	Upgrade consideration	<link to upgrade-related test or design doc>
DEV	CEE/PX summary presentation	label epic with cee-training and add a <link to your support-facing preso>
QE	Test plans in Polarion	<link or reference to Polarion>
QE	Automated tests merged	<link or reference to automated tests>
DOC	Downstream documentation merged	<link to meaningful PR>

Story WRKLDS-705: Enable DynamicResourceAllocation feature gate with TPNoUpgrade in 4.13

View the Description View the linked PRs

Part of making https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation available for early adoption.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/701

Feature OCPPLAN-3342: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Feature OCPPLAN-4333: The details of this Jira Card are restricted (Only Red Hat employees and contractors)

View the Description

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

Epic CORS-1512: AWS: Support for C2S Region

View the Description

Goal:

As an administrator, I would like to deploy OpenShift 4 clusters to AWS C2S region

Problem:

Customers were able to deploy to AWS C2S region in OCP 3.11, but our global configuration in OCP 4.1 doesn't support this.

Why is this important:

Many of our public sector customers would like to move off 3.11 and on to 4.1, but missing support for AWS C2S region will prevent them from being able to migrate their environments..

Lifecycle Information:

Core

Previous Work:**

Here are the relevant PRs from OCP 3.11. You can see that these endpoints are not part of the standard SDK (they use an entirely separate SDK). To support these regions the endpoints had to be configured explicitly.

Kube: https://github.com/kubernetes/kubernetes/pull/72245
Autoscaler: https://github.com/kubernetes/autoscaler/pull/1717
Installer: https://github.com/openshift/openshift-ansible/pull/11277/

Seth Jennings has put together a highly customized POC.

Dependencies:

Custom API endpoint support w/ CA

- Cloud / Machine API
- Image Registry
- Ingress
- Kube Controller Manager
- Cloud Credential Operator
- others?
Require access to local/private/hidden AWS environment

Prioritized epics + deliverables (in scope / not in scope):

Allow AWS C2S region to be specified for OpenShift cluster deployment
Enable customers to use their own managed internal/cluster DNS solutions due to provider and operational restrictions
Document deploying OpenShift to AWS C2S region
Enable CI for the AWS C2S region

Estimate (XS, S, M, L, XL, XXL): L

Customers: North America Public Sector and Government Agencies

Open Questions:

Story CORS-1584: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-image-registry-operator/pull/638

Feature OCPSTRAT-120: Implement RWOP SELinux context mounts (TechPreview)

View the Description

Epic Goal*

Provide a long term solution to SELinux context labeling in OCP.

Why is this important? (mandatory)

As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.

https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.

This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Apply new context when there is none
Change context of all files/folders when changing context
RWO & RWX PVs
1. ReadWriteOncePod PVs first
2. RWX PV in a second phase

As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context

More on design & scenarios in the KEP and related epic ~~STOR-1173~~

Dependencies (internal and external) (mandatory)

None for the core feature

However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature.

Contributing Teams(and contacts) (mandatory)

Development - STOR
Documentation - STOR
QE - STOR
PX -
Others -

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Epic STOR-1173: Upstream Beta: SELinux relabeling using mount options (TP)

View the Description

Epic Goal

Support upstream feature "SELinux relabeling using mount options (CSIDriver API change)"" in OCP as Beta, i.e. test it and have docs for it (unless it's Alpha upstream).

Summary: If Pod has defined SELinux context (e.g. it uses "resticted" SCC) and it uses ReadWriteOncePod PVC and CSI driver responsible for the volume supports this feature, kubelet + the CSI driver will mount the volume directly with the correct SELinux labels. Therefore CRI-O does not need to recursive relabel the volume and pod startup can be significantly faster. We will need a thorough documentation for this.

This upstream epic actually will be implemented by us!

Why is this important?

We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Enhancement issue: [1710]
KEP: https://github.com/kubernetes/enhancements/pull/3172

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

External: the feature is currently scheduled for Beta in Kubernetes 1.27, i.e. OCP 4.14, but it may change before Kubernetes 1.27 GA.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story STOR-1276: Update all OCP CSIDriver instances

View the Description View the linked PRs

As a cluster user, I want to use mounting with SELinux context without any configuration.

This means OCP ships CSIDriver objects with "SELinuxMount: true" for CSI drivers that support mounting with "-o context". I.e. all CSI drivers that are based on block volumes and use ext4/xfs should have this enabled.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/264

Epic STOR-966: Storage: [UPSTREAM Alpha] Recursive Permissions (SELinux) - Alpha 2/2

View the Description

This Epic is to track upstream work in the Storage SIG community

This Epic is to track the SELinux specific work required. fsGroup work is not included here.

Goal:

Continue contributing to and help move along the upstream efforts to enable recursive permissions functionality.

Finish current SELinuxMountReadWriteOncePod feature upstream:

Implement it in all volume plugins (current alpha has just iSCSI and CSI
Add e2e test + fixing all tests that don't work well with SELinux
Implement necessary changes in volume reconstruction to reconstruct also SELinux context.

The feature is probably going to stay alpha upstream.

Problem:

Recursive permission change takes very long for fsGroup and SELinux. For volumes with many small files Kubernetes currently does a chown for every file on the volume (due to fsGroup). Similarly for container runtimes (such as CRI-O) a chcon of every file on the volume is performed due to SCC's SELinux context. Data on the volume may already have the correct GID/SELinux context so Kubernetes needs way to detect this automatically to avoid the long delay.

Why is this important:

A user wants to bring their pod online quickly and efficiently.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

Estimate (XS, S, M, L, XL, XXL):

Previous Work:

Customers:

Open questions:

Notes:

Story STOR-1078: Update CSI drivers operators in OCP to support mount with SELinux

View the Description View the linked PRs

As OCP developer (and as OCP user in the future), I want all CSI drivers shipped as part of OCP to support mounting with -o context=XYZ, so I can test with CSIDriver.SELinuxMount: true (or my pods are running without CRI-O recursively relabeling my volume).

In detail:

For CSI drivers based on block devices, pass host's /etc/selinux and /sys/fs/ to the CSI drvier container on the node as HostPath volumes
For CSI drivers based on NFS / CIFS: do the same as for block volumes (it won't harm the driver in any way), but investigate if these drivers can actually run with CSIDriver.SELinuxMount: true.

Details: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#selinux-support-in-volumes

Exit criteria:

Verify that CSI drivers shipped by OCP based on block volumes mount volumes with -o context=xyz instead of relabeling the volumes by CRI-O. That should happen when all these conditions are satisfied:
- SELinuxMountReadWriteOncePod and ReadWriteOncePod feature gates are enabled
- CSIDriver.SELinuxMount is set to true manually for the CSI driver. OCP will not do it by default in 4.13, because it requires the alpha feature gates from the previous bullet.
- PVC has AccessMode: [ReadWriteOncePod]
- Pod has SELinux context explicitly assigned, i.e. pod.spec.securityContext (or pod.spec.containers[*].securityContext) has seLinuxOptions set, incl. {{level }}(based on SCC, OCP might do it automatically)
This is alpha / dev preview feature, so QE might done when graduating to Beta / tech preview.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/174

Feature OCPSTRAT-1230: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic WRKLDS-1258: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story WRKLDS-1292: Update kubernetes version to v1.30 (oc & workloads operator)

View the Description View the linked PRs

[Vu Dinh] cluster-kube-controller-manager operator: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/803
[Jan] cluster-policy-controller: https://github.com/openshift/cluster-policy-controller/pull/151
[Vu Dinh] cluster-kube-scheduler operator: https://github.com/openshift/cluster-kube-scheduler-operator/pull/540
[Jan] secondary-scheduler-operator: https://github.com/openshift/secondary-scheduler-operator/pull/142
[Jan] cluster-capacity: https://github.com/openshift/cluster-capacity/pull/93
[Jan] run-once-duration-override-operator: https://github.com/openshift/run-once-duration-override-operator/pull/57
[Jan] run-once-duration-override: https://github.com/openshift/run-once-duration-override/pull/35

Other teams:

[Jan] route-controller-manager: https://github.com/openshift/route-controller-manager/pull/44

If needed this card can be broken down into more cards with sublists, each card assigned to a different assignee.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/352

Epic API-1783: Upgrade to Kubernetes 1.30

View the Description View the linked PRs

Epic Goal*

Drive the technical part of the Kubernetes 1.29 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

Why is this important? (mandatory)

OpenShift 4.17 cannot be released without Kubernetes 1.30

Scenarios (mandatory)

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

PRs:

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/803

Feature OCPSTRAT-1352: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic WRKLDS-1432: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story WRKLDS-1492: Update kubernetes version to v1.31 (oc & workloads operator)

View the Description View the linked PRs

Goal:
Update team owned repositories to Kubernetes v1.31

?? is the 1.31 freeze
?? is the 1.31 GA

Problem:<please update links for 1.31>
The following repository must be rebased onto the latest version of Kubernetes:

oc: https://github.com/openshift/oc/pull/1877

The following repositories should be rebased onto the latest version of Kubernetes:

cluster-kube-controller-manager operator: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/816
cluster-policy-controller: https://github.com/openshift/cluster-policy-controller/pull/156
cluster-kube-scheduler operator: https://github.com/openshift/cluster-kube-scheduler-operator/pull/547
secondary-scheduler-operator: https://github.com/openshift/secondary-scheduler-operator/pull/225
cluster-capacity: https://github.com/openshift/cluster-capacity/pull/97
run-once-duration-override-operator: https://github.com/openshift/run-once-duration-override-operator/pull/68
run-once-duration-override: https://github.com/openshift/run-once-duration-override/pull/36
cluster-openshift-controller-manager-operator: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/368
openshift-controller-manager: https://github.com/openshift/openshift-controller-manager/pull/345
cli-manager-operator: https://github.com/openshift/cli-manager-operator/pull/358
cli-manager: https://github.com/openshift/cli-manager/pull/144
cluster-kube-descheduler-operator: https://github.com/openshift/cluster-kube-descheduler-operator/pull/384
descheduler:

Entirely remove dependencies on k/k repository inside oc.

Why is this important:

Customers demand we provide the latest stable version of Kubernetes.
The rebase and upstream participation represents a significant portion of the Workloads team's activity.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/368

Epic WRKLDS-1449: Upgrade to Kubernetes 1.31

View the Description View the linked PRs

Epic Goal*

Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

Why is this important? (mandatory)

OpenShift 4.18 cannot be released without Kubernetes 1.31

Scenarios (mandatory)

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

PRs:

Retro: Kube 1.31 Rebase Retrospective Timeline (OCP 4.18)

Retro recording: https://drive.google.com/file/d/1htU-AglTJjd-VgFfwE3z_dH5tKXT1Tes/view?usp=drive_web

Feature OCPSTRAT-1733: MultiOperatorManager Phase 1

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic API-1835: Produce MultiOperatorManager POC

View the Description View the linked PRs

link back to OCPSTRAT-1644 somehow

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/823

Feature OCPSTRAT-193: Automatically restart storage operators pods when the CA certificates are updated

View the Description

Feature Overview (aka. Goal Summary)

The storage operators need to be automatically restarted after the certificates are renewed.

From OCP doc "The service CA certificate, which issues the service certificates, is valid for 26 months and is automatically rotated when there is less than 13 months validity left."

Since OCP is now offering an 18 months lifecycle per release, the storage operator pods need to be automatically restarted after the certificates are renewed.

Goals (aka. expected user outcomes)

The storage operators will be transparently restarted. The customer benefit should be transparent, it avoids manually restart of the storage operators.

Requirements (aka. Acceptance Criteria):

The administrator should not need to restart the storage operator when certificates are renew.

This should apply to all relevant operators with a consistent experience.

Use Cases (Optional):

As an administrator I want the storage operators to be automatically restarted when certificates are renewed.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

This feature request is triggered by the new extended OCP lifecycle. We are moving from 12 to 18 months support per release.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

No doc is required

Interoperability Considerations

This feature only cover storage but the same behavior should be applied to every relevant components.

Epic STOR-991: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story STOR-1300: Automatically restart `aws-ebs-csi-driver-controller` pods when the secret `aws-ebs-csi-driver-controller-metrics-serving-cert` is updated

View the Description View the linked PRs

The pod `aws-ebs-csi-driver-controller` mounts the secret:

$ oc get po -n openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-559f74d7cd-5tk4p -o yaml
...
    name: driver-kube-rbac-proxy
    name: provisioner-kube-rbac-proxy
	name: attacher-kube-rbac-proxy
	name: resizer-kube-rbac-proxy
	name: snapshotter-kube-rbac-proxy

    volumeMounts:
    - mountPath: /etc/tls/private
      name: metrics-serving-cert

  volumes:
  - name: metrics-serving-cert
    secret:
      defaultMode: 420
      secretName: aws-ebs-csi-driver-controller-metrics-serving-cert

Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/216

Feature OCPSTRAT-242: No auto-generated secrets for SA when Registry is disabled

View the Description

Feature Overview (aka. Goal Summary)

Description of problem:

Even though in 4.11 we introduced LegacyServiceAccountTokenNoAutoGeneration to be compatible with upstream K8s to not generate secrets with tokens when service accounts are created, today OpenShift still creates secrets and tokens that are used for legacy usage of openshift-controller as well as the image-pull secrets.

Customer issues:

Customers see auto-generated secrets for service accounts which is flagged as a security risk.

This Feature is to track the implementation for removing legacy usage and image-pull secret generation as well so that NO secrets are auto-generated when a Service Account is created on OpenShift cluster.

Goals (aka. expected user outcomes)

NO Secrets to be auto-generated when creating service accounts

Requirements (aka. Acceptance Criteria):

Following *secrets need to NOT be generated automatically with every Serivce account creation:*

ImagePullSecrets : This is needed for Kubelet to fetch registry credentials directly. Implementation needed for the following upstream feature.
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2133-kubelet-credential-providers/README.md
Dockerconfig secrets: The openshift-controller-manager relies on the old token secrets and it creates them so that it's able to generate registry credentials for the SAs. There is a PR that was created to remove this https://github.com/openshift/openshift-controller-manager/pull/223.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Concerns/Risks: Replacing functionality of one of the openshift-controller used for controllers that's been in the code for a long time may impact behaviors that w

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Existing documentation needs to be clear on where we are today and why we are providing the above 2 credentials. Related Tracker: https://issues.redhat.com/browse/OCPBUGS-13226

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic API-1642: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature OCPSTRAT-487: Pod Security Admission Integration - Restricted Enforcement

View the Description

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".

Epic AUTH-262: Pod Security Admission Integration - Restricted Enforcement

View the Description

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

Story AUTH-482: SCC pinning for all workloads in platform namespaces

View the Description View the linked PRs

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces	4.18	4.17	4.16	4.15
monitored	82	82	82	82
fix needed	69	69	69	69
fixed	38	34	30	39
remaining	31	35	39	30
~ remaining non-runlevel	9	13	17	8
~ remaining runlevel (low-prio)	22	22	22	22
~ untested	2	2	2	82

Progress breakdown

#	namespace	4.18	4.17	4.16	4.15
1	oc debug node pods		#1763	#1816	#1818
2	openshift-apiserver-operator			#573	#581
3	openshift-authentication			#656	#675
4	openshift-authentication-operator			#656	#675
5	openshift-catalogd			#50	#58
6	openshift-cloud-credential-operator			#681	#736
7	openshift-cloud-network-config-controller	#2282	#2490	#2496
8	openshift-cluster-csi-drivers	#524 #131 #6 #127 #108 #118 #306 #265 #75		#170 #459	#484
9	openshift-cluster-node-tuning-operator			#968	#1117
10	openshift-cluster-olm-operator			#54	n/a
11	openshift-cluster-samples-operator			#535	#548
12	openshift-cluster-storage-operator	#516		#459 #196	#484 #211
13	openshift-cluster-version			#1038	#1068
14	openshift-config-operator			#410	#420
15	openshift-console		#871	#908	#924
16	openshift-console-operator		#871	#908	#924
17	openshift-controller-manager			#336	#361
18	openshift-controller-manager-operator			#336	#361
19	openshift-e2e-loki	#56579	#56579	#56579	#56579
20	openshift-image-registry			#1008	#1067
21	openshift-ingress	#1031
22	openshift-ingress-canary	#1031
23	openshift-ingress-operator	#1031
24	openshift-insights	#1026		#915	#967
25	openshift-kni-infra	#4504	#4542	#4539	#4540
26	openshift-kube-storage-version-migrator			#107	#112
27	openshift-kube-storage-version-migrator-operator			#107	#112
28	openshift-machine-api	#1308	#407	#315 #282 #1220 #73 #50 #433	#332 #326 #1288 #81 #57 #443
29	openshift-machine-config-operator	#4636	#4219	#4384	#4393
30	openshift-manila-csi-driver		#234	#235	#236
31	openshift-marketplace	#578		#561	#570
32	openshift-metallb-system	#238	#240	#241
33	openshift-monitoring	#2498		#2335	#2420
34	openshift-network-console	#2545
35	openshift-network-diagnostics	#2282	#2490	#2496
36	openshift-network-node-identity	#2282	#2490	#2496
37	openshift-nutanix-infra	#4504	#4504	#4539	#4540
38	openshift-oauth-apiserver			#656	#675
39	openshift-openstack-infra	#4504	#4504	#4539	#4540
40	openshift-operator-controller			#100	#120
41	openshift-operator-lifecycle-manager			#703	#828
42	openshift-route-controller-manager			#336	#361
43	openshift-service-ca			#235	#243
44	openshift-service-ca-operator			#235	#243
45	openshift-sriov-network-operator		#754 #995	#999	#1003
46	openshift-user-workload-monitoring			#2335	#2420
47	openshift-vsphere-infra	#4504	#4542	#4539	#4540
48	(runlevel) default
49	(runlevel) kube-system
50	(runlevel) openshift-cloud-controller-manager
51	(runlevel) openshift-cloud-controller-manager-operator
52	(runlevel) openshift-cluster-api
53	(runlevel) openshift-cluster-machine-approver
54	(runlevel) openshift-dns
55	(runlevel) openshift-dns-operator
56	(runlevel) openshift-etcd
57	(runlevel) openshift-etcd-operator
58	(runlevel) openshift-kube-apiserver
59	(runlevel) openshift-kube-apiserver-operator
60	(runlevel) openshift-kube-controller-manager
61	(runlevel) openshift-kube-controller-manager-operator
62	(runlevel) openshift-kube-proxy
63	(runlevel) openshift-kube-scheduler
64	(runlevel) openshift-kube-scheduler-operator
65	(runlevel) openshift-multus
66	(runlevel) openshift-network-operator
67	(runlevel) openshift-ovn-kubernetes
68	(runlevel) openshift-sdn
69	(runlevel) openshift-storage

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/336

Feature OCPSTRAT-519: Unify and Instrument Hosted Control Planes Storage Operators

View the Description

Feature Overview (aka. Goal Summary)

Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.

Goals (aka. expected user outcomes)

Simplify the operators with a unified code pattern
Expose metrics from control-plane components
Use proper RBACs in the guest cluster
Scale the pods according to HostedControlPlane's AvailabilityPolicy
Add proper node selector and pod affinity for mgmt cluster pods

Requirements (aka. Acceptance Criteria):

OCP regression tests work in both standalone OCP and HyperShift
Code in the operators looks the same
Metrics from control-plane components are exposed
Proper RBACs are used in the guest cluster
Pods scale according to HostedControlPlane's AvailabilityPolicy
Proper node selector and pod affinity is added for mgmt cluster pods

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.

Interoperability Considerations

Epic STOR-1437: Refactor AWS EBS CSI driver operators to support hypershift

View the Description View the linked PRs

Epic Goal*

Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.

Why is this important? (mandatory)

An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.

Scenarios (mandatory)

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/297

Story STOR-1461: Add ebs-operator to legacy folder and build and test image from it

View the Description View the linked PRs

Until final structure is finalized we should be able to build and test aws-ebs image from legacy folder.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/294

Feature OCPSTRAT-709: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic API-1800: Migrate secrets of "SecretTypeTLS" type

View the Description View the linked PRs

TBD

Known affected components:

etcd-o ( [1] )

kcm-o ( [2], [3])

[1] https://github.com/openshift/cluster-etcd-operator/blob/eeef8033eb058b0f9899c7b16aead237d5b84462/manifests/0000_20_etcd-operator_03_secret.yaml#L3

[2] https://github.com/openshift/cluster-kube-controller-manager-operator/blob/351c1193c7eebb49054a289a17fc25dfc0e0cd73/bindata/bootkube/manifests/secret-csr-signer-signer.yaml#L10
[3] https://github.com/openshift/cluster-etcd-operator/pull/1234

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/804

Feature OCPSTRAT-715: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic STOR-1425: Update Control Plane Kubernetes Version to 1.28

View the Description View the linked PRs

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/742

Epic STOR-1383: OCP 4.15 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-1404: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

Epic WRKLDS-806: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/305

Feature OCPSTRAT-790: prevent workload to be schedule in master node

View the Description

Feature Overview (aka. Goal Summary)

As an openshift admin ,who wants to make my OCP more secure and stable . I want to prevent anyone to schedule their workload in master node so that master node only run OCP management related workload .

Goals (aka. expected user outcomes)

secure OCP master node by preventing scheduling of customer workload in master node

Epic WRKLDS-1015: prevent application workload to be schedule to master nodes

View the Description

Anyone applying toleration(s) in a pod spec can unintentionally tolerate master taints which protect master nodes from receiving application workload when master nodes are configured to repel application workload. An admission plugin needs to be configured to protect master nodes from this scenario. Besides the taint/toleration, users can also set spec.NodeName directly, which this plugin should also consider protecting master nodes against.

Story WRKLDS-1169: kube-controller-manager pods need to tolerate node-role.kubernetes.io/control-plane:NoExecute

View the Description View the linked PRs

Needed so we can provide this workflow to customers following the proposal at https://github.com/openshift/enhancements/pull/1583

Reference https://issues.redhat.com/browse/WRKLDS-1015

kube-controller-manager pods are created by code residing in controllers provided by the kube-controler-manager operator. So changes are required in that repo to add a toleration to the node-role.kubernetes.io/control-plane:NoExecute taint.

https://github.com/openshift/cluster-kube-controller-manager-operator/blob/bdb42fd60a64adf6f98cc7be4d90a76ea30662d6/manifests/0000_25_kube-controller-manager-operator_06_deployment.yaml#L108

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/802

Feature OCPSTRAT-974: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic WRKLDS-1016: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/324

Epic OCPNODE-1886: [OCP Rebase] Rebase OCP control plane with Kubernetes v1.29

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Goal of this epic is to capture all the amount of required work and efforts that take to update the openshift control plane with the upstream kubernetes v1.29

Why is this important?

Rebase is a must process for every ocp release to leverage all the new features implemented upstream

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Following epic captured the previous rebase work of k8s v1.28
https://issues.redhat.com/browse/STOR-1425

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPNODE-1890: Bump api, dependent libraries

View the Description View the linked PRs

Bump the following libraries in an order with the latest kube and the dependent libraries

openshift/api
openshift/client-go
openshift/library-go
openshift/apiserver-library-go
kube-openapi (if required)

Prev Ref:
https://github.com/openshift/api/pull/1534
https://github.com/openshift/client-go/pull/250
https://github.com/openshift/library-go/pull/1557
https://github.com/openshift/apiserver-library-go/pull/118

https://github.com/openshift/kube-openapi/pull/7

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/770

Feature SECFLOWOTL-22: OCP Capabilities: Disable Builder Service Account

View the Description

Executive Summary

Provide mechanisms for the builder service account to be made optional in core OpenShift.

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

Let cluster administrators disable the automatic creation of the "builder" service account when the Build capability is disabled on the cluster. This reduces potential attack vectors for clusters that do not run build or other CI/CD workloads. Example - fleets for mission-critical applications, edge deployments, security-sensitive environments.
Let cluster administrators enable/disable the generation of the "builder" service account at will. Applies to new installations with the "Build" capability enabled as well as upgraded clusters. This helps customers who are not able to easily provision new OpenShift clusters and block usage of the Build system through other means (ex: RBAC, 3rd party admission controllers (ex OPA, Kyverno)).

Requirements

Requirements	Notes	IS MVP
Disable service account controller related to Build/BuildConfig when Build capability is disabled	When the API is marked as removed or disabled, stop creating the "builder" service account and its associated RBAC	Yes
Option to disable the "builder" service account	Even if the Build capability is enabled, allow admins to disable the "builder" service account generation. Admins will need to bring their own service accounts/RBAC for builds to work	Yes

(Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Build as an installation capability - see ~~WRKLDS-695~~
Disabling the Build system through RBAC or admission controllers. The "builder" service account is the only thing that RBAC and admission control cannot block without significant cluster impact.

Out of scope

<Defines what is not included in this story>

Disabling the Build API separately from the capabilities feature

Dependencies

< Link or at least explain any known dependencies. >

Build capability: ~~WRKLDS-695~~
Separate controllers for default service accounts: API-1651

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

In OCP 4.14, "Build" was introduced as an optional installation capability. This means that the BuildConfig API and subsystems are not guaranteed to be present on new clusters.
The "builder" service account is granted permission to push images to the OpenShift internal registry via a controller. There is risk that the service account can be used as an attack vector to overwrite images in the internal registry.
OpenShift has an existing API to configure the build system. See OCP documentation on the current supported options. The current OCP build test suite includes checks for these global settings. Source code.
Customers with larger footprints typically separate "CI/CD clusters" from "application clusters" that run production workloads. This is because CI/CD workloads (and building container images in particular) can have "noisy" consumption of resources that risk destabilizing running applications.

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Must work for new installations as well as upgraded clusters.

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

Update the "Build configurations" doc so admins can understand the new feature.
Potential updates to "Understanding BuildConfig" doc doc to include references to the serviceAccount option in the spec, as well as a section describing the permissions granted to the "builder" service account.

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

Disabling OCM Controllers (slides). Note that the controller names may be a bit out of date once API-1651 is done.
Install capabilities - OCP docs

Done Checklist

Acceptance criteria are met
Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
User Journey automation is delivered
Support and SRE teams are provided with enough skills to support the feature in production environment

Epic OCPBUILD-7: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-34077: Default Rolebindings Created on OCP 4.16 with No Capabilities

View the Description View the linked PRs

In OCP 4.16.0, the default role bindings for image puller, image pusher, and deployer are created, even if the respective capabilities are disabled on the cluster.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/346

Story OCPBUILD-9: Disable Build/Deployer/Image Registry RBAC Controllers with Capabilities

View the Description View the linked PRs

Story (Required)

As a cluster admin trying to disable the Build, DeploymentConfig, and Image Registry capabilities I want the RBAC controllers for the builder and deployer service accounts and default image-registry rolebindings disabled when their respective capability is disabled.

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer's experience?>

Background (Required)

<Describes the context or background related to this story>

In ~~WRKLDS-695~~, ocm-o was enhanced to disable the Build and DeploymentConfig controllers when the respective capability was disabled. This logic should be extended to include the controllers that set up the service accounts and role bindings for these respective features.

Out of scope

<Defines what is not included in this story>

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

Ensure that all the controllers (builder, deployer, image-puller rolebinding controllers) are specified in the api/*/types.go as well as in controllerInitializers, so that the controller is initiated when capability is enabled.
Map the controllers introduced with ~~BUILD-725~~ to respective capabilities in ocm-operator.
- This helps to avoid running controllers when capabilities are disabled.
- The relevant clusterVersionCapability is available here
The OpenShift CI has no check for cluster deployment with capabilities disabled. Looked through https://github.com/openshift/release/tree/master/ci-operator/step-registry/ipi/conf/capability

- Needs manual testing (OpenShift cluster deployed with all/some capabilities disabled).

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

Acceptance Criteria (Mandatory)

Build and DeploymentConfig systems remain functional when the respective capability is enabled.
Build, DeploymentConfig, and Image-Puller RoleBinding controllers are not started when the respective capability is disabled.

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

Engineering: 5
QE: 2
Doc: 2

Legend

Unknown

Verified

Unsatisfied

Done Checklist

Code is completed, reviewed, documented and checked in
Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
Continuous Delivery pipeline(s) is able to proceed with new code included
Customer facing documentation, API docs etc. are produced/updated, reviewed and published
Acceptance criteria are met

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/335

Bug OCPBUGS-34395: Build Cluster Config API Not Present when Build Capability Enabled

View the Description View the linked PRs

Description of problem:


When a cluster is deployed with no capabilities enabled, and the Build capability is later enabled, it's related cluster configuration CRD is not installed. This prevents admins from fine-tuning builds and ocm-o from fully reconciling its state.

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Always

Steps to Reproduce:

    1. Launch a cluster with no capabilities enabled (via cluster-bot: launch 4.16.0.0-ci aws,no-capabilities 
    2. Edit the clusterversion to enable the build capability: oc patch clusterversion/version --type merge -p '{"spec":{"capabilities":{"additionalEnabledCapabilities":["Build"]}}}'
    3. Wait for the openshift-apiserver and openshift-controller-manager to roll out

Actual results:

APIs for BuildConfig (build.openshift.io) are enabled.
Cluster configuration API for build system is not:

$ oc api-resources | grep "build"
buildconfigs  bc  build.openshift.io/v1 true         BuildConfig
builds                   build.openshift.io/v1 true         Build

Expected results:

Cluster configuration API is enabled.

$ oc api-resources | grep "build"
buildconfigs  bc  build.openshift.io/v1    true  BuildConfig
builds                   build.openshift.io/v1    true  Build
builds                   config.openshift.io/v1  true  Build

Additional info:

This causes list errors in openshift-controller-manager-operator, breaking the controller that reconciles state for builds and the image registry.

W0523 18:23:38.551022       1 reflector.go:539] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)
E0523 18:23:38.551334       1 reflector.go:147] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.Build: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/351

Feature TELCOSTRAT-38: Pre-Install SNO DU Deployments

View the Description

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

Factory Depot: Install OCP with minimal steps
- Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
- Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
- The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
- Provide ability to validate previously installed OCP nodes
- Provide ability to update previously installed OCP nodes
- 100 parallel installations at Service Depot
Far Edge: Deploy OCP with minimal steps
- Provide site specific information via usb/file mount or simple interface
- Minimize time spent at far edge site by technician/barista/installer
- Register with desired RHACM Hub cluster for ongoing LCM
Minimal ongoing maintenance of solution
- Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

requirement	Notes	isMvp?

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

<What concepts do customers need to understand to be successful in [action]?>

<How do we expect customers will use the feature? For what purpose(s)?>

<What reference material might a customer want/need to complete [action]?>

<Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >

<What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question	Outcome

Epic MGMT-13418: Shorten SNO installation duration

View the Description

Epic Goal

Install SNO within 10 minutes

Why is this important?

SNO installation takes around 40+ minutes.
This makes SNO less appealing when compared to k3s/microshift.
We should analyze the SNO installation, figure our why it takes so long and come up with ways to optimize it

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

https://docs.google.com/document/d/1ULmKBzfT7MibbTS6Sy3cNtjqDX1o7Q0Rek3tAe1LSGA/edit?usp=sharing

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-7440: kube-controller-manager cluster operator is degraded due connection refused while querying rules

View the Description View the linked PRs

Description of problem:

while trying to figure out why it takes so long to install Single node OpenShift I noticed that the kube-controller-manager cluster operator is degraded for ~5 minutes due to:
GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused
I don't understand how the prometheusClient is successfully initialized, but we get a connection refused once we try to query the rules.
Note that if the client initialization fails the kube-controller-manger won't set the  GarbageCollectorDegraded to true.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc)

2. monitor the cluster operators staus

Actual results:

GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused

Expected results:

Expected the GarbageCollectorDegraded status to be false

Additional info:

It seems that for PrometheusClient to be successfully initialised it needs to successfully create a connection but we get connection refused once we make the query.
Note that installing SNO with this patch (https://github.com/eranco74/cluster-kube-controller-manager-operator/commit/26e644503a8f04aa6d116ace6b9eb7b9b9f2f23f) reduces the installation time by 3 minutes

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/706

Epic AUTH-20: Client Cert based Metrics Scraping

View the Description

Monitoring needs to be reliable and is the very useful when trying to debug clusters in an already degraded state. We want to ensure that metrics scraping can always work if the scraper can reach the target, even if the kube-apiserver is unavailable or unreachable. To do this, we will combine a local authorizer (already merged in many binaries and the rbac-proxy) and client-cert based authentication to have a fully local authentication and authorization path for scraper targets.

If networking (or part of networking) is down and a scraper target cannot reach the kube-apiserver to verify a token and a subjectaccessreview, then the metrics scraper can be rejected. The subjectaccessreview (authorization) is already largely addressed, but service account tokens are still used for scraping targets. Tokens require an external network call that we can avoid by using client certificates. Gathering metrics, especially client metrics, from partially functionally clusters helps narrow the search area between kube-apiserver, etcd, kubelet, and SDN considerably.

In addition, this will significantly reduce the load on the kube-apiserver. We have observed in the CI cluster that token and subject access reviews are a significant percentage of all kube-apiserver traffic.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AUTH-26: cluster-policy-controller: approve CSRs issued by monitoring

View the Description View the linked PRs

User story:

As cluster-policy-controller I automatically approve cert signing requests issued by monitoring.

DoD:

cert signing requests issued by the cluster-monitoring-operator service account are approved automatically.

Implementation hints: leverage approving logic implemented in https://github.com/openshift/library-go/pull/1083.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/535

Epic BUILD-144: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story BUILD-145: Fix usability with mirroring sample images in disconnected environments

View the Description View the linked PRs

User Story

As an cluster administrator of a disconnected OCP cluster,
I want a list of possible sample images to mirror
So that I can configure my image mirror prior to installing OCP in a disconnected environment.

Acceptance Criteria

Publish the list of the sample images to mirror as a ConfigMap in the samples operator namespace.
Provide instructions on how to obtain the current image SHAs from the list above (via podman or skopeo).
Reference the ConfigMap name in our "import failing" alert.
[optional] Reference the ConfigMap name in our "Removed" condition message.

Notes

it is too onerous to find a connected cluster in order to obtain the list of possible samples images to mirror using the current documented procedures.
I need a list make available to me in my disconnected cluster that I can reference after initial install.

https://github.com/openshift/cluster-samples-operator/pull/321

Story BUILD-125: Improve recording of imagestream import results

View the Description View the linked PRs

User Story

Sample operator use of the reason field in its config object to track imagestream import completion has resulted in that singleton being a bottleneck and source of update conflicts (we are talking 60 or 70 imagestreams potentially updating that once field concurrently).

Acceptance Criteria

Reduction in reconciliation errors when imagestream imports complete (success or failure).
ConfigMaps containing failing imagstream imports are added to must-gather
After all imagestreams successfully import, there are no error-recording ConfigMaps in the samples-operator namespace.

Notes

See this hackday PR

for an alternative approach which uses a configmap per imagestream

https://github.com/openshift/cluster-samples-operator/pull/313

Epic BUILD-529: Kubernetes 1.25 Rebases

View the Description

Epic Goal

Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.25

Why is this important?

Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

Existing CI/CD tests must be passing

Story OCPBUILD-146: Rebase openshift/cluster-openshift-controller-manager-operator to k8s 1.25

Epic IR-115: OCI Support

View the Description

Goal: Support OCI images.

Problem: Buildah and podman use OCI format by default and OpenShift Image Registry and ImageStream API doesn't understand it.

Why is this important: OCI images are supposed to replace Docker schema 2 images, OpenShift should be ready when OCI images become widely adopted.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

Estimate (XS, S, M, L, XL, XXL): XL

Previous Work:

Customers:

Open Questions:

Story IR-112: Add support of OCI images to image pruner

View the Description View the linked PRs

User Story

As a user of OpenShift
I want the image pruner to be aware of OCI images
So that it doesn't delete their layers/configs

Acceptance Criteria

- When
  - an OCI image is pushed/mirrored to the registry
  - an schema 2 image is pushes/mirrored to the registry and share its layers/config with the OCI image
  - the schema 2 image is eligible to be pruned
  - the shared layers/config are not shared with other images
- the pruner
  - will delete the schema 2 image
  - will NOT delete the OCI image and its layers/config

Launch Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Notes

Add pertinent notes here:

Enhancement proposal link
Previous product docs
Best practices
Known issues

Guiding Questions

User Story

Is this intended for an administrator, application developer, or other type of OpenShift user?
What experience level is this intended for? New, experienced, etc.?
Why is this story important? What problems does this solve? What benefit(s) will the customer experience?
Is this part of a larger epic or initiative? If so, ensure that the story is linked to the appropriate epic and/or initiative.

Acceptance Criteria

How should a customer use and/or configure the feature?
Are there any prerequisites for using/enabling the feature?

Notes

Is this a new feature, or an enhancement of an existing feature? If the latter, list the feature and docs reference.
Are there any new terms, abbreviations, or commands introduced with this story? Ex: a new command line argument, a new custom resource.
Are there any recommended best practices when using this feature?
On feature completion, are there any known issues that customers should be aware of?

https://github.com/openshift/oc/pull/617

Story IR-114: Add support of OCI images to image registry

View the Description View the linked PRs

User Story

As a user of OpenShift
I want to push OCI images to the registry
So that I can use buildah and podman with their defaults to push images

Acceptance Criteria

An OCI image can be pushed to the registry by buildah or podman
An imported OCI image can be pulled from the registry
The registry should be able to pull-through OCI images from other registries

Launch Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Notes

Add pertinent notes here:

Enhancement proposal link
Previous product docs
Best practices
Known issues

Guiding Questions

User Story

Is this intended for an administrator, application developer, or other type of OpenShift user?
What experience level is this intended for? New, experienced, etc.?
Why is this story important? What problems does this solve? What benefit(s) will the customer experience?
Is this part of a larger epic or initiative? If so, ensure that the story is linked to the appropriate epic and/or initiative.

Acceptance Criteria

How should a customer use and/or configure the feature?
Are there any prerequisites for using/enabling the feature?

Notes

Is this a new feature, or an enhancement of an existing feature? If the latter, list the feature and docs reference.
Are there any new terms, abbreviations, or commands introduced with this story? Ex: a new command line argument, a new custom resource.
Are there any recommended best practices when using this feature?
On feature completion, are there any known issues that customers should be aware of?

https://github.com/openshift/image-registry/pull/255

Epic IR-138: Continuous Improvement of Maintainability 4.7

View the Description

Epic Goal

Improve CI testing of the image registry components.

Why is this important?

The image registry, image API and the image pruner had a lot of tests removed during transition 4.0. This may make the platform less stable and/or slow down the team.

Scenarios

Acceptance Criteria

CI - tests should be more stable and have broader coverage

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.

Story IR-137: Make ISI tests non-disruptive

View the Description View the linked PRs

https://github.com/openshift/origin/pull/25475/files marked our tests for ISI as Disruptive.

Tests should wait until operators become stable, otherwise other tests will be run on an unstable cluster and it'll cause flakes.

Acceptance Criteria

The tests wait until operators stable after image.config changes.
The tests is no longer [Disruptive].
If tests are slow (it depends on other operator, but MCO tends to be slow), they should be [Slow].

https://github.com/openshift/origin/pull/25497

Story IR-152: Monitor availability of registry during upgrade tests

View the Description View the linked PRs

An the registry developer
I want e2e-upgrade jobs to monitor availability of the registry during upgrades
So that I can be sure that clients can use the registry without disruptions.

Acceptance Criteria

A new test in openshift/origin repo.

Notes

https://github.com/openshift/origin/blob/e6b3d1ece61d7c3ab5a23151c9875e1f9ad36838/test/extended/util/disruption/controlplane/controlplane.go#L69

https://bugzilla.redhat.com/show_bug.cgi?id=1884380

https://github.com/openshift/origin/pull/25679

Task IR-172: Convert image-registry integration tests into e2e tests

View the Description View the linked PRs

The integration tests for the image registry expect that OpenShift and tests are run on the same machine (i.e. OpenShift can connect to sockets that the tests listen). This is not the case with e2e tests.

Acceptance Criteria

every integration test is converted into an e2e test or a techdebt story
image-registry tests are green

https://github.com/openshift/image-registry/pull/258

Epic IR-85: Rebase Registry v2.7.1

View the Description

Goal: Rebase registry to Docker Distribution

Problem: The registry is currently based on an outdated version of the upstream docker/distribution project. The base does not even have a version associated with it - DevEx last rebased on an untagged commit.

Why is this important: Update the registry with improvements and bug fixes from the upstream community.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

~~DEVEXP-505~~ Rebase on Docker Distribution v2.7.1

Estimate (XS, S, M, L, XL, XXL): M

Previous Work:

Customers:

Open questions:

Story IR-52: Rebase Image Registry to v2.7.1

View the Description View the linked PRs

User Story

As a user of OpenShift
I want the image registry to be rebased on the latest docker/distribution release (v2.7.1)
So that the image registry has the latest upstream bugfixes and enhancements

Acceptance Criteria

Image registry is based on docker/distribution v2.7.1

Launch Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Notes

Add pertinent notes here:

Enhancement proposal link
Previous product docs
Best practices
Known issues

Guiding Questions

User Story

Is this intended for an administrator, application developer, or other type of OpenShift user?
What experience level is this intended for? New, experienced, etc.?
Why is this story important? What problems does this solve? What benefit(s) will the customer experience?
Is this part of a larger epic or initiative? If so, ensure that the story is linked to the appropriate epic and/or initiative.

Acceptance Criteria

How should a customer use and/or configure the feature?
Are there any prerequisites for using/enabling the feature?

Notes

Is this a new feature, or an enhancement of an existing feature? If the latter, list the feature and docs reference.
Are there any new terms, abbreviations, or commands introduced with this story? Ex: a new command line argument, a new custom resource.
Are there any recommended best practices when using this feature?
On feature completion, are there any known issues that customers should be aware of?

https://github.com/openshift/image-registry/pull/252

Epic OCPBUILD-26: Kubernetes 1.27 Rebases

View the Description

Epic Goal

Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.27

Why is this important?

Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

Existing CI/CD tests must be passing

Story OCPBUILD-145: Rebase openshift/cluster-openshift-controller-manager-operator to k8s 1.27

Epic OCPCLOUD-2512: Clean up Cloud Provider code CCM migration

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Stop setting `-~~cloud-provider` and `~~-cloud-config` arguments on KAS, KCM and MCO
Remove `CloudControllerOwner` condition from CCM and KCM ClusterOperators
Remove feature gating reliance in library-go IsCloudProviderExternal
Remove CloudProvider feature gates from openshift/api

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2513: Clear --cloud-provider and --cloud-config flags from KCM/KAS

View the Description View the linked PRs

Background

KCM and KAS previously relied on having the `{}cloud-provider` and `{-}-cloud-config` flags set. However, these are no longer required as there is no cloud provider code in either binary.

Both operators rely on the config observer in library go to set these flags.

In the future, if these values are set, even to the empty string, then startup will fail.

The config observer sets keys and values for a map (see), we need to make sure the keys for these two flags are deleted rather than set to a specific value.

Steps

Update the logic in the config observer to remove the `{}cloud-config` and `{-}-cloud-provider` flag, neither should be set going forward
Update KAS and KCM operators to include the new logic.

Stakeholders

Cluster Infra
API team
Workloads team

Definition of Done

Clusters do not set `{}cloud-provider` or `{-}-cloud-config` on KAS and KCM

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/806

Story OCPCLOUD-2515: Remove CloudControllerOwner condition from CCMO and KCMO

View the Description View the linked PRs

Background

As part of the migration to external cloud providers, the CCMO and KCMO used a CloudControllerOwner condition to show which controller owned the controllers.

This is no longer required and can be removed.

Steps

Remove code from CCMO that looks for and gates on the KCMO condition
Ensure CCMO clears the condition
Ensure KCMO clears the condition

Stakeholders

Cluster Infra
Workloads team

Definition of Done

Clusters upgraded to 4.16 do not have a CloudControllerOwner condition set on the KCMO or CCMO ClusterOperators

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/795

Story OCPCLOUD-2514: External cloud providers should not rely on feature gates

View the Description View the linked PRs

Background

Code in library-go currently uses feature gates to determine if Azure and GCP clusters should be external or not. They have been promoted for at least one release and we do not see ourselves going back.

In 4.17 the code is expected to be deleted completely.

We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.

Steps

Update library go to remove reliance on feature gates
Update callers to no longer rely on feature gate accessor (KCMO, KASO, MCO, CCMO)
Remove feature gates from API repo

Stakeholders

Cluster Infra
MCO team
Workloads team
API server team

Definition of Done

Feature gates for external cloud providers are removed from the product

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/794

Epic STOR-1565: OCP 4.16 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-1688: Chore: add .snyk file to ignore false positives

View the Description View the linked PRs

We get too many false positive bugs like https://issues.redhat.com/browse/OCPBUGS-25333 from SAST scans, especially from the vendor directory. Add a .snyk file like https://github.com/openshift/oc/blob/master/.snyk to each repo to ignore them.

Story STOR-1573: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

Epic STOR-1576: OCP 4.17 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-1593: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

https://github.com/openshift/csi-external-resizer/pull/164

Epic STOR-2005: OCP 4.18 release chores

View the Description

Epic Goal

Update all images that we ship with OpenShift to the latest upstream releases and libraries.
Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Story STOR-2018: Chore: update CSI sidecars

View the Description View the linked PRs

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

external-attacher
external-provisioner
external-resizer
external-snapshotter
node-driver-registrar
livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

Bug OCPBUGS-722: Undiagnosed panic detected in pod: openshift-controller-manager-operator_openshift-controller-manager-operator invalid memory address or nil pointer dereference

View the Description View the linked PRs

Description of problem:

In looking at jobs on an accepted payload at https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.ci/release/4.12.0-0.ci-2022-08-30-122201 , I observed this job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1564589538850902016 with "Undiagnosed panic detected in pod" "pods/openshift-controller-manager-operator_openshift-controller-manager-operator-74bf985788-8v9qb_openshift-controller-manager-operator.log.gz:E0830 12:41:48.029165       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)"

Version-Release number of selected component (if applicable):

4.12

How reproducible:

probably relatively easy to reproduce (but not consistently) given it's happened several times according to this search: https://search.ci.openshift.org/?search=Observed+a+panic%3A+%22invalid+memory+address+or+nil+pointer+dereference%22&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

1. let nightly payloads run or run one of the presubmit jobs mentioned in the search above
2.
3.

Actual results:

Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}

Expected results:

no panics

Additional info:

Bug OCPBUGS-6259: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/691

Bug OCPBUGS-13926: OCM-o does not support obtaining verbosity through OpenShiftControllerManager.operatorLogLevel objec

View the Description View the linked PRs

Description of problem:

OCM-o does not support obtaining verbosity through OpenShiftControllerManager.operatorLogLevel object

Version-Release number of selected component (if applicable):

How reproducible:

modify the OpenShiftControllerManager.operatorLogLevel, and the OCM-o operator will not display the correspond logs

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-14323: Change static manifest pod files permissions to 0600 to conform with CIS benchmarks

View the Description View the linked PRs

Refer to the CIS RedHat OpenShift Container Platform Benchmark PDF: https://drive.google.com/file/d/12o6O-M2lqz__BgmtBrfeJu1GA2SJ352c/view
1.1.7 Ensure that the etcd pod specification file permissions are set to 600 or more restrictive (Manual)
======================================================================================================
As per CIS v1.3 PDF permissions should be 600 with the following statement:
"The pod specification file is created on control plane nodes at /etc/kubernetes/manifests/etcd-member.yaml with permissions 644. Verify that the permissions are 600 or more restrictive."
But when I ran the following command it was showing 644 permissions

for i in $(oc get pods -n openshift-etcd -l app=etcd -o name | grep etcd )
do
echo "check pod $i"
oc rsh -n openshift-etcd $i \
stat -c %a /etc/kubernetes/manifests/etcd-pod.yaml
done

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/739

Bug OCPBUGS-16254: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/562

Bug OCPBUGS-4491: hypershift: aws-ebs-csi-driver-operator uses wrong kubeconfig

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/169

Bug OCPBUGS-15823: Adjust CSI rpc call timeouts from Sidecar for AWS and GCP-PD driver

View the Description View the linked PRs

We should adjust CSI RPC call timeout from sidecars to CSI driver. We seem to be using default values which are just too short and hence can cause unintended side-effects.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/248

Bug OCPBUGS-2862: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/451

Bug OCPBUGS-11882: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-12133: Update 4.14 ose-cluster-kube-controller-manager-operator image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/726

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/727

Bug OCPBUGS-13017: aws-ebs-csi-driver-controller-sa ServiceAccount does not include the HCP pull-secret in its imagePullSecrets

View the Description View the linked PRs

aws-ebs-csi-driver-controller-ca ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/219

Bug OCPBUGS-14810: Update OWNERS and OWNERS_ALIASES in livenessprobe repo

View the Description View the linked PRs

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

https://github.com/openshift/csi-livenessprobe/pull/45

Bug OCPBUGS-16072: Updating Kubernetes and associated dependencies

View the Description View the linked PRs

Description of problem:

Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/296

Bug OCPBUGS-3041: Guard Pod Hostnames Too Long and Truncated Down Into Collisions With Other Masters

View the Description View the linked PRs

Discovered in the must gather kubelet_service.log from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1586093220087992320

It appears the guard pod names are too long, and being truncated down to where they will collide with those from the other masters.

From kubelet logs in this run:

❯ grep openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste kubelet_service.log
Oct 28 23:58:55.693391 ci-op-3hj6pnwf-4f6ab-lv57z-master-1 kubenswrapper[1657]: E1028 23:58:55.693346    1657 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-1" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:03.735726 ci-op-3hj6pnwf-4f6ab-lv57z-master-0 kubenswrapper[1670]: E1028 23:59:03.735671    1670 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-0" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:11.168082 ci-op-3hj6pnwf-4f6ab-lv57z-master-2 kubenswrapper[1667]: E1028 23:59:11.168041    1667 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-2" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"

This also looks to be happening for openshift-kube-scheduler-guard, kube-controller-manager-guard, possibly others.

Looks like they should be truncated further to make room for random suffixes in https://github.com/openshift/library-go/blame/bd9b0e19121022561dcd1d9823407cd58b2265d0/pkg/operator/staticpod/controller/guard/guard_controller.go#L97-L98

Unsure of the implications here, it looks a little scary.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/664

Bug OCPBUGS-3990: HyperShift control plane operators have wrong priorityClass

View the Description View the linked PRs

Description of problem:

This PR fails HyperShift CI fails with:

=== RUN TestAutoscaling/EnsureNoPodsWithTooHighPriority
util.go:411: pod csi-snapshot-controller-7bb4b877b4-q5457 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000
util.go:411: pod csi-snapshot-webhook-644b6dbfb-v4lj7 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000

How reproducible:

always

Steps to Reproduce:

Install HyperShift + create a guest cluster with CSI Snapshot Controller and/or Cluster Storage Operator / AWS EBS CSI driver operator running in the HyperShift managed cluster
Check priorityClass of the guest control plane pods in the hosted cluster.

Alternatively, ci/prow/e2e-aws in https://github.com/openshift/hypershift/pull/1698 and https://github.com/openshift/hypershift/pull/1748 must pass.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/167

Bug OCPBUGS-7837: hypershift: aws-ebs-csi-driver-operator uses guest cluster proxy causing PV provisioning failure

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/186

Bug OCPBUGS-5275: remove unnecessary RBAC in OCM

View the Description View the linked PRs

Description of problem:

system:openshift:openshift-controller-manager:leader-locking-ingress-to-route-controller role and role-binding should not be present in openshift-route-controller-manager namespace. Not needed since the leader locking responsibility was moved to route-controller-manager which is managed by leader-locking-openshift-route-controller-manager

This was added in and used by https://github.com/openshift/openshift-controller-manager/pull/230/files#diff-2ddbbe8d5a13b855786852e6dc0c6213953315fd6e6b813b68dbdf9ffebcf112R20

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/276

Bug OCPBUGS-8691: Operands running management side missing affinity, tolerations, node selector and priority rules than the operator

View the Description View the linked PRs

Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

aws-ebs-csi-driver-controller
aws-ebs-csi-driver-operator
csi-snapshot-controller
csi-snapshot-webhook

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/205

Task ETCD-178: refactor discover-etcd-initial-cluster and add tests

View the Description View the linked PRs

discover-etcd-initial-cluster was written very early on in the cluster-etcd-operator life cycle. We have observed at least one bug in this code and in order to validate logical correctness it needs to be rewritten with unit tests.

PR: https://github.com/openshift/etcd/pull/73

https://github.com/openshift/etcd/pull/74

Task MON-1208: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-4347: set TLS cipher suites in Kube RBAC sidecars

View the Description View the linked PRs

See the following:
https://issues.redhat.com/browse/OCPBUGS-2083
https://github.com/openshift/library-go/pull/1413
https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/117

This was fixed for vsphere, but we need the same change for the other storage operators. Bump library-go and add --tls-cipher-suites=${TLS_CIPHER_SUITES} to the kube RBAC sidecars.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/168

Bug OCPBUGS-4343: Use flowcontrol/v1beta3 for apf manifests in 4.13

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/273

Bug OCPBUGS-10568: migrate to using Lease for leader election

View the Description View the linked PRs

Description of problem:

library-go should use Lease for leader election by default. 
In 4.10 we switched from configmaps to configmapsleases, now we can switch to leases

change library-go to use lease by default, we already have an open pr for that: https://github.com/openshift/library-go/pull/1448 

once the pr merges, we should revendor library-go for:
- kas operator
- oas operator
- etcd operator
- kcm operator
- openshift controller manager operator
- scheduler operator
- auth operator
- cluster policy controller

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-3283: remove unnecessary RBAC in KCM

View the Description View the linked PRs

Description of problem:

We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .

This RBAC was only used 4.2 and 4.3 for

for making a switch from configMaps to leases in leader election

and we should remove it

Version-Release number of selected component (if applicable):{code:none}

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/661

Bug OCPBUGS-15256: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/740

Bug OCPBUGS-16209: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1777

Bug OCPBUGS-3985: Allow PSa enforcement in 4.13 by using featuresets

View the Description View the linked PRs

Allow users to turn PodSecurity admission in enforcement mode in 4.13 as TechPreviewNoUpgrade in order to be able to test the feature with their workloads and see if there is anything that needs fixing.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/663

Bug OCPBUGS-12709: --external-cloud-volume-plugin for out-of tree providers

View the Description View the linked PRs

Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.

The original PR that fixed this (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/721) got reverted by mistake. We need to bring it back to unblock the kube rebase.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/729

Bug OCPBUGS-13895: route-controller-manager configuration changes for a refactoring

View the Description View the linked PRs

Description of problem:

The following changes are required for openshift/route-controller-manager#22 refactoring.

add POD_NAME to route-controller-manager deployment
introduce route-controller-defaultconfig and customize lease name openshift-route-controllers to override the default one supplied by library-go
add RBAC for infrastructures which is used by library-go for configuring leader election

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/288

Bug OCPBUGS-8752: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-14812: Update OWNERS and OWNERS_ALIASES in external-resizer repo

View the Description View the linked PRs

Sanitize OWNERS/OWNER_ALIASES:

1) OWNERS must have:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

https://github.com/openshift/csi-external-resizer/pull/142

Bug OCPBUGS-14824: CSI Driver Operators should not update the default storageclass annotation back after customers set the default storageclass annotation to false

View the Description View the linked PRs

Description of problem:

[AWS EBS CSI Driver Operator] should not update the default storageclass annotation back after customers remove the default storageclass annotation

Version-Release number of selected component (if applicable):

Server Version: 4.14.0-0.nightly-2023-06-08-102710

How reproducible:

Always

Steps to Reproduce:

1. Install an aws openshift cluster
2. Create 6 extra storage classes(any sc is ok)
3. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false and check all the sc are set as undefault 
4. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=true 
5. loop step4-5 several times

Actual results:

Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false, sometimes recovered by the driver operator

Expected results:

Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false should always succeed

Additional info:

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/247

Bug OCPBUGS-14638: Bump Kubernetes to 0.27.1

View the Description View the linked PRs

Description of problem:

Bump Kubernetes to 0.27.1 and bump dependencies

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/294

Bug OCPBUGS-13579: Rebase components to k8s v0.27.*

View the Description View the linked PRs

Some repositories require bugzilla/valid-bug label present. Complement to https://issues.redhat.com/browse/WRKLDS-700.

Bug OCPBUGS-11352: --external-cloud-volume-plugin for out-of tree providers

View the Description View the linked PRs

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/721

Bug OCPBUGS-3978: AWS EBS CSI driver operator is degraded without "CSISnapshot" capability

View the Description View the linked PRs

With CSISnapshot capability disabled, the CSI driver operator are Degraded. For example:

18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource
AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: )
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):
4.12.nightly

The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.

CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/165

Task MON-1302: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPCLOUD-914: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/447

Bug OCPBUGS-1904: CSI driver operators are degraded without "CSISnapshot" capability

View the Description View the linked PRs

With CSISnapshot capability is disabled, all CSI driver operators are Degraded. For example AWS EBS CSI driver operator during installation:

18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource
AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: )
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):
4.12.nightly

The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.

CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/164

Bug OCPBUGS-5269: remove unnecessary RBAC in KCM: file removal

View the Description View the linked PRs

Description of problem:

We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .

This RBAC was only used 4.2 and 4.3 for

for making a switch from configMaps to leases in leader election

and we should remove it

followup to https://issues.redhat.com/browse/OCPBUGS-3283 - the RBACs are not applied anymore, but we just need to remove the actual files from the repo. No behavioral change should occur with the file removal.

Version-Release number of selected component (if applicable):{code:none}

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/681

Bug OCPBUGS-4401: limit cluster-policy-controller RBAC permissions

View the Description View the linked PRs

Description of problem:

cluster-policy-controller has  unnecessary permissions and is able to operate on all leases in KCM namespace. This also applies to namespace-security-allocation-controller that was moved some time ago and does not need lock mechanism.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/670

Bug OCPBUGS-16891: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-16783: Chore: Update OWNERS and OWNERS_ALIASES in CSI driver and operator repos

View the Description View the linked PRs

Sanitize OWNERS/OWNER_ALIASES in all CSI driver and operator repos.

For driver repos:

1) OWNERS must have `component`:

component: "Storage / Kubernetes External Components"

2) OWNER_ALIASES must have all team members of Storage team.

For operator repos:

1) OWNERS must have:

all team members of Storage team as `approvers`
`component`:
```
component: "Storage / Operators"
```

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/251

Bug OCPBUGS-16686: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/1917

Bug OCPBUGS-212: co/kube-controller-manager degraded: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address

View the Description View the linked PRs

Description of problem:

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-ec.1   True        False         45h     Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded

oc --context build02 get co kube-controller-manager
NAME                      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-ec.1   True        False         True       2y87d   GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/649

Bug OCPBUGS-3929: Use flowcontrol/v1beta2 for apf manifests in 4.13

View the Description View the linked PRs

In 4.12.0-rc.0 some API-server components declare flowcontrol/v1beta1 release manifests:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.12.0-rc.0-x86_64
$ grep -r flowcontrol.apiserver.k8s.io manifests
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_etcd-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-controller-manager-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1

The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. This ticket tracks removing those manifests, or replacing them with a more modern resource type, or some such. Definition of done is that new 4.13 (and with backports, 4.12) nightlies no longer include flowcontrol.apiserver.k8s.io/v1beta1 manifests.

This can be noticed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27560/pull-ci-openshift-origin-master-e2e-gcp-ovn/1593697975584952320/artifacts/e2e-gcp-ovn/openshift-e2e-test/build-log.txt:

[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:27
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:158
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:159
flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Ginkgo exit error 4: exit with code 4

This is required to unblock https://github.com/openshift/origin/pull/27561

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/272

Bug OCPBUGS-34281: ART requests updates to 4.17 image ose-cluster-kube-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/809

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/813

Story WRKLDS-1071: [R&D] revision controller spinning 30+ revisions

View the Description View the linked PRs

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705425516419799

A revision controller is spinning to many revisions.

Goal: update the revision controller code to temporarily log config changes to validate the newly created revisions are valid. Or, proof some new revisions are unnecessary.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/796

Bug OCPBUGS-29565: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/793

Bug OCPBUGS-21830: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/308

Bug OCPBUGS-24846: Update 4.16 ose-csi-external-resizer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/152

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/152

Bug OCPBUGS-19157: Update 4.15 ose-aws-ebs-csi-driver-operator image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver-operator/pull/268

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/268

Bug OCPBUGS-21846: Test "static pods should start after being created" failed

View the Description View the linked PRs

Description of problem:

Recently, the passing rate for test "static pods should start after being created" has dropped significantly for some platforms: 

https://sippy.dptools.openshift.org/sippy-ng/tests/4.15/analysis?test=%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

Take a look at this example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072

The test failed with the following message:
{  static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ci-op-2z99zzqd-7f99c-rfp4q-master-0" didn't show up, waited: 3m0s}

Seemingly revision 6 was never reached. But if we look at the log from kube-controller-manager-operator, it jumps from revision 5 to revision 7: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072/artifacts/e2e-azure-sdn-techpreview/gather-extra/artifacts/pods/openshift-kube-controller-manager-operator_kube-controller-manager-operator-7cd978d745-bcvkm_kube-controller-manager-operator.log

The log also indicates that there is a possibility of race:

W1013 12:59:17.775274       1 staticpod.go:38] revision 7 is unexpectedly already the latest available revision. This is a possible race!

This might be a static controller issue. But I am starting with kube-controller-manager component for the case. Feel free to reassign. 

Here is a slack thread related to this:
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1697472297510279

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/786

Bug OCPBUGS-30502: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-resizer/pull/159

Bug OCPBUGS-5006: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/678

Bug OCPBUGS-18498: spec.containers.image is empty when use 'oc new-app' created deploy when build/deploymentconfig are not installed

View the Description View the linked PRs

Description of problem:

If not installed capability operator build and deploymentconfig, when use `oc new-app registry.redhat.io/<namespace>/<image>:<tag>` , the created deployment emptied spec.containers[0].image. The deploy will fail to start pod.

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.14.0-0.nightly-2023-08-22-221456
Kustomize Version: v5.0.1
Server Version: 4.14.0-0.nightly-2023-09-02-132842
Kubernetes Version: v1.27.4+2c83a9f

How reproducible:

Always

Steps to Reproduce:

1. Installed cluster without build/deploymentconfig function
Set "baselineCapabilitySet: None" in install-config
2.Create a deploy using 'new-app' cmd
oc new-app registry.redhat.io/ubi8/httpd-24:latest
3.

Actual results:

2.
$oc new-app registry.redhat.io/ubi8/httpd-24:latest
--> Found container image c412709 (11 days old) from registry.redhat.io for "registry.redhat.io/ubi8/httpd-24:latest"    Apache httpd 2.4
    ----------------
    Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites.    Tags: builder, httpd, httpd-24    * An image stream tag will be created as "httpd-24:latest" that will track this image--> Creating resources ...
    imagestream.image.openshift.io "httpd-24" created
    deployment.apps "httpd-24" created
    service "httpd-24" created
--> Success
    Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
     'oc expose service/httpd-24'
    Run 'oc status' to view your app

3. oc get deploy -o yaml
 apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
      image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"httpd-24:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"httpd-24\")].image"}]'
      openshift.io/generated-by: OpenShiftNewApp
    creationTimestamp: "2023-09-04T07:44:01Z"
    generation: 1
    labels:
      app: httpd-24
      app.kubernetes.io/component: httpd-24
      app.kubernetes.io/instance: httpd-24
    name: httpd-24
    namespace: wxg
    resourceVersion: "115441"
    uid: 909d0c4e-180c-4f88-8fb5-93c927839903
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        deployment: httpd-24
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        annotations:
          openshift.io/generated-by: OpenShiftNewApp
        creationTimestamp: null
        labels:
          deployment: httpd-24
      spec:
        containers:
        - image: ' '
          imagePullPolicy: IfNotPresent
          name: httpd-24
          ports:
          - containerPort: 8080
            protocol: TCP
          - containerPort: 8443
            protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
  status:
    conditions:
    - lastTransitionTime: "2023-09-04T07:44:01Z"
      lastUpdateTime: "2023-09-04T07:44:01Z"
      message: Created new replica set "httpd-24-7f6b55cc85"
      reason: NewReplicaSetCreated
      status: "True"
      type: Progressing
    - lastTransitionTime: "2023-09-04T07:44:01Z"
      lastUpdateTime: "2023-09-04T07:44:01Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    - lastTransitionTime: "2023-09-04T07:44:01Z"
      lastUpdateTime: "2023-09-04T07:44:01Z"
      message: 'Pod "httpd-24-7f6b55cc85-pvvgt" is invalid: spec.containers[0].image:
        Invalid value: " ": must not have leading or trailing whitespace'
      reason: FailedCreate
      status: "True"
      type: ReplicaFailure
    observedGeneration: 1
    unavailableReplicas: 1
kind: List
metadata:

Expected results:

Should set spec.containers[0].image to registry.redhat.io/ubi8/httpd-24:latest

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/299

Bug OCPBUGS-34275: ART requests updates to 4.17 image csi-livenessprobe-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/65

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Story TRT-1420: console and auth operators failing to install in 4.16 intermittently

View the Description View the linked PRs

Last payload showed several occurrences of this problem seemingly surfacing the same way:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.nightly/release/4.16.0-0.nightly-2023-12-18-092716

Example jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496 (blocked the payload)

Looking at sippy, both tests took a dip in pass rate on the 16th, meaning a regression may have merged late on the 15th (friday) or somewhere on the 16th (less likely)

console operator test 4.16

auth operator test 4.16

The problem kills the install and thus we are getting no intervals charts to help debug.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/781

Bug OCPBUGS-19263: Update 4.15 ose-cluster-kube-controller-manager-operator image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/747

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/747

Bug OCPBUGS-22357: Fix and bump library-go for storage operators

View the Description View the linked PRs

We need to fix and bump library-go for http2 vulnerability CVE-2023-44487. This effectively turns off HTTP/2 in library-go http endpoints, i.e. metrics and health.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/290

Bug OCPBUGS-19136: Update 4.15 ose-cluster-openshift-controller-manager-operator image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/304

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/304

Bug OCPBUGS-21722: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-22544: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/291

Bug OCPBUGS-34859: Missing component in csr-signer-signer secret

View the Description View the linked PRs

Description of problem:

    After https://github.com/openshift/cluster-kube-controller-manager-operator/pull/804 was merged the controller no longer updates secret type and this no longer adds owner label. This PR would ensure the secret is created with this label

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/812

Bug OCPBUGS-31384: api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

View the Description View the linked PRs

Description of problem:

In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:

$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message'
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Version-Release number of selected component (if applicable)

4.15.3, so we have 4.15.2's ~~OCPBUGS-30304~~ but not 4.15.5's ~~OCPBUGS-30237~~.

How reproducible

Seen in two clusters after updating from 4.14 to 4.15.3.

Steps to Reproduce

Unclear.

Actual results

Sad multus pods.

Expected results

Happy cluster.

Additional info

$ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null
...
Certificate chain
 0 s:CN = api-int.REDACTED
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT
...
 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT
...

So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:

$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2
2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255       1 cmd.go:241] Using service-serving-cert provided certificates
2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.

We were able to recover individual nodes via:

oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig from any machine with an admin kubeconfig
copy to all nodes as /etc/kubernetes/kubeconfig
on each node rm /var/lib/kubelet/kubeconfig
restart each node
approve each kubelet CSR
delete the node's multus-* pod.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/800

Bug OCPBUGS-41116: ART requests updates to 4.18 image csi-livenessprobe-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/68

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/68

Story STOR-1432: Allow separate images to the specified for Hosted Control Plane components

View the Description View the linked PRs

Hypershift needs to be able to specify a different release payload for control plane components without redeploying anything in the hosted cluster.

csi-driver-node DaemonSet pods in the hosted cluster and the csi-driver-controller Deployment that runs in the control plane both use the AWS_EBS_DRIVER_IMAGE and LIVENESS_PROBE_IMAGE

https://github.com/openshift/hypershift/blob/fc42313fc93125799f7eba5361190043cc2f6561/control-plane-operator/controllers/hostedcontrolplane/storage/envreplace.go#L9-L48

We need a way to specify these images separately for csi-driver-node and csi-driver-controller.

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/252

Bug OCPBUGS-25001: Update 4.16 ose-cluster-kube-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/774

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-5825: system:openshift:kube-controller-manager:gce-cloud-provider referencing non existing serviceAccount

View the Description View the linked PRs

Description of problem:

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.20 True False 43h Cluster version is 4.11.20

$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2023-01-11T13:16:47Z"
  name: system:openshift:kube-controller-manager:gce-cloud-provider
  resourceVersion: "6079"
  uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:kube-controller-manager:gce-cloud-provider
subjects:
- kind: ServiceAccount
  name: cloud-provider
  namespace: kube-system

$ oc get sa cloud-provider -n kube-system
Error from server (NotFound): serviceaccounts "cloud-provider" not found

The serviceAccount cloud-provider does not exist. Neither in kube-system nor in any other namespace.

It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount.

From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4 (all version from what we have found)

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4
2. Run oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml

Actual results:

$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2023-01-11T13:16:47Z"
  name: system:openshift:kube-controller-manager:gce-cloud-provider
  resourceVersion: "6079"
  uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:kube-controller-manager:gce-cloud-provider
subjects:
- kind: ServiceAccount
  name: cloud-provider
  namespace: kube-system

$ oc get sa cloud-provider -n kube-system
Error from server (NotFound): serviceaccounts "cloud-provider" not found

Expected results:

The serviceAccount called cloud-provider to exist or otherwise the ClusterRoleBinding to be removed.

Additional info:

Finding related to a Security review done on the OpenShift Container Platform 4 - Platform

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/778

Bug OCPBUGS-25093: Update 4.16 csi-livenessprobe-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/56

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/56

Bug OCPBUGS-25534: Update 4.16 csi-livenessprobe-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/58

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/58

Bug OCPBUGS-24162: Update 4.15 ose-cluster-kube-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/772

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/772

Bug OCPBUGS-21593: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-25587: Update 4.16 ose-cluster-kube-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/779

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/780

Bug OCPBUGS-18932: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/302

Bug OCPBUGS-29972: ART requests updates to 4.16 image csi-livenessprobe-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/59

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/62

Bug OCPBUGS-34356: ART requests updates to 4.17 image ose-csi-external-resizer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/162

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-41227: ART requests updates to 4.18 image ose-cluster-openshift-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364

Bug OCPBUGS-30488: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/798

Bug OCPBUGS-41244: ART requests updates to 4.18 image ose-csi-external-resizer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/165

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/165

Bug OCPBUGS-19224: Update 4.15 ose-csi-external-resizer image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/144

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/144

Bug OCPBUGS-18115: PrometheusOperatorRejectedResources alert fires on Hypershift clusters with user-defined monitoring

View the Description View the linked PRs

Description of problem:

After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Start an hypershift-hosted cluster with cluster-bot
2. Enable user-defined monitoring
3.

Actual results:

PrometheusOperatorRejectedResources alert becomes firing

Expected results:

No alert firing

Additional info:

Need to reach out to the HyperShift folks as the fix should probably be in their code base.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/757

Bug OCPBUGS-29581: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/334

Bug OCPBUGS-29973: ART requests updates to 4.16 image ose-cluster-openshift-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/337

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/337

Bug OCPBUGS-24815: Update 4.16 csi-livenessprobe-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/55

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/55

Bug OCPBUGS-26117: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/784

Bug OCPBUGS-28666: openshift/openshift-controller-manager-operator - replace 'coreydaley' with 'sayan-biswas' in OWNERS file

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/326

Bug OCPBUGS-23848: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/341

Bug OCPBUGS-24190: when baselinecapabiliity set is set to None, still see SA with name `deployer-controller` being present in the cluster

View the Description View the linked PRs

When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.

steps to Reproduce:

=================

1. Install 4.15 cluster with baselineCapabilitySet to None

2. Run command `oc get sa -A | grep deployer`

Actual Results:

================

[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m

Expected Results:

==================

No SA related to deployer should be returned

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/320

Bug OCPBUGS-35801: ocm-operator: panic detected in pod

View the Description View the linked PRs

Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
at:
github.com/openshift/cluster-openshift-controller-manager-operator/pkg/operator/internalimageregistry/cleanup_controller.go:146 +0xd65

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/355

Bug OCPBUGS-24215: kube-controller-manager TLS artifacts should have ownership annotations

View the linked PRs

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/769

Bug OCPBUGS-24888: Update 4.16 ose-cluster-openshift-controller-manager-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321

Bug OCPBUGS-22969: Use v1 for flowcontrol API objects

View the Description View the linked PRs

flowcontrol v1beta3 is deprecated from 1.29, and will be removed in 1.32
update the OpenShift specific APF manifests to use v1

The flowcontrol manifests in the following operators (kas, oas, etcd, openshift controller manager, auth, and network) should use v1.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/316

Bug OCPBUGS-25540: Update 4.16 ose-csi-external-resizer-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/154

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/154

Bug OCPBUGS-16794: The file permission of the controller manager pod specification file should be set to 600 to conform with CIS benchmarks

View the Description View the linked PRs

Description of problem:

Observation from CISv1.4 pdf:
1.1.3 Ensure that the controller manager pod specification file



When I checked I found description of the controller manager pod specification file in CIS v1.4 PDF is as follows:
"Ensure that the controller manager pod specification file has permissions of 600 or more
restrictive.
 
OpenShift 4 deploys two API servers: the OpenShift API server and the Kube API server. The OpenShift API server delegates requests for Kubernetes objects to the Kube API server.
The OpenShift API server is managed as a deployment. The pod specification yaml for openshift-apiserver is stored in etcd.
The Kube API Server is managed as a static pod. The pod specification file for the kube-apiserver is created on the control plane nodes at /etc/kubernetes/manifests/kube-apiserver-pod.yaml. The kube-apiserver is mounted via hostpath to the kube-apiserver pods via /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml with permissions 600."
 
To conform with CIS benchmarks, the controller manager pod specification file should be updated to 600.

$ for i in $( oc get pods -n openshift-kube-controller-manager -o name -l app=kube-controller-manager)
do                          
oc exec -n openshift-kube-controller-manager $i -- stat -c %a /etc/kubernetes/static-pod-resources/kube-controller-manager-pod.yaml  
done                                                                    
644
644
644

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-20-215234

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

The controller manager pod specification file for the kube-apiserver is 644.

Expected results:

The controller manager pod specification file for the kube-apiserver is 644.

Additional info:

https://github.com/openshift/library-go/commit/19a42d2bae8ba68761cfad72bf764e10d275ad6e

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/750

Bug OCPBUGS-34054: lots of churn during image registry managed/removed transition

View the Description View the linked PRs

The OCM-operator's imagePullSecretCleanupController attempts to prevent new pods from using an image pull secret that needs to be deleted, but this results in the OCM creating a new image pull secret in the meantime.

The overlap occurs when OCM-operator has detected the registry is removed, simultaneously triggering the imagePullSecretCleanup controller to start deleting and updating the OCM config to stop creating, but the OCM behavior change is delayed until its pods are restarted.

In 4.16 this churn is minimized due to the OCM naming the image pull secrets consistently, but the churn can occur during an upgrade given that the OCM-operator is updated first.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/347

Bug OCPBUGS-41121: ART requests updates to 4.18 image ose-cluster-kube-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819

Bug OCPBUGS-30440: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-livenessprobe/pull/63

Bug OCPBUGS-23624: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/325

Bug OCPBUGS-20164: builds.config.openshift.io CRD is available in a cluster with baselineCapabilitySet None

View the Description View the linked PRs

Description of problem:

a cluster installed with baselineCapabilitySet: None have build available while the build capability is disabled


❯ oc get -o json clusterversion version | jq '.spec.capabilities'                      
{
  "baselineCapabilitySet": "None"
}

❯ oc get -o json clusterversion version | jq '.status.capabilities.enabledCapabilities'
null

❯ oc get build -A                   
NAME      AGE
cluster   5h23m

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-04-143709

How reproducible:

100%

Steps to Reproduce:

1.install a cluster with baselineCapabilitySet: None

Actual results:

❯ oc get build -A                   
NAME      AGE
cluster   5h23m

Expected results:

❯ oc get -A build
error: the server doesn't have a resource type "build"

slack thread with more info: https://redhat-internal.slack.com/archives/CF8SMALS1/p1696527133380269

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/306

Bug OCPBUGS-36456: [csi-livenessprobe] Rename Dockerfile

View the Description View the linked PRs

Refactor name to Dockerfile.ocp as a better, version independent alternative

https://github.com/openshift/csi-livenessprobe/pull/67

Bug OCPBUGS-19132: Update 4.15 csi-livenessprobe image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/47

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/47

Bug OCPBUGS-23462: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/771

Bug OCPBUGS-22956: When build capability is disabled, ConfigObserver controller does not run

View the Description View the linked PRs

Description of problem:

ConfigObserver controller waits until the all given informers are marked as synced including the build informer. However, when build capability is disabled, that causes ConfigObserver's blockage and never runs.

This is likely only happening on 4.15 because capability watching mechanism was bound to ConfigObserver in 4.15.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Launch cluster-bot cluster via "launch 4.15.0-0.nightly-2023-11-05-192858,openshift/cluster-openshift-controller-manager-operator#315 no-capabilities"

Steps to Reproduce:

1.
2.
3.

Actual results:

ConfigObserver controller stuck in failure

Expected results:

ConfigObserver controller runs and successfully clear all deployer service accounts when deploymentconfig capability is disabled.

Additional info:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/315

Bug OCPBUGS-29984: ART requests updates to 4.16 image ose-csi-external-resizer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/155

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-resizer/pull/158

4.7.0-0.okd-2023-09-13-174737

Changes from 4.6.0-0.okd-2023-10-03-141935

Complete Features

Epic Goal

Why is this important?

Acceptance Criteria

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Why?

Overview

DoD

Epic Goal

Why is this important?

Acceptance Criteria

Feature Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

User Story

Background

Steps

Stakeholders

Definition of Done

User Story:

Acceptance Criteria:

(optional) Out of Scope:

Engineering Details:

Epic Goal

Why is this important?

Acceptance Criteria

Incomplete Features

Goal

Description

User Stories

Non-Requirements

Notes

References

Done Checklist

Epic Goal

Why is this important?

Upstream links

Acceptance Criteria

Dependencies (internal and external)

Done Checklist

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Executive Summary