Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Telemetry | No | |
Certification | No | |
API metrics | No | |
Out of Scope
n/a
Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low
Assumptions
Customer Considerations
Documentation Considerations
Notes
In progress:
High prio:
Unsorted
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
Update all CSI sidecars to the latest upstream release.
This includes update of VolumeSnapshot CRDs in https://github.com/openshift/cluster-csi-snapshot-controller-operator/tree/master/assets
On new installations, we should make the StorageClass created by the CSI operator the default one.
However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set a different quota on the CSI driver Storage Class.
Exit criteria:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift-controller-manager to k8s 1.24
Assumption
Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
Assumption
Run cluster-storage-operator (CSO) + AWS EBS CSI driver operator + AWS EBS CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster.
More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
As HyperShift Cluster Instance Admin, I want to run AWS EBS CSI driver operator + control plane of the CSI driver in the management cluster, so the guest cluster runs just my applications.
Exit criteria:
Feature Goal*
What is our purpose in implementing this? What new capability will be available to customers?
The goal of this feature is to provide a consistent, predictable and deterministic approach on how the default storage class(es) is managed.
Why is this important? (mandatory)
The current default storage class implementation has corner cases which can result in PVs staying in pending because there is either no default storage class OR multiple storage classes are defined
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
No default storage class
In some cases there is no default SC defined, this can happen during OCP deployment where components such as the registry request a PV whereas the SC are not been defined yet. This can also happen during a change in default SC, there won't be any between the admin unset the current one and set the new on.
Another user creates PVC requesting a default SC, by leaving pvc.spec.storageClassName=nil. The default SC does not exist at this point, therefore the admission plugin leaves the PVC untouched with pvc.spec.storageClassName=nil.
The admin marks SC2 as default.
PV controller, when reconciling the PVC, updates pvc.spec.storageClassName=nil to the new SC2.
PV controller uses the new SC2 when binding / provisioning the PVC.
The installer creates a default SC.
PV controller, when reconciling the PVC, updates pvc.spec.storageClassName=nil to the new default SC.
PV controller uses the new default SC when binding / provisioning the PVC.
Multiple Storage Classes
In some cases there are multiple default SC, this can be an admin mistake (forget to unset the old one) or during the period where a new default SC is created but the old one is still present.
New behavior:
-> the PVC will get the default storage class with the newest CreationTimestamp (i.e. B) and no error should show.
-> admin will get an alert that there are multiple default storage classes and they should do something about it.
CSI that are shipped as part of OCP
The CSI drivers we ship as part of OCP are deployed and managed by RH operators. These operators automatically create a default storage class. Some customers don't like this approach and prefer to:
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
No external dependencies.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Can bring confusion to customer as there is a change in the default behavior customer are used to. This needs to be carefully documented.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.
To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).
OCPBU-5: Phase 1
OCPBU-510: Phase 2
OCPBU-329: Phase.Next
Phase 1
Phase 2
Phase 3
As a Red Hat Partner installing OpenShift using the External platform type, I would like to install my own Cloud Controller Manager(CCM). Having a field in the Infrastructure configuration object to signal that I will install my own CCM and that Kubernetes should be configured to expect an external CCM will allow me to run my own CCM on new OpenShift deployments.
This work has been defined in the External platform enhancement , and had previously been part of openshift/api . The CCM API pieces were removed for the 4.13 release of OpenShift to ensure that we did not ship unused portions of the API.
In addition to the API changes, library-go will need to have an update to the IsCloudProviderExternal function to detect the if the External platform is selected and if the CCM should be enabled for external mode.
We will also need to check the ObserveCloudVolumePlugin function to ensure that it is not affected by the external changes and that it continues to use the external volume plugin.
After updating openshift/library-go, it will need to be re-vendored into the MCO , KCMO , and CCCMO (although this is not as critical as the other 2).
1. Proposed title of this feature request
BYOK encrypts root vols AND default storageclass
2. What is the nature and description of the request?
User story
As a customer spinning up managed OpenShift clusters, if I pass a custom AWS KMS key to the installer, I expect it (installer and cluster-storage-operator) to not only encrypt the root volumes for the nodes in the cluster, but also be applied to encrypt the first/default (gp2 in current case) StorageClass, so that my assumptions around passing a custom key are met.
In current state, if I pass a KMS key to the installer, only root volumes are encrypted with it, and the default AWS managed key is used for the default StorageClass.
Perhaps this could be offered as a flag to set in the installer to further pass the key to the storage class, or not.
3. Why does the customer need this? (List the business requirements here)
To satisfy that customers wish to encrypt their owned volumes with their selected key instead of the AWS default account key, by accident.
4. List any affected packages or components.
Note: this implementation should take effect on AWS, GCP and Azure (any cloud provider) equally.
As a cluster admin, I want OCP to provision new volumes with my custom encryption key that I specified during cluster installation in install-config.yaml so all OCP assets (PVs, VMs & their root disks) use the same encryption key.
Description of criteria:
Re-encryption of existing PVs with a new key. Only newly provisioned PVs will use the new key.
Enhancement (incl. TBD API with encryption key reference) will be provided as part of https://issues.redhat.com/browse/CORS-2080.
"Raw meat" of this story is translation of the key reference in TBD API to StorageClass.Parameters. AWS EBS CSi driver operator should update both the StorageClass it manages (managed-csi) with:
Parameters:
encrypted: "true"
kmsKeyId: "arn:aws:kms:us-east-1:012345678910:key/abcd1234-a123-456a-a12b-a123b4cd56ef"
Upstream docs: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/parameters.md
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Goal: Resources provided via the Dynamic Resource Allocation Kubernetes mechanism can be consumed by VMs.
Details: Dynamic Resource Allocation
Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.
The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.
NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.
When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.
Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs [7]. There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
Part of making https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation available for early adoption.
Goal:
As an administrator, I would like to deploy OpenShift 4 clusters to AWS C2S region
Problem:
Customers were able to deploy to AWS C2S region in OCP 3.11, but our global configuration in OCP 4.1 doesn't support this.
Why is this important:
Lifecycle Information:
Previous Work:**
Here are the relevant PRs from OCP 3.11. You can see that these endpoints are not part of the standard SDK (they use an entirely separate SDK). To support these regions the endpoints had to be configured explicitly.
Seth Jennings has put together a highly customized POC.
Dependencies:
Prioritized epics + deliverables (in scope / not in scope):
Related : https://jira.coreos.com/browse/CORS-1271
Estimate (XS, S, M, L, XL, XXL): L
Customers: North America Public Sector and Government Agencies
Open Questions:
Epic Goal*
Provide a long term solution to SELinux context labeling in OCP.
Why is this important? (mandatory)
As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.
https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.
This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context
More on design & scenarios in the KEP and related epic STOR-1173
Dependencies (internal and external) (mandatory)
None for the core feature
However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Support upstream feature "SELinux relabeling using mount options (CSIDriver API change)"" in OCP as Beta, i.e. test it and have docs for it (unless it's Alpha upstream).
Summary: If Pod has defined SELinux context (e.g. it uses "resticted" SCC) and it uses ReadWriteOncePod PVC and CSI driver responsible for the volume supports this feature, kubelet + the CSI driver will mount the volume directly with the correct SELinux labels. Therefore CRI-O does not need to recursive relabel the volume and pod startup can be significantly faster. We will need a thorough documentation for this.
This upstream epic actually will be implemented by us!
As a cluster user, I want to use mounting with SELinux context without any configuration.
This means OCP ships CSIDriver objects with "SELinuxMount: true" for CSI drivers that support mounting with "-o context". I.e. all CSI drivers that are based on block volumes and use ext4/xfs should have this enabled.
This Epic is to track upstream work in the Storage SIG community
This Epic is to track the SELinux specific work required. fsGroup work is not included here.
Goal:
Continue contributing to and help move along the upstream efforts to enable recursive permissions functionality.
Finish current SELinuxMountReadWriteOncePod feature upstream:
The feature is probably going to stay alpha upstream.
Problem:
Recursive permission change takes very long for fsGroup and SELinux. For volumes with many small files Kubernetes currently does a chown for every file on the volume (due to fsGroup). Similarly for container runtimes (such as CRI-O) a chcon of every file on the volume is performed due to SCC's SELinux context. Data on the volume may already have the correct GID/SELinux context so Kubernetes needs way to detect this automatically to avoid the long delay.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Customers:
Open questions:
Notes:
As OCP developer (and as OCP user in the future), I want all CSI drivers shipped as part of OCP to support mounting with -o context=XYZ, so I can test with CSIDriver.SELinuxMount: true (or my pods are running without CRI-O recursively relabeling my volume).
In detail:
Exit criteria:
Other teams:
If needed this card can be broken down into more cards with sublists, each card assigned to a different assignee.
Epic Goal*
Drive the technical part of the Kubernetes 1.29 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.17 cannot be released without Kubernetes 1.30
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
Goal:
Update team owned repositories to Kubernetes v1.31
?? is the 1.31 freeze
?? is the 1.31 GA
Problem:<please update links for 1.31>
The following repository must be rebased onto the latest version of Kubernetes:
The following repositories should be rebased onto the latest version of Kubernetes:
Entirely remove dependencies on k/k repository inside oc.
Why is this important:
Epic Goal*
Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.18 cannot be released without Kubernetes 1.31
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
Retro: Kube 1.31 Rebase Retrospective Timeline (OCP 4.18)
Retro recording: https://drive.google.com/file/d/1htU-AglTJjd-VgFfwE3z_dH5tKXT1Tes/view?usp=drive_web
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
link back to OCPSTRAT-1644 somehow
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
The storage operators need to be automatically restarted after the certificates are renewed.
From OCP doc "The service CA certificate, which issues the service certificates, is valid for 26 months and is automatically rotated when there is less than 13 months validity left."
Since OCP is now offering an 18 months lifecycle per release, the storage operator pods need to be automatically restarted after the certificates are renewed.
The storage operators will be transparently restarted. The customer benefit should be transparent, it avoids manually restart of the storage operators.
The administrator should not need to restart the storage operator when certificates are renew.
This should apply to all relevant operators with a consistent experience.
As an administrator I want the storage operators to be automatically restarted when certificates are renewed.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
This feature request is triggered by the new extended OCP lifecycle. We are moving from 12 to 18 months support per release.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
No doc is required
This feature only cover storage but the same behavior should be applied to every relevant components.
The pod `aws-ebs-csi-driver-controller` mounts the secret:
$ oc get po -n openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-559f74d7cd-5tk4p -o yaml ... name: driver-kube-rbac-proxy name: provisioner-kube-rbac-proxy name: attacher-kube-rbac-proxy name: resizer-kube-rbac-proxy name: snapshotter-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: defaultMode: 420 secretName: aws-ebs-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
Description of problem:
Even though in 4.11 we introduced LegacyServiceAccountTokenNoAutoGeneration to be compatible with upstream K8s to not generate secrets with tokens when service accounts are created, today OpenShift still creates secrets and tokens that are used for legacy usage of openshift-controller as well as the image-pull secrets.
Customer issues:
Customers see auto-generated secrets for service accounts which is flagged as a security risk.
This Feature is to track the implementation for removing legacy usage and image-pull secret generation as well so that NO secrets are auto-generated when a Service Account is created on OpenShift cluster.
NO Secrets to be auto-generated when creating service accounts
Following *secrets need to NOT be generated automatically with every Serivce account creation:*
Use Cases (Optional):
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Concerns/Risks: Replacing functionality of one of the openshift-controller used for controllers that's been in the code for a long time may impact behaviors that w
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Existing documentation needs to be clear on where we are today and why we are providing the above 2 credentials. Related Tracker: https://issues.redhat.com/browse/OCPBUGS-13226
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
# namespaces | 4.18 | 4.17 | 4.16 | 4.15 |
---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 |
fix needed | 69 | 69 | 69 | 69 |
fixed | 38 | 34 | 30 | 39 |
remaining | 31 | 35 | 39 | 30 |
~ remaining non-runlevel | 9 | 13 | 17 | 8 |
~ remaining runlevel (low-prio) | 22 | 22 | 22 | 22 |
~ untested | 2 | 2 | 2 | 82 |
# | namespace | 4.18 | 4.17 | 4.16 | 4.15 |
---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |
2 | openshift-apiserver-operator | #573 | #581 | ||
3 | openshift-authentication | #656 | #675 | ||
4 | openshift-authentication-operator | #656 | #675 | ||
5 | openshift-catalogd | #50 | #58 | ||
6 | openshift-cloud-credential-operator | #681 | #736 | ||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |
8 | openshift-cluster-csi-drivers | #524 #131 #6 #127 #108 #118 #306 #265 #75 | #170 #459 | #484 | |
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||
10 | openshift-cluster-olm-operator | #54 | n/a | ||
11 | openshift-cluster-samples-operator | #535 | #548 | ||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |
13 | openshift-cluster-version | #1038 | #1068 | ||
14 | openshift-config-operator | #410 | #420 | ||
15 | openshift-console | #871 | #908 | #924 | |
16 | openshift-console-operator | #871 | #908 | #924 | |
17 | openshift-controller-manager | #336 | #361 | ||
18 | openshift-controller-manager-operator | #336 | #361 | ||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 |
20 | openshift-image-registry | #1008 | #1067 | ||
21 | openshift-ingress | #1031 | |||
22 | openshift-ingress-canary | #1031 | |||
23 | openshift-ingress-operator | #1031 | |||
24 | openshift-insights | #1026 | #915 | #967 | |
25 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 |
26 | openshift-kube-storage-version-migrator | #107 | #112 | ||
27 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||
28 | openshift-machine-api | #1308 | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 |
29 | openshift-machine-config-operator | #4636 | #4219 | #4384 | #4393 |
30 | openshift-manila-csi-driver | #234 | #235 | #236 | |
31 | openshift-marketplace | #578 | #561 | #570 | |
32 | openshift-metallb-system | #238 | #240 | #241 | |
33 | openshift-monitoring | #2498 | #2335 | #2420 | |
34 | openshift-network-console | #2545 | |||
35 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |
36 | openshift-network-node-identity | #2282 | #2490 | #2496 | |
37 | openshift-nutanix-infra | #4504 | #4504 | #4539 | #4540 |
38 | openshift-oauth-apiserver | #656 | #675 | ||
39 | openshift-openstack-infra | #4504 | #4504 | #4539 | #4540 |
40 | openshift-operator-controller | #100 | #120 | ||
41 | openshift-operator-lifecycle-manager | #703 | #828 | ||
42 | openshift-route-controller-manager | #336 | #361 | ||
43 | openshift-service-ca | #235 | #243 | ||
44 | openshift-service-ca-operator | #235 | #243 | ||
45 | openshift-sriov-network-operator | #754 #995 | #999 | #1003 | |
46 | openshift-user-workload-monitoring | #2335 | #2420 | ||
47 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 |
48 | (runlevel) default | ||||
49 | (runlevel) kube-system | ||||
50 | (runlevel) openshift-cloud-controller-manager | ||||
51 | (runlevel) openshift-cloud-controller-manager-operator | ||||
52 | (runlevel) openshift-cluster-api | ||||
53 | (runlevel) openshift-cluster-machine-approver | ||||
54 | (runlevel) openshift-dns | ||||
55 | (runlevel) openshift-dns-operator | ||||
56 | (runlevel) openshift-etcd | ||||
57 | (runlevel) openshift-etcd-operator | ||||
58 | (runlevel) openshift-kube-apiserver | ||||
59 | (runlevel) openshift-kube-apiserver-operator | ||||
60 | (runlevel) openshift-kube-controller-manager | ||||
61 | (runlevel) openshift-kube-controller-manager-operator | ||||
62 | (runlevel) openshift-kube-proxy | ||||
63 | (runlevel) openshift-kube-scheduler | ||||
64 | (runlevel) openshift-kube-scheduler-operator | ||||
65 | (runlevel) openshift-multus | ||||
66 | (runlevel) openshift-network-operator | ||||
67 | (runlevel) openshift-ovn-kubernetes | ||||
68 | (runlevel) openshift-sdn | ||||
69 | (runlevel) openshift-storage |
Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
Why is this important? (mandatory)
An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Until final structure is finalized we should be able to build and test aws-ebs image from legacy folder.
TBD
Known affected components:
[2] https://github.com/openshift/cluster-kube-controller-manager-operator/blob/351c1193c7eebb49054a289a17fc25dfc0e0cd73/bindata/bootkube/manifests/secret-csr-signer-signer.yaml#L10
[3] https://github.com/openshift/cluster-etcd-operator/pull/1234
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
As an openshift admin ,who wants to make my OCP more secure and stable . I want to prevent anyone to schedule their workload in master node so that master node only run OCP management related workload .
secure OCP master node by preventing scheduling of customer workload in master node
Anyone applying toleration(s) in a pod spec can unintentionally tolerate master taints which protect master nodes from receiving application workload when master nodes are configured to repel application workload. An admission plugin needs to be configured to protect master nodes from this scenario. Besides the taint/toleration, users can also set spec.NodeName directly, which this plugin should also consider protecting master nodes against.
Needed so we can provide this workflow to customers following the proposal at https://github.com/openshift/enhancements/pull/1583
Reference https://issues.redhat.com/browse/WRKLDS-1015
kube-controller-manager pods are created by code residing in controllers provided by the kube-controler-manager operator. So changes are required in that repo to add a toleration to the node-role.kubernetes.io/control-plane:NoExecute taint.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Bump the following libraries in an order with the latest kube and the dependent libraries
Prev Ref:
https://github.com/openshift/api/pull/1534
https://github.com/openshift/client-go/pull/250
https://github.com/openshift/library-go/pull/1557
https://github.com/openshift/apiserver-library-go/pull/118
Provide mechanisms for the builder service account to be made optional in core OpenShift.
< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >
Requirements | Notes | IS MVP |
Disable service account controller related to Build/BuildConfig when Build capability is disabled | When the API is marked as removed or disabled, stop creating the "builder" service account and its associated RBAC | Yes |
Option to disable the "builder" service account | Even if the Build capability is enabled, allow admins to disable the "builder" service account generation. Admins will need to bring their own service accounts/RBAC for builds to work | Yes |
< What are we making, for who, and why/what problem are we solving?>
<Defines what is not included in this story>
< Link or at least explain any known dependencies. >
Background, and strategic fit
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< If the feature is ordered with other work, state the impact of this feature on the other work>
In OCP 4.16.0, the default role bindings for image puller, image pusher, and deployer are created, even if the respective capabilities are disabled on the cluster.
As a cluster admin trying to disable the Build, DeploymentConfig, and Image Registry capabilities I want the RBAC controllers for the builder and deployer service accounts and default image-registry rolebindings disabled when their respective capability is disabled.
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer's experience?>
<Describes the context or background related to this story>
In WRKLDS-695, ocm-o was enhanced to disable the Build and DeploymentConfig controllers when the respective capability was disabled. This logic should be extended to include the controllers that set up the service accounts and role bindings for these respective features.
<Defines what is not included in this story>
<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
Description of problem:
When a cluster is deployed with no capabilities enabled, and the Build capability is later enabled, it's related cluster configuration CRD is not installed. This prevents admins from fine-tuning builds and ocm-o from fully reconciling its state.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Launch a cluster with no capabilities enabled (via cluster-bot: launch 4.16.0.0-ci aws,no-capabilities 2. Edit the clusterversion to enable the build capability: oc patch clusterversion/version --type merge -p '{"spec":{"capabilities":{"additionalEnabledCapabilities":["Build"]}}}' 3. Wait for the openshift-apiserver and openshift-controller-manager to roll out
Actual results:
APIs for BuildConfig (build.openshift.io) are enabled. Cluster configuration API for build system is not: $ oc api-resources | grep "build" buildconfigs bc build.openshift.io/v1 true BuildConfig builds build.openshift.io/v1 true Build
Expected results:
Cluster configuration API is enabled. $ oc api-resources | grep "build" buildconfigs bc build.openshift.io/v1 true BuildConfig builds build.openshift.io/v1 true Build builds config.openshift.io/v1 true Build
Additional info:
This causes list errors in openshift-controller-manager-operator, breaking the controller that reconciles state for builds and the image registry. W0523 18:23:38.551022 1 reflector.go:539] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io) E0523 18:23:38.551334 1 reflector.go:147] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.Build: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)
Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.
Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.
Q: how challenging will it be to support multi-node clusters with this feature?
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
Description of problem:
while trying to figure out why it takes so long to install Single node OpenShift I noticed that the kube-controller-manager cluster operator is degraded for ~5 minutes due to: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused I don't understand how the prometheusClient is successfully initialized, but we get a connection refused once we try to query the rules. Note that if the client initialization fails the kube-controller-manger won't set the GarbageCollectorDegraded to true.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc) 2. monitor the cluster operators staus
Actual results:
GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused
Expected results:
Expected the GarbageCollectorDegraded status to be false
Additional info:
It seems that for PrometheusClient to be successfully initialised it needs to successfully create a connection but we get connection refused once we make the query. Note that installing SNO with this patch (https://github.com/eranco74/cluster-kube-controller-manager-operator/commit/26e644503a8f04aa6d116ace6b9eb7b9b9f2f23f) reduces the installation time by 3 minutes
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
Monitoring needs to be reliable and is the very useful when trying to debug clusters in an already degraded state. We want to ensure that metrics scraping can always work if the scraper can reach the target, even if the kube-apiserver is unavailable or unreachable. To do this, we will combine a local authorizer (already merged in many binaries and the rbac-proxy) and client-cert based authentication to have a fully local authentication and authorization path for scraper targets.
If networking (or part of networking) is down and a scraper target cannot reach the kube-apiserver to verify a token and a subjectaccessreview, then the metrics scraper can be rejected. The subjectaccessreview (authorization) is already largely addressed, but service account tokens are still used for scraping targets. Tokens require an external network call that we can avoid by using client certificates. Gathering metrics, especially client metrics, from partially functionally clusters helps narrow the search area between kube-apiserver, etcd, kubelet, and SDN considerably.
In addition, this will significantly reduce the load on the kube-apiserver. We have observed in the CI cluster that token and subject access reviews are a significant percentage of all kube-apiserver traffic.
User story:
As cluster-policy-controller I automatically approve cert signing requests issued by monitoring.
DoD:
Implementation hints: leverage approving logic implemented in https://github.com/openshift/library-go/pull/1083.
As an cluster administrator of a disconnected OCP cluster,
I want a list of possible sample images to mirror
So that I can configure my image mirror prior to installing OCP in a disconnected environment.
it is too onerous to find a connected cluster in order to obtain the list of possible samples images to mirror using the current documented procedures.
I need a list make available to me in my disconnected cluster that I can reference after initial install.
Sample operator use of the reason field in its config object to track imagestream import completion has resulted in that singleton being a bottleneck and source of update conflicts (we are talking 60 or 70 imagestreams potentially updating that once field concurrently).
See this hackday PR
for an alternative approach which uses a configmap per imagestream
Goal: Support OCI images.
Problem: Buildah and podman use OCI format by default and OpenShift Image Registry and ImageStream API doesn't understand it.
Why is this important: OCI images are supposed to replace Docker schema 2 images, OpenShift should be ready when OCI images become widely adopted.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL): XL
Previous Work:
Customers:
Open Questions:
As a user of OpenShift
I want the image pruner to be aware of OCI images
So that it doesn't delete their layers/configs
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Add pertinent notes here:
As a user of OpenShift
I want to push OCI images to the registry
So that I can use buildah and podman with their defaults to push images
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Add pertinent notes here:
https://github.com/openshift/origin/pull/25475/files marked our tests for ISI as Disruptive.
Tests should wait until operators become stable, otherwise other tests will be run on an unstable cluster and it'll cause flakes.
The integration tests for the image registry expect that OpenShift and tests are run on the same machine (i.e. OpenShift can connect to sockets that the tests listen). This is not the case with e2e tests.
Goal: Rebase registry to Docker Distribution
Problem: The registry is currently based on an outdated version of the upstream docker/distribution project. The base does not even have a version associated with it - DevEx last rebased on an untagged commit.
Why is this important: Update the registry with improvements and bug fixes from the upstream community.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL): M
Previous Work:
Customers:
Open questions:
As a user of OpenShift
I want the image registry to be rebased on the latest docker/distribution release (v2.7.1)
So that the image registry has the latest upstream bugfixes and enhancements
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Add pertinent notes here:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
KCM and KAS previously relied on having the `{}cloud-provider` and `{-}-cloud-config` flags set. However, these are no longer required as there is no cloud provider code in either binary.
Both operators rely on the config observer in library go to set these flags.
In the future, if these values are set, even to the empty string, then startup will fail.
The config observer sets keys and values for a map (see), we need to make sure the keys for these two flags are deleted rather than set to a specific value.
As part of the migration to external cloud providers, the CCMO and KCMO used a CloudControllerOwner condition to show which controller owned the controllers.
This is no longer required and can be removed.
Code in library-go currently uses feature gates to determine if Azure and GCP clusters should be external or not. They have been promoted for at least one release and we do not see ourselves going back.
In 4.17 the code is expected to be deleted completely.
We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
We get too many false positive bugs like https://issues.redhat.com/browse/OCPBUGS-25333 from SAST scans, especially from the vendor directory. Add a .snyk file like https://github.com/openshift/oc/blob/master/.snyk to each repo to ignore them.
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
In looking at jobs on an accepted payload at https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.ci/release/4.12.0-0.ci-2022-08-30-122201 , I observed this job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1564589538850902016 with "Undiagnosed panic detected in pod" "pods/openshift-controller-manager-operator_openshift-controller-manager-operator-74bf985788-8v9qb_openshift-controller-manager-operator.log.gz:E0830 12:41:48.029165 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)"
Version-Release number of selected component (if applicable):
4.12
How reproducible:
probably relatively easy to reproduce (but not consistently) given it's happened several times according to this search: https://search.ci.openshift.org/?search=Observed+a+panic%3A+%22invalid+memory+address+or+nil+pointer+dereference%22&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. let nightly payloads run or run one of the presubmit jobs mentioned in the search above 2. 3.
Actual results:
Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}
Expected results:
no panics
Additional info:
Description of problem:
OCM-o does not support obtaining verbosity through OpenShiftControllerManager.operatorLogLevel object
Version-Release number of selected component (if applicable):
How reproducible:
modify the OpenShiftControllerManager.operatorLogLevel, and the OCM-o operator will not display the correspond logs
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Refer to the CIS RedHat OpenShift Container Platform Benchmark PDF: https://drive.google.com/file/d/12o6O-M2lqz__BgmtBrfeJu1GA2SJ352c/view
1.1.7 Ensure that the etcd pod specification file permissions are set to 600 or more restrictive (Manual)
======================================================================================================
As per CIS v1.3 PDF permissions should be 600 with the following statement:
"The pod specification file is created on control plane nodes at /etc/kubernetes/manifests/etcd-member.yaml with permissions 644. Verify that the permissions are 600 or more restrictive."
But when I ran the following command it was showing 644 permissions
for i in $(oc get pods -n openshift-etcd -l app=etcd -o name | grep etcd ) do echo "check pod $i" oc rsh -n openshift-etcd $i \ stat -c %a /etc/kubernetes/manifests/etcd-pod.yaml done
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We should adjust CSI RPC call timeout from sidecars to CSI driver. We seem to be using default values which are just too short and hence can cause unintended side-effects.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/726
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
aws-ebs-csi-driver-controller-ca ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Discovered in the must gather kubelet_service.log from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1586093220087992320
It appears the guard pod names are too long, and being truncated down to where they will collide with those from the other masters.
From kubelet logs in this run:
❯ grep openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste kubelet_service.log Oct 28 23:58:55.693391 ci-op-3hj6pnwf-4f6ab-lv57z-master-1 kubenswrapper[1657]: E1028 23:58:55.693346 1657 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-1" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste" Oct 28 23:59:03.735726 ci-op-3hj6pnwf-4f6ab-lv57z-master-0 kubenswrapper[1670]: E1028 23:59:03.735671 1670 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-0" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste" Oct 28 23:59:11.168082 ci-op-3hj6pnwf-4f6ab-lv57z-master-2 kubenswrapper[1667]: E1028 23:59:11.168041 1667 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-2" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
This also looks to be happening for openshift-kube-scheduler-guard, kube-controller-manager-guard, possibly others.
Looks like they should be truncated further to make room for random suffixes in https://github.com/openshift/library-go/blame/bd9b0e19121022561dcd1d9823407cd58b2265d0/pkg/operator/staticpod/controller/guard/guard_controller.go#L97-L98
Unsure of the implications here, it looks a little scary.
Description of problem:
This PR fails HyperShift CI fails with:
=== RUN TestAutoscaling/EnsureNoPodsWithTooHighPriority util.go:411: pod csi-snapshot-controller-7bb4b877b4-q5457 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000 util.go:411: pod csi-snapshot-webhook-644b6dbfb-v4lj7 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000
How reproducible:
always
Steps to Reproduce:
Alternatively, ci/prow/e2e-aws in https://github.com/openshift/hypershift/pull/1698 and https://github.com/openshift/hypershift/pull/1748 must pass.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
system:openshift:openshift-controller-manager:leader-locking-ingress-to-route-controller role and role-binding should not be present in openshift-route-controller-manager namespace. Not needed since the leader locking responsibility was moved to route-controller-manager which is managed by leader-locking-openshift-route-controller-manager This was added in and used by https://github.com/openshift/openshift-controller-manager/pull/230/files#diff-2ddbbe8d5a13b855786852e6dc0c6213953315fd6e6b813b68dbdf9ffebcf112R20
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In hypershift context: Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265 These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator. This could be done by looking at the operator deployment itself or at the HCP resource. aws-ebs-csi-driver-controller aws-ebs-csi-driver-operator csi-snapshot-controller csi-snapshot-webhook
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a hypershift cluster. 2. Check affinity rules and node selector of the operands above. 3.
Actual results:
Operands missing affinity rules and node selecto
Expected results:
Operands have same affinity rules and node selector than the operator
Additional info:
discover-etcd-initial-cluster was written very early on in the cluster-etcd-operator life cycle. We have observed at least one bug in this code and in order to validate logical correctness it needs to be rewritten with unit tests.
See the following:
https://issues.redhat.com/browse/OCPBUGS-2083
https://github.com/openshift/library-go/pull/1413
https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/117
This was fixed for vsphere, but we need the same change for the other storage operators. Bump library-go and add --tls-cipher-suites=${TLS_CIPHER_SUITES} to the kube RBAC sidecars.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
library-go should use Lease for leader election by default. In 4.10 we switched from configmaps to configmapsleases, now we can switch to leases change library-go to use lease by default, we already have an open pr for that: https://github.com/openshift/library-go/pull/1448 once the pr merges, we should revendor library-go for: - kas operator - oas operator - etcd operator - kcm operator - openshift controller manager operator - scheduler operator - auth operator - cluster policy controller
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .
This RBAC was only used 4.2 and 4.3 for
and we should remove it
Version-Release number of selected component (if applicable):{code:none}
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Allow users to turn PodSecurity admission in enforcement mode in 4.13 as TechPreviewNoUpgrade in order to be able to test the feature with their workloads and see if there is anything that needs fixing.
Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.
The original PR that fixed this (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/721) got reverted by mistake. We need to bring it back to unblock the kube rebase.
Description of problem:
The following changes are required for openshift/route-controller-manager#22 refactoring.
add POD_NAME to route-controller-manager deployment
introduce route-controller-defaultconfig and customize lease name openshift-route-controllers to override the default one supplied by library-go
add RBAC for infrastructures which is used by library-go for configuring leader election
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Description of problem:
[AWS EBS CSI Driver Operator] should not update the default storageclass annotation back after customers remove the default storageclass annotation
Version-Release number of selected component (if applicable):
Server Version: 4.14.0-0.nightly-2023-06-08-102710
How reproducible:
Always
Steps to Reproduce:
1. Install an aws openshift cluster 2. Create 6 extra storage classes(any sc is ok) 3. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false and check all the sc are set as undefault 4. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=true 5. loop step4-5 several times
Actual results:
Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false, sometimes recovered by the driver operator
Expected results:
Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false should always succeed
Additional info:
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Some repositories require bugzilla/valid-bug label present. Complement to https://issues.redhat.com/browse/WRKLDS-700.
Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.
With CSISnapshot capability disabled, the CSI driver operator are Degraded. For example:
18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: ) Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
4.12.nightly
The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.
CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.
With CSISnapshot capability is disabled, all CSI driver operators are Degraded. For example AWS EBS CSI driver operator during installation:
18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: ) Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
4.12.nightly
The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.
CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.
Description of problem:
We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .
This RBAC was only used 4.2 and 4.3 for
and we should remove it
followup to https://issues.redhat.com/browse/OCPBUGS-3283 - the RBACs are not applied anymore, but we just need to remove the actual files from the repo. No behavioral change should occur with the file removal.
Version-Release number of selected component (if applicable):{code:none}
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
cluster-policy-controller has unnecessary permissions and is able to operate on all leases in KCM namespace. This also applies to namespace-security-allocation-controller that was moved some time ago and does not need lock mechanism.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Sanitize OWNERS/OWNER_ALIASES in all CSI driver and operator repos.
For driver repos:
1) OWNERS must have `component`:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
For operator repos:
1) OWNERS must have:
component: "Storage / Operators"
Description of problem:
oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-ec.1 True False 45h Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded oc --context build02 get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-ec.1 True False True 2y87d GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.
In 4.12.0-rc.0 some API-server components declare flowcontrol/v1beta1 release manifests:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.12.0-rc.0-x86_64 $ grep -r flowcontrol.apiserver.k8s.io manifests manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_etcd-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-controller-manager-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. This ticket tracks removing those manifests, or replacing them with a more modern resource type, or some such. Definition of done is that new 4.13 (and with backports, 4.12) nightlies no longer include flowcontrol.apiserver.k8s.io/v1beta1 manifests.
[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/api_requests.go:27 Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:158 [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:159 flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Ginkgo exit error 4: exit with code 4
This is required to unblock https://github.com/openshift/origin/pull/27561
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/809
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705425516419799
A revision controller is spinning to many revisions.
Goal: update the revision controller code to temporarily log config changes to validate the newly created revisions are valid. Or, proof some new revisions are unnecessary.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver-operator/pull/268
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Recently, the passing rate for test "static pods should start after being created" has dropped significantly for some platforms: https://sippy.dptools.openshift.org/sippy-ng/tests/4.15/analysis?test=%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D Take a look at this example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072 The test failed with the following message: { static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ci-op-2z99zzqd-7f99c-rfp4q-master-0" didn't show up, waited: 3m0s} Seemingly revision 6 was never reached. But if we look at the log from kube-controller-manager-operator, it jumps from revision 5 to revision 7: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072/artifacts/e2e-azure-sdn-techpreview/gather-extra/artifacts/pods/openshift-kube-controller-manager-operator_kube-controller-manager-operator-7cd978d745-bcvkm_kube-controller-manager-operator.log The log also indicates that there is a possibility of race: W1013 12:59:17.775274 1 staticpod.go:38] revision 7 is unexpectedly already the latest available revision. This is a possible race! This might be a static controller issue. But I am starting with kube-controller-manager component for the case. Feel free to reassign. Here is a slack thread related to this: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1697472297510279
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If not installed capability operator build and deploymentconfig, when use `oc new-app registry.redhat.io/<namespace>/<image>:<tag>` , the created deployment emptied spec.containers[0].image. The deploy will fail to start pod.
Version-Release number of selected component (if applicable):
oc version Client Version: 4.14.0-0.nightly-2023-08-22-221456 Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2023-09-02-132842 Kubernetes Version: v1.27.4+2c83a9f
How reproducible:
Always
Steps to Reproduce:
1. Installed cluster without build/deploymentconfig function Set "baselineCapabilitySet: None" in install-config 2.Create a deploy using 'new-app' cmd oc new-app registry.redhat.io/ubi8/httpd-24:latest 3.
Actual results:
2. $oc new-app registry.redhat.io/ubi8/httpd-24:latest --> Found container image c412709 (11 days old) from registry.redhat.io for "registry.redhat.io/ubi8/httpd-24:latest" Apache httpd 2.4 ---------------- Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites. Tags: builder, httpd, httpd-24 * An image stream tag will be created as "httpd-24:latest" that will track this image--> Creating resources ... imagestream.image.openshift.io "httpd-24" created deployment.apps "httpd-24" created service "httpd-24" created --> Success Application is not exposed. You can expose services to the outside world by executing one or more of the commands below: 'oc expose service/httpd-24' Run 'oc status' to view your app 3. oc get deploy -o yaml apiVersion: v1 items: - apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"httpd-24:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"httpd-24\")].image"}]' openshift.io/generated-by: OpenShiftNewApp creationTimestamp: "2023-09-04T07:44:01Z" generation: 1 labels: app: httpd-24 app.kubernetes.io/component: httpd-24 app.kubernetes.io/instance: httpd-24 name: httpd-24 namespace: wxg resourceVersion: "115441" uid: 909d0c4e-180c-4f88-8fb5-93c927839903 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: deployment: httpd-24 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: null labels: deployment: httpd-24 spec: containers: - image: ' ' imagePullPolicy: IfNotPresent name: httpd-24 ports: - containerPort: 8080 protocol: TCP - containerPort: 8443 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: conditions: - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Created new replica set "httpd-24-7f6b55cc85" reason: NewReplicaSetCreated status: "True" type: Progressing - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: 'Pod "httpd-24-7f6b55cc85-pvvgt" is invalid: spec.containers[0].image: Invalid value: " ": must not have leading or trailing whitespace' reason: FailedCreate status: "True" type: ReplicaFailure observedGeneration: 1 unavailableReplicas: 1 kind: List metadata:
Expected results:
Should set spec.containers[0].image to registry.redhat.io/ubi8/httpd-24:latest
Additional info:
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Last payload showed several occurrences of this problem seemingly surfacing the same way:
Example jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-sdn-techpreview-serial/1736680969496170496 (blocked the payload)
Looking at sippy, both tests took a dip in pass rate on the 16th, meaning a regression may have merged late on the 15th (friday) or somewhere on the 16th (less likely)
The problem kills the install and thus we are getting no intervals charts to help debug.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/747
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We need to fix and bump library-go for http2 vulnerability CVE-2023-44487. This effectively turns off HTTP/2 in library-go http endpoints, i.e. metrics and health.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/304
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After https://github.com/openshift/cluster-kube-controller-manager-operator/pull/804 was merged the controller no longer updates secret type and this no longer adds owner label. This PR would ensure the secret is created with this label
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:
$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message' (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.
Seen in two clusters after updating from 4.14 to 4.15.3.
Unclear.
Sad multus pods.
Happy cluster.
$ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null ... Certificate chain 0 s:CN = api-int.REDACTED i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256 v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT ... 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228 a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256 v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT ...
So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:
$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2 2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255 1 cmd.go:241] Using service-serving-cert provided certificates 2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
We were able to recover individual nodes via:
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Hypershift needs to be able to specify a different release payload for control plane components without redeploying anything in the hosted cluster.
csi-driver-node DaemonSet pods in the hosted cluster and the csi-driver-controller Deployment that runs in the control plane both use the AWS_EBS_DRIVER_IMAGE and LIVENESS_PROBE_IMAGE
We need a way to specify these images separately for csi-driver-node and csi-driver-controller.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/774
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.20 True False 43h Cluster version is 4.11.20 $ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2023-01-11T13:16:47Z" name: system:openshift:kube-controller-manager:gce-cloud-provider resourceVersion: "6079" uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:kube-controller-manager:gce-cloud-provider subjects: - kind: ServiceAccount name: cloud-provider namespace: kube-system $ oc get sa cloud-provider -n kube-system Error from server (NotFound): serviceaccounts "cloud-provider" not found The serviceAccount cloud-provider does not exist. Neither in kube-system nor in any other namespace. It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount. From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4 (all version from what we have found)
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Run oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
Actual results:
$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2023-01-11T13:16:47Z" name: system:openshift:kube-controller-manager:gce-cloud-provider resourceVersion: "6079" uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:kube-controller-manager:gce-cloud-provider subjects: - kind: ServiceAccount name: cloud-provider namespace: kube-system $ oc get sa cloud-provider -n kube-system Error from server (NotFound): serviceaccounts "cloud-provider" not found
Expected results:
The serviceAccount called cloud-provider to exist or otherwise the ClusterRoleBinding to be removed.
Additional info:
Finding related to a Security review done on the OpenShift Container Platform 4 - Platform
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/56
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/772
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/779
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/162
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/165
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start an hypershift-hosted cluster with cluster-bot 2. Enable user-defined monitoring 3.
Actual results:
PrometheusOperatorRejectedResources alert becomes firing
Expected results:
No alert firing
Additional info:
Need to reach out to the HyperShift folks as the fix should probably be in their code base.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/337
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/55
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.
steps to Reproduce:
=================
1. Install 4.15 cluster with baselineCapabilitySet to None
2. Run command `oc get sa -A | grep deployer`
Actual Results:
================
[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m
Expected Results:
==================
No SA related to deployer should be returned
Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
at:
github.com/openshift/cluster-openshift-controller-manager-operator/pkg/operator/internalimageregistry/cleanup_controller.go:146 +0xd65
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The flowcontrol manifests in the following operators (kas, oas, etcd, openshift controller manager, auth, and network) should use v1.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/154
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Observation from CISv1.4 pdf: 1.1.3 Ensure that the controller manager pod specification file When I checked I found description of the controller manager pod specification file in CIS v1.4 PDF is as follows: "Ensure that the controller manager pod specification file has permissions of 600 or more restrictive. OpenShift 4 deploys two API servers: the OpenShift API server and the Kube API server. The OpenShift API server delegates requests for Kubernetes objects to the Kube API server. The OpenShift API server is managed as a deployment. The pod specification yaml for openshift-apiserver is stored in etcd. The Kube API Server is managed as a static pod. The pod specification file for the kube-apiserver is created on the control plane nodes at /etc/kubernetes/manifests/kube-apiserver-pod.yaml. The kube-apiserver is mounted via hostpath to the kube-apiserver pods via /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml with permissions 600." To conform with CIS benchmarks, the controller manager pod specification file should be updated to 600. $ for i in $( oc get pods -n openshift-kube-controller-manager -o name -l app=kube-controller-manager) do oc exec -n openshift-kube-controller-manager $i -- stat -c %a /etc/kubernetes/static-pod-resources/kube-controller-manager-pod.yaml done 644 644 644
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
The controller manager pod specification file for the kube-apiserver is 644.
Expected results:
The controller manager pod specification file for the kube-apiserver is 644.
Additional info:
https://github.com/openshift/library-go/commit/19a42d2bae8ba68761cfad72bf764e10d275ad6e
The OCM-operator's imagePullSecretCleanupController attempts to prevent new pods from using an image pull secret that needs to be deleted, but this results in the OCM creating a new image pull secret in the meantime.
The overlap occurs when OCM-operator has detected the registry is removed, simultaneously triggering the imagePullSecretCleanup controller to start deleting and updating the OCM config to stop creating, but the OCM behavior change is delayed until its pods are restarted.
In 4.16 this churn is minimized due to the OCM naming the image pull secrets consistently, but the churn can occur during an upgrade given that the OCM-operator is updated first.
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
a cluster installed with baselineCapabilitySet: None have build available while the build capability is disabled ❯ oc get -o json clusterversion version | jq '.spec.capabilities' { "baselineCapabilitySet": "None" } ❯ oc get -o json clusterversion version | jq '.status.capabilities.enabledCapabilities' null ❯ oc get build -A NAME AGE cluster 5h23m
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-04-143709
How reproducible:
100%
Steps to Reproduce:
1.install a cluster with baselineCapabilitySet: None
Actual results:
❯ oc get build -A NAME AGE cluster 5h23m
Expected results:
❯ oc get -A build error: the server doesn't have a resource type "build"
slack thread with more info: https://redhat-internal.slack.com/archives/CF8SMALS1/p1696527133380269
Refactor name to Dockerfile.ocp as a better, version independent alternative
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ConfigObserver controller waits until the all given informers are marked as synced including the build informer. However, when build capability is disabled, that causes ConfigObserver's blockage and never runs. This is likely only happening on 4.15 because capability watching mechanism was bound to ConfigObserver in 4.15.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Launch cluster-bot cluster via "launch 4.15.0-0.nightly-2023-11-05-192858,openshift/cluster-openshift-controller-manager-operator#315 no-capabilities"
Steps to Reproduce:
1. 2. 3.
Actual results:
ConfigObserver controller stuck in failure
Expected results:
ConfigObserver controller runs and successfully clear all deployer service accounts when deploymentconfig capability is disabled.
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/155
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.