Skip to content

[WIP] Parsing current RHEL Streams from Product Pages#424

Open
mruprich wants to merge 12 commits intopackit:mainfrom
mruprich:product-pages
Open

[WIP] Parsing current RHEL Streams from Product Pages#424
mruprich wants to merge 12 commits intopackit:mainfrom
mruprich:product-pages

Conversation

@mruprich
Copy link
Copy Markdown
Collaborator

@mruprich mruprich commented Apr 27, 2026

TODO:

  • Write new tests or update the old ones to cover new functionality.
  • Update doc-strings where appropriate.
  • Update or write new documentation in packit/packit.dev.
  • ‹fill in›

The Product Pages should be the source of truth for this, but getting current and upcoming z-streams and current y-streams is tricky since they overlap during the DevDocTest phases. My logic is as follows:

  1. If there are more than one active y-stream branches ATM per major RHEL, that means that the lower stream is currently in the very late phase of DevDocTest and right now it is being prepared for GA. That means it will very soon become an active z-stream, but right now it is the upcoming z-stream
  2. I did not find a way to get active z-streams, other than look for the (GA/ZStream) in the long name, that marks the current z-stream

Anyway, the fetch_rhel_streams_snapshot() function should return a structure similar to what load_rhel_config() function returns now except for the package_instructions part. It will only work with a valid Kerberos ticket, otherwise it will raise a ToolError.

I am marking this as WIP for now so let me have it. Let me know if I chose a good place for the code.

RELEASE NOTES BEGIN

Packit now supports automatic parsing of all the current RHEL streams.

RELEASE NOTES END

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new module product_pages.py to fetch and process RHEL release stream data from the internal Product Pages API. The implementation includes logic for identifying current y-streams, current z-streams, and upcoming z-streams. Feedback was provided to improve the reliability and efficiency of the API interaction by adding timeouts, using a context manager for the session, consolidating multiple API calls into a single request, and enhancing error handling.

Comment thread ymir/common/product_pages.py Outdated
Comment on lines +169 to +216
s = requests.Session()
auth = requests_gssapi.HTTPSPNEGOAuth(mutual_authentication=requests_gssapi.OPTIONAL)
auth_resp = s.post(_OIDC_AUTHENTICATE_URL, auth=auth)
_require_ok(auth_resp, "OIDC authenticate")

# Multiple active releases per major: lower stream is finishing; higher is main y-stream.
response_active = s.get(
_RELEASES_API_URL,
params={
"fields": "shortname",
"active": "",
"product__shortname": "rhel",
},
)
_require_ok(response_active, "active releases")
active_data = response_active.json()

current_y_streams = _build_current_y_streams(active_data)
upcoming_z_streams = _build_upcoming_z_streams(active_data)

response_zstream = s.get(
_RELEASES_API_URL,
params={
"fields": "shortname,name_incl_maint,name",
"product__shortname": "rhel",
},
)
_require_ok(response_zstream, "releases for z-stream filtering")
z_data = response_zstream.json()

fields = [
"shortname",
"name_incl_maint",
"name",
]
filtered = [
{k: item[k] for k in fields}
for item in z_data
if _GA_ZSTREAM_RE.search(item.get("name_incl_maint") or "")
]

current_z_streams = _build_current_z_streams_ga_zstream(filtered)

return {
"current_y_streams": current_y_streams,
"current_z_streams": current_z_streams,
"upcoming_z_streams": upcoming_z_streams,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of _fetch_rhel_streams_snapshot_sync can be improved for reliability, efficiency, and resource management:

  1. Timeouts: Network requests using requests should always specify a timeout (e.g., 30 seconds) to prevent the application from hanging indefinitely if the server is unresponsive.
  2. Efficiency: Instead of making two separate calls to the releases API, you can fetch all RHEL releases in a single call by requesting the active field and then filtering the results in memory. This reduces network overhead and load on the Product Pages API.
  3. Resource Management: Using requests.Session as a context manager (with requests.Session() as s:) ensures that the underlying TCP connections are properly closed when the function exits.
  4. Error Handling: Wrapping the network calls and JSON parsing in a try-except block (catching requests.RequestException and ValueError) allows for more robust error reporting by wrapping these low-level errors into a ToolError with a descriptive message.
    with requests.Session() as s:
        auth = requests_gssapi.HTTPSPNEGOAuth(mutual_authentication=requests_gssapi.OPTIONAL)
        try:
            # Always specify a timeout for network requests to avoid hanging indefinitely.
            auth_resp = s.post(_OIDC_AUTHENTICATE_URL, auth=auth, timeout=30)
            _require_ok(auth_resp, "OIDC authenticate")

            # Fetch all RHEL releases in a single call and filter in memory to reduce network overhead.
            response = s.get(
                _RELEASES_API_URL,
                params={
                    "fields": "shortname,name_incl_maint,name,active",
                    "product__shortname": "rhel",
                },
                timeout=30,
            )
            _require_ok(response, "fetch releases")
            all_data = response.json()
        except (requests.RequestException, ValueError) as e:
            raise ToolError(f"Failed to fetch RHEL streams from Product Pages: {e}") from e

        active_data = [item for item in all_data if item.get("active")]
        current_y_streams = _build_current_y_streams(active_data)
        upcoming_z_streams = _build_upcoming_z_streams(active_data)

        filtered_z = [
            item for item in all_data
            if _GA_ZSTREAM_RE.search(item.get("name_incl_maint") or "")
        ]
        current_z_streams = _build_current_z_streams_ga_zstream(filtered_z)

        return {
            "current_y_streams": current_y_streams,
            "current_z_streams": current_z_streams,
            "upcoming_z_streams": upcoming_z_streams,
        }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. added
  2. I don't think that is possible. I did not find a way to get all that data in a single API call. The active API call gives very different results from the second one that grabs everything.
  3. added
  4. added

Copy link
Copy Markdown
Member

@TomasTomecek TomasTomecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!! it would be lovely to get at least a single test case that has mocked data with what PP returns (fake data obviously so we're not leaking anything here)

timeout = _PRODUCT_PAGES_TIMEOUT
try:
with requests.Session() as s:
auth = requests_gssapi.HTTPSPNEGOAuth(mutual_authentication=requests_gssapi.OPTIONAL)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure this works okay with our kerberos setup

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work with an active and valid krb ticket in your system but I realize that packit might have a different way to get a ticket right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at init_kerberos_ticket function. That is the one used to obtain kerberos ticket or verify it already exists.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing another round of changes. I used the init_kereberos_ticket to get a ticket. I hope I am using that right.

I also added a unit test for the product pages..

@TomasTomecek
Copy link
Copy Markdown
Member

Just a headsup a related commit was merged recently: a0c211a

@mruprich
Copy link
Copy Markdown
Collaborator Author

mruprich commented May 5, 2026

One question. I see the internal_repos_host being pulled from the rhel-config.json in ymir/tools/privileged/copr.py. What should I do about that? That is not possible to pull from product pages...

Maybe @nforro or @opohorel are aware of that option? Any idea whether this is actually used somewhere?

@opohorel
Copy link
Copy Markdown
Collaborator

opohorel commented May 5, 2026

One question. I see the internal_repos_host being pulled from the rhel-config.json in ymir/tools/privileged/copr.py. What should I do about that? That is not possible to pull from product pages...

Maybe @nforro or @opohorel are aware of that option? Any idea whether this is actually used somewhere?

AFAIK it is used for z-stream builds where the repo is added into the copr. If I remember correctly we had some issues with copr not having all the latest packages which were in the z-stream repo. With internal_repos_host defined you always get the latest z-stream packages to build against

@nforro knows more about it as he implemented it in #347

@nforro
Copy link
Copy Markdown
Member

nforro commented May 5, 2026

It's the location of Brew repos, and it has nothing to do with Product Pages, it's in rhel-config.json only because it's an internal URL (and yes, it changes in time, so it can't be hardcoded).

@mruprich
Copy link
Copy Markdown
Collaborator Author

mruprich commented May 5, 2026

Ok, thanks. So one solution would be to keep the rhel-config.json at least for copr.py to use. That requires minimal changes in copr.py.

Maybe the rhel-config.json could be used in the future as some sort of a cache? So that you don't have to contact PP every time you do something? Might save some bandwidth and you would be able to work even when the PP is unavailable. There would have to be a mechanism to make sure that it is up-to-date though...

Copy link
Copy Markdown
Member

@nforro nforro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a privileged MCP tool as it accesses internal data?

@nforro
Copy link
Copy Markdown
Member

nforro commented May 5, 2026

Ok, thanks. So one solution would be to keep the rhel-config.json at least for copr.py to use. That requires minimal changes in copr.py.

Yes, we can rename the file later or use an environment variable for the URL instead, but that's out of scope of this PR.

Maybe the rhel-config.json could be used in the future as some sort of a cache? So that you don't have to contact PP every time you do something? Might save some bandwidth and you would be able to work even when the PP is unavailable. There would have to be a mechanism to make sure that it is up-to-date though...

Some caching mechanism would be definitely welcome, but I don't think it should be a file in .secrets.

@mruprich
Copy link
Copy Markdown
Collaborator Author

mruprich commented May 5, 2026

@nforro the placement of the product-pages.py is really up to you guys, I will move it if necessary.

@nforro
Copy link
Copy Markdown
Member

nforro commented May 6, 2026

@nforro the placement of the product-pages.py is really up to you guys, I will move it if necessary.

It's more about not calling fetch_rhel_streams_snapshot() from anywhere except a privileged MCP tool than about location of the file. There are plenty of examples of such tools, how they are registered and how they are called in the code.

Perhaps it would be easier for you to contribute just product-pages.py in a form of standalone "library" and leave the rest on us. Kerberos ticket initialization would be the tool's responsibility and raising ToolErrors from fetch_rhel_streams_snapshot() seems wrong anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants