Skip to content

autopilot-dashboard#61

Open
Mete4 wants to merge 217 commits into
IBM:mainfrom
Mete4:main
Open

autopilot-dashboard#61
Mete4 wants to merge 217 commits into
IBM:mainfrom
Mete4:main

Conversation

@Mete4
Copy link
Copy Markdown

@Mete4 Mete4 commented Nov 18, 2024

Summary

Introduces a fully functional UI dashboard for Autopilot. This dashboard integrates with GPU-equipped OpenShift/Kubernetes clusters, providing administrators with a user-friendly interface to monitor cluster health, initiate diagnostic tests, and view real-time results via an embedded terminal. The system aims to enhance usability and efficiency in cluster management while streamlining health checks and monitoring.

Scope and Impact

API Changes

  • No changes to existing Autopilot APIs
  • New frontend routes added for dashboard functionality:
    • /login - Authentication interface
    • /monitor - Cluster monitoring view
    • /testing - Test execution interface

Breaking Changes

No breaking changes. This is an added feature for Autopilot's functionality without changing existing behavior.

Key Features

  1. Monitor Page

    • Real-time cluster status monitoring
    • Filterable node list with health status indicators
    • Detailed node information including GPU status
    • DCGM diagnostics results display
  2. Testing Page

    • Interactive test selection and configuration
    • Support for all 9 Autopilot health checks
    • Real-time test results in an embedded terminal
    • Batch test execution capabilities
  3. UI Components

    • Carbon Design System integration
    • Responsive layout
    • Real-time data updates
    • Filterable tables and searchable content

Technical Implementation

  1. Frontend Stack

    • React 18.3.1
    • Carbon Design System
    • Vite for development and building
    • NGINX for production serving
  2. Key Dependencies

    • @carbon/react and @carbon/styles for UI components
    • Kubernetes watch API
    • styled-components for styling
    • react-router-dom for routing
  3. Deployment

    • Containerized with Docker
    • Helm chart for Kubernetes deployment
    • NGINX configuration for production serving

How was this Pull-Request Tested and Validated?

  1. Development Testing
cd app
npm install
npm run dev
  1. Build Validation
npm run build
docker build -t autopilot-dashboard .
  1. Functional Testing
  • Verified all health check operations
  • Tested real-time monitoring functionality
  • Validated filter and search operations
  • Confirmed responsive design across devices
  1. Integration Testing
  • Tested with live Autopilot instance on a real cluster environment
  • Verified Kubernetes API interactions
  • Validated real-time data updates

Pull-Request Reminders

  • Does the Autopilot Readme require updates?

    • Yes - Added documentation for the dashboard feature and usage instructions in the autopilot-dashboard directory
  • Are there any new software dependencies introduced?

    • Yes:
      • React and related dependencies
      • Carbon Design System
      • Vite build tool
      • NGINX for production serving

juehlin and others added 30 commits September 16, 2024 11:41
Added users for IBM#2
Added IBM#6 Release Planning
Added Sprint 1 demo videos and slides

# This sets the container image more information can be found here: https://kubernetes.io/docs/concepts/containers/images/
image:
repository: quay.io/anish2sinha/autopilot-dashboard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you push the image to quay.io/autopilot?

Copy link
Copy Markdown
Contributor

@Anish701 Anish701 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ba6a113. The image is pushed with tag "dashboard". The yaml now points to quay.io/autopilot/autopilot with tag "dashboard".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is preferable to be in the gh-pages branch, and it should be built with the autopilot container registry. I think we can delete this entirely and let the existing github action handle that (with appropriate changes, of course). This can be done separately from this PR

Copy link
Copy Markdown
Contributor

@Anish701 Anish701 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted in ba6a113. I think we pushed the binary on accident. It should be deleted now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted in b08b38a. We originally used that workflow file for our class JIRA board. We removed it now as it is no longer used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this might be an issue, we use Apache License. Is it possible to change this to Apache?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ba6a113. We changed the license to Apache, copied from the main IBM/autopilot repository for consistency.

@cmisale
Copy link
Copy Markdown
Contributor

cmisale commented Dec 12, 2024

@Mete4 @Anish701 @ryanliao296 @juehlin @eburhansjah
Not sure who to reach out to, so I reach out to everyone.

I suspect the helm chart misses some RBAC because the dashboard doesn't show the list of nodes.
Can you help clarifying?

@Anish701
Copy link
Copy Markdown
Contributor

Anish701 commented Dec 13, 2024

@Mete4 @Anish701 @ryanliao296 @juehlin @eburhansjah Not sure who to reach out to, so I reach out to everyone.

I suspect the helm chart misses some RBAC because the dashboard doesn't show the list of nodes. Can you help clarifying?

@cmisale RBAC is not integrated with this branch so it should not affect the helm chart (RBAC is done separately with keycloak server). The image which this helm chart pulls from assumes a local kubernetes cluster (with existing read permissions). Two likely causes for why there are no nodes showing are:

  1. kubectl proxy is not running locally
  2. CORS is not enabled on the browser

Are there any error messages appearing in the browser console when on the Monitor page?

@cmisale
Copy link
Copy Markdown
Contributor

cmisale commented Jan 6, 2025

Hm we'd need some RBAC to see the nodes, because we can't rely on external tooling.
I think it all works, it's just that I see this
Screenshot 2025-01-06 at 2 48 12 PM

not sure what CORS is and it's probably not enabled

@Anish701
Copy link
Copy Markdown
Contributor

Anish701 commented Jan 14, 2025

Hm we'd need some RBAC to see the nodes, because we can't rely on external tooling. I think it all works, it's just that I see this Screenshot 2025-01-06 at 2 48 12 PM

not sure what CORS is and it's probably not enabled

@cmisale I updated the code so that it can work without CORS. Please also make sure to have the following environment variables set in a .env file in the "app" directory (use 127.0.0.1 as shown below instead of localhost):

VITE_AUTOPILOT_ENDPOINT=http://127.0.0.1:3333
VITE_KUBERNETES_ENDPOINT=http://127.0.0.1:8001

Also, after running git pull to retrieve the updated changes, please make sure to run npm install as I added a dependency. Please let me know if this works and allows the nodes to be shown when running kubectl proxy. If not, let me know what errors appear in right click -> inspect -> console.

@cmisale
Copy link
Copy Markdown
Contributor

cmisale commented Jan 15, 2025

What if I install through Helm? Where is this .env file supposed to be? I've just pulled the code and built a new container

@Anish701
Copy link
Copy Markdown
Contributor

What if I install through Helm? Where is this .env file supposed to be? I've just pulled the code and built a new container

@cmisale Sorry, I updated the container image just now to reflect the new changes. Please let me know if it works after running kubectl proxy with the following flags:

kubectl proxy --address=0.0.0.0 --accept-hosts='.*'

Let me know if you see any errors which appear in the console or in the pod logs.

@cmisale
Copy link
Copy Markdown
Contributor

cmisale commented Jan 16, 2025

Thank you for the update, but unfortunately it didn't go far.. It errors out with

$ k logs autopilot-dashboard-7dc46f897b-6xr7p 
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: ipv6 not available
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2025/01/16 19:46:08 [emerg] 1#1: host not found in upstream "host.containers.internal" in /etc/nginx/nginx.conf:17
nginx: [emerg] host not found in upstream "host.containers.internal" in /etc/nginx/nginx.conf:17

@Anish701
Copy link
Copy Markdown
Contributor

Anish701 commented Jan 17, 2025

@cmisale Thank you for the logs! It seems as though this issue is related to the environment being used for the dashboard.

For a production environments, like the one on NERC/MOC which we deployed to, we need to use an NGINX server to forward requests meant for the Kubernetes and Autopilot APIs to their respective endpoints. For NERC this was https://kubernetes.default.svc/ and http://autopilot-healthchecks.autopilot.svc:3333/. If we are to deploy the dashboard to a different production environment, we just need to change these URLs in the nginx.conf file and rebuild the image. If this is your goal, please let me know of the kubernetes and autopilot service endpoints and I can update the image.

For local/dev environments, the situation is a little bit tricky as a podman/docker container running on your local machine may not have access to the host IP. Usually host.containers.internal points to the host machine, which is why I set the nginx.conf to forward to that URL (with appropriate port numbers). As shown in the logs, the container does not have access to this host machine. I face the same issue on my laptop even when I explicitly set the IP address of my host on the nginx.

Thus, for local development/testing, I think the best option is to revert back to the original implementation: The browser itself sends requests directly to the locally run kubernetes API (localhost:8001) and Autopilot (localhost:3333). This of course results in a CORS error on some browsers like Chrome, but this can easily be bypassed by using an Allow-CORS extension. The following extension is the one I use and is commonly used in local/dev environments: https://chromewebstore.google.com/detail/allow-cors-access-control/lhobafahddgcelffkeicbaginigeejlf?hl=en. Another option is to disable this CORS security feature in the browser settings.

Please let me know if anything is unclear or how you would like to move forward with this. If you are using the dashboard locally and are fine with using a CORS extension or changing browser settings, I can revert back to the original implementation and update the image on quay. If you would try a different approach or are running the dashboard in a production environment, please let me know and I can make the appropriate updates to the NGINX server configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants