-
Notifications
You must be signed in to change notification settings - Fork 14
Added feature to have a persistent MLC index #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA βοΈ β |
|
Hi @sujik18 , Thanks for working on the pull request. I just had a couple of questions: 1.β β With the new changes, what happens if a new user downloads the MLCFlow framework and simply runs a script? Also I think it would be better to keep the logging regarding indexing in DEBUG level rather than at INFO level so that it only gets printed when user runs the command in verbose mode. |
|
Hi @anandhu-eng ,
Example Logs: (mlcflow) sujith@ideapad-g3:~$ mlc search script --tags=gate
[2025-11-10 18:32:01,537 index.py:190 INFO] - Script is modified, index getting updated
[2025-11-10 18:32:01,540 index.py:228 INFO] - Deleting and updating index entry for the script parse-gate-question with UID 8fe2944512654e81
[2025-11-10 18:32:01,557 index.py:61 INFO] - Saving modified times to /home/sujith/MLC/repos/modified_times.json
[2025-11-10 18:32:01,566 index.py:212 INFO] - Index updated (changes detected).
[2025-11-10 18:32:01,584 main.py:109 INFO] - Item path: /home/sujith/MLC/repos/llm-gate-exam-evaluation/script/app-llm-evaluation
[2025-11-10 18:32:01,584 main.py:109 INFO] - Item path: /home/sujith/MLC/repos/llm-gate-exam-evaluation/script/parse-gate-question
(mlcflow) sujith@ideapad-g3:~$ when no script was modified (mlcflow) sujith@ideapad-g3:~$ mlc search script --tags=gate
[2025-11-10 18:42:56,748 main.py:109 INFO] - Item path: /home/sujith/MLC/repos/llm-gate-exam-evaluation/script/app-llm-evaluation
[2025-11-10 18:42:56,748 main.py:109 INFO] - Item path: /home/sujith/MLC/repos/llm-gate-exam-evaluation/script/parse-gate-question
(mlcflow) sujith@ideapad-g3:~$ |
β¦als in index.py
amd-arsuresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an estimation of the speedup due to this change?
mlc/index.py
Outdated
| config_file = p | ||
| break | ||
|
|
||
| if not config_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when can we have this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is no meta.yaml or .json file in the script directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we are using meta.json anymore. Also we should return error if config file is not found as it is mandatory, I think its handled in line 283
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meta.json check was included in the current implementation itself, here at #L117. Yes, if the config file is not found, the script wonβt be taken into account for indexing, this is handled in line #145. Line 283 is only checking the unique ID field to see whether the uid field is present in the scriptβs meta file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta.json is used for cache entries.
|
Currently in my local setup with two repos llm-gate-exam-evaluation and mlcommons@mlperf-automations real 0m3.456s
user 0m3.412s
sys 0m0.042sWith persistent MLC index time observed was: real 0m0.182s
user 0m0.143s
sys 0m0.039sSo speedup for this process is approx And in the scenario when there is a change in one of the script; real 0m3.520s
user 0m3.406s
sys 0m0.048sWith persistent MLC index time observed was: real 0m0.190s
user 0m0.140s
sys 0m0.050sSo speedup for this process is approx |
|
|
||
| # Validate and add to indices | ||
| if unique_id: | ||
| self._delete_by_uid(folder_type, unique_id, alias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this add extra overhead? how about we use set to avoid duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently deleting older index entry function is called only when a particular script is modified, and while deleting, it just needs to loop through the n number of uid entry, which might takes less than a millisecond to complete, not sure it will be adding any noticeable overhead.
| "experiment": os.path.join(repos_path, "index_experiment.json") | ||
| } | ||
| self.indices = {key: [] for key in self.index_files.keys()} | ||
| self.modified_times_file = os.path.join(repos_path, "modified_times.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sujik18 , how about we use corresponding target index file like index_script.json for also storing information about the modified time rather than creating another file separately? Do you think there would be a significant performance loss due to size of those files? If not, we could prevent keeping an additional file for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently modified_times.json stores a simple dictionary, whereas index_script.json file is list of object which will be slower to parse. So I think loading mtimes would be simpler and cheaper this way as it can be completed even before index processing begins.
mlc/index.py
Outdated
| if os.path.isfile(config_path): | ||
| key = f"{repo_path}/{folder_type}/{automation_dir}" | ||
| current_script_keys.add(key) | ||
| mtime = self.get_script_mtime(automation_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is executed for all targets like script, cache and experiment right? So, get_item_mtime may be a better name
mlc/index.py
Outdated
| continue | ||
|
|
||
| # update mtime | ||
| logger.debug("Script is modified, index getting updated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only script, cache or experiment entries are also possible here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tests are failing now - I believe the problem is that we are not resetting the index on a repo update. When we register and unregister repos, we need to force reindexing - probably the easiest solution is to check the modified time of MLC/repos/repos.json and clear the index.
That's great. But when I tried the same |
Was this only happening on the first run? If so, thatβs expected as on the first run with the changes will recreate |
Hi @anandhu-eng, I'm facing difficulty seeing the debug logs. I thought using the |
To ignore cached scripts and avoid waiting for input
With the new changes (repeated runs, first run took around 10s) I think the overhead is coming from more cache entries - here I have 695 cache folders. |
@arjunsuresh I think we should skip the data folder while checking Also, I'm not sure why the Should we add another Action run that tests the same script without the |
Actually we only need to check Regarding the test failure, I could replicate it once but not again. When we use |
Run mlcr detect,cpu -j -s --quiet
[2025-11-27 13:48:18,707 index.py:61 INFO] - Saving modified times to /home/runner/MLC/repos/modified_times.json
[2025-11-27 13:48:18,715 index.py:240 INFO] - Index updated (changes detected). I shall change these to DEBUG, since they shouldnβt appear when -s is used. |
When we use "-s" mlcflow shouldn't output info logs. That's how it's configured in main.py. If this is happening it's a bug and that needs to be fixed. But the main problem is performance - I tried restricting the modified time check to just meta.yaml and meta.json and then there is no performance drop but no performance gain also. |
@arjunsuresh I think we can further optimize the get_item_mtime function to skip folders like data and outputs. I believe this might reduce the performance drop, as I didn't observe the slowdown locally because I don't have any dataset files in my cache. |
- Optimized check for meta.json file - fix bug where changes in multiple scripts where not taken into account
mlc/index.py
Outdated
| def get_item_mtime(self,folder): | ||
| # logger.info(f"Getting latest modified time for folder: {folder}") | ||
| latest = 0 | ||
| for root, _, files in os.walk(folder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sujik18 , how about we use os.listdir instead of os.walk in order to get the modified time of folder/files as os.walk might add recursive overhead when some assets like the inference repo is cloned in a cache.
I think its safe as changes in subfolders make the parent folder appear modified even when its own contents havenβt changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anandhu-eng , thanks for the suggestion. Actually, I have modified the code, and the function get_item_mtime will now only be called for the meta file of each script. Therefore, we don't need to use either os.listdir or os.walk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujik18 that change works great and we do see greater than 10X improvement in mlc performance with the change. But there's some bug if the modified times are absent in the index - most likely the tests are failing due to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arjunsuresh Glad to know the performance improved!
Iβll debug and reproduce the error to fix the missing-mtime bug and ensure all tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sujik18 LGTM now. The startup performance is easily improved by 10X. Will do further testing too as this is a core change. We can merge this by the weekend.
@anandhu-eng please test the corner cases like manually removing the script/cache entries and updating the script, cache tags in the meta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got this error when I manually removed the corresponding cache folder.
No meta file found in /home/arsuresh/MLC/repos/local/cache/detect-os_356490ac for None
Traceback (most recent call last):
File "/home/arsuresh/mlcflow/bin/mlc", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/arsuresh/mlcflow/lib/python3.12/site-packages/mlc/main.py", line 342, in main
res = method(run_args)
^^^^^^^^^^^^^^^^
File "/home/arsuresh/mlcflow/lib/python3.12/site-packages/mlc/cache_action.py", line 125, in show
res = self.search(run_args)
^^^^^^^^^^^^^^^^^^^^^
File "/home/arsuresh/mlcflow/lib/python3.12/site-packages/mlc/cache_action.py", line 62, in search
expiration_time = item_meta.get('expiration_time')
^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amd-arsuresh , it would be fixed in latest commit, as earlier while deleting from index entry it was looking for path of meta file instead of the directory name where the meta file exists, therefore index entry didn't got deleted for deleted scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sujik18 Safety wise now we are exactly same as without persistent index right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arjunsuresh with persistent index it only changes the recreation of index every time a mlc command is run, so I think safety aspects remains unchanged.
Regarding testing, I have tested for various cases locally like removing/modifying the meta file, repos.json file, modified_times.json and none of those had a issue.
However, there is one issue that I am not sure requires a fix or not. Since we are not monitoring the mtime for index script, if the index file is manually removed/modified without deleting the modified times, the index wont be rebuilded again, because the modified times currently wont be able to deduct it.
- To delete index entry for the scripts which doesnt has config file
β PR Checklist
devπ Note: PRs must be raised against
dev. Do not commit directly tomain.β Testing & CI
π Documentation
π File Hygiene & Output Handling
π‘οΈ Safety & Security
π Contribution Hygiene
Fixes #orCloses #.Fixes #37
Have added a feature to maintain a modified_times json file to keep track of last modified time for each script and whenever a MLC command is executed to check whether any files inside the script folders has been changed.
If any changes is detected, by using uid of the script, only that scripts entry is updated in both modified_times and index_script json file.
Before making changes to the parse-gate-question script:
index_script.json
modified_times.json
After making changes to script files
index_script.json
modified_times.json
Logs: