Base operator for HugeCTR serving support#129
Base operator for HugeCTR serving support#129jperez999 wants to merge 10 commits intoNVIDIA-Merlin:mainfrom
Conversation
Click to view CI ResultsGitHub pull request #129 of commit 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb, no merge conflicts.
Running as SYSTEM
Setting status of 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/125/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb^{commit} # timeout=10
Checking out Revision 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10
Commit message: "remove common folder in tests and remove unneeded lines in test hugectr"
> git rev-list --no-walk 088570474e008fa0580cb7ae6de1c4a2bceadf4e # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins11789234233452956815.sh
PYTHONPATH=/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 48 items
|
Documentation preview |
|
rerun tests |
Click to view CI ResultsGitHub pull request #129 of commit 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb, no merge conflicts.
Running as SYSTEM
Setting status of 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/126/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb^{commit} # timeout=10
Checking out Revision 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10
Commit message: "remove common folder in tests and remove unneeded lines in test hugectr"
> git rev-list --no-walk 1bbda7b9aedf11d2bc56b4542a26f7a3db8872fb # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins13412966895579345381.sh
PYTHONPATH=/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 48 items
|
Click to view CI ResultsGitHub pull request #129 of commit ac56b79d882d571f189c2aa3db3d5dc2f3d71083, no merge conflicts.
Running as SYSTEM
Setting status of ac56b79d882d571f189c2aa3db3d5dc2f3d71083 to PENDING with url https://10.20.13.93:8080/job/merlin_systems/140/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse ac56b79d882d571f189c2aa3db3d5dc2f3d71083^{commit} # timeout=10
Checking out Revision ac56b79d882d571f189c2aa3db3d5dc2f3d71083 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f ac56b79d882d571f189c2aa3db3d5dc2f3d71083 # timeout=10
Commit message: "Merge branch 'main' into hugectr-base"
> git rev-list --no-walk 74b88a50a8974327d917509b551a08015f5c7c81 # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins13320333107056980916.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 49 items
|
karlhigley
left a comment
There was a problem hiding this comment.
Left some style suggestions, but nothing that would block this PR once the tests pass
merlin/systems/dag/ops/hugectr.py
Outdated
| if "opt" not in path.name | ||
| ] | ||
|
|
||
| config_dict = dict() |
There was a problem hiding this comment.
I think the linter suggestion for this is to use {} instead of dict()
merlin/systems/dag/ops/hugectr.py
Outdated
| model = dict() | ||
| model["model"] = model_name | ||
| model["slot_num"] = num_cat_columns | ||
| model["sparse_files"] = sparse_paths | ||
| model["dense_file"] = dense_path | ||
| model["maxnum_des_feature_per_sample"] = data_layer["dense"]["dense_dim"] | ||
| model["network_file"] = network_file | ||
| model["num_of_worker_buffer_in_pool"] = 4 | ||
| model["num_of_refresher_buffer_in_pool"] = 1 | ||
| model["deployed_device_list"] = self.device_list | ||
| model["max_batch_size"] = self.max_batch_size | ||
| model["default_value_for_each_table"] = [0.0] * len(sparse_layers) | ||
| model["hit_rate_threshold"] = 0.9 | ||
| model["gpucacheper"] = self.hugectr_params["gpucacheper"] | ||
| model["gpucache"] = True | ||
| model["cache_refresh_percentage_per_iteration"] = 0.2 | ||
| model["maxnum_catfeature_query_per_table_per_sample"] = [ | ||
| len(x["sparse_embedding_hparam"]["slot_size_array"]) for x in sparse_layers | ||
| ] | ||
| model["embedding_vecsize_per_table"] = vec_size | ||
| model["embedding_table_names"] = [x["top"] for x in sparse_layers] |
There was a problem hiding this comment.
Wonder if might be worthwhile to extract a helper function to construct this dictionary
merlin/systems/dag/ops/hugectr.py
Outdated
| return config | ||
|
|
||
|
|
||
| def _hugectr_config(name, hugectr_params, max_batch_size=None): |
There was a problem hiding this comment.
It seems like there's a fair amount of repetition in this method. Maybe some of this can be done with a for loop?
There was a problem hiding this comment.
I'm pretty sure we implemented this a for loop @jperez999 must have been during the splitting of commits from #125
Click to view CI ResultsGitHub pull request #129 of commit 92070d02437d7679280097b7eaf495c1f5b19541, no merge conflicts.
Running as SYSTEM
Setting status of 92070d02437d7679280097b7eaf495c1f5b19541 to PENDING with url https://10.20.13.93:8080/job/merlin_systems/146/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse 92070d02437d7679280097b7eaf495c1f5b19541^{commit} # timeout=10
Checking out Revision 92070d02437d7679280097b7eaf495c1f5b19541 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 92070d02437d7679280097b7eaf495c1f5b19541 # timeout=10
Commit message: "Merge branch 'main' into hugectr-base"
> git rev-list --no-walk b2f89fe1c8f53060270d0483dcccc04b46b29164 # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins4798257444405123681.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 49 items
|
92070d0 to
221c35c
Compare
Click to view CI ResultsGitHub pull request #129 of commit 221c35c040eb96d183e8302fb1cae4d8542d514e, no merge conflicts.
Running as SYSTEM
Setting status of 221c35c040eb96d183e8302fb1cae4d8542d514e to PENDING with url https://10.20.13.93:8080/job/merlin_systems/310/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse 221c35c040eb96d183e8302fb1cae4d8542d514e^{commit} # timeout=10
Checking out Revision 221c35c040eb96d183e8302fb1cae4d8542d514e (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 221c35c040eb96d183e8302fb1cae4d8542d514e # timeout=10
Commit message: "Split out model and dataset creation into conftest"
> git rev-list --no-walk 4269cf90c507f051348b5b63ad6236b3638e05ba # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins5580325469944632981.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 72 items
|
221c35c to
4d99847
Compare
Click to view CI ResultsGitHub pull request #129 of commit 4d99847a4d45afb83050acc2c99235edc09ac0eb, no merge conflicts.
Running as SYSTEM
Setting status of 4d99847a4d45afb83050acc2c99235edc09ac0eb to PENDING with url https://10.20.13.93:8080/job/merlin_systems/311/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse 4d99847a4d45afb83050acc2c99235edc09ac0eb^{commit} # timeout=10
Checking out Revision 4d99847a4d45afb83050acc2c99235edc09ac0eb (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 4d99847a4d45afb83050acc2c99235edc09ac0eb # timeout=10
Commit message: "Add slot_sizes parameter"
> git rev-list --no-walk 221c35c040eb96d183e8302fb1cae4d8542d514e # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins15556838871102646320.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 71 items
|
Click to view CI ResultsGitHub pull request #129 of commit 027f495e62b6030a2cd712f532280b54c3b54a5a, no merge conflicts.
Running as SYSTEM
Setting status of 027f495e62b6030a2cd712f532280b54c3b54a5a to PENDING with url https://10.20.13.93:8080/job/merlin_systems/312/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/129/*:refs/remotes/origin/pr/129/* # timeout=10
> git rev-parse 027f495e62b6030a2cd712f532280b54c3b54a5a^{commit} # timeout=10
Checking out Revision 027f495e62b6030a2cd712f532280b54c3b54a5a (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 027f495e62b6030a2cd712f532280b54c3b54a5a # timeout=10
Commit message: "Extract config to methods and extend with all known params"
> git rev-list --no-walk 4d99847a4d45afb83050acc2c99235edc09ac0eb # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins10890944273429405522.sh
PYTHONPATH=:/usr/local/lib/python3.8/dist-packages/:/usr/local/hugectr/lib:/var/jenkins_home/workspace/merlin_systems/systems
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 71 items
|
|
@jperez999 @oliverholworthy Could you resolve the conflicts on this when you get a chance? I'd still like to add this support, and maybe we can nudge the HugeCTR team to help us out again. |
This PR will introduce the initial hugectr Operator. This operator works along and will need a wrapper operator to handle inputs coming from a dataframe. The PR lays the foundation for using Hugectr in systems. Allows you to pass a model or path to a model and it is loaded, relevant information extracted and the necessary artifacts are created (ps.json, model files, model.json, config.pbtxt for triton).