Closed
Conversation
27166a8 to
d5c92e7
Compare
59ccaf7 to
57ca92b
Compare
57ca92b to
8999084
Compare
8999084 to
e99307a
Compare
7fd3978 to
be8ed79
Compare
be8ed79 to
8864d38
Compare
8864d38 to
244fb1f
Compare
244fb1f to
67d41f4
Compare
67d41f4 to
e77d660
Compare
e77d660 to
3d4a22d
Compare
Whiile nova already cleaned up its original image-cache, our patches also added images as templates into the mix that were not cleaned up prior to this patch and thus filling up the datastores. Change-Id: I2fc631d6ce0a9339be9237d63ae7e86d94779dcc (cherry picked from commit 97780bb) (cherry picked from commit ce5f026)
If we sort the by creation-time, we can stop looping over tasks once we find an event that's too old. All other VMs, we didn't find an event for, yet, must be expired anyways. Change-Id: I196bb9ab48867f314d9a5f3b7566384fa72778df (cherry picked from commit 642bcc7) (cherry picked from commit 2190382)
Since we want to keep images in the owner's folder again instead of uploading them to all projects using them, we need to also make sure we clean up everywhere - even in projects not having instances. Change-Id: I9344a9bb8c0436cb5c87f3328fa3b9d31dedbdbc (cherry picked from commit 943babe) (cherry picked from commit b1af943)
Since every nova-compute node will be running the image-cache cleanup, we have to make sure this works in parallel. Therefore, we limit the cleanup of image-cache templates to the datastores the nova-compute node is responsible for - the ones configured in the datastore_regex config option. Change-Id: I5fae822a08bcc06565c64f959553cf7082bb2423 (cherry picked from commit 0d8d760) (cherry picked from commit 5f21c7d)
Even with image-as-template disabled we have to run the image-template cleanup to ensure removal after the setting is switched off and because we switched all our image-upload to using those image-templates. Change-Id: I924fbc42f014ecf4c0342246d48f27b6ba1d5c77 (cherry picked from commit 44b8620) (cherry picked from commit 7331446)
Every image we upload automatically ends up as template-VM. We can leverage this in normal deployments and copy the VMDK from the image-template, if it still exists. If it doesn't exist anymore, we can try to copy another image-template from another datastore and use its VMDK instead - that should be still faster than copying stuff from glance again. If all fails, we turn back to the original approach of loading the image from glance. Change-Id: I659d22e494a86fe4e567306062784313432d11ee (cherry picked from commit cfa3eb8) (cherry picked from commit a0d9db9)
Until now, the logs during a clean-up showed "Destroyed VM" messages, but one could not find out, which VM was destroyed. Therefore, we add an additional log message to point out that and which image-template VM we're going to destroy. Change-Id: I7429fca0175ec4593689466be6dcc0cb2482cb9f (cherry picked from commit 557e34e) (cherry picked from commit 2d6f122)
There are cases, where we don't want to blindly try and delete a VM, but instead know the outcome for certain types of errors at least. For supporting this case, vm_util.destroy_vm() has to re-raise certain types of exceptions. In the case we're looking for, that's the "InvalidArgument" fault. For this to work like before, code calling destroy_vm() needs to catch these exceptions, too. Change-Id: I28d2f1b94b8439cfea88146746ae6e59d61f087c (cherry picked from commit e112c5e) (cherry picked from commit 73a9a27)
While we don't know why, yet, we've seen image-cache templates missing their directory on the datastore. For the image-template cleanup this means, that Nova cannot destroy those templates, as this raises an InvalidArgument fault on VMware's side, tellung us that ConfigSpec.files.vmPathName is invalid. Since we need to clean those templates up anyways to create a new, usable one, we catch those errors and call UnregisterVM as a fallback. For getting templates from another datastores or re-using the already existing template to copy its disk, we have to catch FileNotFoundException. If this situation occurs, we have to clean that broken template up and also let nova continue the search for a source of the disk for the new VM. Change-Id: Id6d1cc1cd7a50958c77a1417e3f2aed7b9672e15 (cherry picked from commit 1b342cd) (cherry picked from commit 13940c5)
If the DRS policy is not set to "fullyAutomated", i.e. if DRS will not take action on existing, running VMs, we have not change to free up a host and thus don't have to try to find one. Instead, we error out to tell nova-bigvm to search for another cluster. Change-Id: Idbdfe82b4057844401e710fb9d87141478bb3353 (cherry picked from commit e98d6ae) (cherry picked from commit 0e8f17e)
Support for the E1000 driver is phasing out in guest operating systems, so we switch to E1000e, which is also faster and has more hardware offloading capabilities. Change-Id: I08ac32f914a57d3eb7328351a07a20a2ef212cb8 (cherry picked from commit 5e7556d) fix unit tests for vmware: Switch default VIF model (cherry picked from commit 425070f) (cherry picked from commit a3b173a)
Original code can be found at: https://opendev.org/x/nova-mksproxy Integrate the code into nova as a builtin command. Subclass `base.SecurityProxy` into `MksSecurityProxy`. Because MKS' `connect_info.internal_access_path` contains more than just a `path` (namely a JSON object with "ticket", "cfgFile" & "thumbprint") add a new `parse_internal_access_path()` method in `NovaProxyRequestHandler`. This method tries to parse `internal_acess_path` as JSON and if that fails, puts the contents in the "path" key of a new `internal_access_path_data` dict. Co-authored-by: Johannes Kulik <johannes.kulik@sap.com>
Has to be "nova-{NAME}proxy".
The old script had a --verbose flag that we are using in the k8s deployment definition for all regions. This will include both Xena and older Rocky deployments. The new name of this flag is --mks-verbose. Instead of adding an image version check in the helm-chart, add an alias for the flag instead, for the duration of the update, and revert this after the Xena upgrade is through everywhere.
Physical CPUs are faster than hyperthreaded CPUs. So by default VMware spreads a VM's CPU cores onto physical cores, spanning multiple NUMA nodes if needed although the vCPUs would fit onto a smaller number of NUMA nodes. HANA VMs prefer low-latency (and thus NUMA-locality) over raw performance, so they set the VMware config `preferHT` to not get spread out. Before this setting was only applied if the VM's flavor qualified as a "Big VM" flavor (i.e. >1GiB) in size. This excluded smaller HANA flavors and a single-NUMA-node VM got spread over two nodes. Make `preferHT` depend on whether the flavor requires one of our `CUSTOM_NUMASIZE_*` traits instead.
Another best-practice recommendation for HANA-on-VMware. The code pretty much duplicates the one for setting the DRS partially- automated behavior. I decided against deduplication and abstraction in favor of readability and at the cost of LOC.
Testing showed that single- and half-numa-node VMs get spawned across multiple NUMA nodes due to `numa.autosize.vcpu.maxPerVirtualNode` being too low. Set the value explictly to the same value that's used for the `CoresPerSocket` cluster-vm setting.
Introduced in 'vmware: Restart BigVMs with "high" priority', the
function "update_cluster_das_vm_override()" did not work, because it
created a "ClusterDasVmConfigInfo" instead of a "ClusterDasVmConfigSpec"
object. This lead to us not being able to spawn BigVMs with the
following error:
Exception in ReconfigureComputeResource_Task.
Cause: Type not found: 'operation'
Change-Id: If9acf9ee07e373b7b24c14c642d0d99fe2a41db1
Copy the relevant code from the libvirt driver. This reintroduces the code that was changed (presumably by accident) in fd4f43b VmWare: Refactored resource and inventory collection Fixup that commit with this one when forward porting!
`remove_unused_original_minimum_age_seconds` along with several other
config options related to image cache management were moved to their
own config group `image_cache` in Ussuri (nova 21), but the cherry-
pick in
9de18bd "vmware: Update expired image-template-VM handling"
failed to account for this.
In
daeeafd "baremetal quota handling"
the `_build_per_flavor_limits()` method is incorrectly
rewritten to match surrounding code, esp. `_build_absolute_limits()`,
and uses not only the limit, but limit and in_use as value for the
resulting limit dict.
The original commit,
16857ed "reintroduce baremetal quota handling"
used `absolute_limits` directly as method parameter.
For historic reasons, there is a mapping between the volume-uuid and the device[].backing.uuid in vmware through extraConfig. In most cases they are the same, but sometimes they are not. And apparently, the extraProperties can even hold wrong information. Since the risk of a collision of uuids is quite low, and lookup is the iteration over a small list of volumes we first try it that way, avoiding possibly incorrect information in the extraConfig. Failing that, we still do the additional api call to get the mapping, and try that. Work around type errors mypy raises in the changed file Change-Id: Ifcdf96cfc6d00473299c1f2a7cb9d23d03294027
The ironic-python-agent uses qemu-img to convert various image formats to the raw disk format. QCow2 is already correctly marked accordingly, but vmdk was missing Change-Id: Ifd868e6951452b291184bd848d4244d37649f1ed
We need to seperate the arguments for tox from the ones to the linter Change-Id: I6d7ee1d95f0ca17fa6c04a3546a7f907cc6393f1
Nested `ClusterDasVmSettings` inside new `ClusterDasVmConfigInfo` objects don't get created properly. This seems to have worked on Rocky but fails on Xena.
Otherwise will try and decode as UTF-8-encoded text during read().
We have added custom filters and weighers for our common use-case VMware, and excluded there already baremetal. For libvirt as additional hypervisor, we also want to skip those filters/weighers, so we test now if the request is scheduled for vmware either due to a flavor extra_spec or an image property If nothing is set, we default to vmware for backwards compatibility Change-Id: I3223aee433ba9009d2cd6387eeda8e70dbbb3cde
The code assumes that the libvirt version reported by the hypervisor matches the version of the python library installed. Since we run the agent in a container, that is not correct, and the attribute will not be in the library as it has been build for an older version. Change-Id: I156047b1f9829a49b429d51ca7f7777606b10a56
Change-Id: I3edf0d3c6deb8385adc1095c7e3c09526e2d4d65
This command detects changes in the flavors and syncs the instances flavor-info with the new changes. `nova-manage sap sync_instances_flavor`
The vcenter doesn't like a reconfiguration call with an empty spec, and raises an error. So we skip that and save ourselves from doing some work on top. Change-Id: I1da3309500a2cd384c5d7cd431e71297ef5d5a3a
You can configure now [vmware] hypervisor_mode=<cluster,esxi_to_cluster,cluster_to_esxi> to chose to expose the individual esxi hosts instead of the cluster. It will only show the esxi-host, as hypervisor node, but will not honour the node parameter for placeing the instance for instance-creation, migration, etc... Change-Id: I264fd242c0de01ae8442c03bc726a0abfbe176ef
The code assumes that there is a single compute-node per host, which is incorrect for ironic. By summarising over all hosts, we get a correct response also for compute-hosts with more than one compute-node. Change-Id: Iaf6e2a72f6649e234660de47bec8b1da1ea1571e
3d4a22d to
30fc702
Compare
|
that's quite few commits. is this meant for xena? |
Member
Author
|
The PR was meant for rocky, but I've updated the branch for xena. |
|
fyi, it should be possible to change the target branch within the existing PR -- but all good, let's continue in #423. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, the code only collects stats for the different nodes... Now I have to look how to represent it to placement.
After that, the question is, how to migrate the VMs between the hypervisor nodes.