Doing a rollback via CLI on an LVM storage with 'saferemove' and
'snapshot-as-volume-chain' would run into a locking issue, because
the forked zero-out worker would try to acquire the lock while the
main CLI task is still inside the locked section for
volume_snapshot_rollback_locked(). The same issue does not happen when
the rollback is done via UI. The reason for this can be found in the
note regarding fork_worker():
> we simulate running in foreground if ($self->{type} eq 'cli')
So the worker will be awaited synchronously in CLI context, resulting
in the deadlock, while via API/UI, the main task would move on and
release the lock allowing the zero-out worker to acquire it.
Avoid doing fork_cleanup_worker() inside the locked section to avoid
the issue.
Fixes: 8eabcc7 ("lvm plugin: snapshot-as-volume-chain: use locking for snapshot operations")
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Link: https://lore.proxmox.com/20251120101742.24843-1-f.ebner@proxmox.com
There are multiple reasons to disallow disabling
'snapshot-as-volume-chain' while a qcow2 image exists:
1. The list of allowed formats depends on 'snapshot-as-volume-chain'.
2. Snapshot functionality is broken. This includes creating snapshots,
but also rollback, which removes the current volume and then fails.
3. There already is coupling between having qcow2 on LVM and having
'snapshot-as-volume-chain' enabled. For example, the
'discard-no-unref' option is required for qcow2 on LVM, but
qemu-server only checks for 'snapshot-as-volume-chain' to avoid
hard-coding LVM. Another one is that volume_qemu_snapshot_method()
returns 'mixed' when the format is qcow2 even when
'snapshot-as-volume-chain' is disabled. Hunting down these corner
cases just to make it easier to disable does not seem to be worth
it, considering there's already 1. and 2. as reasons too.
4. There might be other similar issues that have not surfaced yet,
because disabling the feature while qcow2 is present is essentially
untested and very uncommon.
For file-based storages, the 'snapshot-as-volume-chain' property is
already fixed, i.e. is determined upon storage creation.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
As reported by a user in the enterprise support in a ticket handled by
Friedrich, concurrent snapshot operations could lead to metadata
corruption of the volume group with unlucky timing. Add the missing
locking for operations modifying the metadata, i.e. allocation, rename
and removal. Since volume_snapshot() and volume_snapshot_rollback()
only do those, use a wrapper for the whole function. Since
volume_snapshot_delete() can do longer-running commit or rebase
operations, only lock the necessary sections there.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Friedrich Weber <f.weber@proxmox.com>
Link: https://lore.proxmox.com/20251103162330.112603-5-f.ebner@proxmox.com
With a shared LVM storage, parallel imports, which might be done in
the context of remote migration, could lead to metadata corruption
with unlucky timing, because of missing locking. Add locking around
allocation and removal, which are the sections that modify LVM
metadata. Note that other plugins suffer from missing locking here as
well, but only regarding naming conflicts. Adding locking around the
full call to volume_import() would mean locking for much too long.
Other plugins could follow the approach here, or there could be a
reservation approach like proposed in [0].
[0]: https://lore.proxmox.com/pve-devel/20240403150712.262773-1-h.duerr@proxmox.com/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Friedrich Weber <f.weber@proxmox.com>
Link: https://lore.proxmox.com/20251103162330.112603-4-f.ebner@proxmox.com
Current cstream implementation is pretty slow, even without throttling.
use blkdiscard --zeroout instead when storage support it,
which is a few magnitudes faster.
Another benefit is that blkdiscard is skipping already zeroed block, so for empty
temp images like snapshot, is pretty fast.
blkdiscard don't have throttling like cstream, but we can tune the step size
of zeroes pushed to the storage.
I'm using 32MB stepsize by default , like ovirt, where it seem to be the best
balance between speed and load.
79f1d79058
but it can be reduce with "saferemove_stepsize" option.
stepsize is also autoreduce to sysfs write_zeroes_max_bytes, which is the maximum
zeroing batch supported by the storage
test with a 100G volume (empty):
time /usr/bin/cstream -i /dev/zero -o /dev/test/vm-100-disk-0.qcow2 -T 10 -v 1 -b 1048576
13561233408 B 12.6 GB 10.00 s 1356062979 B/s 1.26 GB/s
26021462016 B 24.2 GB 20.00 s 1301029969 B/s 1.21 GB/s
38585499648 B 35.9 GB 30.00 s 1286135343 B/s 1.20 GB/s
50998542336 B 47.5 GB 40.00 s 1274925312 B/s 1.19 GB/s
63702765568 B 59.3 GB 50.00 s 1274009877 B/s 1.19 GB/s
76721885184 B 71.5 GB 60.00 s 1278640698 B/s 1.19 GB/s
89126539264 B 83.0 GB 70.00 s 1273178488 B/s 1.19 GB/s
101666459648 B 94.7 GB 80.00 s 1270779024 B/s 1.18 GB/s
107390959616 B 100.0 GB 84.39 s 1272531142 B/s 1.19 GB/s
write: No space left on device
real 1m24.394s
user 0m0.171s
sys 1m24.052s
time blkdiscard --zeroout /dev/test/vm-100-disk-0.qcow2 -v
/dev/test/vm-100-disk-0.qcow2: Zero-filled 107390959616 bytes from the offset 0
real 0m3.641s
user 0m0.001s
sys 0m3.433s
test with a 100G volume with random data:
time blkdiscard --zeroout /dev/test/vm-100-disk-0.qcow2 -v
/dev/test/vm-112-disk-1: Zero-filled 4764729344 bytes from the offset 0
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 4764729344
/dev/test/vm-112-disk-1: Zero-filled 4831838208 bytes from the offset 9428795392
/dev/test/vm-112-disk-1: Zero-filled 4831838208 bytes from the offset 14260633600
/dev/test/vm-112-disk-1: Zero-filled 4831838208 bytes from the offset 19092471808
/dev/test/vm-112-disk-1: Zero-filled 4865392640 bytes from the offset 23924310016
/dev/test/vm-112-disk-1: Zero-filled 4596957184 bytes from the offset 28789702656
/dev/test/vm-112-disk-1: Zero-filled 4731174912 bytes from the offset 33386659840
/dev/test/vm-112-disk-1: Zero-filled 4294967296 bytes from the offset 38117834752
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 42412802048
/dev/test/vm-112-disk-1: Zero-filled 4697620480 bytes from the offset 47076868096
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 51774488576
/dev/test/vm-112-disk-1: Zero-filled 4261412864 bytes from the offset 56438554624
/dev/test/vm-112-disk-1: Zero-filled 4362076160 bytes from the offset 60699967488
/dev/test/vm-112-disk-1: Zero-filled 4127195136 bytes from the offset 65062043648
/dev/test/vm-112-disk-1: Zero-filled 4328521728 bytes from the offset 69189238784
/dev/test/vm-112-disk-1: Zero-filled 4731174912 bytes from the offset 73517760512
/dev/test/vm-112-disk-1: Zero-filled 4026531840 bytes from the offset 78248935424
/dev/test/vm-112-disk-1: Zero-filled 4194304000 bytes from the offset 82275467264
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 86469771264
/dev/test/vm-112-disk-1: Zero-filled 4395630592 bytes from the offset 91133837312
/dev/test/vm-112-disk-1: Zero-filled 3623878656 bytes from the offset 95529467904
/dev/test/vm-112-disk-1: Zero-filled 4462739456 bytes from the offset 99153346560
/dev/test/vm-112-disk-1: Zero-filled 3758096384 bytes from the offset 103616086016
real 0m23.969s
user 0m0.030s
sys 0m0.144s
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
Link: https://lore.proxmox.com/mailman.253.1761222252.362.pve-devel@lists.proxmox.com
[FE: Minor language improvements
Use more common style for importing with qw()
Don't specify full path to blkdiscard binary for run_command()]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Without untainting, offline-deleting a volume-chain snapshot on an LVM
storage via the GUI can fail with an "Insecure dependecy in exec
[...]" error, because volume_snapshot_delete uses the filename its
qemu-img invocation.
Commit 93f0dfb ("plugin: volume snapshot info: untaint snapshot
filename") fixed this already for the volume_snapshot_info
implementation of the Plugin base class, but missed that the LVM
plugin overrides the method and was still missing the untaint.
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
Link: https://lore.proxmox.com/20250731071306.11777-1-f.weber@proxmox.com
Volume names are allowed to contain underscores, so it is impossible
to determine the snapshot name from just the volume name, e.g:
snap_vm-100-disk_with_underscore_here_s_some_more.qcow2
Therefore, pass along the short volume name too and match against it.
Note that none of the variables from the result of parse_volname()
were actually used previously.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Max R. Carrara <m.carrara@proxmox.com>
Reviewed-by: Max R. Carrara <m.carrara@proxmox.com>
Link: https://lore.proxmox.com/20250730162117.160498-3-f.ebner@proxmox.com
As the alloc_lvm_image() helper asserts, qcow2 cannot be used as a
format without the 'snapshot-as-volume-chain' configuration option.
Therefore it is necessary to implement get_formats() and distinguish
based on the storage configuration.
In case the 'snapshot-as-volume-chain' option is set, qcow2 is even
preferred and thus declared the default format.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Not perfect but now it's still easy to rename and the new variant fits
a bit better to the actual design and implementation.
Add best-effort migration for storage.cfg, this has been never
publicly released after all.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This also changes the return values, since their meanings are rather
weird from the storage point of view. For instance, "internal" meant
it is *not* the storage which does the snapshot, while "external"
meant a mixture of storage and qemu-server side actions. `undef` meant
the storage does it all...
┌────────────┬───────────┐
│ previous │ new │
├────────────┼───────────┤
│ "internal" │ "qemu" │
│ "external" │ "mixed" │
│ undef │ "storage" │
└────────────┴───────────┘
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Just to safe, as this is already check higher in the stack.
Technically, it's possible to implement snapshot file renaming,
and update backing_file info with "qemu-img rebase -u".
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
by moving the preallocation handling to the call site, and preparing
them for taking further options like cluster size in the future.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
we format lvm logical volume with qcow2 to handle snapshot chain.
like for qcow2 file, when a snapshot is taken, the current lvm volume
is renamed to snap volname, and a new current lvm volume is created
with the snap volname as backing file
snapshot volname is similar to lvmthin : snap_${volname}_{$snapname}.qcow2
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
Returns if the volume is supporting qemu snapshot:
'internal' : do the snapshot with qemu internal snapshot
'external' : do the snapshot with qemu external snapshot
undef : does not support qemu snapshot
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
This add a $running param to volume_snapshot,
it can be used if some extra actions need to be done at the storage
layer when the snapshot has already be done at qemu level.
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
When discovering a new volume group (VG), for example on boot, LVM
triggers autoactivation. With the default settings, this activates all
logical volumes (LVs) in the VG. Activating an LV creates a
device-mapper device and a block device under /dev/mapper.
This is not necessarily problematic for local LVM VGs, but it is
problematic for VGs on top of a shared LUN used by multiple cluster
nodes (accessed via e.g. iSCSI/Fibre Channel/direct-attached SAS).
Concretely, in a cluster with a shared LVM VG where an LV is active on
nodes 1 and 2, deleting the LV on node 1 will not clean up the
device-mapper device on node 2. If an LV with the same name is
recreated later, the leftover device-mapper device will cause
activation of that LV on node 2 to fail with:
> device-mapper: create ioctl on [...] failed: Device or resource busy
Hence, certain combinations of guest removal (and thus LV removals)
and node reboots can cause guest creation or VM live migration (which
both entail LV activation) to fail with the above error message for
certain VMIDs, see bug #4997 for more information [1].
To avoid this issue in the future, disable autoactivation when
creating new LVs using the `--setautoactivation` flag. With this
setting, LVM autoactivation will not activate these LVs, and the
storage stack will take care of activating/deactivating the LV (only)
on the correct node when needed.
This additionally fixes an issue with multipath on FC/SAS-attached
LUNs where LVs would be activated too early after boot when multipath
is not yet available, see [3] for more details and current workaround.
The `--setautoactivation` flag was introduced with LVM 2.03.12 [2], so
it is available since Bookworm/PVE 8, which ships 2.03.16. Nodes with
older LVM versions ignore the flag and remove it on metadata updates,
which is why PVE 8 could not use the flag reliably, since there may
still be PVE 7 nodes in the cluster that reset it on metadata updates.
The flag is only set for newly created LVs, so LVs created before this
patch can still trigger #4997. To avoid this, users will be advised to
run a script to disable autoactivation for existing LVs.
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=4997
[2] https://gitlab.com/lvmteam/lvm2/-/blob/main/WHATS_NEW
[3] https://pve.proxmox.com/mediawiki/index.php?title=Multipath&oldid=12039#FC/SAS-specific_configuration
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
Link: https://lore.proxmox.com/20250709141034.169726-2-f.weber@proxmox.com
using the new top-level `make tidy` target, which calls perltidy via
our wrapper to enforce the desired style as closely as possible.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Hard-coding a list of sensitive properties means that custom plugins
cannot define their own sensitive properties for the on_add/on_update
hooks.
Have plugins declare the list of their sensitive properties in the
plugin data. For backwards compatibility, return the previously
hard-coded list if no such declaration is present.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Reviewed-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Link: https://lore.proxmox.com/20250404133204.239783-6-f.ebner@proxmox.com
Previously, the size was rounded down which, in case of an image with
non-1KiB-aligned sze (only possible for external plugins or manually
created images) would lead to errors when attempting to write beyond
the end of the too small allocated target image.
For image allocation, the size is already rounded up to the
granularity of the storage. Do the same for import.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
$storeid param is missing and $snapname is used as third param.
This seem to works actually because $snapname is always empty in lvm
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
dd supports a 'status' flag, which enables it to show the copied bytes,
duration, and the transfer rate, which then get printed to stderr.
Signed-off-by: Leo Nunner <l.nunner@proxmox.com>