This patch introduces support for Cephs RBD namespaces.
A new storage config parameter 'namespace' defines the namespace to be
used for the RBD storage.
The namespace must already exist in the Ceph cluster as it is not
automatically created.
The main intention is to use this for external Ceph clusters. With
namespaces, each PVE cluster can get its own namespace and will not
conflict with other PVE clusters.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
The <pool>/<image> paths are needed in quite a lot of places. Having one
single place where they are created helps to reduce duplicate code and
makes it easier to introduce new features.
The 'add_pool_to_disk' sub was already doing that but the name was not
really fitting. This commit renames it to the more general
'get_rbd_path' and changes the second parameter to the more widely used
$volume instead of $disk.
Furthermore, all occurences where "$pool/$volume" has been concatenated
have been replaced with a call to get_rbd_path.
Plus some minor code style cleanups for long function calls that were
touched.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
by relying on archive_info's vmid first. archive_info is already used to
determine if it's a standard name, and in that case the vmid is certainly set.
Also add asserts to make sure we got what we expected.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
it is optional after all, and missing (/None) for files stored in the
snapshot dir but not referenced in the manifest for whatever reason.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
This reverts commit a44c18925d and adds a reminder
comment.
The mentioned commit is actually a backwards-incompatible change that leads to
slightly different behavior when migrating a VM with volumes on a misconfigured
storage. For example, unreferenced volumes on a misconfigured storage won't be
picked up, even though they were before. And for referenced volumes on a
misconfigured storage, the disk size would not be updated on migration anymore.
We should wait until the next major release for this change and then also
re-evaluate the migration behavior with misconfigured disks.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
use DirPlugin's get/update_volume_notes implementation (which all the
other supported file systems use)
Signed-off-by: Dylan Whyte <d.whyte@proxmox.com>
Only these storages are activated in the first place, and it's bad behavior to
list images when no appropriate content type is not set.
For example, on VM destruction, this avoids unreferenced images to be deleted
from a storage with only 'backup' content type set, which is supposedly what
happened in this[0] forum thread.
(Some) callers expect all keys to be present and valid array references in the
result, so initialization is needed.
Now, the enabled check is already done by the preceding code for every element
that is iterated over, and thus isn't needed in the main loop anymore.
[0]: https://forum.proxmox.com/threads/erasing-all-vm-disks-after-a-failed-vm-migration-task.85068
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
as that seems to be the more natural permission path for listing a nodes local
disks. For backwards compatibility, the old permission check has to be kept
(relevant with propagate=0).
This API call was originally part of the Ceph API and got copied here later,
which might explain the current permission check.
In the UI, the Disk panel is visible with a node audit permission, but the API
call itself failed without the '/' audit permission.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Early return when mounted heuristics returns true, that allows to get
rid of an indentation level.
Moving the heuristic out makes the activate method smaller and easier
to grasp
Best viewed with ignoring whitespace changes (`git show -w`).
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
highly unlikely to fail in our setups, most realistic case is when
procfs is not mounted at /proc, which breaks much else anyway and is
a requirement
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
this was mistakenly done as the procfs code uses it and it was
assumed we need to decode this too to get both in the same
encoding-space and thus correct comparission.
But only procfs has that encoding, we don't have it for pool values
in the storage config, so we must not do a decode on that value, that
could potentially break.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This commit is a small performance optimization to the previous one:
`zpool list` is cheaper than `zpool import -d /dev..` (the latter
scans the disks in the provided directory for zfs signatures,
unconditionally)
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
This patch addresses an issue we recently saw on a production machine:
* after booting a ZFS pool failed to get imported (due to an empty
/etc/zfs/zpool.cache)
* pvestatd/guest-startall eventually tried to import the pool
* the pool was imported, yet the datasets of the pool remained
not mounted
A bit of debugging showed that `zpool import <poolname>` is not
atomic, in fact it does fork+exec `mount` with appropriate parameters.
If an import ran longer than the hardcoded timeout of 15s, it could
happen that the pool got imported, but the zpool command (and its
forks) got terminated due to timing out.
reproducing this is straight-forward by setting (drastic) bw+iops
limits on a guests' disk (which contains a zpool) - e.g.:
`qm set 100 -scsi1 wd:vm-100-disk-1,iops_rd=10,iops_rd_max=20,\
iops_wr=15,iops_wr_max=20,mbps_rd=10,mbps_rd_max=15,mbps_wr=10,\
mbps_wr_max=15`
afterwards running `timeout 15 zpool import <poolname>` resulted in
that situation in the guest on my machine
The patch changes the check in activate_storage for the ZFSPoolPlugin,
to check if any dataset below the 'pool' (which can also be a sub-dataset)
is mounted by parsing /proc/mounts:
* this is cheaper than running `zfs get` or `zpool list`
* it catches a properly imported and mounted pool in case the
root-dataset has 'canmount' set to off (or noauto), as long
as any dataset below is mounted
After trying to import the pool, we also run `zfs mount -a` (in case
another check of /proc/mounts fails).
Potential for regression:
* running `zfs mount -a` is problematic, if a dataset is manually
umounted after booting (without setting 'canmount')
* a pool without any mounted dataset (no mountpoint property set and
only zvols) - will result in repeated calls to `zfs mount -a`
both of the above seem unlikely and should not occur, if using our
tooling.
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
and squash the __no_lock-variant into it.
This lock is not broad enough, because for a caller that plans to do or not do
some storage operation based on the result of the check, the following could
happen:
1. volume_is_base_and_used is called and the result is used to enter a branch
2. situation on the storage changes in the meantime
3. the branch chosen in 1. might not be the one that should be taken anymore
This means that callers are responsible for locking, and luckily the existing
callers do use their own locks already:
1. vdisk_free used the __no_lock-variant with a broader lock also covering
the free operation.
2. vdisk_clone is not a caller, but is relevant and it does lock the storage
2. the calls during VM migration and VM destruction happen in the context of a
locked VM config. Because the clone operation also locks the VM config, it
cannot happen that a linked clone is created while the template VM is
migrated away or destroyed or vice versa. And even if that were the case,
the base disk would not be freed, because of what vdisk_free/vdisk_clone do.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
and have a parent key for partitions, to be able to see the associated disk in
the result without having to rely on naming heuristics (just adding a number at
the end doesn't work for NVMes).
The disk's usage will not be based on the partitions usage if the flag is set,
but will simply be 'partitions'.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
so it can be re-used for partitions.
Also changes the regular expression in get_ceph_volume_info to match the full
device/partition name the LV is on. Not only is this needed for partitions,
especially if there's multiple partitions with an OSD, but it also fixes
handling NVMe devices with an OSD as a side effect. Previuosly those were not
detected here, because of the digits in the name, e.g. /dev/nvme0n1
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Note that this is a slight behavior change, because now the first
partition's usage which is not simply 'partition' will become the disk's
usage. Previously, if any partition was 'mounted', it would become the disk's
usage, then 'LVM', 'ZFS', etc.
A partitions usage defaults to 'partition' if nothing more specific can be
found, and is never treated as unused for now.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
in preparation to also query the file system type from lsblk. Note that the
result now also includes devices without a parttype, so a definedness check in
get_devices_by_partuuid is needed. This will be useful when the whole device
contains a filesystem.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Previously any GPT initialized disk without an osdid (i.e. equal to -1) would
be included in the list of journal disk candidates, for example a ZFS disk. But
the OSD creation API call will fail for those. To fix it, re-use the condition
from the corresponding check in that API call (in PVE/API2/Ceph/OSD.pm).
Now, included disks are unused disks, those with usage 'partitions' and GPT, and
those with usage 'LVM'.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Like this, the property will get added when parsing the storage configuration
and PBS storages will correctly show up as shared storages in API results.
AFAICT the only affected PBS operation is free_image via vdisk_free, which will
now be protected by a cluster-wide lock, and that shouldn't hurt.
Another issue this fixes, which is the reason this patch exists, was reported
in the forum[0]. The free space from PBS storages was counted once for each node
that had access to the storage.
[0]: https://forum.proxmox.com/threads/pve-6-3-the-storage-size-was-displayed-incorrectly.83136/
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
LVM RAID logical volumes (including mirrors) can be valid disk images, so they
should show up in storage content listings (for example pvesm list).
Including LV types is safer than excluding, especially because of possible
additional types in the future.
Co-developed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Dominic Jäger <d.jaeger@proxmox.com>
the check_connection is done by querying the exports of the nfs server
in question. With nfs v4 those exports aren't listed anymore since nfs
v4 employs a pseudo-filesystem starting from root (/).
rpcinfo allows to query the existence of an nfs v4 service.
Signed-off-by: Alwin Antreich <a.antreich@proxmox.com>
as described in the zfs bug https://github.com/openzfs/zfs/issues/10931
the kernel keeps around cached data from mmaps after a rollback, thus
having invalid data in files that were allegedly rolled back
to workaround this (until a real fix comes along), we unmount the subvol,
invalidating the kernel cache anyway
the dataset gets mounted on the next 'activate_volume' again
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
the compat symlink from bin to sbin has been dropped with bullseye, and
we rely on PATH begin set properly in our daemons/CLI tools anyway..
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>