PVE is based on the famous Debian Linux distribution. That means that you have access to the whole world of Debian packages, and the base system is well documented. The Debian Administrator's Handbook is available online, and provides a comprehensive introduction to the Debian operating system (see [Hertzog13]).
A standard PVE installation uses the default repositories from Debian, so you get bug fixes and security updates through that channel. In addition, we provide our own package repository to roll out all PVE related packages. This includes updates to some Debian packages when necessary.
We also deliver a specially optimized Linux kernel, where we enable all required virtualization and container features. That kernel includes drivers for ZFS, and several hardware drivers. For example, we ship Intel network card drivers to support their newest hardware.
The following sections will concentrate on virtualization related topics. They either explains things which are different on PVE, or tasks which are commonly used on PVE. For other topics, please refer to the standard Debian documentation.
System Software Updates
We provide regular package updates on all repositories. You can install those update using the GUI, or you can directly run the CLI command apt-get:
apt-get update apt-get dist-upgrade
|
The apt package management system is extremely flexible and provides countless of feature - see man apt-get or [Hertzog13] for additional information. |
You should do such updates at regular intervals, or when we release versions with security related fixes. Major system upgrades are announced at the PVE Community Forum. Those announcement also contain detailed upgrade instructions.
|
We recommend to run regular upgrades, because it is important to get the latest security updates. |
Network Configuration
Network configuration can be done either via the GUI, or by manually editing the file /etc/network/interfaces, which contains the whole network configuration. The interfaces(5) manual page contains the complete format description. All PVE tools try hard to keep direct user modifications, but using the GUI is still preferable, because it protects you from errors.
Once the network is configured, you can use the Debian traditional tools ifup and ifdown commands to bring interfaces up and down.
|
PVE does not write changes directly to /etc/network/interfaces. Instead, we write into a temporary file called /etc/network/interfaces.new, and commit those changes when you reboot the node. |
Naming Conventions
We currently use the following naming conventions for device names:
-
Ethernet devices: en*, systemd network interface names. This naming scheme is used for new PVE installations since version 5.0.
-
Ethernet devices: eth[N], where 0 ≤ N (eth0, eth1, …) This naming scheme is used for PVE hosts which were installed before the 5.0 release. When upgrading to 5.0, the names are kept as-is.
-
Bridge names: vmbr[N], where 0 ≤ N ≤ 4094 (vmbr0 - vmbr4094)
-
Bonds: bond[N], where 0 ≤ N (bond0, bond1, …)
-
VLANs: Simply add the VLAN number to the device name, separated by a period (eno1.50, bond1.30)
This makes it easier to debug networks problems, because the device name implies the device type.
Systemd Network Interface Names
Systemd uses the two character prefix en for Ethernet network devices. The next characters depends on the device driver and the fact which schema matches first.
-
o<index>[n<phys_port_name>|d<dev_port>] — devices on board
-
s<slot>[f<function>][n<phys_port_name>|d<dev_port>] — device by hotplug id
-
[P<domain>]p<bus>s<slot>[f<function>][n<phys_port_name>|d<dev_port>] — devices by bus id
-
x<MAC> — device by MAC address
The most common patterns are:
-
eno1 — is the first on board NIC
-
enp3s0f1 — is the NIC on pcibus 3 slot 0 and use the NIC function 1.
For more information see Predictable Network Interface Names.
Choosing a network configuration
Depending on your current network organization and your resources you can choose either a bridged, routed, or masquerading networking setup.
PVE server in a private LAN, using an external gateway to reach the internet
The Bridged model makes the most sense in this case, and this is also the default mode on new PVE installations. Each of your Guest system will have a virtual interface attached to the PVE bridge. This is similar in effect to having the Guest network card directly connected to a new switch on your LAN, the PVE host playing the role of the switch.
PVE server at hosting provider, with public IP ranges for Guests
For this setup, you can use either a Bridged or Routed model, depending on what your provider allows.
PVE server at hosting provider, with a single public IP address
In that case the only way to get outgoing network accesses for your guest systems is to use Masquerading. For incoming network access to your guests, you will need to configure Port Forwarding.
For further flexibility, you can configure VLANs (IEEE 802.1q) and network bonding, also known as "link aggregation". That way it is possible to build complex and flexible virtual networks.
Default Configuration using a Bridge
Bridges are like physical network switches implemented in software. All VMs can share a single bridge, or you can create multiple bridges to separate network domains. Each host can have up to 4094 bridges.
The installation program creates a single bridge named vmbr0, which is connected to the first Ethernet card. The corresponding configuration in /etc/network/interfaces might look like this:
auto lo iface lo inet loopback iface eno1 inet manual auto vmbr0 iface vmbr0 inet static address 192.168.10.2 netmask 255.255.255.0 gateway 192.168.10.1 bridge_ports eno1 bridge_stp off bridge_fd 0
Virtual machines behave as if they were directly connected to the physical network. The network, in turn, sees each virtual machine as having its own MAC, even though there is only one network cable connecting all of these VMs to the network.
Routed Configuration
Most hosting providers do not support the above setup. For security reasons, they disable networking as soon as they detect multiple MAC addresses on a single interface.
|
Some providers allows you to register additional MACs on there management interface. This avoids the problem, but is clumsy to configure because you need to register a MAC for each of your VMs. |
You can avoid the problem by “routing” all traffic via a single interface. This makes sure that all network packets use the same MAC address.
A common scenario is that you have a public IP (assume 198.51.100.5 for this example), and an additional IP block for your VMs (203.0.113.16/29). We recommend the following setup for such situations:
auto lo iface lo inet loopback auto eno1 iface eno1 inet static address 198.51.100.5 netmask 255.255.255.0 gateway 198.51.100.1 post-up echo 1 > /proc/sys/net/ipv4/ip_forward post-up echo 1 > /proc/sys/net/ipv4/conf/eno1/proxy_arp auto vmbr0 iface vmbr0 inet static address 203.0.113.17 netmask 255.255.255.248 bridge_ports none bridge_stp off bridge_fd 0
Masquerading (NAT) with iptables
Masquerading allows guests having only a private IP address to access the network by using the host IP address for outgoing traffic. Each outgoing packet is rewritten by iptables to appear as originating from the host, and responses are rewritten accordingly to be routed to the original sender.
auto lo iface lo inet loopback auto eno1 #real IP address iface eno1 inet static address 198.51.100.5 netmask 255.255.255.0 gateway 198.51.100.1 auto vmbr0 #private sub network iface vmbr0 inet static address 10.10.10.1 netmask 255.255.255.0 bridge_ports none bridge_stp off bridge_fd 0 post-up echo 1 > /proc/sys/net/ipv4/ip_forward post-up iptables -t nat -A POSTROUTING -s '10.10.10.0/24' -o eno1 -j MASQUERADE post-down iptables -t nat -D POSTROUTING -s '10.10.10.0/24' -o eno1 -j MASQUERADE
Linux Bond
Bonding (also called NIC teaming or Link Aggregation) is a technique for binding multiple NIC’s to a single network device. It is possible to achieve different goals, like make the network fault-tolerant, increase the performance or both together.
High-speed hardware like Fibre Channel and the associated switching hardware can be quite expensive. By doing link aggregation, two NICs can appear as one logical interface, resulting in double speed. This is a native Linux kernel feature that is supported by most switches. If your nodes have multiple Ethernet ports, you can distribute your points of failure by running network cables to different switches and the bonded connection will failover to one cable or the other in case of network trouble.
Aggregated links can improve live-migration delays and improve the speed of replication of data between PVE Cluster nodes.
There are 7 modes for bonding:
-
Round-robin (balance-rr): Transmit network packets in sequential order from the first available network interface (NIC) slave through the last. This mode provides load balancing and fault tolerance.
-
Active-backup (active-backup): Only one NIC slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The single logical bonded interface’s MAC address is externally visible on only one NIC (port) to avoid distortion in the network switch. This mode provides fault tolerance.
-
XOR (balance-xor): Transmit network packets based on [(source MAC address XOR’d with destination MAC address) modulo NIC slave count]. This selects the same NIC slave for each destination MAC address. This mode provides load balancing and fault tolerance.
-
Broadcast (broadcast): Transmit network packets on all slave network interfaces. This mode provides fault tolerance.
-
IEEE 802.3ad Dynamic link aggregation (802.3ad)(LACP): Creates aggregation groups that share the same speed and duplex settings. Utilizes all slave network interfaces in the active aggregator group according to the 802.3ad specification.
-
Adaptive transmit load balancing (balance-tlb): Linux bonding driver mode that does not require any special network-switch support. The outgoing network packet traffic is distributed according to the current load (computed relative to the speed) on each network interface slave. Incoming traffic is received by one currently designated slave network interface. If this receiving slave fails, another slave takes over the MAC address of the failed receiving slave.
-
Adaptive load balancing (balance-alb): Includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special network switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the NIC slaves in the single logical bonded interface such that different network-peers use different MAC addresses for their network packet traffic.
If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using
the corresponding bonding mode (802.3ad). Otherwise you should generally use the
active-backup mode.
If you intend to run your cluster network on the bonding interfaces, then you
have to use active-passive mode on the bonding interfaces, other modes are
unsupported.
The following bond configuration can be used as distributed/shared storage network. The benefit would be that you get more speed and the network will be fault-tolerant.
auto lo iface lo inet loopback iface eno1 inet manual iface eno2 inet manual auto bond0 iface bond0 inet static slaves eno1 eno2 address 192.168.1.2 netmask 255.255.255.0 bond_miimon 100 bond_mode 802.3ad bond_xmit_hash_policy layer2+3 auto vmbr0 iface vmbr0 inet static address 10.10.10.2 netmask 255.255.255.0 gateway 10.10.10.1 bridge_ports eno1 bridge_stp off bridge_fd 0
Another possibility it to use the bond directly as bridge port. This can be used to make the guest network fault-tolerant.
auto lo iface lo inet loopback iface eno1 inet manual iface eno2 inet manual auto bond0 iface bond0 inet manual slaves eno1 eno2 bond_miimon 100 bond_mode 802.3ad bond_xmit_hash_policy layer2+3 auto vmbr0 iface vmbr0 inet static address 10.10.10.2 netmask 255.255.255.0 gateway 10.10.10.1 bridge_ports bond0 bridge_stp off bridge_fd 0
VLAN 802.1Q
A virtual LAN (VLAN) is a broadcast domain that is partitioned and isolated in the network at layer two. So it is possible to have multiple networks (4096) in a physical network, each independent of the other ones.
Each VLAN network is identified by a number often called tag. Network packages are then tagged to identify which virtual network they belong to.
VLAN for Guest Networks
PVE supports this setup out of the box. You can specify the VLAN tag when you create a VM. The VLAN tag is part of the guest network confinuration. The networking layer supports differnet modes to implement VLANs, depending on the bridge configuration:
-
VLAN awareness on the Linux bridge: In this case, each guest’s virtual network card is assigned to a VLAN tag, which is transparently supported by the Linux bridge. Trunk mode is also possible, but that makes the configuration in the guest necessary.
-
"traditional" VLAN on the Linux bridge: In contrast to the VLAN awareness method, this method is not transparent and creates a VLAN device with associated bridge for each VLAN. That is, if e.g. in our default network, a guest VLAN 5 is used to create eno1.5 and vmbr0v5, which remains until rebooting.
-
Open vSwitch VLAN: This mode uses the OVS VLAN feature.
-
Guest configured VLAN: VLANs are assigned inside the guest. In this case, the setup is completely done inside the guest and can not be influenced from the outside. The benefit is that you can use more than one VLAN on a single virtual NIC.
VLAN on the Host
To allow host communication with an isolated network. It is possible to apply VLAN tags to any network device (NIC, Bond, Bridge). In general, you should configure the VLAN on the interface with the least abstraction layers between itself and the physical NIC.
For example, in a default configuration where you want to place the host management address on a separate VLAN.
|
In the examples we use the VLAN at bridge level to ensure the correct function of VLAN 5 in the guest network, but in combination with VLAN anwareness bridge this it will not work for guest network VLAN 5. The downside of this setup is more CPU usage. |
auto lo iface lo inet loopback iface eno1 inet manual iface eno1.5 inet manual auto vmbr0v5 iface vmbr0v5 inet static address 10.10.10.2 netmask 255.255.255.0 gateway 10.10.10.1 bridge_ports eno1.5 bridge_stp off bridge_fd 0 auto vmbr0 iface vmbr0 inet manual bridge_ports eno1 bridge_stp off bridge_fd 0
The next example is the same setup but a bond is used to make this network fail-safe.
auto lo iface lo inet loopback iface eno1 inet manual iface eno2 inet manual auto bond0 iface bond0 inet manual slaves eno1 eno2 bond_miimon 100 bond_mode 802.3ad bond_xmit_hash_policy layer2+3 iface bond0.5 inet manual auto vmbr0v5 iface vmbr0v5 inet static address 10.10.10.2 netmask 255.255.255.0 gateway 10.10.10.1 bridge_ports bond0.5 bridge_stp off bridge_fd 0 auto vmbr0 iface vmbr0 inet manual bridge_ports bond0 bridge_stp off bridge_fd 0
Time Synchronization
The PVE cluster stack itself relies heavily on the fact that all the nodes have precisely synchronized time. Some other components, like Ceph, also refuse to work properly if the local time on nodes is not in sync.
Time synchronization between nodes can be achieved with the “Network Time Protocol” (NTP). PVE uses systemd-timesyncd as NTP client by default, preconfigured to use a set of public servers. This setup works out of the box in most cases.
Using Custom NTP Servers
In some cases, it might be desired to not use the default NTP servers. For example, if your PVE nodes do not have access to the public internet (e.g., because of restrictive firewall rules), you need to setup local NTP servers and tell systemd-timesyncd to use them:
[Time] Servers=ntp1.example.com ntp2.example.com ntp3.example.com ntp4.example.com
After restarting the synchronization service (systemctl restart systemd-timesyncd) you should verify that your newly configured NTP servers are used by checking the journal (journalctl --since -1h -u systemd-timesyncd):
... Oct 07 14:58:36 node1 systemd[1]: Stopping Network Time Synchronization... Oct 07 14:58:36 node1 systemd[1]: Starting Network Time Synchronization... Oct 07 14:58:36 node1 systemd[1]: Started Network Time Synchronization. Oct 07 14:58:36 node1 systemd-timesyncd[13514]: Using NTP server 10.0.0.1:123 (ntp1.example.com). Oct 07 14:58:36 nora systemd-timesyncd[13514]: interval/delta/delay/jitter/drift 64s/-0.002s/0.020s/0.000s/-31ppm ...
External Metric Server
Starting with PVE 4.0, you can define external metric servers, which will be sent various stats about your hosts, virtual machines and storages.
Currently supported are:
-
graphite (see http://graphiteapp.org )
-
influxdb (see https://www.influxdata.com/time-series-platform/influxdb/ )
The server definitions are saved in /etc/pve/status.cfg
Graphite server configuration
The definition of a server is:
graphite: server your-server port your-port path your-path
where your-port defaults to 2003 and your-path defaults to proxmox
PVE sends the data over udp, so the graphite server has to be configured for this
Influxdb plugin configuration
The definition is:
influxdb: server your-server port your-port
PVE sends the data over udp, so the influxdb server has to be configured for this
Here is an example configuration for influxdb (on your influxdb server):
[[udp]] enabled = true bind-address = "0.0.0.0:8089" database = "proxmox" batch-size = 1000 batch-timeout = "1s"
With this configuration, your server listens on all IP addresses on port 8089, and writes the data in the proxmox database
Disk Health Monitoring
Although a robust and redundant storage is recommended, it can be very helpful to monitor the health of your local disks.
Starting with PVE 4.3, the package smartmontools
[smartmontools homepage https://www.smartmontools.org]
is installed and required. This is a set of tools to monitor and control
the S.M.A.R.T. system for local hard disks.
You can get the status of a disk by issuing the following command:
# smartctl -a /dev/sdX
where /dev/sdX is the path to one of your local disks.
If the output says:
SMART support is: Disabled
you can enable it with the command:
# smartctl -s on /dev/sdX
For more information on how to use smartctl, please see man smartctl.
By default, smartmontools daemon smartd is active and enabled, and scans the disks under /dev/sdX and /dev/hdX every 30 minutes for errors and warnings, and sends an e-mail to root if it detects a problem.
For more information about how to configure smartd, please see man smartd and man smartd.conf.
If you use your hard disks with a hardware raid controller, there are most likely tools to monitor the disks in the raid array and the array itself. For more information about this, please refer to the vendor of your raid controller.
Logical Volume Manager (LVM)
Most people install PVE directly on a local disk. The PVE installation CD offers several options for local disk management, and the current default setup uses LVM. The installer let you select a single disk for such setup, and uses that disk as physical volume for the Volume Group (VG) pve. The following output is from a test installation using a small 8GB disk:
# pvs PV VG Fmt Attr PSize PFree /dev/sda3 pve lvm2 a-- 7.87g 876.00m # vgs VG #PV #LV #SN Attr VSize VFree pve 1 3 0 wz--n- 7.87g 876.00m
The installer allocates three Logical Volumes (LV) inside this VG:
# lvs LV VG Attr LSize Pool Origin Data% Meta% data pve twi-a-tz-- 4.38g 0.00 0.63 root pve -wi-ao---- 1.75g swap pve -wi-ao---- 896.00m
- root
-
Formatted as ext4, and contains the operation system.
- swap
-
Swap partition
- data
-
This volume uses LVM-thin, and is used to store VM images. LVM-thin is preferable for this task, because it offers efficient support for snapshots and clones.
For PVE versions up to 4.1, the installer creates a standard logical volume called “data”, which is mounted at /var/lib/vz.
Starting from version 4.2, the logical volume “data” is a LVM-thin pool, used to store block based guest images, and /var/lib/vz is simply a directory on the root file system.
Hardware
We highly recommend to use a hardware RAID controller (with BBU) for such setups. This increases performance, provides redundancy, and make disk replacements easier (hot-pluggable).
LVM itself does not need any special hardware, and memory requirements are very low.
Bootloader
We install two boot loaders by default. The first partition contains the standard GRUB boot loader. The second partition is an EFI System Partition (ESP), which makes it possible to boot on EFI systems.
Creating a Volume Group
Let’s assume we have an empty disk /dev/sdb, onto which we want to create a volume group named “vmdata”.
|
Please note that the following commands will destroy all existing data on /dev/sdb. |
First create a partition.
# sgdisk -N 1 /dev/sdb
Create a Physical Volume (PV) without confirmation and 250K metadatasize.
# pvcreate --metadatasize 250k -y -ff /dev/sdb1
Create a volume group named “vmdata” on /dev/sdb1
# vgcreate vmdata /dev/sdb1
Creating an extra LV for /var/lib/vz
This can be easily done by creating a new thin LV.
# lvcreate -n <Name> -V <Size[M,G,T]> <VG>/<LVThin_pool>
A real world example:
# lvcreate -n vz -V 10G pve/data
Now a filesystem must be created on the LV.
# mkfs.ext4 /dev/pve/vz
At last this has to be mounted.
|
be sure that /var/lib/vz is empty. On a default installation it’s not. |
To make it always accessible add the following line in /etc/fstab.
# echo '/dev/pve/vz /var/lib/vz ext4 defaults 0 2' >> /etc/fstab
Resizing the thin pool
Resize the LV and the metadata pool can be achieved with the following command.
# lvresize --size +<size[\M,G,T]> --poolmetadatasize +<size[\M,G]> <VG>/<LVThin_pool>
|
When extending the data pool, the metadata pool must also be extended. |
Create a LVM-thin pool
A thin pool has to be created on top of a volume group. How to create a volume group see Section LVM.
# lvcreate -L 80G -T -n vmstore vmdata
ZFS on Linux
ZFS is a combined file system and logical volume manager designed by Sun Microsystems. Starting with PVE 3.4, the native Linux kernel port of the ZFS file system is introduced as optional file system and also as an additional selection for the root file system. There is no need for manually compile ZFS modules - all packages are included.
By using ZFS, its possible to achieve maximum enterprise features with low budget hardware, but also high performance systems by leveraging SSD caching or even SSD only setups. ZFS can replace cost intense hardware raid cards by moderate CPU and memory load combined with easy management.
-
Easy configuration and management with PVE GUI and CLI.
-
Reliable
-
Protection against data corruption
-
Data compression on file system level
-
Snapshots
-
Copy-on-write clone
-
Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
-
Can use SSD for cache
-
Self healing
-
Continuous integrity checking
-
Designed for high storage capacities
-
Protection against data corruption
-
Asynchronous replication over network
-
Open Source
-
Encryption
-
…
Hardware
ZFS depends heavily on memory, so you need at least 8GB to start. In practice, use as much you can get for your hardware/budget. To prevent data corruption, we recommend the use of high quality ECC RAM.
If you use a dedicated cache and/or log disk, you should use an enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can increase the overall performance significantly.
|
Do not use ZFS on top of hardware controller which has its own cache management. ZFS needs to directly communicate with disks. An HBA adapter is the way to go, or something like LSI controller flashed in “IT” mode. |
If you are experimenting with an installation of PVE inside a VM (Nested Virtualization), don’t use virtio for disks of that VM, since they are not supported by ZFS. Use IDE or SCSI instead (works also with virtio SCSI controller type).
Installation as Root File System
When you install using the PVE installer, you can choose ZFS for the root file system. You need to select the RAID type at installation time:
RAID0
|
Also called “striping”. The capacity of such volume is the sum of the capacities of all disks. But RAID0 does not add any redundancy, so the failure of a single drive makes the volume unusable. |
RAID1
|
Also called “mirroring”. Data is written identically to all disks. This mode requires at least 2 disks with the same size. The resulting capacity is that of a single disk. |
RAID10
|
A combination of RAID0 and RAID1. Requires at least 4 disks. |
RAIDZ-1
|
A variation on RAID-5, single parity. Requires at least 3 disks. |
RAIDZ-2
|
A variation on RAID-5, double parity. Requires at least 4 disks. |
RAIDZ-3
|
A variation on RAID-5, triple parity. Requires at least 5 disks. |
The installer automatically partitions the disks, creates a ZFS pool called rpool, and installs the root file system on the ZFS subvolume rpool/ROOT/pve-1.
Another subvolume called rpool/data is created to store VM images. In order to use that with the PVE tools, the installer creates the following configuration entry in /etc/pve/storage.cfg:
zfspool: local-zfs pool rpool/data sparse content images,rootdir
After installation, you can view your ZFS pool status using the zpool command:
# zpool status pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda2 ONLINE 0 0 0 sdb2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 errors: No known data errors
The zfs command is used configure and manage your ZFS file systems. The following command lists all file systems after installation:
# zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 4.94G 7.68T 96K /rpool rpool/ROOT 702M 7.68T 96K /rpool/ROOT rpool/ROOT/pve-1 702M 7.68T 702M / rpool/data 96K 7.68T 96K /rpool/data rpool/swap 4.25G 7.69T 64K -
Bootloader
The default ZFS disk partitioning scheme does not use the first 2048 sectors. This gives enough room to install a GRUB boot partition. The PVE installer automatically allocates that space, and installs the GRUB boot loader there. If you use a redundant RAID setup, it installs the boot loader on all disk required for booting. So you can boot even if some disks fail.
|
It is not possible to use ZFS as root file system with UEFI boot. |
ZFS Administration
This section gives you some usage examples for common tasks. ZFS itself is really powerful and provides many options. The main commands to manage ZFS are zfs and zpool. Both commands come with great manual pages, which can be read with:
# man zpool # man zfs
To create a new pool, at least one disk is needed. The ashift should have the same sector-size (2 power of ashift) or larger as the underlying disk.
zpool create -f -o ashift=12 <pool> <device>
To activate compression
zfs set compression=lz4 <pool>
Minimum 1 Disk
zpool create -f -o ashift=12 <pool> <device1> <device2>
Minimum 2 Disks
zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
Minimum 4 Disks
zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
Minimum 3 Disks
zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
Minimum 4 Disks
zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
It is possible to use a dedicated cache drive partition to increase the performance (use SSD).
As <device> it is possible to use more devices, like it’s shown in "Create a new pool with RAID*".
zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
It is possible to use a dedicated cache drive partition to increase the performance(SSD).
As <device> it is possible to use more devices, like it’s shown in "Create a new pool with RAID*".
zpool create -f -o ashift=12 <pool> <device> log <log_device>
If you have an pool without cache and log. First partition the SSD in 2 partition with parted or gdisk
|
Always use GPT partition tables. |
The maximum size of a log device should be about half the size of physical memory, so this is usually quite small. The rest of the SSD can be used as cache.
zpool add -f <pool> log <device-part1> cache <device-part2>
zpool replace -f <pool> <old device> <new-device>
Activate E-Mail Notification
ZFS comes with an event daemon, which monitors events generated by the ZFS kernel module. The daemon can also send emails on ZFS events like pool errors. Newer ZFS packages ships the daemon in a separate package, and you can install it using apt-get:
# apt-get install zfs-zed
To activate the daemon it is necessary to edit /etc/zfs/zed.d/zed.rc with your favourite editor, and uncomment the ZED_EMAIL_ADDR setting:
ZED_EMAIL_ADDR="root"
Please note PVE forwards mails to root to the email address configured for the root user.
|
The only setting that is required is ZED_EMAIL_ADDR. All other settings are optional. |
Limit ZFS Memory Usage
It is good to use at most 50 percent (which is the default) of the system memory for ZFS ARC to prevent performance shortage of the host. Use your preferred editor to change the configuration in /etc/modprobe.d/zfs.conf and insert:
options zfs zfs_arc_max=8589934592
This example setting limits the usage to 8GB.
|
If your root file system is ZFS you must update your initramfs every time this value changes: update-initramfs -u |
SWAP on ZFS on Linux may generate some troubles, like blocking the server or generating a high IO load, often seen when starting a Backup to an external Storage.
We strongly recommend to use enough memory, so that you normally do not run into low memory situations. Additionally, you can lower the “swappiness” value. A good value for servers is 10:
sysctl -w vm.swappiness=10
To make the swappiness persistent, open /etc/sysctl.conf with an editor of your choice and add the following line:
vm.swappiness = 10
Value | Strategy |
---|---|
vm.swappiness = 0 |
The kernel will swap only to avoid an out of memory condition |
vm.swappiness = 1 |
Minimum amount of swapping without disabling it entirely. |
vm.swappiness = 10 |
This value is sometimes recommended to improve performance when sufficient memory exists in a system. |
vm.swappiness = 60 |
The default value. |
vm.swappiness = 100 |
The kernel will swap aggressively. |
Certificate Management
Certificates for communication within the cluster
Each PVE cluster creates its own internal Certificate Authority (CA) and generates a self-signed certificate for each node. These certificates are used for encrypted communication with the cluster’s pveproxy service and the Shell/Console feature if SPICE is used.
The CA certificate and key are stored in the pmxcfs (see the pmxcfs(8) manpage).
Certificates for API and web GUI
The REST API and web GUI are provided by the pveproxy service, which runs on each node.
You have the following options for the certificate used by pveproxy:
-
By default the node-specific certificate in /etc/pve/nodes/NODENAME/pve-ssl.pem is used. This certificate is signed by the cluster CA and therefore not trusted by browsers and operating systems by default.
-
use an externally provided certificate (e.g. signed by a commercial CA).
-
use ACME (e.g., Let’s Encrypt) to get a trusted certificate with automatic renewal.
For options 2 and 3 the file /etc/pve/local/pveproxy-ssl.pem (and /etc/pve/local/pveproxy-ssl.key, which needs to be without password) is used.
Certificates are managed with the PVE Node management command (see the pvenode(1) manpage).
|
Do not replace or manually modify the automatically generated node certificate files in /etc/pve/local/pve-ssl.pem and /etc/pve/local/pve-ssl.key or the cluster CA files in /etc/pve/pve-root-ca.pem and /etc/pve/priv/pve-root-ca.key. |
Getting trusted certificates via ACME
PVE includes an implementation of the Automatic Certificate Management Environment ACME protocol, allowing PVE admins to interface with Let’s Encrypt for easy setup of trusted TLS certificates which are accepted out of the box on most modern operating systems and browsers.
Currently the two ACME endpoints implemented are Let’s Encrypt (LE) and its staging environment (see https://letsencrypt.org), both using the standalone HTTP challenge.
Because of rate-limits you should use LE staging for experiments.
There are a few prerequisites to use Let’s Encrypt:
-
Port 80 of the node needs to be reachable from the internet.
-
There must be no other listener on port 80.
-
The requested (sub)domain needs to resolve to a public IP of the Node.
-
You have to accept the ToS of Let’s Encrypt.
At the moment the GUI uses only the default ACME account.
root@proxmox:~# pvenode acme account register default mail@example.invalid Directory endpoints: 0) Let's Encrypt V2 (https://acme-v02.api.letsencrypt.org/directory) 1) Let's Encrypt V2 Staging (https://acme-staging-v02.api.letsencrypt.org/directory) 2) Custom Enter selection: 1 Attempting to fetch Terms of Service from 'https://acme-staging-v02.api.letsencrypt.org/directory'.. Terms of Service: https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf Do you agree to the above terms? [y|N]y Attempting to register account with 'https://acme-staging-v02.api.letsencrypt.org/directory'.. Generating ACME account key.. Registering ACME account.. Registration successful, account URL: 'https://acme-staging-v02.api.letsencrypt.org/acme/acct/xxxxxxx' Task OK root@proxmox:~# pvenode acme account list default root@proxmox:~# pvenode config set --acme domains=example.invalid root@proxmox:~# pvenode acme cert order Loading ACME account details Placing ACME order Order URL: https://acme-staging-v02.api.letsencrypt.org/acme/order/xxxxxxxxxxxxxx Getting authorization details from 'https://acme-staging-v02.api.letsencrypt.org/acme/authz/xxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxx-xxxxxxx' ... pending! Setting up webserver Triggering validation Sleeping for 5 seconds Status is 'valid'! All domains validated! Creating CSR Finalizing order Checking order status valid! Downloading certificate Setting pveproxy certificate and key Restarting pveproxy Task OK
Automatic renewal of ACME certificates
If a node has been successfully configured with an ACME-provided certificate (either via pvenode or via the GUI), the certificate will be automatically renewed by the pve-daily-update.service. Currently, renewal will be attempted if the certificate has expired or will expire in the next 30 days.