online gluster and ovirt expansion
Core problem:
The existing ovirt cluster has only two hosts to run VMs and cannot run all desired VMs on a single machine because of inadequate memory (and possibly other resource constraints) - so we do not have the intended redundancy to maintain all VMs if there is a server hardware failure. (In late 2024, the desired hosts were dean, a dashboard1 replacement, and ovirt-engine. This has since expanded to additional upgraded online database servers in anticipation of STAR end-of-life and the need to migrate services to SDCC possibly as soon as early 2026.)
Basic structure: Three servers provide the core services for the ovirt cluster: ovirt1, ovirt2, and ovirt4 (not a typo, there was no ovirt3 as this got underway). Ovirt1 and ovirt2 act as hosts for the VMs and are part of the gluster pool, while ovirt4 only provides gluster bricks. Each of ovirt[124] has three gluster bricks, for a total of nine, of which 3 are arbiters (see gluster pool info below; I think this configuration is designated "replica 3 arbiter 1" though in fact there are only 2 copies of the data).
Attempting to expand the ovirt cluster to include additional hosts has had some snags, one of them being the expansion of the gluster pool. Apparently it is not possible to expand directly from the current state of 3 x (2 + 1). Having arbiters prevents adding additional redundant copies.
Options / questions:
- Was there a reason to make the gluster with arbiters instead of pure replicas? I see no benefit to using arbiters in our setup (or really in any setup - only a potential for a small cost savings as far as I can tell, with greater risk and possibly worse performance)
--- Should we convert arbiters to full replicas? see https://lists.gluster.org/pipermail/gluster-users/2018-July/034385.html
- Do we have enough space in the glusterfs? Is there value in expanding the space if that's possible?
- Are we keeping adequate snapshots (of dean at this point) and storing outside of the ovirt for emergency recovery?
- Can ovirt4 serve as a VM host? Possibly not, because of a hardware (processor) incompatibility with ovirt1 and ovirt2, but this is not confirmed as I write this on April 11, 2025.
- Build a whole new ovirt cluster? Or yet some other virtual machine infrastructure?
- The gluster data distribution appears to be rather unbalanced. Of the three "spans", one is 14% full while the other two are only 5% full. It might be worthwhile to attempt a rebalance.
- On ovirt1, in /data/brick1, there is what appears to be an extraneous directory, "gv1". Can this be removed safely?
- Why is the akanugant user not showing up in the user list on ovirt-image web interface? An "akanugant" user account was created with ovirt-aaa-jdbc-tool, yet no such user appears in the web interface.
--- Update April 28,20205: This appears to have been user (me) misunderstanding the web interface. Apparently the procedure is to create the user at the command line with ovirt-aaa-jdbc-tool, then go to the web interface, use the search to find the newly created user, select from the results and then hit "Add"
- Are messages such as the following a concern?:
--- Apr 20 04:37:47 ovirt-engine platform-python[1146]: ::ffff:130.199.162.134 - - [20/Apr/2025 00:32:42] code 502, message Bad Gateway
--- Apr 20 04:37:47 ovirt-engine platform-python[1146]: ::ffff:130.199.162.134 - - [20/Apr/2025 00:32:42] "POST /v2.0/tokens HTTP/1.1" 502 - - What about the daily error events like this:
--- Failed to check for available updates on host ovirt1.star.bnl.gov with message 'Failed to run check-update of host 'ovirt1.star.bnl.gov'. Error: null'.
----- Is this an ssh failure?
[root@ovirt1 gv0]# gluster volume info
Volume Name: ovirt
Type: Distributed-Replicate
Volume ID: 1356094c-8397-46a9-bf7c-72564eee4318
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ovirt1:/data/brick1/gv0
Brick2: ovirt2:/data/brick1/gv0
Brick3: ovirt4:/data/brick1/gv0 (arbiter)
Brick4: ovirt4:/data/brick2/gv0
Brick5: ovirt1:/data/brick2/gv0
Brick6: ovirt2:/data/brick2/gv0 (arbiter)
Brick7: ovirt2:/data/brick3/gv0
Brick8: ovirt4:/data/brick3/gv0
Brick9: ovirt1:/data/brick3/gv0 (arbiter)
Options Reconfigured:
cluster.lookup-optimize: off
server.keepalive-count: 5
server.keepalive-interval: 2
server.keepalive-time: 10
server.tcp-user-timeout: 20
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.eager-lock: enable
performance.strict-o-direct: on
network.remote-dio: disable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.client-io-threads: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
cluster.quorum-type: auto
network.ping-timeout: 20
auth.allow: *
storage.owner-uid: 36
storage.owner-gid: 36
- wbetts's blog
- Login or register to post comments