Hello SlapOS users,
We have been working on resilient stack for some time now, and we’ve just reached our goals: having a working resilient stack, with monitoring support, and with “erp5-in-webrunner” support.
Let me explain all of this.
A “working” resiliency stack has been released last year, but if anything goes wrong in the backup or the clone instances, no one nows that there is a problem unless one has a look into the instances, thus creating a total lack of confidence of the system.
Fortunately, there have been tremendous improvement on the monitoring part from SlapOS Master and in the monitoring stack, thus improving the experience of deploying/maintaining a resilient instance.
The last piece of this puzzle is knowing the state of the different clone instances. It is now possible from a set of promises that check if backup has been successfuly done in the last 24 hours: you can now it from the SlapOS Master (red instance) and/or from the RSS feed of the monitoring stack of the clone instances.
2. Resiliency of resiliency
The resiliency stack was known to have a lot of different bugs for many edge cases. This has been improved and there should now have much less problems than before. As always, VIFIB team is here to help in case of problem!
Moreover, data corruption is now checked for every file backed up, and the scripts will raise if a problem is detected, causing the promises to inform about a problem if it persists for more than 24 hours.
3. ERP5 and resiliency
The standard deployment of ERP5 today is done through a Web Runner. Even if webrunner itself, thanks to 1. and 2., is resilient, what is inside of the Web Runner was not because of lack of communication between the webrunner and what was inside of it.
So we modified Web Runner and ERP5 so that it is possible to have a blacklist of files to NOT backup (for example, copying kumo data may not be helpful but can be really large), and to trigger a set of export and import scripts from the ERP5 instances called by main instance and clone instances of webrunner.
One thing to know is that in erp5 instances containing data (tidstorage, zeo, mariadb), it will create a ~/srv/runner-import-restore script that erases the content of the instance, imports it from ~/srv/backup and check if backup is consistent (this part may be extended later to check more and more things).
Those scripts are called by the CLONE instance of the webrunner to correctly import the data. It is NEVER run on the main instance of Web Runner and will RAISE if called manually and zeo/mariadb/etc processes are running.
Please note that the (non-working) mariadb backup system of ERP5 for erp5 branch of slapos.git has been changed to use the one from erp5-cluster branch.
4. Known issues
The promises mentionned in 1. raise if no backup is done. When you successfuly deploy a resilient instance, all backups are done by default every hour, but can be customized (many projects may want to be once per day depending on the load of the machine and the amount of data to sync). For the user, no problem, but for VIFIB team, it means that a lot of instances can be red because of “no backup yet” during the first ~12 hours, until the next midnight.
5. How to use
For the user (a.k.a developer of ERP5), the changes are transparent: all you have to do is to upgrade to latest version of the Web Runner (currently slapos-0.258), and upgrade your ERP5 Software Release to its latest version (using your favorite of the two flavours: erp5 branch or erp5-cluster branch of slapos.git). The order doesn’t matter and you can upgrade one without upgrading the other (but you won’t have resiliency). No special parameter is required. Bonus: no erp5.git upgrade is required.