VIFIB DESCENTRALIZED CLOUD COMPUTING

SlapOS is a decentralized Cloud Computing technology that can automate the deployment and configuration of applications in a heterogeneous environment.

The goal of this document is to describe how to write promise and how partitions and servers are monitored in SlapOS

Introducing SlapOS Architecture

SlapOS is a distributed, open source, Edge Cloud system. It is based on a Master and Slave design. Master assign services to Slave Nodes, Nodes process the list of services using buildout and send connection information as well as their monitoring status to Master. The monitoring status of services is based on promises.

Monitoring Crimes

There are four monitoring crimes that every developer should keep in mind:

  • Buildout runs all the time without ever going to sleep
  • Run all promises every minutes
  • Ever falling promises
  • Buildout taking too long to process a computer partition

Buildout runs all the time without ever going to sleep

If buildout runs all then times this consumes too much resources and can overload the server. Instead, one should care to that all promises of the Software Release can be solved.

Run all promises every minutes

It's no required to run all promises in monitor every minute, instead they should be configurable, the frequency should be set for each promise.

Ever falling promises

If a promise never reaches the stage it passes, it means that the SR is badly implemented and should be reviewed.

Buildout taking too long to process a computer partition

Buildout should process a computer partition in a short time, else it prevents ensuring reponsive provisionning of other paritions. The time to process a computer partition should be less that one minute ( < 1 min).

Monitoring Goals

The goal of monitoring is to provide good quality of services by knowing problems before customer tells us.

This is done by making sure that:

  • Servers are alive
  • Partitions are fulfilling all promises

Alive servers

Servers should contact master periodically to notify that they are alive. The master will show the state of each server according to a colour. A server is Green if it contacted the master within the last 5 minutes. If it contacted the master within the last hour 1 hour, the server is Orange else it's Red. From a monitoring point of view, the server conctacts the master whenever Slapgrid connects to slapOS master, no matter what for.

Fulfilled promises

The master shows the state of each requested partition according to a colour. A partition is Green if the latest result sent by Slapgrid for that partition is OK (meaning that all promises succeeded and there was no other failures) and if that message was sent less than one day ago and less than the buildout run frequency defined by the software release and if no bang was trigered after that. Else the partition is Red.

Note 1: buildout on a partition in SlapOS will be executed at least once per computer configurable frequency (usually one day) and at least once per software release configurable frequency (seldom configured).

Note 2: the computer configurable frequency of buildout run must be stored on the Computer in SlapOS master at registration time and updated, else it is impossible to check promise fulfillment.

How Slapgrid checks partitions Status

In normal conditions:

  • Instantiation runs periodically (at least once in an interval of computer configurable frequency which is usually 24 hours), running promises and posting to master, hence showing signs of life.
  • Slapgrid runs periodically a set of promise sensors, and upon anomaly detection on the promise sensor value, bang is called on the partition.
  • Upon call of bang, a run of partition instantiation is scheduled by SlapOS Master on all partitions that belong to the same software instance tree.

Running buildout on all partitions after a bang is supposed to converge to a stable state with all promises passing. 

Slapgrid is configured to run promises at some interval of time which can be configured differently for each promise sensor. SlapOS knows nothing about the results of running promise sensors. The only thing the master knows is that a bang was issued due to anomaly detection.

We want to promote a simple, easy and standardised way of writing promises scripts that will verify the state of the system. These scripts can be launched by cron and are configurable for each Software Release. A promise has three parts:

  • promise sensor
  • promise test
  • promise anomaly detector

The promise sensor collects the value of some monitoring aspects such as "if server is supposed to be started, get the response of an http request, else return 'server stopped' and in case of timeout return empty string". 

The promise test is Green if the result of the promise sensor of the previous example is not empty, else Red. This ensures that a server that is started actually responds to http requests. There is no margin of tolerance for promise tests.

The promise anomaly detector is Green if one of the three last promise sensor values was not empty, else it is red. This ensures that we call bang only if the server is really stopped, not if an Internet glitch happened.

Note: promises are what buildout launches at the end. They return True or False. True means that one aspect of the partition is OK. Cron does not launch promises, but anomaly detectors. Very often, anomaly detector and promises are the same executable with the same result, but not always. Therefore, the two concepts are different. What they have in common is that they often sense the same thing. But detecting an anomaly is not the same as detecting that a promise is initially met.

Watchdog

Watchdog is a simple SlapOS Node feature allowing to watch any process managed by supervisord. All processes scripts into PARTITION_DIRECTORY/etc/service directory are watched. They are automatically configured in supervisord with an added on-watch suffix on theirs process_name. Whenever one of them exit, watchdog will trigger an alert (bang) that is sent to the master. Bang will force slapgrid to reprocess all instances of the service. This also forces recheck of all promises and post the result to master, letting the master decide whether the partition state is Green or Red.

Bang should be called as much as needed in a day by a partition, we should not have limitation as it's today else it's not possible to adapt dynamically. A Master protection against recurring bang calls should be considered using a kind of quota per day, that might depend on price or defined into the software release. if the bang quota of the day is reached, the master will reject all future calls until the next day.

How to write monitoring python promise

The script bellow is an example of promise in python. Writing a promise consists of defining a class called RunPromise which inherits from GenericPromise class and defining methods: anomaly(), sense() and test(). Python promises should be placed into the folder etc/plugin of the computer partition.

cat << EOF > etc/plugin/check-my-site.py

from zope import interface as zope_interface
from slapos.grid.promise import interface
from slapos.grid.promise.generic import GenericPromise, TestResult, AnomalyResult

class RunPromise(GenericPromise):
  
  zope_interface.implements(interface.IPromise)

  def __init__(self, config):
    GenericPromise.__init__(self, config)
    # run the promise everty 2 minutes
    self.setPeriodicity(minute=2)

  def anomaly(self):
    """
      Called to detect if there is an anomaly.
      Return AnomalyResult or TestResult object
      # When AnomalyResult has failure bang is called if another promise didn't bang
    """

    # Example
    promise_result_list = self.getLastPromiseResultList(result_count=3, only_failure=True)
    if len(promise_result_list) > 2:
      return AnomalyResult(problem=True, message=promise_result_list[0][0]['mesage'])
    return AnomalyResult(problem=False, message="")

    # It's possible to use Generic helper methods
    # return self._anomaly(result_count=3, failure_amount=3)

  def sense(self):
    """
      Run the promise code and store the result
        raise error, log error message, ... for failure
    """

    # DO SOMETHING...
    failed = True
    raised = False
    if failed:
      self.logger.error("ERROR while checking instance http server")
    else:
      self.logger.info("http server is OK")
    if raised:
      raise ValueError("Server URL is not correct")

  def test(self):
    """
      Test promise and say if problem is detected or not
      Return TestResult object
    """

   # Example
   promise_result_list = self.getLastPromiseResultList(result_count=1)[0]
   problem = False
   message = ""
   for result in promise_result_list:
     if result['status'] == 'ERROR' and not problem:
       problem = True
     message += "\n%s" % result['message']

   return TestResult(problem=problem, messsage=message)

   # It's possible to use Generic helper methods
   # return self._test(result_count=1, failure_amount=1)

EOF

sense() run the promise with the given frequency, collects data for the promise whenever is makes sense and appends to a log file.

test() check TestResult object describing the actual promise state. Test method is called when buildout process a partition, a partition is marked as correctly processed if there is no buildout failures and all promises test() pass.

anomaly() return AnomalyResult object describing the promise state. Anomaly method is called by slapgrid when the partition is correctly processed to check if the partition has no anomaly. If AnomalyResult.hasFailed() is True, bang is called if another promise of the same instance didn't call bang.

GenericPromise class contain base implementation of promise, it provide a method run() which read the option 'check_anomaly' to enforce call of anomaly() instead of test(). By default, run a promise script will call sense() and test(). Option check_anomaly is used used by buildout for periodic promise check, when the partition is already well deployed.

In future, GenericPromise will be improved to provide more methods that can be used in sense() to store promise graph data. This graph data will be used on monitor interface to plot a chart of promise result progression.

Methods available in Promise class (inherited from GenericPromise) are:

  • self.getTitle(): return the promse title, ex: my_promise
  • self.getName(): return the name of the promise, ex: my_promise.py
  • self.getPromiseFile(): return the promise file path
  • self.getPeriodicity(): return the current promise periodicity
  • self.setPeriodicity(minute=XX): set the promise periodicity in minutes in __init__()
  • self.getLogFile(): return path to the file where output is written
  • self.getLogFolder(): return folder where monitoring logs are written
  • self.getPartitionFolder(): return base partition folder
  • self.getConfig(key, default=None): return the configuration send to the promise class.
    Default configuration availble are:
    -  self.getConfig('partition-id')
    -  self.getConfig('computer-id')
    -  self.getConfig('partition-key')
    -  self.getConfig('partition-cert')
    -  self.getConfig('master-url')
  • self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False): read the promise log result group from the latest promise execution specified by COUNT. Set latest_minute to specifie the maximum promise execution time to search. If only_failure is True, will only get failure messages.
  • self._test(result_count=COUNT, failure_amount=XX, latest_minute=0): return TestResult from the latest promise result messages.
  • self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0): return AnomalyResult from the latest promise result messages.

Where to commit my python promise code

Promise code will be committed in slapos.toolbox repository. Please put you promise into the folder slapos/promise/plugin, you can import them in a file in etc/plugin folder.

cat << EOF > etc/plugin/check-my-site.py

from slapos.promise.plugin.my_promise_check_site import RunPromise

EOF

 

How to add a promise from buildout

A recipe slapos.cookbook:promise.plugin can be used to generate promise scripts. Add promise will look like this:

[promise-check-site]
recipe = slapos.cookbook:promise.plugin
eggs =
  slapos.toolbox
output = ${directory:plugins}/check-my-site.py
content = 
  from slapos.promise.plugin.check_site_state import RunPromise
config-site-url = ${publish:site-url}
config-connection-timeout = 20
config-foo = bar
mode = 600

Then you will have to add promise-check-site section to buildout parts, so it will be installed.

In your promise code, you will be able to call self.getConfig('site-url'), self.getConfig('connection-timeout') and self.getConfig('foo'). The
returned value self.getConfig(KEY) is None if the config parameter KEY is not set.

Monitor Promise launcher script

The monitor promise script added by monitor can be used to test promises execution without use slapgrid when writing new promises or for debug. The script will be exposed in the bin directory of the software release.

To run promises, the command should be:

SR_DIRECTORY/bin/monitor.runpromise --config etc/monitor.conf --console --dry-run [ARG, ...]

 

How to call something else than python script of promise

Legacy promises are promises placed in PARTITION_DIRECTORY/etc/promise, they can be bash or others executable scripts. The promise launcher will use a special wrapper to call them as a subprocess, the success or failure state will be based on the process return code (0 = sucess, > 1 = failure).

How do I set the frequency of buildout run of software release ?

To set frequency of buildout run, software release should write a file periodicity of the software release folder. The file should contain the time period in seconds. For example, to process the partition every 12 hours, the file /opt/slapgrid/SR_MD5SUM/periodicity should contain 43200 = 12h.

How to chose a proper promise style?

Result with banging the partition and run during instantiation and periodically

Promise shall return AnomalyResult and not be TestLess

Result with banging the partition and run only periodically

Promise shall return AnomalyResult and be TestLess

Does not bang the partition and run during instantiation and periodically

Promise shall return TestResult and not be TestLess

Does not bang the partition and run only periodically

Promise shall return TestResult and be TestLess