SlapOS Home SlapOS

    General SlapOS Monitoring Specifications

    This document details how partitions and servers are monitored in SlapOS
    • Last Update:2020-04-16
    • Version:004
    • Language:en

    The goal of this document is to describe how to write promise and how partitions and servers are monitored in SlapOS

    Introducing SlapOS Architecture

    SlapOS is a distributed, open source, Edge Cloud system. It is based on a Master and Slave design. Master assign services to Slave Nodes, Nodes process the list of services using buildout and send connection information as well as their monitoring status to Master. The monitoring status of services is based on promises.

    Monitoring Crimes

    There are four monitoring crimes that every developer should keep in mind:

    • Buildout runs all the time without ever going to sleep
    • Run all promises every minutes
    • Ever falling promises
    • Buildout taking too long to process a computer partition

    Buildout runs all the time without ever going to sleep

    If buildout runs all then times this consumes too much resources and can overload the server. Instead, one should care to that all promises of the Software Release can be solved.

    Run all promises every minutes

    It's no required to run all promises in monitor every minute, instead they should be configurable, the frequency should be set for each promise.

    Ever falling promises

    If a promise never reaches the stage it passes, it means that the SR is badly implemented and should be reviewed.

    Buildout taking too long to process a computer partition

    Buildout should process a computer partition in a short time, else it prevents ensuring reponsive provisionning of other paritions. The time to process a computer partition should be less that one minute ( < 1 min).

    Monitoring Goals

    The goal of monitoring is to provide good quality of services by knowing problems before customer tells us.

    This is done by making sure that:

    • Servers are alive
    • Partitions are fulfilling all promises

    Alive servers

    Servers should contact master periodically to notify that they are alive. The master will show the state of each server according to a colour. A server is Green if it contacted the master within the last 5 minutes. If it contacted the master within the last hour 1 hour, the server is Orange else it's Red. From a monitoring point of view, the server conctacts the master whenever Slapgrid connects to slapOS master, no matter what for.

    Fulfilled promises

    The master shows the state of each requested partition according to a colour. A partition is Green if the latest result sent by Slapgrid for that partition is OK (meaning that all promises succeeded and there was no other failures) and if that message was sent less than one day ago and less than the buildout run frequency defined by the software release and if no bang was trigered after that. Else the partition is Red.

    Note 1: buildout on a partition in SlapOS will be executed at least once per computer configurable frequency (usually one day) and at least once per software release configurable frequency (seldom configured).

    Note 2: the computer configurable frequency of buildout run must be stored on the Computer in SlapOS master at registration time and updated, else it is impossible to check promise fulfillment.

    How Slapgrid checks partitions Status

    In normal conditions:

    • Instantiation runs periodically (at least once in an interval of computer configurable frequency which is usually 24 hours), running promises and posting to master, hence showing signs of life.
    • Slapgrid runs periodically a set of promise sensors, and upon anomaly detection on the promise sensor value, bang is called on the partition.
    • Upon call of bang, a run of partition instantiation is scheduled by SlapOS Master on all partitions that belong to the same software instance tree.

    Running buildout on all partitions after a bang is supposed to converge to a stable state with all promises passing. 

    Slapgrid is configured to run promises at some interval of time which can be configured differently for each promise sensor. SlapOS knows nothing about the results of running promise sensors. The only thing the master knows is that a bang was issued due to anomaly detection.

    We want to promote a simple, easy and standardised way of writing promises scripts that will verify the state of the system. These scripts can be launched by cron and are configurable for each Software Release. A promise has three parts:

    • promise sensor
    • promise test
    • promise anomaly detector

    The promise sensor collects the value of some monitoring aspects such as "if server is supposed to be started, get the response of an http request, else return 'server stopped' and in case of timeout return empty string". 

    The promise test is Green if the result of the promise sensor of the previous example is not empty, else Red. This ensures that a server that is started actually responds to http requests. There is no margin of tolerance for promise tests.

    The promise anomaly detector is Green if one of the three last promise sensor values was not empty, else it is red. This ensures that we call bang only if the server is really stopped, not if an Internet glitch happened.

    Note: promises are what buildout launches at the end. They return True or False. True means that one aspect of the partition is OK. Cron does not launch promises, but anomaly detectors. Very often, anomaly detector and promises are the same executable with the same result, but not always. Therefore, the two concepts are different. What they have in common is that they often sense the same thing. But detecting an anomaly is not the same as detecting that a promise is initially met.

    Watchdog

    Watchdog is a simple SlapOS Node feature allowing to watch any process managed by supervisord. All processes scripts into PARTITION_DIRECTORY/etc/service directory are watched. They are automatically configured in supervisord with an added on-watch suffix on theirs process_name. Whenever one of them exit, watchdog will trigger an alert (bang) that is sent to the master. Bang will force slapgrid to reprocess all instances of the service. This also forces recheck of all promises and post the result to master, letting the master decide whether the partition state is Green or Red.

    Bang should be called as much as needed in a day by a partition, we should not have limitation as it's today else it's not possible to adapt dynamically. A Master protection against recurring bang calls should be considered using a kind of quota per day, that might depend on price or defined into the software release. if the bang quota of the day is reached, the master will reject all future calls until the next day.

    How to write monitoring python promise

    The script bellow is an example of promise in python. Writing a promise consists of defining a class called RunPromise which inherits from GenericPromise class and defining methods: anomaly(), sense() and test(). Python promises should be placed into the folder etc/plugin of the computer partition.

    cat << EOF > etc/plugin/check-my-site.py
    
    from zope import interface as zope_interface
    from slapos.grid.promise import interface
    from slapos.grid.promise.generic import GenericPromise, TestResult, AnomalyResult
    
    class RunPromise(GenericPromise):
      
      zope_interface.implements(interface.IPromise)
    
      def __init__(self, config):
        GenericPromise.__init__(self, config)
        # run the promise everty 2 minutes
        self.setPeriodicity(minute=2)
    
      def anomaly(self):
        """
          Called to detect if there is an anomaly.
          Return AnomalyResult or TestResult object
          # When AnomalyResult has failure bang is called if another promise didn't bang
        """
    
        # Example
        promise_result_list = self.getLastPromiseResultList(result_count=3, only_failure=True)
        if len(promise_result_list) > 2:
          return AnomalyResult(problem=True, message=promise_result_list[0][0]['mesage'])
        return AnomalyResult(problem=False, message="")
    
        # It's possible to use Generic helper methods
        # return self._anomaly(result_count=3, failure_amount=3)
    
      def sense(self):
        """
          Run the promise code and store the result
            raise error, log error message, ... for failure
        """
    
        # DO SOMETHING...
        failed = True
        raised = False
        if failed:
          self.logger.error("ERROR while checking instance http server")
        else:
          self.logger.info("http server is OK")
        if raised:
          raise ValueError("Server URL is not correct")
    
      def test(self):
        """
          Test promise and say if problem is detected or not
          Return TestResult object
        """
    
       # Example
       promise_result_list = self.getLastPromiseResultList(result_count=1)[0]
       problem = False
       message = ""
       for result in promise_result_list:
         if result['status'] == 'ERROR' and not problem:
           problem = True
         message += "\n%s" % result['message']
    
       return TestResult(problem=problem, messsage=message)
    
       # It's possible to use Generic helper methods
       # return self._test(result_count=1, failure_amount=1)
    
    EOF

    sense() run the promise with the given frequency, collects data for the promise whenever is makes sense and appends to a log file.

    test() check TestResult object describing the actual promise state. Test method is called when buildout process a partition, a partition is marked as correctly processed if there is no buildout failures and all promises test() pass.

    anomaly() return AnomalyResult object describing the promise state. Anomaly method is called by slapgrid when the partition is correctly processed to check if the partition has no anomaly. If AnomalyResult.hasFailed() is True, bang is called if another promise of the same instance didn't call bang.

    GenericPromise class contain base implementation of promise, it provide a method run() which read the option 'check_anomaly' to enforce call of anomaly() instead of test(). By default, run a promise script will call sense() and test(). Option check_anomaly is used used by buildout for periodic promise check, when the partition is already well deployed.

    In future, GenericPromise will be improved to provide more methods that can be used in sense() to store promise graph data. This graph data will be used on monitor interface to plot a chart of promise result progression.

    Methods available in Promise class (inherited from GenericPromise) are:

    • self.getTitle(): return the promse title, ex: my_promise
    • self.getName(): return the name of the promise, ex: my_promise.py
    • self.getPromiseFile(): return the promise file path
    • self.getPeriodicity(): return the current promise periodicity
    • self.setPeriodicity(minute=XX): set the promise periodicity in minutes in __init__()
    • self.getLogFile(): return path to the file where output is written
    • self.getLogFolder(): return folder where monitoring logs are written
    • self.getPartitionFolder(): return base partition folder
    • self.getConfig(key, default=None): return the configuration send to the promise class.
      Default configuration availble are:
      -  self.getConfig('partition-id')
      -  self.getConfig('computer-id')
      -  self.getConfig('partition-key')
      -  self.getConfig('partition-cert')
      -  self.getConfig('master-url')
    • self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False): read the promise log result group from the latest promise execution specified by COUNT. Set latest_minute to specifie the maximum promise execution time to search. If only_failure is True, will only get failure messages.
    • self._test(result_count=COUNT, failure_amount=XX, latest_minute=0): return TestResult from the latest promise result messages.
    • self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0): return AnomalyResult from the latest promise result messages.

    Where to commit my python promise code

    Promise code will be committed in slapos.toolbox repository. Please put you promise into the folder slapos/promise/plugin, you can import them in a file in etc/plugin folder.

    cat << EOF > etc/plugin/check-my-site.py
    
    from slapos.promise.plugin.my_promise_check_site import RunPromise
    
    EOF

     

    How to add a promise from buildout

    A recipe slapos.cookbook:promise.plugin can be used to generate promise scripts. Add promise will look like this:

    [promise-check-site]
    recipe = slapos.cookbook:promise.plugin
    eggs =
      slapos.toolbox
    output = ${directory:plugins}/check-my-site.py
    content = 
      from slapos.promise.plugin.check_site_state import RunPromise
    config-site-url = ${publish:site-url}
    config-connection-timeout = 20
    config-foo = bar
    mode = 600

    Then you will have to add promise-check-site section to buildout parts, so it will be installed.

    In your promise code, you will be able to call self.getConfig('site-url'), self.getConfig('connection-timeout') and self.getConfig('foo'). The
    returned value self.getConfig(KEY) is None if the config parameter KEY is not set.

    Monitor Promise launcher script

    The monitor promise script added by monitor can be used to test promises execution without use slapgrid when writing new promises or for debug. The script will be exposed in the bin directory of the software release.

    To run promises, the command should be:

    SR_DIRECTORY/bin/monitor.runpromise --config etc/monitor.conf --console --dry-run [ARG, ...]

     

    How to call something else than python script of promise

    Legacy promises are promises placed in PARTITION_DIRECTORY/etc/promise, they can be bash or others executable scripts. The promise launcher will use a special wrapper to call them as a subprocess, the success or failure state will be based on the process return code (0 = sucess, > 1 = failure).

    How do I set the frequency of buildout run of software release ?

    To set frequency of buildout run, software release should write a file periodicity of the software release folder. The file should contain the time period in seconds. For example, to process the partition every 12 hours, the file /opt/slapgrid/SR_MD5SUM/periodicity should contain 43200 = 12h.

    How to chose a proper promise style?

    Result with banging the partition and run during instantiation and periodically

    Promise shall return AnomalyResult and not be TestLess

    Result with banging the partition and run only periodically

    Promise shall return AnomalyResult and be TestLess

    Does not bang the partition and run during instantiation and periodically

    Promise shall return TestResult and not be TestLess

    Does not bang the partition and run only periodically

    Promise shall return TestResult and be TestLess