Home | History | Annotate | Download | only in scheduler
      1 #pylint: disable-msg=C0111
      2 
      3 """
      4 Prejob tasks.
      5 
      6 Prejob tasks _usually_ run before a job and verify the state of a machine.
      7 Cleanup and repair are exceptions, cleanup can run after a job too, while
      8 repair will run anytime the host needs a repair, which could be pre or post
      9 job. Most of the work specific to this module is achieved through the prolog
     10 and epilog of each task.
     11 
     12 All prejob tasks must have a host, though they may not have an HQE. If a
     13 prejob task has a hqe, it will activate the hqe through its on_pending
     14 method on successful completion. A row in afe_special_tasks with values:
     15     host=C1, unlocked, is_active=0, is_complete=0, type=Verify
     16 will indicate to the scheduler that it needs to schedule a new special task
     17 of type=Verify, against the C1 host. While the special task is running
     18 the scheduler only monitors it through the Agent, and its is_active bit=1.
     19 Once a special task finishes, we set its is_active=0, is_complete=1 and
     20 success bits, so the scheduler ignores it.
     21 HQE.on_pending:
     22     Host, HQE -> Pending, Starting
     23     This status is acted upon in the scheduler, to assign an AgentTask.
     24 PreJobTask:
     25     epilog:
     26         failure:
     27             requeue hqe
     28             repair the host
     29 Children PreJobTasks:
     30     prolog:
     31         set Host, HQE status
     32     epilog:
     33         success:
     34             on_pending
     35         failure:
     36             repair throgh PreJobTask
     37             set Host, HQE status
     38 
     39 Failing a prejob task effects both the Host and the HQE, as follows:
     40 
     41 - Host: PreJob failure will result in a Repair job getting queued against
     42 the host, is we haven't already tried repairing it more than the
     43 max_repair_limit. When this happens, the host will remain in whatever status
     44 the prejob task left it in, till the Repair job puts it into 'Repairing'. This
     45 way the host_scheduler won't pick bad hosts and assign them to jobs.
     46 
     47 If we have already tried repairing the host too many times, the PreJobTask
     48 will flip the host to 'RepairFailed' in its epilog, and it will remain in this
     49 state till it is recovered and reverified.
     50 
     51 - HQE: Is either requeued or failed. Requeuing the HQE involves putting it
     52 in the Queued state and setting its host_id to None, so it gets a new host
     53 in the next scheduler tick. Failing the HQE results in either a Parsing
     54 or Archiving postjob task, and an eventual Failed status for the HQE.
     55 """
     56 
     57 import logging
     58 import os
     59 
     60 from autotest_lib.client.common_lib import host_protections
     61 from autotest_lib.frontend.afe import models
     62 from autotest_lib.scheduler import agent_task, scheduler_config
     63 from autotest_lib.server import autoserv_utils
     64 from autotest_lib.server.cros import provision
     65 
     66 
     67 class PreJobTask(agent_task.SpecialAgentTask):
     68     def _copy_to_results_repository(self):
     69         if not self.queue_entry or self.queue_entry.meta_host:
     70             return
     71 
     72         self.queue_entry.set_execution_subdir()
     73         log_name = os.path.basename(self.task.execution_path())
     74         source = os.path.join(self.task.execution_path(), 'debug',
     75                               'autoserv.DEBUG')
     76         destination = os.path.join(
     77                 self.queue_entry.execution_path(), log_name)
     78 
     79         self.monitor.try_copy_to_results_repository(
     80                 source, destination_path=destination)
     81 
     82 
     83     def epilog(self):
     84         super(PreJobTask, self).epilog()
     85 
     86         if self.success:
     87             return
     88 
     89         if self.host.protection == host_protections.Protection.DO_NOT_VERIFY:
     90             # effectively ignore failure for these hosts
     91             self.success = True
     92             return
     93 
     94         if self.queue_entry:
     95             # If we requeue a HQE, we should cancel any remaining pre-job
     96             # tasks against this host, otherwise we'll be left in a state
     97             # where a queued HQE has special tasks to run against a host.
     98             models.SpecialTask.objects.filter(
     99                     queue_entry__id=self.queue_entry.id,
    100                     host__id=self.host.id,
    101                     is_complete=0).update(is_complete=1, success=0)
    102 
    103             previous_provisions = models.SpecialTask.objects.filter(
    104                     task=models.SpecialTask.Task.PROVISION,
    105                     queue_entry_id=self.queue_entry.id).count()
    106             if (previous_provisions >
    107                 scheduler_config.config.max_provision_retries):
    108                 self._actually_fail_queue_entry()
    109                 # This abort will mark the aborted bit on the HQE itself, to
    110                 # signify that we're killing it.  Technically it also will do
    111                 # the recursive aborting of all child jobs, but that shouldn't
    112                 # matter here, as only suites have children, and those are
    113                 # hostless and thus don't have provisioning.
    114                 # TODO(milleral) http://crbug.com/188217
    115                 # However, we can't actually do this yet, as if we set the
    116                 # abort bit the FinalReparseTask will set the status of the HQE
    117                 # to ABORTED, which then means that we don't show the status in
    118                 # run_suite.  So in the meantime, don't mark the HQE as
    119                 # aborted.
    120                 # queue_entry.abort()
    121             else:
    122                 # requeue() must come after handling provision retries, since
    123                 # _actually_fail_queue_entry needs an execution subdir.
    124                 # We also don't want to requeue if we hit the provision retry
    125                 # limit, since then we overwrite the PARSING state of the HQE.
    126                 self.queue_entry.requeue()
    127 
    128             # Limit the repair on a host when a prejob task fails, e.g., reset,
    129             # verify etc. The number of repair jobs is limited to the specific
    130             # HQE and host.
    131             previous_repairs = models.SpecialTask.objects.filter(
    132                     task=models.SpecialTask.Task.REPAIR,
    133                     queue_entry_id=self.queue_entry.id,
    134                     host_id=self.queue_entry.host_id).count()
    135             if previous_repairs >= scheduler_config.config.max_repair_limit:
    136                 self.host.set_status(models.Host.Status.REPAIR_FAILED)
    137                 self._fail_queue_entry()
    138                 return
    139 
    140             queue_entry = models.HostQueueEntry.objects.get(
    141                     id=self.queue_entry.id)
    142         else:
    143             queue_entry = None
    144 
    145         models.SpecialTask.objects.create(
    146                 host=models.Host.objects.get(id=self.host.id),
    147                 task=models.SpecialTask.Task.REPAIR,
    148                 queue_entry=queue_entry,
    149                 requested_by=self.task.requested_by)
    150 
    151 
    152     def _should_pending(self):
    153         """
    154         Decide if we should call the host queue entry's on_pending method.
    155         We should if:
    156         1) There exists an associated host queue entry.
    157         2) The current special task completed successfully.
    158         3) There do not exist any more special tasks to be run before the
    159            host queue entry starts.
    160 
    161         @returns: True if we should call pending, false if not.
    162 
    163         """
    164         if not self.queue_entry or not self.success:
    165             return False
    166 
    167         # We know if this is the last one when we create it, so we could add
    168         # another column to the database to keep track of this information, but
    169         # I expect the overhead of querying here to be minimal.
    170         queue_entry = models.HostQueueEntry.objects.get(id=self.queue_entry.id)
    171         queued = models.SpecialTask.objects.filter(
    172                 host__id=self.host.id, is_active=False,
    173                 is_complete=False, queue_entry=queue_entry)
    174         queued = queued.exclude(id=self.task.id)
    175         return queued.count() == 0
    176 
    177 
    178 class VerifyTask(PreJobTask):
    179     TASK_TYPE = models.SpecialTask.Task.VERIFY
    180 
    181 
    182     def __init__(self, task):
    183         args = ['-v']
    184         if task.queue_entry:
    185             args.extend(self._generate_autoserv_label_args(task))
    186         super(VerifyTask, self).__init__(task, args)
    187         self._set_ids(host=self.host, queue_entries=[self.queue_entry])
    188 
    189 
    190     def prolog(self):
    191         super(VerifyTask, self).prolog()
    192 
    193         logging.info("starting verify on %s", self.host.hostname)
    194         if self.queue_entry:
    195             self.queue_entry.set_status(models.HostQueueEntry.Status.VERIFYING)
    196         self.host.set_status(models.Host.Status.VERIFYING)
    197 
    198         # Delete any queued manual reverifies for this host.  One verify will do
    199         # and there's no need to keep records of other requests.
    200         self.remove_special_tasks(models.SpecialTask.Task.VERIFY,
    201                                   keep_last_one=True)
    202 
    203 
    204     def epilog(self):
    205         super(VerifyTask, self).epilog()
    206         if self.success:
    207             if self._should_pending():
    208                 self.queue_entry.on_pending()
    209             else:
    210                 self.host.set_status(models.Host.Status.READY)
    211 
    212 
    213 class CleanupTask(PreJobTask):
    214     # note this can also run post-job, but when it does, it's running standalone
    215     # against the host (not related to the job), so it's not considered a
    216     # PostJobTask
    217 
    218     TASK_TYPE = models.SpecialTask.Task.CLEANUP
    219 
    220 
    221     def __init__(self, task, recover_run_monitor=None):
    222         args = ['--cleanup']
    223         if task.queue_entry:
    224             args.extend(self._generate_autoserv_label_args(task))
    225         super(CleanupTask, self).__init__(task, args)
    226         self._set_ids(host=self.host, queue_entries=[self.queue_entry])
    227 
    228 
    229     def prolog(self):
    230         super(CleanupTask, self).prolog()
    231         logging.info("starting cleanup task for host: %s", self.host.hostname)
    232         self.host.set_status(models.Host.Status.CLEANING)
    233         if self.queue_entry:
    234             self.queue_entry.set_status(models.HostQueueEntry.Status.CLEANING)
    235 
    236 
    237     def _finish_epilog(self):
    238         if not self.queue_entry or not self.success:
    239             return
    240 
    241         do_not_verify_protection = host_protections.Protection.DO_NOT_VERIFY
    242         should_run_verify = (
    243                 self.queue_entry.job.run_verify
    244                 and self.host.protection != do_not_verify_protection)
    245         if should_run_verify:
    246             entry = models.HostQueueEntry.objects.get(id=self.queue_entry.id)
    247             models.SpecialTask.objects.create(
    248                     host=models.Host.objects.get(id=self.host.id),
    249                     queue_entry=entry,
    250                     task=models.SpecialTask.Task.VERIFY)
    251         else:
    252             if self._should_pending():
    253                 self.queue_entry.on_pending()
    254 
    255 
    256     def epilog(self):
    257         super(CleanupTask, self).epilog()
    258 
    259         if self.success:
    260             self.host.update_field('dirty', 0)
    261             self.host.set_status(models.Host.Status.READY)
    262 
    263         self._finish_epilog()
    264 
    265 
    266 class ResetTask(PreJobTask):
    267     """Task to reset a DUT, including cleanup and verify."""
    268     # note this can also run post-job, but when it does, it's running standalone
    269     # against the host (not related to the job), so it's not considered a
    270     # PostJobTask
    271 
    272     TASK_TYPE = models.SpecialTask.Task.RESET
    273 
    274 
    275     def __init__(self, task, recover_run_monitor=None):
    276         args = ['--reset']
    277         if task.queue_entry:
    278             args.extend(self._generate_autoserv_label_args(task))
    279         super(ResetTask, self).__init__(task, args)
    280         self._set_ids(host=self.host, queue_entries=[self.queue_entry])
    281 
    282 
    283     def prolog(self):
    284         super(ResetTask, self).prolog()
    285         logging.info('starting reset task for host: %s',
    286                      self.host.hostname)
    287         self.host.set_status(models.Host.Status.RESETTING)
    288         if self.queue_entry:
    289             self.queue_entry.set_status(models.HostQueueEntry.Status.RESETTING)
    290 
    291         # Delete any queued cleanups for this host.
    292         self.remove_special_tasks(models.SpecialTask.Task.CLEANUP,
    293                                   keep_last_one=False)
    294 
    295         # Delete any queued reverifies for this host.
    296         self.remove_special_tasks(models.SpecialTask.Task.VERIFY,
    297                                   keep_last_one=False)
    298 
    299         # Only one reset is needed.
    300         self.remove_special_tasks(models.SpecialTask.Task.RESET,
    301                                   keep_last_one=True)
    302 
    303 
    304     def epilog(self):
    305         super(ResetTask, self).epilog()
    306 
    307         if self.success:
    308             self.host.update_field('dirty', 0)
    309 
    310             if self._should_pending():
    311                 self.queue_entry.on_pending()
    312             else:
    313                 self.host.set_status(models.Host.Status.READY)
    314 
    315 
    316 class ProvisionTask(PreJobTask):
    317     TASK_TYPE = models.SpecialTask.Task.PROVISION
    318 
    319     def __init__(self, task):
    320         # Provisioning requires that we be associated with a job/queue entry
    321         assert task.queue_entry, "No HQE associated with provision task!"
    322         # task.queue_entry is an afe model HostQueueEntry object.
    323         # self.queue_entry is a scheduler models HostQueueEntry object, but
    324         # it gets constructed and assigned in __init__, so it's not available
    325         # yet.  Therefore, we're stuck pulling labels off of the afe model
    326         # so that we can pass the --provision args into the __init__ call.
    327         labels = {x.name for x in task.queue_entry.job.labels}
    328         _, provisionable = provision.filter_labels(labels)
    329         extra_command_args = ['--provision',
    330                               '--job-labels', ','.join(provisionable)]
    331         super(ProvisionTask, self).__init__(task, extra_command_args)
    332         self._set_ids(host=self.host, queue_entries=[self.queue_entry])
    333 
    334 
    335     def _command_line(self):
    336         # If we give queue_entry to _autoserv_command_line, then it will append
    337         # -c for this invocation if the queue_entry is a client side test. We
    338         # don't want that, as it messes with provisioning, so we just drop it
    339         # from the arguments here.
    340         # Note that we also don't verify job_repo_url as provisioining tasks are
    341         # required to stage whatever content we need, and the job itself will
    342         # force autotest to be staged if it isn't already.
    343         return autoserv_utils._autoserv_command_line(self.host.hostname,
    344                                                      self._extra_command_args,
    345                                                      in_lab=True)
    346 
    347 
    348     def prolog(self):
    349         super(ProvisionTask, self).prolog()
    350         # add check for previous provision task and abort if exist.
    351         logging.info("starting provision task for host: %s", self.host.hostname)
    352         self.queue_entry.set_status(
    353                 models.HostQueueEntry.Status.PROVISIONING)
    354         self.host.set_status(models.Host.Status.PROVISIONING)
    355 
    356 
    357     def epilog(self):
    358         super(ProvisionTask, self).epilog()
    359 
    360         # If we were not successful in provisioning the machine
    361         # leave the DUT in whatever status was set in the PreJobTask's
    362         # epilog. If this task was successful the host status will get
    363         # set appropriately as a fallout of the hqe's on_pending. If
    364         # we don't call on_pending, it can only be because:
    365         #   1. This task was not successful:
    366         #       a. Another repair is queued: this repair job will set the host
    367         #       status, and it will remain in 'Provisioning' till then.
    368         #       b. We have hit the max_repair_limit: in which case the host
    369         #       status is set to 'RepairFailed' in the epilog of PreJobTask.
    370         #   2. The task was successful, but there are other special tasks:
    371         #      Those special tasks will set the host status appropriately.
    372         if self._should_pending():
    373             self.queue_entry.on_pending()
    374 
    375 
    376 class RepairTask(agent_task.SpecialAgentTask):
    377     TASK_TYPE = models.SpecialTask.Task.REPAIR
    378 
    379 
    380     def __init__(self, task):
    381         """\
    382         queue_entry: queue entry to mark failed if this repair fails.
    383         """
    384         protection = host_protections.Protection.get_string(
    385                 task.host.protection)
    386         # normalize the protection name
    387         protection = host_protections.Protection.get_attr_name(protection)
    388 
    389         args = ['-R', '--host-protection', protection]
    390         if task.queue_entry:
    391             args.extend(self._generate_autoserv_label_args(task))
    392 
    393         super(RepairTask, self).__init__(task, args)
    394 
    395         # *don't* include the queue entry in IDs -- if the queue entry is
    396         # aborted, we want to leave the repair task running
    397         self._set_ids(host=self.host)
    398 
    399 
    400     def prolog(self):
    401         super(RepairTask, self).prolog()
    402         logging.info("repair_task starting")
    403         self.host.set_status(models.Host.Status.REPAIRING)
    404 
    405 
    406     def epilog(self):
    407         super(RepairTask, self).epilog()
    408 
    409         if self.success:
    410             self.host.set_status(models.Host.Status.READY)
    411         else:
    412             self.host.set_status(models.Host.Status.REPAIR_FAILED)
    413             if self.queue_entry:
    414                 self._fail_queue_entry()
    415