Spark作业调度 - 码农教程

　　Spark在standalone模式下，默认是使用FIFO的模式，我们可以使用spark.cores.max 来设置它的最大核心数，使用spark.executor.memory 来设置它的内存。

在YARN模式下，使用--num-workers设置worker的数量，使用--worker-memory设置work的内存，使用--worker-cores设置worker的核心数。

下面介绍一下怎么设置Spark的调度为Fair模式。

　　在实例化SparkContext之前，设置spark.scheduler.mode。

System.setProperty("spark.scheduler.mode", "FAIR")

　　公平算法支持把作业提交到调度池里面，然后给每个调度池设置优先级来运行，下面是怎么在程序里面指定调度池。

context.setLocalProperty("spark.scheduler.pool", "pool1")

　　上面是设置调度池为pool1，当不需要的时候，可以设置为null。

context.setLocalProperty("spark.scheduler.pool", null)

　　默认每个调度池在集群里面是平等共享集群资源的，但是在调度池里面，作业的执行是FIFO的，如果给每个用户设置一个调度池，这样就不会出现迟提交的比先提交的先运行。

　　下面是设置pool的样本，详情可以具体参考conf/fairscheduler.xml.template。

<?xml version="1.0"?>
<allocations>
  <pool name="production">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
  </pool>
  <pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>2</weight>
    <minShare>3</minShare>
  </pool>
</allocations>

schedulingMode：FAIR或者FIFO。

weight：权重，默认是1，设置为2的话，就会比其他调度池获得2x多的资源，如果设置为-1000，该调度池一有任务就会马上运行。

minShare：最小共享核心数，默认是0，在权重相同的情况下，minShare大的，可以获得更多的资源。

　　我们可以通过spark.scheduler.allocation.file参数来设置这个文件的位置。

System.setProperty("spark.scheduler.allocation.file", "/path/to/file")