2015年八月月 发布的文章

[Linux] 使用noatime属性优化文件系统读取性能

当文件被创建,修改和访问时,Linux系统会记录这些时间信息,当访问足够频繁将会是很大的开销,因为每次访问都会记录时间,所以 我们今天使用bonnie++来简单测试我们修改noatime给我们带来的性能提升有多少,我们先下载最新版本的bonnie++

# tar xf bonnie++-1.97.tgz
# cd bonnie++-1.97.1
# make

编译好之后就可以使用了

注:测试数据最好为内存的2倍

所以在没修改noatime之前,我们先测试文件系统的性能

./bonnie++ -s 31896 -d /export/ -u root -q >> file.csv

运行结果如下:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
localhost    31896M   458  99 189663  52 82909  21  2487  98 214994  26 823.4  56
Latency             32591us     566ms     705ms   11924us     252ms     122ms
Version  1.97       ------Sequential Create------ --------Random Create--------
localhost           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16300  79 +++++ +++ +++++ +++ 14745  74 +++++ +++ 18007  32
Latency             10929us     478us     521us     493us     134us     374us

接下来我们修改挂载的/export,重新测试一遍

# vim /etc/fstab
UUID=d41182b5-5092-4f2f-88a3-be619feef512 /export                 ext4    defaults,noatime        1 2

设置立即生效

mount -o remount /export

执行命令:

./bonnie++ -s 31896 -d /export/ -u root -q >> file.csv

运行结果为:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
localhost    31896M   497  99 171760  35 93152  21  2276  97 240294  28 755.6  45
Latency             18716us     661ms     539ms   29368us     263ms   79468us
Version  1.97       ------Sequential Create------ --------Random Create--------
localhost           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 18605  93 +++++ +++ +++++ +++ 20520  96 +++++ +++ +++++ +++
Latency              1186us     379us    1297us    1288us     127us    1443us

可能这样的结果不直观,我们可以

cat file.csv | ./bon_csv2html > result.html

网页打开为:

noatime_test1

可以看出214MBps提升到了240MBps,虽然这只是一次测试,但是理论上来说还是会有性能上的提升,在整体的集群环境下,还是有益提升集群性能的。

参考资料:

测试工具Bonnie++的使用

[HADOOP 问题] NodeManager OOM挂掉问题解决

在更换JDK1.6_25到JDK1.7_45后,集群出现频繁死掉NM,出现结果为如下:

2015-08-12 16:35:06,662 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[process reaper,10,system] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.UNIXProcess$ProcessPipeInputStream.drainInputStream(UNIXProcess.java:267)
at java.lang.UNIXProcess$ProcessPipeInputStream.processExited(UNIXProcess.java:280)
at java.lang.UNIXProcess.processExited(UNIXProcess.java:187)
at java.lang.UNIXProcess$3.run(UNIXProcess.java:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

和类似的

2015-08-12 16:37:56,893 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[process reaper,10,system] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: Java heap space
at java.lang.UNIXProcess$ProcessPipeInputStream.drainInputStream(UNIXProcess.java:267)
at java.lang.UNIXProcess$ProcessPipeInputStream.processExited(UNIXProcess.java:280)
at java.lang.UNIXProcess.processExited(UNIXProcess.java:187)
at java.lang.UNIXProcess$3.run(UNIXProcess.java:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

在google搜索关键字hadoop UNIXProcess drainInputStream,找到关于JDK7的一些bug,在NM负载高的情况下,出现OOM问题。 详情请看HADOOP-10146

和一些相关解释:

JDK-8027348

JDK-8024521

后来更换JDK1.7_67则没出现OOM的问题

[YARN] MRAppMaster心跳原理

最近集群遇到一个问题,就是集群在跑任务的时候,AM会超时10min而被KILL,但任务重跑则成功,问题是随机的出现的,所以初步怀疑是因为AM心跳汇报出现问题或则RM因为繁忙hang住,AM因为某些机制导致等待10min不汇报心跳,所以我们还是先了解,AM是如何向RM汇报心跳的。

在MRAppMaster中,ContainerAllocatorRouter负责向RM申请资源(发送心跳)

RMAM

RMContainerAllocator其最终父类是RMCommunicator,它实现了RMHeartbeatHandler接口

public interface RMHeartbeatHandler {
  long getLastHeartbeatTime(); // 获取上一次心跳的时间
  void runOnNextHeartbeat(Runnable callback); // 回调注册到callback队列的callback函数
}

每一次心跳回来,都会执行一次注册在heartbeatCallbacks中的回调函数:

allocatorThread = new Thread(new Runnable() {
      @Override
      public void run() {
        while (!stopped.get() && !Thread.currentThread().isInterrupted()) {
              ......
              heartbeat();            
              lastHeartbeatTime = context.getClock().getTime();// 记录上一次心跳时间
              executeHeartbeatCallbacks(); // 执行回调函数
              ....
});

RMCommunicator类中:

private void executeHeartbeatCallbacks() {
    Runnable callback = null;
    while ((callback = heartbeatCallbacks.poll()) != null) {
      callback.run();
    }
  }

在RMCommunicator启动时,首先会向RM注册,把自己的host和port告诉RM,然后在启动一条线程(startAllocatorThread)定期的调用RMContainerAllocator中实现的heartbeat方法(向RM申请资源,定期汇报信息,告诉RM自己还活着)。

AM初始化同时也会初始化RMCommunicator:

protected void serviceStart() throws Exception {
  scheduler= createSchedulerProxy(); // 获取RM的代理
  register(); // 注册
  startAllocatorThread(); // 心跳线程
....
}

AM的ContainerAllocatorRouter事件处理流程如下图:

RMALLO

注册流程:

调用RMCommunicator远程调用ApplicationMasterService的registerApplicationMaster方法,设置维护responseId,然后把它加入AMLivelinessMonitor中,并使用map记录时间,用来监控AM是否因为长时间没有心跳而超时,如果AM长时间没有心跳信息更新,RM就会通知NodeManager把AM移除。

心跳线程:

在发送心跳的过程中,即也是获取资源的过程

@Override
  protected synchronized void heartbeat() throws Exception {
    scheduleStats.updateAndLogIfChanged("Before Scheduling: ");
    List<Container> allocatedContainers = getResources();// 重要的方法
    if (allocatedContainers.size() > 0) {
      scheduledRequests.assign(allocatedContainers);
    }
   ......
  }

获取资源的过程:

private List<Container> getResources() throws Exception {
     ...
     response = makeRemoteRequest(); // 和RM进行交互
     ...
     // 优先处理RM发送过来的命令
     if (response.getAMCommand() != null) {
         switch(response.getAMCommand()) {
                case AM_RESYNC:
                case AM_SHUTDOWN:
                     eventHandler.handle(new JobEvent(this.getJob().getID(),
                                     JobEventType.JOB_AM_REBOOT));
                     throw new YarnRuntimeException("Resource Manager doesn't recognize AttemptId: " +
                             this.getContext().getApplicationID());
                default:
                     ....
      }
     // 等等一系列处理
}
}

构建请求:

protected AllocateResponse makeRemoteRequest() throws IOException {
    AllocateRequest allocateRequest =
        AllocateRequest.newInstance(lastResponseID,
          super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),
          new ArrayList<ContainerId>(release), blacklistRequest);
    AllocateResponse allocateResponse;
    allocateResponse = scheduler.allocate(allocateRequest); // RPC调用ApplicationMasterService的allocate方法
    .....
}

每一次心跳的调用都会刷新AMLivelinessMonitor的时间,代表AM还活着

而且我们通过代码可以看出,资源请求被封装为一个ask,即一个ResourceRequest的ArrayList的资源列表 例如:

priority:20 host:host9 capability:<memory:2048, vCores:1>
priority:20 host:host2 capability:<memory:2048, vCores:1>
priority:20 host:host10 capability:<memory:2048, vCores:1>
priority:20 host:/rack/rack3203 capability:<memory:2048, vCores:1>
priority:20 host:/rack/rack3202 capability:<memory:2048, vCores:1>
priority:20 host:* capability:<memory:2048, vCores:1>

然而,ask是如何被构造的呢?

RMContainerAllocator中的addMap,addReduce,assign方法中对ask的数据内容进行了修改

addContainerReq --> addResourceRequest --> addResourceRequestToAsk;

通过在代码自己添加日志可以看出,资源会被分为local,rack,和any级别去申请资源

最终变为一个ask list发送到RM上:

 ask Capability:<memory:2048, vCores:1> ResourceName:* NumContainers:384 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:/rack/rack3201 NumContainers:227 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:/rack/rack3202 NumContainers:231 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:/rack/rack3203 NumContainers:152 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:/rack/rack3204 NumContainers:158 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:host1 NumContainers:46 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:host5 NumContainers:52 Priority:20 RelaxLocality:true
 ask Capability:<memory:2048, vCores:1> ResourceName:host6 NumContainers:38 Priority:20 RelaxLocality:true

类似日志为:

getResources() for application_1438330253091_0004: ask=29 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:0, vCores:0> knownNMs=24

总结:

除了了解心跳之外,还学习了许多Map和Reduce的分配机制,收获良多。