Java教程:9种常见的CMS GC问题详解(三)

开课吧开课吧锤锤2021-03-22 10:13

    Java编程语言是一种简单、面向对象、分布式、解释型、健壮安全、与系统无关、可移植、高性能、多线程和动态的语言。如今Java已经广泛应用于各个领域的编程开发。

Java

    3.GC问题判断

    在做GC问题排查和优化之前,我们需要先来明确下到底是不是GC直接导致的问题,或者应用代码导致的GC异常,最终出现问题。

    3.1判断GC有没有问题?

    3.1.1设定评价标准

    评判GC的两个核心指标:

    延迟(Latency):也可以理解为最大停顿时间,即垃圾收集过程中一次STW的最长时间,越短越好,一定程度上可以接受频次的增大,GC技术的主要发展方向。

    吞吐量(Throughput):应用系统的生命周期内,由于GC线程会占用Mutator当前可用的CPU时钟周期,吞吐量即为Mutator有效花费的时间占系统总运行时间的百分比,例如系统运行了100min,GC耗时1min,则系统吞吐量为99%,吞吐量优先的收集器可以接受较长的停顿。

    目前各大互联网公司的系统基本都更追求低延时,避免一次GC停顿的时间过长对用户体验造成损失,衡量指标需要结合一下应用服务的SLA,主要如下两点来判断:

Java

    简而言之,即为一次停顿的时间不超过应用服务的TP9999,GC的吞吐量不小于99.99%。举个例子,假设某个服务A的TP9999为80ms,平均GC停顿为30ms,那么该服务的最大停顿时间最好不要超过80ms,GC频次控制在5min以上一次。如果满足不了,那就需要调优或者通过更多资源来进行并联冗余。(大家可以先停下来,看看监控平台上面的gc.meantime分钟级别指标,如果超过了6ms那单机GC吞吐量就达不到4个9了。)

    备注:除了这两个指标之外还有Footprint(资源量大小测量)、反应速度等指标,互联网这种实时系统追求低延迟,而很多嵌入式系统则追求Footprint。

    3.1.2读懂GCCause

    拿到GC日志,我们就可以简单分析GC情况了,通过一些工具,我们可以比较直观地看到Cause的分布情况,如下图就是使用gceasy绘制的图表:

Java

    如上图所示,我们很清晰的就能知道是什么原因引起的GC,以及每次的时间花费情况,但是要分析GC的问题,先要读懂GCCause,即JVM什么样的条件下选择进行GC操作,具体Cause的分类可以看一下Hotspot源码:src/share/vm/gc/shared/gcCause.hpp和src/share/vm/gc/shared/gcCause.cpp中。

const char* GCCause::to_string(GCCause::Cause cause) {
  switch (cause) {
    case _java_lang_system_gc:
      return "System.gc()";

    case _full_gc_alot:
      return "FullGCAlot";

    case _scavenge_alot:
      return "ScavengeAlot";

    case _allocation_profiler:
      return "Allocation Profiler";

    case _jvmti_force_gc:
      return "JvmtiEnv ForceGarbageCollection";

    case _gc_locker:
      return "GCLocker Initiated GC";

    case _heap_inspection:
      return "Heap Inspection Initiated GC";

    case _heap_dump:
      return "Heap Dump Initiated GC";

    case _wb_young_gc:
      return "WhiteBox Initiated Young GC";

    case _wb_conc_mark:
      return "WhiteBox Initiated Concurrent Mark";

    case _wb_full_gc:
      return "WhiteBox Initiated Full GC";

    case _no_gc:
      return "No GC";

    case _allocation_failure:
      return "Allocation Failure";

    case _tenured_generation_full:
      return "Tenured Generation Full";

    case _metadata_GC_threshold:
      return "Metadata GC Threshold";

    case _metadata_GC_clear_soft_refs:
      return "Metadata GC Clear Soft References";

    case _cms_generation_full:
      return "CMS Generation Full";

    case _cms_initial_mark:
      return "CMS Initial Mark";

    case _cms_final_remark:
      return "CMS Final Remark";

    case _cms_concurrent_mark:
      return "CMS Concurrent Mark";

    case _old_generation_expanded_on_last_scavenge:
      return "Old Generation Expanded On Last Scavenge";

    case _old_generation_too_full_to_scavenge:
      return "Old Generation Too Full To Scavenge";

    case _adaptive_size_policy:
      return "Ergonomics";

    case _g1_inc_collection_pause:
      return "G1 Evacuation Pause";

    case _g1_humongous_allocation:
      return "G1 Humongous Allocation";

    case _dcmd_gc_run:
      return "Diagnostic Command";

    case _last_gc_cause:
      return "ILLEGAL VALUE - last gc cause - ILLEGAL VALUE";

    default:
      return "unknown GCCause";
  }
  ShouldNotReachHere();
}

    重点需要关注的几个GCCause:

    System.gc():手动触发GC操作。

    CMS:CMSGC在执行过程中的一些动作,重点关注CMSInitialMark和CMSFinalRemark两个STW阶段。

    PromotionFailure:Old区没有足够的空间分配给Young区晋升的对象(即使总可用内存足够大)。

    ConcurrentModeFailure:CMSGC运行期间,Old区预留的空间不足以分配给新的对象,此时收集器会发生退化,严重影响GC性能,下面的一个案例即为这种场景。

    GCLockerInitiatedGC:如果线程执行在JNI临界区时,刚好需要进行GC,此时GCLocker将会阻止GC的发生,同时阻止其他线程进入JNI临界区,直到最后一个线程退出临界区时触发一次GC。

    什么时机使用这些Cause触发回收,大家可以看一下CMS的代码,这里就不讨论了,具体在/src/hotspot/share/gc/cms/concurrentMarkSweepGeneration.cpp中。

bool CMSCollector::shouldConcurrentCollect() {
  LogTarget(Trace, gc) log;

  if (_full_gc_requested) {
    log.print("CMSCollector: collect because of explicit  gc request (or GCLocker)");
    return true;
  }

  FreelistLocker x(this);
  // ------------------------------------------------------------------
  // Print out lots of information which affects the initiation of
  // a collection.
  if (log.is_enabled() && stats().valid()) {
    log.print("CMSCollector shouldConcurrentCollect: ");

    LogStream out(log);
    stats().print_on(&out);

    log.print("time_until_cms_gen_full %3.7f", stats().time_until_cms_gen_full());
    log.print("free=" SIZE_FORMAT, _cmsGen->free());
    log.print("contiguous_available=" SIZE_FORMAT, _cmsGen->contiguous_available());
    log.print("promotion_rate=%g", stats().promotion_rate());
    log.print("cms_allocation_rate=%g", stats().cms_allocation_rate());
    log.print("occupancy=%3.7f", _cmsGen->occupancy());
    log.print("initiatingOccupancy=%3.7f", _cmsGen->initiating_occupancy());
    log.print("cms_time_since_begin=%3.7f", stats().cms_time_since_begin());
    log.print("cms_time_since_end=%3.7f", stats().cms_time_since_end());
    log.print("metadata initialized %d", MetaspaceGC::should_concurrent_collect());
  }
  // ------------------------------------------------------------------

  // If the estimated time to complete a cms collection (cms_duration())
  // is less than the estimated time remaining until the cms generation
  // is full, start a collection.
  if (!UseCMSInitiatingOccupancyOnly) {
    if (stats().valid()) {
      if (stats().time_until_cms_start() == 0.0) {
        return true;
      }
    } else {
   
      if (_cmsGen->occupancy() >= _bootstrap_occupancy) {
        log.print(" CMSCollector: collect for bootstrapping statistics: occupancy = %f, boot occupancy = %f",
                  _cmsGen->occupancy(), _bootstrap_occupancy);
        return true;
      }
    }
  }
  if (_cmsGen->should_concurrent_collect()) {
    log.print("CMS old gen initiated");
    return true;
  }

  CMSHeap* heap = CMSHeap::heap();
  if (heap->incremental_collection_will_fail(true /* consult_young */)) {
    log.print("CMSCollector: collect because incremental collection will fail ");
    return true;
  }

  if (MetaspaceGC::should_concurrent_collect()) {
    log.print("CMSCollector: collect for metadata allocation ");
    return true;
  }

  // CMSTriggerInterval starts a CMS cycle if enough time has passed.
  if (CMSTriggerInterval >= 0) {
    if (CMSTriggerInterval == 0) {
      // Trigger always
      return true;
    }

    // Check the CMS time since begin (we do not check the stats validity
    // as we want to be able to trigger the first CMS cycle as well)
    if (stats().cms_time_since_begin() >= (CMSTriggerInterval / ((double) MILLIUNITS))) {
      if (stats().valid()) {
        log.print("CMSCollector: collect because of trigger interval (time since last begin %3.7f secs)",
                  stats().cms_time_since_begin());
      } else {
        log.print("CMSCollector: collect because of trigger interval (first collection)");
      }
      return true;
    }
  }

  return false;
}

    以上内容由开课吧老师、新宇、湘铭、祥璞提供,更多Java教程尽在开课吧广场Java教程频道。

有用
分享