为什么锁定 Go 比 Java 慢得多？花了很多时间在 Mutex.Lock() Mutex.Unlock() 上

我编写了一个小的Go库(Go -patan)，它收集某些变量的运行min/max/avg/stddev。我将其与等价的Java实现(patan)进行了比较，令我惊讶的是，Java实现要快得多。我想知道为什么。

该库基本上由一个简单的数据存储和一个序列化读写的锁组成。这是一段代码:

type Store struct {
   durations map[string]*Distribution
   counters  map[string]int64
   samples   map[string]*Distribution
   lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
  store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
  store.addToStore(store.durations, key, value)
}
func (store *Store) addToCounter(key string, value int64) {
  store.lock.Lock()
  defer store.lock.Unlock()
  store.counters[key] = store.counters[key] + value
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
  store.lock.Lock()
  defer store.lock.Unlock()
  distribution, exists := destination[key]
  if !exists {
    distribution = NewDistribution()
    destination[key] = distribution
  }
  distribution.addSample(value)
}

我已经对GO和Java实现进行了基准测试(GO -benchmark-gist, Java -benchmark-gist)， Java遥遥领先，但我不明白为什么:

Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis
Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis  
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis

我已经用Go的pprof对程序进行了概要分析，并生成了一个调用图。这表明它基本上把所有的时间都花在sync.(*Mutex). lock()和sync.(*Mutex). unlock()上。

根据分析器的Top20调用:

(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
      flat  flat%   sum%        cum   cum%
    8900ms 12.04% 12.04%     8900ms 12.04%  runtime.futex
    7270ms  9.84% 21.88%     7270ms  9.84%  runtime/internal/atomic.Xchg
    7020ms  9.50% 31.38%     7020ms  9.50%  runtime.procyield
    4560ms  6.17% 37.56%     4560ms  6.17%  sync/atomic.CompareAndSwapUint32
    4400ms  5.95% 43.51%     4400ms  5.95%  runtime/internal/atomic.Xadd
    4210ms  5.70% 49.21%    22040ms 29.83%  runtime.lock
    3650ms  4.94% 54.15%     3650ms  4.94%  runtime/internal/atomic.Cas
    3260ms  4.41% 58.56%     3260ms  4.41%  runtime/internal/atomic.Load
    2220ms  3.00% 61.56%    22810ms 30.87%  sync.(*Mutex).Lock
    1870ms  2.53% 64.10%     1870ms  2.53%  runtime.osyield
    1540ms  2.08% 66.18%    16740ms 22.66%  runtime.findrunnable
    1430ms  1.94% 68.11%     1430ms  1.94%  runtime.freedefer
    1400ms  1.89% 70.01%     1400ms  1.89%  sync/atomic.AddUint32
    1250ms  1.69% 71.70%     1250ms  1.69%  github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
    1240ms  1.68% 73.38%     3140ms  4.25%  runtime.deferreturn
    1070ms  1.45% 74.83%     6520ms  8.82%  runtime.systemstack
    1010ms  1.37% 76.19%     1010ms  1.37%  runtime.newdefer
    1000ms  1.35% 77.55%     1000ms  1.35%  runtime.mapaccess1_faststr
     950ms  1.29% 78.83%    15660ms 21.19%  runtime.semacquire
     860ms  1.16% 80.00%    50220ms 67.97%  main.Benchmrk.func1

有人能解释为什么在Go中锁定似乎比在Java中慢得多，我做错了什么?我还用Go语言编写了一个基于通道的实现，但速度更慢。

在需要高性能的小型函数中最好避免使用defer，因为它很昂贵。在大多数其他情况下，没有必要避免它，因为defer的成本被它周围的代码所抵消。

我还建议使用lock sync.Mutex而不是使用指针。指针给程序员带来了少量额外的工作(初始化，nil错误)，也给垃圾收集器带来了少量额外的工作。

我也在golang-nuts群里发了这个问题。Jesper Louis Andersen的回复很好地解释了Java使用同步优化技术，如锁逸出分析/锁省略和锁粗化。

Java JIT可能会获取锁并允许在锁内一次进行多个更新以提高性能。我用-Djava.compiler=NONE运行了Java基准测试，它提供了惊人的性能，但不是一个公平的比较。

我认为许多这些优化技术在生产环境中影响较小。

相关内容

最新更新

热门标签：