下载网站tokenpocketapp下载|pgo

首页
pgo

下载网站tokenpocketapp下载|pgo

作者：下载网站tokenpocketapp下载

2024-03-07 20:31:31

性能优化的终极手段之 Profile-Guided Optimization (PGO) - 知乎

性能优化的终极手段之 Profile-Guided Optimization (PGO) - 知乎切换模式写文章登录/注册性能优化的终极手段之 Profile-Guided Optimization (PGO)腾讯技术工程编程话题下的优秀答主作者：koka 我们在进行性能优化的时候，往往会应用各种花式的优化手段：优化算法复杂度（从 O(N)优化到 O(logN) ），优化锁的粒度或者无锁化，应用各种池化技术：内存池、连接池、线程池，协程池等，压缩技术，预拉取，缓存，批量处理，SIMD，内存对齐等等手段后，其实还有一种手段就是 Profile-Guided Optimization （PGO）。本文会介绍 PGO 的原理，以及 Go/C++语言进行 PGO 的实践。Profile-guided optimization (PGO)又称 feedback-directed optimization (FDO) 是指利用程序运行过程中采集到的 profile 数据，来重新编译程序以达到优化效果的 post-link 优化技术。它是一种通用技术，不局限于某种语言。1、Profile-Guided Optimization (PGO)原理PGO 首先要对程序进行剖分(Profile)，收集程序实际运行的数据生成 profiling 文件，根据此文件来进行性能优化：通过缩小代码大小，减少错误分支预测，重新组织代码布局减少指令缓存问题等方法。PGO 向编译器提供最常执行代码区域，编译器知道这些区域后可以对这些区域进行针对性和具体的优化。PGO 大体都可以由如下 3 个步骤，具体细节可能稍微有点差异，后面会讲：img步骤 1：编译的时候添加编译或者链接选项以便在步骤二运行的时候可以生成 prof 文件。例如 clang 的-fprofile-instr-generate、-fdebug-info-for-profiling、-funique-internal-linkage-names 选项等。步骤 2:该步骤是根据步骤 1 生成的可执行程序，运行生成 prof 文件。这种通常有两种方法，第一种方法是如上面 clang 的-fprofile-instr-generate 选项，该参数相当于在编译时插桩，运行时自动生成 prof 文件。另外一种称之为 AutoFDO，才运行时候动态采集，C++等可以用 perf，go 的话更方便runtime/pprof or net/http/pprof 都可以采集到。步骤 3:步骤 3 是根据步骤 2 的 prof 重新编译，同时有必要的话去掉步骤 1 中添加的编译参数，重新编译生成新的可执行文件。1.1 错误分支预测优化下面用简单的一个 if 判断语句来说明为什么减少错误分支预测可以实现正优化。看下面示例代码：if condition {

// 执行逻辑1

} else {

// 执行逻辑2

}

在编译时，由于编译器并不能假设 condition 为 true 或者 false 的概率，所以按照定义的顺序：如果 condition 为 true 执行逻辑 1，如果条件不满足跳跃至 else 执行逻辑 2。在 CPU 的实际执行中，由于指令顺序执行以及 pipeline 预执行等机制，因此，会优先执行当前指令紧接着的下一条指令。上面的指令如果 condition 为 true 那么整个流水线便一气呵成，没有跳转的开销。相反的，如果 condition 为 false，那么 pipeline 中先前预执行的逻辑 1 计算则会被作废，转而需要从 else 处的重新加载指令，并重新执行逻辑 2，这些消耗会显著降低指令的执行性能。如果在实际运行中，condition 为 true 的概率比较大，那么该代码片段会比较高效，反之则低效。借助对程序运行期的 pprof profile 数据进行采集，则可以得到上面的分支判断中，实际走 if 分支和走 else 分支的次数。借助该统计数据，在 PGO 编译中，若走 else 分支的概率较大（相差越大效果越明显），编译器便可以对输出的机器指令进行调整，使其生成的指令从而对执行逻辑 2 更加有利。其实很简单比如汇编指令 je （等于就跳转）等价替换成 jne（不等于就跳转）。2、Go 的 PGO 实践Go 语言从 Go1.20 开始就支持 PGO 优化，不过默认是关闭的-pgo=off，从 Go1.21 开始-pgo=auto 默认打开。从我测试的几个 case 来看 Go.1.20 的优化效果并不明显，Go1.21 的优化效果更明显，现在 Go1.21 已经发布建议大家用 Go1.21 及其以上版本。2.1 Profile 文件采集Go 的 PGO 需要一个 cpu pprof profile 文件作为一个输入，可喜的是 go profile 文件的生成已经集成到了运行时：（ runtime/pprof and net/http/pprof)可以直接采集获取。当然其他的格式的文件比如上述 Linux perf 的满足一天基本的前提条件可以可以转换成 pprof format 为 Go PGO 所用。最简单的方法是：curl -o cpu.pprof " http://localhost:8080/debug/pprof/profile?seconds=30 " 从服务的任意实例获取 30s 的数据。由于下述的原因 30s 的数据可能不具有代表性该实例在执行分析时候比较空闲，尽管它平时可能比较忙该实例的流量某天发生了变化导致实例行为也发生了变化在不同的时间段执行不同的操作类型，可能该 30s 的采样间隔只能覆盖单一的操作类型该实例有异常流量其他比较稳健的做法是不同时间收集不同实例的 profile 文件，然后合并成一个文件给 PGO 使用，以限制单个 profile 文件的影响。go tool pprof -proto a.pprof b.pprof > merged.pprof

需要注意的是，profile 文件的收集都是要从生成环境获得实际最真实的运行情况，这样的优化效果才最好。单元测试或者部分的基准测试不适合 PGO 优化，因为它支持程序的一小部分收效甚微。2.2 PGO 的迭代构建正如上面所说，建议采用 Go 1.21 以上版本，标准的构建方法是将default.pgo 文件放在 main package 所在的目录，Go 编译器探测到**`default.pgo```** 自动开启 PGO 优化。除了这种方式外，也可以指定 profile 文件路径go build -pgo=/pprof/main.pprof

img由于程序一直在开发迭代，所以步骤 2 和步骤 3 是一个循环过程。步骤 2 中的 profile 文件对应的源代码跟最新的源代码可能是不一样的，Go PGO 的实现对此具有鲁棒性，称之为源稳定性。同样在经过一次迭代后，二进制的版本也是使用上次 profile 文件已经优化后的版本，Go PGO 的实现同样对此具有鲁棒性，称为迭代鲁棒性。2.2.1 源鲁棒性源稳定性是通过使用启发式方法将配置文件中的示例与编译源进行匹配来实现的。因此，对源代码的许多更改（例如添加新功能）对匹配现有代码没有影响。当编译器无法匹配更改的代码时，一些优化会丢失，但请注意，这是一种优雅的降级。单个函数未能匹配可能会失去优化机会，但总体 PGO 收益通常会分布在多个函数中。Go 的 PGO 尽最大努力继续将旧配置文件中的样本与当前源代码进行匹配。具体来说，Go 在函数内使用行偏移（例如：调用函数的第 10 行），总的来说存在两种情况一种是破坏匹配，另外一种没有破坏匹配。许多常见的修改不会破坏匹配：在热函数之外更改文件（在函数上方或下方添加/更改代码）。将函数移动到同一包中的另一个文件（编译器完全忽略源文件名）。还有一些修改会破坏匹配：热函数内的更改（可能会影响行偏移）。重命名函数（和/或方法的类型）（更改符号名称）。将函数移动到另一个包（更改符号名称）如果 profile 相对较新，则差异可能只会影响少数热门函数，从而限制了无法匹配的函数中错过优化的影响。尽管如此，随着时间的推移，profile 慢慢变旧，性能下降会慢慢累积，因为代码很少被重构回旧的形式，因此定期收集新的 profile 以限制生产中的源偏差非常重要。profile 文件匹配度可能显着降低的一种情况是大规模重构，即重命名许多函数或在包之间移动它们。在这种情况下，您可能会受到短期性能影响，直到新的 profile 文件构建生效。2.2.2 迭代鲁棒性迭代稳定性是为了防止连续 PGO 构建中的可变性能循环（例如，构建 1 快，构建 2 慢，构建 3 快，等等）。我们使用 CPU profile 文件来识别要优化的热门函数调用。理论上，PGO 可以大大加快热函数的速度，使其在下一个 profile 中不再显示为热函数，并且不会得到优化，从而使其再次变慢。Go 编译器对 PGO 优化采取保守的方法，他们认为这可以防止出现重大差异。2.2.3 总结假如 Go PGO 不能保证源稳定性和迭代稳定性，那我们就需要采样二阶段构建的方式发布我们的服务。第一阶段构建一个未启用 PGO 优化的版本，灰度发布到生产环境，然后采集对应的 profile 文件。第二阶段根据采集的 profile 文件启用 PGO 优化，再次全量发布到生成环境。2.3 实践结果在我们的辅助 sidecar 程序采用 Go 1.21 开启 PGO 优化后，大概有**5%性能提升，Go 官方给的数据大概是2 ～ 7%**提升。业务程序也部分开始应用 PGO 进行优化。Go 未来 PGO 会继续迭代优化，我们可以持续关注下。2.4 Go PGO 未来关于这个问题 Go 语言 member @aclements 在 pgo 的一个 issue 里有提到过 pgo 可以优化的非完全列表：内联（这个已经很常规了）函数块排序，对函数块进行排序，聚焦热块改进分支预测寄存器分配，目前寄存器分配采用启发式确定热路径和移除，PGO 可以告知真正的热路径函数排序，在整个二进制的级别对函数进行排序和聚集，以后更好的局部性全局块排序，超越函数排序的一步，其集中形式可能是冷热分离，也有可能比这更激进间接调用去虚拟化，这里后面跟 C++的类似（后面 C++会详细讲下这里）模版化，基于 profile 将模版化热通用函数map/slice 的预分配生命周期分配，将具有相似生命周期的分配放在一起3、C++的 PGO 实践根据 profile 可以优化寄存器的分配，优化循环的矢量化（针对只有少数几个迭代的循环不做 vectorize，vecrorize 会增加而外的运行成本），提升分支预测的准确性等。C++中虚函数的 Speculative devirtualization 优化技术就依赖于分支预测的准确性，下面会重点讲下。3.1 虚函数优化C++的虚函数使用起来非常方便，代码的抽象层次也非常好，但是他还是有一定的开销相比普通函数，如果大量使用虚函数在性能要求非常高的场景对性能还是有一定的影响，主要体现在如下的方面：空间开销：由于需要为每一个包含虚函数的类生成一个虚函数表，所以程序的二进制文件大小会相应的增大。其次，对于包含虚函数的类的实例来说，每个实例都包含一个虚函数表指针用于指向对应的虚函数表，所以每个实例的空间占用都增加一个指针大小（32 位系统 4 字节，64 位系统 8 字节）。这些空间开销可能会造成缓存的不友好，在一定程度上影响程序性能。虚函数表查找：虚函数增加了一次内存寻址，通过虚函数指针找到虚函数表，有一点点开销但是还好。间接调用（indirect call）开销：由于运行期的实际函数(或接口)代码地址是动态赋值的，机器指令无法做更多优化，只能直接执行 call 指令（间接调用）。对于直接调用而言，是不存在分支跳转的，因为跳转地址是编译器确定的，CPU 直接去跳转地址取后面的指令即可，不存在分支预测，这样可以保证 CPU 流水线不被打断。而对于间接寻址，由于跳转地址不确定，所以此处会有多个分支可能，这个时候需要分支预测器进行预测，如果分支预测失败，则会导致流水线冲刷，重新进行取指、译码等操作，对程序性能有很大的影响。无法内联优化：由于 virtual 函数的实现本身是多态的，编译中无法得出实际运行期会执行的实现，因此也无法进行内联优化。同时在很多场景下，调用一个函数只是为了得到部分返回值或作用，但函数实现通常还执行了某些额外计算，这些计算本可以通过内联优化消除，由于无法内联，indirect call 会执行更多无效的计算。阻碍进一步的编译优化：indirect call 相当于是指令中的一个屏障，由于其本身是一个运行期才能确定的调用，它在编译期会使各种控制流判断以及代码展开失效，从而限制进一步编译及链接的优化空间。3.2 Basic devirtualization我们通过下面一个例子来简单说明编译器是如何去虚拟化的：class A {

public:

virtual int foo() { return ; }

};

class B : public A {

public:

int foo() { return 2; }

};

int test(B* b) {

return b->foo() + ; }

当调用 test(B *b)里面的 b->foo()函数时，编译器并不知道 b 是一个真正的 B 类型，还是 B 的子类型，所以编译生成的代码会包含间接调用（indirect call 行：19）针对虚函数调用(b->foo())。gcc 9 生成的汇编代码如下（裁剪后）： 12 subq $16, %rsp

13 movq %rdi, -8(%rbp)

14 movq -8(%rbp), %rax

15 movq (%rax), %rax

16 movq (%rax), %rdx

17 movq -8(%rbp), %rax

18 movq %rax, %rdi

19 call *%rdx

20 addl $3, %eax

我们把上面 class B 的代码改一下，增加关键词final ：class B : public A {

public:

int value() final { return 2; }

};

这样编译器知道 class B 不可能有子类，可以进行去虚拟化优化(-fdevirtualize )，汇编代码如下： 6 _ZN1B3fooEv:

7 .LFB1:

8 .cfi_startproc

9 pushq %rbp

10 .cfi_def_cfa_offset 16

11 .cfi_offset 6, -16

12 movq %rsp, %rbp

13 .cfi_def_cfa_register 6

14 movq %rdi, -8(%rbp)

15 movl $2, %eax

16 popq %rbp

17 .cfi_def_cfa 7, 8

18 ret

19 .cfi_endproc

20 .LFE1:

21 .size _ZN1B3fooEv, .-_ZN1B3fooEv

22 .text

23 .globl _Z4testP1B

24 .type _Z4testP1B, @function

25 _Z4testP1B:

26 .LFB2:

27 .cfi_startproc

28 pushq %rbp

29 .cfi_def_cfa_offset 16

30 .cfi_offset 6, -16

31 movq %rsp, %rbp

32 .cfi_def_cfa_register 6

33 subq $16, %rsp

34 movq %rdi, -8(%rbp)

35 movq -8(%rbp), %rax

36 movq %rax, %rdi

37 call _ZN1B3fooEv

38 addl $3, %eax

39 leave

40 .cfi_def_cfa 7, 8

41 ret

42 .cfi_endproc

可以看到间接调用已经修改成直接调用，当然这里可以进一步优化成一条指令： 6 .LFB2:

7 .cfi_startproc

8 movl $5, %eax

9 ret

10 .cfi_endproc

3.3 Speculative devirtualization根据实际运行情况，去推导去虚拟化。还是举一个简单的例子来说明下：A* ptr->foo()，ptr 是一个指针，他可以是 A 也可以是 B，甚至是他们的子类，编译器在编译无法确定其类型。假设在实际的生产环境中的，ptr 大概率是 A 对象，而不是 B 对象或者其子类对象，speculative devirtualization，gcc 的编译参数（-fdevirtualize-speculatively）优化就会尝试进行如下的转换：if (ptr->foo == A::foo)

A::foo ();

else

ptr->foo ();

经过此转换后，将间接调用转换成直接调用，就可以进行直接调用优化，比如说 inline 等。3.4 实践结果最近正在进行 envoy 的性能优化测试，到时候的测试结果会补充在这里。4、总体实践和规划现在我们的可观测平台已经自动采集了和保存了 Go 程序的 pprof 文件，流水线构建的时候自动拉取该服务对应的 pprof 文件，进行编译构建优化。C++程序的自动采集和构建以及与治理平台的结合也在规划中，我们的目标是自动提供系统化、平台化能力而且不需要业务参与，欢迎大家一起交流。5、其他LTO(Link-Time Optimization)， BOLT（inary Optimization and Layout Tool）是另外两种优化手段。LTO 就是对整个程序代码进行的一种优化，是 LLVM 里在链接时进行跨模块间的优化。BOLT 能够在配置文件后重新排列可执行文件，产生比编译器的 LTO 和 PGO 优化所能达到的更快性能。这两种优化技术在这里就先不介绍了，有机会单独写一篇文章介绍下。6、参考资料https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/profile-guided-optimization-pgo.htmlhttps://go.dev/doc/pgo#alternative-sources发布于 2023-09-08 17:30・IP 属地广东性能优化优化优化搜索赞同 1702 条评论分享喜欢收藏申请

PGO 是啥，咋就让 Go 更快更猛了？ - 知乎

PGO 是啥，咋就让 Go 更快更猛了？ - 知乎首发于跟煎鱼精通 Go 语言切换模式写文章登录/注册PGO 是啥，咋就让 Go 更快更猛了？陈煎鱼大家好，我是煎鱼。Go1.20 即将发布，近期很多大佬提到一个关键词 PGO，说是有很大的提高，很猛...让我一愣一愣，不禁思考是什么？今天就由煎鱼和大家一起学习。快速了解PGO 是什么Profile-guided optimization (PGO)，翻译过来是使用配置文件引导的优化。也被称为：profile-directed feedback（PDF）feedback-directed optimization（FDO）PGO 是计算机编程中的一种编译器优化技术，使用分析来提高程序运行时性能。也就是可以提高 Go 运行时的性能。该项优化是一个通用技术，不局限于某一门语言。像是：常用的 Chrome 浏览器，在 64 位版本的 Chrome 中从 53 版开始启用 PGO， 32 位版在 54 版中启用。Microsoft Visual C++ 也同样有所使用。AutoFDO 进行了 PGO 的优化，直接将某数据中心中的 C/C++ 程序的性能提高了 5-15%（不用改业务代码）。这个优化成绩，一听就很振奋人心。PGO 怎么优化《Intel Developer Guide and Reference》[1] 中对 PGO 的优化和流程有一个基本介绍，如下内容，分享给大家。PGO 通过缩小代码大小、减少分支错误预测和重新组织代码布局以减少指令缓存问题来提高应用程序性能。并向编译器提供有关应用程序中最常执行的区域的信息。通过了解这些领域，编译器能够在优化应用程序时更具选择性和针对性。PGO 由三个阶段组成。如下图：检测程序。编译器从您的源代码和编译器的特殊代码创建并链接一个检测程序。运行检测的可执行文件。每次执行插桩代码时，插桩程序都会生成一个动态信息文件，用于最终编译。最终编译。当您第二次编译时，动态信息文件将合并到一个摘要文件中。使用此文件中的概要信息摘要，编译器尝试优化程序中最频繁的运行路径去执行。这就是 PGO 这项优化的基本过程了。新提案背景提案作者（Cherry Mui、Austin Clements、Michael Pratt）建议向 Go GC 工具链增加对配置文件引导优化 (PGO) 的支持，可以使得工具链能根据运行时信息执行特定于应用程序和工作负载的优化。说明了就是想提高性能，不改业务代码。用什么来做PGO 需要用户参与来收集配置文件并将其反馈到构建过程中才能优化。这是一个大问题。最符合这个要求的，就是 pprof。最终敲定Go 团队将基于 runtime/pprof 来得到所需 profile，以此来完成 PGO。因为它符合：采集样本开销低、多系统兼容性强、Go 标准且被广泛使用的基准。也就是有 runtime/pprof 生成的 profile，就能搞 PGO 了！支持到什么程度PGO 第一个版本将会先支持 pprof CPU，直接读取 pprof CPU profile 文件来完成优化。预计将在 Go1.20 发布预览版本。在 Go 工具链上，将在 go build 子命令增加 -pgo=，用于显式指定用于 PGO 构建的 profile 文件位置。可能会有同学说，还得显式指定，太麻烦了？这 Go 团队也考虑到了...只需要你将其设置为：-pgo=auto，就会自动去读取主目录下的 profile 文件，非常香！如果不需要，那就直接 -pgo=off 就能完全关闭 PGO。Go1.20 实现 PGO 的预览版本，配置默认为 off，成熟后会默认为 auto。从哪里先动手Go 团队先会专注于 Go 编译器的开发，毕竟这是万物的开始，后续会在 cmd/go 做一些简单的支持。PGO 第一个动手的方向是：函数内联。这项被认为性价比是最高的。未来展望上，还会包含：devirtualization（去虚拟化，一种编译器优化策略）、特定泛型函数的模板化、基本块排序和函数布局。甚至后续会用于改进内存行为，例如：改进逃逸行为和内存分配。看看这个PGO 的未来展望[2]，这个饼，我感觉画的又大又圆（远）...超前实践以下来自 @Frederic Branczyk 在《Exploring Go's Profile-Guided Optimizations[3]》一文中，提前使用 PGO 对 Go 官方已经开发的函数内联进行了提前尝鲜。步骤如下：首先拉取已实现的 Go 源码并进行编译和导入。如下代码：git clone https://go.googlesource.com/go

cd go

git fetch https://go.googlesource.com/go refs/changes/63/429863/3 && git checkout -b change-429863 FETCH_HEAD

cd src

./all.bash

cd ..

export PATH="$(pwd)/bin:$PATH" # or add the path to your bashrc/zshrc

进入到 PGO 的内联测试代码：cd src/cmd/compile/internal/test/testdata/pgo/inline

做提前准备，生成 pprof cpu profile 文件：go test -o inline_hot.test -bench=. -cpuprofile inline_hot.pprof

完成准备动作后。我们进行两次测试：一次不用 PGO，一次用 PGO，来进行对比。不使用 PGO 的情况：go test -run=none -tags='' -timeout=9m0s -gcflags="-m -m" 2>&1 | grep "can inline"

./inline_hot.go:15:6: can inline D with cost 7 as: func(uint) int { return int((i + (wSize - 1)) >> lWSize) }

./inline_hot.go:19:6: can inline N with cost 20 as: func(uint) *BS { bs = &BS{...}; return bs }

...

使用 PGO 的情况：go test -run=none -tags='' -timeout=9m0s -gcflags="-m -m -pgoprofile inline_hot.pprof"

用于如下对比：go test -o inline_hot.test -bench=. -cpuprofile inline_hot.pprof -count=100 > without_pgo.txt

go test -o inline_hot.test -bench=. -gcflags="-pgoprofile inline_hot.pprof" -count=100 > with_pgo.txt

benchstat without_pgo.txt with_pgo.txt

name old time/op new time/op delta

A-10 960µs ± 2% 950µs ± 1% -1.05% (p=0.000 n=98+83)

从结论来看，引入 PGO 后有了 1% 的性能改进。当然，这只是一小段测试代码。不同的程序结果会不一样。总结PGO 是一门编译器优化技术，能够在不改业务代码的情况下，给你的应用程序带来一定的性能提升。在 Go PGO 中将会依托 runtime/pprof 所生成的 profile 来完成（需改造），也算是做了一个不错的串联。另外从需求出发点来看，这项优化感觉更多的来自开发同学的兴趣优化，官方 issues 中并没有指出是由于什么用户痛点导致的要去开发这项功能。不过后续如果遇到一些需要进一步优化的 Go 程序，PGO 将会是一个不错的选择。毕竟不用改业务代码。文章持续更新，可以微信搜【脑子进煎鱼了】阅读，本文 GitHub github.com/eddycjy/blog 已收录，学习 Go 语言可以看 Go 学习地图和路线，欢迎 Star 催更。Go 图书系列Go 语言入门系列：初探 Go 项目实战Go 语言编程之旅：深入用 Go 做项目Go 语言设计哲学：了解 Go 的为什么和设计思考Go 语言进阶之旅：进一步深入 Go 源码推荐阅读Go for 循环有时候真的很坑。。。Go 十年了，终于想起要统一 log 库了！Go 只会 if err != nil？这是不对的，分享这些优雅的处理姿势给你！参考资料[1] Intel Developer Guide and Reference: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/profile-guided-optimization-pgo.html[2] 未来展望: https://github.com/golang/go/issues/55022#issuecomment-1245605666[3] Exploring Go's Profile-Guided Optimizations: https://www.polarsignals.com/blog/posts/2022/09/exploring-go-profile-guided-optimizations/发布于 2022-11-22 12:32・IP 属地广东Go 语言Golang 最佳实践Golang编程赞同 5310 条评论分享喜欢收藏申请转载文章被以下专栏收录跟煎鱼精通 Go 语言程序员们，咱们一起打怪

Go PGO 快速上手，性能可提高 2~4%！ - 知乎

Go PGO 快速上手，性能可提高 2~4%！ - 知乎首发于跟煎鱼精通 Go 语言切换模式写文章登录/注册Go PGO 快速上手，性能可提高 2~4%！陈煎鱼大家好，我是煎鱼。2023 年初，在 Go1.20，PGO 发布了预览版本。在本次 Go1.21 的新版本发布，修复了各种问题后，PGO 已经正式官宣生产可用。今天这篇文章就是和大家一起跟着官方示例快速体验一下他的性能优化和使用。温习一下 PGOProfile-guided optimization (PGO)，PGO 是计算机编程中的一种编译器优化技术，借助配置文件来引导编译，达到提高程序运行时性能的目的。翻译过来是使用配置文件引导的优化，能提供应用程序的性能。也被称为：profile-directed feedback（PDF）feedback-directed optimization（FDO）该项优化是一个通用技术，不局限于某一门语言。像是我们常见很多应用都有所使用其来优化。如下几个案例：Chrome 浏览器，在 64 位版本的 Chrome 中从 53 版开始启用 PGO， 32 位版在 54 版中启用。AutoFDO 进行了 PGO 的优化，直接将某数据中心中的 C/C++ 程序的性能提高了 5-15%（不用改业务代码）。Go 怎么读取 PGOPGO 第一个版本将会先支持 pprof CPU，直接读取 pprof CPU profile 文件来完成优化。有以下两种方式：手动指定：Go 工具链在 go build 子命令增加了 -pgo= 参数，用于显式指定用于 PGO 构建的 profile 文件位置。自动指定：当 Go 工具链在主模块目录下找到 default.pgo 的配置文件时，将会自动启用 PGO。快速 Demo初始化应用程序首先我们创建一个 Demo 目录，用于做一系列的实验。执行如下命令：$ mkdir pgo-demo && cd pgo-demo

初始化模块路径和拉取程序所需的依赖：$ go mod init example.com/markdown

go: creating new go.mod: module example.com/markdown

$ go get gitlab.com/golang-commonmark/markdown@bf3e522c626a

创建 main.go 文件，写入如下package main

import (

"bytes"

"io"

"log"

"net/http"

_ "net/http/pprof"

"gitlab.com/golang-commonmark/markdown"

)

func render(w http.ResponseWriter, r *http.Request) {

if r.Method != "POST" {

http.Error(w, "Only POST allowed", http.StatusMethodNotAllowed)

return

}

src, err := io.ReadAll(r.Body)

if err != nil {

log.Printf("error reading body: %v", err)

http.Error(w, "Internal Server Error", http.StatusInternalServerError)

return

}

md := markdown.New(

markdown.XHTMLOutput(true),

markdown.Typographer(true),

markdown.Linkify(true),

markdown.Tables(true),

)

var buf bytes.Buffer

if err := md.Render(&buf, src); err != nil {

log.Printf("error converting markdown: %v", err)

http.Error(w, "Malformed markdown", http.StatusBadRequest)

return

}

if _, err := io.Copy(w, &buf); err != nil {

log.Printf("error writing response: %v", err)

http.Error(w, "Internal Server Error", http.StatusInternalServerError)

return

}

func main() {

http.HandleFunc("/render", render)

log.Printf("Serving on port 8080...")

log.Fatal(http.ListenAndServe(":8080", nil))

}

编译并运行应用程序：$ go build -o markdown.nopgo

$ ./markdown.nopgo

2023/10/02 13:55:40 Serving on port 8080...

运行起来后进行验证，这是一个将 Markdown 转换为 HTML 的应用程序。我们从 GitHub 上拉取一份 markdown 文件并给到该程序进行转换。如下命令：$ curl -o README.md -L "https://raw.githubusercontent.com/golang/go/c16c2c49e2fa98ae551fc6335215fadd62d33542/README.md"

$ curl --data-binary @README.md http://localhost:8080/render

The Go Programming Language

Go is an open source programming language that makes it easy to build simple,

reliable, and efficient software.

...

如果正常则说明运行没问题。收集 PGO 所需的配置文件一般情况下，我们可以通过生产、测试环境的 pprof 采集所需的 profile 文件，用于做 PGO 的配置文件。但由于示例没有对应的生产环境。本次快速 Demo，Go 官方提供了一个简单的程序来快速的发压。在确保前面小节的 pgo-demo 程序正常运行的情况下。运行如下命令，启动一个发压程序：$ go run github.com/prattmic/markdown-pgo/load@latest

收集对应的 profile 文件：$ curl -o cpu.pprof "http://localhost:8080/debug/pprof/profile?seconds=30"

生成了一个 cpu.pprof 文件，可以在后续使用。应用程序使用 PGO前面我们有提到，当模块目录下包含 default.pgo 时。Go 工具链就会自动应用 PGO。我们只需要将前面的 cpu.pprof 修改一下即可。执行如下命令：$ mv cpu.pprof default.pgo

$ go build -o markdown.withpgo

编译成功后，使用如下命令验证是否正常：$ go version -m markdown.withpgo

markdown.withpgo: go1.21.1

path example.com/markdown

mod example.com/markdown (devel)

...

build GOOS=darwin

build GOAMD64=v1

build -pgo=/Users/eddycjy/app/go/pgo-demo/default.pgo

可以看到最后的 build -pgo=...，代表该应用程序成功应用了我们的 default.pgo 文件（启用 PGO）。总结PGO 作为 Go 新版本的一个性能好帮手，在官方给出的数据中启用 PGO 后，性能能够得到一定的提升。但也会带来其他方面（CPU、大小等）的开销增加。如本文的例子中，官方给出的数据是程序性能提升了约 2~4%，CPU 使用率会带来 2~7% 的开销增加。也可能会导致构建时长变长一些、编译后的二进制文件会稍微大一些。面对一些场景，PGO 是一个不错的性能优化方式。但有利必有弊，就看这个应用程序的类型和综合取舍了。文章持续更新，可以微信搜【脑子进煎鱼了】阅读，本文 GitHub github.com/eddycjy/blog 已收录，学习 Go 语言可以看 Go 学习地图和路线，欢迎 Star 催更。 Go 图书系列Go 语言入门系列：初探 Go 项目实战Go 语言编程之旅：深入用 Go 做项目Go 语言设计哲学：了解 Go 的为什么和设计思考Go 语言进阶之旅：进一步深入 Go 源码推荐阅读Go 标准库想增加 metrics 指标，你支持吗？互联网公司裁员的预兆和手段Go1.21 那些事：泛型库、for 语义变更、统一 log/slog、WASI 等新特性，你知道多少？发布于 2023-10-13 12:39・IP 属地广东Go应用Go 语言Golang编程赞同 22 条评论分享喜欢收藏申请转载文章被以下专栏收录跟煎鱼精通 Go 语言程序员们，咱们一起打怪

PGO 摩特動力機車製造大廠 - PGOSCOOTERS

請選擇您要瀏覽的網站

為提供您最佳的網站使用體驗，我們使用Cookies，以作為技術、分析、行銷之用途，並持續改善我們的網頁，讓使用者更便於操作，繼續使用此網頁代表您同意Cookies的使用。觀看隱私權政策

我了解

使用配置文件引导的优化 (PGO) | Android 开源项目 | Android Open Source Project

文档

新变化

版本说明

如果没有PGO，JIT 编译相比AOT 编译有哪些优势？ - 知乎

如果没有PGO，JIT 编译相比AOT 编译有哪些优势？ - 知乎首页知乎知学堂发现等你来答切换模式登录/注册编程语言Java 虚拟机（JVM）即时编译（JIT）编译原理编译器如果没有PGO，JIT 编译相比AOT 编译有哪些优势？据说JIT 编译可以拿到比AOT编译更多的运行时信息。但是如果一个纯JIT（第一次执行时编译，无profiling feedback）的话具体可以拿到…显示全部关注者163被浏览19,013关注问题写回答邀请回答好问题添加评论分享3 个回答默认排序RednaxelaFX计算机科学等 7 个话题下的优秀答主关注首先，讨论这个问题一定要确定我们讨论的主题是“JIT可以比AOT在哪些方面做得更好”，而不要陷入“JIT编译出来的代码的整体效果怎样就比AOT编译要更好”的大坑。前者只是一些局部点的讨论，而后者则要牵扯更多方面。后者的话，AOT编译最大的优势就是有机会不计成本（资源开销）地做代码分析和优化，使得它可以承受更重量级的优化而得到更好的代码。而JIT就算是有adaptive dynamic compilation / tiered compilation来分担初始开销，毕竟是在应用运行的同时来编译，做什么分析/优化都要考虑时间和空间开销，所以跟传统AOT的强项没办法硬碰硬。=======================================JIT+PGO的情况那么回到正题，JIT能做些什么有趣的事情。题主一上来先把JIT最擅长的方面给禁了——不让JIT搭配PGO做优化。现实中JIT编译最大的优势就是可以通过FDO（feedback-directed optimization）或者叫PGO（profile-guided optimization）来做优化，这样可以以少量的初始运行时开销，换取一些本来要通过重量级静态分析才可以得到、或者静态分析根本无法得到的一些运行时信息，然后基于它来做优化就可以事半功倍。先放个传送门来讲解一些相关名词的关系：JIT编译，动态编译与自适应动态编译 - 编程语言与高级语言虚拟机杂谈（仮） - 知乎专栏JIT会做的典型的FDO / PGO可以有这么一些点：type-feedback optimization：主要针对多态的面向对象程序来做优化。根据profile收集到的receiver type信息来把原本多态的虚方法调用点（virtual method call site）或属性访问点（property access site）根据类型来去虚化（devirtualize）。single-value profiling：这个相对少见一些。它的思路是有些参数、函数返回值可能在一次运行中只会遇到一个具体值。如果是这样的话可以把那个具体值给记录下来，然后在JIT编译时把它当作常量来做优化，于是常见的常量相关优化（常量折叠、条件常量传播等）就可以针对一个静态意义上本来不是常量的值来做了。branch-profile-based code scheduling：主要目的是把“热”的（频繁执行的）代码路径集中放在一起，而把“冷”的（不频繁执行的）代码路径放到别的地方。AOT编译的话常常会利用一些静态的启发条件来猜测哪些路径比较热，或者让用户指定哪些路径比较热（例如 likely() / unlikely() 宏），而JIT搭配PGO的话可以有比较准确的路径热度信息，对应可以做的优化也就更吻合实际执行情况，于是效果会更好。profile-guided inlining heuristics：根据profile信息得知函数调用点的热度，从而影响内联决策——对某个调用点，到底值不值得把目标函数内联进来。implicit exception：隐式异常，例如Java / C#的空指针异常检查，又例如Java / C#的除以零检查。这些异常如果在某块代码里从来没有发生过，就可以用更快的方式来实现，而不必生成显式检查代码。但如果在某块代码经常发生这种异常，则显式检查会更快。更多讨论请跳传送门：如何评价《王垠：C 编译器优化过程中的 Bug》？上面的(1)和(2)在JIT+PGO的场景中，生成的代码常常会带有条件判断（guard）来检查运行时实际遇到的值是否还跟profile得到的信息一致，只有在一致的时侯才执行优化的代码，否则执行后备（fallback）的不优化代码。当然这样的优化还是结合一些静态分析效果更佳。例如说，对下面的Java伪代码，假设有接口IFoo和一个实现了该接口的类Foo。void func(IFoo obj) {

obj.bar(); // call site 1

obj.bar(); // call site 2

}如果只应用上述(1)的type-feedback optimization，我们可能会发现profile记录下来两个bar()的调用点的receiver type都是Foo，于是一个很傻的JIT可能会生成这样的代码：void func(IFoo obj) {

// call site 1

if (obj.klass == Foo) { // guard

Foo.bar(obj); // devirtualized

} else {

obj.bar(); // virtual call as fallback

}

// call site 2

if (obj.klass == Foo) { // guard

Foo.bar(obj); // devirtualized

} else {

obj.bar(); // virtual call as fallback

}

}这样的JIT虽然应用了profile信息来做优化，但是没有对代码做足够静态分析和优化，没有发现其实两个调用点都是一样的引用，类型肯定相同。而一个没那么傻的JIT编译器可能会生成这样的代码，把guard产生的类型信息传播出去：void func(IFoo obj) {

if (obj.klass == Foo) { // guard

Foo.bar(obj); // devirtualized call site 1

Foo.bar(obj); // devirtualized call site 2

} else {

obj.bar(); // virtual call as fallback

}

这是假设没有足够静态信息来判断obj运行时的实际类型的情况。那么稍微改变一下例子，变成这样：void func() {

IFoo obj = new Foo();

obj.bar();

}此时只使用type-feedback optimization的比较傻的JIT编译器还是会生成跟前面类似的代码：void func() {

IFoo obj = new Foo();

// call site 1

if (obj.klass == Foo) { // guard

Foo.bar(obj); // devirtualized

} else {

obj.bar(); // virtual call as fallback

}

// call site 2

if (obj.klass == Foo) { // guard

Foo.bar(obj); // devirtualized

} else {

obj.bar(); // virtual call as fallback

}

而一个做了类型信息传播的JIT编译器则会发现new Foo()是一个可以确定准确类型的表达式，把这个信息传播出去就可以确定后面两个bar()的调用点都肯定会调用Foo.bar()。于是它可以忽略收集到的profile信息，优先借助静态分析/优化的结果而生成这样的代码：void func(IFoo obj) {

Foo obj = new Foo();

Foo.bar(obj); // devirtualized call site 1

Foo.bar(obj); // devirtualized call site 2

}

一个做了足够优化的AOT编译器会对这个例子生成跟后者一模一样的代码，而不需要借助profile信息。举这两组例子只是想提醒一下读这篇回答的同学们，不是所有“JIT”的优化程度都一样，不要对JIT的行为“想当然”。Profile信息对程序优化的影响会收到输入程序的实际情况的影响，也会受到搭配的编译器自身所做的优化的影响。很多现实中的JIT都是在优化开销和目标性能之间的权衡，设计出发点的差异会导致实现的巨大不同。=======================================JIT不搭配PGO的情况前戏结束，终于来到正餐。其实JIT编译（或者宽泛而言，“动态生成代码”（dynamic code generation））最大的优势就是利用运行时信息。运行时信息有很多种，并不是所有都算“profile”；而对运行时信息的使用也非常多样化。同学们一定要有open mind来发挥自己的想像力 >_<然而同时值得注意的是，有不少JIT编译器之所以很难被改造为AOT编译器使用，很大一部分原因就来自于它们在设计之初就只考虑被用作JIT编译器，无条件内嵌或者说依赖了很多运行时的值，如果要改造为AOT编译器使用则需要把这些根深蒂固的依赖都挖掉，工作量常常会大得让人放弃orz让我先放个简单列表，回头有空再补充更多内容或者展开其中一些来讲解。可以把许多运行时才确定的地址 / 指针值当作常量。可以针对程序一次启动所配置的参数而生成最合适的代码，减少运行时条件判断。可以针对当前运行的机器的实际状况生成最合适的机器码。比针对通用情况编译的AOT编译结果更优，而比包含运行时检查机器功能（例如用cpuid检测某些指令是否可用）的AOT编译结果减少运行时检查开销。针对动态链接的场景，可以跨越动态链接的模块边界做优化（例如跨越模块边界将函数调用内联）。可以选择性编译频繁执行的代码，减少编译后的代码的内存开销，特别在诸如资源极其受限的嵌入式场景有特殊用法。有些功能可能静态编译的计算量太大，而放在运行时根据具体值来JIT编译则可以只对特定情况计算，很好地达到性能与开销的平衡。常常会允许做code patching，针对代码实际运行遇到的值做特化，并且在实际的值发生变化时跟随做调整。有可能运行动态对代码做instrumentation，并且根据收益和开销来动态调整instrumentation的详细程度。在收集到足够信息后可以动态撤销instrumentation代码来恢复到原有性能。最后但其实可能是最重要的，是JIT编译常常可以做“激进的预测性优化”（aggressive speculative optimization），在预测错误时可以灵活地fallback到安全的不那么优化的代码上。例如说一个方法可以只编译“执行过的部分”或者“预测可能会执行的部分”。如果实际执行到之前没编译的路径上，那就当场再编译就是了。例如说在支持动态加载代码的场景中，静态编译只能以open-world assumption来做保守优化，而JIT编译可以做closed-world assumption做激进的优化，并且当动态加载了新代码使之前的预测不再准确时，抛弃之前编译的代码而重新编译。---------------------------------------------对上面的(1)，举几个例子。例如说微软的CLR的JIT编译器，目前是对一个方法只能正常JIT编译一次的，所以用不上“PGO”。但它可以利用许多运行时的值，例如说这样：void Bar() {

}

void Foo() {

Bar();

}

void Goo() {

Bar();

}

void Main() {

Foo();

Goo();

}

假如程序从Main()方法开始执行，全部没有被NGen，那么走的就是正常的“第一次被调用时才JIT编译”的路径。于是，假如没有发生内联，JIT编译的顺序是调用树的深度优先遍历： Main() -> Foo() -> Bar() -> Goo()在编译Main()时，它要调用的Foo()与Goo()尚未被编译，其编译后方法入口地址尚未知，所以Main()里对它们的调用就会生成对它们的prestub的调用代码，用于触发JIT编译并把调用点patch到对编译好的方法入口地址的直接调用。放个传送门：什么是桩代码（Stub）？ - RednaxelaFX 的回答 - 知乎。编译Foo()的时候也是类似，Bar()尚未被编译所以只能先生成对prestub的调用，等prestub被调用的时候触发JIT编译并把调用点patch为直接调用。而JIT编译Goo()时，它要调用的Bar()方法已经被JIT编译好了，其方法入口地址是已知的，所以可以生成直接调用其方法入口地址的代码。无论Main()、Foo()、Goo()、Bar()分别在哪个“模块”（.NET Assembly意义上）中，它们在运行时都是被混在一起的，跨模块调用不会有额外开销，不会因为Foo()与Bar()不在一个模块而导致该调用要经过诸如GOT结构来做间接调用。然后，例如说在HotSpot JVM中，“常量对象”（例如Java对象的Klass、例如说String常量等）的引用值可以被直接嵌入到生成的代码中。这些地址也是只有运行时才能确定的。再例如，如果有一个带JIT带GC的运行时环境，GC使用连续的虚拟地址空间并且分两代，那么要检查一个引用指向的对象是在young generation还是在old generation，只要看该引用值是否小于两代之间的分界地址即可。这个地址显然也是一个运行时值，用JIT的话就可以很轻松地把地址内嵌到生成的代码中，而AOT编译的话常常需要为此生成一个内存读操作。---------------------------------------------对上面的(2)，举点例子。例如说，HotSpot JVM的解释器其实是“JIT”出来的——是在VM启动的过程中动态生成出来的。根据某次启动所配置的参数，例如说是否要在解释器中做profiling，它可以选择性生成代码，完全不生成该次运行所不需要的代码，从而让解释器代码在内存中的布局更加紧凑，提高代码局部性。---------------------------------------------对上面的(3)…就不举例子了。这个可能是被讨论得最多的场景，似乎大家都知道这是什么意思。---------------------------------------------对上面的(4)，前面举的CLR的例子已经涉及一点。但比起能生成“直接调用”，更有趣的是JIT编译通常可以无视模块边界而实现跨越模块的函数调用内联。还是用CLR那个例子的话，假如那是一个用C++实现的程序，而Main()、Foo()、Goo()、Bar()四个函数各自在自己的exe或dll里，那它们之间的调用就通常无法被内联。而对CLR而言，如果这是一个C#实现的程序，那这几个方法从什么模块而来根本没关系，照样都可以内联。---------------------------------------------对上面的(5)，可以举的有趣例子实在太多，而且应用场景可以相当不同。以C++模版与C#的泛型实例化为例，一个AOT编译的C++程序，静态编译时编译器看到了某个模版类/函数的哪些实例化，就必须把该实例化版本的代码和元数据都生成出来，而假如实际运行只用到了其中的很少数，那就很浪费。而C#程序在CLR上运行的话，一个泛型类型只有运行时实际用到的实例化版本才会生成对应的代码和元数据，其中代码部分还有机会共享，内存开销就小很多。而且更有趣的是这样还可以允许运行时动态创建（反射创建）新的实例化版本。C++的模版在AOT编译的模型下就做不到这点。再举一个例子，看看低端的Java ME的场景。这种场景的设备可能只有很少内存和持久存储（RAM和ROM都少），用起来得非常节省。Java字节码其实可以看作程序的“压缩形式”，如果编译到机器码的话，所占空间有可能要膨胀3倍到10倍。如果一个Java应用的所有代码都被AOT编译到机器码，它可能就根本没办法装到设备（ROM）上了。所以这种场景下Java程序适合以字节码的形式持久存储于ROM上，只占用很少ROM空间，然后像Monty VM（也叫CLDC HotSpot Implementation）的JVM实现，会配置一个非常小的JIT code cache，其中只保留最近执行最频繁的JIT编译的代码，其它代码都解释执行——假如触发了新的JIT编译而code cache已用满，则抛弃掉最冷的代码来让出空间给新代码用。这样就在内存占有与性能之间达成了一个动态平衡。“只编译频繁执行的路径”的思路下还有trace-based compilation。这里就先不展开说了。---------------------------------------------针对上面的(6)，简单举俩例子。第一个例子是Sun Labs以前研发过的Fortress语言，它的编译器实现就混合使用了静态编译与动态代码生成技术——虽说动态生成的是Java字节码。这也算是一种形式的JIT。参考Christine Flood大妈在JVM Language Summit 2011上做的一个Fortress演讲提到的一点：Interface InjectionBecause recursive types could potentially require an infinite

number of methods the entire type hierarchy can't be generated at

compile time, and some classes must be generated on demand at

run time. Interface injection would save us a whole lot of

complicated dispatch code.Fortress语言的recursive type设计使得有些类型根本无法在静态编译时完全生成出来（不然编译器自己就停不了机了orz），所以有些类型就干脆等到运行时再根据实际使用状态动态生成出来。第二个例子是微软CLR的GC大佬之Patrick Dussud在一个访谈中提到过，他刚工作的时候参与过一个项目，是一个APL语言的实现的runtime优化，其中就涉及JIT编译技术。例如说APL的⍳ (Iota) 函数可以生成从1到n的整数数列，而如果对它的结果 + 1的话，就相当于对这个数列的所有元素加1。当时一般的APL runtime实现会在执行iota时一开始就一口气生成出整个从1到n的数组放在内存里，然后执行加1就真的每个元素都加1。而Patrick参与的项目则尝试把这些操作“符号化”（symbolic representation + lazy computation），在不需要使用实际值的时候只把操作记录下来，等到真的要用其中的一些值时才materialize。这个过程中，如果materialize时发现计算是很简单的就解释执行之，如果发现计算是复杂计算则动态生成特化的计算代码（JIT编译）然后再执行之。传送门：Patrick Dussud: Managing Garbage Collection | Behind The Code | Channel 9（从10:00开始的一小段）---------------------------------------------针对上面的(7)，code patching，举点小例子。例如说，CLRv2对接口方法做所谓“virtual stub dispatch”（VSD），其实是一个monomorphic inline cache call。它会在第一次执行的时候记录下当时传入的receiver type，将自身特化成类似这样的形式： if (obj.MethodTable == expected_MT_of_Foo) {

call Foo.bar(); // direct call, fastest

} else {

failure_counter++;

if (failure_counter < 1000) {

lookup_target_and_call(); // reflective lookup, slowest

} else {

patch_self_to_generic(); // patch to generic version, mediocre

}

}于是后续调用如果还是对同一receiver type的参数做，就可以走快速路径做直接调用，而如果遇到了其它receiver type则记录下失败次数，当失败次数超过阈值时把自己patch成泛化的慢速形式。这种场景虽然有feedback，但是并不需要完善的profile机制，而JIT编译器自身也不使用profile信息来生成特化代码，而是让runtime的别的一些机制，例如stub管理之类来管理跟随feedback而调整代码，所以不算PGO编译。再举一个例子。HotSpot VM的Client Compiler（C1）允许对若干类型的场景生成占位代码，等到运行时有足够信息的时候再填进去。例如说，遇到对尚未加载的类的字段访问，因为还不知道字段所在的偏移量应该是多少，所以还无法生成最终的完整代码。此时C1可以生成一些nop以及一个runtime call来占位，等到第一次执行到那个地方的时候就调用进那个runtime call。进到runtime，此时这里涉及的类肯定以及加载好了，于是查询好相关的偏移量信息之后，就把原本占位用的指令patch成实际的字段访问代码。这里还有个有趣的细节：如果调用进runtime，发现这个字段是个常规字段，就按照上面的流程工作即可。而如果发现这个字段是个volatile字段，那就意味着当前方法的C1编译版代码在编译的时候可能没有考虑足够重排序相关限制，所以必须要抛弃掉这个版本的编译代码，然后重新让C1再编译一次。---------------------------------------------针对上面的(8)，可以参考CLR的ReJIT功能。先放俩传送门：CLR 4.5: David Broman - Inside Re-JIT | Going Deep | Channel 9ReJIT: A How-To Guide---------------------------------------------针对上面的(9)，这就好玩了。非常非常好玩。现代高性能JVM的JIT编译器非常依赖于这方面的优化。所谓assumption-based speculative optimization就是这种。先放个传送门占位：HotSpot VM有没有对invokeinterface指令的方法表搜索进行优化？回头再展开举例。编辑于 2016-11-16 09:24赞同 1389 条评论分享收藏喜欢收起知乎用户补充一下 @vczh 的回答AOT面临的一个无法避免的问题就是判断哪些代码会被用到一个例子是泛型的虚函数，如果发现了对该虚函数的调用，那么其所有override都是可能用到的，因此需要AOT编译。如：C#中最典型的ToString()即为object级别的虚方法，而在framework内有着几乎一定会调用到的「拿到object-调用ToString」操作，于是乎所有类型的ToString方法都必须被AOT——尽管不少类型你根本不会对其执行ToString。第二个例子是反射.Net Native在编译的时候，需要你给一个配置文件，分别对每个类型（或者每组类型）配置你需要什么级别的反射调用——仅名称，构造器，序列化，方法调用：为没用到的反射操作生成支持也是一件有额外开销的事情。（当然，强行做所有代码路径的全扫描也能判别，不过这个开销已经巨大到无法接受了）发布于 2016-11-14 00:37赞同 6添加评论分享收藏喜欢收起

Profile-guided optimization in Go 1.21 - The Go Programming Language

Why Go arrow_drop_down

Press Enter to activate/deactivate dropdown

Case Studies

Common problems companies solve with Go

Use Cases

Stories about how and why companies use Go

Security

How Go can help keep you secure by default

Learn

Press Enter to activate/deactivate dropdown

Docs arrow_drop_down

Press Enter to activate/deactivate dropdown

Effective Go

Tips for writing clear, performant, and idiomatic Go code

Go User Manual

A complete introduction to building software with Go

Standard library

Reference documentation for Go's standard library

Release Notes

Learn what's new in each Go release

Packages

Press Enter to activate/deactivate dropdown

Community arrow_drop_down

Press Enter to activate/deactivate dropdown

Recorded Talks

Videos from prior events

Meetups

open_in_new

Meet other local Go developers

Conferences

open_in_new

Learn and network with Go developers from around the world

Go blog

The Go project's official blog.

Go project

Get help and stay informed from Go

Get connected

Why Go navigate_next

navigate_beforeWhy Go

Case Studies

Use Cases

Security

Learn

Docs navigate_next

navigate_beforeDocs

Effective Go

Go User Manual

Standard library

Release Notes

Packages

Community navigate_next

navigate_beforeCommunity

Recorded Talks

Meetups

open_in_new

Conferences

open_in_new

Go blog

Go project

Get connected

The Go Blog

Profile-guided optimization in Go 1.21

Michael Pratt

5 September 2023

Earlier in 2023, Go 1.20 shipped a preview of profile-guided optimization (PGO) for users to test.

After addressing known limitations in the preview, and with additional refinements thanks to community feedback and contributions, PGO support in Go 1.21 is ready for general production use!

See the profile-guided optimization user guide for complete documentation.

Below we will run through an example of using PGO to improve the performance of an application.

Before we get to that, what exactly is “profile-guided optimization”?

When you build a Go binary, the Go compiler performs optimizations to try to generate the best performing binary it can.

For example, constant propagation can evaluate constant expressions at compile time, avoiding runtime evaluation cost.

Escape analysis avoids heap allocations for locally-scoped objects, avoiding GC overheads.

Inlining copies the body of simple functions into callers, often enabling further optimization in the caller (such as additional constant propagation or better escape analysis).

Devirtualization converts indirect calls on interface values whose type can be determined statically into direct calls to the concrete method (which often enables inlining of the call).

Go improves optimizations from release to release, but doing so is no easy task.

Some optimizations are tunable, but the compiler can’t just “turn it up to 11” on every optimization because overly aggressive optimizations can actually hurt performance or cause excessive build times.

Other optimizations require the compiler to make a judgment call about what the “common” and “uncommon” paths in a function are.

The compiler must make a best guess based on static heuristics because it can’t know which cases will be common at run time.

Or can it?

With no definitive information about how the code is used in a production environment, the compiler can operate only on the source code of packages.

But we do have a tool to evaluate production behavior: profiling.

If we provide a profile to the compiler, it can make more informed decisions: more aggressively optimizing the most frequently used functions, or more accurately selecting common cases.

Using profiles of application behavior for compiler optimization is known as Profile-Guided Optimization (PGO) (also known as Feedback-Directed Optimization (FDO)).

Example

Let’s build a service that converts Markdown to HTML: users upload Markdown source to /render, which returns the HTML conversion.

We can use gitlab.com/golang-commonmark/markdown to implement this easily.

Set up

$ go mod init example.com/markdown

$ go get gitlab.com/golang-commonmark/markdown@bf3e522c626a

In main.go:

package main

import (

"bytes"

"io"

"log"

"net/http"

_ "net/http/pprof"

"gitlab.com/golang-commonmark/markdown"

)

func render(w http.ResponseWriter, r *http.Request) {

if r.Method != "POST" {

http.Error(w, "Only POST allowed", http.StatusMethodNotAllowed)

return

}

src, err := io.ReadAll(r.Body)

if err != nil {

log.Printf("error reading body: %v", err)

http.Error(w, "Internal Server Error", http.StatusInternalServerError)

return

}

md := markdown.New(

markdown.XHTMLOutput(true),

markdown.Typographer(true),

markdown.Linkify(true),

markdown.Tables(true),

)

var buf bytes.Buffer

if err := md.Render(&buf, src); err != nil {

log.Printf("error converting markdown: %v", err)

http.Error(w, "Malformed markdown", http.StatusBadRequest)

return

}

if _, err := io.Copy(w, &buf); err != nil {

log.Printf("error writing response: %v", err)

http.Error(w, "Internal Server Error", http.StatusInternalServerError)

return

}

func main() {

http.HandleFunc("/render", render)

log.Printf("Serving on port 8080...")

log.Fatal(http.ListenAndServe(":8080", nil))

}

Build and run the server:

$ go build -o markdown.nopgo.exe

$ ./markdown.nopgo.exe

2023/08/23 03:55:51 Serving on port 8080...

Let’s try sending some Markdown from another terminal.

We can use the README.md from the Go project as a sample document:

$ curl -o README.md -L "https://raw.githubusercontent.com/golang/go/c16c2c49e2fa98ae551fc6335215fadd62d33542/README.md"

$ curl --data-binary @README.md http://localhost:8080/render

The Go Programming Language

Go is an open source programming language that makes it easy to build simple,

reliable, and efficient software.

...

Profiling

Now that we have a working service, let’s collect a profile and rebuild with PGO to see if we get better performance.

In main.go, we imported net/http/pprof which automatically adds a /debug/pprof/profile endpoint to the server for fetching a CPU profile.

Normally you want to collect a profile from your production environment so that the compiler gets a representative view of behavior in production.

Since this example doesn’t have a “production” environment, I have created a simple program to generate load while we collect a profile.

Fetch and start the load generator (make sure the server is still running!):

$ go run github.com/prattmic/markdown-pgo/load@latest

While that is running, download a profile from the server:

$ curl -o cpu.pprof "http://localhost:8080/debug/pprof/profile?seconds=30"

Once this completes, kill the load generator and the server.

Using the profile

The Go toolchain will automatically enable PGO when it finds a profile named default.pgo in the main package directory.

Alternatively, the -pgo flag to go build takes a path to a profile to use for PGO.

We recommend committing default.pgo files to your repository.

Storing profiles alongside your source code ensures that users automatically have access to the profile simply by fetching the repository (either via the version control system, or via go get) and that builds remain reproducible.

Let’s build:

$ mv cpu.pprof default.pgo

$ go build -o markdown.withpgo.exe

We can check that PGO was enabled in the build with go version:

$ go version -m markdown.withpgo.exe

./markdown.withpgo.exe: go1.21.0

...

build -pgo=/tmp/pgo121/default.pgo

Evaluation

We will use a Go benchmark version of the load generator to evaluate the effect of PGO on performance.

First, we will benchmark the server without PGO.

Start that server:

$ ./markdown.nopgo.exe

While that is running, run several benchmark iterations:

$ go get github.com/prattmic/markdown-pgo@latest

$ go test github.com/prattmic/markdown-pgo/load -bench=. -count=40 -source $(pwd)/README.md > nopgo.txt

Once that completes, kill the original server and start the version with PGO:

$ ./markdown.withpgo.exe

While that is running, run several benchmark iterations:

$ go test github.com/prattmic/markdown-pgo/load -bench=. -count=40 -source $(pwd)/README.md > withpgo.txt

Once that completes, let’s compare the results:

$ go install golang.org/x/perf/cmd/benchstat@latest

$ benchstat nopgo.txt withpgo.txt

goos: linux

goarch: amd64

pkg: github.com/prattmic/markdown-pgo/load

cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz

│ nopgo.txt │ withpgo.txt │

│ sec/op │ sec/op vs base │

Load-12 374.5µ ± 1% 360.2µ ± 0% -3.83% (p=0.000 n=40)

The new version is around 3.8% faster!

In Go 1.21, workloads typically get between 2% and 7% CPU usage improvements from enabling PGO.

Profiles contain a wealth of information about application behavior and Go 1.21 just begins to crack the surface by using this information for a limited set of optimizations.

Future releases will continue improving performance as more parts of the compiler take advantage of PGO.

Next steps

In this example, after collecting a profile, we rebuilt our server using the exact same source code used in the original build.

In a real-world scenario, there is always ongoing development.

So we may collect a profile from production, which is running last week’s code, and use it to build with today’s source code.

That is perfectly fine!

PGO in Go can handle minor changes to source code without issue.

Of course, over time source code will drift more and more, so it is still important to update the profile occasionally.

For much more information on using PGO, best practices and caveats to be aware of, please see the profile-guided optimization user guide.

If you are curious about what is going on under the hood, keep reading!

Under the hood

To get a better understanding of what made this application faster, let’s take a look under the hood to see how performance has changed.

We are going to take a look at two different PGO-driven optimizations.

Inlining

To observe inlining improvements, let’s analyze this markdown application both with and without PGO.

I will compare this using a technique called differential profiling, where we collect two profiles (one with PGO and one without) and compare them.

For differential profiling, it’s important that both profiles represent the same amount of work, not the same amount of time, so I’ve adjusted the server to automatically collect profiles, and the load generator to send a fixed number of requests and then exit the server.

The changes I have made to the server as well as the profiles collected can be found at https://github.com/prattmic/markdown-pgo.

The load generator was run with -count=300000 -quit.

As a quick consistency check, let’s take a look at the total CPU time required to handle all 300k requests:

$ go tool pprof -top cpu.nopgo.pprof | grep "Total samples"

Duration: 116.92s, Total samples = 118.73s (101.55%)

$ go tool pprof -top cpu.withpgo.pprof | grep "Total samples"

Duration: 113.91s, Total samples = 115.03s (100.99%)

CPU time dropped from ~118s to ~115s, or about 3%.

This is in line with our benchmark results, which is a good sign that these profiles are representative.

Now we can open a differential profile to look for savings:

$ go tool pprof -diff_base cpu.nopgo.pprof cpu.withpgo.pprof

File: markdown.profile.withpgo.exe

Type: cpu

Time: Aug 28, 2023 at 10:26pm (EDT)

Duration: 230.82s, Total samples = 118.73s (51.44%)

Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top -cum

Showing nodes accounting for -0.10s, 0.084% of 118.73s total

Dropped 268 nodes (cum <= 0.59s)

Showing top 10 nodes out of 668

flat flat% sum% cum cum%

-0.03s 0.025% 0.025% -2.56s 2.16% gitlab.com/golang-commonmark/markdown.ruleLinkify

0.04s 0.034% 0.0084% -2.19s 1.84% net/http.(*conn).serve

0.02s 0.017% 0.025% -1.82s 1.53% gitlab.com/golang-commonmark/markdown.(*Markdown).Render

0.02s 0.017% 0.042% -1.80s 1.52% gitlab.com/golang-commonmark/markdown.(*Markdown).Parse

-0.03s 0.025% 0.017% -1.71s 1.44% runtime.mallocgc

-0.07s 0.059% 0.042% -1.62s 1.36% net/http.(*ServeMux).ServeHTTP

0.04s 0.034% 0.0084% -1.58s 1.33% net/http.serverHandler.ServeHTTP

-0.01s 0.0084% 0.017% -1.57s 1.32% main.render

0.01s 0.0084% 0.0084% -1.56s 1.31% net/http.HandlerFunc.ServeHTTP

-0.09s 0.076% 0.084% -1.25s 1.05% runtime.newobject

(pprof) top

Showing nodes accounting for -1.41s, 1.19% of 118.73s total

Dropped 268 nodes (cum <= 0.59s)

Showing top 10 nodes out of 668

flat flat% sum% cum cum%

-0.46s 0.39% 0.39% -0.91s 0.77% runtime.scanobject

-0.40s 0.34% 0.72% -0.40s 0.34% runtime.nextFreeFast (inline)

0.36s 0.3% 0.42% 0.36s 0.3% gitlab.com/golang-commonmark/markdown.performReplacements

-0.35s 0.29% 0.72% -0.37s 0.31% runtime.writeHeapBits.flush

0.32s 0.27% 0.45% 0.67s 0.56% gitlab.com/golang-commonmark/markdown.ruleReplacements

-0.31s 0.26% 0.71% -0.29s 0.24% runtime.writeHeapBits.write

-0.30s 0.25% 0.96% -0.37s 0.31% runtime.deductAssistCredit

0.29s 0.24% 0.72% 0.10s 0.084% gitlab.com/golang-commonmark/markdown.ruleText

-0.29s 0.24% 0.96% -0.29s 0.24% runtime.(*mspan).base (inline)

-0.27s 0.23% 1.19% -0.42s 0.35% bytes.(*Buffer).WriteRune

When specifying pprof -diff_base, the values in displayed in pprof are the difference between the two profiles.

So, for instance, runtime.scanobject used 0.46s less CPU time with PGO than without.

On the other hand, gitlab.com/golang-commonmark/markdown.performReplacements used 0.36s more CPU time.

In a differential profile, we typically want to look at the absolute values (flat and cum columns), as the percentages aren’t meaningful.

top -cum shows the top differences by cumulative change.

That is, the difference in CPU of a function and all transitive callees from that function.

This will generally show the outermost frames in our program’s call graph, such as main or another goroutine entry point.

Here we can see most savings are coming from the ruleLinkify portion of handling HTTP requests.

top shows the top differences limited only to changes in the function itself.

This will generally show inner frames in our program’s call graph, where most of the actual work is happening.

Here we can see that individual savings are coming mostly from runtime functions.

What are those? Let’s peek up the call stack to see where they come from:

(pprof) peek scanobject$

Showing nodes accounting for -3.72s, 3.13% of 118.73s total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-0.86s 94.51% | runtime.gcDrain

-0.09s 9.89% | runtime.gcDrainN

0.04s 4.40% | runtime.markrootSpans

-0.46s 0.39% 0.39% -0.91s 0.77% | runtime.scanobject

-0.19s 20.88% | runtime.greyobject

-0.13s 14.29% | runtime.heapBits.nextFast (inline)

-0.08s 8.79% | runtime.heapBits.next

-0.08s 8.79% | runtime.spanOfUnchecked (inline)

0.04s 4.40% | runtime.heapBitsForAddr

-0.01s 1.10% | runtime.findObject

----------------------------------------------------------+-------------

(pprof) peek gcDrain$

Showing nodes accounting for -3.72s, 3.13% of 118.73s total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-1s 100% | runtime.gcBgMarkWorker.func2

0.15s 0.13% 0.13% -1s 0.84% | runtime.gcDrain

-0.86s 86.00% | runtime.scanobject

-0.18s 18.00% | runtime.(*gcWork).balance

-0.11s 11.00% | runtime.(*gcWork).tryGet

0.09s 9.00% | runtime.pollWork

-0.03s 3.00% | runtime.(*gcWork).tryGetFast (inline)

-0.03s 3.00% | runtime.markroot

-0.02s 2.00% | runtime.wbBufFlush

0.01s 1.00% | runtime/internal/atomic.(*Bool).Load (inline)

-0.01s 1.00% | runtime.gcFlushBgCredit

-0.01s 1.00% | runtime/internal/atomic.(*Int64).Add (inline)

----------------------------------------------------------+-------------

So runtime.scanobject is ultimately coming from runtime.gcBgMarkWorker.

The Go GC Guide tells us that runtime.gcBgMarkWorker is part of the garbage collector, so runtime.scanobject savings must be GC savings.

What about nextFreeFast and other runtime functions?

(pprof) peek nextFreeFast$

Showing nodes accounting for -3.72s, 3.13% of 118.73s total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-0.40s 100% | runtime.mallocgc (inline)

-0.40s 0.34% 0.34% -0.40s 0.34% | runtime.nextFreeFast

----------------------------------------------------------+-------------

(pprof) peek writeHeapBits

Showing nodes accounting for -3.72s, 3.13% of 118.73s total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-0.37s 100% | runtime.heapBitsSetType

0 0% | runtime.(*mspan).initHeapBits

-0.35s 0.29% 0.29% -0.37s 0.31% | runtime.writeHeapBits.flush

-0.02s 5.41% | runtime.arenaIndex (inline)

----------------------------------------------------------+-------------

-0.29s 100% | runtime.heapBitsSetType

-0.31s 0.26% 0.56% -0.29s 0.24% | runtime.writeHeapBits.write

0.02s 6.90% | runtime.arenaIndex (inline)

----------------------------------------------------------+-------------

(pprof) peek heapBitsSetType$

Showing nodes accounting for -3.72s, 3.13% of 118.73s total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-0.82s 100% | runtime.mallocgc

-0.12s 0.1% 0.1% -0.82s 0.69% | runtime.heapBitsSetType

-0.37s 45.12% | runtime.writeHeapBits.flush

-0.29s 35.37% | runtime.writeHeapBits.write

-0.03s 3.66% | runtime.readUintptr (inline)

-0.01s 1.22% | runtime.writeHeapBitsForAddr (inline)

----------------------------------------------------------+-------------

(pprof) peek deductAssistCredit$

Showing nodes accounting for -3.72s, 3.13% of 118.73s total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-0.37s 100% | runtime.mallocgc

-0.30s 0.25% 0.25% -0.37s 0.31% | runtime.deductAssistCredit

-0.07s 18.92% | runtime.gcAssistAlloc

----------------------------------------------------------+-------------

Looks like nextFreeFast and some of the others in the top 10 are ultimately coming from runtime.mallocgc, which the GC Guide tells us is the memory allocator.

Reduced costs in the GC and allocator imply that we are allocating less overall.

Let’s take a look at the heap profiles for insight:

$ go tool pprof -sample_index=alloc_objects -diff_base heap.nopgo.pprof heap.withpgo.pprof

File: markdown.profile.withpgo.exe

Type: alloc_objects

Time: Aug 28, 2023 at 10:28pm (EDT)

Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top

Showing nodes accounting for -12044903, 8.29% of 145309950 total

Dropped 60 nodes (cum <= 726549)

Showing top 10 nodes out of 58

flat flat% sum% cum cum%

-4974135 3.42% 3.42% -4974135 3.42% gitlab.com/golang-commonmark/mdurl.Parse

-4249044 2.92% 6.35% -4249044 2.92% gitlab.com/golang-commonmark/mdurl.(*URL).String

-901135 0.62% 6.97% -977596 0.67% gitlab.com/golang-commonmark/puny.mapLabels

-653998 0.45% 7.42% -482491 0.33% gitlab.com/golang-commonmark/markdown.(*StateInline).PushPending

-557073 0.38% 7.80% -557073 0.38% gitlab.com/golang-commonmark/linkify.Links

-557073 0.38% 8.18% -557073 0.38% strings.genSplit

-436919 0.3% 8.48% -232152 0.16% gitlab.com/golang-commonmark/markdown.(*StateBlock).Lines

-408617 0.28% 8.77% -408617 0.28% net/textproto.readMIMEHeader

401432 0.28% 8.49% 499610 0.34% bytes.(*Buffer).grow

291659 0.2% 8.29% 291659 0.2% bytes.(*Buffer).String (inline)

The -sample_index=alloc_objects option is showing us the count of allocations, regardless of size.

This is useful since we are investigating a decrease in CPU usage, which tends to correlate more with allocation count rather than size.

There are quite a few reductions here, but let’s focus on the biggest reduction, mdurl.Parse.

For reference, let’s look at the total allocation counts for this function without PGO:

$ go tool pprof -sample_index=alloc_objects -top heap.nopgo.pprof | grep mdurl.Parse

4974135 3.42% 68.60% 4974135 3.42% gitlab.com/golang-commonmark/mdurl.Parse

The total count before was 4974135, meaning that mdurl.Parse has eliminated 100% of allocations!

Back in the differential profile, let’s gather a bit more context:

(pprof) peek mdurl.Parse

Showing nodes accounting for -12257184, 8.44% of 145309950 total

----------------------------------------------------------+-------------

flat flat% sum% cum cum% calls calls% + context

----------------------------------------------------------+-------------

-2956806 59.44% | gitlab.com/golang-commonmark/markdown.normalizeLink

-2017329 40.56% | gitlab.com/golang-commonmark/markdown.normalizeLinkText

-4974135 3.42% 3.42% -4974135 3.42% | gitlab.com/golang-commonmark/mdurl.Parse

----------------------------------------------------------+-------------

The calls to mdurl.Parse are coming from markdown.normalizeLink and markdown.normalizeLinkText.

(pprof) list mdurl.Parse

Total: 145309950

ROUTINE ======================== gitlab.com/golang-commonmark/mdurl.Parse in /usr/local/google/home/mpratt/go/pkg/mod/gitlab.com/golang-commonmark/mdurl@v0.0.0-20191124015652-932350d1cb84/parse

.go

-4974135 -4974135 (flat, cum) 3.42% of Total

. . 60:func Parse(rawurl string) (*URL, error) {

. . 61: n, err := findScheme(rawurl)

. . 62: if err != nil {

. . 63: return nil, err

. . 64: }

. . 65:

-4974135 -4974135 66: var url URL

. . 67: rest := rawurl

. . 68: hostless := false

. . 69: if n > 0 {

. . 70: url.RawScheme = rest[:n]

. . 71: url.Scheme, rest = strings.ToLower(rest[:n]), rest[n+1:]

Full source for these functions and callers can be found at:

mdurl.Parse

markdown.normalizeLink

markdown.normalizeLinkText

So what happened here? In a non-PGO build, mdurl.Parse is considered too large to be eligible for inlining.

However, because our PGO profile indicated that the calls to this function were hot, the compiler did inline them.

We can see this from the “(inline)” annotation in the profiles:

$ go tool pprof -top cpu.nopgo.pprof | grep mdurl.Parse

0.36s 0.3% 63.76% 2.75s 2.32% gitlab.com/golang-commonmark/mdurl.Parse

$ go tool pprof -top cpu.withpgo.pprof | grep mdurl.Parse

0.55s 0.48% 58.12% 2.03s 1.76% gitlab.com/golang-commonmark/mdurl.Parse (inline)

mdurl.Parse creates a URL as a local variable on line 66 (var url URL), and then returns a pointer to that variable on line 145 (return &url, nil).

Normally this requires the variable to be allocated on the heap, as a reference to it lives beyond function return.

However, once mdurl.Parse is inlined into markdown.normalizeLink, the compiler can observe that the variable does not escape normalizeLink, which allows the compiler to allocate it on the stack.

markdown.normalizeLinkText is similar to markdown.normalizeLink.

The second largest reduction shown in the profile, from mdurl.(*URL).String is a similar case of eliminating an escape after inlining.

In these cases, we got improved performance through fewer heap allocations.

Part of the power of PGO and compiler optimizations in general is that effects on allocations are not part of the compiler’s PGO implementation at all.

The only change that PGO made was to allow inlining of these hot function calls.

All of the effects to escape analysis and heap allocation were standard optimizations that apply to any build.

Improved escape behavior is a great downstream effect of inlining, but it is not the only effect.

Many optimizations can take advantage of inlining.

For example, constant propagation may be able to simplify the code in a function after inlining when some of the inputs are constants.

Devirtualization

In addition to inling, which we saw in the example above, PGO can also drive conditional devirtualization of interface calls.

Before getting to PGO-driven devirtualization, let’s step back and define “devirtualization” in general.

Suppose you have code that looks like something like this:

f, _ := os.Open("foo.txt")

var r io.Reader = f

r.Read(b)

Here we have a call to the io.Reader interface method Read.

Since interfaces can have multiple implementations, the compiler generates an indirect function call, meaning it looks up the correct method to call at run time from the type in the interface value.

Indirect calls have a small additional runtime cost compared to direct calls, but more importantly they preclude some compiler optimizations.

For example, the compiler can’t perform escape analysis on an indirect call since it doesn’t know the concrete method implementation.

But in the example above, we do know the concrete method implementation.

It must be os.(*File).Read, since *os.File is the only type that could possibly be assigned to r.

In this case, the compiler will perform devirtualization, where it replaces the indirect call to io.Reader.Read with a direct call to os.(*File).Read, thus allowing other optimizations.

(You are probably thinking “that code is useless, why would anyone write it that way?” This is a good point, but note that code like above could be the result of inlining.

Suppose f is passed into a function that takes an io.Reader argument.

Once the function is inlined, now the io.Reader becomes concrete.)

PGO-driven devirtualization extends this concept to situations where the concrete type is not statically known, but profiling can show that, for example, an io.Reader.Read call targets os.(*File).Read most of the time.

In this case, PGO can replace r.Read(b) with something like:

if f, ok := r.(*os.File); ok {

f.Read(b)

} else {

r.Read(b)

}

That is, we add a runtime check for the concrete type that is most likely to appear, and if so use a concrete call, or otherwise fall back to the standard indirect call.

The advantage here is that the common path (using *os.File) can be inlined and have additional optimizations applied, but we still maintain a fallback path because a profile is not a guarantee that this will always be the case.

In our analysis of the markdown server we didn’t see PGO-driven devirtualization, but we also only looked at the top impacted areas.

PGO (and most compiler optimizations) generally yield their benefit in the aggregate of very small improvements in lots of different places, so there is likely more happening than just what we looked at.

Inlining and devirtualization are the two PGO-driven optimizations available in Go 1.21, but as we’ve seen, these often unlock additional optimizations.

In addition, future versions of Go will continue to improve PGO with additional optimizations.

Next article: Scaling gopls for the growing Go ecosystem

Previous article: Perfectly Reproducible, Verified Go Toolchains

Blog Index

Why Go

Use Cases

Case Studies

Get Started

Playground

Tour

Stack Overflow

Help

Packages

Standard Library

About Go Packages

About

Download

Blog

Issue Tracker

Release Notes

Brand Guidelines

Code of Conduct

Connect

Twitter

GitHub

Slack

r/golang

Meetup

Golang Weekly

Opens in new window.

Report an Issue

go.dev uses cookies from Google to deliver and enhance the quality of its services and to

analyze traffic. Learn more.

Okay

Profile-guided optimization - Wikipedia

Jump to content

Main menu

move to sidebar

hide

Navigation

Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate

Contribute

HelpLearn to editCommunity portalRecent changesUpload file

Languages

Language links are at the top of the page.

Create account

Personal tools

Create account Log in

Pages for logged out editors learn more

ContributionsTalk

Contents

move to sidebar

hide

(Top)

1Method

2Adoption

3Implementations

4See also

5References

Toggle the table of contents

Profile-guided optimization

3 languages

FrançaisРусскийSuomi

Edit links

ArticleTalk

English

ReadEditView history

Tools

move to sidebar

hide

Actions

ReadEditView history

General

What links hereRelated changesUpload fileSpecial pagesPermanent linkPage informationCite this pageGet shortened URLDownload QR codeWikidata item

Print/export

Download as PDFPrintable version

From Wikipedia, the free encyclopedia

Compiler optimization technique

Profile-guided optimization (PGO, sometimes pronounced as pogo[1]), also known as profile-directed feedback (PDF),[2] and feedback-directed optimization (FDO)[3] is a compiler optimization technique in computer programming that uses profiling to improve program runtime performance.

Method[edit]

Optimization techniques based on static program analysis of the source code consider code performance improvements without actually executing the program. No dynamic program analysis is performed. The analysis may even consider code within loops including the number of times the loop will execute, for example in loop unrolling. In the absence of all the run time information, static program analysis can not take into account how frequently that code section is actually executed.

The first high-level compiler, introduced as the Fortran Automatic Coding System in 1957, broke the code into blocks and devised a table of the frequency each block is executed via a simulated execution of the code in a Monte Carlo fashion in which the outcome of conditional transfers (as via IF-type statements) is determined by a random number generator suitably weighted by whatever FREQUENCY statements were provided by the programmer.[4]

Rather than programmer-supplied frequency information, profile-guided optimization uses the results of profiling test runs of the instrumented program to optimize the final generated code.[5]

[6][7] The compiler accesses profile data from a sample run of the program across a representative input set. The results indicate which areas of the program are executed more frequently, and which areas are executed less frequently. All optimizations benefit from profile-guided feedback because they are less reliant on heuristics when making compilation decisions. The caveat, however, is that the sample of data fed to the program during the profiling stage must be statistically representative of the typical usage scenarios; otherwise, profile-guided feedback has the potential to harm the overall performance of the final build instead of improving it.

Just-in-time compilation can make use of runtime information to dynamically recompile parts of the executed code to generate a more efficient native code. If the dynamic profile changes during execution, it can deoptimize the previous native code, and generate a new code optimized with the information from the new profile.

Adoption[edit]

There is support for building Firefox using PGO.[8] Even though PGO is effective, it has not been widely adopted by software projects, due to its tedious dual-compilation model.[9] It is also possible to perform PGO without instrumentation by collecting a profile using hardware performance counters.[9] This sampling-based approach has a much lower overhead and does not require a special compilation.

The HotSpot Java virtual machine (JVM) uses profile-guided optimization to dynamically generate native code. As a consequence, a software binary is optimized for the actual load it is receiving. If the load changes, adaptive optimization can dynamically recompile the running software to optimize it for the new load. This means that all software executed on the HotSpot JVM effectively make use of profile-guided optimization.[10]

PGO has been adopted in the Microsoft Windows version of Google Chrome. PGO was enabled in the 64-bit edition of Chrome starting with version 53 and version 54 for the 32-bit edition.[11]

Google published a paper [12] describing a tool in use for using production profiles to guide builds resulting in up to a 10% performance improvement.

Implementations[edit]

Examples of compilers that implement PGO are:

Intel C++ Compiler and Fortran compilers[6]

GNU Compiler Collection compilers

Oracle Solaris Studio (formerly called Sun Studio)

Microsoft Visual C++ compiler[1][13]

Clang[14]

IBM XL C/C++[15]

GraalVM[16] Enterprise Edition

.NET JIT compiler[17]

Go[18]

云原生 PostgreSQL 集群 - PGO：5分钟快速上手-腾讯云开发者社区-腾讯云

PostgreSQL 集群 - PGO：5分钟快速上手-腾讯云开发者社区-腾讯云为少云原生 PostgreSQL 集群 - PGO：5分钟快速上手关注作者腾讯云开发者社区文档建议反馈控制台首页学习活动专区工具TVP最新优惠活动文章/答案/技术大牛搜索搜索关闭发布登录/注册首页学习活动专区工具TVP最新优惠活动返回腾讯云官网为少首页学习活动专区工具TVP最新优惠活动返回腾讯云官网社区首页 >专栏 >云原生 PostgreSQL 集群 - PGO：5分钟快速上手云原生 PostgreSQL 集群 - PGO：5分钟快速上手为少关注发布于 2022-03-31 19:33:221.3K0发布于 2022-03-31 19:33:22举报文章被收录于专栏：黑客下午茶黑客下午茶目录前提条件安装第 1 步：下载示例第 2 步：安装 PGO，即 Postgres Operator创建 Postgres 集群连接到 Postgres 集群通过终端中的 psql 连接实战 Keycloak 连接 PostgreSQL 集群更多前提条件请确保您的主机上安装了以下实用程序：kubectlgit安装第 1 步：下载示例首先，转到 GitHub 并 fork Postgres Operator 示例存储库：https://github.com/CrunchyData/postgres-operator-examples/fork一旦你分叉了这个 repo，你可以使用类似下面的命令将它下载到你的工作环境中：YOUR_GITHUB_UN=""

git clone --depth 1 "git@github.com:${YOUR_GITHUB_UN}/postgres-operator-examples.git"

cd postgres-operator-examples

复制第 2 步：安装 PGO，即 Postgres Operator您可以使用以下命令安装 PGO，即来自 Crunchy Data 的 Postgres Operator：kubectl apply -k kustomize/install复制这将创建一个名为 postgres-operator 的命名空间，并创建部署 PGO 所需的所有对象。要检查安装状态，可以运行以下命令：kubectl -n postgres-operator get pods \

--selector=postgres-operator.crunchydata.com/control-plane=postgres-operator \

--field-selector=status.phase=Running复制如果 PGO Pod 运行良好，您应该会看到类似于以下内容的输出：NAME READY STATUS RESTARTS AGE

postgres-operator-9dd545d64-t4h8d 1/1 Running 0 3s复制创建 Postgres 集群让我们创建一个简单的 Postgres 集群。您可以通过执行以下命令来执行此操作：kubectl apply -k kustomize/postgres复制注意：注意，你的集群已经有一个默认的 Storage Class这将在 postgres-operator 命名空间中创建一个名为 hippo 的 Postgres 集群。您可以使用以下命令跟踪集群的进度：kubectl -n postgres-operator describe postgresclusters.postgres-operator.crunchydata.com hippo复制连接到 Postgres 集群作为创建 Postgres 集群的一部分，Postgres Operator 创建一个 PostgreSQL 用户帐户。此帐户的凭据存储在名为 -pguser- 的 Secret 中。此 Secret 中的属性提供了让您登录 PostgreSQL 集群的信息。这些包括：user: 用户帐户的名称。password: 用户帐户的密码。dbname: 默认情况下用户有权访问的数据库的名称。host: 数据库主机的名称。这引用了主 Postgres 实例的 Service。port: 数据库正在侦听的端口。uri: 一个 PostgreSQL 连接 URI，它提供了登录 Postgres 数据库的所有信息。jdbc-uri: 一个 PostgreSQL JDBC 连接 URI，它提供了通过 JDBC driver 登录到 Postgres 数据库的所有信息。如果您使用 PgBouncer 连接池部署 Postgres 集群，则用户 Secret 中会填充其他值，包括：pgbouncer-host: PgBouncer 连接池的主机名。这引用了 PgBouncer 连接池的 Service。pgbouncer-port: PgBouncer 连接池正在侦听的端口。pgbouncer-uri: 一个 PostgreSQL 连接 URI，它提供了通过 PgBouncer 连接池登录到 Postgres 数据库的所有信息。pgbouncer-jdbc-uri: 一个 PostgreSQL JDBC 连接 URI，它提供了使用 JDBC driver 通过 PgBouncer 连接池登录到 Postgres 数据库的所有信息。请注意，所有连接都使用 TLS。PGO 为您的 Postgres 集群设置 PKI。您也可以选择自带 PKI / certificate authority；这将在文档后面介绍。PgBouncerhttps://www.pgbouncer.org/通过终端中的 psql 连接直接连接如果您与 PostgreSQL 集群位于同一网络上，则可以使用以下命令直接连接到它：psql $(kubectl -n postgres-operator get secrets hippo-pguser-hippo -o go-template='{{.data.uri | base64decode}}')

复制使用端口转发连接在新终端中，创建一个端口转发：PG_CLUSTER_PRIMARY_POD=$(kubectl get pod -n postgres-operator -o name \

-l postgres-operator.crunchydata.com/cluster=hippo,postgres-operator.crunchydata.com/role=master)

kubectl -n postgres-operator port-forward "${PG_CLUSTER_PRIMARY_POD}" 5432:5432复制建立与 PostgreSQL 集群的连接。PG_CLUSTER_USER_SECRET_NAME=hippo-pguser-hippo

PGPASSWORD=$(kubectl get secrets -n postgres-operator "${PG_CLUSTER_USER_SECRET_NAME}" -o go-template='{{.data.password | base64decode}}') \

PGUSER=$(kubectl get secrets -n postgres-operator "${PG_CLUSTER_USER_SECRET_NAME}" -o go-template='{{.data.user | base64decode}}') \

PGDATABASE=$(kubectl get secrets -n postgres-operator "${PG_CLUSTER_USER_SECRET_NAME}" -o go-template='{{.data.dbname | base64decode}}') \

psql -h localhost

复制实战 Keycloak 连接 PostgreSQL 集群用户 Secret 中提供的信息将允许您将应用程序直接连接到您的 PostgreSQL 数据库。例如，让我们连接 Keycloak。Keycloak 是一种流行的开源身份管理工具，由 PostgreSQL 数据库支持。使用我们创建的 hippo 集群，我们可以部署以下清单文件：Keycloakhttps://www.keycloak.org/cat <> keycloak.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

name: keycloak

namespace: postgres-operator

labels:

app.kubernetes.io/name: keycloak

spec:

selector:

matchLabels:

app.kubernetes.io/name: keycloak

template:

metadata:

labels:

app.kubernetes.io/name: keycloak

spec:

containers:

- image: quay.io/keycloak/keycloak:latest

name: keycloak

args: ["start-dev"]

env:

- name: DB_VENDOR

value: "postgres"

- name: DB_ADDR

valueFrom: { secretKeyRef: { name: hippo-pguser-hippo, key: host } }

- name: DB_PORT

valueFrom: { secretKeyRef: { name: hippo-pguser-hippo, key: port } }

- name: DB_DATABASE

valueFrom: { secretKeyRef: { name: hippo-pguser-hippo, key: dbname } }

- name: DB_USER

valueFrom: { secretKeyRef: { name: hippo-pguser-hippo, key: user } }

- name: DB_PASSWORD

valueFrom: { secretKeyRef: { name: hippo-pguser-hippo, key: password } }

- name: KEYCLOAK_ADMIN

value: "admin"

- name: KEYCLOAK_ADMIN_PASSWORD

value: "admin"

- name: PROXY_ADDRESS_FORWARDING

value: "true"

ports:

- name: http

containerPort: 8080

- name: https

containerPort: 8443

restartPolicy: Always

EOF

kubectl apply -f keycloak.yaml

kubectl -n postgres-operator port-forward ${KEYCLOAK_POD} 8086:8080 --address='0.0.0.0'

# Forwarding from 0.0.0.0:8086 -> 8080

复制转到 http://127.0.0.1:8086在 kustomize/keycloak 文件夹中有一个关于如何使用 Postgres Operator 部署 Keycloak 的完整示例。注意：quay.io/keycloak/keycloak:latest，科学拉取镜像对 keycloak.yaml 进行了上述修改恭喜，您的 Postgres 集群已启动并运行，还连接了一个应用程序！您可以通过文档和 kubectl explain 了解有关 postgresclusters 自定义资源定义的更多信息，即：kubectl explain postgresclusters

复制postgresclusters 自定义资源定义https://access.crunchydata.com/documentation/postgres-operator/5.0.4/references/crd/本文参与腾讯云自媒体分享计划，分享自微信公众号。原始发表：2022-02-23，如有侵权请联系 cloudcommunity@tencent.com 删除数据库sqlpostgresqlhttpsjdbc本文分享自黑客下午茶微信公众号，前往查看如有侵权，请联系 cloudcommunity@tencent.com 删除。本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！数据库sqlpostgresqlhttpsjdbc评论登录后参与评论0 条评论热度最新登录后参与评论推荐阅读LV.关注文章0获赞0目录目录前提条件安装第 1 步：下载示例第 2 步：安装 PGO，即 Postgres Operator创建 Postgres 集群连接到 Postgres 集群通过终端中的 psql 连接实战 Keycloak 连接 PostgreSQL 集群相关产品与服务数据库云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建，轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量，更专注于业务发展，让企业一站式享受数据上云及分布式架构的技术红利！产品介绍2024新春采购节领券社区专栏文章阅读清单互动问答技术沙龙技术视频团队主页腾讯云TI平台活动自媒体分享计划邀请作者入驻自荐上首页技术竞赛资源技术周刊社区标签开发者手册开发者实验室关于社区规范免责声明联系我们友情链接腾讯云开发者扫码关注腾讯云开发者领取腾讯云代金券热门产品域名注册云服务器区块链服务消息队列网络加速云数据库域名解析云存储视频直播热门推荐人脸识别腾讯会议企业云CDN加速视频通话图像分析MySQL 数据库SSL 证书语音识别更多推荐数据安全负载均衡短信文字识别云点播商标注册小程序开发网站监控数据迁移Copyright © 2013 - 2024 Tencent Cloud. All Rights Reserved. 腾讯云版权所有深圳市腾讯计算机系统有限公司 ICP备案/许可证号：粤B2-20090059 深公网安备号 44030502008569腾讯云计算（北京）有限责任公司京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档Copyright © 2013 - 2024 Tencent Cloud.All Rights Reserved. 腾讯云版权所有登录后参与评论00

使用 Profile Guided Optimization 提升 Application 的性能 - 知乎

使用 Profile Guided Optimization 提升 Application 的性能 - 知乎首发于不闹的编程切换模式写文章登录/注册使用 Profile Guided Optimization 提升 Application 的性能不闹不休码匠什么是 Profile Guided Optimization在编译时，通过指定优化等级，编译器已经可以帮助我们进行适当的优化，比如 inline 一些短函数等。现在考虑这样一个场景：有一个稍微长一点的函数，刚好长到编译器不对它的调用进行 inline 优化，但是实际上，这个函数是一个热点调用，在运行时被调用的次数非常多。那么如果此时编译器也能帮我们把它优化掉，是不是很好呢？但是，编译器怎么能知道这个“稍微长一点的函数”是一个热点调用呢？这就是 Profile Guided Optimization（PGO）发挥作用的地方。PGO 是一种根据运行时 profiling data 来进行优化的技术。如果一个 application 的使用方式没有什么特点，那么我们可以认为代码的调用没有什么倾向性。但实际上，我们操作一个 application 的时候，往往有一套固定流程，尤其在程序启动的时候，这个特点更加明显。采集这种“典型操作流”的 profiling data，然后让编译器根据这些 data 重新编译代码，就可以把运行时得到的知识，运用到编译期，从而获得一定的性能提升。然而，值得指出的一点是，这样获得的性能提升并不是十分明显，通常只有 5-10%。如果已经没有其他办法，再考虑试试 PGO。使用 PGO 的经验下面具体说说在 MacOS 上进行 PGO 的一些方法和经验，不过核心知识可以迁移到其他平台，只要使用的编译器是 Clang 即可。首先，Xcode 已经提供了 PGO 的 UI 操作（详情可参考：https://developer.apple.com/library/archive/documentation/DeveloperTools/Conceptual/xcode_profile_guided_optimization/Introduction/Introduction.html#//apple_ref/doc/uid/TP40014459-CH1-SW1），所以如果是简单的 application，可以直接使用 UI 操作的方式，简单省事。不过，UI 操作有一些缺陷，具体表现在：控制粒度粗糙，要么不打开 PGO，要么对所有 code 进行 PGO。如果项目中有 swift 代码，那么这种方式就不能用了，因为 swift 不支持 PGO；只支持两种方式采集 profiling data。第一种是每次手动运行，运行结束后退出 application，Xcode 会产生一个 xxx.profdata，之后的编译，都会用这个文件，来进行优化；如果代码发生变更，Xcode 会提示 profdata file out of date。第二种方法是借助 XCTest 来采集 profiling data，这种方法提供了一定的 automation 能力，但是另一方面也限制了 automation team 的手脚，他们可能在使用另一些更好用的工具而不是 XCTest。在真正的开发环境中，我们一般使用 automation tests 作为 training set，而非手动执行；另一方面，自动化测试用具一般很难集成到 XCTest 中。Xcode 的后端编译器用的是 Clang，PGO 的 UI 功能也是来源于 Clang，如果直接从 command line 入手，或许就能克服上述缺陷。基于这个想法，我进行了一些调研，在这篇问答（https://stackoverflow.com/questions/35582268/clang-pgo-empty-profraw）中，作者提到了他在命令行中采用的办法： I compile with -fprofile-instr-generate: clang++ -o test -fprofile-instr-generate dummy.cpp The executable "test", when launched, generates a default.profraw file I can merge the profiles with llvm-profdata merge At the end I can compile with the profiles integration, with -fprofile-instr-use on the .profdata 所以我们能够知道，使用方法大致是这样：先带着 -fprofile-instr-generate 进行编译，然后运行 application 获得 profraw 文件（比如可以通过 automation tests 来“可重现”地获得这些文件），如果有多个 profraw，需要使用 llvm-profdata 工具进行合并，得到一个 profdata 文件，最后再带着 -fprofile-instr-use=xxx.profdata 进行编译。经过一系列的尝试，得到了下面这些经验，我认为对加深理解和正确使用 PGO 都有指导意义（环境：MacOS 10.13，Xcode 9.3）: 同一个 binary 中的不同 object 文件之间，没有强制传染性：A.cpp 和 B.cpp 两个源文件，编译 A 时带着 -fprofile-instr-generate，B 不带着，结果是 A.o 中包含 clang 插入的函数调用（___llvm_xxx）， B.o 中没有，链接后的可执行文件中包含 clang 插入的函数调用。执行该可执行文件，可以产生 default.profraw。执行了 PGO 的库文件，对 client 没有强制传染性：A 带着 -fprofile-instr-generate，并且被编译为 dylib，B 不带着，并且被编译为 executable，但是使用 A 这个 dylib；dylib 中有插入的函数，B 的可执行文件中没有；执行可执行文件，可以产生 default.profraw。上面的两个发现说明，即使是部分 code with profiling，也可以正确产生 pgo 的 profraw 文件。在 Mac 上，必须直接启动可执行文件，才能产生 profraw；若使用 open XXX.app 的方式，则没有 profraw 文件产生。产生的 profraw 默认名字为 default.profraw，该文件就在启动 application 命令执行时的目录下，即当前目录。 default.profraw 可以被指定名字。有两种方法：第一，通过 -fprofile-instr-generate=XXX.profraw 指定；第二种是通过设置环境变量 LLVM_PROFILE_FILE。当然，因为我们是希望通过自动化方式进行，完全可以做到，跑完一个 automation test，就把当前目录下的 profraw 移动到另一个地方去，同时重命名：mv $(pwd)/default.profraw 。在 Mac 上，直接执行 llvm-profdata 一般会提示没有该命令，这时候可能第一想法是去 download 或者 install 一个，其实不必这么麻烦。因为 Xcode 本身使用 Clang 作为编译器，因此这个工具已经安装好了，要运行它，只需借助 xcrun 命令：xcrun llvm-profdata merge /*.profraw -output pgo.profdata。 -fprofile-instr-generate/use 不仅是编译器选项，同时也是 linker flag，所以在配置 Xcode 工程时，要同时配置 OTHER_CPLUSPLUSFLAGS 和 OTHER_LDFLAGS。借助环境变量和 xcconfig file，可以很容易实现流程的自动化，这属于基础知识了。总结在编译期优化的基础上，我们还可以使用 PGO 技术来获取运行期信息，进一步提升性能 5-10%；对于一般规模的应用程序，Xcode 提供的 UI 操作已经能够满足要求，但是如果程序规模较大，并且希望整合到自动化流程中，我们需要借助 -fprofile-instr-generate 和 -fprofile-instr-use 这两个 Clang 提供的编译选项，获得更加灵活的实现方案。发布于 2018-11-01 17:20编译性能优化Clang赞同 1910 条评论分享喜欢收藏申请转载文章被以下专栏收录不闹的编程记录和分享编码过程中的经验

Follow

下载网站tokenpocketapp下载|pgo

下载网站tokenpocketapp下载|pgo

性能优化的终极手段之 Profile-Guided Optimization (PGO) - 知乎

PGO 是啥，咋就让 Go 更快更猛了？ - 知乎

Go PGO 快速上手，性能可提高 2~4%！ - 知乎

The Go Programming Language

PGO 摩特動力機車製造大廠 - PGOSCOOTERS

使用配置文件引导的优化 (PGO) | Android 开源项目 | Android Open Source Project

如果没有PGO，JIT 编译相比AOT 编译有哪些优势？ - 知乎

Profile-guided optimization in Go 1.21 - The Go Programming Language

The Go Programming Language

Profile-guided optimization - Wikipedia

云原生 PostgreSQL 集群 - PGO：5分钟快速上手-腾讯云开发者社区-腾讯云

使用 Profile Guided Optimization 提升 Application 的性能 - 知乎

最近的新闻

您可能喜欢的文章

imtoken钱包下载安卓1.0|pantera

tokenpocket钱包app官网|pika

别的手机能装华为钱包吗

古希腊数字货币：探索古代金融与现代科技的结

如何在安卓手机上授权TP钱包访问相机和相册？

为什么TP钱包只能用于购买，不能用于出售数字资

欧易如何提现到TP钱包？