Intro
I have a set of fairly large and complex static libraries for an embedded target that I want to optimize with LTO at the boundary of the public API. I have a project with the following layout:
The public.h header file contains public API symbols and have default visibility.
#pragma once
#pragma GCC visibility push(default)
void public_function_1(int a);
void public_function_2(void);
int public_function_3(int a, int b, int* c);
#pragma GCC visibility pop
The internal.h contains internal symbols that are never used outside the library.
#pragma once
int internal_function_1(int a);
void internal_function_2(int a, int b);
int internal_function_3(int a, int b);
Content of public.c is not that important, but its included here for fuller picture:
#include "public.h"
#include "internal.h"
static int some_state = 1;
void public_function_1(int a)
{
some_state *= a;
some_state++;
}
void public_function_2(void)
{
internal_function_2(10, 20);
some_state = internal_function_1(some_state);
}
int public_function_3(int a, int b, int* c)
{
internal_function_2(20, 50);
*c = internal_function_3(a, b);
return *c + some_state;
}
Content of internal.c is irrelevant.
Building the library
I know that static libraries are merely an archive of object files, so here is my plan to optimize them with LTO:
- Compile
public.candinternal.cwith the following command:
arm-zephyr-eabi-gcc -ffunction-sections -fdata-sections -Os -flto -g3 -fvisibility=internal -fdata-sections -ffunction-sections -c internal.c -o internal.c.obj <arch flags>
arm-zephyr-eabi-gcc -ffunction-sections -fdata-sections -Os -flto -g3 -fvisibility=internal -fdata-sections -ffunction-sections -c public.c -o public.c.obj <arch flags>
- Partially link the object files with
-r -flinker-output=nolto-rel:
arm-zephyr-eabi-gcc -ffunction-sections -fdata-sections -Os -flto -g3 -r -flinker-output=nolto-rel internal.c.obj public.c.obj -o lib
- Put the resulting object file into a library with
ar.
The -flinker-output=nolto-rel instructs GCC to output machine code, instead of GIMPLE IR.
I'm using the arm-zephyr-eabi toolchain, but it should work the same for plain system gcc as well. The GCC version is 12.1.0.
I ended up with an object file that more or less contains what I need with a caveat. Here's a disassembly:
lib: file format elf32-littlearm
Disassembly of section .text.internal_function_1:
00000000 <internal_function_1>:
0: 220a movs r2, #10
2: 4b02 ldr r3, [pc, #8] ; (c <internal_function_1+0xc>)
4: 6818 ldr r0, [r3, #0]
6: 4350 muls r0, r2
8: 6018 str r0, [r3, #0]
a: 4770 bx lr
c: 00000000 .word 0x00000000
Disassembly of section .text.internal_function_2:
00000000 <internal_function_2>:
0: 4b01 ldr r3, [pc, #4] ; (8 <internal_function_2+0x8>)
2: 4408 add r0, r1
4: 6018 str r0, [r3, #0]
6: 4770 bx lr
8: 00000000 .word 0x00000000
Disassembly of section .text.internal_function_3:
00000000 <internal_function_3>:
0: 4408 add r0, r1
2: 4770 bx lr
Disassembly of section .text.public_function_1:
00000000 <public_function_1>:
0: 4b02 ldr r3, [pc, #8] ; (c <public_function_1+0xc>)
2: 681a ldr r2, [r3, #0]
4: 4350 muls r0, r2
6: 3001 adds r0, #1
8: 6018 str r0, [r3, #0]
a: 4770 bx lr
c: 00000000 .word 0x00000000
Disassembly of section .text.public_function_2:
00000000 <public_function_2>:
0: f44f 7396 mov.w r3, #300 ; 0x12c
4: 4a02 ldr r2, [pc, #8] ; (10 <public_function_2+0x10>)
6: 6013 str r3, [r2, #0]
8: 4a02 ldr r2, [pc, #8] ; (14 <public_function_2+0x14>)
a: 6013 str r3, [r2, #0]
c: 4770 bx lr
e: bf00 nop
...
Disassembly of section .text.public_function_3:
00000000 <public_function_3>:
0: b510 push {r4, lr}
2: 2446 movs r4, #70 ; 0x46
4: 4b03 ldr r3, [pc, #12] ; (14 <public_function_3+0x14>)
6: 4408 add r0, r1
8: 601c str r4, [r3, #0]
a: 4b03 ldr r3, [pc, #12] ; (18 <public_function_3+0x18>)
c: 6010 str r0, [r2, #0]
e: 681b ldr r3, [r3, #0]
10: 4418 add r0, r3
12: bd10 pop {r4, pc}
...
The problem
As you can see in the disassembly, the internal_function_xyz symbols have been inlined into the body of the public functions, which means that LTO works correctly. What I'm not happy about is that the internal_function_xyz symbols along with the machine code are still present in the object file. I expected that the linker would discard those symbols, since those were marked with visibility internal or hidden. The output of nm shows the following:
00000000 T internal_function_1
00000000 T internal_function_2
00000000 T internal_function_3
00000000 T public_function_1
00000000 T public_function_2
00000000 T public_function_3
00000000 d some_state.lto_priv.0
00000000 d some_state.lto_priv.1
This means that although the symbols had internal or hidden visibility, the symbols were still externally visible in the symbol table. My suspicion is that this caused the linker to keep those symbols.
I wanted to get rid of those symbols using objcopy and strip like so:
arm-zephyr-eabi-objcopy --localize-hidden lib localized_symbols
The symbol table now looks like this:
00000000 t internal_function_1
00000000 t internal_function_2
00000000 t internal_function_3
00000000 T public_function_1
00000000 T public_function_2
00000000 T public_function_3
00000000 d some_state.lto_priv.0
00000000 d some_state.lto_priv.1
After the following command:
arm-zephyr-eabi-strip --strip-unneeded localized_symbols
I end up with:
00000000 T public_function_1
00000000 T public_function_2
00000000 T public_function_3
However, in the disassembly the machine code still remains:
localized_symbols: file format elf32-littlearm
Disassembly of section .text.internal_function_1:
00000000 <.text.internal_function_1>:
0: 220a movs r2, #10
2: 4b02 ldr r3, [pc, #8] ; (c <.text.internal_function_1+0xc>)
4: 6818 ldr r0, [r3, #0]
6: 4350 muls r0, r2
8: 6018 str r0, [r3, #0]
a: 4770 bx lr
c: 00000000 .word 0x00000000
Disassembly of section .text.internal_function_2:
00000000 <.text.internal_function_2>:
0: 4b01 ldr r3, [pc, #4] ; (8 <.text.internal_function_2+0x8>)
2: 4408 add r0, r1
4: 6018 str r0, [r3, #0]
6: 4770 bx lr
8: 00000000 .word 0x00000000
Disassembly of section .text.internal_function_3:
00000000 <.text.internal_function_3>:
0: 4408 add r0, r1
2: 4770 bx lr
Disassembly of section .text.public_function_1:
00000000 <public_function_1>:
0: 4b02 ldr r3, [pc, #8] ; (c <public_function_1+0xc>)
2: 681a ldr r2, [r3, #0]
4: 4350 muls r0, r2
6: 3001 adds r0, #1
8: 6018 str r0, [r3, #0]
a: 4770 bx lr
c: 00000000 .word 0x00000000
Disassembly of section .text.public_function_2:
00000000 <public_function_2>:
0: f44f 7396 mov.w r3, #300 ; 0x12c
4: 4a02 ldr r2, [pc, #8] ; (10 <public_function_2+0x10>)
6: 6013 str r3, [r2, #0]
8: 4a02 ldr r2, [pc, #8] ; (14 <public_function_2+0x14>)
a: 6013 str r3, [r2, #0]
c: 4770 bx lr
e: bf00 nop
...
Disassembly of section .text.public_function_3:
00000000 <public_function_3>:
0: b510 push {r4, lr}
2: 2446 movs r4, #70 ; 0x46
4: 4b03 ldr r3, [pc, #12] ; (14 <public_function_3+0x14>)
6: 4408 add r0, r1
8: 601c str r4, [r3, #0]
a: 4b03 ldr r3, [pc, #12] ; (18 <public_function_3+0x18>)
c: 6010 str r0, [r2, #0]
e: 681b ldr r3, [r3, #0]
10: 4418 add r0, r3
12: bd10 pop {r4, pc}
...
Question
Is there any way I can optimize the library with LTO, keep only public API symbols in the symbol table and not have any redundant internal symbols and machine code?
Additional concerns
Since in my case the gcc ends up generating machine code for internal symbols and puts them into a symbol table, I suspect that the optimization might not be done to the fullest extent. Let's say that internal_function_1 were much larger in size. Let's also assume that the function internal_function_1 is referenced only once within the library. If the linker sees more than one reference to the symbol (one from library code, unknown number of references from externally linked code due to being present in symbol table), the optimizer may be more reluctant to inline such internal function and will not perform aggressive optimizations on it. I haven't confirmed this hypothesis yet, but if it's true, then I think the only reasonable solution would involve preventing those internal symbols from ever being generated and inserted into the symbol table at the relocatable link stage.
