Skip to content
Snippets Groups Projects
2012-01-07-Making-OCaml-native-code-shorter-on-Mac-OS-X.html 6.83 KiB
Newer Older
---
layout: post
author: Pascal Cuoq
date: 2012-01-07 11:00 +0200
categories: OCaml
format: xhtml
title: "Making OCaml native code 0.5% shorter on Mac OS X"
summary: 
---
{% raw %}
<h2>Mac OS X and assembly labels</h2> 
<p>A few months ago, I was moving things around in <a href="http://forge.ocamlcore.org/projects/zarith/">Zarith</a>. It's a good way to 
relax  not unlike gardening. And then I noticed something strange.</p> 
<ul> 
<li>On Mac OS X  a label in assembly code is marked as local by prefixing it with "L"  unlike another common convention of using ".".</li> 
<li>On Mac OS X  the assembler has some sort of strange limitation that makes it generate long variants of jump instructions if the destination label is not local  even if the label is in the same compilation unit (and its relative position known).</li> 
</ul> 
<p>Here is a hand-written assembly snippet to show this:</p> 
<pre>	testl	%eax  %eax 
	jne	L6 
	testl	%ebx  %ebx 
	jne	.L7 
	movl	$0  %ecx 
L6: 
	movl	$0  %eax 
.L7: 
	movl	$0  %ebx 
</pre> 
<p>The two above facts together mean that on Mac OS X  the snippet is compiled into object code that can then be disassembled as:</p> 
<pre>0x0000000000000057 &lt;main+24&gt;:	test   %eax %eax 
0x0000000000000059 &lt;main+26&gt;:	jne    0x68 &lt;main+41&gt; 
0x000000000000005b &lt;main+28&gt;:	test   %ebx %ebx 
0x000000000000005d &lt;main+30&gt;:	jne    0x63 &lt;main+36&gt; 
0x0000000000000063 &lt;main+36&gt;:	mov    $0x0 %ecx 
0x0000000000000068 &lt;main+41&gt;:	mov    $0x0 %eax 
0x000000000000006d &lt;.L7+0&gt;:	mov    $0x0 %ebx 
</pre> 
<p>You may notice that since <code>.L7</code> is not a local label  <code>gdb</code> considers that it may be the name you want to see at address 0x6d. This is just a heuristic of little importance. The second conditional jump at &lt;main+30&gt; appears to be going to &lt;main+36&gt;  but this is just because we are looking at an unlinked object file. The destination has been left blank in the object file  and since it is expressed as a relative offset  the default value 0 makes it look like the destination is the instruction that immediately follows. More to the point  the first conditional jump at &lt;main+26&gt; occupies two bytes  because the assembler sees that the destination is close and that the relative offset fits into one byte  whereas the second conditional jump at &lt;main+30&gt; occupies 6 bytes  leaving room for a 4-byte encoding of the target address.</p> 
<h2>Enter OCaml. The plot thickens.</h2> 
<p>Antoine Miné and Xavier Leroy then respectively contributed the following additional facts:</p> 
<ul> 
<li>OCaml generates labels intended to be local with a ".L" prefix on Mac OS X. This at first sight seems inefficient  since it leads the assembler to use the long encoding of jumps all the time  even when the destination is nearby.</li> 
<li>But in fact  Mac OS X's linker complains when you have subtracted local labels in the file being linked  so that if you intend to subtract some of the labels you are generating  you shouldn't make them local anyway.</li> 
</ul> 
<p>Subtracting addresses indicated by local labels is something that OCaml does in the process of generating meta-data accompanying the code in the assembly file. Thus  on Mac OS X  OCaml is prevented from using proper local labels with an "L" prefix.</p> 
<p>Mac OS X's compilation chain is  of course  being obtuse. It could generate the short jump variants when the non-local destination label happens to be known and nearby. It could as well compute whatever subtractions between local labels occur in an assembly file while it is being assembled  instead of leaving them for later and then complaining that it's impossible. The usual GNU compilation suite on a modern Linux distribution gets both of these features right  and Mac OS X's gets both of them wrong  leaving the OCaml native compiler no choice but to generate the inefficiently compiled non-local labels. Mac OS X deserves all the blame here.</p> 
<h2>Solution: a hack</h2> 
<p>Are we destined to keep ugly 6-byte jumps in our OCaml-generated native code on Mac OS X then? No  because I made a hack. In Frama-C's Makefile  I changed the rule to 
compile .ml files into the .cmx  .o  ...  natively compiled versions thus:</p> 
<pre>--- share/Makefile.common	(revision 16792) 
+++ share/Makefile.common	(working copy) 
@@ -306 7 +306 10 @@ 
 %.cmx: %.ml 
 	$(PRINT_OCAMLOPT) $@ 
-	$(OCAMLOPT) -c $(OFLAGS) $&lt; 
+	$(OCAMLOPT) -S -c $(OFLAGS) $&lt; 
+	sed -f /Users/pascal/ppc/sed_asm \ 
+            &lt; $(patsubst %.ml %.s $&lt;) &gt; $(patsubst %.ml %.1.S $&lt;) 
+	gcc -c $(patsubst %.ml %.1.S $&lt;) -o $(patsubst %.ml %.o $&lt;) 
 # .o are generated together with .cmx  but %.o %.cmx: %.ml only confuses 
 # make when computing dependencies... 
</pre> 
<p>This uses <code>ocamlopt</code>'s -S option to generate the assembly file from the .ml source code file. Then  a <a href="http://en.wikipedia.org/wiki/Sed">sed</a> script is applied to the assembly file to modify it a little. And finally  the modified assembly file is compiled (gcc can be used for this).</p> 
<p>The sed commands to transform the assembly file are these:</p> 
<pre>s/^[.]\(L[0-9]*\):/.\1: \1:/g 
s/\([[:space:]]*j.*[[:space:]]*\)[.]\(L[0-9]*\)$/\1\2/g 
</pre> 
<p>The first command transform all label declarations (e.g. <code>.L100:</code>) into a double declaration <code>.L100: L100:</code>. The two labels thus indicate the same location  but the second one is local whereas the first one isn't.</p> 
<p>The second command transforms labels when used inside jump instructions  so that <code>jne .L100</code> is transformed into <code>jne L100</code>. Crucially  it does not transform labels elsewhere  for instance when referenced in the meta-data that the OCaml compiler generates  where the compiler may have to subtract one label's address from another's.</p> 
<h2>Results</h2> 
<p>On Mac OS X  the described trick makes Ocaml-generated native code smaller:</p> 
<pre>-rwxr-xr-x  1 pascal  staff  11479984 Jan  6 23:24 bin/toplevel.old 
-rwxr-xr-x  1 pascal  staff  11414512 Jan  7 00:12 bin/toplevel.opt 
</pre> 
<p>The nearly 65000 bytes of difference between the two version represent the accumulation of all the inefficiently assembled jumps in a large piece of software such as Frama-C.</p> 
<h2>Conclusion</h2> 
<p>If you play with the above compilation rule and sed script  do not expect much in terms of speed improvements: changing an inefficient encoding into an efficient one of the same instruction helps the processor  but only marginally. I guess it would be measurable  but not with my usual protocol of launching three of each and keeping the median measurement. I would have to learn about confidence intervals  which does not sound fun (not like gardening at all). Instead  I will avoid making the claim that this hack improves execution speed  and I will just postpone a bit more the moment I have to learn about statistics.</p>
{% endraw %}