2011-01-31-On-memcpy-part-2-the-OCaml-source-code-approach.html

---
layout: post
author: Pascal Cuoq
date: 2011-01-31 10:12 +0200
categories: OCaml memcpy value value-builtins
format: xhtml
title: "On memcpy (part 2: the OCaml source code approach)"
summary: 
---
{% raw %}
<p>When picking up the title for <a href="/index.php?post/2011/01/27/On-memcpy-1" hreflang="en">the previous post on function memcpy</a>  I anticipated this second part would describe the <code>Frama_C_memcpy</code> built-in function. The subtitle "part 1: the source code approach" seemed a good idea  since the first part was about using C source code to tell the analyzer what this function <code>memcpy</code> it often encounters is.</p> 
<p>I did not think further  and did not realize that I would spend this sequel post explaining that a built-in function is <strong>an OCaml function that directly transforms an abstract state into another abstract state</strong> without going through the tedious  instruction-by-instruction interpretation of C code. So this second post is still about source code.</p> 
<p>It is not source code that it is recommended  or even documented how  to write yourself. But it is a convenient way for value analysis experts to introduce certain new features.</p> 
<p><strong>What new features?</strong> Smoking hot  blazing fast new features.</p> 
<p>Also  features that work completely right neither the first nor the second time. The <strong>Formula One</strong> of features  if you will. In fact  you could be forgiven for calling them Virgin Racing features  although I prefer to think of them as Ferrari features.</p> 
<p>As just stated  builtins are about using OCaml functions to describe the effect of an unavailable C function. For instance  calls to <code>Frama_C_memcpy(d s l)</code> are interpreted with an OCaml function that  in a single step  copies a slice of memory of length <code>l*8</code> bits from <code>s</code> to <code>d</code>. And it is fast. Behold:</p> 
<pre>char s[10000]  d[10000]; 
main() 
{ 
  // Make problem representative 
  for (int i=0; i&lt;100; i++) 
    s[100*i] = i; 
  // Time one or the other 
  //  c_memcpy(d  s  10000); 
  Frama_C_memcpy(d  s  10000); 
  // Check that d contains the right thing 
  Frama_C_dump_each(); 
  return 0; 
} 
</pre> 
<p><a href="/assets/img/blog/imported-posts/memcpytest.c">(download the entire program)</a></p> 
<p>Using the C version of <code>memcpy</code> and sufficient unrolling  you get the expected contents for <code>d</code>:</p> 
<pre>$ time bin/toplevel.opt -val -slevel 20000 memcpytest.c -no-results 
... 
        d[0..99] ∈ {0; } 
         [100] ∈ {1; } 
         [101..199] ∈ {0; } 
         [200] ∈ {2; } 
         ... 
user   0m23.155s 
</pre> 
<p>Using the <code>Frama_C_memcpy</code> built-in  you get the same precise result for <code>d</code>:</p> 
<pre>$ time bin/toplevel.opt -val -slevel 20000 memcpytest.c -no-results 
... 
        d[0..99] ∈ {0; } 
         [100] ∈ {1; } 
         [101..199] ∈ {0; } 
         [200] ∈ {2; } 
         ... 
user   0m0.402s 
</pre> 
<p>You also get nearly 23 seconds of your life back.</p> 
<p>So what are the things that can go wrong with a builtin replacement function?</p> 
<ol> 
<li>If the C code that is not analyzed would have caused an alarm to be emitted  the built-in should emit that alarm too  or at the very least be provided with a header file that offers an ACSL pre-condition along with a prototype for the builtin. Both alternatives work  but it is easy to forget to do either.</li> 
<li>All cases of imprecise arguments should be handled. Support for an imprecise length was added recently. Very imprecise source or destination pointers may still cause the analysis to abort (it is being done and I am not sure where we were the last time we were working on this).</li> 
<li>The builtin function calls the value analysis' internal functions in contexts that differ from the usual  well tested circumstances occurring during the analysis of normal C programs. Strange behaviors or bugs may be uncovered by the user. As an example  I was just preparing the next section of this post  where I would have shown how imprecise lengths were now handled. I was going to suggest modifying the above test program  thus:</li> 
</ol> 
<pre>  Frama_C_memcpy(d  s  Frama_C_nondet(150 10000)); 
</pre> 
<p>This brought up the following two remarks: first  there is an internal function for moving slices of the abstract state around that is slower than it should be  and it shows on this test. Secondly  the value analysis in Carbon will tell you that bits 1200 to 79999 of <code>d</code> contain either <code>0</code> or a very large number  a number with about 23400 decimal digits that start with <code>2164197</code> and end with <code>85824</code>. The value analysis is right  of course. If the length argument to <code>Frama_C_memcpy</code> was <code>150</code>  then that slice of memory contains zero. If the length was <code>10000</code>  the numbers in <code>d</code> that were copied from <code>s</code> can be interpreted as representing exactly that 78800-bit number. This is the value analysis' way of telling you that it's either one or the other: either the length was <code>150</code> or it was <code>10000</code>. This is more informative than telling you cell by cell that the cell contains zero or another number  because then you wouldn't see that it's either all one or all the other. The value analysis is rather cute that way.</p> 
<blockquote><p>Note: if the 78800-bit number explanation seems terribly complicated  do not worry  there will be a better explanation for this issue in a future post. We just need another new feature in order to be able to show you.</p> 
</blockquote> 
<p>In conclusion  not everyone may want a Formula One car. But <code>Frama_C_memcpy</code> should at least work for completely unrolled deterministic programs  where it can speed up analysis quite a bit compared to the interpretation of a C memcpy.</p> 
<p>Thanks to Julien for help with motorsports metaphors  and to Boris for help in improving <code>Frama_C_memcpy</code> (it used to be worse).</p>
{% endraw %}