Precise and Practical Mode Analysis with the CHINA Analyzer

CHINA is the name of a data-flow analyzer for CLP programs we have been developing during the last 5 years. The name CHINA stands for CLP(H, N) Analyzer, where `H' stands for Herbrand (a domain of finite or rational trees), `N' stands for any numeric domain, and the `I' in CHINA stands for the Integration between H and N. In other words, the leafs of the trees can be numerical. CHINA, which is written in C++, performs bottom-up analysis deriving information on both call-patterns and success-patterns by means of program transformations and optimized fixpoint computation techniques.

In this paper we will briefly describe some recent developments that allow to perform precise and practical mode analysis with the CHINA analyzer. The emphasized words are important. On one hand, precision means that we use the most precise domains and domain combinations known to date. On the other hand, the issue of practicality has many facets. Space limitations do not allow to discuss them here, but they can be synthesized as follows. The analyzer should deal with real programs designed to run on real, implemented languages, it should never crash (the only limitation being the amount of available virtual memory), and it must terminate in reasonable time for any input program. What enables CHINA to be able to achieve this goal for mode analysis is a particular combination of domains and the techniques that have been adopted for their implementation. These domains capture information about groundness, compoundness, pair-sharing, freeness, and linearity.

The Groundness Domain

The abstract domain employed in CHINA for detecting the property of definite groundness is based on Pos, the domain of positive Boolean functions. Pos functions are expressive enough to capture the kind of dependencies that are implicit in unification constraints such as Z=f(g(X),Y): the corresponding Boolean function is  (X and Y) iff Z, meaning that Z is ground if and only if X and Y are so. They also capture dependencies arising from other constraint domains: for instance, the CLP(R) constraint X+2Y+Z=4 can be abstracted as  ((X and Y) implies Z) and ((X and Z) implies Y) and ((Y and Z) implies X) , indicating that determining any two variables is sufficient to determine the third.

This ability to express dependencies makes analysis based on Pos very precise, but also makes it relatively expensive, as many operations on Boolean formulas have exponential worst case complexity. Reduced Ordered Binary Decision Diagrams (ROBDDs) also known as Bryant graphs, are a representation for Boolean functions supporting many efficient operations. Because of this, they have often been used to implement Pos for abstract interpretation. However, the observation of many (constraint) logic programs shows that the percentage of variables that are found to be ground during the analysis, for typical invocations, is as high as 80%. This suggests that representing Pos elements simply by means of ROBDDs, is probably not the best we can do [Bag96].

It is also the case that ROBDDs generated during program analysis often contain many pairs of variables that are equivalent. In the context of groundness analysis, this means that either both are ground, or neither is. Such equivalent variables appear, of course, for a program goal of the form X=Y, but they also frequently appear naturally during the analysis process. For example, for a goal X=[Y|Z], where it can be established that Y is ground, the analyzer will deduce that X and Z are equivalent. Such equivalent pairs can greatly increase the size of ROBDDs, which in turn makes ROBDD operations much more expensive [BS99].

The frequency with which definitely ground variables and ground-equivalent pairs of variables occur in the analysis of real programs suggested to us to factorize apart this information [BS99]. In the GER representation, an element of Pos is represented by means of a triple: the G component (from "ground") records the set of ground variables; the E component (from "equivalent") records a transitively closed set of pairs of equivalent variables; the R component (from "ROBDD") records a residual groundness dependency. There are elements of  Pos  that can be represented by several such triples and, in the GER representation, we need to make a choice among those. This choice must be canonical and economical. Economy can be explained as follows: true variables are most efficiently represented in the first component (a bit-vector at the implementation level) and should not occur anywhere else in the representation. Equivalent variables are best represented in the second component of the GER representation (implemented as a vector of integers). As equivalent variables partition the space of variables into equivalence classes, only one variable per equivalence class must occur in the ROBDD constituting the third component of the representation. If we choose, say, the least variable (with respect to a predefined total ordering on variables) of each equivalence class as the representative of the class, we have also ensured canonicity.

The GER representation has several advantages. The ROBDD generated during the analysis are kept as small as possible, and thus several important operations (which are of exponential worst case complexity) are speeded up considerably. Of course, ensuring reduction of the GER representation has a non negligible cost, but this overhead is greatly repayed: only the analysis of tiny programs shows a very modest slowdown (which is measured in hundredth of a second) while on the analysis of real programs the GER representation gives rise to speedups of up to an order of magnitude. Another advantage of the GER representation (which, incidentally, was the very reason why this line of research was started) is that the information about ground variables is readily available, since all and only the ground variables are kept in the G component. Obtaining the set of ground variables or testing whether a certain variable is ground are thus very efficient operations. As we will see, this characteristic allows to improve the efficiency of other analyses.

When using the GER representation of Pos, it is quite rare to find real programs for which the analysis becomes too complex. In other words, the factorization of ground and ground-equivalent variables is such that the exponential nature of Pos seldom manifests itself. However, programs that make the GER representation explode do exist and, moreover, any serious domain of analysis whose complexity is, say, more than linear in the number of variables, needs a widening. The GER representation lends itself to a particularly simple and effective widening, both in terms of precision and in terms of efficiency of the implementation. Since the G and E components have a linear size representation they do not need any widening. In contrast, the R component can grow exponentially and this is where widening takes place. The widening imposes a limit on the maximum number of ROBDD nodes allocated by any operation used by the analysis. When this limit is reached an exception is thrown and a sound approximation of the operation is returned instead of the real result. For instance, here is how conjunction is implemented in the ROBDD class upon which the GER representation for Pos is based (BNode is the class of ROBDD nodes, the unsafe_and operation is the unguarded version of conjunction, and emergency_cleanup() is a static method ensuring that all the relevant ROBDD structures are garbage-collected before resuming the operation).

const BNode* BNode::and_approximate_up(const BNode* y) const
{
  try {
    return unsafe_and(this, y);
  }
  catch(ROBDD_Too_Complex) {
    emergency_cleanup();
    return this; // x is a sound up-approximation of x & y.
  }
}

Theoretically speaking, another source of potential difficulties for a Pos-based analysis is due to the fact that the longest increasing chain in Pos has exponential length in the number of variables. This means that the chaotic iteration could traverse a very long chain of Pos descriptions before converging to a (post-) fixpoint. While programs exhibiting this behavior can easily be written on purpose, no real program we have access to gives rise to this phenomenon. Nonetheless, the "always terminate in reasonable time" rule demands this possibility to be taken into account. Again, the G and E components of the GER representation do not pose any problems: the longest chains in G and E have linear length in the number of variables. Given that the problem rarely manifests itself (to say the least) the simple widening ever proposed suffices: allow the ROBDD components to change only a certain number of times, then approximate them with True (which, in terms of Pos is the "don't know", description).

The Domain for Compoundness

Compoundness here means `non-variable'; that is, a variable is compound if it is bound to a term that is definitely not free. Compoundness information is very important to optimize clause indexing.

In the CHINA analyzer, the domain used for compoundness is, again, Pos. This is because compoundness and compoundness dependencies are well represented by positive Boolean functions. Thus the global domain employed by CHINA includes a second GER representation for compoundness. As any ground variables is clearly compound, the G component of the groundness domain is always propagated to the GER representation used to track compoundness. This is just an example where the ready availability of the set of ground variables is used to speed up other analyses. Note, however, that the combination with groundness information is an essential ingredient for achieving reasonable precision on compoundness. In fact, the compoundness Pos description for the equation Z=f(g(X),Y), the Boolean function Z meaning that Z is bound, fails to capture the fact that grounding Z will certainly bind both X and Y. In this case, the groundness Pos description for the above equation will sanction the groundness of X and Y. Propagation from the groundness component to the compoundness component will then deduce that X and Y are bound.

The Domain for Pair-Sharing, Freeness, and Linearity

The abstract domain used in the CHINA analyzer to capture pair-sharing is based on the set-sharing domain Sharing of Jacobs and Langen. In 1997, when we first incorporated the Sharing domain into the analyzer, a number of theoretical and practical questions were still open. The most important issue was the one of correctness: up to 1998 there seemed to be no published proofs of the soundness of the domain. In [HBZ98] we have proved the soundness, idempotence, and commutativity of Sharing. Most importantly, these results have been established, for the first time, without assuming that the analyzed language performs the occur-check in the unification procedure. This closed a long-standing gap, as all the works on the use of Sharing for the analysis of Prolog programs had always disregarded this problem.

The goal of sharing analysis is to detect which pairs of variables are definitely independent (namely, they cannot be bound to terms having one or more variables in common). In fact, in the literature we can find no reference to the "independence of a set of variables". All the proposed applications of sharing analysis (compile-time optimizations, occur-check reduction and so on) are based on information about the independence of pairs of variables. It it thus natural to ask whether the set-sharing domain Sharing is adequate with respect to the property of interest, that is, pair-sharing. In [BHZ97] we have proved that Sharing is redundant for pair-sharing and we have identified the weakest abstraction of Sharing that can capture pair-sharing with the same degree of precision. One notable advantage of this abstraction is that the costly star-union operator, whose complexity is exponential in the number of sharing-groups, is no longer necessary and can be safely replaced by the binary-union operator, which has quadratic complexity. This line of work has thus had an important impact on the issue of practicality of sharing analysis. In fact, the non-redundant domain described in [BHZ97] gives rise to speedups of up to three order of magnitudes on the analysis of real programs.

The complexity of the Sharing domain is exponential in the number of program variables. Its non-redundant version has also exponential complexity, even though, in practice, the non-redundant domain is much better behaved than the original. In any case, a stable behavior, and thus a truly practical domain, can only be obtained by means of widening operators. The problem of scalability of non-redundant Sharing while still retaining as much precision as possible has been tackled in [ZBH99], where a family of widenings is presented that allows to achieve the desired goal. We moved from the observation that, when the sharing-sets become large, then they are at the same time heavy to manipulate and, at least for a subset of the variables involved, light as far as information content is concerned. We have thus introduced a new representation for set-sharing made of two components. They are both sharing-sets, that is, sets of sets of variables. However, while the second component is interpreted as in the standard Sharing domain, the first component records worst-case sharing assumptions of sets of variables. Let us consider, for instance, the sharing-group {X,Y,Z}. While its presence on the second component indicates that variables X, Y, and Z can share a common variable, the same sharing-group in the first component means that X, Y, and Z can share variables in all possible ways. This new representation supports a wide range of widenings. These widenings allow to trade precision for efficiency by moving information from the second component to the first one. What differentiates one widening from another concerns what to move and when. The advantage of this new representation is that the first component is able to represent possible sharing in a very compact, though imprecise way. Descriptions constituted by hundreds of thousands of sharing-groups can be collapsed into descriptions consisting of a few tens of such groups, often with very little precision loss. This phenomenon has been actually observed in the analysis of real programs. Of course, some precision is lost, but experimentation showed that the loss concerns almost exclusively ground dependencies, and these can be recovered by coupling the new sharing domain with Pos. This brings us to the topic of combining sharing domains with other domains.

Practically speaking, a sharing domain is never used in isolation. The first combination was suggested by Langen himself in his PhD thesis: linearity (the property of all variables that can only be bound to terms without multiple occurrences of variables) can greatly improve the accuracy of sharing analysis. The synergy attainable from the integration between aliasing and freeness information (freeness is the property of all variables that can only be bound to other variables) has been pointed out, for the first time, by Muthukumar and Hermenegildo. Besides that, freeness is a very useful property in itself. These standard combinations are now widely accepted and nobody would seriously think of performing sharing analysis without them. However, a number of other proposals for refined domain combinations have been circulating more or less clandestinely for years. One feature that is common to these proposals is that they do not seem to have undergone experimental evaluation. We have tackled this issue in [BZH99], where we have evaluated several proposals for precise domain combinations. Apart from the combination with structural information the results of this work were disappointing: only very small precision gains seem to be attainable by the more refined domain combinations. The combination with Pos, also studied in [BZH99], deserves further discussion.

It is well known that Sharing keeps track of ground dependencies. More precisely, Sharing contains Def, the domain of definite Boolean functions, as a proper subdomain. However, there are several good reasons to couple Sharing with Pos: (1) this combination is essential for the powerful widening techniques on non-redundant Sharing mentioned above to be applied. This is very important, since analysis based on (non-redundant) Sharing without a widening is not practical. (2) Def is not expressive enough to capture all the ground dependencies of Prolog programs. Moreover, Def cannot even capture the dependencies induced by the primitive constraints of some CLP languages, and we target the analysis at Prolog and CLP programs. (3) In the context of the analysis of CLP programs, the notions of "ground variable" and the notion of "variable that cannot share a common variable with other variables" are distinct. A numeric variable in, say, CLP(R), cannot share with other variables but is not ground unless it has been constrained to a unique value. Thus the analysis of CLP programs with Sharing alone will either lose precision on pair-sharing (if numerical variables are allowed to "share" in order to compute their groundness) or be unable to compute the groundness of numerical variables (if numerical variables are excluded from the sharing-sets). In the first alternative, as we have already noted, the precision with which groundness of numerical variables can be tracked will also be limited. Since groundness of numerical variables is important for a number of applications (e.g., compiling equality constraints down to assignments or tests in some circumstances), we advocate the use of Pos and Sharing at the same time. (4) Detecting definitely ground variables through Pos and exploiting them to simplify the operations on Sharing is very worthwhile as far as efficiency is concerned, if the set of ground variables is readily available. This is the case, for instance, with the GER implementation of Pos described above. Using the GER representation, the propagation of ground variables to the sharing domain allows to obtain additional speedups of up to two orders of magnitude in the analysis of real programs. (5) Knowing the set of ground variables in advance, not only reduces the complexity of the operations of the sharing domain. It also improves precision when the domain keeps track of linearity information, as in our case. In fact, while it has been proved that Sharing and its derivatives are commutative, meaning that the result of the analysis does not depend on the ordering in which the bindings are executed [HBZ98], Sharing plus linearity does not enjoy this property. In particular, it is known since the thesis of Langen that best results are obtained if the grounding unifications are considered before the others. Again, the combination with Pos, since it allows the analyzer to know the set of all definitely ground variables in advance, is clearly advantageous in this respect.

Bibliography

Bag96
R. Bagnara.  A reactive implementation of Pos using ROBDDs.  In H. Kuchen and S. D. Swierstra, editors, Programming Languages: Implementations, Logics and Programs, Proceedings of the Eighth International Symposium, volume 1140 of Lecture Notes in Computer Science, pages 107-121, Aachen, Germany, 1996. Springer-Verlag, Berlin.

BHZ97
R. Bagnara, P. M. Hill, and E. Zaffanella.  Set-sharing is redundant for pair-sharing.  In P. Van Hentenryck, editor, Static Analysis: Proceedings of the 4th International Symposium, volume 1302 of Lecture Notes in Computer Science, pages 53-67, Paris, France, 1997. Springer-Verlag, Berlin.  An extended version will appear on Theoretical Computer Science.

BS99
R. Bagnara and P. Schachte.  Factorizing equivalent variable pairs in ROBDD-based implementations of Pos.  In A. M. Haeberer, editor, Proceedings of the ``Seventh International Conference on Algebraic Methodology and Software Technology (AMAST'98)'', volume 1548 of Lecture Notes in Computer Science, pages 471-485, Amazonia, Brazil, 1999. Springer-Verlag, Berlin.

BZH99
R. Bagnara, E. Zaffanella, and P. M. Hill.  Enhancing Sharing for precision.  In M. C. Meo and E. Omodeo, editors, Proceedings of the ``1999 Joint Conference on Declarative Programming (APPIA-GULP-PRODE'99)'', L'Aquila, Italy, 1999.

HBZ98
P. M. Hill, R. Bagnara, and E. Zaffanella.  The correctness of set-sharing.  In G. Levi, editor, Static Analysis: Proceedings of the 5th International Symposium, volume 1503 of Lecture Notes in Computer Science, pages 99-114, Pisa, Italy, 1998. Springer-Verlag, Berlin.

ZBH99
E. Zaffanella, R. Bagnara, and P. M. Hill.  Widening set-sharing.  In Principles and Practice of Declarative Programming, Proceedings of the First International Conference, Lecture Notes in Computer Science, Paris, France, September 1999. Springer-Verlag, Berlin.  To appear.
Roberto Bagnara
Dipartimento di Matematica
Università di Parma
Italy
bagnara@cs.unipr.it