In this paper we will briefly describe some recent developments
that allow to perform *precise* and *practical* mode analysis
with the CHINA analyzer.
The emphasized words are important.
On one hand, precision means that we use the most precise domains and
domain combinations known to date.
On the other hand, the issue of practicality has many facets.
Space limitations do not allow to discuss them here, but they can
be synthesized as follows.
The analyzer should deal with real programs designed to run on real,
implemented languages, it should never crash (the only limitation being
the amount of available virtual memory), and it must terminate in reasonable
time for any input program.
What enables CHINA to be able to achieve this goal for mode
analysis is a particular combination of domains and the techniques
that have been adopted for their implementation.
These domains capture information about groundness, compoundness, pair-sharing,
freeness, and linearity.

`Z=f(g(X),Y)`

:
the corresponding Boolean function
is `Z`

is ground if and only if
`X`

and `Y`

are so.
They also capture dependencies arising from other constraint domains:
for instance, the CLP(R) constraint `X+2Y+Z=4`

can be abstracted as
This ability to express dependencies makes analysis based on *Pos*
very precise, but also makes it relatively expensive, as many
operations on Boolean formulas have exponential worst case complexity.
*Reduced Ordered Binary Decision Diagrams* (ROBDDs)
also known as *Bryant graphs*, are a representation for
Boolean functions supporting many efficient operations.
Because of this, they have often been used to implement
*Pos* for abstract interpretation.
However, the observation of many (constraint) logic programs
shows that the percentage
of variables that are found to be ground during the analysis, for typical
invocations, is as high as 80%. This suggests that representing *Pos*
elements simply by means of ROBDDs, is probably not the best we can do
[Bag96].

It is also the case that ROBDDs generated during program analysis
often contain many pairs of variables that are equivalent. In the
context of groundness analysis, this means that either both are
ground, or neither is. Such equivalent variables appear, of course,
for a program goal of the form `X=Y`

, but they also frequently
appear naturally during the analysis process. For example, for a goal
`X=[Y|Z]`

, where it can be established that `Y`

is ground, the analyzer will deduce that `X`

and `Z`

are equivalent.
Such equivalent pairs can greatly increase the size of ROBDDs, which
in turn makes ROBDD operations much more expensive
[BS99].

The frequency with which definitely ground variables
and ground-equivalent pairs of variables occur in the analysis
of real programs suggested to us to factorize apart this information
[BS99].
In the GER representation, an element of *Pos* is represented
by means of a triple:
the G component (from "ground") records the set of ground variables;
the E component (from "equivalent") records a transitively closed
set of pairs of equivalent variables;
the R component (from "ROBDD") records a residual groundness dependency.
There are elements of * Pos* that can be represented by
several such triples and, in the GER representation, we need
to make a choice among those.
This choice must be

The GER representation has several advantages. The ROBDD generated during the analysis are kept as small as possible, and thus several important operations (which are of exponential worst case complexity) are speeded up considerably. Of course, ensuring reduction of the GER representation has a non negligible cost, but this overhead is greatly repayed: only the analysis of tiny programs shows a very modest slowdown (which is measured in hundredth of a second) while on the analysis of real programs the GER representation gives rise to speedups of up to an order of magnitude. Another advantage of the GER representation (which, incidentally, was the very reason why this line of research was started) is that the information about ground variables is readily available, since all and only the ground variables are kept in the G component. Obtaining the set of ground variables or testing whether a certain variable is ground are thus very efficient operations. As we will see, this characteristic allows to improve the efficiency of other analyses.

When using the GER representation of *Pos*, it is quite
rare to find real programs for which the analysis becomes too complex.
In other words, the factorization of ground and ground-equivalent
variables is such that the exponential nature of *Pos*
seldom manifests itself.
However, programs that make the GER representation explode do exist
and, moreover, any serious domain of analysis whose complexity is, say,
more than linear in the number of variables, needs a widening.
The GER representation lends itself to a particularly simple
and effective widening, both in terms of precision and in terms
of efficiency of the implementation.
Since the G and E components have a linear size representation
they do not need any widening.
In contrast, the R component can grow exponentially and this is where
widening takes place.
The widening imposes a limit on the maximum number of ROBDD nodes
allocated by any operation used by the analysis.
When this limit is reached an exception is thrown and a sound
approximation of the operation is returned instead of the real result.
For instance, here is how conjunction is implemented
in the ROBDD class upon which the GER representation for *Pos* is based
(`BNode`

is the class of ROBDD nodes,
the `unsafe_and`

operation is the
unguarded version of conjunction,
and `emergency_cleanup()`

is a static method ensuring that all the
relevant ROBDD structures are garbage-collected before resuming the operation).

const BNode* BNode::and_approximate_up(const BNode* y) const { try { return unsafe_and(this, y); } catch(ROBDD_Too_Complex) { emergency_cleanup(); return this; // x is a sound up-approximation of x & y. } }

Theoretically speaking, another source of potential difficulties
for a *Pos*-based analysis is due to the fact that the longest
increasing chain in *Pos* has exponential length in the number
of variables.
This means that the chaotic iteration could traverse a very long
chain of *Pos* descriptions before converging to a (post-) fixpoint.
While programs exhibiting this behavior can easily be written on purpose,
no real program we have access to gives rise to this phenomenon.
Nonetheless, the "always terminate in reasonable time" rule demands this possibility
to be taken into account.
Again, the G and E components of the GER representation do not pose
any problems: the longest chains in G and E have linear length
in the number of variables.
Given that the problem rarely manifests itself (to say the least)
the simple widening ever proposed suffices: allow the ROBDD components
to change only a certain number of times, then approximate them
with *True* (which, in terms of *Pos* is the "don't know",
description).

In the CHINA analyzer, the domain used
for compoundness is, again, *Pos*.
This is because compoundness and compoundness dependencies
are well represented by positive Boolean functions.
Thus the global domain employed by CHINA
includes a second GER representation for compoundness.
As any ground variables is clearly compound, the G component
of the groundness domain is always propagated to the GER representation
used to track compoundness.
This is just an example where the ready availability of the set
of ground variables is used to speed up other analyses.
Note, however, that the combination with groundness information
is an essential ingredient for achieving reasonable precision
on compoundness.
In fact, the compoundness *Pos* description for the
equation `Z=f(g(X),Y)`

,
the Boolean function *Z* meaning that `Z`

is bound,
fails to capture the fact that grounding `Z`

will certainly
bind both `X`

and `Y`

.
In this case, the groundness *Pos* description for the above equation
will sanction the groundness of `X`

and `Y`

.
Propagation from the groundness component to the compoundness component
will then deduce that `X`

and `Y`

are bound.

The goal of sharing analysis is to detect
which *pairs* of variables are definitely independent
(namely, they cannot be bound to terms having one or more variables
in common).
In fact, in the literature we can find no reference to the
"independence of a *set* of variables".
All the proposed applications of sharing analysis
(compile-time optimizations, occur-check reduction and so on)
are based on information about the independence of *pairs* of variables.
It it thus natural to ask whether the *set-sharing* domain *Sharing*
is adequate with respect to the property of interest, that is,
*pair-sharing*.
In [BHZ97] we have proved that *Sharing*
is redundant for pair-sharing and we have identified the weakest abstraction
of *Sharing* that can capture pair-sharing with the same degree of precision.
One notable advantage of this abstraction is that the costly star-union
operator, whose complexity is exponential in the number of sharing-groups,
is no longer necessary and can be safely replaced by the binary-union operator,
which has quadratic complexity.
This line of work has thus had an important impact on the issue of practicality
of sharing analysis.
In fact, the non-redundant domain described in [BHZ97]
gives rise to speedups of up to three order of magnitudes on the analysis
of real programs.

The complexity of the *Sharing* domain is exponential in the
number of program variables.
Its non-redundant version has also exponential complexity,
even though, in practice, the non-redundant domain is much better
behaved than the original.
In any case, a stable behavior, and thus a truly practical domain,
can only be obtained by means of widening operators.
The problem of scalability of non-redundant *Sharing*
while still retaining as much precision as possible has been tackled in
[ZBH99], where a family of widenings
is presented that allows to achieve the desired goal.
We moved from the observation that, when the sharing-sets become large,
then they are at the same time heavy to manipulate and,
at least for a subset of the variables involved, light as far as
information content is concerned.
We have thus introduced a new representation for set-sharing made of two
components. They are both sharing-sets, that is, sets of sets of variables.
However, while the second component is interpreted as in the standard *Sharing*
domain, the first component records worst-case sharing assumptions of sets
of variables.
Let us consider, for instance, the sharing-group *{X,Y,Z}*.
While its presence on the second component indicates that
variables *X*, *Y*, and *Z* can share a common variable,
the same sharing-group in the first component means that
*X*, *Y*, and *Z* can share variables in all possible ways.
This new representation supports a wide range of widenings.
These widenings allow to trade precision for efficiency by moving
information from the second component to the first one.
What differentiates one widening from another concerns what to move and when.
The advantage of this new representation is that the first component
is able to represent possible sharing in a very compact, though imprecise way.
Descriptions constituted by hundreds of thousands of sharing-groups
can be collapsed into descriptions consisting of a few tens of such groups,
often with very little precision loss.
This phenomenon has been actually observed in the analysis of real programs.
Of course, some precision is lost, but experimentation showed that
the loss concerns almost exclusively ground dependencies,
and these can be recovered by coupling the new sharing domain with *Pos*.
This brings us to the topic of combining sharing domains with other domains.

Practically speaking, a sharing domain is never used in isolation.
The first combination was suggested by Langen himself in his
PhD thesis: *linearity* (the property of all variables that
can only be bound to terms without multiple occurrences of variables)
can greatly improve the accuracy of sharing analysis.
The synergy attainable from the integration between aliasing and
*freeness* information
(freeness is the property of all variables that
can only be bound to other variables)
has been pointed out, for the first time, by Muthukumar and Hermenegildo.
Besides that, freeness is a very useful property in itself.
These standard combinations are now widely accepted and nobody
would seriously think of performing sharing analysis without them.
However, a number of other proposals for refined domain combinations
have been circulating more or less clandestinely for years.
One feature that is common to these proposals is that they do
not seem to have undergone experimental evaluation.
We have tackled this issue in [BZH99],
where we have evaluated several proposals for precise domain
combinations.
Apart from the combination with structural information
the results of this work were disappointing: only very small precision
gains seem to be attainable by the more refined domain combinations.
The combination with *Pos*,
also studied in [BZH99],
deserves further discussion.

It is well known that *Sharing* keeps track of ground dependencies.
More precisely, *Sharing* contains *Def*, the domain of definite
Boolean functions, as a proper subdomain.
However, there are several good reasons to couple *Sharing* with *Pos*:
(1)
this combination is essential for the powerful widening techniques
on non-redundant *Sharing* mentioned above to be applied.
This is very important,
since analysis based on (non-redundant) *Sharing*
without a widening is not practical.
(2)
*Def* is not expressive enough to capture all the ground dependencies
of Prolog programs.
Moreover, *Def* cannot even capture the dependencies
induced by the primitive constraints of some CLP languages,
and we target the analysis at Prolog *and* CLP programs.
(3)
In the context of the analysis of CLP programs, the notions
of "ground variable" and the notion of "variable that cannot share
a common variable with other variables" are distinct.
A numeric variable in, say, CLP(*R*), cannot share with other variables
but is not ground unless it has been constrained to a unique value.
Thus the analysis of CLP programs with *Sharing* alone will either lose
precision on pair-sharing (if numerical variables are allowed to "share"
in order to compute their groundness) or be unable to compute
the groundness of numerical variables (if numerical variables are excluded
from the sharing-sets).
In the first alternative, as we have already noted, the precision
with which groundness of numerical variables can be tracked will
also be limited.
Since groundness of numerical variables is important for a number
of applications (e.g., compiling equality constraints down to
assignments or tests in some circumstances), we advocate the use
of *Pos* and *Sharing* at the same time.
(4)
Detecting definitely ground variables through *Pos* and exploiting
them to simplify the operations on *Sharing* is very worthwhile
as far as efficiency is concerned, if the set of ground variables
is readily available.
This is the case, for instance, with the GER implementation of *Pos*
described above.
Using the GER representation, the propagation of ground variables to
the sharing domain allows to obtain additional speedups of up to two orders
of magnitude in the analysis of real programs.
(5)
Knowing the set of ground variables in advance, not only reduces the
complexity of the operations of the sharing domain.
It also improves precision when the domain keeps track of linearity
information, as in our case.
In fact, while it has been proved that *Sharing* and its derivatives
are commutative, meaning that the result of the analysis does not depend
on the ordering in which the bindings are executed
[HBZ98],
*Sharing* plus linearity does not enjoy this property.
In particular, it is known since the thesis of Langen
that best results are obtained if the grounding unifications
are considered before the others.
Again, the combination with *Pos*, since it allows the analyzer
to know the set of all definitely ground variables in advance,
is clearly advantageous in this respect.

- Bag96
- R. Bagnara. A reactive implementation of
*Pos*using ROBDDs. In H. Kuchen and S. D. Swierstra, editors,*Programming Languages: Implementations, Logics and Programs, Proceedings of the Eighth International Symposium*, volume 1140 of*Lecture Notes in Computer Science*, pages 107-121, Aachen, Germany, 1996. Springer-Verlag, Berlin. - BHZ97
- R. Bagnara, P. M. Hill, and E. Zaffanella. Set-sharing is redundant
for pair-sharing. In P. Van Hentenryck, editor,
*Static Analysis: Proceedings of the 4th International Symposium*, volume 1302 of*Lecture Notes in Computer Science*, pages 53-67, Paris, France, 1997. Springer-Verlag, Berlin. An extended version will appear on*Theoretical Computer Science*. - BS99
- R. Bagnara and P. Schachte. Factorizing equivalent variable pairs
in ROBDD-based implementations of
*Pos*. In A. M. Haeberer, editor,*Proceedings of the ``Seventh International Conference on Algebraic Methodology and Software Technology (AMAST'98)''*, volume 1548 of*Lecture Notes in Computer Science*, pages 471-485, Amazonia, Brazil, 1999. Springer-Verlag, Berlin. - BZH99
- R. Bagnara, E. Zaffanella, and P. M. Hill. Enhancing Sharing for
precision. In M. C. Meo and E. Omodeo, editors,
*Proceedings of the ``1999 Joint Conference on Declarative Programming (APPIA-GULP-PRODE'99)''*, L'Aquila, Italy, 1999. - HBZ98
- P. M. Hill, R. Bagnara, and E. Zaffanella. The correctness of set-sharing.
In G. Levi, editor,
*Static Analysis: Proceedings of the 5th International Symposium*, volume 1503 of*Lecture Notes in Computer Science*, pages 99-114, Pisa, Italy, 1998. Springer-Verlag, Berlin. - ZBH99
- E. Zaffanella, R. Bagnara, and P. M. Hill. Widening set-sharing.
In
*Principles and Practice of Declarative Programming, Proceedings of the First International Conference*, Lecture Notes in Computer Science, Paris, France, September 1999. Springer-Verlag, Berlin. To appear.Roberto Bagnara

Dipartimento di Matematica

Università di Parma

Italy

bagnara@cs.unipr.it