Research Statement - Software Structure

For small software systems, perhaps a few thousand lines of code, software structure is largely an esthetic issue. We want to build small neat structures out of professional pride. We would like readers of our code to find our intents clear and the code behavior obvious. When software systems grow large, perhaps a million or more lines of source code, its structure becomes much more important. Developing a large system requires teams of developers working in concert to provide a finished product in a reasonable amount of time. That means that many people will read each component of our code - all those who's code depends on ours, quality assurance personnel, and our team leads, charged with meeting schedules with quality products.

Our research group has been, and continues to be interested in finding ways to successfully create and maintain software systems that are larger than any one person can completely comprehend from its code alone. We need abstractions and tools to help understand and discuss these systems, ways of building them with reasonable amounts of effort and expense, and ways of packaging, shipping, and maintaining them over their effective lifetimes. Much of our effort has be directed to understanding and managing software structures and repairing them when the structures are flawed.

Software structure is expressed in several ways:

  1. Placement Tree

    Object Oriented Languages like C++, C#, and Java obey a set of inclusion rules that govern how key constructs are placed within source code. The constructs that will interest us here are: namespaces, interfaces, classes, structs, enums, functions, controls, and data declarations. We will see that this placement structure is a tree - each element has only one parent. Of the constructs cited here, all but data declarations are each endowed with a scope.

    Scope is defined by all code enclosed within braces, e.g., "{" and "}". Namespace, interface, class, struct, and enum scopes are compile-time only constructs. They add and qualify names in the compiler symbol table and interfaces, classes, and structs determine access rules and syntax for functions and data contained within their scope, and enums simply define a set of named integer constants.

    Function and control scopes have both compile-time and run-time effects. At compile time they serve to provide names and define syntax. At run time when a function or control scope is entered by the thread of execution it is provided a stack frame that serves as scratch-pad memory, valid only for the time that execution resides within that scope. Local data can be declared withing these scopes and if the data is an instance of a C++ class or struct its destructor is invoked when execution leaves the scope.

    The placement tree is a tree of scopes with the global namespace as its root. The rules that govern contruction of the scope tree are as follows:

    • Namespaces

      Namespaces can be composed (we sometimes say nested) and/or sequenced but must be declared at the global level. That is, they cannot be declared inside any non-namespace scope. The outer namespace is the global namespace. It has no name, is not declared, and includes everything not enclosed in a declared namespace. Namespaces can include:
      • other namespaces, interfaces, classes, structs, enums, functions, and data
    • Interfaces, Classes, and Structs

      Classes and structs can be declared and may be defined any place a declaration is accepted by the programming language. They can include:
      • other interfaces, classes and structs, enums, functions, and data declarations
    • Enums

      Enums can include only named constant integers that may have an associated value specifier.
    • Functions

      Functions can be declared only within namespaces and classes. Functions are the only place that program behavior is defined. They can include:
      • Interfaces, classes, structs, controls, data declarations and data definitions.
    • Data declarations and definitions

      Data declarations can not include any of the other constructs discussed here. They are always a leaf of the placement tree.

    Code restructuring entails making behavior preserving modifications to the placement tree. The goal of restructuring is to make the code more understandable, maintainable, and to reduce undesireable dependency effects. These operations are exclusively:

    • Extracting new Functions

      Adding a new function as a child of a namespace or class and moving some branch of the tree inside the new function and adding a function invocation at the original site of the moved branch.
    • Extracting new Classes and Structs

      Adding a new class as a child of a namespace or function and moving functions and data declarations within its scope by making them children of the new class or struct.
    • Adding new Interfaces

      Adding a new interface has a small syntactical effect, but a major effect on the code's dependency structure, discussed next.
  2. Dependency Graphs

    Unlike placement, dependency is a graph relationship. Any software element may have more than one other element that depends upon it, e.g., a parent in the software's dependency graph. There are several different graph representations useful for understanding software structure:
    • Function Call Dependency Graph

      One function depends on another if, and only if, it invokes the other. The graph of function call dependencies is often referred to as a structure chart. That name is a hold-over from the days of structured programming which focused on partitioning software into functions. Today, for understanding, we tend to focus more on classes and packages rather than functions. That is one of our abstraction tools that help us understand large systems. But even now, when we need to understand the details of how a package works, function call dependency graphs can be very useful.
    • Package Dependency Graph

      Package A depends on package B if:
      • Package B declares and/or defines a namespace, interface, class, struct, enum, function or datum cited in A, e.g., any of the constructs that define package B's placement tree.
      The citation may be to create an instance, include in a derivation chain, invoke, or include in an expression.

      Package dependencies determine both build and test strategies. For example, all mutually dependent packages must be considered a single unit of testing, for a change to any one package in that set may break any other element of the set. In the language of graph theory, these sets of mutually dependent packages are called strong components.

    • Data Use by Line Number, a Bipartite Graph

      When restructuring code, one of the essential things we need to track are the locations of all references to data. One way of representing that is with a bipartite graph. A bipartite graph is a graph with vertices divided onto two sets with relationships only between an element of one set with an element of the other set.

      The bipartite graph we are interested in is one in which one set of vertices represents each datum (either primitive or user defined) and the other set of verticies represents the line numbers where that datum is cited. So, for each datum, the graph contains relationships that define all the line numbers where that datum was used.

    • Data Composition Dependencies

      One member of our group, Mehmet Kaya, is currently working on the code restructuring problem, and believes that data dependencies will be important for extracting new classes from existing code. These dependencies arise from composition operations. A new value for an existing datum is defined by an expression that uses values of one or more other data. So there exists a potentially interesting data composition graph that may find use in defining new cohesive classes in code that could benefit from additional structure.
    Dependency graphs can be restructured too by inserting interfaces to break implementation dependencies, by shifting some package contents to another package for reasons of size, complexity, or cohesion, and by extracting code from a package to form one or more new packages.
  3. Visualization

    Each package in a system defines one and only one placement tree. If package sizes are reasonable, say at most a few hundred lines of code, the placement tree of any of them will be of reasonably modest size, visualizable by fairly routine means, say a well laid-out tree in a single view.

    Dependency graphs are an entirely different matter. A large system may have thousands of packages with correspondingly large dependency graph. And function calling sequences are not constrained by package boundaries so may also become quite large. How can we represent these large structures in a way that is meaningful to a human observer? One of our group, Mubarek Mohammed, is beginning to think about some of these issues.

    We need tools to discover and manipulate the relationships we've discussed above. But, without some kind of effective visualization process a user will be overwhelmed with data. Visualization procedures are needed to render these large collections of data into meaningful information.

    • Leveling

      One idea is to define levels withing the dependency relationships and, at first, show only the top few levels. When the user identifies a region of interest, that vertex becomes a new visualization root and the next few levels are displayed. This gives us a way to zoom into and explore regions of interest.
    • Semantic Tagging

      Another idea is to use some form of semantic tagging to allow a user to explore and mark regions of interest. Namespaces give us a means of defining a semantic group, but we may also need some, perhaps attribute marking scheme, encoded into the dependency relationships, to identify a particular interesting point.

The format of this page is an experiment to see if some variations in page style work for this site. I think the jury is still out on this question.