Flow-Based Programming - Chap. XI
Data Descriptions and Descriptors

drawflow.ico

This chapter has been excerpted from the book "Flow-Based Programming: A New Approach to Application Development" (van Nostrand Reinhold, 1994), by J.Paul Morrison.

A second edition (2010) is now available from CreateSpace eStore and Amazon.com.

The 2nd edition is also available in e-book format from Kindle (Kindle format) and Lulu (epub format).

To find out more about FBP, click on FBP.                               

For definitions of FBP terms, see Glossary.                                                                                       

The reader will note similarities between this chapter and Object-Oriented (OO) concepts.  Much of the following is perhaps obsoleted by OO, plus it is not really central to the idea of Flow-Based Programming,  but the reader may wish to read it for historical reasons, or to think about directions in which OO could be extended.  Since FBP is so heavily data-oriented, it makes sense to try to manage data as intelligently as possible. 

In a recent e-business application built around 2000 and 2001 using the Java implementation of FBP (now called JavaFBP), we also built a number of business objects, which are described in Business Data Types.  It is to be hoped that, when building business applications in OO, programmers do not persist in using int or float for currency values!

Material from book starts here:

Up to now, you have probably noticed that we have been assuming that all components "know" the layouts of the IPs they handle. The layouts of all IP types that a component can handle and the IP types that it can generate become part of the specification of that component, just as much as the overall function is. If you feed something raw material it can't digest, you are bound to have problems, just as in the real world!

In conventional programs, you usually use the same layout for a structure or file record in all the subroutines of a program - in FBP actually only each pair of neighbouring processes has to agree on the layout. This means that a process can receive data in one format and send it on in a different format. If two neighbouring processes (perhaps written by different suppliers) are using different layouts, all you have to do is add a transform process in between.

Now suppose you at first wanted to have two neighbouring processes communicate by means of 20-element arrays. You then decide this is too restrictive, so you stay with the arrays, but allow them to communicate the size of the arrays at run-time. This is a type of "metadata", data about data, and can be as much or as little as the two processes involved want. For instance, their agreement might specify that the array size is to be positioned as a separate field (it could even be a separate IP) ahead of the array.

Now most higher level languages don't support metadata very well, so you are dependent on having the data formats imbedded in a program. Also, it is easy for old data to get out of step with the programs that describe them. There is a perhaps apocryphal story that somebody discovered a few decades ago that the majority of the tapes in the US Navy's tape library were illegible. It wasn't that the tapes had I/O errors - they were in perfect shape physically - but the problem was that the layouts of the tape records were hard-coded within code, and nobody knew which programs or copy code described which tapes! I'm sure that such a problem, if it ever existed, will have been remedied long ago! [tongue firmly planted in cheek!]

In DFDM we extended the idea of metadata by providing run-time descriptions which could be attached to IPs passing across a connection. DFDM allowed the creator of an IP to attach a separately compiled descriptor to the IP, which was used by special DFDM services which allowed fields to be accessed by name. Whenever an IP was created, the "allocate" function could optionally specify a descriptor which was to be permanently associated with that IP. All the IPs of the same type would share the same descriptor [shades of OO].

In DFDM these access services were called GETV and SETV ("get value" and "set value"). They had the advantage that, if you ever wanted to move a field (call it AMOUNT) from one place to another in the structure containing it, you didn't have to recompile all the programs referring to it. Another advantage was that, with a single call, a component could access a field which might be at different offsets in different types of incoming IP. For example, the field called AMOUNT could be at different locations in different IP types, and the component would still be able to access it or modify it, as long as the IPs had descriptors. The DFDM GETV and SETV services (and their descendants) were designed to be called from S/370 Assembler or from the HLLs we supported. They also provided limited conversion facilities between similar field formats - for example, between 2-byte and 4-byte binary fields, between different lengths and scales of packed decimal, or between varying and non-varying character strings. Thus you could specify that you wanted to see a binary field as 4 bytes in working storage, even though it was only 2 bytes in the IP. As of the time of writing, these services have not been implemented for THREADS.

Without such a facility, the layout of incoming IPs has to be part of the specification of all components. This points up another advantage of GETV and SETV: if a component is only interested in three fields, only those specific field names have to be mentioned in the specification for the component, rather than the whole layout of the IPs in question.

When you add this facility to the idea of option IPs, you get a powerful way of building more user-friendly black boxes. For instance, you could write a Collator which specified two field names for its major and minor keys, respectively. It would then use GETV to locate these fields in the correct place in all incoming IPs. "Collate on Salesman Number within Branch" seems much more natural than "Collate on cols 1-6 and 7-9 for IP type A, cols 4-9 and 1-3 for IP type B," and so on. Thus it is much better to parametrize generalized components using symbolic field names, rather than lengths and offsets. The disadvantage, of course, is performance: the component has to access the fields involved using the appropriate API calls, rather than compiled-in offsets. However, the additional CPU time is usually a negligible cost factor compared with the cost of the human time required for massive recompiles when something changes, or, still worse, the cost of finding and correcting errors introduced while making the changes!

When we looked at the problems of passing data descriptions between components, we rapidly got into the problem of what the data "means": in the case of our conventional Higher Level Languages, the emphasis has always been on generating the desired machine instruction. For instance, on the IBM S/370, currency is usually held as a packed decimal field - 2 digits per byte (except the last byte, which has one digit and a sign), and usually amounts are held with 2 decimal places (these usually have special names, e.g. cents relative to dollars, new pence relative to pounds, and so forth - are there any mixed radix currencies left in the world?). Since the instructions on the machine don't care about scale (number of decimal places), the compiler has to keep track of scale information and make sure it is handled correctly for all operations. (You could conceivably use floating point notation, but this has other characteristics which make it less suitable for business calculations). Now suppose a component receives an IP and tries to access a currency field in it based on its compiled-in knowledge of the IP's layout. If the compiled code has been told that the field has 2 places of decimals, that's what the component will "see". So we can now do arithmetic with the number, display it in the right format (if we know what national currency we are dealing with), and so forth. But note that the layout of the IP is only defined in the code - the code cannot tell where the fields really start and stop. So we have a sort of mutual dependency: the only definition of the data is in the code, and the code is tightly tied to the format of the data. If you want to decouple the two, you have to have a separate description of the data which various routines can interrogate, which can be attached to the data, independent of what routines are going to work with that data. If this is powerful enough, it will also let you access older forms of your data, say, on an old file (the "legacy data" problem).

Apart from format information, you also have to identify what type of data a field contains. For instance, the hexadecimal value

019920131F

might be a balance in a savings account ($199,201.31), but it could also be a date (31st January, 1992). Depending on which it is, we will want to perform very different operations on it. Conversely a function like "display", which applies to both data types, will result in very different results:

$199,201.31

versus

31st January, 1992

Our traditional HLLs will see both data types as FIXED DECIMAL (or COMPUTATIONAL-3). Again, only the program knows which kind of data is in the field (by using the right operations). There is also the issue of which representation is being used. We have all run into the problem of not knowing whether 01021992 is January 2 (American convention) or February 1 (British convention). So we have to record somewhere which digits represent the day, which the month and which the year. Thus a complete description of our field has what we might call "base type" (signed packed decimal), length (and perhaps scale), domain and representation. Some systems use a standard representation for "internal" data, or at least a format which is less variable than "external" formats, but the "data about data" (metadata) items which I have just described are pretty basic. (Base type could be treated logically as part of the more general domain information, but it turns out to be useful for designing such functions as "dumb" print processes). In some systems, domain is referred to as "logical type", and representation as "physical type".

Let us now look at another example of the pitfalls of programs not knowing what kind of data they are working with. Suppose that you have coded the following PL/I statement in a program by mistake:

NET_PAY = GROSS_PAY * TAX;

where TAX is a computed amount of tax, not the tax rate.  The compiler isn't going to hesitate for a microsecond. It will blithely multiply two decimal values together to give another decimal value (assuming this is how these fields are defined), even though the result is perfectly meaningless. Remember: computers do what you tell them, not what you mean! A human, on the other hand, would spot the error immediately (we hope), because we know that you can't multiply currency figures together. The compiler knows that the result of multiplying two numbers each with 2 decimal places is a number with 4 decimal places, so it will carefully trim off the 2 extra decimal places from the result (maybe even rounding the result to the nearest cent). What it can't do is tell us whether the whole operation makes any sense in the first place!

One other (real world) problem with currency figures is that inflation will make them steadily larger without increasing their real value. If 11 digits seemed quite enough in 1970, the same sort of information may need 13 or 15 digits in 1992. It is disconcerting to have your program report that you mislaid exactly a billion dollars (even though it is usually a good hint about what's wrong)! It would be nice if we could avoid building this kind of information into the logic of our applications. If we had an external description of a file, we could either use this dynamically at run-time, or convert the data into some kind of standard internal format - with lots of room for inflation, of course! The other place where this affects our systems in in screen and report layouts. As we shall see, these are also areas where it makes a lot of sense to hold descriptions separately, and interpret them at run-time. What we don't want to do is have to recompile our business systems every time some currency amount gets too big for the fields which hold it.

This is another legacy of the mathematical origins of today's computers - everything is viewed as a mathematical construct - integers, real numbers, vectors, matrices. In real life, almost everything has a dimension and a unit, e.g. weight, in pounds or kilograms, or distance, in miles or kilometres. If you multiply two distances together, you should get area (acres or hectares); three distances give volume, in cubic centimetres, bushels or litres. Currency amounts can never be multiplied together, although you can add and subtract them. You could also theoretically convert currency from dollars to pounds, or francs to marks, but this is a little different as it would need access to up-to-date conversion rate tables, and it was recently pointed out to me that banks usually like to charge for converting from one currency to another, so that it's not just a simple matter of multiplying a currency amount by a conversion factor [although we built just such a facility for a recent e-business application]. Dates can't even be added, although you can subtract one date from another. There is a temptation with dates to just convert them into a canonical form (number of days from a reference date - for instance, Jan. 1, 1800) and then assume you can do anything with them. In fact, they remain dates, just represented differently, and you still can't add them.... On the other hand, you can do things like ask what day of the week a date falls on, what is the date of the following Monday, how many business days there are between June 30 and August 10, etc.

We have used the term "representation" quite a lot so far. Some of the above complexities come from confusing what the data is with how it is represented. The data is really a value, drawn from a domain (defined as a set of possible values). We really shouldn't care how the data is represented internally - we only care when we have to interface with humans, or a file is coming from another system. However, we also have to care every time we interface with current higher level languages. The requirements for interfacing with humans involve even more interesting considerations such as national languages and national conventions for writing numbers and dates, which should as far as possible be encapsulated within off-the-shelf subroutines.

Multilingual support is an increasingly important area. Some Asian languages involve double-byte coding, which differs from machine to machine, as well as from language to language. Computer users no longer feel that they should have to learn English to use an application, although most programmers are still willing to do so! This attitude on the part of programmers still sometimes laps over into what they build for their customers, but the more sophisticated ones know that we are living in a global market, and that computers have to adapt to people, rather than the other way around. In many ways, Canada has been at the forefront of these changes, as it is an officially bilingual country, and French Canadians have historically been very insistent, and rightly so, that their language be written correctly!

If we separate the representation from the content of the data, we can look at the variety of possible representations for any given chunk of data, and consider how best to support conversion from one to another. I once counted 18 different representations of dates in use in our shop! These are all inter-convertible, provided you are not missing information (like the century). I imagine a lot of retired programmers are going to be called back into harness around the year 2000, converting programs with 6-digit dates to make them able to cope with the 21st century! [Yup! We were!]

As we said above, the representation of data inside the machine doesn't really concern us. It is of interest when we are talking about external uses of that data (being read by humans, being written to or read from data bases, or being processed by HLLs). In some ways, this resembles the Object-Oriented view. However, when displaying data (e.g. numeric fields) we found that we needed "global" information to control how the data should be presented, such as:

  • currency symbol (whether required and, if so, which)

  • whether currency symbol is floating or fixed

  • separators between groups of three digits (whether required, and, if so, what symbol)

  • separators between integer part and fractional part (what symbol)

  • whether negative values should be indicated and, if so, how (DR, preceding -, following -, etc.)

In addition, these options usually come in "layers": there may be an international standard, a national standard, and a company standard, and a particular report may even use one or more such representations for the same domain, e.g. amounts with and without separators. Today it is not enough just to provide one conversion facility in each direction. Representations occur at the boundaries between responsibilities and, I believe, require sophisticated multi-level parametrization.

Given that we have to work with existing HLLs [this was pre-Java!], the best we can probably do is to describe fields and use smart subroutines as much as possible for all the conversion and interfacing logic. This will let us implement all known useful functions, but it cannot prevent illegal ones. Object-oriented and some of the newer High Level Languages with strong typing are moving in this direction, but the older HLLs do not provide any protection. I suspect we have to go further, though, as some form of dimensional analysis will probably be necessary eventually. Hardware designs may emerge which take into account the needs of business users, and when this happens, we will not have to have these drastic conversions back and forth between such different paradigms.

So far we have only talked about static descriptions of data. DFDM also had another type of data description which proved very useful, which we called "dynamic attributes". Dynamic attributes were also a form of metadata, but were attached to IPs, as well as to their descriptions. The first example of this that we came up with was the "null" attribute. We took this term from IBM's DB2, in which a column in a table may have the attribute of "may be null". This means that individual fields in this column will have an additional bit of information, indicating whether or not the particular value is null or not, meaning either "don't know" or "does not apply". Some writers feel that these two cases are different, and in fact the latter may be avoided by judicious choice of entity classes, but the former is certainly very useful. In DFDM interactive applications, we often used "nullness" to indicate fields which had not been filled in on a panel by the end-user.

The "null" attribute also works well with another dynamic attribute which we also found useful: the "modified" attribute. Suppose that an application panel has a number of fields on it, some of which do not have values known to the program. It is reasonable to display "null" fields as blank, or maybe question-marks. If the user fills one or more fields in, their attributes are changed to "modified" and "non-null". This information can then be used by the application code to provide user-responsive logic. We found that this kind of logic often occurred in the type of application which is called "decision assist": here you often see screens with a large number of fields and it becomes important to know which ones have been modified by the user.

Many applications encode null as a "default" value, e.g. binary zeros, but there are a number of formats which do not have an unused value, e.g. binary, so how do you tell whether you have zero eggs, or an unknown number? Does a blank street name and number in an address mean that we don't know the house-owner's full address, or that she lives in a rural community, where the mailman knows everyone by name? Also, we saw no advantage to confusing the idea of null and default - what is an appropriate default number of eggs?

Where DB2 has specific handling for the "nullness" attribute only, DFDM generalized this idea to allow you to attach any kind of dynamic attribute data to any field of any IP, e.g. "null" and "modified", but also "colour", "error number", etc. Since we felt we couldn't predict what kinds of dynamic attribute data we might want to attach to the fields of an IP, we built a very general mechanism, driven by its own descriptor (called, naturally, a Dynamic Attribute Descriptor or DAD). It allowed any number of attributes to be attached to each field and also to the IP as a whole. Thus, we had a "modified" attribute on each field, but, for performance reasons, we had a "modified" attribute on the IP as a whole, which was set on if any fields were modified.

"Null" and "modified" are of course Boolean, but we allowed binary or even character dynamic attributes. One character-type dynamic attribute which we found very useful for interactive applications was "error code". Suppose an editing routine discovered that a numeric field had been entered by the user incorrectly: it would then tag that field with the error code indicating "invalid numeric value". Any number of fields could be tagged in this way. When the IP containing the screen data was redisplayed, the display component would automatically change all the erroneous fields to some distinctive colour (in our case yellow), position the cursor under the first one and put the corresponding error message in the message field of the screen. Without leaving the component (under IBM's ISPF environment), the user could then cycle through all the error fields, with the correct error message being displayed each time. This was really one of the friendliest applications I have ever used, and it was all managed by one reusable component which invoked ISPF services (under IBM's IMS you can do the same thing but it would probably take a set of components working together).

Some writers have objected to the "null" attribute on the grounds that it introduces 3-valued logic: yes, no and don't know. Our experience was that, in practice, it never caused any confusion, and in fact significantly reduced the complexity of the design of our end user interfaces.

A last point has been suggested by our study of Object-Oriented Programming: IP types are often related in a superclass-subclass relationship. This comes up frequently in file handling: one may know that one is dealing with, say, cars, but not know until a record is read what kind of car it is. It would be very nice to be able to attach a "car" descriptor to each record as it is read in, and then "move down" the class hierarchy for a given record, based on some indicator in the record. This is in turn related to the question of compatibility of descriptors: what changes are allowed between descriptors? It seems reasonable to be able to change a generic car into a Volkswagen as one gains more information about it, but not a car into, say, a beetle [note - this is similar to the OO class/superclass relationship, except that it is not acceptable to have to modify code as new models of car are produced]. In the later versions of DFDM we decided that IP lengths should be completely specified by IP type - i.e. if you created an IP with a given type, the length would be obtained from the type descriptor, and couldn't be changed. The question then arises: if you change a car into a Pontiac (move from superclass to subclass), does the IP become longer, or does an "unstructured" portion of the IP become structured? Let's leave this as an exercise for the reader!