Substreams and Control IPs

This chapter has been excerpted from the book "Flow-Based Programming: A New Approach to Application Development" (van Nostrand Reinhold, 1994), by J.Paul Morrison.

A second edition (2010) is now available from CreateSpace eStore and Amazon.com.

The 2nd edition is also available in e-book format from Kindle (Kindle format) and Lulu (epub format).

To find out more about FBP, click on FBP.

For definitions of FBP terms, see Glossary.

Material from book starts here:

We are now going to expand on the use of the Collate component mentioned in the previous chapter. This chapter will also show how Collate, substreams and control IPs can be combined to address one of the most difficult types of conventional business batch application. The main function of Collate, just as it was for the Collator Unit Record machine from which it gets its name, is to merge the IPs in its incoming data streams based on values in key fields (the definition of these key fields is normally specified in an options IP). In most applications, we have more than one key field, which are used to specify different levels of grouping. As an example, let's take a file of bank accounts within branches. In this particular bank, we'll say that account numbers are not guaranteed to be unique across branches. Another way of saying this is that to make an account number unique across the whole bank, we must specify the branch number.

Suppose we have an application where a stream of banking transactions must be run against a stream of account records: we first sort both streams by account number within branch number. Now in conventional programming, we have to write an "Update" program. I once figured that something like a quarter of all business programs running today are Updates! Whether or not that is the right figure, Update programs are hard to code and harder to modify, and yet the only assistance programmers have ever received is a piece of paper, handed down from father to son, showing the basic logic for Updates, which is a pattern of moves and compares usually called the "Balance Line" technique. This logic then has to be modified by hand to suit the particular situation you are faced with. However, it is still only a description of an approach - to adapt it to a particular application you have to massively modify it. I once had the dubious pleasure of having to modify an update program whose author had written the client an explanation of why his request could not be satisfied, which started: "Owing to the limitations of data processing,..." My clear recollection, after almost 25 years, is that modifying that program (and, for a conventional program, it was really quite well-written) was not quite impossible!

Now imagine what you could do if you had a prewritten, pretested component for collating data streams. Let us imagine that we have a stream of transactions and a stream of accounts, both sorted by account number within branch. We will now collate them into a single stream, specifying branch number and account number as major and minor control fields, respectively. When Collate finds two equal records from two different port elements, it outputs the one from the lowest-numbered element first. The resulting output stream will contain the following sort of pattern:

IP type     branch   acct #      date       amount   DEP/WD

 account        1        1
 trans          1        1      1992/3/12     12.82     DEP
 trans          1        1      1992/3/12    101.99     WD
 trans          1        1      1992/3/12     43.56     WD
 trans          1        1      1992/3/26     54.77     WD
 trans          1        1      1992/3/26     12.26     WD
 
 account        1        2
 trans          1        2      1992/3/03     34.88     DEP
 trans          1        2      1992/3/03     10.00     WD
   . 
   .
   .
     
 account        2        1
 trans          2        1      1992/2/29     25.99     DEP
 trans          2        1      1992/3/25     87.56     DEP

 account        2        3
 trans          2        3      1992/3/01     34.88     WD
 trans          2        3      1992/3/17     88.22     DEP
   .
   .

Figure 9.1

Notice that the effect of Collate operating on sorted input streams is to give us a nicely sequenced and grouped data stream, consisting of two kinds of data IP. The job of the process downstream of Collate is therefore much simpler than the conventional Balance Line, which has to do this grouping as well as implement the required business logic. A conventional Update also has to worry about what happens if one file is exhausted before the other. Instead, in our FBP solution, the actual business logic (as compared with all the logic of synchronizing the two data files) sees one IP at a time, determines its type, and decides what to do with it. In what follows, we will call this process UPDATE_ACCTS. One rule of thumb in conventional programming is that the complexity of a program is roughly proportional to the square of the number of input files. Just one reusable component, Collate, therefore can reduce the complexity of UPDATE_ACCTS significantly!

So far we have talked about the branch and account levels. Now let's assume we want to group transactions by date - bank statements often only show one subtotal per day. This then gives us three grouping levels, most of which are only recognized by changes in control fields. A change of account number is recognizable by the arrival of a new Account IP, but this cannot tell us when we have started a new branch. So a lot of the logic in a conventional Update is keyed to changes in value of control fields. Now, Collate has to look at all the control field values anyway, so it would be nice if we could have Collate figure out the groupings and pass that information to downstream processes, which would therefore be relieved of all this comparing to see when a given group starts or finishes. How does Collate pass this grouping information downstream? You guessed it! We use the "bracket" IPs mentioned in Chapter 3.

Bracket IPs have a recognizable type which follows a special convention, so that they can never conflict with user-defined types. Brackets come in two flavours: open and close brackets. They may also contain real data (if their IP length is non-zero), which by convention we use for the name of the group they delimit. Let's get Collate to insert some brackets into its output data stream, resulting in a collated data stream that looks like the following diagram. As before, I will use bracket symbols to represent open and close bracket IPs, but this time we will show the names of the groups they refer to in the data part of the bracket IP ("date" means a group comprising all the deposits and withdrawals for a given date). To make things a bit clearer, I will show a "level number" (L) in front of each IP - open brackets increase the number, close brackets decrease it). I will just show the first few IPs of the collated stream:

L  IP type     branch    acct #     date     amount   DEP/WD

0 |< 		branch
1 |< 		account  
2 |account          1        1
2 |< 		date
3 |trans            1        1    1992/3/12    12.82    DEP
3 |trans            1        1    1992/3/12   101.99    WD
3 |trans            1        1    1992/3/12    43.56    WD
3 |> 		date
2 |< 		date
3 |trans            1        1    1992/3/26    54.77    WD
3 |trans            1        1    1992/3/26    12.26    WD
3 |> 		date
2 |> 		account
1 |< 		account
2 |account          1        2
2 |< 		date
3 |trans            1        2    1992/3/03    34.88    DEP
3 |trans            1        2    1992/3/03    10.00    WD
2 |> 		date

Figure 9.2

Generally, the logic of UPDATE_ACCTS will consist of a "case" statement based on the type of the incoming IP. An open bracket will cause counters and totals for the correct level to be initialized to zero; a close bracket will cause an IP containing the counters and totals for that level to be sent to an output port. We won't even have to reinitialize the counters and totals at this point because we know that another open bracket will be coming along shortly (or end of data). We could either update the counters and totals at every level for every incoming data IP, or just roll the values into the next level up at close bracket time - it seems simpler to choose the latter. There is some redundancy in the data structure, as an account IP is always immediately preceded by an account open bracket, but this is much better than not having enough data! Since we will be needing information from the account IP, we can just ignore the account open bracket, or we can do a cross-check that they are both present ("belt and braces" programming - that's "[belt and] suspenders" for American readers!).

So far the main piece of logic for UPDATE_ACCTS (the process downstream of the Collate) looks roughly like this:

receive incoming IP
   begin cases based on IP type
       case: open bracket for branch
          initialize counters and totals for branch 

       case: open bracket for account
          initialize counters and totals for account 

       case: open bracket for date
          initialize counters and totals for date
      
       case: account
          pick up account info 

       case: transaction
          increment counter for debit or credit
          add amount to debit or credit total 

       case: close bracket for date
          output IP containing counters and totals for
            date 
          roll these values to account level 

       case: close bracket for account
          output IP containing counters and totals for
            account 
          roll these values to branch level 

       case: close bracket for branch
          output IP containing counters and totals for
            branch 
   end cases

Figure 9.3

Notice that these groupings are perfectly "nested", and at any point of time we are only looking at one level or at most two adjacent ones. This suggests that we could handle this kind of logic very elegantly using a push-down or "last in first out" (LIFO) stack. You will remember that this is a storage structure which is added to and removed from at one end only. We also have to decide what kind of data structure to hold all these counters and totals in. One possibility would be an array, but for each level we have to eventually output an IP containing the values for that level, so why not hold the values in IPs all the time? This kind of IP is called a control IP, and it can be described as an IP whose lifetime corresponds exactly to the lifetime of a substream, which it can be said to "represent".

Now we can put these techniques together and deduce that we will be working with a stack containing control IPs (actually their handles). So the above logic becomes much simpler. It now reads something like this:

 receive incoming IP
   begin cases based on IP type
       case: open bracket 
          create IP for this level
          initialize counters and totals for this level
          push IP onto stack 

       case: account
          pick up account info, insert into account IP 

       case: transaction
          increment counter for debit or credit in IP
            currently at head of stack
          add amount to debit or credit total in IP 
            currently at head of stack
       
       case: close bracket 
          pop IP off stack
          roll counters and totals in this IP into IP
            currently at top of stack, if there is one
          output IP which was just popped off stack  
   end cases

Figure 9.4

When we say "create an IP" in the above logic, we may simply use the incoming IP if there is a characteristic type at the beginning of the group. Bear in mind, however, that the IP must have data fields in it for the totals. The most obvious technique is to create a control IP, copy fields such as identifiers across from the incoming IP (frequently the one after the open bracket) into the control IP, stack the control IP and throw away the original incoming IP. A faster technique, though, which can sometimes be used, is to arrange for the incoming data to be put into a bigger IP at the time it is read in by the application. It can then be stacked immediately, without the tedious "create new, copy, destroy old" sequence.

Now we have our control IPs, what do we use as a stack? AMPS had a stack mechanism specifically for this kind of logic, which was actually a kind of connection (one that was only attached at one end, like a cul de sac in a city). The later versions of DFDM did not provide stack mechanisms, but, since we are only dealing with one component, it was easy enough to set up an array of IP handles in the process's working storage and "push" and "pop" (the basic stack operations) by incrementing or decrementing an index over the array. However, I happen to like stacks, and we shall see in a later chapter there are striking similarities between the way components parse their input streams and the way compilers parse their input. In both cases stacks are the natural mechanism for keeping track of nested structures. We have accordingly provided a stack mechanism in THREADS.

You may have noticed that Figures 9.3 and 9.4 did not have the familiar "do while receive has not reached end of data" - this is because they are written in a style which assumes that they end execution after processing each IP. In FBP an inactive process will be invoked the next time an IP arrives at any of its input ports, so this kind of component will be invoked once for each incoming IP. A result of this is that it cannot maintain continuity across multiple IPs, but this is where the stack comes in. Since the stack is outside the process's local storage, continuity can be maintained across the invocations using the stack. This style of component is called a "non-looper", as opposed to components written in the "do while receive has not reached end of data" style, which are referred to as "loopers". This is not an externally detectable attribute of a component, but just depends on when and how often the component decides to end processing - as long as there is data to be processed, it will continue being reinvoked.

You may be wondering what the advantages of non-loopers are, if any. Well, for one thing, non-loopers end execution more frequently, so IPs which have not been disposed of are detected sooner, making such errors easier to find. Also, non-loopers' local storage is only used within one invocation, so there is less opportunity for one IP's logic to interfere with another's. Also we shall see in the chapter on Checkpointing (Chapter 20) that there is an advantage to having processes yield control as often as possible. As always, there are pros and cons.

Obviously, there is still some application logic left to be written for UPDATE_ACCTS, but I have tried to show that the approach of holding control IPs in a stack (together with a generalized Collate) does significantly simplify the remaining piece of logic you do have to write. One of the things that also makes this logic simpler is the fact that every action is associated with the arrival of a distinct type of IP (this is even more obvious in the case of non-loopers), rather than a change in value - this is what allows one to cleave the logic of such a component into distinct self-contained cases. When talking about such logic, I often find it useful to refer to "open bracket time", or "detail time". An incoming IP triggers an action, which starts up and then finishes, readying the process for the arrival of the next IP (or end of data). In a later chapter, I will try to show that the code we still have to write has such a simple structure that we can begin to think about generating it semi-automatically from some kind of specification other than a HLL program.

By the way, as you work through this kind of logic, you may notice a characteristic flavour of FBP coding: very little of the data you will be dealing with actually resides in a process's working storage - the vast majority of it will be in IPs, very often control IPs like the ones we have just been discussing. When you think about it, this should not be that strange - in a real life office, most of your data is in files, memos or on the computer - how much data do you have to hold in your personal short-term memory? I don't know about you, but I try to hold onto as little data as possible for as short a time as possible (after which I destroy it, pass it on it or file it - just like IPs).

Flow-Based Programming - Chap. IXSubstreams and Control IPs

Flow-Based Programming - Chap. IX
Substreams and Control IPs