[Home]VorlathDiscussion

FlowBasedProgramming | RecentChanges | Preferences

The following ties in with a discussion of functional, procedural and OOP coding styles in http://my.opera.com/Vorlath/blog/show.dml/202787, called "Recategorizing Paradigms". --pm


Note from Vorlath:

The whole point of the first section is that machine code is executed in parallel. This is actually what makes hyperthreading possible. If there is a stall, some stages will be empty while it waits for a previous instruction to complete, but these free stages will now be used by another process instead of wasting the processing power. That's hyperthreading. I want to use the same technique on a larger scale with components. That's what the diagrams show in the third section. It would mean faster execution because there is no more sync on the streams.

But a single component cannot be broken down as I understand it in FBP. If there's only basic instructions in the component, then fine. But I doubt every component would be so primitive as to not theoretically be decomposable. I want to resolve this issue. I think it IS possible to have even the insides of components be pipelined if the compiler and architechture dictates it necessary. This is VERY important because in the future processors will become simpler, but more numerous. I am 100% convinced that in the future, each processing core will only be able to execute but the simplest of instructions. Yet, at the same time, I think that we will have hundreds and maybe thousands of processing cores. FBP will not be able to survive if you retain the notion of a basic component that is not decomposable. That is what I'm trying to address. I think BIG! and for the future.

I realise you pushed the notion of asynchronous components. While this is simple to implement using locking mechanism. Ie. Once you get it working once, you can use it for all components. But the thing is that this is a serious hit on performance. You can use pipelining and remove most of this contention. I want to try sequentially pipelined components (with threading to pick up the slack on any blocked IO component, but only if absolutely necessary). The thing to remember is that on a single core, sequential execution is always preferred over threading. There are two exceptions to this. 1) A blocking process. 2) A process is taking to long to complete.

  1. 1 is easily detected with an idle process. If it starts to execute, you know the other process has blocked. #2 can be monitored with a timer event adjusted for maximum responsiveness.
And if components are written using pure functional techniques that don't have side-effects, then these basic instructions can be pipelined as well. In fact, this has been a feature of functional languages that its proponents boast about.

So my purpose is: 1) Reduce contention (and task switching) by implementing some form of sequential pipeline component execution when possible. 2) Use a language that has limited scoping in order to pipeline statements inside components.

Asynchronous components on a single CPU is currently a waste of processing power. There has to be a way to automatically scale on both single and multi core CPU or processing stations.

I hope this explains my objective better. And I hope that my wiki entry shows my thought processes on how I came about these conclusions.


PaulMorrison's questions:

do while not end of stream
  get from input port 1 using handle 'a'
  put to output port using handle 'a'
  get from input port 2 using handle 'a'
  put to output port using handle 'a' 
enddo

The "do while" could be a couple of instructions, and the rest are all calls - on a PC that's a couple of stack instructions, and a call instruction for each function call. I fail to see why you need to move to a lower level, and if you did, the logic would probably be a lot more complex. What am I missing?


I'll repost and reply here.

To me, asynchronous events can only be created by hardware events. Threads themselves are created by hardware timer events. It's anything that isn't directly controlled. Where it's free to initiate an operation at any time.

do while not end of stream
  get from input port 1 using handle 'a'
  put to output port using handle 'a'
  get from input port 2 using handle 'a'
  put to output port using handle 'a' 
enddo

The "do while" could be a couple of instructions, and the rest are all calls - on a PC that's a couple of stack instructions, and a call instruction for each function call. I fail to see why you need to move to a lower level, and if you did, the logic would probably be a lot more complex. What am I missing?

We've just hit GOLD! This is a PERFECT example of where you could decompose this component. In fact, this example is so conducive to parallelism that parts of it can not only be made into a pipeline, but also made into distinct parallel execution points. The function calls do not matter for this example.

I think I should first mention that what I'm describing is on the compiler level and the implementation. I don't want the programmer to change the way he/she writes components or the way he/she thinks about it.

Here, I will put internal-level-components/statements inside brackets. And the streams/data will be without. There can be two pipelines as long as Line 2 and 3 are started at the same time.

Port1 -> [Line 1] -> Handle A -> [Line 2] -> Output
                     Port2    -> [Line 3] -> Handle B -> [Line 4] -> Output             

A compiler can see that there is no dependancy between the first two lines and the last two when it comes to the handle. So I renamed it to Handle B in the second part just like the compiler would. Let's say we have 2 processing cores and TONS of input coming in on port 1 & 2. In contemporary solutions, you would need processors large enough to handle the ENTIRE component. While this is fine today, in the future this will not be the case. Smaller cores with exceptional execution speed will be the norm. I believe cores of the future will be as small as molecules or atoms. Much smaller than your non reducible component. So now look at what we can do with the above two pipelines.

Let's say there is data at all streams and that we're already mid-way through merging tons of data. We can execute Line 1 and Line 4 on two independent processing cores without any locking on the streams. Line 2 and 3 are idle, so they're not touching the streams. Line 1 and 4 are free to do as they please with these streams (Port1, output and the handles). Once Line 4 and Line 1 are done, Line 2 and Line 3 can each be executed in parallel (one on each core). Once both lines have completed. Restart Line 2 and 4 and continue going back and forth until all is processed. This is maximum processing power that you wouldn't get with the monolithic component model. While true that in this example, each line could only process one intem at a time because the output is alternated, there is a way to get around this limitation. The compiler could figure out that the second pipeline should output its item at every second location in the output stream. In this way, each line could be placed in a loop and process its entire input.

Let's look at it again. When we execute Line 1 and Line 4 at the exact same time on multiple cores, let's look at what they access.

Notice that there is no data in common between Line 1 and Line 4. So there is no need to sync anything at all. They can access that data as they please without fear and without synchronisation.

After the above two lines have completed, let's look at what Line 2 and 3 access.

Again, these two lines are truly independant of each other. No overlap in data access. By alternating execution between these two lines and the above two lines (and having each line in a loop), we can obtain maximum throughput with zero wasted processing power on task switches or synchronisation.

Maybe it's just me, but I see a future where this can be taken advantage of. If you create a monolithic component, this is no longer possible. Also, if a one line/sub-component is finished before the other one, then do what hardware pipelining does. Let another chain of components execute in the meantime.

If this doesn't spark some kind of enthusiams, then perhaps I'll have to recheck my priorities. But I find this kind of thing absolutely fascinating for the future of programming. While true that right now, you don't need this. I don't think it would be wise to overlook it. And you can get SOME speed increase right now, today.

Actually, if the two threads/components are allowed to run at the same time, then there will always be blocking. I'm getting the impression that just because you don't see the synchronisation logic, that it isn't there. This is not so. It IS there. Your threads WILL block if they can both access the same stream at the same time. It doesn't matter how big the stream is. Or that the data chunks are separate. You need to synch the stream itself to access it. I plan to remove that restriction.

The logic is indeed simple. What I am proposing is a little more complicated, but more scalable to smaller and more numerous processing cores.

Well, you have to weight the cost of transfer vs. the cost of execution. It also matters how much data you are processing and how fast you need it. But the faster you need any particular data, the more other data has to wait. So it cancels out. I want to maximise the total throughput. While some parts may slow down if taken individually, overall the throughput should increase drastically. In fact, I wouldn't be suprised if the responsiveness went up too.

Again, I don't mean that you should manually break down the components. I want the compiler to be able to break it down and make it parallel when needed. I also want the programmer to use a language or a style that is conducive to this. If we write components like we write other software (and using those other languages), then we're just delaying the very same problems that plague those languages to a later time. Not fixing them.

My biggest worry is having other programmers, such as procedural, switch over to FBP, but use their sequential programming techniques when it comes to writing components. If this were to happen, I would be immensly disapointed because it would seal the fate of FBP. I want something that will still be around 100 years from now. Where new compilers can analyse the code and make it more (or less) modular depending on the need. If there is any part of the system that cannot be analysed or decomposed in this manner, than that will create an obstacle for the compiler. That is the obstacle I want to remove. Where all your variables can easily be converted to streams. And all your function calls can be pipelined or made parallel outright. Once we have that, then I'll be happy. Note that what I'm talking about is enhancements!

-Vorlath


I'll repeat the comments here. The above is getting messy.

I'm talking about line of code. There are four lines in your loop, so each line number corresponds to each line.-V

Sure the logic to process the data from intput port 1 can be overlapped with that of port2. Ok, how do I explain this in a different way?

If only one item is pumped through at a time going through the pipeline, there is no problem. The component will process two items at once instead of only one and sequencing will be automatic. But you can also have multiple items go through at once. It requires some modification to how you place the items in the output stream though. Let's just keep it at one item at a time for this example.

At each step, both cores operate in parallel.  
Not like threading where it's simulated, but actually at the very same physical time.
Both cores must complete before going to the next step.  
Just as in hyperthreading, you can start another component while waiting for the other core to complete.
But for now, we'll leave that out.
What is important is that the input and output of each core need not be synchronised
and can be freely accessed.

Step 1

 Core 1
 Item1(P1) from Port1 -> Line 1 -> Handle A

 Core 2
 Idle or Processing something else.

 Ouput so far:
 {empty}

Step 2

 Core 1
 Item1(P1) from Handle A -> Line 2 -> Output

 Core 2
 Item1(P2) from Port2 -> Line 3 -> Handle B

 Output so far: 
 Item1(P1)

Step 3

 Core 1
 Item2(P1) from Port1 -> Line 1 -> Handle A

 Core 2
 Item1(P2) from Handle B -> Line 4 -> Output 

 Output so far:
 Item1(P1),Item1(P2)

Repeat steps 2 and 3 until all done.
Notice that all items will be properly interleaved.

So the items will indeed come out in sequence. The thing is that each step does need some kind of synchronisation to keep track when each core terminates. I left it at one item to make it easier to comprehend. But you can easily make it process more than one item at each step. When it comes to the output, you just have to keep track of what index you're outputting. Core 1 would output to even locations. Core 2 would output to odd locations. It's a modified stream for sure, but it's a simple counter. Nothing fancy here. The thing is that the above example can easily be converted to process multiple items at a time and still be in sequence.

Let's say you have millions of processors, but they can only execute simple instructions. And their memory is limited. A component will not fit on a single processor. How would you get your component to execute? I realise it's not important NOW, but I want something for the future. Yet, at the same time, I believe a lot of gains can be had today even if not absolutely necessary. For one, no more synching on the streams. You say that it's fast, but this is not so when processing a lot of items. If there isn't a lot of data, then FBP is not really important. A sequential program will do fine. Assuming that there IS a lot of data, then there will be some occasions where you will not get a lock on the stream and this will cause a task switch. Not just one, but at least two because it has to return to the original process so that it may complete. Even if there is no synchronisation, having a thread for each component requires task switches for no good reason and stack space for each. My method requires but one stack for each processor. Maybe a few extra ones in case there are components that access hardware and are waiting. I can reduce time wasted for task switching because there will hardly be any. I can significantly reduce sync contention. And can really simplify the compiling of code because there is zero locking code. Any locking code that does exist will be in predefined code that controls the sequence in which each component or sub-component is executed. Leaving this up to the OS is not optimal because it has no idea what order is best suited for the application.

Anyways, I'll try and get something running first. Maybe I can better show you what I mean at that time.


There are a few issues to deal with in FBP:

Let's take the above bullet points and go through how I think they could be improved.

My idea is to have a self-symmetric system. Where there is only one kind of code, not two. This will make it easier for the compiler to decompose the code into parallel parts if needed. Or it can make multiple components sequential if there are only a few processing cores. This will become a requirement in the future of programming. There is plenty of talk about this is the programming community. It would be unwise to disregard this issue.

This is related to the above point. If the programmer continues to program components in a sequential manner, then these components will suffer the same fate as sequential programs. Because they are generally smaller, they are easier to maintain. But there is nothing stopping a programmer from writing large components. The size of components is completely arbitrary. While it should be easy to separate software into components for experienced programmers, this is not an enviable task for beginners. However, if given a proper language that is implicetely parallel, then this risk is limited and future IDE's could offer suggestions on how to split up components. And even if there are large components, these would execute just like compound components. This is just common business sense. You need a way to transition other programmers.

Between the choice of locking on every item in the stream or locking once to signal termination of the component, which would you choose? Right, locking once is always better than locking multiple times. Locking is extremely expensive.

On a single CPU (single core), there is never a need for threads. This will be an unpopular view, but it doesn't make this statement any less true. I predict that the failure to understand this will be the #1 obstacle in implementing proper parallel code on multiple CPU's and cores as they become more prevalent.

One a single CPU, you should never start another thread if there is already one that is currently active. The problem is that sequential programming leads to synchronous (and blocking) code. In FBP, there should be no need for this. Why you would want to keep this notion of threads is beyond me. Threads were invented for two reasons.

  1. To let other code run if something is waiting on hardware.
  2. To let other code run if something is hogging the CPU.

Waiting on asynchronous hardware is a throwback from the early days of programming where asynchronous events were converted into synchronous calls at the expense of wasting CPU time, but easier sequential programming techniques. In order to not waste this CPU time, threads were used. There is no need to ever wait on anything or keep using these archaic techniques. You can be notified when something happens and resume the operation at that time. No need to stall the entire application or computer.

The second one is related to bad programming practices or simply having to process a lot of data. The thing here is this cannot happen with my improvements. If every component can be split up into parallel components, then there is no risk that any one part hogs the CPU. Every part of the entire system will get their chance to execute.

So no threading needed. This involves even more advantages. No unnecessary locking between tasks that run on the same CPU or core. When I say task, I mean that the application itself controls what parts of your system is executed. So if each task is executed to completion, these are run one after the other just as a regular task switching system in an OS. But because they are allowed to execute to completion, the next task need not lock anything unless using something that could be used on another core at that precise time (which I have already explained how to avoid in my article). You can write your own task switching code. It is very simple to implement. In this case, you don't even need a separate stack because each task can use the same one. Realise that tasks are executed out of order to achieve the maximum possible throughput. Tasks are reduced or increased in size according to the compiler determination of where you will running your software. Without this ability to scale up or down, you are severely limited to what kind of systems your code can run on.

I talked about this in the previous bullet point. A proper functional language will be needed for this. A programming style with the least amount of side-effects. The need to scale up or down to different parallel architectures will result it the most efficient system.

Advantages:

  1. Consistant and self-symmetric system. If you ever need to make several components into one, or split up one component into several parts, there is no longer any difficulty in doing so by the compiler or by programmers.
  2. Component sizes are no longer arbitrary. Nor are they left up to the programmer to decide.
  3. Locking once per groups of tasks instead of locking on every item in the stream. An improvement from O(n) to O(logn), an almost two scale improvement.
  4. No more task switches significantly improving execution speed. A O(n) to O(0) improvement! Complete elimination. Can't get better than that.
  5. Only one stack per CPU or core. Much less storage space needed.
  6. Automatically scalable to different architectures. No longer limited by maximum component size.
  7. Easier compilation of individual tasks as there is no longer any locking code.

These are some of the more obvious advantages. In fact, these are the advantages that I thought FBP would bring. Anyways, I hope that I've explained myself better. While efficiency (memory and cpu usage) may not be necessary if you have ample processing power and memory, it should always be desirable. I thought every programmer believed at least in that basic principle. That's why your comments seem really confusing to me because they seem to go against this notion. Then again, perhaps I just don't understand.

-Vorlath


StephenBrewell's questions:

I've been reading this discussion with interest and have just a few observations/comments to make. Forgive me if these have been made or answered as I am just coming up to speed in this area. I have done a lot of work in the area of MOM based distributed component architectures - areas that on the surface have synergies with the ideas behind FBP. I am quite happy to stand corrected on any misguided statements or flawed logic I make. That said :-

I can see Vorlath's reasoning behind functional decomposition of FBP components - to facilitate the distribution of these amongst potentially large numbers of simple processors - and I can see how the pipelining pattern can offer "real" parallel processing of seemingly sequential programs based on these functions.

However, pipelining only works due to the existence of a system clock which triggers the flow from one stage to the next. This works fine on a single CPU system or multi-CPU systems which share a system clock, indeed it will in theory support the many thousands of simple "core" scenarios expected in the future, so long as those pipelined "cores" share the same system clock. This would support massively parallel computers (i.e. single nodes) well, but not necessarily distributed multi-node systems and all the well-known scalability and availablity benefits they provide. Surely future architectures need to support heavily distributed computing?

Is pipelining introducing a synchronization overhead/possible single point-of-failure counteracting the benefits of the I/O synchronization removed by it? In current distributed processing models, there is no need to synchronize CPUs but there is a need to synchronize the I/O (eg via MOM), isn't the reverse true in your proposition: pipelined "cores" which process parallel dependent functions will require synchronization.

Also pipelines work because there are a fixed set of stages that each function goes through - a seven stage pipeline can theoretically support seven instructions in parallel. In your example above, the two independent activities (yet dependent as togther they provide the merge functionality) contain the same number of stages and hence order is preserved. What if HANDLE A was a sub-network that contained 3 stages and HANDLE B was a sub-network that contained 5 stages? It would seem a restriction to me that any operations that could benefit from pipelining would need the same number of stages?

Anyhow, that's my thoughts for now on this subject. I am also curious how the ideas of FBP fit into the current vogue Enterprise Service Bus architectures, BPEL and tools offered by BEA Weblogic, IBM Websphere, Microsoft Biztalk and Progress Sonic MQ for graphically linking components into business process flows and specifying the linkages/dataflow between them.

Stephen

Hi Stephen! Thanks for your input - I had the feeling that the single clock might be underlying some of Vorlath's comments... but then I saw the phrase, "When it comes to the output, you just have to keep track of what index you're outputting. Core 1 would output to even locations. Core 2 would output to odd locations." As I said above, this seems to be a very ad hoc solution for the problem I proposed. I also totally agree with you that, in a highly distributed environment, we can count on all clocks being in step. In fact, I believe there is current work going on on asynchronous computer architectures - the "Bucket Brigade" approach - so we may be moving away from this assumption at the lowest levels.

Thanks also for the above list - some are new to me, but IMO graphical notations were inevitable, once the technology got sophisticated enough. --pm


StephenBrewell wrote:

However, pipelining only works due to the existence of a system clock which triggers the flow from one stage to the next. This works fine on a single CPU system or multi-CPU systems which share a system clock, indeed it will in theory support the many thousands of simple "core" scenarios expected in the future, so long as those pipelined "cores" share the same system clock. This would support massively parallel computers (i.e. single nodes) well, but not necessarily distributed multi-node systems and all the well-known scalability and availablity benefits they provide. Surely future architectures need to support heavily distributed computing?

Yes, that's why I want the compiler to be able to partition code that is well suited for massively parallel computers and larger sections of code that would be well suited for distributed computing (or manually making these selections). I've written on this, but basically I want to make it possible to go up or down in scalability without the need to explicitely state the differences in the actual code. Rather, you would specify configuration on how the code should be executed if you want certain parts to be run on specific machines.

Also, you don't need the clock although there is somewhat of a simulation of it. Processes don't end at the same time and you don't necessarilly have the same amount of stages. If there are other pipelines, then you can use something similar to hyperthreading where other pipelines are executing in free cores not used by the main pipeline. If there aren't any other pipelines, then there's nothing to discuss. You have to wait for each part to finish anyhow whether using FBP or any lower level modifications.

Is pipelining introducing a synchronization overhead/possible single point-of-failure counteracting the benefits of the I/O synchronization removed by it? In current distributed processing models, there is no need to synchronize CPUs but there is a need to synchronize the I/O (eg via MOM), isn't the reverse true in your proposition: pipelined "cores" which process parallel dependent functions will require synchronization.

There is definitely is counteracting effect. But instead of sync on every I/O, it's once per process. And even then, it's possible to reduce it much less than that. It can be once per X processes. My discussion here isn't meant for distributed application, but to enhance FBP to also support massively parallel and vectored machines. So I think that may answer a lot of your questions. Synch'ing multiple machines would involve way too much latency IMO. But on single machines, with DMA and other transfer technologies, these transfers can be thought of parallel "mini-components" where their only function is data transfer. So latency can be worked into the algorithm with a certain amount of reliabilty. Not so in distributed computing. I'm not trying to fix or enhance that part. It's what I liked about FBP in the first place.

Also pipelines work because there are a fixed set of stages that each function goes through - a seven stage pipeline can theoretically support seven instructions in parallel. In your example above, the two independent activities (yet dependent as togther they provide the merge functionality) contain the same number of stages and hence order is preserved. What if HANDLE A was a sub-network that contained 3 stages and HANDLE B was a sub-network that contained 5 stages? It would seem a restriction to me that any operations that could benefit from pipelining would need the same number of stages?

I have thought of sub-components, but I didn't want to unecessarily confuse the situation. What you do is to (not sure the technical term here) unwrap the sub-components of HANDLE B and HANDLE A as full fledge component statements. In the event that there is contention, you simply execute other pipelines or put in empty components (idle). Sure, it takes management, but the compiler can figure it out. Besides, your OS task switcher would take much more time and space. The user never need deal with this directly.

Here is your proposed network after unwrapping (based on the original dual pipeline).

Port1 -> [Line 1] -> Handle A/IN -> [Stage 1A] -> X1 -> [Stage 2A] -> X2 -> [Stage 3A] -> Handle A/OUT -> [Line 2] -> Output
                     Port2    -> [Line 3] -> Handle B/IN -> [Stage 1B] -> Y1 -> [Stage 2B] -> Y2 -> [Stage 3B] -> Y3 -> [Stage 4B] -> Y4 -> [Stage 5B] -> Handle B/OUT -> [Line 4] -> Output 

Setup

Line 3    Port2, Handle B/IN

Stage 1

Line 1    Port1, Handle A/IN
Stage 2A  X1, X2
Line 2    Handle A/OUT, Output

Stage 1B  Handle B/IN, Y1
Stage 3B  Y2, Y3
Stage 5B  Y4, Handle B/OUT

Stage 2

Stage 1A  Handle A/IN, X1
Stage 3A  X2, Handle A/OUT

Line 3    Port2, Handle B/IN
Stage 2B  Y1, Y2
Stage 4B  Y3, Y4
Line 4    Handle B/OUT, Output

Note that each stage doesn't use any overlapping resource. I really like this because the compiler can very easily analyse resource usage. Any overlap can be separated. The setup part is because there are more statements for the second output than the first. There are two extra statements, but only one extra is needed for setup as the second one can be executed during the first output (both outputs are not done at the same time). So stage 1 and 2 would alternate. And each of the 6 statements can be executed in parallel. If you have less than 6 cores in this example, the ones that finish first can execute the remaining statements. Single cores would just execute all stages in sequence without bothering with threads at all. Zero sync.

That is your final network. So if there are differing amounts of statements, then you only need to make sure that the longer one is pre-fed with enough data.

So in the final binary version, there is only one pipeline or multiple chains of pipelines, but they would all be at the same level. The hieararchy would only be seen in the code itself to help humans maintain the system. For dynamic components, of course you will need well defined boundaries. But again, that is more like distributed computing and I didn't want to confuse the issue.

Anyhow, that's my thoughts for now on this subject. I am also curious how the ideas of FBP fit into the current vogue Enterprise Service Bus architectures, BPEL and tools offered by BEA Weblogic, IBM Websphere, Microsoft Biztalk and Progress Sonic MQ for graphically linking components into business process flows and specifying the linkages/dataflow between them.

Stephen

I saw a presentation on BPEL once. I couldn't understand what it was trying to do or what problem it was trying to solve. It was quite alien to me. Then again, I'm always lost if I don't see the foundation work first. They seemed to jump right into it even though it was an introduction presentation.

-Vorlath


FlowBasedProgramming | RecentChanges | Preferences
This page is read-only - contact owner for a password | View other revisions
Last edited October 11, 2006 12:43 pm by PaulMorrison (diff)
Search: