OO Batch Model and optimised Java for batch?
Hi All,
I'm looking to see if there is any literature of OO models for batch processing and optimising of batch java.
Thoughts & comments welcome.........
I have an existing batch process running on a mainframe which is very successful. We would like to leverage this by building a similar batch process to run 'anywhere' so likely options are Java/Unix.
There are many patterns/models etc for OO based GUI / interactive processes but very few for (that I have found) for batch.
I have worked mainly with mainframe batch and online applications and come with the baggage that activity that can be processed in batch should be to avoid overloading the online container (CICS region, web server etc).
I believe that this continues to be true, as well as the particular data we are processing benefits from efficiencies of batching the data together to store eventually on tape.
In view of not finding any literature (which I doubt is the case) it seems that the problem is the same, so probably the solution is also similar.
In the procedural solution, a Jackson (or similar), would have been designed which would then reflect the procedures build into the code.
I expect that if instead of procedures classes where defined, certainly at a higher level then the design would still be ok.
(So a the higher level you have a 'main' class, which instansiates a 'read' io object, a processing object which handles the actual processing activity and a write io object).
The level to which would would combine procedures together, or further split them out, would then be the main point of discussion.
( However am open to the above suggestion being completly wrong).
Then there is efficient configuration when processing........
When running on the mainframe the code is loaded once, the memory for all the working storage strucures created. When actually processing there is no instantiating classes, or running the garbage collector etc. I re-use the same memory for each new record red in / processed / written out and all the code is normally loaded once when first called and the same code is re-used until all records processed.
Is there any way that I can replicate this within Java either in it's own JVM or running in a container such as websphere? When processing the volume of data that we do (20 million db entries + 40GB of document data avg) then anything not optimised is costing money and available processing time.
[2536 byte] By [
klanky68a] at [2007-10-3 9:53:45]

when you declare a object, it will live (at least) as long as you have a strong reference to it. If you declare a structure in your main(), like
public static void main(String []args) {
for(int i = 0, i < 10; i++) {
//...
Processor p = new Processor();
DataStructure data = new DataStructure();
//...
p.process(data)
//or even
Processor.Process(data)
}
}
Both processor and data variables will live as long as the main never ends and as long as they are not null, but that doesn't mean that you can't reuse data variables by setting them to another value at another iteration by invoking data.set(myFields) or something like that.
Analyse your data structure to see when and how you can reuse the same objects.
> I have worked mainly with mainframe batch and
> online applications and come with the baggage that
> activity that can be processed in batch should be
> to avoid overloading the online container (CICS
> region, web server etc).
I have worked for a few years with batch-style Java applications and I am currently writing Java to run on IBM midrange hardware.
I disagree with your premise here. Excessive batching IMO, is an anti-pattern. It leads to uneven usage of the cpu time leave lots of idle periods. In fact I've seen batch style programming to be the cause of many problems including missed SLAs and inability to handle high volumes.
You do not have to batch to avoid overloading the system. One simple alternative is to use a messaging system (e.g. any JMS container) to queue up tasks. Listening applications process these messages as fast as they can. This means that during high-volume time periods, the queues fill up and when volumes go down, the machine still processes at maximum capacity until backlogs are cleared.
> I believe that this continues to be true, as
> well as the particular data we are processing
> benefits from efficiencies of batching the data
> together to store eventually on tape.
If there is real benefit to processing the records together at one time then batching may be warranted for that specific case but as a general rule, it is a bad approach, in my experience.
> Is there any way that I can replicate this within
> Java either in it's own JVM or running in a container
> such as websphere? When processing the volume of data
> that we do (20 million db entries + 40GB of document
> data avg) then anything not optimised is costing
> money and available processing time.
How will batching solve this? Batching means you run the application less often. I often come into work and see messages about how 'batch is still running' and we had tons of excess capacity during the previous day. Why do you think this will improve things?
>
> I disagree with your premise here. Excessive
> batching IMO, is an anti-pattern. It leads to uneven
> usage of the cpu time leave lots of idle periods. In
> fact I've seen batch style programming to be the
> cause of many problems including missed SLAs and
> inability to handle high volumes.
Interesting thought.
I suspect that batching is underused though rather than overused. Several reasons for that.
1. Newer programmers don't even think about that.
2. Older applications which can not be changed, used batch processing. It might not be ideal to continue that usage but it would require re-architecting the entire system. And that is unlikely to be cost effective. The usage here is required because of that.
That of course is in general. There could certainly be sectors (banking comes to mind) where it is overused regularly on newer apps and newer business models simply because it looks like everything else.
I know that I cringe everytime I see a question here that suggests that someone is going to move their entire database to a java app, process it, and then send it back. I certainly wouldn't want to encourage that as the norm.
>> There could certainly be sectors (banking comes to mind) where it is overused regularly on newer apps
banking relies very heavily on batch processing, in part because they don't do a lot of it themselves, they outsource it, and because of the sensitive nature and the sheer volume of the data, the preferred delivery meidium is via courier rather than electronic means
on top of that, much of said banking data is delivered to the bank as a batch. up until very recently, EFT transactions consisted of a real-time authorization cycle, followed by a (typically) end-of-day batch settlement
> I suspect that batching is underused though rather
> than overused.
Can you elaborate on that? What kind of conditions would you advocate batching for? I can think of only a few: there is a data that must be processed together, running calculations over specific time periods or when running larger datasets causes a signficant (e.g. exponential) reduction in resource usage.
From what I can tell, most of the reason for batching is like what the OP proposes: that by batching you can save on overhead like context switching and resource allocation. It may make perfect sense in the mainframe world.
I have some experience working with batch java applications running in Unix. And I can tell you that they did not improve anything. They were actually the source of many of our issues. That added abitrary time lags during times of low volumes, sometimes adding 30 minutes or more to the processing of a transaction as it waited for the next scheduled batch. They also made our backlogs worse in times of high volumes because the incoming data flow was uneven, we would often get big batches of data from partner systems (more batching, gotta love it) that hit us when our batch process was sleeping. 5, 10, 15 minutes would pass where the server ran at 10% capacity while huge backlogs were piling up. It didn't make anything better. It was just causing idling.
It just seems to me that in Java with all the nice threading we have access to, the server should never be idle and if you cannot handle your volume you are better off adding more servers, not attempting to batch things.
> > I suspect that batching is underused though rather
> > than overused.
>
> Can you elaborate on that? What kind of conditions
> would you advocate batching for?
Running daily, monthly, etc reports. Or something that feeds those.
>
> I have some experience working with batch java
> applications running in Unix. And I can tell you
> that they did not improve anything.
I suspect I would agree with that. I am not advocating that the batching be done in java. Just that idea that an 'incremental' process that requires moving data versus a 'batch' process that doesn't isn't something that I would normally consider a good idea.
> They were
> actually the source of many of our issues. That
> added abitrary time lags during times of low volumes,
> sometimes adding 30 minutes or more to the processing
> of a transaction as it waited for the next scheduled
> batch. They also made our backlogs worse in times of
> high volumes because the incoming data flow was
> uneven, we would often get big batches of data from
> partner systems (more batching, gotta love it) that
> hit us when our batch process was sleeping. 5, 10,
> 15 minutes would pass where the server ran at 10%
> capacity while huge backlogs were piling up. It
> didn't make anything better. It was just causing
> idling.
>
What was the timeliness requirements for the processing? Did it need to be completed by 2am in the morning? Or could it have really just been completed on demand?
> It just seems to me that in Java with all the nice
> threading we have access to, the server should never
> be idle and if you cannot handle your volume you are
> better off adding more servers, not attempting to
> batch things.
I have created applications that were intended to run 'batch' jobs which could be spread across servers. Those particular processes had to finish within a very narrow time span as well - about two hours as I recall. There was an incremental as well as batch functionality that needed to be run for this. The batch functionality ran on the database. The incremental took the batched results and handle the incremental part.
Although management was never willing to dedicate more than one server to the processing so I guess it wasn't that important to them.
I have seen apps that claimed they were 'fast' because they did all of the incremental processing outside of the database. The design required moving, literally, the entire database over the network to other servers which would then process it. Processing it in the database would have taken orders of less time. And that was time sensitive data. I can't remember if that app allowed for multiple boxes to do the processing. I do know that the people working on it could never figure out the bottleneck (it was scaling to something like 12 hours a day which was not acceptable.)
> > > I suspect that batching is underused though
> rather
> > > than overused.
> >
> > Can you elaborate on that? What kind of
> conditions
> > would you advocate batching for?
>
> Running daily, monthly, etc reports. Or something
> that feeds those.
I don't disagree, I just don't even see this as 'batching'. Batching to me is when you take something that could be done incrementally and purposely doing it in large groups at set times or time-periods. If you have a daily report and you do it daily, you're just doing the most obvious approach. It might not even be the most efficient.
> >
> > I have some experience working with batch java
> > applications running in Unix. And I can tell you
> > that they did not improve anything.
>
> I suspect I would agree with that. I am not
> advocating that the batching be done in java. Just
> that idea that an 'incremental' process that requires
> moving data versus a 'batch' process that doesn't
> isn't something that I would normally consider a good
> idea.
I think we are thinking about different things. I'm really just talking about incremental or real-time vs. batching.
> > They were
> > actually the source of many of our issues. That
> > added abitrary time lags during times of low
> volumes,
> > sometimes adding 30 minutes or more to the
> processing
> > of a transaction as it waited for the next
> scheduled
> > batch. They also made our backlogs worse in times
> of
> > high volumes because the incoming data flow was
> > uneven, we would often get big batches of data
> from
> > partner systems (more batching, gotta love it)
> that
> > hit us when our batch process was sleeping. 5,
> 10,
> > 15 minutes would pass where the server ran at 10%
> > capacity while huge backlogs were piling up. It
> > didn't make anything better. It was just causing
> > idling.
> >
>
> What was the timeliness requirements for the
> processing? Did it need to be completed by 2am in
> the morning? Or could it have really just been
> completed on demand?
It was B2B transactions ASAP was the time requirement. I guess the upper limit was 6 hours or so. But batching didn't really decrease the processing time per transaction anyway and the server was never dedicated to the batch or anything so there were still context switches.
> > It just seems to me that in Java with all the nice
> > threading we have access to, the server should
> never
> > be idle and if you cannot handle your volume you
> are
> > better off adding more servers, not attempting to
> > batch things.
>
> I have created applications that were intended to run
> 'batch' jobs which could be spread across servers.
> Those particular processes had to finish within a
> very narrow time span as well - about two hours as I
> recall. There was an incremental as well as batch
> functionality that needed to be run for this. The
> batch functionality ran on the database. The
> incremental took the batched results and handle the
> incremental part.
>
> Although management was never willing to dedicate
> more than one server to the processing so I guess it
> wasn't that important to them.
>
> I have seen apps that claimed they were 'fast'
> because they did all of the incremental processing
> outside of the database. The design required moving,
> literally, the entire database over the network to
> other servers which would then process it.
> Processing it in the database would have taken
> orders of less time. And that was time sensitive
> data. I can't remember if that app allowed for
> multiple boxes to do the processing. I do know that
> the people working on it could never figure out the
> bottleneck (it was scaling to something like 12
> hours a day which was not acceptable.)
I guess I don't see doing it on the DB as implying batching. We use triggers to drive processes in Java, COBOL, whatever.
> I think we are thinking about different things.
I am certain of it.
So then I don't disagree with what you are saying.
I have only seen that pattern (that would fit your definition) once and in that case it was a time to market decision and it was replaced in a follow on update.
I'm also looking for info on this topic and have found the following: http://www.devx.com/Java/Article/20791/0/page/1Just curious did you end up finding any other articles, papers etc.?