ValueListHandler, Large Results and Clustering

Has anybody got experience of using the ValueListHanlder pattern with a session facade and potentially very large query results, e.g. millions of results (even when filtered)?

How did this solution scale with many users each with a stateful session bean containing all of the results? How did state replication over a cluster scale? Are there any better solutions you have implemented?

Any experience/tips would be much appreciated.

Duncan Eley

[469 byte] By [djeleya] at [2007-10-2 13:29:07]
# 1

Ah, ValueListHandler, a pattern whose sole existence is due to the limitations of entity beans. Ah, the old painful days of EJB. (I digress)

Yes, there are several solutions. Do you need millions of rows? There are a few ways to get around this, depending on your requirements:

> If you are storing all those rows to perform a series of calculations, perform the calculations 'close' to the ResultSet itself, meaning read each row and update your calculations accordingly. You should then simply have to return the calculation results. This would typically be done during a batch process run or a report.

> If you are only displaying, say, a hundred at a time, implement pagination. This would be like in Google where you see 1 ... n for however many pages of data there are. Rather than returning a million rows, SELECT the count first and then SELECT however many rows are appropriate for a page. You can use ROWNUM (for Oracle) or LIMIT (for ANSI compliant RDBMS) to 'page' the results returned by the database.

> If all else fails, and this is a very rare requirement that literally millions of rows must be sent to either the app server or the client, then store the results temporarily in the file system of the app server. This is a last resort. I would be shocked to find real, valid business requirements to actually hold onto millions of rows.

Hope that stimulates a few ideas. Why do you have millions of rows? (BTW, regarding state replication, this would make a horrendous situation that much worse; it would in all likelihood gum up your network and cause all your machines to run out of memory soon).

- Saish

Saisha at 2007-7-13 11:12:18 > top of Java-index,Other Topics,Patterns & OO Design...
# 2

> Ah, ValueListHandler, a pattern whose sole existence

> is due to the limitations of entity beans. Ah, the

> old painful days of EJB. (I digress)

>

> Yes, there are several solutions. Do you need

> millions of rows? There are a few ways to get around

> this, depending on your requirements:

Unfortunatley, the current implementation of the system could result in millions of rows, paged of course, being delivered to the end user. I am yet to discuss how useful this could be to the end user - it is quite possibly useless but that's for our users to decide.

> > If you are storing all those rows to perform a

> series of calculations, perform the calculations

> 'close' to the ResultSet itself, meaning read each

> row and update your calculations accordingly. You

> should then simply have to return the calculation

> results. This would typically be done during a batch

> process run or a report.

>

> > If you are only displaying, say, a hundred at a

> time, implement pagination. This would be like in

> Google where you see 1 ... n for however many pages

> of data there are. Rather than returning a million

> rows, SELECT the count first and then SELECT however

> many rows are appropriate for a page. You can use

> ROWNUM (for Oracle) or LIMIT (for ANSI compliant

> RDBMS) to 'page' the results returned by the

> database.

This approach would require two queries to begin with (count and first page) then a query for each page. What worries me about this approach is that if the query consists of multiple joins on tables with millions of rows, the queries can be quite slow. And having used this technique once before on a complex query with GROUP BY, ORDER BY and HAVING, using LIMIT was not much quicker than not using LIMIT (in MySQL 4.0).

> > If all else fails, and this is a very rare

> requirement that literally millions of rows

> must be sent to either the app server or the

> client, then store the results temporarily in the

> file system of the app server. This is a last

> resort. I would be shocked to find real, valid

> business requirements to actually hold onto millions

> of rows.

I agree this would be a last resort: would not work in a cluster, clean up issues etc. I've seen one solution where results were stored back in the database as a BLOB!? See: http://wldj.sys-con.com/read/45563.htm

> Hope that stimulates a few ideas. Why do you have

> millions of rows? (BTW, regarding state replication,

> this would make a horrendous situation that much

> worse; it would in all likelihood gum up your network

> and cause all your machines to run out of memory

> soon).

Thanks for your input. If the requirements cannot change then I guess at the moment I'll have to compare the 'one query, page through results in stateful session bean' approach with the 'multiple but limited queries approach'.

Has anybody already compared these two approaches? What do people think of the 'results stored in the database' approach?

> - Saish

Duncan Eley

djeleya at 2007-7-13 11:12:18 > top of Java-index,Other Topics,Patterns & OO Design...
# 3

> > Ah, ValueListHandler, a pattern whose sole

> existence

> > is due to the limitations of entity beans. Ah,

> the

> > old painful days of EJB. (I digress)

> >

> > Yes, there are several solutions. Do you

> need

> > millions of rows? There are a few ways to get

> around

> > this, depending on your requirements:

>

> Unfortunatley, the current implementation of the

> system could result in millions of rows, paged of

> course, being delivered to the end user. I am yet to

> discuss how useful this could be to the end user -

> - it is quite possibly useless but that's for our

> users to decide.

>

There are business requirements, and there are also technical realities. First approach them with, "How would you even scroll through a million records?" Then, if they persist, "Well, it doesn't matter anyway because a million records will either break the server or require you to buy ten for every one you would have purchased otherwise."

> > > If you are storing all those rows to perform a

> > series of calculations, perform the calculations

> > 'close' to the ResultSet itself, meaning read each

> > row and update your calculations accordingly. You

> > should then simply have to return the calculation

> > results. This would typically be done during a

> batch

> > process run or a report.

> >

> > > If you are only displaying, say, a hundred at a

> > time, implement pagination. This would be like in

> > Google where you see 1 ... n for however many

> pages

> > of data there are. Rather than returning a

> million

> > rows, SELECT the count first and then SELECT

> however

> > many rows are appropriate for a page. You can use

> > ROWNUM (for Oracle) or LIMIT (for ANSI compliant

> > RDBMS) to 'page' the results returned by the

> > database.

>

> This approach would require two queries to begin

> with (count and first page) then a query for each

> page. What worries me about this approach is that if

> the query consists of multiple joins on tables with

> millions of rows, the queries can be quite slow. And

> having used this technique once before on a complex

> query with GROUP BY, ORDER BY and HAVING, using LIMIT

> was not much quicker than not using LIMIT (in MySQL

> 4.0).

>

You can always serialize the results to the file system or store the query results in a temporary table. The latter is nice because LIMIT works on that (smaller with fewer joins) query. The issue you will run into is how and when to clear out 'stale' query results. Depending on how much disk you have, you could conceivably dedicate a parent record for each user. When the user requested another query, the existing one would be overwritten. A batch process could expire all stored results at the end of the night, week, month, etc.

> > > If all else fails, and this is a very rare

> > requirement that literally millions of rows

> > must be sent to either the app server or

> the

> > client, then store the results temporarily in the

> > file system of the app server. This is a last

> > resort. I would be shocked to find real, valid

> > business requirements to actually hold onto

> millions

> > of rows.

>

> I agree this would be a last resort: would not work

> in a cluster, clean up issues etc. I've seen one

> solution where results were stored back in the

> database as a BLOB!? See:

> http://wldj.sys-con.com/read/45563.htm

>

BLOB is a possibility, but I think a dedicated temporary table is more elegant. How would you paginate a BLOB without loading it into memory first?

> > Hope that stimulates a few ideas. Why do you have

> > millions of rows? (BTW, regarding state

> replication,

> > this would make a horrendous situation that much

> > worse; it would in all likelihood gum up your

> network

> > and cause all your machines to run out of memory

> > soon).

>

> Thanks for your input. If the requirements

> cannot change then I guess at the moment I'll have to

> compare the 'one query, page through results in

> stateful session bean' approach with the 'multiple

> but limited queries approach'.

>

I think the former has memory scaling issues and the latter may have performance issues.

> Has anybody already compared these two approaches?

> What do people think of the 'results stored in the

> database' approach?

>

> > - Saish

>

> Duncan Eley

Interesting discussion!

- Saish

Saisha at 2007-7-13 11:12:18 > top of Java-index,Other Topics,Patterns & OO Design...