Advice on my design please

I would like some advice on the project i'm currently working on.

First you should know that i know how to program in Java and i have created some small sites already using JSP and Beans and a May database in the background. But so far all i have done is A to B to C programming considering each visitor being independent for any other.

So basically a visitor comes along, uses a JSP page to request data, a bean runs a query on the database, returns the data to the JSP page who display's it. Next visitor does exactly the same thing. Potentially wasting resources because when the data is the same there is no need for the second visitor to query the database, he should be able to re-use what the bean has already done for the first visitor. But for me this is a new area i'm entering hence the need for some advise.

I want to create a site where people can find info on football(soccer) competitions and matches. A constraint i am imposing myself is that for the end-user, the website's visitor, speed is the most important thing. I do not want people to wait for pages to appear, things should appear instantly. ( or as close as possible )

If you want to see some examples of existing site these would illustrate what i'm about to create:

www.soccerway.com , www.365stats.com , www.betexplorer.com , www.sfstats.net

The design i've come up with so far is a 4 part system:

( some would call them tiers but i'm not sure )

Part 1 Website presentation using mainly JSP

Part 2 Data retrieval using beans

Part 3 Centralised database

Part 4 Data collection routines

First part 3

A centralised database containing data that acts as a resource for not one but multiple websites where each site retrieves info that is partly unique to that site and partly the same for several sites.

Then Part 4

The data collection routines are a group of threads running on timers who continuously collect data from various sources such as other databases or XML feeds. Each data collector thread runs independently, collects data, validates and stores(adds) the info in the database.

Part 1

A number of JSP pages that make up the front end of the website.

Here i intend to look at my pages in terms of collections of components. For example a list of matches can be a component, a competition table, a teams statistics, a list of bookmaker prices on a match. Then i want to relate each component to a data retrieval bean in part 2. This should allow me to re-use much code rather than create each page individually.

Part 2

Now we come to the part that i most uncertain of.

I am thinking of creating a number of beans to retrieve the data belonging to a component of my views. Each bean, depending on what data it handles, would use a mechanism to optimise speed and reduce the waste of resources.

Part 2a

In some cases the data retrieved needs to be as up-to-date as possible, for example bookmaker betting prices. In this case the bean runs the same query for each visitor. The only potential mechanism i might include is to limit the number of queries on the database timewise. So if a visitor requests the same info as a previous visitor within a minute, the bean should return the same result. Only if more than 1 minute has passed should it run the actual query again.

Part2b

The second mechanism i'm thinking of is for data that doesn't change that often like a competition overview or a list of played matches. Lot's of changes during the weekend when matches are played but very few during the week when nothing is going on. Here i will maintain a "change flag" on a key element. All the data collection routines will relate to the key element and update it's "change flag" when they add data belonging to that element.

The bean retrieving data will store the data in files on the server, possibly in html format. Each time a visitor requests that data the bean is to check the "change flag" and compare to the existing file. If there was no change since the last time the file was created then it should simply return nothing and the requesting JSP file uses(includes) the existing stored html file, And if there was a change the bean should refresh the html file first. This may require quite some diskspace but that's ok.

This is where i'm at at this point in time and i welcome any advice from any experienced developer. My main problem is that i cannot relate to "size" . For example a number of visitors of 1000 people per day, 100 at the same time, each generating 5 pageviews i would describe as "large". But someone else might describe that as small or even tiny.

So am i thinking about things i will probably never need and i should simply go with A - B - C and not worry about wasting resources as i'll always have enough ?

Am i thinking too small and should i look into the EE and a more elaborate design with other workload mechanisms ?

Or am i ( HOPEFULLY ) thinking along the right lines ?

In this case where should i put my priorities ?

What tutorials/books can help me given this project as a context ?

[5222 byte] By [Datapuntera] at [2007-11-27 5:24:00]
# 1

> I would like some advice on the project i'm currently

> working on.

> First you should know that i know how to program in

> Java and i have created some small sites already

> using JSP and Beans and a May database in the

> background. But so far all i have done is A to B to C

> programming considering each visitor being

> independent for any other.

>

> So basically a visitor comes along, uses a JSP page

> to request data, a bean runs a query on the database,

> returns the data to the JSP page who display's it.

> Next visitor does exactly the same thing. Potentially

> wasting resources because when the data is the same

> there is no need for the second visitor to query the

> database, he should be able to re-use what the bean

> has already done for the first visitor. But for me

> this is a new area i'm entering hence the need for

> some advise.

First, I think you need to make sure this is really a problem. Does this look up take a long time?

The most obvious solution to your problem is to use a local (to the webserver) cache. A lot of people have written their own caches (including myself) but if you decide to go that route, I would suggest using a cache library like EHCache instead. Caches are notoriously difficult to get right, especially if you need good multithreading preformance.

dubwaia at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 2

> Part2b

> The second mechanism i'm thinking of is for data that

> doesn't change that often like a competition overview

> or a list of played matches. Lot's of changes during

> the weekend when matches are played but very few

> during the week when nothing is going on. Here i will

> maintain a "change flag" on a key element. All the

> data collection routines will relate to the key

> element and update it's "change flag" when they add

> data belonging to that element.

>

> The bean retrieving data will store the data in files

> on the server, possibly in html format. Each time a

> visitor requests that data the bean is to check the

> "change flag" and compare to the existing file. If

> there was no change since the last time the file was

> created then it should simply return nothing and the

> requesting JSP file uses(includes) the existing

> stored html file, And if there was a change the bean

> should refresh the html file first. This may require

> quite some diskspace but that's ok.

These can easily be handled with a cache. The first would merely require setting the cache to timeout after a minute. Or, you can create a thread that updates the cache every minute regardless of whether there is a request. For the html page, you can again cache the page and load is as needed. Generally, an HTML page doesn't take up much space.

One thing I have done in the past is used triggers in the db to update a table with all the last modifications of the other tables. Then I had a thread that would check this table every minute of so for any updates. This is more complicated than just using time-based queries but it soves some pretty sticky issues.

dubwaia at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 3

> This is where i'm at at this point in time and i

> welcome any advice from any experienced developer. My

> main problem is that i cannot relate to "size" . For

> example a number of visitors of 1000 people per day,

> 100 at the same time, each generating 5 pageviews i

> would describe as "large". But someone else might

> describe that as small or even tiny.

>

> So am i thinking about things i will probably never

> need and i should simply go with A - B - C and not

> worry about wasting resources as i'll always have

> enough ?

>

> Am i thinking too small and should i look into the EE

> and a more elaborate design with other workload

> mechanisms ?

>

> Or am i ( HOPEFULLY ) thinking along the right lines

> ?

> In this case where should i put my priorities ?

> What tutorials/books can help me given this project

> as a context ?

You seem to hint that you are inexperienced but if that's the case, I'm pretty impressed with your design.

From the volumes that you mention here, I see nothing about your design that should be problematic. In fact, I would advise that you not get out ahead of yourself. If you get (of have gotten) this all working smoothly, that is it itself a great accomplishment and something a lot of experienced developers can't seem to pull off. You'll probably be better off making sure you don't have any common coding mistakes and optimizing what you have than trying to bring in more tools.

If your site takes off and you can't handle all the load with this design, well, that's a good problem to have. I would say design for the present with the future in your peripheral vision. Don't code for things that may never happen but don't paint yourself into a corner either.

dubwaia at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 4

Thanks for your replies dubwai,

>> First, I think you need to make sure this is really a problem.

Yes good point, frankly i don't know.

And not knowing would be the actual problem.

It's difficult enough getting a regular flow of visitors to a website but then to lose them again due to performance problems is a webmasters worst nightmare. So i'm looking to be just one step ahead of things. If nothing else it will put my mind at easy on the subject and i can fully concentrate on other aspects.

In the whole project i can see only one area that can become a performance bottleneck and that is database access.

First of all because on rented hosting accounts you only get a limited number of concurrent connections. The data collection threads will continuously use a number of connections and i can't do much about that as that is an essential part of the whole. Hosting companies usually don't have the option of increasing this, they immediately suggest using dedicated servers but that would be a too big of a jump in budget.

Second is the speed of queries. Haven't actually designed the database yet, about to do that now, but i know that even if a page takes only a few seconds to a site visitor it's an annoying long time. Ok ok so i'll need to get creative and design fast queries.

>> You seem to hint that you are inexperienced but if that's the case, I'm pretty impressed with your design

Thanks, amazing howfar one can get with a bit of common sense LOL.

I do have some experience in other IT area's so not completely new, just new to Java based websites.

--

To summarise, my 2 mechanisms could work and get me a long way, but i need to look into caching as that may be more appropriate.

And it would be pretty safe to just start with a simple JSP-Bean-DB structure as that should be able handle 1000 visitors per day.

>> If your site takes off and you can't handle all the load with this design, well, that's a good problem to have.

Yes i sure hope my next topic here will be titled: "Help, i have too many visitors" LOL

Datapuntera at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 5

> Thanks for your replies dubwai,

>

> >> First, I think you need to make sure this is

> really a problem.

>

> Yes good point, frankly i don't know.

> And not knowing would be the actual problem.

>

> It's difficult enough getting a regular flow of

> visitors to a website but then to lose them again due

> to performance problems is a webmasters worst

> nightmare. So i'm looking to be just one step ahead

> of things. If nothing else it will put my mind at

> easy on the subject and i can fully concentrate on

> other aspects.

>

> In the whole project i can see only one area that can

> become a performance bottleneck and that is database

> access.

If you really want to put your mind at ease, you should start considering some testing strategies. How might you simulate the situation that you are concerned with? Then you can take past what you expect. If your site can handle 10 times what you expect in terms of volume, you are probably pretty safe for a while. The last step is to take it to the breaking point. When does it start have problems? Then you'll know when to start thinking about adding more muscle.

> First of all because on rented hosting accounts you

> only get a limited number of concurrent connections.

> The data collection threads will continuously use a

> number of connections and i can't do much about that

> as that is an essential part of the whole. Hosting

> companies usually don't have the option of increasing

> this, they immediately suggest using dedicated

> servers but that would be a too big of a jump in

> budget.

Caching packages can definitely help with this. You might want to do some preemptive caching too. Basically you can make the cache the only thing hitting the DB. The main reason this matters is that connections aren't threadsafe. If you only have 3 connections to work with, you can only run three queries at a time. If all the pages are loaded via a fresh query, you'll run into contention issues really quickly.

> Second is the speed of queries. Haven't actually

> designed the database yet, about to do that now, but

> i know that even if a page takes only a few seconds

> to a site visitor it's an annoying long time. Ok ok

> so i'll need to get creative and design fast

> queries.

Indexes are important. I would design the tables in the way that makes sense from a data perspective and then optimize. I would avoid denormalizing the data in order to optimize the queries but some people will argue otherwise.

dubwaia at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 6

I've read about caching and that does seem the way to go.

( althou it will take me some time to get my head around it )

Let's see if i got it right ?

In order to use caching for database queries i need to design the data in such a way that i can uniquely identify a set of data from a query.

Then when a user comes along and requests that data set i can attempt to retrieve it from the cache. If successfull it is a very quick way to return the data to the user. If unsuccessfull then i need to get it from the database, ( put it in the cache for re-use ), and return the data to the user.

Depending on the type of data, or better the frequency with which the data is expected to change/expand i can select the most appropriate cache being memory or disk.

The ability to set an expiration time on the data in the cache effectively provides the same functionality as the 2 mechanisms i thought of.

Seems to do the job:

1) quick for the end-user

2) avoid (or manage) any potential bottleneck in database access

Datapuntera at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 7

>>If you really want to put your mind at ease, you should start considering some testing strategies.

How do i go about doing this ?

On my web wanderings i have come across texts that deal with testing the workload a site can handle. And if i recall there are applications for workload testing. Can you give me a startingpoint ?

Datapuntera at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...
# 8

> In order to use caching for database queries i need

> to design the data in such a way that i can uniquely

> identify a set of data from a query.

Yes. But unless I'm missing something, this should be trivial. In order to identify two requests as being the same, you must be able to do this.

> Then when a user comes along and requests that data

> set i can attempt to retrieve it from the cache. If

> successfull it is a very quick way to return the data

> to the user. If unsuccessfull then i need to get it

> from the database, ( put it in the cache for re-use

> ), and return the data to the user.

Yes. Make sure you break it down this way: User requests data from the cache. The cache will always service the request or return an error if it is not possible. The cache services the request by checking for a copy of the data. If it does not exist in cache, it attempts to acquire it. If it does exist, it checks to see if it is expired. If it is, it refreshes the data. Lastly the user request is satisfied. The point is that the cache needs to manage itself. The code requesting the data has other things to worry about.

> Depending on the type of data, or better the

> frequency with which the data is expected to

> change/expand i can select the most appropriate cache

> being memory or disk.

I usually don't worry about this much. Keep every thing in memory if you can. If not, you can used LRU or other caching strategies.A good caching package should address these kinds of concerns.

> The ability to set an expiration time on the data in

> the cache effectively provides the same functionality

> as the 2 mechanisms i thought of.

>

> Seems to do the job:

> 1) quick for the end-user

> 2) avoid (or manage) any potential bottleneck in

> database access

Yes. And if you get to a point where you are flooded with requests for the same data, you can use a concurrent cache to repeatedly retrieve data in the background instead of having a few requests wait every 2 minutes (or whatever the cache timeout is.)

dubwaia at 2007-7-12 11:50:09 > top of Java-index,Other Topics,Patterns & OO Design...