Sorting vs. database DISTINCT

Hi,

The question is what would be more efficient, to apply sorting algoritm

or to use a DISTINCT clause.

I have a list which must contain only unique values.

Values are coming from the database table where each column might contain

duplications.

I have two options and I can't decide which one is best.

Option 1.

Query each column with DISTINCT clause and then put it in a list.

or

Option 2.

Query all columns and apply some sorting algorithm to the result set. Algorithm

will sort and search for duplications and then put it in a list.

The number of records vary from 10 to 10.000 or more.

GROUP BY clause won't work with the DISTINCT either.

Thanks,

[752 byte] By [TR1a] at [2007-10-1 3:51:03]
# 1

> Hi,

>

> The question is what would be more efficient, to

> apply sorting algoritm

> or to use a DISTINCT clause.

>

> I have a list which must contain only unique values.

> Values are coming from the database table where each

> column might contain

> duplications.

>

> I have two options and I can't decide which one is

> best.

> Option 1.

> Query each column with DISTINCT clause and then put

> it in a list.

>or

> Option 2.

> Query all columns and apply some sorting algorithm to

> the result set. Algorithm

> will sort and search for duplications and then put

> it in a list.

>

> The number of records vary from 10 to 10.000 or

> more.

> GROUP BY clause won't work with the DISTINCT either.

>

> Thanks,

If you have so few just shove 'em into a HashSet (Or if order is necessary, a TreeSet) as they come out.

Adeodatusa at 2007-7-8 22:42:52 > top of Java-index,Other Topics,Algorithms...
# 2

> The question is what would be more efficient, to

> apply sorting algoritm

> or to use a DISTINCT clause.

The easiest and most reliable way to get a good answer to this question would be to try both and see what is more efficient. Another reason for doing that is that you would be able to use your own definition of "efficient" instead of relying on uninterested parties to assign the word a meaning.

DrClapa at 2007-7-8 22:42:52 > top of Java-index,Other Topics,Algorithms...
# 3

> Hi,

>

> The question is what would be more efficient, to

> apply sorting algoritm

> or to use a DISTINCT clause.

In general those are not equivanlent.

Distinct insures uniquess without ordering. Sorting applies ordering but does not require uniquess.

Most operations should be down in a database, particularily if you end up eliminating data on the client side (which uniqueness would suggest.)

Ordering, particularly when sorting on small sets with differing fields with the same data is often better on the client side because the database round trip is eliminated.

jschella at 2007-7-8 22:42:52 > top of Java-index,Other Topics,Algorithms...
# 4

Option 3 might be best depending on your requirements.

Although it wouldn't make sense to use GROUP BY and DISTINCT together (since GROUP BY must have something to group), you can certainly use a GROUP BY to do what you want. You have to add a "computed column" to your query. You 'group on' the computed column and you 'GROUP BY' all the columns that you actually want to do the 'distinct' operation on. The GROUP BY will return an ordered result set so you won't need another sort.

Option 2 is next best. A sort is going to happen whether you use DISTINCT to do it or use an ORDER BY.

As jschell states 'Distinct insures uniquess without ordering'. This doesn't mean that a sort isn't happening but rather means that the result set is not guaranteed to be in any particular order.

Although a sort is not strictly REQUIRED in order to determine uniqueness performing a sort first is far more efficient and requires fewer resources than caching every record and comparing each new record with everything in the cache.

Thus, the DISTINCT will perform a sort for you.

If you use an ORDER BY you still have to compare all of the columns in the result set in order to eliminate the duplicates yourself.

Obviously, the use of DISTINCT isn't needed if querying a single table that has a unique index or primary key.

By the way, I assume you know that the DISTINCT keyword applies to all of the selected columns; you don't "query each column with DISTINCT".

rp0428a at 2007-7-8 22:42:52 > top of Java-index,Other Topics,Algorithms...