Danger when using UNION in SQL
Posted by Michael Bray on December 19, 2007
Interesting thing I learned about UNION today – it removes duplicated result set rows. At first glance, it may not be a big deal – it is clearly documented in the Transact SQL SELECT statement docs, and after all, this matches the concept of UNION in set theory. However, what I didn’t expect was that not only does it remove the rows that are duplicates BETWEEN the two queries, but it will also remove duplicated rows WITHIN each of the individual queries.
I found this when trying to calculate the sum of a single column, where the data was actually stored in two different tables. So the query I was trying to execute was:
SELECT SUM(Duration) FROM (SELECT Duration FROM tblA [WHERE…] UNION SELECT Duration FROM tblB [WHERE…]) tmp
The individual result sets were:
And the result of the SELECT SUM(Duration) was 16. Huh?? It’s 16 because the Duration values are selected, then UNIONed, then the duplicates removed, resulting in a list of ‘8, 6, 2, 0’ to be summed. I would have expected that the UNION find the duplicated rows between the two sets, and then doing the UNION, resulting in something more like ‘8, 0; 8, 2; 6, 2’ to be the rows that are kept after the removal of the duplcates, followed by the sum resulting in 26.
Of course, this isn’t the number I want either, and the answer is to include the ‘ALL’ keyword along with UNION (“UNION ALL”) which disables the removal of duplicates. Then the sum works perfectly, giving an answer of 42. (Of course!!! :))