This post has already been read 214 times!

The Query’s goal is :

  1. Find the top 5 Locations
  2. List the users who live in those top 5 Locations, alphabetized by their DisplayName

There are a LOT of ways I could write this query, but for the purpose of this post, I’m only going to contrast common table expressions (CTEs) versus temp tables:

CTE & Temp Table
/* Set up an index to make our query easier: */
CREATE INDEX IX_Location ON dbo.Users(Location);
 
/* Common Table Expression, CTE: */
WITH TopLocations AS
  (SELECT TOP 5 Location
   FROM dbo.Users
   GROUP BY Location
   ORDER BY COUNT(*) DESC
   )
SELECT u.DisplayName, u.Location, u.WebsiteUrl, u.Reputation, u.Id
  FROM TopLocations tl
  INNER JOIN dbo.Users u ON tl.Location = u.Location
  ORDER BY u.DisplayName;
GO
 
/* Temp table version: */
CREATE TABLE #TopLocations (Location NVARCHAR(100));
INSERT INTO #TopLocations (Location)
   SELECT TOP 5 Location
   FROM dbo.Users
   GROUP BY Location
   ORDER BY COUNT(*) DESC;
 
SELECT u.DisplayName, u.Location, u.WebsiteUrl, u.Reputation, u.Id
  FROM #TopLocations tl
  INNER JOIN dbo.Users u ON tl.Location = u.Location
  ORDER BY u.DisplayName;
GO

We’re talking about a relatively small table here – less than 1GB in the Stack Overflow 2013 (50GB) version – so both versions of the query perform fairly quickly.

Here’s the execution plan for the CTE:

The CTE does both operations (finding the top locations, and finding the users in that location) in a single statement. That has pros and cons:

  • Good: SQL Server doesn’t necessarily have to materialize the top 5 locations to disk
  • Good: it accurately estimated that 5 locations would come out of the CTE
  • Bad: it didn’t know what those 5 locations would be, so…
  • Bad: it wildly under-estimated how many Users would be found in those locations (est: 66 rows, actual: 50,073 rows)
  • Bad: it chose to do index seeks + key lookups for a total of 158,916 reads

Here’s the execution plan for the temp table:

The temp table version splits the work up into two phases, which means that by the time the second operation happens, SQL Server has the benefit of knowing what happened in the first phase. That also has pros and cons:

  • Good: SQL Server accurately estimated that 5 locations would come out of the temp table
  • Great: it even estimated what those 5 locations would be, so
  • Great: it guessed much more accurately about how many Users lived in those locations (est: 24,657 rows, actual: 50,073 rows)
  • Great: it chose to do a table scan, leading to less logical reads (just 49,900 total for both operations)

So what’s better, and when?

CTEs are usually better when:

  • SQL Server can do a good job of estimating how many rows will come out of it, and the contents of what those rows will be, or
  • When what comes out of the CTE doesn’t really influence the behavior of the rest of the query, or
  • When you’re not sure what portions of the CTE’s data will actually be necessary for the rest of the query (because SQL Server can figure out what parts to execute, and what parts to simply ignore)

Temp tables are usually better when:

  • You have to refer to the output multiple times, or
  • When you need to pass data between stored procedures, or
  • When you need to break a query up into phases to isolate unpredictable components that dramatically affect the behavior of the rest of the query

I’d suggest starting with CTEs because they’re easy to write and to read. If you hit a performance wall, try ripping out a CTE and writing it to a temp table, then joining to the temp table.

Leave a Reply

Post Navigation