A while back, I asked if surrogate keys provide better performance than natural keys in SQL Server. @sqlvogel provided an answer to that question yesterday that caused me to revisit it.
This question is an attempt to "upgrade" the prior question, and hopefully provide the opportunity for thoughtful answers that help the community.
Consider a system for storing details about computers. Each computer has an architecture, and an Operating System. In SQL Server, we could create these tables using natural keys like this:
CREATE TABLE dbo.Architecture( ArchitectureName varchar(10) NOT NULL , ArchitectureVersion decimal(5,2) NOT NULL , ReleaseDate date NOT NULL , CONSTRAINT PK_Architecture PRIMARY KEY CLUSTERED (ArchitectureName, ArchitectureVersion));CREATE TABLE dbo.Manufacturer( ManufacturerName varchar(10) NOT NULL CONSTRAINT PK_Manufacturer PRIMARY KEY CLUSTERED);CREATE TABLE dbo.OS( OSName varchar(30) NOT NULL , ManufacturerName varchar(10) NOT NULL CONSTRAINT FK_OS_Manufacturer FOREIGN KEY (ManufacturerName) REFERENCES dbo.Manufacturer(ManufacturerName) , ArchitectureName varchar(10) NOT NULL , ArchitectureVersion decimal(5,2) NOT NULL , CONSTRAINT FK_OS_Architecture FOREIGN KEY (ArchitectureName, ArchitectureVersion) REFERENCES dbo.Architecture(ArchitectureName, ArchitectureVersion) , CONSTRAINT PK_OS PRIMARY KEY CLUSTERED (OSName));CREATE TABLE dbo.Computers( ComputerID varchar(10) NOT NULL CONSTRAINT PK_Computers PRIMARY KEY CLUSTERED , OSName varchar(30) NOT NULL CONSTRAINT FK_Computers_OSName FOREIGN KEY REFERENCES dbo.OS(OSName) , ComputerManufacturerName varchar(10) NOT NULL CONSTRAINT FK_Computers_Manufacturer FOREIGN KEY REFERENCES dbo.Manufacturer(ManufacturerName) , EffectiveDate datetime NOT NULL CONSTRAINT DF_Computers_EffectiveDate DEFAULT (GETDATE()) , ExpiryDate datetime NULL);
To query the dbo.Computers
table, with 2 rows in dbo.Computers
, showing various details, we could do this:
SELECT Computers.ComputerID , Computers.ComputerManufacturerName , OSManufacturer = OS.ManufacturerName , Computers.OSName , OS.ArchitectureName , OS.ArchitectureVersionFROM dbo.Computers INNER JOIN dbo.OS ON Computers.OSName = OS.OSNameWHERE Computers.EffectiveDate <= GETDATE() AND (Computers.ExpiryDate >= GETDATE() OR Computers.ExpiryDate IS NULL)ORDER BY Computers.ComputerID;
The query output is:
╔════════════╦══════════════════════════╦════════════════╦════════════╦══════════════════╦═════════════════════╗║ ComputerID ║ ComputerManufacturerName ║ OSManufacturer ║ OSName ║ ArchitectureName ║ ArchitectureVersion ║╠════════════╬══════════════════════════╬════════════════╬════════════╬══════════════════╬═════════════════════╣║ CM700-01 ║ HP ║ Microsoft ║ Windows 10 ║ x64 ║ 1.00 ║║ CM700-02 ║ HP ║ Microsoft ║ Windows 10 ║ x64 ║ 1.00 ║╚════════════╩══════════════════════════╩════════════════╩════════════╩══════════════════╩═════════════════════╝
The query plan for this is quite simple:
Image may be NSFW.
Clik here to view.
Or, if we choose to use surrogate keys, like this:
CREATE TABLE dbo.Architecture( ArchitectureID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_Architecture PRIMARY KEY CLUSTERED , ArchitectureName varchar(10) NOT NULL , ArchitectureVersion decimal(5,2) NOT NULL , ReleaseDate date NOT NULL , CONSTRAINT UQ_Architecture_Name UNIQUE (ArchitectureName, ArchitectureVersion));CREATE TABLE dbo.Manufacturer( ManufacturerID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_Manufacturer PRIMARY KEY CLUSTERED , ManufacturerName varchar(10) NOT NULL);CREATE TABLE dbo.OS( OS_ID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_OS PRIMARY KEY CLUSTERED , OSName varchar(30) NOT NULL CONSTRAINT UQ_OS_Name UNIQUE , ManufacturerID int NOT NULL CONSTRAINT FK_OS_Manufacturer FOREIGN KEY REFERENCES dbo.Manufacturer(ManufacturerID) , ArchitectureID int NOT NULL CONSTRAINT FK_OS_Architecture FOREIGN KEY REFERENCES dbo.Architecture(ArchitectureID));CREATE TABLE dbo.Computers( ComputerID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_Computers PRIMARY KEY CLUSTERED , ComputerName varchar(10) NOT NULL CONSTRAINT UQ_Computers_Name UNIQUE , OS_ID int NOT NULL CONSTRAINT FK_Computers_OS FOREIGN KEY REFERENCES dbo.OS(OS_ID) , ComputerManufacturerID int NOT NULL CONSTRAINT FK_Computers_Manufacturer FOREIGN KEY REFERENCES dbo.Manufacturer(ManufacturerID) , EffectiveDate datetime NOT NULL CONSTRAINT DF_Computers_EffectiveDate DEFAULT (GETDATE()) , ExpiryDate datetime NULL);
In the design above, you may notice we have to include several new unique constraints to ensure our data model is consistent across both approaches.
Querying this surrogate-key approach with 2 rows in dbo.Computers
looks like:
SELECT Computers.ComputerName , ComputerManufacturerName = cm.ManufacturerName , OSManufacturer = om.ManufacturerName , OS.OSName , Architecture.ArchitectureName , Architecture.ArchitectureVersionFROM dbo.Computers INNER JOIN dbo.OS ON Computers.OS_ID = OS.OS_ID INNER JOIN dbo.Manufacturer cm ON Computers.ComputerManufacturerID = cm.ManufacturerID INNER JOIN dbo.Architecture ON OS.ArchitectureID = Architecture.ArchitectureID INNER JOIN dbo.Manufacturer om ON OS.ManufacturerID = om.ManufacturerIDWHERE Computers.EffectiveDate <= GETDATE() AND (Computers.ExpiryDate >= GETDATE() OR Computers.ExpiryDate IS NULL)ORDER BY Computers.ComputerID;
The results:
╔══════════════╦══════════════════════════╦════════════════╦════════════╦══════════════════╦═════════════════════╗║ ComputerName ║ ComputerManufacturerName ║ OSManufacturer ║ OSName ║ ArchitectureName ║ ArchitectureVersion ║╠══════════════╬══════════════════════════╬════════════════╬════════════╬══════════════════╬═════════════════════╣║ CM700-01 ║ HP ║ Microsoft ║ Windows 10 ║ x64 ║ 1.00 ║║ CM700-02 ║ HP ║ Microsoft ║ Windows 10 ║ x64 ║ 1.00 ║╚══════════════╩══════════════════════════╩════════════════╩════════════╩══════════════════╩═════════════════════╝
Image may be NSFW.
Clik here to view.
The I/O statistics are even more telling. For the natural keys, we have:
Table 'OS'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Computers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
For the surrogate key setup, we get:
Table 'Manufacturer'. Scan count 0, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Architecture'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'OS'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Computers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Quite clearly, in the above admittedly very simple setup, the surrogate key is lagging in both ease-of-use, and performance.
Having said that, what happens if we need to change the name of one of the manufacturers? Here's the T-SQL for the natural key version:
UPDATE dbo.Manufacturer SET ManufacturerName = 'Microsoft™'WHERE ManufacturerName = 'Microsoft';
And the plan:
Image may be NSFW.
Clik here to view.
The T-SQL for the surrogate key version:
UPDATE dbo.Manufacturer SET ManufacturerName = 'Microsoft™'WHERE ManufacturerID = 1;
And that plan:
Image may be NSFW.
Clik here to view.
The natural key version has an estimated subtree cost that is nearly three times greater than the surrogate key version.
Am I correct in saying that both natural keys and surrogate keys offer benefits; deciding which methodology to use should be carefully considered?
Are there common situations where the comparisons I made above don't work? What other considerations should be made when choosing natural or surrogate keys?