Employee ID | Salary |
3 | 200 |
4 | 800 |
7 | 450 |
Query:
SELECT * /*This is the outer query part */ FROM Employee Emp1 WHERE (N-1) = ( /* Subquery starts here */ SELECT COUNT(DISTINCT(Emp2.Salary)) FROM Employee Emp2 WHERE Emp2.Salary > Emp1.Salary)
How does the query above work?
The query above can be quite confusing if you have not seen anything like it before – pay special attention to the fact that “Emp1” appears in both the subquery (also known as an inner query) and the “outer” query. The outer query is just the part of the query that is not the subquery/inner query – both parts of the query are clearly labeled in the comments.The subquery is a correlated subquery
The subquery in the SQL above is actually a specific type of subquery known as a correlated subquery. The reason it is called a correlated subquery is because the the subquery uses a value from the outer query in it’s WHERE clause. In this case that value is the Emp1 table alias as we pointed out earlier. A normal subquery can be run independently of the outer query, but a correlated subquery can NOT be run independently of the outer query. If you want to read more about the differences between correlated and uncorrelated subqueries you can go here: Correlated vs Uncorrelated Subqueries.The most important thing to understand in the query above is that the subquery is evaluated each and every time a row is processed by the outer query. In other words, the inner query can not be processed independently of the outer query since the inner query uses the Emp1 value as well.Finding nth highest salary example and explanation
Let’s step through an actual example to see how the query above will actually execute step by step. Suppose we are looking for the 2nd highest Salary value in our table above, so our N is 2. This means that the query will look like this:SELECT * FROM Employee Emp1 WHERE (1) = ( SELECT COUNT(DISTINCT(Emp2.Salary)) FROM Employee Emp2 WHERE Emp2.Salary > Emp1.Salary)You can probably see that Emp1 and Emp2 are just aliases for the same Employee table – it’s like we just created 2 separate clones of the Employee table and gave them different names.Understanding and visualizing how the query above works
Let’s assume that we are using this data:
Employee ID | Salary |
3 | 200 |
4 | 800 |
7 | 450 |
For the sake of our explanation, let’s assume that N is 2 – so the query is trying to find the 2nd highest salary in the Employee table. The first thing that the query above does is process the very first row of the Employee table, which has an alias of Emp1.
The salary in the first row of the Employee table is 200. Because the subquery is correlated to the outer query through the alias Emp1, it means that when the first row is processed, the query will essentially look like this – note that all we did is replace Emp1.Salary with the value of 200:
SELECT * FROM Employee Emp1 WHERE (1) = ( SELECT COUNT(DISTINCT(Emp2.Salary)) FROM Employee Emp2 WHERE Emp2.Salary > 200)
So, what exactly is happening when that first row is processed? Well, if you pay special attention to the subquery you will notice that it’s basically searching for the count of salary entries in the Employee table that are greater than 200. Basically, the subquery is trying to find how many salary entries are greater than 200. Then, that count of salary entries is checked to see if it equals 1 in the outer query, and if so then everything from that particular row in Emp1 will be returned.
Note that Emp1 and Emp2 are both aliases for the same table – Employee. Emp2 is only being used in the subquery to compare all the salary values to the current salary value chosen in Emp1. This allows us to find the number of salary entries (the count) that are greater than 200. And if this number is equal to N-1 (which is 1 in our case) then we know that we have a winner – and that we have found our answer.
But, it’s clear that the subquery will return a 2 when Emp1.Salary is 200, because there are clearly 2 salaries greater than 200 in the Employee table. And since 2 is not equal to 1, the salary of 200 will clearly not be returned.
So, what happens next? Well, the SQL processor will move on to the next row which is 800, and the resulting query looks like this:
SELECT * FROM Employee Emp1 WHERE (1) = ( SELECT COUNT(DISTINCT(Emp2.Salary)) FROM Employee Emp2 WHERE Emp2.Salary > 800)
Since there are no salaries greater than 800, the query will move on to the last row and will of course find the answer as 450. This is because 800 is greater than 450, and the count will be 1. More precisely, the entire row with the desired salary would be returned, and this is what it would look like:
EmployeeID | Salary |
7 | 450 |
It’s also worth pointing out that the reason DISTINCT is used in the query above is because there may be duplicate salary values in the table. In that scenario, we only want to count repeated salaries just once, which is exactly why we use the DISTINCT operator.
A high level summary of how the query works
Let’s go through a high level summary of how someone would have come up with the SQL in the first place – since we showed you the answer first without really going through the thought process one would use to arrive at that answer.
Think of it this way – we are looking for a pattern that will lead us to the answer. One way to look at it is that the 2nd highest salary would have just one salary that is greater than it. The 4th highest salary would have 3 salaries that are greater than it. In more general terms, in order to find the Nth highest salary, we just find the salary that has exactly N-1 salaries greater than itself. And that is exactly what the query above accomplishes – it simply finds the salary that has N-1 salaries greater than itself and returns that value as the answer.
No comments:
Post a Comment