SQL self join table and explaination to find min/max values
Introduction
It’s general and popular to find min/max value of a column in tables with aggregate functions such as min
and max
.
However, I would like to use vanila way to solve this problem, in addition to explain how does it work with join
.
The explaination can help you get a abstract of join and how does it work in processing.
Issue
To help you understand how join
can help you approach min
and max
, we need a good example. Now let start. We got
alot of people in a company, each of them has more then one assessment during their work time. The question is how to
get the latest, newest
assessment and the first, oldest
assessment. The big question is how to get the latest
and the first assessment answers filter by that person and assessment, remember that an assessment can be done many times. Our table will look like:
table named: assessment_answers;
|-------------+-----------+---------------+--------------------+-------------|
| id | person_id | assessment_id | assessment_answers | inserted_at |
|-------------+-----------+---------------+--------------------+-------------|
| integer, fk | int | int | jsonb | datetime |
|-------------+-----------+---------------+--------------------+-------------|
Thinking approach
To find min/max value of inserted_at
value filtered by the person_id
and assessment_id
, first we need to find all record filtered by person_id
and assessment_id
, then we must find a min/value of inserted_at
via a loop to comparing one to others inserted_at
siblings, one which is smaller than others is the smallest and one which is greater than all others is the greatest. This also can be understand in this way.
- If you can find a
group named A
froma fixed set
filtered by some conditions, all elements in thisgroup A
can find a greater element froma fix set
. On the other hands, other elements which does not belong to thegroup A
cannot find any element which is greater than itself, which also means that those other element are the greatest. - If you can find a
group named B
froma fixed set
filtered by some conditions, all elemments in thisgroup B
can find a smaller element froma fix set
. On the other hands, other elements which does not belong to thegroup B
cannot find any element which is smaller than itself, which also mean that those elements are the smallest.
Solution
I would like to find the latest/newest assessments of all people. Let do it step by step and analyze:
The over all data set of a person who own id is 1. As you can see, for a single person, there are many assessments has been done by him/her. We have to find out the latest assessment answers of each assessment id. For example: assessment answers with id: 1, 7, 8, 9; the person has finished this assessment_id 2 four times.
select id, person_id, assessment_id, inserted_at from assessment_answers where person_id = 1;
id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477 x
2 | 1 | 3 | 2016-12-14 04:07:11.96
3 | 1 | 8 | 2016-12-14 04:07:12.74
4 | 1 | 5 | 2016-12-14 04:07:13.177
5 | 1 | 1 | 2016-12-14 04:07:14.053
6 | 1 | 7 | 2016-12-14 04:07:14.427
7 | 1 | 2 | 2016-12-16 01:19:25.61 x
8 | 1 | 2 | 2017-02-24 22:13:12.79 x
9 | 1 | 2 | 2017-02-25 09:48:03.53 x
10 | 1 | 4 | 2017-02-14 00:00:00
11 | 1 | 4 | 2017-02-22 00:00:00
12 | 1 | 6 | 2017-02-14 00:00:00
(12 rows)
Now, let find a group of element which can find a greater one among the fixed set
. The comparing value based on inserted_at
column.
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at, ab.id, ab.person_id, ab.assessment_id
from assessment_answers aa join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
and aa.inserted_at < ab.inserted_at -> I will explain why we set condition here not under _where_ phrase
where aa.person_id = 1
order by aa.id, ab.id;
id | person_id | assessment_id | inserted_at | id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------+----+-----------+---------------+------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
10 | 1 | 4 | 2017-02-14 00:00:00 | 11 | 1 | 4 | 2017-02-22 00:00:00
(7 rows)
The fixed set we are talking about is starting from id 1 -> 12
, however, our group contain 1, 7, 8, 10
. 1, 7, 8, 10
can find greater element for instance:
id | person_id | assessment_id | inserted_at | id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------+----+-----------+---------------+------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
The inserted_at(id_1)
is smaller than inserted_at(id_7)
. For those elements which does not belong to this group cannot find greater element, because they are greatest ones. The other elements should include 2, 3, 4, 5, 6, 9, 11, 12
. The previous query used join
phrase as a consequence, it remove all unsatisfied element for condition aa.insrted_at < ab.inserted_at
. On the other hands, left join
takes satisfied and unsatisfied records.
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at, ab.id, ab.person_id, ab.assessment_id, ab.inserted_at from assessment_answers aa
left join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
and aa.inserted_at < ab.inserted_at
where aa.person_id = 1
order by aa.id, ab.id;
id | person_id | assessment_id | inserted_at | id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------+----+-----------+---------------+------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
2 | 1 | 3 | 2016-12-14 04:07:11.96 | | | |
3 | 1 | 8 | 2016-12-14 04:07:12.74 | | | |
4 | 1 | 5 | 2016-12-14 04:07:13.177 | | | |
5 | 1 | 1 | 2016-12-14 04:07:14.053 | | | |
6 | 1 | 7 | 2016-12-14 04:07:14.427 | | | |
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
9 | 1 | 2 | 2017-02-25 09:48:03.53 | | | |
10 | 1 | 4 | 2017-02-14 00:00:00 | 11 | 1 | 4 | 2017-02-22 00:00:00
11 | 1 | 4 | 2017-02-22 00:00:00 | | | |
12 | 1 | 6 | 2017-02-14 00:00:00 | | | |
(15 rows)
Now let check the id of assessment answers records if the id 2, 3, 4, 5, 6, 7, 9, 11, 12
are the greatest. They cannot find any greater elements.
Let finalize our works to extract the greatest element only
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at, ab.id, ab.person_id, ab.assessment_id, ab.inserted_at from assessment_answers aa
left join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
and aa.inserted_at < ab.inserted_at
where aa.person_id = 1
and ab.id is null
order by aa.id, ab.id;
id | person_id | assessment_id | inserted_at | id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------+----+-----------+---------------+-------------
2 | 1 | 3 | 2016-12-14 04:07:11.96 | | | |
3 | 1 | 8 | 2016-12-14 04:07:12.74 | | | |
4 | 1 | 5 | 2016-12-14 04:07:13.177 | | | |
5 | 1 | 1 | 2016-12-14 04:07:14.053 | | | |
6 | 1 | 7 | 2016-12-14 04:07:14.427 | | | |
9 | 1 | 2 | 2017-02-25 09:48:03.53 | | | |
11 | 1 | 4 | 2017-02-22 00:00:00 | | | |
12 | 1 | 6 | 2017-02-14 00:00:00 | | | |
(8 rows)
And for short, here it’s final shot to find the latest/newest assessment answers of user(id: 1)
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at from assessment_answers aa
left join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
and aa.inserted_at < ab.inserted_at
where aa.person_id = 1
and ab.id is null
order by aa.id;
id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------
2 | 1 | 3 | 2016-12-14 04:07:11.96
3 | 1 | 8 | 2016-12-14 04:07:12.74
4 | 1 | 5 | 2016-12-14 04:07:13.177
5 | 1 | 1 | 2016-12-14 04:07:14.053
6 | 1 | 7 | 2016-12-14 04:07:14.427
9 | 1 | 2 | 2017-02-25 09:48:03.53
11 | 1 | 4 | 2017-02-22 00:00:00
12 | 1 | 6 | 2017-02-14 00:00:00
(8 rows)
To find the smallest value, here is your solution:
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at from assessment_answers aa
left join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
and aa.inserted_at > ab.inserted_at
where aa.person_id = 1
and ab.id is null
order by aa.id;
id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477
2 | 1 | 3 | 2016-12-14 04:07:11.96
3 | 1 | 8 | 2016-12-14 04:07:12.74
4 | 1 | 5 | 2016-12-14 04:07:13.177
5 | 1 | 1 | 2016-12-14 04:07:14.053
6 | 1 | 7 | 2016-12-14 04:07:14.427
10 | 1 | 4 | 2017-02-14 00:00:00
12 | 1 | 6 | 2017-02-14 00:00:00
Quesions
Q: Can we place the condition aa.inserted_at < ab.inserted_at
under where
phrase instead of under lelft join on
phrase, and why?
A: No, if we put the condition aa.inserted_at < ab.inserted_at
under the where
phrase, we can only get a group of not-greatest elements
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at, ab.id, ab.person_id, ab.assessment_id, ab.inserted_at from assessment_answers aa
left join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
where aa.person_id = 1
and aa.inserted_at < ab.inserted_at
order by aa.id, ab.id;
id | person_id | assessment_id | inserted_at | id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------+----+-----------+---------------+------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
10 | 1 | 4 | 2017-02-14 00:00:00 | 11 | 1 | 4 | 2017-02-22 00:00:00
(7 rows)
Q: If I use left join
in this query, why dont I get a combine of satisfied and unsatisfied records:
A: This is how the query look like and its result set.
select aa.id, aa.person_id, aa.assessment_id, aa.inserted_at, ab.id, ab.person_id, ab.assessment_id, ab.inserted_at from assessment_answers aa
left join assessment_answers ab
on aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
where aa.person_id = 1
order by aa.id, ab.id;
id | person_id | assessment_id | inserted_at | id | person_id | assessment_id | inserted_at
----+-----------+---------------+-------------------------+----+-----------+---------------+-------------------------
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 1 | 1 | 2 | 2016-12-14 04:04:46.477
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
1 | 1 | 2 | 2016-12-14 04:04:46.477 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
2 | 1 | 3 | 2016-12-14 04:07:11.96 | 2 | 1 | 3 | 2016-12-14 04:07:11.96
3 | 1 | 8 | 2016-12-14 04:07:12.74 | 3 | 1 | 8 | 2016-12-14 04:07:12.74
4 | 1 | 5 | 2016-12-14 04:07:13.177 | 4 | 1 | 5 | 2016-12-14 04:07:13.177
5 | 1 | 1 | 2016-12-14 04:07:14.053 | 5 | 1 | 1 | 2016-12-14 04:07:14.053
6 | 1 | 7 | 2016-12-14 04:07:14.427 | 6 | 1 | 7 | 2016-12-14 04:07:14.427
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 1 | 1 | 2 | 2016-12-14 04:04:46.477
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
7 | 1 | 2 | 2016-12-16 01:19:25.61 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 1 | 1 | 2 | 2016-12-14 04:04:46.477
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
8 | 1 | 2 | 2017-02-24 22:13:12.79 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
9 | 1 | 2 | 2017-02-25 09:48:03.53 | 1 | 1 | 2 | 2016-12-14 04:04:46.477
9 | 1 | 2 | 2017-02-25 09:48:03.53 | 7 | 1 | 2 | 2016-12-16 01:19:25.61
9 | 1 | 2 | 2017-02-25 09:48:03.53 | 8 | 1 | 2 | 2017-02-24 22:13:12.79
9 | 1 | 2 | 2017-02-25 09:48:03.53 | 9 | 1 | 2 | 2017-02-25 09:48:03.53
10 | 1 | 4 | 2017-02-14 00:00:00 | 10 | 1 | 4 | 2017-02-14 00:00:00
10 | 1 | 4 | 2017-02-14 00:00:00 | 11 | 1 | 4 | 2017-02-22 00:00:00
11 | 1 | 4 | 2017-02-22 00:00:00 | 10 | 1 | 4 | 2017-02-14 00:00:00
11 | 1 | 4 | 2017-02-22 00:00:00 | 11 | 1 | 4 | 2017-02-22 00:00:00
12 | 1 | 6 | 2017-02-14 00:00:00 | 12 | 1 | 6 | 2017-02-14 00:00:00
(26 rows)
The condition I am talking here is aa.person_id = ab.person_id
and aa.assessment_id = ab.assessment_id
- Satisfied condition is
aa.person_id == ab.person_id
andaa.assessment_id == ab.assessment_id
- Unsatisfied condition is
aa.person_id != ab.person_id
oraa.assessment_id != ab.assessment_id
Why the record that are not satisfied exist in the result set. Hmm, that’s a question I am think about. Personally, I think the developers of postgres did make an exception, hardcode perhap if people useequal
in thejoin
expression.