Data cleaning with SQL: filling missing dates
Filling in missing dates is a common data cleaning task that can be a bit of a curly problem when taking into account different groups of data within a dataset.
To demonstrate this (and the solutions) I'll be using a table called page_traffic
:
DT | PAGE_NAME | PAGE_VIEWS |
---|---|---|
2024-09-01 | Homepage | 10 |
2024-09-02 | Homepage | 15 |
2024-09-05 | Homepage | 13 |
2024-09-02 | International news | 2 |
2024-09-08 | International news | 9 |
What we want is a row for each missing date in this dataset:
DATES_DT | PAGE_NAME | PAGE_VIEWS |
---|---|---|
2024-09-01 | Homepage | 10 |
2024-09-02 | Homepage | 15 |
2024-09-03 | Homepage | |
2024-09-04 | Homepage | |
2024-09-05 | Homepage | 13 |
2024-09-06 | Homepage | |
2024-09-07 | Homepage | |
2024-09-08 | Homepage | |
2024-09-01 | International news | |
2024-09-02 | International news | 2 |
2024-09-03 | International news | |
2024-09-04 | International news | |
2024-09-05 | International news | |
2024-09-06 | International news | |
2024-09-07 | International news | |
2024-09-08 | International news | 9 |
Note that this solution was developed using Snowflake but it should work for all RDBMs with some syntax adjustments.
Here's the code for creating the table and inserting the values:
CREATE OR REPLACE TABLE PAGE_TRAFFIC(
DT DATE,
PAGE_NAME VARCHAR(100),
PAGE_VIEWS NUMBER(38,0)
);
INSERT INTO PAGE_TRAFFIC (DT, PAGE_NAME, PAGE_VIEWS) VALUES('2024-09-01', 'Homepage', 10);
INSERT INTO PAGE_TRAFFIC (DT, PAGE_NAME, PAGE_VIEWS) VALUES('2024-09-02', 'Homepage', 15);
INSERT INTO PAGE_TRAFFIC (DT, PAGE_NAME, PAGE_VIEWS) VALUES('2024-09-05', 'Homepage', 13);
INSERT INTO PAGE_TRAFFIC (DT, PAGE_NAME, PAGE_VIEWS) VALUES('2024-09-02', 'International news', 2);
INSERT INTO PAGE_TRAFFIC (DT, PAGE_NAME, PAGE_VIEWS) VALUES('2024-09-08', 'International news', 9);
If you don't want to read the explanation you can jump straight to the solutions.
Generating the dates
First let's use a recursive CTE to create the date range we're after:
WITH RECURSIVE DATES AS
(
SELECT DATE('2024-09-01') AS dt
UNION ALL
SELECT DATEADD(DAY, 1, dt) AS dt
FROM DATES
WHERE dt < '2024-09-08'
)
SELECT *
FROM DATES
;
DT |
---|
2024-09-01 |
2024-09-02 |
2024-09-03 |
2024-09-04 |
2024-09-05 |
2024-09-06 |
2024-09-07 |
2024-09-08 |
Let's fill some gaps!
Let's focus just on the Homepage rows for now.
Using a RIGHT JOIN
we'll return every row from DATES
and also return PAGE_VIEWS
from page_traffic
where they exist:
WITH RECURSIVE DATES AS
(
SELECT DATE('2024-09-01') AS dt
UNION ALL
SELECT DATEADD(DAY, 1, dt) AS dt
FROM DATES
WHERE DT < '2024-09-08'
)
SELECT
DATES.DT
, PAGE_VIEWS -- NULL is returned if there is no match in DATES.
FROM page_traffic AS pt
RIGHT JOIN DATES -- All rows from this CTE will be returned, even if there is no match in page_traffic.
ON pt.dt = dates.dt
AND page_name = 'Homepage'
;
Our missing dates have been added to the dataset:
DT | PAGE_VIEWS |
---|---|
2024-09-01 | 10 |
2024-09-02 | 15 |
2024-09-03 | |
2024-09-04 | |
2024-09-05 | 13 |
2024-09-06 | |
2024-09-07 | |
2024-09-08 |
Including additional columns
However we also want to include the page_name
column.
Including the page_name
column in our SELECT
statement won't work with our current query because the DATES
dataset only
has a match with rows in page_traffic
that have the same date:
WITH RECURSIVE DATES AS ... -- Truncuating this to make these code blocks more readable.
SELECT
DATES.dt
, page_name
, page_views
FROM page_traffic AS pt
RIGHT JOIN DATES
ON pt.dt = dates.dt
AND page_name = 'Homepage'
;
DT | PAGE_NAME | PAGE_VIEWS |
---|---|---|
2024-09-01 | Homepage | 10 |
2024-09-02 | Homepage | 15 |
2024-09-03 | ||
2024-09-04 | ||
2024-09-05 | Homepage | 13 |
2024-09-06 | ||
2024-09-07 | ||
2024-09-08 |
To solve this problem we could harcode the value, use COALESCE()
or a window function:
WITH RECURSIVE DATES AS ...
SELECT
DATES.dt
, 'Homepage' AS page_name
, COALESCE(page_name, 'Homepage') AS page_name_v2
, FIRST_VALUE(PAGE_NAME) IGNORE NULLS OVER (ORDER BY DATES.DT) AS page_name_v3
, PAGE_VIEWS
FROM page_traffic AS pt
RIGHT JOIN DATES
ON pt.dt = dates.dt
AND page_name = 'Homepage'
;
However these solutions won't work considering that our dataset includes more than one unique value in the page_name
column.
Remember, up until now I've just been focused on the Homepage rows.
One solution would be to use COALESCE
with a SELECT
query for each unique group in page_traffic
and use a UNION ALL
to create a complete dataset...
WITH RECURSIVE DATES AS ...
SELECT
DATES.dt
, COALESCE(page_name, 'Homepage') AS page_name
, page_views
FROM page_traffic AS pt
RIGHT JOIN DATES
ON pt.DT = dates.DT
AND page_name = 'Homepage'
UNION ALL
SELECT
DATES.DT
, COALESCE(page_name, 'International news') AS page_name
, page_views
FROM page_traffic AS pt
RIGHT JOIN DATES
ON pt.dt = dates.dt
AND page_name = 'International news'
;
... however this is a cumbersome solution, particularly when there is a more elegant, dynamic solutions.
Solution 1
We want to ensure that every date/row from DATES
is paired wth as many unique values as there are in our page_traffic
table.
Once we've done this, we can deduplicate the data so that page_views
data is only returned when it exists.
We can accomplish this by using a CROSS JOIN
which will join each row in our DATES
data with each row in our page_views
table (resulting in what is known as a Cartesian product).
Let's examine what that looks like:
WITH RECURSIVE DATES AS ...
SELECT
pt.*
, dates.dt as dates_dt -- Note I've renamed it this to make it a bit easier to understand.
FROM page_traffic AS pt
CROSS JOIN DATES
ORDER BY page_name, pt.dt, dates.dt
;
DT | PAGE_NAME | PAGE_VIEWS | DATES_DT |
---|---|---|---|
2024-09-01 | Homepage | 10 | 2024-09-01 |
2024-09-01 | Homepage | 10 | 2024-09-02 |
2024-09-01 | Homepage | 10 | 2024-09-03 |
2024-09-01 | Homepage | 10 | 2024-09-04 |
2024-09-01 | Homepage | 10 | 2024-09-05 |
2024-09-01 | Homepage | 10 | 2024-09-06 |
2024-09-01 | Homepage | 10 | 2024-09-07 |
2024-09-01 | Homepage | 10 | 2024-09-08 |
2024-09-02 | Homepage | 15 | 2024-09-01 |
2024-09-02 | Homepage | 15 | 2024-09-02 |
2024-09-02 | Homepage | 15 | 2024-09-03 |
2024-09-02 | Homepage | 15 | 2024-09-04 |
2024-09-02 | Homepage | 15 | 2024-09-05 |
2024-09-02 | Homepage | 15 | 2024-09-06 |
2024-09-02 | Homepage | 15 | 2024-09-07 |
2024-09-02 | Homepage | 15 | 2024-09-08 |
2024-09-05 | Homepage | 13 | 2024-09-01 |
2024-09-05 | Homepage | 13 | 2024-09-02 |
2024-09-05 | Homepage | 13 | 2024-09-03 |
2024-09-05 | Homepage | 13 | 2024-09-04 |
2024-09-05 | Homepage | 13 | 2024-09-05 |
2024-09-05 | Homepage | 13 | 2024-09-06 |
2024-09-05 | Homepage | 13 | 2024-09-07 |
2024-09-05 | Homepage | 13 | 2024-09-08 |
2024-09-02 | International news | 2 | 2024-09-01 |
2024-09-02 | International news | 2 | 2024-09-02 |
2024-09-02 | International news | 2 | 2024-09-03 |
2024-09-02 | International news | 2 | 2024-09-04 |
2024-09-02 | International news | 2 | 2024-09-05 |
2024-09-02 | International news | 2 | 2024-09-06 |
2024-09-02 | International news | 2 | 2024-09-07 |
2024-09-02 | International news | 2 | 2024-09-08 |
2024-09-08 | International news | 9 | 2024-09-01 |
2024-09-08 | International news | 9 | 2024-09-02 |
2024-09-08 | International news | 9 | 2024-09-03 |
2024-09-08 | International news | 9 | 2024-09-04 |
2024-09-08 | International news | 9 | 2024-09-05 |
2024-09-08 | International news | 9 | 2024-09-06 |
2024-09-08 | International news | 9 | 2024-09-07 |
2024-09-08 | International news | 9 | 2024-09-08 |
If you find this confusing, just remember that every date in our date range (1st of September to 8th of September) has been joined to every row in our page_traffic
table. Doing this means we can dynamically pair every date with as many unique values as there are in the page_traffic
table.
For example, our 1st of September row in our page_traffic
table has been duplicated 8 times, for every row in our DATES
CTE.
If we were to ignore the page_views
column our solution would be complete. We can just use the DISTICT
keyword to only return unique rows:
WITH RECURSIVE DATES AS ...
SELECT
DISTINCT
DATES.dt
, PT.page_name
FROM page_traffic AS pt
CROSS JOIN DATES
ORDER BY page_name, dt
;
DT | PAGE_NAME |
---|---|
2024-09-01 | Homepage |
2024-09-02 | Homepage |
2024-09-03 | Homepage |
2024-09-04 | Homepage |
2024-09-05 | Homepage |
2024-09-06 | Homepage |
2024-09-07 | Homepage |
2024-09-08 | Homepage |
2024-09-01 | International news |
2024-09-02 | International news |
2024-09-03 | International news |
2024-09-04 | International news |
2024-09-05 | International news |
2024-09-06 | International news |
2024-09-07 | International news |
2024-09-08 | International news |
However of course we'll want to include page_views
and this is where it gets a little trickier.
Let's use 2024-09-01
for the Homepage as an example:
WITH RECURSIVE DATES AS ...
SELECT
dates.dt AS dates_dt
, pt.*
FROM page_traffic AS pt
CROSS JOIN DATES
WHERE dates.dt = '2024-09-01'
AND page_name = 'Homepage'
ORDER BY page_name, pt.dt, dates.dt
;
DATES_DT | DT | PAGE_NAME | PAGE_VIEWS |
---|---|---|---|
2024-09-01 | 2024-09-01 | Homepage | 10 |
2024-09-01 | 2024-09-02 | Homepage | 15 |
2024-09-01 | 2024-09-05 | Homepage | 13 |
We want to exclude the page_views
for the 2024-09-02
and 2024-09-05
rows (as these are simply the product of joining 2024-08-01
to every row in page_traffic
) and so let's use a CASE
statement to do to do this:
WITH RECURSIVE DATES ...
SELECT
DISTINCT
dates.dt AS dates_dt
, pt.page_name
, CASE WHEN DATES.dt = PT.dt THEN page_views END AS pv
FROM page_traffic AS pt
CROSS JOIN DATES
WHERE dates.dt = '2024-09-01'
AND page_name = 'Homepage'
;
DATES_DT | PAGE_NAME | PAGE_VIEWS |
---|---|---|
2024-09-01 | Homepage | 10 |
2024-09-01 | Homepage |
Great! If page views exist for a date (as determined by our CASE
statement) that's what will be returned.
Next we'll want to only include the date with a non-NULL page_view
and we can accomplish this by using the RANK()
window function:
WITH RECURSIVE DATES ...
SELECT
DISTINCT
dates.dt AS dates_dt
, pt.page_name
, CASE WHEN DATES.dt = PT.dt THEN page_views END AS pv -- Be sure to rename this column.
, RANK() OVER (PARTITION BY dates_dt, page_name ORDER BY pv) AS pv_ranking
FROM page_traffic AS pt
CROSS JOIN DATES
WHERE dates.dt = '2024-09-01'
AND page_name = 'Homepage'
;
DATES_DT | PAGE_NAME | PV | PV_RANKING |
---|---|---|---|
2024-09-01 | Homepage | 10 | 1 |
2024-09-01 | Homepage | 2 |
As you can see, rows with a non-NULL value will always be ranked 1 and we just need to filter out rows without this rank.
What about dates on which there are no page views? NULL
values will recieve the same rank but our DISTINCT
keyword will take care of them (returning only 1 row).
And that's our first solution!
If you're using Snowflake or any other RDBMS where the QUALIFY
clause is supported we can filter to rows ranked 1 without having to using an inline-view or additional CTE:
-- RDBMS with QUALIFY:
WITH RECURSIVE DATES AS
(
SELECT DATE('2024-09-01') AS dt
UNION ALL
SELECT DATEADD(DAY, 1, dt) AS dt
FROM DATES
WHERE dt < '2024-09-08'
)
SELECT
DISTINCT
dates.dt AS dates_dt
, pt.page_name
, CASE WHEN DATES.dt = PT.dt THEN page_views END AS pv
FROM page_traffic AS pt
CROSS JOIN DATES
QUALIFY RANK() OVER (PARTITION BY dates_dt, page_name ORDER BY pv) = 1
ORDER BY page_name, dates_dt
;
-- If QUALIFY is not supported:
WITH RECURSIVE DATES AS
(
SELECT DATE('2024-09-01') AS dt
UNION ALL
SELECT DATEADD(DAY, 1, DT) AS pt
FROM DATES
WHERE dt < '2024-09-08'
),
final_dataset AS
(
SELECT
dates.dt AS dates_dt
, pt.page_name
, CASE WHEN DATES.DT = PT.DT THEN page_views END AS pv
, RANK() OVER (PARTITION BY dates_dt, page_name ORDER BY pv) AS pv_rank
FROM page_traffic AS pt
CROSS JOIN DATES
)
SELECT
DISTINCT
dates_dt
, page_name
, pv
FROM final_dataset
WHERE pv_rank = 1
ORDER BY page_name, dates_dt
;
DATES_DT | PAGE_NAME | PV |
---|---|---|
2024-09-01 | Homepage | 10 |
2024-09-02 | Homepage | 15 |
2024-09-03 | Homepage | |
2024-09-04 | Homepage | |
2024-09-05 | Homepage | 13 |
2024-09-06 | Homepage | |
2024-09-07 | Homepage | |
2024-09-08 | Homepage | |
2024-09-01 | International news | |
2024-09-02 | International news | 2 |
2024-09-03 | International news | |
2024-09-04 | International news | |
2024-09-05 | International news | |
2024-09-06 | International news | |
2024-09-07 | International news | |
2024-09-08 | International news | 9 |
Solution 2
Solution 2 involves the same approach, in that we're using a CROSS JOIN
to make sure there is a row for every date in our range for every unique page name.
However we'll also use a LEFT JOIN
to take out the manual work of evaluating whether a page view exists for a date.
By using an in-line view of distinct page names we can join every date in our range with all possible page names:
WITH RECURSIVE DATES AS
(
SELECT DATE('2024-09-01') AS dt
UNION ALL
SELECT DATEADD(DAY, 1, dt) AS dt
FROM DATES
WHERE dt < '2024-09-08'
)
SELECT
DISTINCT
dates.dt AS dt
, pn.page_name
FROM DATES
CROSS JOIN (SELECT DISTINCT page_name FROM page_traffic) AS pn
;
dates_dt | page_name |
---|---|
2024-09-01 | Homepage |
2024-09-01 | International news |
2024-09-02 | Homepage |
2024-09-02 | International news |
2024-09-03 | Homepage |
2024-09-03 | International news |
2024-09-04 | Homepage |
2024-09-04 | International news |
2024-09-05 | Homepage |
2024-09-05 | International news |
2024-09-06 | Homepage |
2024-09-06 | International news |
2024-09-07 | Homepage |
2024-09-07 | International news |
2024-09-08 | Homepage |
2024-09-08 | International news |
Next, let's LEFT JOIN
to page_traffic
. If a page view exists that's what
will be returned, otherwise NULL
will be returned:
WITH RECURSIVE DATES AS
(
SELECT DATE('2024-09-01') AS dt
UNION ALL
SELECT DATEADD(DAY, 1, dt) AS dt
FROM DATES
WHERE dt < '2024-09-08'
)
SELECT
dates.dt AS dates_dt
, pn.page_name
, pt.page_views as pv
FROM dates as dates
CROSS JOIN (SELECT DISTINCT page_name FROM page_traffic) AS pn
LEFT join page_traffic as pt
ON pt.dt = dates.dt
AND pt.page_name = pn.page_name
ORDER BY page_name, dates_dt
;
And that's it!
Comments !