A Deep Dive into Gaps and Islands: Calculating Consecutive Days for User Activity

Consecutive Days User Login: A Deep Dive into Gaps and Islands

In this article, we will explore a SQL query to calculate the logic of day_in_row field in a table called FactDailyUsers. The table contains users who were active on a specific date with a specific action they have made (aggregate total actions per row). We’ll break down the problem step by step and explain all technical terms, processes, and concepts used in the solution.

Problem Statement

The problem statement is as follows:

Given a table called FactDailyUsers which contains the users who were active on a specific date with the specific action they have made (aggregate total actions per row), write a query that calculates the logic of day_in_row field. The day_in_row shows consecutive days for user- users that return day after day to the website per row.

Table Structure

Here is an example table structure:

user_iddateActionday_in_rowtotal_actions
112321/06/2018Pageview11
332120/06/2018Pageview16
332120/06/2018Click12
212219/05/2018Pageview11
221221/06/2018Pageview21
221220/06/2018Pageview11
221224/06/2018Pageview14
221224/06/2018click12
332121/06/2018pageview24
332121/06/2018click23
332117/06/2018pageview11

Solution Overview

The solution involves breaking down the problem into smaller sub-problems and solving each one separately. We’ll use a combination of Common Table Expressions (CTEs) to solve this problem.

Step 1: Create CTEs for Data Preparation

CREATE TABLE d AS (
    SELECT MIN(d.date) AS mn, MAX(d.date) AS mx
    FROM FactDailyUsers
);

CREATE TABLE dt_table AS (
    SELECT n.n, DATEADD(DAY, n.n - 1, m.mn) AS d
    FROM (
        SELECT MIN(d.date) AS mn, MAX(d.date) AS mx
        FROM FactDailyUsers
    ) AS m
    JOIN dbo.Numbers AS n
        on n.n <= DATEDIFF(DAY, m.mn, m.mx)+1
);

CREATE TABLE g AS (
    SELECT *, 
        ROW_NUMBER() OVER (PARTITION BY d.USER_ID, d.ACTION ORDER BY d.date) AS r
    FROM FactDailyUsers
);

CREATE TABLE i AS (
    SELECT *, r - n AS island
    FROM g
    JOIN dt_table AS dt
        ON dt.d = g.date
);

Step 2: Calculate day_in_row

SELECT *, 
    ROW_NUMBER() OVER (
        PARTITION BY 
            i.USER_ID, 
            i.ACTION, 
            i.island 
        ORDER BY i.date
    ) AS calculated_day_in_row
FROM i
ORDER BY USER_ID, ACTION, date;

Explanation of the Solution

The solution involves four CTEs:

  1. d: This CTE calculates the minimum and maximum dates in the FactDailyUsers table.
  2. dt_table: This CTE creates a mini calendar table with contiguous dates.
  3. g: This CTE enumerates the data as it is based on the criteria you’ve outlined (same user performing the same action on successive days).
  4. i: This CTE calculates the difference between consecutive rows for each user and action.

The final query selects all columns from the i CTE, along with a new column called calculated_day_in_row that uses the ROW_NUMBER() function to assign a unique number to each row within each partition of the data.

Advice

  • Always break down complex problems into smaller sub-problems.
  • Use CTEs to simplify your queries and make them more readable.
  • Join tables carefully, making sure that the columns you’re joining on exist in both tables.
  • Use the ROW_NUMBER() function to assign unique numbers to rows within each partition of the data.

By following these steps and using a combination of CTEs, you can solve complex problems like this one and improve your SQL skills.


Last modified on 2023-08-01