
π Letβs talk about mostly used SQL window functions
Why should we use that?
π SQL window functions are powerful tools for analytical queries, especially in data-heavy environments. They were created to address limitations in traditional SQL operations, particularly for performing analytical calculations across related rows without collapsing the result set.
βΉοΈ Without a PARTITION BY clause, a window function still operates on a window, but the entire result set is treated as a single, undivided window.
Unlike aggregate functions (e.g., SUM, AVG) which group rows and return a single result (a single summary row) per group, window functions operate on a defined set of rows (the "window") related to the current row, allowing calculations to be performed while still returning all individual rows. This is crucial for tasks like calculating running totals, moving averages, or ranking within specific partitions of data.
β³ Let's take a deep dive into the seven most used window functions:
- ROW_NUMBER();
- RANK();
- DENSE_RANK();
- NTILE(n);
- PERCENT_RANK();
- LAG();
- LEAD()
βοΈ Okay, let's see how big tech companies (such as Uber, Google, and Amazon) use these in their data-heavy environments.
[1] ROW_NUMBER() β Deduplication / Picking Top-N
Usecase: When we need a unique sequential number for each row within a partition, even if there are ties.
[β
Always gives unique numbers: 1, 2, 3, 4...; e.g. deduplication, pagination]
| customer_id | order_id | order_date | total_amount |
|---|---|---|---|
| 100 | 1010 | 2025-04-05 | 111.20 |
| 102 | 1007 | 2025-04-04 | 45.50 |
| 103 | 1021 | 2025-04-06 | 203.00 |
| 104 | 1035 | 2025-04-07 | 89.00 |
| 101 | 1012 | 2025-04-05 | 88.59 |
| 101 | 1031 | 2025-04-07 | 129.99 |
| 101 | 1033 | 2025-04-07 | 125.00 |
π π Amazon Example: Finding the single most recent order for every customer
β Amazon wants to identify the latest order for each customer β for follow-up emails, loyalty programs, or customer retention analysis.
| customer_id | order_id | order_date | total_amount | row_number |
|---|---|---|---|---|
| 101 | 1033 | 2025-04-07 | 125.00 | 1 |
| 101 | 1031 | 2025-04-07 | 129.99 | 2 |
| 101 | 1012 | 2025-04-05 | 88.59 | 3 |
SELECT
customer_id,
order_id,
order_date,
total_amount
FROM (
SELECT
customer_id,
order_id,
order_date,
total_amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC, order_id DESC) AS rn
FROM orders
) ranked
WHERE rn = 1;
π Why ROW_NUMBER()?
Amazon wants exactly one most recent order per customer β so ROW_NUMBER() is the right choice.
π π Uber Example: Assigning Rider Support Tickets
β Suppose Uber wants to assign unique ticket IDs to customer support requests per city, ordered by timestamp.
SELECT
city,
request_time,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY request_time) AS ticket_id
FROM support_requests;
π Why ROW_NUMBER()?
Even if two requests happen at the same time, they must get different ticket IDs (no duplicates). Uniqueness is key.
[2] RANK() β Competition-style Ranking
Usecase: When we want to rank rows, but allow gaps in ranking after ties like traditional competition-style ranking.
[β
Gives same rank for ties, then skips the next rank(s): 1, 1, 3, 4... ]
π π Amazon Example: Top-Selling Products by Category
β Amazon wants to rank products in each category by units sold.
SELECT
category,
product_name,
units_sold,
RANK() OVER (PARTITION BY category ORDER BY units_sold DESC) AS sales_rank
FROM product_sales;
Suppose two products tie for 1st place:
Product A: 500 units β Rank 1
Product B: 500 units β Rank 1
Product C: 400 units β Rank 3 (skips 2)
π Why RANK()?
It's okay to have gaps. The business wants to know: "Who are the top 3 ranked sold products?" and it's acceptable that there are 2 products in rank - 1 & no ranked 2 position.
Here, the Product C actually should be ranked as 3; cause its' postion in the 3rd row but as the first two ones are same sold, that's why they should be ranked same. That's the
requirement here.
β OR - Ranking sellers by revenue in a category. If two sellers tie at #1, the next seller is #3. Here, we need to reflect gaps (like Olympic medals: 1st, 2nd, 2nd, 4th).
[3] DENSE_RANK() β Leaderboards Without Gaps
Usecase: When we want to rank with no gaps after ties.
[β
Same rank for ties, but next rank is +1: 1, 1, 2, 3...]
π π Amazon Example: Leaderboard for Internal Sales Reps
β Amazon wants to show a clean leaderboard for sales reps in each region.
SELECT
region,
rep_name,
revenue,
DENSE_RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS leader_rank
FROM sales_reps;
If two reps tie for 1st:
Rep A: $100K β Rank 1
Rep B: $100K β Rank 1
Rep C: $90K β Rank 2
π Why DENSE_RANK()?
We want a clean, intuitive ranking: βTop 3 performersβ should mean 3 ranks (1, 2, 3), not skip numbers. Gaps (RANK()) would confuse users.
π π Google Ads: Ranking keywords by click-through-rate (CTR). If two have the same CTR, next rank continues sequentially.
[4] NTILE(n) β Segmentation / Bucketing
Usecase: When we want to divide rows into roughly equal-sized buckets (e.g., quartiles, deciles).
[β
Splits data into n groups: 1, 2, ..., n ]
π π Uber Example: Driver Performance Quartiles
β Uber wants to categorize drivers into quartiles (4 groups) based on trip completion rate.
SELECT
driver_id,
completion_rate,
NTILE(4) OVER (ORDER BY completion_rate DESC) AS performance_quartile
FROM drivers;
Result:
Quartile 1: Top 25% performers
Quartile 2: Next 25%
...
Quartile 4: Bottom 25%
π Why NTILE(4)?
For performance reviews, bonuses, or identifying underperformers, dividing into equal-sized buckets is more useful than ranks.
π π Google Example: User Engagement Deciles
β Google wants to analyze user activity by dividing users into 10 deciles based on daily usage time.
NTILE(10) OVER (ORDER BY daily_minutes) AS usage_decile
Useful for:
- A/B testing (target top/bottom deciles)
- Identifying power users or churn risks
[5] PERCENT_RANK() β Relative Performance
Usecase: When we want the relative rank as a percentage (0 to 1), useful for percentile analysis.
[β
Formula: (rank - 1) / (total rows - 1)]
π π Google Example: Ad Performance Percentile
β Google Ads wants to show advertisers how their CTR (click-through rate) compares to others.
SELECT
ad_id,
ctr,
PERCENT_RANK() OVER (ORDER BY ctr) AS performance_percentile
FROM ads;
Result:
If performance_percentile = 0.85, your ad performs better than 85% of ads.
π Why PERCENT_RANK()?
It gives an intuitive "you're in the top X%" metric, perfect for dashboards and benchmarks.
π π Amazon Example: Product Rating Distribution
β Show where a productβs rating falls in the overall distribution.
PERCENT_RANK() OVER (ORDER BY avg_rating) AS rating_percentile
Helps answer: βIs this 4-star product above or below average?β
[6] LAG(column, offset, default) β Compare with Previous Row
Usecase: When we want to access the value from a previous row in the result set.
[β
It's critical for comparing current vs. past/future values β especially in time-series data.]
LAG(col) β previous row;
LAG(col, 2) β 2 rows back;
LAG(col, 1, 0) β 1 row back, default to 0 if NULL
π π Google Example: Daily Active Users (DAU) Growth
β Google wants to calculate day-over-day change in DAU.
SELECT
date,
dau,
LAG(dau, 1) OVER (ORDER BY date) AS prev_dau,
dau - LAG(dau, 1) OVER (ORDER BY date) AS daily_change,
ROUND(100.0 * (dau - LAG(dau, 1) OVER (ORDER BY date)) / LAG(dau, 1) OVER (ORDER BY date), 2) AS growth_pct
FROM daily_metrics;
π Why LAG()?
To compare todayβs DAU with yesterdayβs β no self-joins needed. Clean, fast, and scalable.
π π Amazon Example: Price Change Detection
β Amazon tracks product price history and wants to flag when a price changed.
SELECT
product_id,
date,
price,
LAG(price) OVER (PARTITION BY product_id ORDER BY date) AS prev_price,
CASE
WHEN price != LAG(price) OVER (PARTITION BY product_id ORDER BY date)
THEN 'Price Changed'
ELSE 'No Change'
END AS price_status
FROM product_pricing;
π Why LAG()?
To detect trends, discounts, or anomalies in pricing over time β useful for dynamic pricing engines or customer alerts.
π π Uber Example: Driver Session Duration
β Uber wants to calculate time between consecutive trips for a driver (idle time).
SELECT
driver_id,
trip_start_time,
LAG(trip_start_time) OVER (PARTITION BY driver_id ORDER BY trip_start_time) AS last_trip_time,
trip_start_time - LAG(trip_start_time) OVER (PARTITION BY driver_id ORDER BY trip_start_time) AS idle_duration
FROM trips;
π Why LAG()?
To understand driver availability, utilization, and fatigue β key for supply optimization.
[7] LEAD(column, offset, default) β Compare with Next Row
Usecase: When we want to access the value from a future row in the result set.
[β
It's critical for comparing current vs. past/future values β especially in time-series data.]
LEAD(col) β next row
LEAD(col, 2) β 2 rows ahead
LEAD(col, 1, 0) β 1 row ahead, default to 0 if NULL
π π Google Example: Next Event Prediction (User Behavior)
β Google wants to see what users do after watching a video.
SELECT
user_id,
event_timestamp,
event_type,
LEAD(event_type, 1) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS next_action
FROM user_events
WHERE event_type = 'video_play';
Result: Shows if user searched, shared, subscribed, or left after playing a video.
π Why LEAD()?
For funnel analysis, churn prediction, or recommendation systems β understanding what comes next is gold.
π π Amazon Example: Forecasting Stock Refilling
β Amazon wants to predict restocking needs by analyzing time between inventory drops.
SELECT
product_id,
check_date,
stock_level,
LEAD(check_date) OVER (PARTITION BY product_id ORDER BY check_date) AS next_check_date
FROM inventory_logs;
π Why LEAD()?
To analyze inventory decay rate and forecast future stockouts.
π§ π₯Final Decision Tree (7 Functions)π₯
Start
β
ββ Analyzing time-series or sequences? βββ¬β Need previous row value? βββββββββ LAG() [e.g. "What was yesterdayβs revenue?"]
β (e.g., trends, changes, sessions)
β ββ Need next row value? βββββββββββββ LEAD() [e.g. "Whatβs the next event in the user journey?"]
β ββ Need change/gap? ββββββββββββ-ββββ LAG()/LEAD()
β
ββ Ranking/grouping rows? βββββββββββββ-β¬β Need unique row numbers per group?β ROW_NUMBER() [e.g. "Pick the first order per customer" (dedup/pick top)]
ββ Allow rank gaps after ties? βββββββ RANK() [e.g. "Top 10 movies" where two #1s mean next is #3]
ββ No gaps in ranks? βββββββββββββββββ DENSE_RANK() [e.g. "Top 3 performers"]
ββ Split into equal buckets? βββββββββ NTILE(n) [e.g. "Quartiles (NTILE(4)), deciles (NTILE(10))"]
ββ Need percentile (0 to 1)? βββββββββ PERCENT_RANK() [e.g. "Your ad is in the 80th percentile" (0β1 relative position)]
π Mastering them lets we write efficient, insightful queries without complex joins or application logic.