If your BigQuery queries are drowning in duplicates, it’s time to bring out the big guns: QUALIFY + ROW_NUMBER(). This trick is a lifesaver when cleaning up messy datasets, and shoutout to Constantin Lungu for highlighting it! I with great pleasure borrowed his screenshot to make it even clearer!
Here’s the deal:
You need to keep only the first (or last) occurrence of something — say, the first event per user. Easy, right? Just use ROW_NUMBER() over your partition by user and order by event time. Then, use QUALIFY [window function result] = 1, and boom — duplicates gone!
BUT… there’s a catch.
Sometimes, two events happen at the exact same time. Uh-oh. What now?
- ROW_NUMBER() will randomly pick one and move on — clean, but non-deterministic.
- DENSE_RANK() will keep ties, letting you decide what to do next.
Moral of the story? Choose your ranking function wisely!
Want to get all my top Linkedin content? I regularly upload it to one Notion doc.
Go here to download it for FREE.