Cheat Sheet - Detecting Incorrect Data
Data is a beautiful mess.
As data heroes, we spot the mess and get rid of it to highlight the beauty.
Here are a couple useful tricks to detect incorrect data.
We will assume a table that describes the number of meows made by each cat in each box, on a given day.

As data heroes, we spot the mess and get rid of it to highlight the beauty.
Here are a couple useful tricks to detect incorrect data.
We will assume a table that describes the number of meows made by each cat in each box, on a given day.


why analytics? because cats
- NOTE THE DETECTED DATA TYPES
If a field was detected as a different type, its values may not follow the right format

- EXPECT THE UNEXPECTED: NULLS, NUMERIC ANOMALIES, TEXT TYPOS
See which text values are contained in your categorial fields:
SELECT DISTINCT a.cat
FROM table a
Find numeric anomalies by detecting range (or simply visualize the field):
SELECT
CASE WHEN a.meows < 0 THEN "under 0"
WHEN a.meows >= 0 AND a.meows <= 10 THEN " 0 - 10"
WHEN a.meows > 10 AND a.meows <= 50 THEN " 11 - 50"
ELSE "over 50"
END meows_range,
COUNT(*) AS instances
FROM table a
GROUP BY 1
- EXPECT THE UNEXPECTED UNDER CERTAIN CONDITIONS
Repeat the previous bullet, this time under certain business logic
Here we expect Jingle the cat to meow at least once every time he snuggles in any box, therefore we will check Jingle's specific range
SELECT
CASE WHEN a.meows < 0 THEN "under 0"
WHEN a.meows >= 0 AND a.meows <= 10 THEN " 0 - 10"
WHEN a.meows > 10 AND a.meows <= 50 THEN " 11 - 50"
ELSE "over 50"
END meow_range,
COUNT(*) AS instances
FROM table a
WHERE a.cat = 'Jingle'
GROUP BY 1
- VALIDATE WITH REALITY
Sample some specific and aggregate data points and check with external application
Why are there entries of Mia the cat given she spent the day at the vet?
- FIND THOSE DUPLICATES
Make sure that you know what combination of fields make a unique identifier in your table
SELECT a.cat, a.box, COUNT(*) AS duplicates
FROM table a
GROUP BY a.cat, a.box
Updated 03-02-2023
intapiuser
Admin
Joined December 15, 2022