Cheat Sheet - Detecting Incorrect Data

intapiuser · ‎03-02-2023

Data is a beautiful mess.
As data heroes, we spot the mess and get rid of it to highlight the beauty.
Here are a couple useful tricks to detect incorrect data.

We will assume a table that describes the number of meows made by each cat in each box, on a given day.

why analytics? because cats

NOTE THE DETECTED DATA TYPES

If a field was detected as a different type, its values may not follow the right format

EXPECT THE UNEXPECTED: NULLS, NUMERIC ANOMALIES, TEXT TYPOS

See which text values are contained in your categorial fields:

SELECT DISTINCT a.cat
FROM table a

Find numeric anomalies by detecting range (or simply visualize the field):

SELECT
CASE WHEN a.meows < 0 THEN "under 0"
WHEN a.meows >= 0 AND a.meows <= 10 THEN " 0 - 10"
WHEN a.meows > 10 AND a.meows <= 50 THEN " 11 - 50"
ELSE "over 50"
END meows_range,
COUNT(*) AS instances
FROM table a
GROUP BY 1
EXPECT THE UNEXPECTED UNDER CERTAIN CONDITIONS

Repeat the previous bullet, this time under certain business logic
Here we expect Jingle the cat to meow at least once every time he snuggles in any box, therefore we will check Jingle's specific range

SELECT
CASE WHEN a.meows < 0 THEN "under 0"
WHEN a.meows >= 0 AND a.meows <= 10 THEN " 0 - 10"
WHEN a.meows > 10 AND a.meows <= 50 THEN " 11 - 50"
ELSE "over 50"
END meow_range,
COUNT(*) AS instances
FROM table a
WHERE a.cat = 'Jingle'
GROUP BY 1
VALIDATE WITH REALITY

Sample some specific and aggregate data points and check with external application
Why are there entries of Mia the cat given she spent the day at the vet?
FIND THOSE DUPLICATES

Make sure that you know what combination of fields make a unique identifier in your table
SELECT a.cat, a.box, COUNT(*) AS duplicates
FROM table a
GROUP BY a.cat, a.box

Sisense Community

Cheat Sheet - Detecting Incorrect Data