cancel
Showing results for
Did you mean:

# Cheat Sheet - Detecting Incorrect Data

Community Team Member
Data is a beautiful mess.
As data heroes, we spot the mess and get rid of it to highlight the beauty.
Here are a couple useful tricks to detect incorrect data.

We will assume a table that describes the number of meows made by each cat in each box, on a given day.
why analytics? because cats

1. NOTE THE DETECTED DATA TYPES

If a field was detected as a different type, its values may not follow the right format

1. EXPECT THE UNEXPECTED: NULLS, NUMERIC ANOMALIES, TEXT TYPOS

See which text values are contained in your categorial fields:

SELECT DISTINCT a.cat
FROM table a

Find numeric anomalies by detecting range (or simply visualize the field):

SELECT
CASE WHEN a.meows < 0 THEN "under 0"
WHEN a.meows >= 0 AND a.meows <= 10 THEN " 0 - 10"
WHEN a.meows > 10 AND a.meows <= 50 THEN " 11 - 50"
ELSE "over 50"
END meows_range,
COUNT(*) AS instances
FROM table a
GROUP BY 1

2. EXPECT THE UNEXPECTED UNDER CERTAIN CONDITIONS

Repeat the previous bullet, this time under certain business logic
Here we expect Jingle the cat to meow at least once every time he snuggles in any box, therefore we will check Jingle's specific range

SELECT
CASE WHEN a.meows < 0 THEN "under 0"
WHEN a.meows >= 0 AND a.meows <= 10 THEN " 0 - 10"
WHEN a.meows > 10 AND a.meows <= 50 THEN " 11 - 50"
ELSE "over 50"
END meow_range,
COUNT(*) AS instances
FROM table a
WHERE a.cat = 'Jingle'
GROUP BY 1

3. VALIDATE WITH REALITY

Sample some specific and aggregate data points and check with external application
Why are there entries of Mia the cat given she spent the day at the vet?
4. FIND THOSE DUPLICATES

Make sure that you know what combination of fields make a unique identifier in your table
SELECT a.cat, a.box, COUNT(*) AS duplicates
FROM table a
GROUP BY a.cat, a.box

Version history
Last update:
‎03-02-2023 08:33 AM
Updated by:
Contributors
Community Toolbox

Developers Group:

Product Feedback Forum: