A Deeper Dive Into The Redshift Warehouse - Backend Mechanics, Table Size/Sortkey/Distkey Info, And More!

Databases within Databases

The Warehouse is a managed Redshift solution which has the ability to store data from multiple origin data sources. Each Redshift cluster can contain multiple databases, and the Warehouse stores all relevant tables pulled in from the Cache UI in a database titled "site_#####" where the numbers indicate your Site ID.

If you're logged in via psql, you can type

\l (SQL)

SELECT datname FROM pg_database; (SQL)

to see the list of databases. The relevant database in the Warehouse which is connected to Sisense for Cloud Data Teams can be found in the cluster info page, using the above command/query, or within Sisense for Cloud Data Teams' IDE, using the following SQL query:

select current_database()

Warehouse Naming Conventions

How does the Warehouse enable cross-database joins between tables coming from different databases? It does so by putting all tables within the "site_####" database, so the joins are no longer actually cross database - all tables reside within the same database of the Warehouse. First, the origin database is given an ID, and the schema name is concatenated to the database ID, which combines the schema and db alias into a single schema within the "site_####" db.

Let's walk through an example.

Suppose I had a Warehouse with a db "site_1" which pulls in data from a postgres db called "production_replica," (assume the alias "db_5" is given by the Warehouse) and a table "public.daily_active_users" within "production_replica".

Within the Sisense for Cloud Data Teams IDE, if we write the query:

select * from production_replica.public.daily_active_users

the Warehouse will translate this query on the backend to the actual Warehouse table names :

select * from site_1.db_5_public.daily_active_users

This can be seen any time under the "Query" tab by running a select * against a table.

Warehouse Names, Sortkeys, and Distkeys

The following is a useful Warehouse query which will yield the database name (within the Warehouse), the schema name (within the origin database), and the primary sortkey and diststyle of each table in the Warehouse (specifically, the site_#### database). Additionally, the table size (GB) and a running cumulative table size are generated. (SQL)


with
  tbl_ids as
    (select distinct oid
     from pg_class c
     where relowner>1
       and relkind='r'),
  stp as
    (select id,sum(rows)sum_r,sum(sorted_rows)sum_sr,min(rows)min_r,
       max(rows)max_r,nvl(count(distinct slice),0)pop_slices
     from stv_tbl_perm
     where id in (select oid from tbl_ids)
       and slice<6400
     group by id),
  colenc as
    (select attrelid,sum(case when a.attencodingtype=0 then 0 else 1 end)
        as encoded_cols,count(*)as cols
     from pg_attribute a
     where a.attrelid in (select oid from tbl_ids)
       and a.attnum>0
     group by a.attrelid),
  cluster_info as
    (select count(distinct node) node_count
     from stv_slices)
select
        ti.database,
       coalesce(regexp_substr(ti.schema, 'db_[0-9]+'), ti.schema) as origin_database_aliased_id,
       right(ti.schema, len(ti.schema) - len(regexp_substr(ti.schema, 'db_[0-9]+_')))  as schema_name,
       ti."table"as tablename,
       ti.diststyle,
       ti.sortkey1,
       round(1.0*ti.size / 1024,2) current_size_gb,
       sum(round(1.0*ti.size / 1024,2)) over (order by size desc rows unbounded preceding) as cumulative_size_on_disk

from svv_table_info ti
left join stp on stp.id=ti.table_id
left join colenc on colenc.attrelid=ti.table_id
cross join cluster_info
where ti.schema not in('pg_internal')
order by ti.size desc