A Deeper Dive Into The Redshift Warehouse - Backend Mechanics, Table Size/Sortkey/Distkey Info, And More!
Databases within Databases
The Warehouse is a managed Redshift solution which has the ability to store data from multiple origin data sources. Each Redshift cluster can contain multiple databases, and the Warehouse stores all relevant tables pulled in from the Cache UI in a database titled "site_#####" where the numbers indicate your Site ID.
If you're logged in via psql, you can type
\l (SQL)
or
SELECT datname FROM pg_database; (SQL)
to see the list of databases. The relevant database in the Warehouse which is connected to Sisense for Cloud Data Teams can be found in the cluster info page, using the above command/query, or within Sisense for Cloud Data Teams' IDE, using the following SQL query:
select current_database()
Warehouse Naming Conventions
How does the Warehouse enable cross-database joins between tables coming from different databases? It does so by putting all tables within the "site_####" database, so the joins are no longer actually cross database - all tables reside within the same database of the Warehouse. First, the origin database is given an ID, and the schema name is concatenated to the database ID, which combines the schema and db alias into a single schema within the "site_####" db.
Let's walk through an example.
Suppose I had a Warehouse with a db "site_1" which pulls in data from a postgres db called "production_replica," (assume the alias "db_5" is given by the Warehouse) and a table "public.daily_active_users" within "production_replica".
Within the Sisense for Cloud Data Teams IDE, if we write the query:
select * from production_replica.public.daily_active_users
the Warehouse will translate this query on the backend to the actual Warehouse table names :
select * from site_1.db_5_public.daily_active_users
This can be seen any time under the "Query" tab by running a select * against a table.
Warehouse Names, Sortkeys, and Distkeys
The following is a useful Warehouse query which will yield the database name (within the Warehouse), the schema name (within the origin database), and the primary sortkey and diststyle of each table in the Warehouse (specifically, the site_#### database). Additionally, the table size (GB) and a running cumulative table size are generated. (SQL)
with tbl_ids as (select distinct oid from pg_class c where relowner>1 and relkind='r'), stp as (select id,sum(rows)sum_r,sum(sorted_rows)sum_sr,min(rows)min_r, max(rows)max_r,nvl(count(distinct slice),0)pop_slices from stv_tbl_perm where id in (select oid from tbl_ids) and slice<6400 group by id), colenc as (select attrelid,sum(case when a.attencodingtype=0 then 0 else 1 end) as encoded_cols,count(*)as cols from pg_attribute a where a.attrelid in (select oid from tbl_ids) and a.attnum>0 group by a.attrelid), cluster_info as (select count(distinct node) node_count from stv_slices) select ti.database, coalesce(regexp_substr(ti.schema, 'db_[0-9]+'), ti.schema) as origin_database_aliased_id, right(ti.schema, len(ti.schema) - len(regexp_substr(ti.schema, 'db_[0-9]+_'))) as schema_name, ti."table"as tablename, ti.diststyle, ti.sortkey1, round(1.0*ti.size / 1024,2) current_size_gb, sum(round(1.0*ti.size / 1024,2)) over (order by size desc rows unbounded preceding) as cumulative_size_on_disk from svv_table_info ti left join stp on stp.id=ti.table_id left join colenc on colenc.attrelid=ti.table_id cross join cluster_info where ti.schema not in('pg_internal') order by ti.size desc