Funnel Charts In Python
Funnel charts are a great way to represent any drop-offs in sample size throughout a series of steps. Using a little bit of Python handiwork in Sisense for Cloud Data Teams' R/Python integration, we can easily create this chart type. A common use case for this chart type is to visualize your pipeline via a sales funnel. This is an extremely effective way to spot opportunities to improve your current sales process! Below is the Python snippet used to create the funnel chart above and the input parameters. Inputs df: a data frame from your SQL output with 2 columns: "step" (the parts of the process you want to visualize in a funnel chart) and "val" (the value associated with the Step in the funnel) show_n (optional parameter): boolean value that determines whether the user wants to display the value corresponding to each step. Default is set to True show_pct (optional parameter): parameter that determines whether the user wants to show each step as... A percent of the previous step ('of last step') A percent of the initial sample ('of whole') Show no percent values ('NA') - default value Snippet # SQL output is imported as a pandas dataframe variable called "df" import pandas as pd import matplotlib.pyplot as plt # Function: funnel_chart, creates a funnel chart image from a dataframe of steps # Inputs: (1) dataframe with 2 columns. The first is the "step", and the second is the "val" corresponding to each step. (2) Optional boolean parameter show_n (displays the n size of each step). (3) Optional paramater show_pct which allows the user to show each step of the funnel as a percent of the original n size ('of whole') or a percent or the previous step ('of last step') # Output: matplotlib image representing a funnel chart def funnel_chart(df, show_n=True,show_pct='NA'): #set up data frame df['val']=df['val'].astype(int) my_max=df['val'][0] df = df.sort_values('val',ascending=False).reset_index(drop=True) df['pct_of_whole']=round((df['val']/my_max)*100).astype(int) pct_change=[100] for j in range(1,len(df['val'])): pct_change.append(int(round(df['val'][j]/df['val'][j-1]*100))) df['pct_change']=pct_change df = df.sort_values('val').reset_index(drop=True) df['left']=(my_max-df['val'])/2 df['col']=['indigo','purple','darkviolet','DarkOrchid','MediumOrchid','orchid','plum','thistle'] #initialize plot fig, ax = plt.subplots() for i in range(len(df['step'])): ax.barh(0.5*i+0.5, df['val'][i], height=0.3, left=df['left'][i], align='center', color=df['col'][i],alpha = 1.0, label=df['step'][i]) if(show_n==True): ax.annotate(' ' + df['step'][i] + ': ' + str(df['val'][i]),xy=(my_max,0.5*i+0.45),horizontalalignment='left') else: ax.annotate(' ' + df['step'][i],xy=(my_max,0.5*i+0.45),horizontalalignment='left') if(show_pct=='of whole'): ax.annotate(str(df['pct_of_whole'][i]) + '%' ,xy=(my_max/2,0.5*i+0.45),horizontalalignment='center',color='white') elif(show_pct=='of last step'): ax.annotate(str(df['pct_change'][i]) + '%' ,xy=(my_max/2,0.5*i+0.45),horizontalalignment='center',color='white') #remove border and align chart ax.axis('off') fig.subplots_adjust(right=0.8) return fig # Use Sisense for Cloud Data Teams to visualize a dataframe or an image by passing data to periscope.output() periscope.output(funnel_chart(df,show_n=True,show_pct='of last step')) What other chart types do you like to use for your sales processes?3.3KViews1like1CommentSankey Diagrams With Plot.Ly In Periscope
Sankey diagrams are a very easy-to-digest way to look at the flow of objects between different points. The example we will walk through here is a basic Sankey overview, leveraging data from a fictional gaming company. The emphasis of this post is focused on how to get your pandas dataframe data into something that can be used by plot.ly in a Sankey diagram. The Sankey set up itself is modeled after the first example in the documentation here. If you want to look at different Sankey plots from a visualization standpoint, plot.ly has a great library or resources. I've linked some of those resources at the bottom of this post for further reference Without any further ado, here is the annotated code! Note that you would need the Periscope Python and R integration to use our Plot.ly integration. Our SQL output is a dataframe that contains 3 columns: Source - The source that the user came from Platform - The user's platform of choice Count - The number of users for that source-platform combination The SQL to generate this is noted below for reference select source , platform , count(*) from users group by 1 , 2 And here is the Python 3.6 code, annotated up top where we set up our pandas dataframe df for use in plot.ly. # SQL output is imported as a dataframe variable called 'df' import pandas as pd import plotly.plotly as py # First, we get a list of all of sources, remove duplicates, and make this a list sources = df["source"].drop_duplicates().tolist() # Then, we get a list of all of platforms (our targets), remove duplicates, and make this a list platforms = df["platform"].drop_duplicates().tolist() # Finally, create a list of all our nodes. We will use this for giving an id to each node for plot.ly all_nodes = sources + platforms # Keeping the size of our dataframes, this would be useful for applying the same color for each "node" and "link" of our sankey diagram, if we so choose to do so n = len(all_nodes) n2 = len(df["source"]) # Create a dataframe that has all of the node ids. We will join this to the original data frame to reference later df1 = pd.DataFrame(all_nodes, columns = ['node']) df1 = df1.reset_index() df2 = pd.merge(pd.merge(df, df1, how = 'inner', left_on = "source", right_on ="node"), df1, how = 'inner', left_on = "platform", right_on ="node", suffixes = ('_source','_target')) # Setting up the data in the plotly "data" argument. # The nodes are described in the "node" dictionary (these are the vertical rectangles in the diagram) # The links are described in the "link" dictionary. These have 3 attributes, the "source" (the index of the node they start at), the "target" (the index of the node they end at), and the "value" the thickness of the band. Additional attributes, such as color can also be specified. data = dict( type='sankey', node = dict( pad = 15, thickness = 20, line = dict( color = "#435951", width = 0.5 ), label = all_nodes, color = ["#84baa6"] * n ), link = dict( source = df2["index_source"], target = df2["index_target"], value = df2["count"], color = ['#bdf9e5'] * n2 )) # Setting up the layout settings in the "layout" argument layout = dict( title = "An Example Sankey Diagram", font = dict( size = 12 ) ) fig = dict(data=[data], layout=layout) # Use Periscope to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively. periscope.plotly(fig) Plot.ly Sankey Diagrams - 1 Plot.ly Sankey Diagrams - 2 Plot.ly Sankey Diagrams - 33.2KViews0likes0CommentsManipulating Dates, Datetimes, Unix Timestamps, And Other Date Formatting Tricks In Redshift
Dates aren't stored in the database in the format you want to display? You can use the Redshift TO_CHAR() function to create a string-formatted extract in order to display dates however you want! Note that in Redshift, there are DATE (no timestamp), TIMESTAMP (timestamp, no timezone) and TIMESTAMPTZ (timestamp with time zone) types. Another common datetime type you'll find is the UNIX timestamp, which is usually a 10 digit integer, representing the number of seconds passed since 1970-01-01 00:00:00 UTC (midnight). Less commonly you may find them with 13 digits (3 extra digits for millisecond precision) or 16 digits (microsecond precision). I've not seen a 19-digit nanosecond precision UNIX timestamp out in the wild, but if you happen to find one let me know. Reference Documents: Redshift Datetime types - https://docs.aws.amazon.com/redshift/latest/dg/r_Datetime_types.html Redshift formatters - https://docs.aws.amazon.com/redshift/latest/dg/r_FORMAT_strings.html Postgres - https://www.postgresql.org/docs/9.5/static/functions-formatting.html Here are some examples using various formatters! I'm going to use 'getdate()' so you can run the queries below, but you can substitute any datetime column for actual usage. SELECT to_char(getdate(), 'Mon') -> 'Feb' SELECT to_char(getdate(), 'MON') -> 'FEB' SELECT to_char(getdate(), 'Month') -> 'February' SELECT to_char(getdate(), 'Mon DD HH12:MI:SS') -> 'Feb 27 12:39:12' (12-hour clock) SELECT to_char(getdate(), 'Mon DD, HH24:MI:SS.MS') -> 'Feb 27, 00:39:12.331' (24-hour clock, milliseconds) How about something like grabbing the week number out of 52? select to_char(getdate(), 'WW') -> 09 Try 'DDD' for day number, 'D' for day of week (Sunday =1), or maybe you just really want the month number in Roman numerals; use 'RM' (or 'rm' lower case) for that. What if you have a unix timestamp i.e. 1519692428 ? Redshift doesn't have a native 'timestamp_to_datetime' type function. Luckily, you can use our Date Aggregators, here we'll use the :ts aggregator which turns a unix timestamp into a datetime: SELECT to_char([1519716997:ts] , 'W') -> 4 (week of month) What if you want to display the dates in the format 'Q1 - 2018'? select to_char([1519692428:ts], 'YYYY') || ' - Q' || to_char([1519692428:ts], 'Q') But what if you're running this directly on your Redshift cluster (outside of periscope) and don't have access to our handy SQL formatters? SELECT to_char(TIMESTAMP 'epoch' + 1519692428 * interval '1 second', 'DD-Mon-YY HH24:MM:SS') -> 27-Feb-18 00:02:08 More generally: SELECT to_char(TIMESTAMP 'epoch' + YOUR_UNIX_TIMESTAMP_COLUMN * interval '1 second', '<formatter string>') And the reverse, taking a date/datetime, and converting to unix timestamp (remember to account for timezone conversions if you have a TIMESTAMPTZ - unix timestamps are always defined in UTC) SELECT extract(epoch from getdate()) -> 1520205318 If you have a 13-digit timestamp (instead of the standard 10-digit), your timestamp is in milliseconds. Divide by 1000 (or if you don't care about the precision, just take the leftmost 10 digits) and use the above function. Same procedure for 16-digit microsecond timestamps. Between CAST/CONVERT, EXTRACT, TO_CHAR, and the date formatters, you can take any kind of date field and display it however you'd like! The examples above use the current-datetime-returning GETDATE() function; in Postgres, you'll use NOW(). You can pass any date or datetime column in for these examples, although you'll ned to make sure your underlying column has the necessary precision (i.e. a DATE type doesn't have time precision). Maybe your data is a bit messier and you have the date field stored as a string with mixed dates and datetimes. (if anyone has unix timestamps and nulls mixed in with their date/datetimes, buy your DBA a drink and ask them to fix it. But really, if someone is actually facing this issue, let me know in the comments and I'll whip up a solution... then ask your DBA to fix it). Or maybe you just need more manual control over over the output, like perhaps you wish to capitalize month names, but only if they're 31 days long, and also include the number of days in the month. The tried-and-true CASE WHEN never fails: CASE WHEN extract(MONTH from date_column::date) = 1 then 'JANUARY (31 Days)' WHEN date_part(month, date_column::date) = 2 then 'February (28 Days)' WHEN extract(month from date_column::date) = 3 then 'MARCH (31 Days)' WHEN date_part(month, date_column::date) = 4 then 'April (30 Days)' ... END as month_name and so on. You can use what you prefer between EXTRACT and DATE_PART here. If you're working with time precision, make sure to cast using ::timestamp instead of ::date. Finally, you probably don't want to type that in every time, so use a SQL Snippet to save and reuse!3.1KViews0likes0CommentsChart Type - Radial Bar Chart In Matplotlib (Python)
Here's a script that takes a data frame with two values, the current and benchmark, and returns radial bar charts to plot progress toward a goal. You can also choose a color using the color_theme parameter that takes values 'Grey', 'Purple, 'Blue', 'Green', 'Orange', or 'Red'. Reason to use it: One of the most common questions an analyst is asked is how close am I to a goal? We had a great thread about doing this in a variety of ways here. The most common tool that people default to on these is to use a gauge chart, which albeit feeling familiar with driving a car, is not necessarily appropriate for visually encoding in an elegant way. With Python in Sisense for Cloud Data Teams, we can start to create custom chart types that are awesome and fit use-cases perfectly. The biggest issue with gauge charts is when you go beyond 100%, do you loop back to the beginning of the gauge or just adjust where 100% is? If we take a step back and look at how the visual could work, we can assemble a few radial components into a full 360° = 100%. That way, when we want a 200%, we can just have multiple 360° rings. ####################################################################### ### LIBRARIES ### ####################################################################### import pandas as pd import matplotlib.pyplot as plt import numpy as np ####################################################################### ### HELPER FUNCTIONS ### ####################################################################### #USE: Create an array structure for rings. #INPUT: a df of row length 1 with the first column as the current metric value and the second colum is the target metric value #OUTPUT: an aray of arrays representing each ring def calculate_rings(df): if df.iloc[0,0] < df.iloc[0,1]: rings=[[df.iloc[0,0],df.iloc[0,1]-df.iloc[0,0]],[0,0]] elif df.iloc[0,0] / df.iloc[0,1] < 2: rings=[[df.iloc[0,0],0],[df.iloc[0,0] % df.iloc[0,1], df.iloc[0,1]-df.iloc[0,0] % df.iloc[0,1]]] else: rings = [[0,0],[0,0]] return rings #USE: Determine if the label for the rotating number label should be left/center/right #INPUT: a df of row length 1 with the first column as the current metric value and the second colum is the target metric value #OUTPUT: the proper text alignment def horizontal_aligner(df): metric = 1.0 * df.iloc[0,0] % df.iloc[0,1] / df.iloc[0,1] if metric in (0, 0.5): align = 'center' elif metric < 0.5: align = 'left' else: align = 'right' return align def vertical_aligner(df): metric = 1.0 * df.iloc[0,0] % df.iloc[0,1] / df.iloc[0,1] if metric < 0.25: align = 'bottom' elif metric < 0.75: align = 'top' elif metric > 0.75: align = 'bottom' else: align = 'center' return align #USE: Create a center label in the middle of the radial chart. #INPUT: a df of row length 1 with the first column as the current metric value and the second column is the target metric value #OUTPUT: the proper text label def add_center_label(df): percent = str(round(1.0*df.iloc[0, 0]/df.iloc[0, 1]*100,1)) + '%' return plt.text(0, 0.2, percent, horizontalalignment='center', verticalalignment='center', fontsize = 40, family = 'sans-serif') #USE: Formats a number with the apropiate currency tags. #INPUT: a currency number #OUTPUT: the properly formmated currency string def get_currency_label(num): currency = '' if num < 10**3: currency = '$' + str(num) elif num < 10**6: currency = '$' + str(round(1.0*num/10**3,1)) + 'K' elif df.iloc[0,0] < 10**9: currency = '$' + str(round(num/10**6,1)) + 'M' else: currency = '$' + str(round(num/10**9,1)) + 'B' return currency #USE: Create a dynamic outer label that servers a pointer on the ring. #INPUT: a df of row length 1 with the first column as the current metric value and the second column is the target metric value #OUTPUT: the proper text label at the apropiate position def add_current_label(df): currency = get_currency_label(df.iloc[0,0]) print('vertical: ' + vertical_aligner(df)) print('horizontal: ' + horizontal_aligner(df)) return plt.text(1.5 * np.cos(0.5 *np.pi - 2 * np.pi * (float(df.iloc[0,0]) % df.iloc[0,1] /df.iloc[0,1])), 1.5 * np.sin(0.5 *np.pi - 2 * np.pi * (float(df.iloc[0,0]) % df.iloc[0,1] / df.iloc[0,1])), currency, horizontalalignment=horizontal_aligner(df), verticalalignment=vertical_aligner(df), fontsize = 20, family = 'sans-serif') def add_sub_center_label(df): amount = 'Goal: ' + get_currency_label(df.iloc[0,1]) return plt.text(0, -.1, amount, horizontalalignment='center', verticalalignment='top', fontsize = 22,family = 'sans-serif') ####################################################################### ### MAIN FUNCTION ### ####################################################################### def create_radial_chart(df, color_theme = 'Purple'): # base styling logic color = plt.get_cmap(color_theme + 's') ring_width = 0.3 outer_radius = 1.5 inner_radius = outer_radius - ring_width # set up plot ring_arrays = calculate_rings(df) fig, ax = plt.subplots() if df.iloc[0, 0] > df.iloc[0, 1]: ring_to_label = 0 outer_edge_color = None inner_edge_color = 'white' else: ring_to_label = 1 outer_edge_color, inner_edge_color = ['white', None] # plot logic outer_ring, _ = ax.pie(ring_arrays[0],radius=1.5, colors=[color(0.9), color(0.15)], startangle = 90, counterclock = False) plt.setp( outer_ring, width=ring_width, edgecolor=outer_edge_color) inner_ring, _ = ax.pie(ring_arrays[1], radius=inner_radius, colors=[color(0.55), color(0.05)], startangle = 90, counterclock = False) plt.setp(inner_ring, width=ring_width, edgecolor=inner_edge_color) # add labels and format plots add_center_label(df) add_current_label(df) add_sub_center_label(df) ax.axis('equal') plt.margins(0,0) plt.autoscale('enable') return plt # call the chart maker function and display the chart periscope.output(create_radial_chart(df, color_theme='Purple')) # Currently supported color themes: Grey, Purple, Blue, Green, Orange, Red An additional layer of information can be introduced using a subtitle below the percentage. Thanks to Kyle Dempsey for developing a simple function to do this. Here's the code with an adjusted position of the percent and subtitle. def add_center_label(df): percent = str(round(1.0*df.iloc[0, 0]/df.iloc[0, 1]*100)) + '%' return plt.text(0, 0.2, percent, horizontalalignment='center', verticalalignment='center', fontsize = 40, family = 'sans-serif') def add_sub_center_label(df): amount = 'Goal: ' + get_currency_label(df.iloc[0,1]) return plt.text(0, -0.1, amount, horizontalalignment='center', verticalalignment='top', fontsize = 22,family = 'sans-serif')2.4KViews1like0CommentsCalculating Trimmed Means (SQL And Python Variations)
Data can oftentimes have extreme outliers, which can heavily skew certain metrics, such as the mean. One way to get around this is to use a trimmed mean. Using a trimmed mean, users will remove the top and bottom x percent of their data and take the average of the result. We see trimmed ranges as a standard methodology that college programs use to communicate the standardized test scores of their admitted students (ex: 25th to 75th percentile of SAT scores). This is the same thought process that drives some data professionals to use trimmed means. For the examples below, let's assume we have the following data: user_ids, and how much money they spent on a fictional gaming app. Figure 1 Note: There are a few nuances to keep in mind when developing a protocol for trimmed means. Primarily, we need to assess how to handle multiple occurrences of a value when it is either the lower or upper bound of results to be trimmed. Case 1: Exclude upper and lower bounds of the trimmed dataset You can perform this calculation in Redshift SQL or Python Redshift SQL (assume the table in Figure 1 is stored in t1) with t2 as (select case when amt_paid <= PERCENTILE_DISC(0.1) WITHIN group(order by amt_paid) over() then null when amt_paid >= PERCENTILE_DISC(1-0.1) WITHIN group(order by amt_paid) over() then null else amt_paid end as trimmed_vals from t1) select avg(trimmed_vals) from t2 Python 3.6 (assume the table in Figure 1 is stored in a dataframe df) # SQL output is imported as a pandas dataframe variable called "df" # Source: https://stackoverflow.com/questions/19441730/trimmed-mean-with-percentage-limit-in-python import pandas as pd import matplotlib.pyplot as plt from scipy.stats import tmean, scoreatpercentile import numpy as np def trimmean(arr, percent): lower_limit = scoreatpercentile(arr, percent) upper_limit = scoreatpercentile(arr, 100-percent) return tmean(arr, limits=(lower_limit, upper_limit), inclusive=(False, False)) my_result = trimmean(df["amt_paid"].values,10) Case 2: Disregard duplicate values of the upper/lower bounds In other words, ensure that exactly x% of the lowest and x% of the highest values are removed from the dataset. If there are multiple records where the value is equal to the upper limit, discard enough copies such that the resulting sample size is 2*x% smaller than the initial sample size. SQL There is no known effective SQL equivalent for this behavior Python 3.6 # SQL output is imported as a pandas dataframe variable called "df" import pandas as pd from scipy.stats import trim_mean import numpy as np my_result = trim_mean(df["amt_paid"].values, 0.1) Case 3: Include upper and lower bounds of the trimmed dataset Redshift SQL (assume the table in Figure 1 is stored in t1) with t2 as (select case when amt_paid < PERCENTILE_DISC(0.1) WITHIN group(order by amt_paid) over() then null when amt_paid > PERCENTILE_DISC(1-0.1) WITHIN group(order by amt_paid) over() then null else amt_paid end as trimmed_vals from t1) select avg(trimmed_vals) from t2 Python 3.6 (assume the table in Figure 1 is stored in a dataframe df) # SQL output is imported as a pandas dataframe variable called "df" # Source: https://stackoverflow.com/questions/19441730/trimmed-mean-with-percentage-limit-in-python import pandas as pd import matplotlib.pyplot as plt from scipy.stats import tmean, scoreatpercentile import numpy as np def trimmean(arr, percent): lower_limit = scoreatpercentile(arr, percent) upper_limit = scoreatpercentile(arr, 100-percent) return tmean(arr, limits=(lower_limit, upper_limit), inclusive=(True, True)) my_result = trimmean(df["amt_paid"].values,10)2.4KViews0likes0CommentsConfidence Interval Printout - Python
Let's say we want a printout of our confidence interval for an entire sample (Note, if you're looking for a visual of a confidence interval over time, check out the post here!) The solution here requires Periscope Data's Python/R Integration as we'll be using Python's Scipy package and matplotlib text charts. Our SQL output is a dataframe (df) with one column of a list of values. Notice that the function created in the Python snippet also takes in the following optional parameters for added customization: interval: the width of the Confidence Interval. Default is set to 0.95 (95%) method: whether we use the t distribution or a z/normal distribution to calculate the confidence interval. Default is set to a 't' distribution. We recommend verifying that your data is normally distributed before using the z distribution statistic. The snippet is written in Python 3.6 # SQL output is imported as a pandas dataframe variable called "df" import pandas as pd from scipy import stats import math import matplotlib.pyplot as plt import numpy as np #Function: CI_prinout, a function that outputs a number overlay expressing a sample's Confidence Interval #Inputs: a dataframe with one column of values. Optional paramater interval for the size of the confidence interval (default is 0.95). Option parameter method that specifies whether the confidence interval will be calculating using the t distribution or a z/normal distribution. #Outputs: a matplotlib text chart with the % confidence interval and the lower and upper bounds def CI_printout(series, interval = 0.95, method = 't'): mean_val = series.mean() n = series.count() stdev = series.std() if method == 't': test_stat = stats.t.ppf((interval + 1)/2, n) elif method == 'z': test_stat = stats.norm.ppf((interval + 1)/2) lower_bound = mean_val - test_stat * stdev / math.sqrt(n) upper_bound = mean_val + test_stat * stdev / math.sqrt(n) fig = plt.figure() plt.axis('off') plt.gcf().set_size_inches(8, 2) plt.xticks([]) plt.yticks([]) plt.text(.5, .75, str(round(interval * 100))+ '% Confidence Interval', fontsize=25, color='black', ha='center') plt.text(.5, .35, str(round(lower_bound[0],2)) + ' to ' + str(round(upper_bound[0],2)), fontsize=25, color='black', ha='center') return plt # Use Periscope to visualize a dataframe or an image by passing data to periscope.output() periscope.output(CI_printout(df)) For the R equivalent of this post, check out the community page here! Found this useful? Let us know in the comments!1.6KViews0likes0CommentsA Deeper Dive Into The Redshift Warehouse - Backend Mechanics, Table Size/Sortkey/Distkey Info, And More!
Databases within Databases The Warehouse is a managed Redshift solution which has the ability to store data from multiple origin data sources. Each Redshift cluster can contain multiple databases, and the Warehouse stores all relevant tables pulled in from the Cache UI in a database titled "site_#####" where the numbers indicate your Site ID. If you're logged in via psql, you can type \l (SQL) or SELECT datname FROM pg_database; (SQL) to see the list of databases. The relevant database in the Warehouse which is connected to Sisense for Cloud Data Teams can be found in the cluster info page, using the above command/query, or within Sisense for Cloud Data Teams' IDE, using the following SQL query: select current_database() Warehouse Naming Conventions How does the Warehouse enable cross-database joins between tables coming from different databases? It does so by putting all tables within the "site_####" database, so the joins are no longer actually cross database - all tables reside within the same database of the Warehouse. First, the origin database is given an ID, and the schema name is concatenated to the database ID, which combines the schema and db alias into a single schema within the "site_####" db. Let's walk through an example. Suppose I had a Warehouse with a db "site_1" which pulls in data from a postgres db called "production_replica," (assume the alias "db_5" is given by the Warehouse) and a table "public.daily_active_users" within "production_replica". Within the Sisense for Cloud Data Teams IDE, if we write the query: select * from production_replica.public.daily_active_users the Warehouse will translate this query on the backend to the actual Warehouse table names : select * from site_1.db_5_public.daily_active_users This can be seen any time under the "Query" tab by running a select * against a table. Warehouse Names, Sortkeys, and Distkeys The following is a useful Warehouse query which will yield the database name (within the Warehouse), the schema name (within the origin database), and the primary sortkey and diststyle of each table in the Warehouse (specifically, the site_#### database). Additionally, the table size (GB) and a running cumulative table size are generated. (SQL) with tbl_ids as (select distinct oid from pg_class c where relowner>1 and relkind='r'), stp as (select id,sum(rows)sum_r,sum(sorted_rows)sum_sr,min(rows)min_r, max(rows)max_r,nvl(count(distinct slice),0)pop_slices from stv_tbl_perm where id in (select oid from tbl_ids) and slice<6400 group by id), colenc as (select attrelid,sum(case when a.attencodingtype=0 then 0 else 1 end) as encoded_cols,count(*)as cols from pg_attribute a where a.attrelid in (select oid from tbl_ids) and a.attnum>0 group by a.attrelid), cluster_info as (select count(distinct node) node_count from stv_slices) select ti.database, coalesce(regexp_substr(ti.schema, 'db_[0-9]+'), ti.schema) as origin_database_aliased_id, right(ti.schema, len(ti.schema) - len(regexp_substr(ti.schema, 'db_[0-9]+_'))) as schema_name, ti."table"as tablename, ti.diststyle, ti.sortkey1, round(1.0*ti.size / 1024,2) current_size_gb, sum(round(1.0*ti.size / 1024,2)) over (order by size desc rows unbounded preceding) as cumulative_size_on_disk from svv_table_info ti left join stp on stp.id=ti.table_id left join colenc on colenc.attrelid=ti.table_id cross join cluster_info where ti.schema not in('pg_internal') order by ti.size desc1.1KViews0likes0CommentsRecommendations For A Book/Tutorial On SQL For Sisense
Question For a novice starting out with Sisense and a very basic knowledge of SQL, would someone in this community be able to recommend a book and/or tutorial that could help expedite learning the SQL syntax used in the system to set up custom tables, etc.? Answer You can start with the below Udemy course which covers most of the basics that you'll need and for new users, it's practically free (if for some reason it cost you more - simply google the course's name and the phrase "coupon". Work like a charm) https://www.udemy.com/microsoft-sql-for-beginners/ https://www.w3schools.com/sql/ is another website where you can learn basic SQL with ease Here is another free interactive site you can learn some basic SQL. https://sqlzoo.net/819Views0likes0CommentsHow To Filter For The Last Two Records Using SQL In Ecube
Question How to find out if there exists the top 2 records for the same ClientID in a table. I know that the MAX gives the Top record and that SQL has the LIMIT 2 but I cannot use these two keywords in eCube. Answer This can be easily done within the ElastiCube. Here are the steps: Create a new custom column in your table to ranks all records for a specific ClientID, the script will be something like this: rankasc([client_id],[created_at]) Create a custom SQL expression (and name this table [Table Max Ranking]) to find out the max ranking for each client, the syntax will be : SELECT client_id, max([rank]) AS MaxRank FROM [Table] GROUP BY client_id Create another custome SQL expression with the following scripts : SELECT a.* FROM [Table] a JOIN [Table Max Ranking] b ON a.client_id = b.client_id AND a.[rank] >= (b.MaxRank - 1) ORDER BY a.client_id, a.[rank] desc725Views0likes0Comments