SQL Order Of Operations
The first thing you'll learn in SQL is the order in which we write various clauses in our queries (SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, then LIMIT). However, the order in which the database interprets the query is slightly different. Understanding this order will allow you to better understand how to construct more complex queries as well as optimize your queries for speed and performance. 1. FROM The first part of the query the database will read is the FROM clause. These are the table(s) we are pulling data from. Any joins that are done are executed first by the query planner. Joining on many rows can often be costly. As a strategy for optimization we can filter down the rows before joining the tables. Because the FROM clause comes before the WHERE clause we will want to filter the results with a CTE before joining them in our final query. This might look like, with american_companies as ( select * from company where country = 'USA' ) , current_users as ( select * from users where deleted_at is null ) select * from t1 join t2 on american_companies.id = current_users.company_id 2. WHERE The where clause is where we filter down the rows from the table. We can filter on any supported data type. You may notice that column aliases and aggregates cannot be used in the where clause. This is because they are established later in SQL's order of operations. where price > 200 and product_name ilike '%iphone%' and purchase_date > '2018-01-01' 3. GROUP BY The group by clause is often used in conjunction with aggregate functions to return an aggregation of results grouped by one or more columns. 4. HAVING As touched on previously, putting an aggregate function in the WHERE clause will throw an error. That is why we have the HAVING clause. The HAVING clause allows us to filter down a result set after the data is grouped and aggregated. Here is an example query that filters on an aggregate, select user_id , count(transaction_id) from transactions where transaction_amount > 50 group by user_id having count(transaction_id) > 10 Window Functions Window functions come after the having clause. As such, they can only be used in the SELECT and ORDER BY clause. You can also use aggregate functions inside of the window function. 5. SELECT The SELECT statement is where we define the columns and aggregate functions we want to return as columns on our table. 6. DISTINCT GROUP BY and DISTINCT can be essentially used the same when not utilizing an aggregate function. DISTINCT comes later in the order of operations, because it removes duplicate rows after all rows have been selected. However, in many cases, DISTINCT and GROUP BY will have the same query plan. This is because most databases are smart enough to recognize both have the same outcome and will choose the most efficient plan of execution. 7. UNION A union takes two queries that can both stand alone as valid queries and stacks them on top of each-other to combine as one. 8. ORDER BY Once all of our data has been grouped and aggregated the ORDER BY clause will sort the resulting set of rows. Because it comes so late in the order of operations we can order by aggregates, window functions, and column aliases. We can also re-order unioned tables. 9. LIMIT The LIMIT clause is where we can define the amount of rows we want returned by our query. It can also be used in conjunction with order by to return the top or bottom x amount of rows. We can also use OFFSET to return a given number of rows past a given row position.29KViews0likes0CommentsGet Yesterday's Date
Often times, we want to analyze data with the date from yesterday. How do we auto-populate yesterday's date? There are a few formats that we can use! The syntax below lets us get yesterday's date with timestamp: select dateadd(day,-1,getdate()) To only get yesterday's date at 00:00:00: select dateadd(day,-1,trunc(getdate())) If you wish to get the date and ditch the timestamp completely, you can use the SQL formatter! select [dateadd(day,-1,getdate()):date] You can also use the interval syntax for a more elegant code: select [getdate():date] - interval '1 day'27KViews0likes0CommentsChanging Line Styling -- Plot.ly (Python And R)
Wanted to alter the thickness of the lines in your series, add markers to your lines, and make them dashed or dotted? The plot.ly library support on Sisense for Cloud Data Teams' Python/R integration allows you to tailor these parameters to your heart's desire! For reference, here are the first few rows of our SQL output. This is data from a hypothetical gaming company And below is the annotated Python 3.6 code! Note we need to first pivot the data such that each series gets its own column. More information on that can be found here. Also, plot.ly has great documentation going into different examples. Much of the below options were outlined in the documentation itself! # SQL output is imported as a dataframe variable called 'df' # Import pandas to use dataframe objects import pandas as pd # Import plot.ly viz libraries import plotly.plotly as py import plotly.graph_objs as go # Pivot data. Details on how to do this here https://community.periscopedata.com/t/q5gk76/pivoting-and-melting-dataframes df=pd.pivot_table(df,index='week',columns='source',values='count') df=df.reset_index() # Create traces # Refer to https://plot.ly/python/line-charts/ for more options admob_trace = dict( x = df.week, y = df['admob'], mode = 'lines', type = 'scatter', name = 'admob', line = dict(shape = 'linear', color = 'rgb(205, 12, 24)', width= 4, dash = 'dash'), connectgaps = True ) leadbolt_trace = go.Scatter( x = df['week'], y = df['leadbolt'], mode = 'lines+markers', name = 'leadbolt', line = dict(shape = 'linear', color = 'rgb(10, 12, 240)', dash = 'dash'), marker = dict(symbol = "star-diamond", color = 'rgb(17, 157, 255)',size = 12), connectgaps = True ) organic_trace = go.Scatter( x = df.week, y = df['organic'], mode = 'lines', name = 'organic', line = dict(shape = 'linear', color = 'rgb(10, 120, 24)', dash = 'dot'), connectgaps = True ) tapjoy_trace = go.Scatter( x = df['week'], y = df['tapjoy'], mode = 'lines', name = 'tapjoy', line = dict(shape = 'linear', color = 'rgb(100, 10, 100)', width = 2, dash = 'dot'), connectgaps = True ) # Setting up the layout settings in the "layout" argument layout = dict( xaxis = dict(title = 'Week'), yaxis = dict(title = 'Source'), margin = dict( l=70, r=10, b=50, t=10 ) ) data = [admob_trace, leadbolt_trace, organic_trace, tapjoy_trace] fig = go.Figure(data = data, layout=layout) # Use Periscope to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively. periscope.plotly(fig) Prefer R Plot.ly? Here's the code for that! # SQL output is imported as a dataframe variable called 'df' # Use Periscope to visualize a dataframe or show text by passing data to periscope.table() or periscope.text() respectively. Show an image by calling periscope.image() after your plot. # Use tidyr to pivot the data. We want a column per series library(tidyr) # Use plot.ly for visualziation library(plotly) # Pivoting data, see this post for further details https://community.periscopedata.com/t/q5gk76/pivoting-and-melting-dataframes df=spread(df,source,count) admob_trace <- df$admob leadbolt_trace <- df$leadbolt organic_trace <- df$organic tapjoy_trace <- df$tapjoy x <- df$week data <- data.frame(x, admob_trace, leadbolt_trace, organic_trace, tapjoy_trace) # Check out Plot.ly's documentation to find more options! https://plot.ly/r/reference/ p <- plot_ly(data, x = ~x) %>% add_trace(y = ~admob_trace, name = 'admob', type = 'scatter', mode = 'lines', line = list(shape = 'linear', color = 'rgb(205, 12, 24)', width= 4, dash = 'dash'), connectgaps = TRUE) %>% add_trace(y = ~leadbolt_trace, name = 'leadbolt',type = 'scatter', mode = 'lines+markers', line = list(shape = 'linear', color = 'rgb(10, 12, 240)', dash = 'dash'), marker = list(symbol = "star-diamond", color = 'rgb(17, 157, 255)',size = 12), connectgaps = TRUE) %>% add_trace(y = ~organic_trace, name = 'organic', type = 'scatter', mode = 'lines', line = list(shape = 'linear', color = 'rgb(10, 120, 24)', dash = 'dot'), connectgaps = TRUE) %>% add_trace(y = ~tapjoy_trace, name = 'tapjoy', type = 'scatter', mode = 'lines', line = list(shape = 'linear', color = 'rgb(100, 10, 100)', width = 2, dash = 'dot'), connectgaps = TRUE) %>% layout(xaxis = list(title = 'Week'), yaxis = list(title = 'Source')) periscope.plotly(p)22KViews1like0CommentsPart 1: Understanding The NVL/COALESCE Expression
What are Nulls? The null value can be confusing--does it represent a zero, an empty space, a blank placeholder? In SQL, null represents an “unknown” type or the absence of a value, and the use case and queries involving nulls are often unique. For example, when there are calculations or expressions involving timestamps, strings, integers, or even booleans with a null value, the overall expression is also null. The logic is valid since if one aspect of the expression is unknown, then the overall expression must also be unknown. Now, let’s say we have columns with both null and non-null values. One way to check if there are null values in the columns is to use the “Is Null” function. If we wanted to return all the non-null values in a column, then we would simply use “Is Not Null”: If the “Is Null” or “If Null” (“IsNull” equivalent in MySQL) functions filters for either null or non-null values, then what is the use case for NVL/Coalesce expressions? Well, there are a few key differences between “Is Null” and the Coalesce/NVL expression. Comparison of "Is NULL" Function vs. NVL/Coalesce Expression: Is NULL function is typically computationally faster than the Coalesce expression (although the difference is pretty negilgible) Is NULL function is evaluated once, while the input values for the Coalesce expression can be evaluated multiple times. Is NULL function uses the data type of the first parameter, while the Coalesce expression abides by the Case expression rules and returns the data type of value with the highest precedence. Is NULL function returns a value for null values that is “Not Nullable,” while the Coalesce expression with non-null parameters is actually NULL. NVL/Coalesce Expression Now, that we have discussed the differences between “Is Null” and “Coalesce,” let’s take a deeper look at the NVL/Coalesce expression. Coalesce () or NVL() are synonymous expressions, and they are used to return the first non-null expression in a list of at least two inputs/arguments. In the case of where all the inputs/arguments are null, the Coalesce() expression will return “null.” An example of returning the first non-null argument: An example of when all the arguments are null: Another way to think about the Coalesce expression is as an abbreviated version of the Case statement. The following coalesce expression and case statement achieve the same results. Coalesce Expression: Case Statement: Tip: One of the advantages of the Sisense for Cloud Data Teams SQL editor is that it provides hints as to the purpose of most SQL expressions as well as their format. Explanation: Format: Interested in the applications of the NVL/Coalesce Expressions? Click here!19KViews0likes0CommentsRemove Whitespace Margins From Plotly Charts
Plotly is wonderful, but sometimes you end up with a chart that looks like it's too small, or that it's not taking up all the space. Why does it do that? In this post, you'll learn how to fix that: With Python, this is actually very easy. We just need to add layout specifications. The documentation for the layout settings can be found on Plotly's website here. layout = go.Layout( margin=go.layout.Margin( l=0, #left margin r=0, #right margin b=0, #bottom margin t=0, #top margin ) ) fig = dict(data=data, layout=layout) Voila! There you go. The default values for the margins are 80 pixels on the left, right, and bottom and 100 pixels on the top. We don't need those! Note: A chart with axes that need to be labeled might need about 25 pixels on the bottom and left. Before import plotly.plotly as py import plotly.graph_objs as go labels = ['Oxygen','Hydrogen','Carbon_Dioxide','Nitrogen'] values = [4500,2500,1053,500] trace = go.Pie(labels=labels, values=values) data = [trace] fig = dict(data=data) periscope.plotly(fig) After import plotly.plotly as py import plotly.graph_objs as go labels = ['Oxygen','Hydrogen','Carbon_Dioxide','Nitrogen'] values = [4500,2500,1053,500] trace = go.Pie(labels=labels, values=values) data = [trace] layout = go.Layout( margin=go.layout.Margin( l=0, #left margin r=0, #right margin b=0, #bottom margin t=0 #top margin ) ) fig = dict(data=data, layout=layout) periscope.plotly(fig) Bonus: In R, you can do this quite similarly. Documentation here. Example here.16KViews0likes0CommentsDealing With Missing Values In Python
Python pandas offers a few different options to deal with null values. Based on your dataset, there will likely be a preferred method to account for null values that 1. accurately represents your data and 2. preserves a decent sample size for rigorous analyses. Option 1: Remove the null columns We can use the following command, dropna, to remove columns that have either have all null values or any null values. Be careful when dropping columns that have any null values - there may be cases where your remaining data set would have very few results to analyze! df=df.dropna(axis=1, how='all') df=df.dropna(axis=1, how='any') Option 2: Remove the null rows Alternatively, we can use dropna to remove rows with all or any null values. This looks just like dropping columns, except the axis parameter is set to 0. Again, use discretion when dropping null rows to ensure your remaining results are representative of the larger set of data. Option 3: Replace the null values We can also pick a value that replaces the missing values. For this, we use the fillna function: df[col]=df[col].fillna(value) "Value" can either be a static number (such as 0), or it can just as easily be a summary metric that best represents your data, such as a median or a mean. Option 4: Interpolate results There may be times when backfilling or using a static value isn't sufficient for handling null values. In the cases that the missing values are numeric, the interpolate function can be used! For example, let's say this is our data: We can use python to fill in those three blank values with the following code: df["y"] = df["y"].interpolate(method="quadratic") This will give the following result: This will give the following result: Pretty good!! We can round this by appending .round() to the end of the line: df["y"] = df["y"].interpolate(method="quadratic").round() Quadratic interpolation is just one of the many ways the values can be interpolated. See the Pandas Documentation for more, including cubic and polynomial! How does your data team handle null values? Share your use cases below!14KViews0likes0CommentsGetting Around Overlapping Data Labels With Python
A common hazard when adding data labels to a chart, is that often times the data labels will overlap when there are two points located relatively close to one another. Fortunately, the flexibility of python allows us a way around overlapping data labels. Here is my original chart as a Sisense for Cloud Data Teams scatter plot with data labels applied, We can see the the data labels overlap in some areas making it hard to read and visually unappealing. Instead we can create a Python chart and write a custom function that checks the location of the data-labels and adjusts any that overlap. import numpy as np import matplotlib.pyplot as plt import pandas as pd # source: https://stackoverflow.com/questions/8850142/matplotlib-overlapping-annotations def get_text_positions(x_data, y_data, txt_width, txt_height): a = zip(y_data, x_data) text_positions = y_data.copy() for index, (y, x) in enumerate(a): local_text_positions = [i for i in a if i[0] > (y - txt_height) and (abs(i[1] - x) < txt_width * 2) and i != (y,x)] if local_text_positions: sorted_ltp = sorted(local_text_positions) if abs(sorted_ltp[0][0] - y) < txt_height: #True == collision differ = np.diff(sorted_ltp, axis=0) a[index] = (sorted_ltp[-1][0] + txt_height, a[index][1]) text_positions[index] = sorted_ltp[-1][0] + txt_height for k, (j, m) in enumerate(differ): #j is the vertical distance between words if j > txt_height * 1.5: #if True then room to fit a word in a[index] = (sorted_ltp[k][0] + txt_height, a[index][1]) text_positions[index] = sorted_ltp[k][0] + txt_height break return text_positions def text_plotter(x_data, y_data, text_positions, axis,txt_width,txt_height): for x,y,t in zip(x_data, y_data, text_positions): axis.text(x - .03, 1.02*t, '%d'%int(y),rotation=0, color='blue', fontsize=13) if y != t: axis.arrow(x, t+20,0,y-t, color='blue',alpha=0.2, width=txt_width*0.0, head_width=.02, head_length=txt_height*0.5, zorder=0,length_includes_head=True) x_data = df['avg'] y_data = df['count'] fig = plt.figure(figsize=(15,8)) ax = fig.add_subplot(111) ax.scatter(x_data, y_data, alpha = 0.4) txt_height = 0.04*(plt.ylim()[1] - plt.ylim()[0]) txt_width = 0.02*(plt.xlim()[1] - plt.xlim()[0]) text_positions = get_text_positions(x_data, y_data, txt_width, txt_height) text_plotter(x_data, y_data, text_positions, ax, txt_width, txt_height) plt.ylim(0,3610) plt.xlim(4.3,6.5) periscope.output(plt) Here is our new python scatter plot:13KViews0likes0CommentsDisplay df.describe() Output In A Dataframe
The .describe() function in Python is really handy for exploratory analysis! Sometimes, you would want to display this as a table on a dashboard, but how? The output is typically printed out as text, not as a dataframe. Fear not, getting this into a dataframe is fairly painless! Before we go on, note that using Python requires the Python/R Integration. Let's say this is the first 10 rows of your SQL output / starting dataframe (the data below is the User IDs and number of gameplays they made for a fictional gaming company) print(df['gameplay_number'].describe()) gives you the following output To get this in a table of its own (rather than a printout), use the code below! Note, the key in the dictionary used as an argument for pd.DataFrame is what we want to name the column of summary values in our final output. I chose to name this 'gameplays.' df2 = pd.DataFrame({'gameplays': df['gameplay_number'].describe()}) df2 = df2.reset_index() # Use Sisense for Cloud Data Teams to visualize a dataframe by passing the data to periscope.output() periscope.output(df2) And there you have it!11KViews0likes1CommentAverage Time Interval Between A Start And End Time
Looking to get the average time interval between a start and an end time (ex: Given the start and end times for runners in a race, find the average time it takes for a runner to complete the race)? Find exactly how to write this up in each flavor of SQL below! Redshift select avg(extract(epoch from endtime) - extract(epoch from starttime)) as avg_time Postgres select avg(endtime - starttime) as avg_time SQL Server select cast(cast(avg(cast((endtime - starttime) as FLOAT) - floor(cast((endtime - starttime) as FLOAT))) as datetime) as time) as 'Avg Time' MySQL select TIME(FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(endtime) - UNIX_TIMESTAMP(starttime)))) AS AVG_TIME BigQuery (Standard SQL) Select TIME(TIMESTAMP_MILLIS(CAST(avg(DATETIME_DIFF(endtime,starttime,millisecond)) as INT64))) as avg_time SQL Server solution inspired by this post8.9KViews0likes0CommentsUsing Python To Write A Create Table Statement And Load A CSV Into Redshift
Usually when I need to upload a CSV I will use the Sisense for Cloud Data Team's CSV functionality. It's fast, easy, allows me to join the data with all my databases, and automatically casts types. Sometimes, however, I like to interact directly with a Redshift cluster—usually for complex data transformations and modeling in Python. When interacting directly with a database, it can be a pain to write a create table statement and load your data. When the table is wide, you have two choices while writing your create table—spend the time to figure out the correct data types, or lazily import everything as text and deal with the type casting in SQL. The first is slow, and the second will get you in trouble down the road. I recently ran into a great example of this when I found out that Stack Overflow released their awesome-as-always survey results for 2017. They are available in a CSV format, but a daunting 158 columns wide. I wanted to load the data into Redshift—and rather than be generous in my data types, I wanted to use the proper columns. I decided to speed up the load process by writing a Python script, which turned into a fun exercise in data type detection. IMPORTING LIBRARIES AND READING DATA IN PYTHON The first step is to load our data, import our libraries, and load the data into a CSV reader object. The csv library will be used to iterate over the data, and the ast library will be used to determine data type. We will also use a few lists. "Longest" will be a list of the longest values in character length to specify varchar column capacity, "headers" will be a list of the column names, and "type_list" will be the updating list of column types as we iterate over our data. import csv, ast, psycopg2 f = open('/path/to/survey/data/survey_data.csv', 'r') reader = csv.reader(f) longest, headers, type_list = [], [], [] FINDING THE DATA TYPE Once we have our data, we need to find the data type for each row. This means we need to evaluate every value and cast to the most restrictive option, from decimalized numbers to integers, and from integers to strings. The function dataType does this. First, it evaluates to see if the value is text or a number, and then for the appropriate type of number if needed. This function consumes both the new data, and the current best type to evaluate against. def dataType(val, current_type): try: # Evaluates numbers to an appropriate type, and strings an error t = ast.literal_eval(val) except ValueError: return 'varchar' except SyntaxError: return 'varchar' if type(t) in [int, long, float]: if (type(t) in [int, long]) and current_type not in ['float', 'varchar']: # Use smallest possible int type if (-32768 < t < 32767) and current_type not in ['int', 'bigint']: return 'smallint' elif (-2147483648 < t < 2147483647) and current_type not in ['bigint']: return 'int' else: return 'bigint' if type(t) is float and current_type not in ['varchar']: return 'decimal' else: return 'varchar' We can iterate over the rows in our CSV, call our function above, and populate our lists. for row in reader: if len(headers) == 0: headers = row for col in row: longest.append(0) type_list.append('') else: for i in range(len(row)): # NA is the csv null value if type_list[i] == 'varchar' or row[i] == 'NA': pass else: var_type = dataType(row[i], type_list[i]) type_list[i] = var_type if len(row[i]) > longest[i]: longest[i] = len(row[i]) f.close() And use our lists to write the SQL statement. statement = 'create table stack_overflow_survey (' for i in range(len(headers)): if type_list[i] == 'varchar': statement = (statement + '\n{} varchar({}),').format(headers[i].lower(), str(longest[i])) else: statement = (statement + '\n' + '{} {}' + ',').format(headers[i].lower(), type_list[i]) statement = statement[:-1] + ');' Finally, our output! create table stack_overflow_survey_data ( respondent int, , professional varchar(56) , programhobby varchar(45) , country varchar(34) .... , expectedsalary decimal); FINISHING THE JOB Of course, the job isn't done—the data needs to get into Redshift! This can be done using the psycopg2 library, and the astute reader will notice we imported it above. To use the copycommand, I first loaded the data to S3. The access key ID and secret access key can be found under users in your AWS console. You can find additional details about the copy command used below on our blog, How to ETL Data into and out of Amazon Redshift. conn = psycopg2.connect( host='mydb.mydatabase.us-west-2.redshift.amazonaws.com', user='user', port=5439, password='password', dbname='example_db') cur = conn.cursor() cur.execute(statement) conn.commit() sql = """copy stack_overflow_survey_data from 's3://an-example-bucket/survey_data.csv' access_key_id '<access_key_id>' secret_access_key '<secret_access_key>' region 'us-west-1' ignoreheader 1 null as 'NA' removequotes delimiter ',';""" cur.execute(sql) conn.commit() And you're ready to begin examining some awesome Stack Overflow data! Let's make a few quick charts to celebrate. Stack Overflow asked a great series of questions asking people what attributes are the most important when making technical hires—including knowledge of algorithms, communications skills, titles and education. We can compare a couple of those questions with the average salary of the respondents: select importantHiringAlgorithms , avg(salary) as salary , count(*) as respondents from stack_overflow_survey_data group by 1 order by 1 And we quickly see that average respondent's salary actually decreases with a focus on data structures and algorithms when hiring! In contrast, a heavier emphasis on communications skills is associated with higher salaries. How important should each of the following be in the hiring process? When we look a level deeper, we can see why—students aren't worried about communications skills yet. While Stack Overflow didn't ask for respondent's titles, I suspect an emphasis on communication skills continues to grow as people advance their careers. It turns out communications skills are highly valued. Luckily, we've written a post about communications in analytics!8.1KViews0likes0Comments