Data Model Development Using Jinja

scastle · ‎10-14-2021

Every aspect of analytics is powered by a data model. A data model presents a “single source of truth” that all analytics queries are based on, from internal reports and insights embedded into applications to the data underlying AI algorithms and much more.

During development, a data model is created that addresses all the data stored in sources that could be called upon for a specific purpose. The primary aim when building a model is to transform complex, raw, real-world data into a coherent picture that can be used to answer likely questions from the business.

Designers, engineers, and analysts see data in different ways. The data model facilitates interaction among these groups by reformatting and restructuring data in order to define the relationship among datasets.

Another way of thinking about a data model is like an architect’s blueprint of a building: It helps create a conceptual model that sets the relationship among various data items.

Transformation: The first step to unlocking insights

Data modeling organizes and transforms data. As data is extracted from various sources, each has its preexisting format, so it’s necessary to transform the data into a clean and uniform format. Consistent formatting and ease of access is a vital part of producing trustworthy insights.

Data warehouses have become intensely important in the modern business world. For many organizations, it’s not uncommon for all their data to be extracted, loaded unchanged into data warehouses, and then transformed via cleaning, merging, aggregation, etc. As an alternative, when using data modeling tools, data goes through an extract, load, and transform (ELT) process to convert it into the required format for analysis.

Once data reaches the warehouse, it can be updated offline or in real time. Offline data is updated regularly on a fixed schedule (usually daily, nightly, weekly, or monthly) from the operational database. The operational database is designed to support transactional processing, also known as online transactional processing (OLTP). OLTP does not hold historical data, only current data. OLTP works as a source for a data warehouse that is used to store and manage data in real time. Elements in an operational database are updated immediately when any transaction takes place — for example, airline booking systems, banking systems, payroll records, and employee data.

Data warehouses provide a consolidated, multidimensional view of data along with online analytical processing (OLAP) tools. OLAP tools help in the interactive and effective processing of data in a multidimensional space. That is why OLAP systems are used by analysts and managers whereas OLTP systems are used by database professionals. A data warehouse is a time-variant database that allows analysts to analyze and compare the business over various time periods (year, quarter, month, week, day).

DBT: Data Build Tool

DBT stands for Data Build Tool and is a data modeling tool built by Fishtown Analytics. DBT is a command-line tool where data analysts and engineers write SELECT statements and queries to transform data in their warehouses. DBT does the transform part of the ELT process, transforming data that has already been extracted and loaded into the warehouse in an efficient way.

The DBT function takes code, compiles it in SQL, and runs against the database. DBTs support databases such as BigQuery, Snowflake, Postgres, and Redshift.

There are sub-processes in transforming data that prepare the raw data for analysis:

Merging: This process integrates data from multiple sources of similar or dissimilar structures. For example, we have two data sources: A and B. A has the date in format dd/mm/yyyy, and B has the date in format yyyy/mm/dd. In transformation, these dates are brought into one format.

Cleansing: This process identifies and changes inconsistencies and removes inaccurate data. For example, it marks active accounts as “A” and inactive accounts as “I.”

Normalizing: This process brings numerical data points into the same range to improve accuracy and results.

Filtering: This selects only certain columns to load.

Aggregating: In this process, multiple detailed values are aggregated into a single summary value, such as sum, average, maximum, or minimum.

DBT with SQL and Jinja

SQL is an easy and powerful language to use, but it still does not have programming languages such as Python and Java features. When SQL is combined with a fully featured template language like Jinja, it becomes dynamic and more powerful.

DBT queries with a combination of SQL and a templating language called Jinja turn DBT projects into a programming environment for SQL. It gives the ability to use programming language features such as write functions and control structures that aren’t normally possible in SQL.

DBT does the most difficult and time-consuming step — dynamic transformation of data for data teams — with ease. Jinja helps data analysts benefit from employing the best software engineering principles like version controlling (using GitHub or GitLab), modularity, staging, and unit testing by introducing these principles in the SQL paradigm.

Jinja’s important features

There are some out-of-the-box features offered by Jinja:

Template inheritance

The most powerful feature offered by Jinja is template inheritance. In this feature, a parent or base template is created with all the basic and common features of the site that need to be included in every rendering. Child templates are created that inherit a common structure from the base template and can override the blocks defined in the base template.

HTML escaping

Jinja provides a powerful automatic HTML escaping feature. HTML escaping helps in preventing client site cross-site scripting (XSS attacks). When generating HTML from templates, there’s always a risk that a variable will include special characters like “>,” “<,” or “&” that have special meaning in the template and that affect the resulting HTML.

Sandboxing

Sandboxing is used to evaluate the untrusted code. The sandboxed environment at runtime evaluates what attributes or functions are safe to access. If the code accessed by the template is insecure, then a security error is raised.

Macros

Macros are like import statements in Python. Repetitive code is generally put in macros in Jinja. These macros can go into different templates and can be imported from there. The main aim of a macro is “DRY” (“don’t repeat yourself”).

Data modeling in Jinja

Originally, in order to pull data from a database, we needed to write a complex SQL query, messily copy-pasting or rewriting SQL queries from scratch. Luckily, DBT with Jinja solves this problem by providing an easy way to manage and execute SQL scripts needed for transforming data already in a data warehouse for analytics/BI tools. Instead of having to copy and paste SQL queries in your scripts, you can use templates and reusable data models to build complex analytics queries.

Jinja templates can be big and stored in a Sisense Git repository that can then be used in analysis and visualization. Hosting your templates on external cloud sources has certain advantages over storing templates locally:

Easy sharing of templates
All the configurations always use the most up-to-date template
Version controlling can be done
Imports and schemas are automatically managed

A model is usually a select statement, defined in .sql files. Data modeling with Jinja templating language means combining the Jinja template with SQL statements. It makes writing and maintaining complex SQL queries easier.

Jinja installation

Use the pip tool to install Jinja

<code>

$ sudo pip3 install jinja2

</code>

Example code

We will start by importing the “Template” object from the jinja2 module. “Template” is the central object that represents a compiled template that holds a set of rules for evaluation.

We have a table, PROJECT, with seven columns. If we want to add a new column, “last_update,” dynamically, it can be done using the Jinja template. Template “addtable” is created for the same, and params “table” and “feature” are passed.

<code>

from jinja2 import Template

import pandas as pd

import sqlite3

addtable = '''CREATE TABLE "{{per.table}}"

(id integer PRIMARY KEY,

name text NOT NULL,

priority integer,

status_id integer NOT NULL,

project_id integer NOT NULL,

begin_date text NOT NULL,

end_date text NOT NULL,

"{{per.feature}}" DATE);'''

params = {'table': 'PROJECT', 'feature': 'last_update'}

output = Template(addtable).render(per=params)

conn = sqlite3.connect('sample-db.db')

data = pd.read_sql(output, conn)

print("Table created........")

print(data)

</code>

Output

<Table>

In our template, we have the {{ }} syntax, which is used to print the variable. We have already defined variable values in “params.” The variable “params” is passed on via the render() method. We use the render() method to generate the final output. The method joins the template string with the data passed as an argument.

As you can see, we can use the Jinja template to simplify complicated dynamic queries. Using the Jinja template with SQL, new features can be added automatically, without additional work on the model.

Sisense data models

Sisense provides two types of data models: ElastiCube models and live models. ElastiCubes are Sisense’s proprietary, high-performance analytical databases specifically designed to withstand the extensive querying typically required for business intelligence applications.

Live models run queries directly against the data source. This provides you with near real-time data updates in your dashboard, through live connections that rely on the source database for all queries without caching.

Sisense goes great with a DBT modeling pipeline because it integrates with all major cloud data warehouses, meaning DBT models are easily imported into Sisense for self-service analytics. With DBT and Jinja, data teams are able to use Sisense to reduce the cost of maintaining data models and better support code-driven analytics.

The benefits of DBT with Jinja

DBT with Jinja makes data models easier to maintain through the use of template SQL and helps provide user-friendly, fast, and accurate models for BI systems.
SQL knowledge is sufficient to write queries in Jinja. Programming knowledge is not required.
With Jinja, code legibility (reuse) increases, as there are many features like macros and template inheritance that help increase code reuse in Jinja.
Templates are easily importable as a single file in Jinja and improve model performance.
DBT with Jinja reduces the maintenance cost of the model and improves the speed of delivery of modeled data.

DBT + Sisense: A powerful combination

DBT with Jinja templates becomes a very flexible and powerful tool to generate dynamic SQL queries. Its unique features enable organizations to integrate with existing data warehouses and dynamically transform data from various sources in the required format. At Sisense, we use DBT with Jinja internally because it reduces the cost of maintaining our data models and better supports our code-driven and self-service analytics practice. If you’re running an advanced data team, hopefully this article has given you a lot to think about and actionable how-to steps for using Jinja in your work to reap some of these benefits.

From simplifying advanced analytics operations to speeding up visualizations, whatever you need to get done, Jinja can help. Do more with your skills, your data, and your time, and start infusing insights of all kinds, anywhere you need them, more simply.

Sisense Community