SQL... but better

December 26, 2023

I can't believe it's not SQL

So, with my first actual technical blog - here’s something that is really interesting, but probably very difficult to convince a team to take on board; but hey-ho it’s interesting to me.

SQL, at least the DQL part is a language created with the idea to sound english - this allows the basic structure of SQL

SELECT field1, field2
FROM table1
WHERE field1 > 100;

… to be read like (with some grammar stuff added) …

Select these fields from this table where this field 
is greater than 100

… pretty much like for like.

which you know, is great and all, but this does mean a lot of shortcomings and leaves many areas for improvement - and does lead to the statement

There's got to be a better way!

And there are in the corners of the internet, the following are a couple of ones that I’ve played around with, or have heard good things about - out of pure simplicity sake, I’m going to split the run through into 2 categories - friendly SQL & SQL alternatives.

Friendly SQL

This category I’m kind of using to cover a number of changes made by groups like the DuckDB developers - this amounts to small snippets and shortcuts that can be used within standard ANSI SQL.

A couple of examples include

No need for SELECT *

  /* Instead of SELECT * FROM tbl1 */
  FROM tbl1

Not needed to expand the * with fields we don’t want, or small calculations

  SELECT *
  EXCLUDE (field1, field2) -- this will show all apart from field1 & field 2
  REPLACE (field3 + 5 AS field3) -- this will replace field3 with field3 + 5
  FROM tbl1

No need to specify every column when grouping and ordering

  SELECT field1,
      field2,
      field3, -- note the trailing column 🥰
  FROM tbl1
  GROUP BY ALL -- same as GROUP BY field1, field2, field3
  ORDER BY ALL -- same as ORDER BY field1, field2, field3

Using aliased names in WHERE / GROUP BY / HAVING

  SELECT
      field1 AS some_bool,
      field2 AS some_data,
      SUM(field3) AS total_stuff
  FROM tbl1
  WHERE
      some_bool = 1
  GROUP BY
      some_bool,
      some_data
  HAVING
      total_stuff > 0

This post on DuckDB or this MotherDuck LinkedIn stream with this accompanying Colab delves into this in more detail - a nice thing about DuckDB as well is that it’s super easy to spin up or you can even use a demo up on the DuckDB home page.

Unfortunately, these changes are only applicable to DuckDB, much like with DuckDB though; there are other friendly SQL/quality of life changes in platforms such as Snowflake.

SQL Alternatives

The other 2 things that I’ll briefly run through are essentially new languages that compiles to SQL; we have 2 to showcase, PRQL (pronounced prequel) which is a general purpose query language, and Malloy which is more towards a semantic layer for analysis.

PRQL

Prequel is one of my favourites; it has many of the quality of life improvements featured under the friendly SQL section, but contains solutions for many of the problems raised in the Against SQL article; it’s more concise, includes many features of other languages that really helps it along, and is system agnostic (with switches for different engines).

The best introductory example is provided on the PRQL playground

from invoices                               # A PRQL query begins with a table
                                            # Subsequent lines "transform" (modify) it
derive {                                    # "derive" adds columns to the result
  transaction_fee = 0.8,                    # "=" sets a column name
  income = total - transaction_fee          # Calculations can use other column names
}

# starts a comment; commenting out a line leaves a valid query
filter income > 5                           # "filter" replaces both of SQL's WHERE & HAVING
filter invoice_date >= @2010-01-16          # Clear date syntax
group customer_id (                         # "group" performs the pipeline in (...) on each group
  aggregate {                               # "aggregate" reduces each group to a single row
    sum_income = sum income,                # ... using SQL SUM(), COUNT(), etc. functions
    ct = count customer_id,          
  }
)
join c=customers (==customer_id)            # join on "customer_id" from both tables
derive name = f"{c.last_name}, 
    {c.first_name}"                         # F-strings like Python
derive db_version = s"version()"            # S-string offers escape hatch to SQL
select {                                    # "select" passes along only the named columns
  c.customer_id, 
  name, 
  sum_income, 
  ct, 
  db_version,
}                                           # trailing commas always ignored
sort {-sum_income}                          # "sort" sorts the result; "-" is decreasing order
take 1..10                                  # Limit to a range - could also be "take 10"

which compiles to

WITH table_1 AS (
  SELECT
    customer_id,
    total - 0.8 AS _expr_0,
    invoice_date
  FROM
    invoices
),
table_0 AS (
  SELECT
    COALESCE(SUM(_expr_0), 0) AS sum_income,
    COUNT(*) AS ct,
    customer_id
  FROM
    table_1
  WHERE
    _expr_0 > 5
    AND invoice_date >= DATE '2010-01-16'
  GROUP BY
    customer_id
)
SELECT
  c.customer_id,
  CONCAT(c.last_name, ', ', c.first_name) AS name,
  table_0.sum_income,
  table_0.ct,
  version() AS db_version
FROM
  table_0
  JOIN customers AS c ON table_0.customer_id = c.customer_id
ORDER BY
  table_0.sum_income DESC
LIMIT
  10

There are a number of reasons as to why I enjoy playing around with this, from the simple quality of life things (trailing commas & alias derived like variables in pretty much any other language) to more (piping, universal functions and variable declarations); I could wax lyrical and blurt out in a rambling way, but the best thing to do is look for yourself - the documentation is growing but is already quite detailed and the playground, VS Code extension, and all of the bindings and integrations makes it proper easy to play around with and juggle with

Just look at all of the integrations, tools, and integrations...

Malloy

This experiment is developed by Google and covers many of the same things as PRQL (albeit it is missing a few features and is more immature), but what it does have above and beyond is the ability to create a sematic layer (deriving aggregations and dimensions) and dashboards from them as well.

And much like with PRQL, there’s VS Code extensions, an online playground, and an integration with Python & Jupyter.

I haven’t played with this too much, just spent time flicking through the site and looking at the documentation but the way it is defined does seem interesting. It seems to be split into 2, queries and sources - with sources you can define the structure of a semantic layer, extending with dimensions and measures, basically creating reusable components. Queries can then be ran from these sources. Views and dashboard elements can also be added to these to create general model files, as with many other languages these files are composable and can be imported into each other to create DRY projects.

As I said, I haven’t played around with it much, but I am interested to carry on with playing around with it and creating some simple dashboards that can be defined in a simple syntax.

So that’s it, not a walkthrough or anything, just a simple show of what’s going on in the world of replacing SQL… just to end the year on a note of change. Saying that, these are all in the experimental stages, and SQL is still key to data at the moment.

Till next time…