SQL/Postgres' Inner Workings & Optimizations - The Hidden Beauty

Although SQL and its enterprise variant Postgres are heavily used by many, I have found that due to the sheer abstraction they provide, their inner workings remain a mystery to many. The difference mainly comes from the fact that SQL is a declarative language, unlike the imperative languages that we are used to, like C++ and Python. Instead of telling it what exactly to do, we rather declare what we want, then it figures out how to achieve the same.

 

In this post, I'll go over details of how SQL works and some findings I've found researching into optimizing Postgres queries as a part of my DBMS project at IIT Kanpur. 

 

A case can be made that imperative languages themselves are declarative at the hardware level over assembly and system calls. Just like how the sort function in Python or C++ STL abstracts which exact sorting method is used here, SQL's 'order by' keyword also abstracts the same. 

While there is a lot of truth to that statement and SQL is a higher-level language, there is a very key difference at the algorithm-level. Instead of having packages with predefined methods for us to use, we tell it what we want, and at runtime, it dynamically figures out how exactly is the best way to achieve the same goal based on metadata, estimations, and statistics of the tables involved.

 

Every SQL execution could end up being executed differently, and every keyword used in SQL could be executed by a different algorithm. For example, sorting the sort method in Python or C++ is a lot simpler in the sense that the actual algorithm being used to perform the desired function stays the same - quick sort. But the same join in SQL could end up being executed via many different algorithms abstracted away from the user – Nested Loop Join, Merge Join, Hash Join.

 

The difference lies in that SQL was built with a very different philosophy. While imperative programming languages are meant to act as highly abstracted ways to interact with the hardware and tell it what to do with data, SQL was built with the philosophy that the researchers behind SQL and its variants can figure out how to perform the same task dynamically a lot better than we programmers can.

Does this mean there’s no point in learning what happens behind the scenes? Absolutely not! It is quite crucial that we do because in a lot of cases, when we want to optimise a query to run faster or with less load, we need to understand what is really happening to make any sort of actual impact. Because the way to optimise a query is not how we usually optimize an algorithm, but by supporting the optimizer by maintaining the database with the proper indexes and writing proper queries that use these, where the optimizer can accurately find the best way to execute the query.

 

My findings looking into PostgreSQL

SQL’s optimizer works by finding equivalent expressions for a given query and choosing the best one based on statistics and estimations. I found that Postgres’ optimizer is so well made that the query structure does not make any difference in performance – ordering ‘where clauses’ by descending selectivity, using CTEs, etc.  

 

Any relational algebra modification that can be made to make the query run better, Postgres seems capable of finding it. Additionally, if any redundant filter exists, then postgres will probably be able to estimate its low selectivity and place it late in the query plan at a point where results are already small enough to be loaded into memory, post which operations are relatively instant.

 

This means that we can optimize the execution of any Postgres query in one of 2 ways:

  1. Ensuring proper indexing for faster execution of the best expression.
  2. Using domain knowledge to remove implicit redundancy and improve the initial relational query into a non-equivalent one that also yields the relevant result.


Note: The only exception to this I have found is when using wrappers to foreign databases to perform operations joining tables existing in different databases. In such cases, using CTEs to filter the table first and then performing joins seems to make a difference. This is probably because the executing database can communicate to the foreign databases only via SQL whose execution is decided by the foreign database. This might be hindering how effective the optimizer can perform.

Comments

Popular posts from this blog

LLM Tools & MCP - What are they and how to get started?