dcmoura/spyql: Query data on the command line with SQL-like SELECTs powered by Python expressions

Date: unknown

Location: github.com

SQL with Python in the middle

Concept

SpyQL is a query language that combines:

  • the simplicity and structure of SQL
  • with the power and readability of Python
SELECT
    date.fromtimestamp(purchase_ts) AS purchase_date,
    price * quantity AS total
FROM csv
WHERE department.upper() == 'IT'
TO json

SQL provides the structure of the query, while Python is used to define expressions, bringing along a vast ecosystem of packages.

SpyQL is fast and memory efficient. Take a look at the benchmarks with GB-size JSON data (best viewed on Colab).

SpyQL command-line tool

Demo video

With the SpyQL command-line tool you can make SQL-like SELECTs powered by Python on top of text data (e.g. CSV and JSON). Data can come from files but also from data streams, such as Kafka, or from databases such as PostgreSQL. Basically, data can come from any command that outputs text :-). More, data can be generated by a Python iterator! Take a look at the examples section to see how to query parquet, process API calls, transverse directories of zipped JSONs, among many other things.

SpyQL also allows you to easily convert between text data formats:

  • FROM: CSV, JSON, TEXT and Python iterators (YES, you can use a list comprehension as the data source)

  • TO: CSV, JSON, SQL (INSERT statements), pretty terminal printing, and terminal plotting.

The JSON format is JSON lines, where each line has a valid JSON object or array. Piping with jq allows SpyQL to handle any JSON input (more on the examples section).

You can leverage command line tools to process other file types like Parquet and XML (more on the examples section).

Installation

To install SpyQL, run this command in your terminal:

Hello world

To test your installation run in the terminal:

spyql "SELECT 'Hello world' as Message TO pretty"

Output:

Message
-----------
Hello world

Try replacing the output format by json and csv, and try adding more columns. e.g. run in the terminal:

spyql "SELECT 'Hello world' as message, 1+2 as three TO json"

Output:

{"message": "Hello world", "three": 3}

SPyQL supports reading/writing json data using orjson, a fast, correct JSON library for Python. To use orjson, you need to install it separately. Then, test it by running the last example again, replacing json by orjson:

spyql "SELECT 'Hello world' as message, 1+2 as three TO orjson"

Output:

{"message":"Hello world","three":3}

Principles

Right now, the focus is on building a command-line tool that follows these core principles:

  • Simple: simple to use with a straightforward implementation
  • Familiar: you should feel at home if you are acquainted with SQL and Python
  • Light: small memory footprint that allows you to process large data that fit into your machine
  • Useful: it should make your life easier, filling a gap in the ecosystem

Syntax

[ IMPORT python_module [ AS identifier ] [, ...] ]
SELECT [ DISTINCT | PARTIALS ]
    [ * | python_expression [ AS output_column_name ] [, ...] ]
    [ FROM csv | spy | text | python_expression | orjson | json [ EXPLODE path ] ]
    [ WHERE python_expression ]
    [ GROUP BY output_column_number | python_expression  [, ...] ]
    [ ORDER BY output_column_number | python_expression
        [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
    [ LIMIT row_count ]
    [ OFFSET num_rows_to_skip ]
    [ TO csv | orjson | json | spy | sql | pretty | plot ]

Notable differences to SQL

In SpyQL:

  • there is guarantee that the order of the output rows is the same as in the input (if no reordering is done)
  • the AS keyword must precede a column alias definition (it is not optional as in SQL)
  • you can always access the nth input column by using the default column names colN (e.g. col1 for the first column)
  • currently only a small subset of SQL is supported, namely SELECT statements without: sub-queries, joins, set operations, etc (check the Syntax section)
  • sub-queries are achieved by piping (see the Command line examples section)
  • aggregation functions have the suffix _agg to avoid conflicts with python's built-in functions:
Operation PostgreSQL SpyQL
Sum all values of a column SELECT sum(col_name) SELECT sum_agg(col_name)
Sum an array SELECT sum(a) FROM (SELECT unnest(array[1,2,3]) AS a) AS t SELECT sum([1,2,3])
  • expressions are pure Python:
SQL SpySQL
x = y x == y
x BETWEEN a AND b a <= x <= b
CAST(x AS INTEGER) int(x)
CASE WHEN x > 0 THEN 1 ELSE -1 END 1 if x > 0 else -1
upper('hello') 'hello'.upper()

Notable differences to Python

Additional syntax

We added additional syntax for making querying easier:

Python SpySQL shortcut Purpose
json['hello']['planet earth'] json->hello->'planet earth' Easy access of elements in dicts (e.g. JSONs)
json['hello'] json.hello Easy access of elements in dicts (e.g. JSONs)

NULL datatype

Python's None generates exceptions when making operations on missing data, breaking query execution (e.g. None + 1 throws a TypeError). To overcome this, we created a NULL type that has the same behavior as in SQL (e.g. NULL + 1 returns NULL), allowing for queries to continue processing data.

Operation Native Python throws SpySQL returns SpySQL warning
NULL + 1 NameError NULL
a_dict['inexistent_key'] KeyError NULL yes
int('') ValueError NULL yes
int('abc') ValueError NULL yes

The above dictionary key access only returns NULL if the dict is an instance of qdict. SpyQL adds qdict, which extends python's native dict. JSONs are automatically loaded as qdict. Unless you are creating dictionaries on the fly you do not need to worry about this.

Importing python modules and user-defined functions

By default, spyql do some commonly used imports:

  • everything from the math module
  • datetime, date and timezone from the datetime module
  • the re module

SpyQL queries support a single import statement at the beginning of the query where several modules can be imported (e.g. IMPORT numpy AS np, sys SELECT ...). Note that the python syntax from module import identifier is not supported in queries.

In addition, you can create a python file that is loaded before executing queries. Here you can define imports, functions, variables, etc using regular python code. Everything defined in this file is available to all your spyql queries. The file should be located at XDG_CONFIG_HOME/spyql/init.py. If the environment variable XDG_CONFIG_HOME is not defined, it defaults to HOME/.config (e.g. /Users/janedoe/.config/spyql/init.py).

Example queries

You can run the following example queries in the terminal: spyql "the_query" < a_data_file

Example data files are not provided on most cases.

Query a CSV (and print a pretty table)

SELECT a_col_name, 'positive' if col2 >= 0 else 'negative' AS sign
FROM csv
TO pretty

Convert CSV to a flat JSON

SELECT * FROM csv TO json

Convert from CSV to a hierarchical JSON

SELECT {'client': {'id': col1, 'name': col2}, 'price': 120.40}
FROM csv TO json

or

SELECT {'id': col1, 'name': col2} AS client, 120.40 AS price
FROM csv TO json

JSON to CSV, filtering out NULLs

SELECT json->client->id AS id, json->client->name AS name, json->price AS price
FROM json
WHERE json->client->name is not NULL
TO csv

Explode JSON to CSV

SELECT json->invoice_num AS id, json->items->name AS name, json-items->price AS price
FROM json
EXPLODE json->items
TO csv

Sample input:

{"invoice_num" : 1028, "items": [{"name": "tomatoes", "price": 1.5}, {"name": "bananas", "price": 2.0}]}
{"invoice_num" : 1029, "items": [{"name": "peaches", "price": 3.12}]}

Output:

id, name, price
1028, tomatoes, 1.5
1028, bananas, 2.0
1029, peaches, 3.12

Python iterator/list/comprehension to JSON

SELECT 10 * cos(col1 * ((pi * 4) / 90))
FROM range(80)
TO json

or

SELECT col1
FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
TO json

Importing python modules

Here we import hashlib to calculate a md5 hash for each input line. Before running this example you need to install the hashlib package (pip install hashlib).

IMPORT hashlib as hl
SELECT hl.md5(col1.encode('utf-8')).hexdigest()
FROM text

Getting the top 5 records

SELECT int(score) AS score, player_name
FROM csv
ORDER BY 1 DESC NULLS LAST, score_date
LIMIT 5

Aggregations

Totals by player, alphabetically ordered.

SELECT json->player_name, sum_agg(json->score) AS total_score
FROM json
GROUP BY 1
ORDER BY 1

Partial aggregations

Calculating the cumulative sum of a variable using the PARTIALS modifier. Also demoing the lag aggregator.

SELECT PARTIALS
    json->new_entries,
    sum_agg(json->new_entries) AS cum_new_entries,
    lag(json->new_entries) AS prev_entries
FROM json
TO json

Sample input:

{"new_entries" : 10}
{"new_entries" : 5}
{"new_entries" : 25}
{"new_entries" : null}
{}
{"new_entries" : 100}

Output:

{"new_entries" : 10,   "cum_new_entries" : 10,  "prev_entries": null}
{"new_entries" : 5,    "cum_new_entries" : 15,  "prev_entries": 10}
{"new_entries" : 25,   "cum_new_entries" : 40,  "prev_entries": 5}
{"new_entries" : null, "cum_new_entries" : 40,  "prev_entries": 25}
{"new_entries" : null, "cum_new_entries" : 40,  "prev_entries": null}
{"new_entries" : 100,  "cum_new_entries" : 140, "prev_entries": null}

If PARTIALSwas omitted the result would be equivalent to the last output row.

Distinct rows

SELECT DISTINCT *
FROM csv

Command line examples

To run the following examples, type Ctrl-x Ctrl-e on you terminal. This will open your default editor (emacs/vim). Paste the code of one of the examples, save and exit.

Queries on Parquet with directories

Here, find transverses a directory and executes parquet-tools for each parquet file, dumping each file to json format. jq -c makes sure that the output has 1 json per line before handing over to spyql. This is far from being an efficient way to query parquet files, but it might be a handy option if you need to do a quick inspection.

find /the/directory -name "*.parquet" -exec parquet-tools cat --json {} \; |
jq -c |
spyql "
	SELECT json->a_field, json->a_num_field * 2 + 1
	FROM json
"

Querying multiple json.gz files

gzcat *.json.gz |
jq -c |
spyql "
	SELECT json->a_field, json->a_num_field * 2 + 1
	FROM json
"

Querying YAML / XML / TOML files

yq converts yaml, xml and toml files to json, allowing to easily query any of these with spyql.

cat file.yaml | yq -c | spyql "SELECT json->a_field FROM json"
cat file.xml | xq -c | spyql "SELECT json->a_field FROM json"
cat file.toml | tomlq -c | spyql "SELECT json->a_field FROM json"

Kafka to PostgreSQL pipeline

Read data from a kafka topic and write to postgres table name customer.

kafkacat -b the.broker.com -t the.topic |
spyql -Otable=customer -Ochunk_size=1 --unbuffered "
	SELECT
		json->customer->id AS id,
		json->customer->name AS name
	FROM json
	TO sql
" |
psql -U an_user_name -h a.host.com a_database_name

Monitoring statistics in Kafka

Read data from a kafka topic, continuously calculating statistics.

kafkacat -b the.broker.com -t the.topic |
spyql --unbuffered "
	SELECT PARTIALS
        count_agg(*) AS running_count,
		sum_agg(value) AS running_sum,
		min_agg(value) AS min_so_far,
        value AS current_value
	FROM json
	TO csv
"

Sub-queries (piping)

A special file format (spy) is used to efficiently pipe data between queries.

cat a_file.json |
spyql "
	SELECT ' '.join([json->first_name, json->middle_name, json->last_name]) AS full_name
	FROM json
	TO spy" |
spyql "SELECT full_name, full_name.upper() FROM spy"

Queries over APIs

curl https://reqres.in/api/users?page=2 |
spyql "
	SELECT
		json->data->email AS email,
		'Dear {}, thank you for being a great customer!'.format(json->data->first_name) AS msg
	FROM json
	EXPLODE json->data
	TO json
"

Plotting to the terminal

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO plot
"

GUI Plotting with matplotcli

spyql "
    SELECT col1 AS y
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO json
" | plt "plot(y)"

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.