Working with Parquet in ClickHouse
Parquet is an efficient file format to store data in a column-oriented way. ClickHouse provides support for both reading and writing Parquet files.
When you reference a file path in a query, where ClickHouse attempts to read from will depend on the variant of ClickHouse that you're using.
If you're using clickhouse-local
it will read from a location relative to where you launched ClickHouse Local.
If you're using ClickHouse Server or ClickHouse Cloud via clickhouse client
, it will read from a location relative to the /var/lib/clickhouse/user_files/
directory on the server.
Importing from Parquet
Before loading data, we can use file() function to explore an example parquet file structure:
DESCRIBE TABLE file('data.parquet', Parquet);
We've used Parquet as a second argument, so ClickHouse knows the file format. This will print columns with the types:
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ path │ Nullable(String) │ │ │ │ │ │
│ date │ Nullable(String) │ │ │ │ │ │
│ hits │ Nullable(Int64) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
We can also explore files before actually importing data using all power of SQL:
SELECT *
FROM file('data.parquet', Parquet)
LIMIT 3;
┌─path──────────────────────┬─date───────┬─hits─┐
│ Akiba_Hebrew_Academy │ 2017-08-01 │ 241 │
│ Aegithina_tiphia │ 2018-02-01 │ 34 │
│ 1971-72_Utah_Stars_season │ 2016-10-01 │ 1 │
└───────────────────────────┴────────────┴──────┘
We can skip explicit format setting for file()
and INFILE
/OUTFILE
.
In that case, ClickHouse will automatically detect format based on file extension.
Importing to an existing table
Let's create a table into which we'll import Parquet data:
CREATE TABLE sometable
(
`path` String,
`date` Date,
`hits` UInt32
)
ENGINE = MergeTree
ORDER BY (date, path);
Now we can import data using the FROM INFILE
clause:
INSERT INTO sometable
FROM INFILE 'data.parquet' FORMAT Parquet;
SELECT *
FROM sometable
LIMIT 5;
┌─path──────────────────────────┬───────date─┬─hits─┐
│ 1988_in_philosophy │ 2015-05-01 │ 70 │
│ 2004_Green_Bay_Packers_season │ 2015-05-01 │ 970 │
│ 24_hours_of_lemans │ 2015-05-01 │ 37 │
│ 25604_Karlin │ 2015-05-01 │ 20 │
│ ASCII_ART │ 2015-05-01 │ 9 │
└───────────────────────────────┴────────────┴──────┘
Note how ClickHouse automatically converted Parquet strings (in the date
column) to the Date
type. This is because ClickHouse does a typecast automatically based on the types in the target table.
Inserting a local file to remote server
If you want to insert a local Parquet file to a remote ClickHouse server, you can do this by piping the contents of the file into clickhouse-client
, as shown below:
clickhouse client -q "INSERT INTO sometable FORMAT Parquet" < data.parquet
Creating new tables from Parquet files
Since ClickHouse reads parquet file schema, we can create tables on the fly:
CREATE TABLE imported_from_parquet
ENGINE = MergeTree
ORDER BY tuple() AS
SELECT *
FROM file('data.parquet', Parquet)
This will automatically create and populate a table from a given parquet file:
DESCRIBE TABLE imported_from_parquet;
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ path │ Nullable(String) │ │ │ │ │ │
│ date │ Nullable(String) │ │ │ │ │ │
│ hits │ Nullable(Int64) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
By default, ClickHouse is strict with column names, types, and values. But sometimes, we can skip unexistent columns or unsupported values during import. This can be managed with Parquet settings.
Exporting to Parquet format
When using INTO OUTFILE
with ClickHouse Cloud you will need to run the commands in clickhouse client
on the machine where the file will be written to.
To export any table or query result to the Parquet file, we can use an INTO OUTFILE
clause:
SELECT *
FROM sometable
INTO OUTFILE 'export.parquet'
FORMAT Parquet
This will create the export.parquet
file in a working directory.
ClickHouse and Parquet data types
ClickHouse and Parquet data types are mostly identical but still differ a bit. For example, ClickHouse will export DateTime
type as a Parquets' int64
. If we then import that back to ClickHouse, we're going to see numbers (time.parquet file):
SELECT * FROM file('time.parquet', Parquet);
┌─n─┬───────time─┐
│ 0 │ 1673622611 │
│ 1 │ 1673622610 │
│ 2 │ 1673622609 │
│ 3 │ 1673622608 │
│ 4 │ 1673622607 │
└───┴────────────┘
In this case type conversion can be used:
SELECT
n,
toDateTime(time) <--- int to time
FROM file('time.parquet', Parquet);
┌─n─┬────toDateTime(time)─┐
│ 0 │ 2023-01-13 15:10:11 │
│ 1 │ 2023-01-13 15:10:10 │
│ 2 │ 2023-01-13 15:10:09 │
│ 3 │ 2023-01-13 15:10:08 │
│ 4 │ 2023-01-13 15:10:07 │
└───┴─────────────────────┘
Further reading
ClickHouse introduces support for many formats, both text, and binary, to cover various scenarios and platforms. Explore more formats and ways to work with them in the following articles:
- CSV and TSV formats
- Avro, Arrow and ORC
- JSON formats
- Regex and templates
- Native and binary formats
- SQL formats
And also check clickhouse-local - a portable full-featured tool to work on local/remote files without the need for Clickhouse server.