Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This tutorial follows the [ClickHouse tutorial] but runs all its queries via pg_clickhouse.Start ClickHouse
First, create a ClickHouse database if you don’t already have one. A quick way to start is with the Docker image:Create a Table
Let’s borrow from the [ClickHouse tutorial] to create a simple database with The New York City taxi dataset:Add the Data Set
And then import the data:Install pg_clickhouse
Build and install pg_clickhouse from PGXN or GitHub. Or spin up a Docker container using the pg_clickhouse image, which simply adds pg_clickhouse to the Docker Postgres image:Connect pg_clickhouse
Now connect to Postgres:password option.
Now, add the taxi table, just import it all of the tables from the remote
ClickHouse database into a Postgres schema:
\det+ to see it:
\d to show all the columns:
COUNT() aggregate, so it runs on ClickHouse and only
returns the single row to Postgres. Use EXPLAIN to see it:
Analyze the data
Run some queries to analyze the data. Explore the following examples or try your own SQL query.-
Calculate the average tip amount:
-
Calculate the average cost based on the number of passengers:
-
Calculate the daily number of pickups per neighborhood:
-
Calculate the length of each trip in minutes, then group the results by
trip length:
-
Show the number of pickups in each neighborhood broken down by hour of the day:
-
Set display time zone for New York and retrieve rides to LaGuardia or JFK
airports:
Create a Dictionary
Create a dictionary associated with a table in your ClickHouse service. The table and dictionary are based on a CSV file that contains a row for each neighborhood in New York City. The neighborhoods are mapped to the names of the five New York City boroughs (Bronx, Brooklyn, Manhattan, Queens and Staten Island), as well as Newark Airport (EWR). Here’s an excerpt from the CSV file you’re using in table format. TheLocationID column in the file maps to the pickup_nyct2010_gid and
dropoff_nyct2010_gid columns in your trips table:
| LocationID | Borough | Zone | service_zone |
|---|---|---|---|
| 1 | EWR | Newark Airport | EWR |
| 2 | Queens | Jamaica Bay | Boro Zone |
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone |
| 4 | Manhattan | Alphabet City | Yellow Zone |
| 5 | Staten Island | Arden Heights | Boro Zone |
-
Still in Postgres, use the
clickhouse_raw_queryfunction to create a ClickHouse dictionary namedtaxi_zone_dictionaryand populate the dictionary from the CSV file in S3:
Setting
LIFETIME to 0 disables automatic updates to avoid unnecessary
traffic to our S3 bucket. In other cases, you might configure it
differently. For details, see Refreshing dictionary data using
LIFETIME.- Now import it:
- Confirm we can query it:
- Excellent. Now use the
dictGetfunction unction to retrieve a borough’s name in a query. For this query sums up the number of taxi rides per borough that end at either the LaGuardia or JFK airport:
Perform a join
Write some queries that join thetaxi_zone_dictionary with your trips
table.
-
Start with a simple
JOINthat acts similarly to the previous airport query above:
Notice the output of the above
JOIN query is the same as the dictGet
query above, (except that the Unknown values aren’t included). Behind
the scenes, ClickHouse is actually calling the dictGet function for
the taxi_zone_dictionary dictionary, but the JOIN syntax is more
familiar for SQL developers.-
This query returns rows for the the 1000 trips with the highest tip
amount, then performs an inner join of each row with the dictionary:
Generally, we avoid using
SELECT * in PostgreSQL and ClickHouse. You
should only retrieve the columns you actually need.