Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This dataset contains over 150M customer reviews of Amazon products. The data is in snappy-compressed Parquet files in AWS S3 that total 49GB in size (compressed). Let’s walk through the steps to insert it into ClickHouse.
Loading the dataset
- Without inserting the data into ClickHouse, we can query it in place. Let’s grab some rows, so we can see what they look like:
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet')
LIMIT 3
The rows look like:
Row 1:
──────
review_date: 16462
marketplace: US
customer_id: 25444946 -- 25.44 million
review_id: R146L9MMZYG0WA
product_id: B00NV85102
product_parent: 908181913 -- 908.18 million
product_title: XIKEZAN iPhone 6 Plus 5.5 inch Waterproof Case, Shockproof Dirtproof Snowproof Full Body Skin Case Protective Cover with Hand Strap & Headphone Adapter & Kickstand
product_category: Wireless
star_rating: 4
helpful_votes: 0
total_votes: 0
vine: false
verified_purchase: true
review_headline: case is sturdy and protects as I want
review_body: I won't count on the waterproof part (I took off the rubber seals at the bottom because the got on my nerves). But the case is sturdy and protects as I want.
Row 2:
──────
review_date: 16462
marketplace: US
customer_id: 1974568 -- 1.97 million
review_id: R2LXDXT293LG1T
product_id: B00OTFZ23M
product_parent: 951208259 -- 951.21 million
product_title: Season.C Chicago Bulls Marilyn Monroe No.1 Hard Back Case Cover for Samsung Galaxy S5 i9600
product_category: Wireless
star_rating: 1
helpful_votes: 0
total_votes: 0
vine: false
verified_purchase: true
review_headline: One Star
review_body: Cant use the case because its big for the phone. Waist of money!
Row 3:
──────
review_date: 16462
marketplace: US
customer_id: 24803564 -- 24.80 million
review_id: R7K9U5OEIRJWR
product_id: B00LB8C4U4
product_parent: 524588109 -- 524.59 million
product_title: iPhone 5s Case, BUDDIBOX [Shield] Slim Dual Layer Protective Case with Kickstand for Apple iPhone 5 and 5s
product_category: Wireless
star_rating: 4
helpful_votes: 0
total_votes: 0
vine: false
verified_purchase: true
review_headline: but overall this case is pretty sturdy and provides good protection for the phone
review_body: The front piece was a little difficult to secure to the phone at first, but overall this case is pretty sturdy and provides good protection for the phone, which is what I need. I would buy this case again.
- Let’s define a new
MergeTree table named amazon_reviews to store this data in ClickHouse:
CREATE DATABASE amazon
CREATE TABLE amazon.amazon_reviews
(
`review_date` Date,
`marketplace` LowCardinality(String),
`customer_id` UInt64,
`review_id` String,
`product_id` String,
`product_parent` UInt64,
`product_title` String,
`product_category` LowCardinality(String),
`star_rating` UInt8,
`helpful_votes` UInt32,
`total_votes` UInt32,
`vine` Bool,
`verified_purchase` Bool,
`review_headline` String,
`review_body` String,
PROJECTION helpful_votes
(
SELECT *
ORDER BY helpful_votes
)
)
ENGINE = MergeTree
ORDER BY (review_date, product_category)
- The following
INSERT command uses the s3Cluster table function, which allows the processing of multiple S3 files in parallel using all the nodes of your cluster. We also use a wildcard to insert any file that starts with the name https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet:
INSERT INTO amazon.amazon_reviews SELECT *
FROM s3Cluster('default',
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet')
In ClickHouse Cloud, the name of the cluster is default. Change default to the name of your cluster…or use the s3 table function (instead of s3Cluster) if you don’t have a cluster.
- That query doesn’t take long - averaging about 300,000 rows per second. Within 5 minutes or so you should see all the rows inserted:
- Let’s see how much space our data is using:
The original data was about 70G, but compressed in ClickHouse it takes up about 30G.
Example queries
- Let’s run some queries. Here are the top 10 most-helpful reviews in the dataset:
This query is using a projection to speed up performance.
- Here are the top 10 products in Amazon with the most reviews:
- Here are the average review ratings per month for each product (an actual Amazon job interview question!):
- Here are the total number of votes per product category. This query is fast because
product_category is in the primary key:
- Let’s find the products with the word “awful” occurring most frequently in the review. This is a big task - over 151M strings have to be parsed looking for a single word:
SELECT
product_id,
any(product_title),
avg(star_rating),
count() AS count
FROM amazon.amazon_reviews
WHERE position(review_body, 'awful') > 0
GROUP BY product_id
ORDER BY count DESC
LIMIT 50;
Notice the query time for such a large amount of data. The results are also a fun read!
- We can run the same query again, except this time we search for awesome in the reviews:
SELECT
product_id,
any(product_title),
avg(star_rating),
count() AS count
FROM amazon.amazon_reviews
WHERE position(review_body, 'awesome') > 0
GROUP BY product_id
ORDER BY count DESC
LIMIT 50;