This dataset contains weather measurements for the last 120 years. Each row is a measurement for a point in time and station. More precisely and according to the origin of this data:Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
GHCN-Daily is a dataset that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two-thirds of which are for precipitation measurements only (Menne et al., 2012). GHCN-Daily is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews (Durre et al., 2010). The archive includes the following meteorological elements:
- Daily maximum temperature
- Daily minimum temperature
- Temperature at the time of observation
- Precipitation (i.e., rain, melted snow)
- Snowfall
- Snow depth
- Other elements where available
Downloading the data
- A pre-prepared version of the data for ClickHouse, which has been cleansed, re-structured, and enriched. This data covers the years 1900 to 2022.
- Download the original data and convert to the format required by ClickHouse. Users wanting to add their own columns may wish to explore this approach.
Pre-prepared data
More specifically, rows have been removed that didn’t fail any quality assurance checks by Noaa. The data has also been restructured from a measurement per line to a row per station id and date, i.e.Original data
The following details the steps to download and transform the original data in preparation for loading into ClickHouse.Download
To download the original data:Sampling the data
- An 11 character station identification code. This itself encodes some useful information
- YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
- ELEMENT = 4 character indicator of element type. Effectively the measurement type. While there are many measurements available, we select the following:
- PRCP - Precipitation (tenths of mm)
- SNOW - Snowfall (mm)
- SNWD - Snow depth (mm)
- TMAX - Maximum temperature (tenths of degrees C)
- TAVG - Average temperature (tenths of a degree C)
- TMIN - Minimum temperature (tenths of degrees C)
- PSUN - Daily percent of possible sunshine (percent)
- AWND - Average daily wind speed (tenths of meters per second)
- WSFG - Peak gust wind speed (tenths of meters per second)
- WT** = Weather Type where ** defines the weather type. Full list of weather types here.
- DATA VALUE = 5 character data value for ELEMENT i.e. the value of the measurement.
- M-FLAG = 1 character Measurement Flag. This has 10 possible values. Some of these values indicate questionable data accuracy. We accept data where this is set to “P” - identified as missing presumed zero, as this is only relevant to the PRCP, SNOW and SNWD measurements.
- Q-FLAG is the measurement quality flag with 14 possible values. We’re only interested in data with an empty value i.e. it didn’t fail any quality assurance checks.
- S-FLAG is the source flag for the observation. Not useful for our analysis and ignored.
- OBS-TIME = 4-character time of observation in hour-minute format (i.e. 0700 =7:00 am). Typically not present in older data. We ignore this for our purposes.
qFlag is equal to an empty string.
Clean the data
Using ClickHouse local we can filter rows that represent measurements of interest and pass our quality requirements:Pivot data
While the measurement per line structure can be used with ClickHouse, it will unnecessarily complicate future queries. Ideally, we need a row per station id and date, where each measurement type and associated value are a column i.e.GROUP BY, we can repivot our data to this structure. To limit memory overhead, we do this one file at a time.
noaa.csv.
Enriching the data
The data has no indication of location aside from a station id, which includes a prefix country code. Ideally, each station would have a latitude and longitude associated with it. To achieve this, NOAA conveniently provides the details of each station as a separate ghcnd-stations.txt. This file has several columns, of which five are useful to our future analysis: id, latitude, longitude, elevation, and name.noaa_enriched.parquet.
Create table
Create a MergeTree table in ClickHouse (from the ClickHouse client).Inserting into ClickHouse
Inserting from local file
Data can be inserted from a local file as follows (from the ClickHouse client):<path> represents the full path to the local file on disk.
See here for how to speed this load up.