New version of the Berkeley Earth Surface Temperature data set

curryja

14 years ago

by Steve Mosher and Zeke Hausfather

Today the Berkeley Earth Surface Temperature Project publically released their accumulated minimum, maximum, and mean monthly data.

The data set is a composite of fourteen different surface temperature datasets, including:

Global Historical Climatology Network – Monthly
Global Historical Climatology Network – Daily
US Historical Climatology Network – Monthly
World Monthly Surface Station Climatology
Hadley Centre / Climate Research Unit Data Collection
US Cooperative Summary of the Month
US Cooperative Summary of the Day
US First Order Summary of the Day
Scientific Committee on Antarctic Research
GSN Monthly Summaries from NOAA
Monthly Climatic Data of the World
GCOS Monthly Summaries from DWD
World Weather Records (only those published since 1961)
Colonial Era Weather Archives

This represents an unprecedented amount of land measurement data available, with 40,752 unique station records comprised of over 15 million station-months of data. It is also an invaluable resource for those of us interested in analyzing land temperature data, as it provides considerably better spatial coverage than any prior datasets.

The overall picture is unsurprising: the Berkeley Earth data shows nearly the same long-term land warming trend found in NCDC, GISTemp, and CRUTEM records.

Note that the CRUTEM record used here is the Simple Average land product rather than the more commonly displayed hemispheric-weighted product, as it is more methodologically comparable to the records produced by other groups. The GISTemp record shown here has a land mask applied.

The Berkeley group does go considerably further back in time than any prior records, with data available since 1750 (or, more reliably, since 1800) with uncertainty ranges derived in part by comparing regions where coverage is available during early periods to the overall global land temperature during times when both have excellent coverage.

The Berkeley group also applies a novel approach to dealing with inhomogenities. It detects series breakpoints based both on metadata and neighbor-comparisons and cuts series at breakpoints, turning them into effectively separate records. These “scalpeled” records are then recombined using a least-squares approach. For more information on the specific methods used, see Rhode et al (submitted).

Having access to the raw data allows us to examine how the results differ if raw data is used and no homogenization process is applied.

Here we see published Berkeley data compared to two different methods using their newly released data. The “Zeke” method uses a simple Common Anomaly Method (CAM) coupled with 5×5 lat/lon grid cells, and excludes any records that do not have at least 10 years of data during the 1971-2000 period. The “Nick” method (using Mosher’s implementation of Nick Stokes’ code) uses a Least Squares Method (LSM) to combine fragmentary station records within 5×5 grid cells and only excludes stations with less than 36 months of data in the entire station history. The “Zeke” method employs a land mask to adjust the weights of grid cells based on their land area, while the “Nick” method does not. The dataset analyzed here is the “Quality Controlled” release, which involves removing obviously wrong data (e.g. a few 10000 C observations) but no adjustments.

The results show that our “raw” series are similar to Berkeley’s homogenized series, but show a slightly lower slope over the century period (and less steep rise over the last 30 years). How much of this might be due to differing methodological choices vs. homogenization is still unclear.

Finally, the newly released Berkeley data includes metadata flags indicating the origin of particular station data. This allows us to compare the standard Global Historical Climatological Network-Monthly (GHCN-M) data that underlies the existing records (GISTemp, CRUTEM, NCDC) to all of the new non GHCN-M data that has been added.

Here we see (using the “Zeke” method described earlier) that non-GHCN-M stations produce a record quite similar to that of GHCN-M stations, though there is a bit more noise early on in the record as the non-GHCN-M set has fewer station months that far back.

Steven Mosher has done extensive work on building a (free and open source) R package to import and process the Berkeley Earth dataset. Details about his package are below:

The first official release of the Berkeley Earth dataset can be a bit daunting at first, but everything is there for people to get started looking at the data. An R package BerkeleyEarth has been created to provide an easy way to get the data. First some background on the data ingest process. Berkeley Earth Surface temperature ingest many different datasets. Those dataset are then transformed into a common format. The package has not been tested with the common format and won’t support reading daily data until a future release. The sources are defined in the file source_flag_definitions.txt. That file is read with the function readSourceFlagDefine().

The next step in the data process is to create a multiple value file. This file is created by merging the source data into one dataset. The various sites will have 3-4 series per site that make up the complete record. At this point only limited QC is done on the data. In future releases of the package the process of merging data and the QC applied will be documented. At this stage the package has not been tested on the multi valued dataset.

There are then two steps that get applied to the multi valued data: There is a quality control step and a seasonality step. After these two processes there are 4 datasets; they are all single valued datasets. The QC step applies quality flags to the data, removing those data elements that are suspect. The seasonality step removes seasonality. This is described in the main readme for the data. For the R package the following dataset was used: Single value, quality controlled; no seasonality removed. Over the course of time and with help from others the other single valued datasets will be tested as well as the multi valued dataset and the source datasets.

Reading the data:

There are three different functions for reading the main datafile: data.txt. That datafile contains 7 columns of data: The station Id, the series ID, the date, the temperature, the uncertainty, the number of observations and the time of observations. The data is monthly data, so the number of observations indicates the days of month that reported observations. The data is presented in a sparse time format. That means missing months are not represented in the file. Values indicating NA for a month are not provided.

A simple routine is provided for reading in this data as is:

readBerkeleydata(). This function does two things. The size of the Berkeley dataset is such that it may overrun the users RAM. To manage this the function creates a memory mapped file of the data. On the first invocation of the function if the memory mapped file does not exist it is created. The function is called like so: Bestdata <- readBerkeleyData(Directory, filename = “data.bin”). On the first call “data.bin” does not exist, so it is created. This takes about 10 minutes. Once that file is created, subsequent access is immediate and the program will return access to the file. This is accomplished using the package bigmemory. All 7 columns are returned as a matrix.

To access only the temperature data, two functions are provided: readBerkekelyTemp() and readAsArray(). At present readBerkeleyTemp() is not optimized. That function will read the data and create a file back matrix for just the temperature data. It takes hours. Once the buffering is optimized this will be reduced. The approach is the same as with reading all the data. The data.txt is read and processed into a 2D matrix. If data.bin exists, it is read. The 2D matrix has a row for every time in the dataset from 1701 to present. There are columns for every station, over 44,000. Missing months are filled in with NA. There will be stations that have no data. Once the file backed matrix is created, access to the data is immediate, but the first call to the function takes hours to rebuild the dataset into a 2D matrix. As noted above, buffering will be added to speed this up. If your system has less than 2GB you are almost forced to use this method of reading. The function is called like so: Bestdata <- readBerkeleyTemp(Directory, filename = “temperature.bin”). Again, if temperature.bin does not exist it will be created by reading data.bin or creating data.bin from data.txt. If temperature.bin exists it will be attached and access is immediate.

The readAsArray() function is for users who have at least 4GB of RAM or more. This function reads data.txt into RAM and converts it into a 3D array of temperatures. At this time the function does not create a filebacked memory mapped version of the data. The dimensions of the array are Station, Month, and year. All of the functions in package RghcnV3 use this data format. So for example, to read the data in and remove the stations that have no data, and window from 1880 to 2010 we do the following:

Berkdata <- readAsArray(Directory = myDir, filename = “data.txt”) Berkdata <- windowArray(Berkdata, start = 1880, end = 2010) Berkdata <- removeNaStations(Berkdata)

The array format has R dimnames applied so that the station Ids are the dimnames for margin 1, months are margin 2 and years are margin 3. Berkdata[1,”jan”,”1980”] extracts the first station Jan of 1980. It’s a simple matter to create time series from the array. The array can also be turned into a multiple time series with asMts(). This function unwraps the 3D array into a 2d matrix of time series with stations in columns and time in rows. Berkdata <- asMts(Berkdata); plot(Berkdata[,3]) just plots the 3^rd station as a time series.

In addition to temperature data there are various subsets of station metadata. For every subset of metadata from the most complete to the summary format there is a corresponding function: readSiteComplete() reads the complete metadata, and readSiteSummary() read the file that contains site Id, site latitude, site longitude and site elevation. These functions output data.frames that work with the RghcnV3 format. That means we can take a station inventory and “crop” it by latitude and longitude, usiing the RghcnV3 functions:

Inventory <- cropInv(Inventory, extent= c(-120,-60,20,50))

The package is currently at version 1.0 available on CRAN. The version described above is 1.1 which is in the build que. Version 1.2 which has buffering implemented is due for release shortly. The package RghcnV3 is also there.

Disclaimer: Both Steven Mosher and Zeke Hausfather are participants in the Berkeley Earth Surface Temperature project. However, the content of this post reflects only their personal opinions and not those of the project as a whole.

JC comment: The new version of the Berkeley Earth Surface Temperature dataset is now achieving its goal as an unprecedented data resource, including transparency and user-friendliness. The addition of Steve and Zeke to the team was an excellent move. They have clearly added value to the product. Further, they provide a welcome and needed link between the Berkeley team and the blogosphere.

Share this: