hdfs

I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!

Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.

Just a short update if you’re unfamiliar with Hadoop and Hive. Hadoop is an ecosystem of tools for storing and analyzing Big Data. At its core Hadoop has its own distributed filesystem called HDFS that can be made to span over hundred of nodes and thousand of terabytes, even petabytes, of data. One layer above HDFS lives MapReduce – a simple yet effective method introduced by Google to distribute calculations on data. Hive lives on HDFS and MapReduce and brings SQL capabilities to Hadoop.

Think of your ordinary RDBMS as a sports car – a fast vehicle often built on fancy hardware. An RDBMS can yield answers to rather complex queries within milliseconds, at least if you keep your data sets below a couple of million rows. Hadoop is a big yellow elephant. It has traded speed for scalability and brute force – it was conceived to move BIG chunks of data around. And it can live happily on commodity hardware. For the sake a brevity we’re going to use some rather small data sets – about 1 megabyte each. It won’t even fill a file block in HDFS (64 megabyte). A more realistic example of Hadoop’s capabilities would be something like querying 100 billion tweets. A RDBMS can’t do that.

You can run Hadoop on your local machine. Like me, on an old MacBook Pro using VMWare. Just download the latest image from Cloudera

The Swedish national weather service SMHI provides the data we need: daily temperature and precipitation data from 1961 to 1997, gathered at a weather station about 60 km from where I live.

Logon to your Hadoop instance and open the terminal to download the data:

wget http://data.smhi.se/met/climate/time_series/day/temperature/SMHI_day_temperature_clim_9720.txt

(sudo yum install wget – if wget is missing)

Trim off header information with a text editor

– – – – – – – –
9720
STOCKHOLM-BROMMA
1961 2010
0101 1231
593537. 179513.
DATUM TT1 TT2 TT3 TTN TTTM TTX
– – – – – – – –

Replace leading and trailing spaces, and replace spaces between fields with commas:

cat SMHI_day_temperature_clim_9720.txt | sed -e ‘s/^[ \t]*//’ | sed ‘s/[[:space:]]\+/,/g’ > temperature.txt

Now we have properly formatted raw data ready to import into Hive. Just type “hive” in the terminal to start up Hive. The columns in the temperature data set looks like this:

DATUM YearMonthDay YYYYMMDD
TT1 temperature at 06 UTC
TT2 temperature at 12 UTC
TT3 temperature at 18 UTC
TTX(1) daily max-temperature
TTN(1) daily min-temperature
TTTM(2) daily mean temperature
-999.0 missing value

There’s no date data type in Hive, so we’ll to store the date as string.

CREATE TABLE temperature (
DATUM STRING,
TT1 DOUBLE,
TT2 DOUBLE,
TT3 DOUBLE,
TTN DOUBLE,
TTTM DOUBLE,
TTX DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

It is not intuitive to me why the field separator is defined in the table definition, but that’s apparently how Hive works. Load the data into the Hive table:

LOAD DATA LOCAL INPATH ‘temperature.txt’
OVERWRITE INTO TABLE temperature;

Iterate over the precipitation data:

wget http://data.smhi.se/met/climate/time_series/day/precipitation/SMHI_day_precipitation_clim_9720.txt

Trim off header information

– – – – – – – –
9720
STOCKHOLM-BROMMA
1961 1997
0101 1231
593537. 179513.
DATUM PES PQRR PRR PRRC1 PRRC2 PRRC3 PSSS PWS
– – – – – – – –

Replace leading and trailing spaces, and replace spaces between fields with commas:

cat SMHI_day_precipitation_clim_9720.txt | sed -e ‘s/^[ \t]*//’ | sed ‘s/[[:space:]]\+/,/g’ > precipitation.txt

Columns for the precipitation data set:

DATUM YearMonthDay YYYYMMDD
PES(1) ground snow/ice code
PRR(2) precipitation mm
PQRR(3) quality code
PRRC1(4) precipitation type
PRRC2(4) precipitation type
PRRC3(4) precipitation type
PSSS(5) total snow depth cm
PWS(3) thunder, fog or aurora borealis code
-999.0 missing value

Create the Hive precipitation table:

CREATE TABLE precipitation (
DATUM STRING,
PES DOUBLE,
QRR DOUBLE,
PRR DOUBLE,
PRRC1 DOUBLE,
PRRC2 DOUBLE,
PRRC3 DOUBLE,
PSSS DOUBLE,
PWS DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

And load the precipitation data into the Hive table:

LOAD DATA LOCAL INPATH ‘precipitation.txt’
OVERWRITE INTO TABLE precipitation;

Let’s define a snowy day as a day that has a temperature below 0 degrees Celsius (freezing) with a precipitation of more than 3 mm (approximately 30 mm snow).

– Number of snow days grouped by year
TTTM Temperature < 0 degrees Celsius
PRR Percipitation > 3 mm (approximately 3 cm snow)

SELECT year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’))), count(*)
FROM precipitation join temperature on (precipitation.datum = temperature.datum)
AND temperature.TTTM < 0
AND precipitation.PRR > 3
GROUP BY year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’)));

Let’s execute the query:

hive> SELECT year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’))), count(*)
> FROM precipitation join temperature on (precipitation.datum = temperature.datum)
> AND temperature.TTTM < 0
> AND precipitation.PRR > 3
> GROUP BY year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’)));

Total MapReduce jobs = 2
Launching Job 1 out of 2
…

Starting Job = job_201201290156_0082, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201201290156_0082
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201201290156_0082
2012-01-29 13:26:24,615 Stage-1 map = 0%, reduce = 0%
2012-01-29 13:26:33,272 Stage-1 map = 50%, reduce = 0%
2012-01-29 13:26:36,288 Stage-1 map = 100%, reduce = 0%
2012-01-29 13:26:47,395 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201201290156_0082

Launching Job 2 out of 2
…

Starting Job = job_201201290156_0083, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201201290156_0083
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201201290156_0083
2012-01-29 13:26:54,675 Stage-2 map = 0%, reduce = 0%
2012-01-29 13:26:59,694 Stage-2 map = 100%, reduce = 0%
2012-01-29 13:27:10,791 Stage-2 map = 100%, reduce = 100%
Ended Job = job_201201290156_0083
OK
1961 6
1962 6
1963 7
1964 6
1965 7
1966 8
1967 7
1968 5
1969 7
1970 12
1971 8
1972 4
1973 8
1974 3
1975 3
1976 10
1977 13
1978 8
1979 6
1980 7
1981 17
1982 5
1983 8
1984 5
1985 19
1986 13
1987 4
1988 11
1989 4
1990 1
1991 3
1992 4
1993 6
1994 1
1995 6
1996 2
1997 6
Time taken: 52.914 seconds

Notice how Hive transform the SQL-query into MapReduce jobs. We could of course do this ourselves in Java, but we’d be swamped in code. Hive hides the underlying complexity of MapReduce in the form of more convenient and mainstream SQL.

Hive supports subqueries, let’s calculate the average number of snow days:

SELECT AVG(sum)
FROM (
SELECT year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’))), count(*) as sum
FROM precipitation join temperature on (precipitation.datum = temperature.datum)
AND temperature.TTTM < 0
AND precipitation.PRR > 3
GROUP BY year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’)))
) t;

From these calculations it seems like the worst case scenario for the snow removal service is 19 occasions per year with an average of 7.

Pretty sweet huh!

#note2self

@magnusljadas

Search for Snow with Hadoop Hive