NICAR13 Day 3, Saturday 2/3

bildIt’s the last day of the NICAR13 conference. Today I’ve been watching Matt Waite tell the story about the Pullitzer prize winning site Politifact. Matt was very keen on structure. Because everything has structure, especially stories. If you can find the structure and think of it on a higher level, you can build systems (like Politifact). Another aspect of building something that overlaps journalism and IT is cultural resistance. Freaked out reporters and reluctant developers, not to mention clueless management. “Build shit, don’t talk shit” – i.e build a prototype to have conversations around. “Your mission might be a very small defined thing”, says Matt. If you can describe your thing with a single short declarative sentence, then you have a chance – you can pitch things. Guard your ONE THING zealously. Having a structure makes it possible to say NO to things. The core question: What is the atomic unit of this?

Another sesson I went to was on how to develop reusable visualization components using D3 and Backbone, with Alastair Dant who works with development at the Guardian. The Javascript library D3 really stands out as everyone’s favorite at this conference. Alastair is a fun guy, and it was a joy listening to him. I could feel part of the audience zoning out (this isn’t primarily a developer conference after all) during his walkthrough of the code. I truly enjoyed it though. The example code can be found on github. Also check out R2D3 if you are required to support IE 7 and 8.

The “Swedish Contingent” at the conference had booked a lunch session. To be honest, this was nothing I was looking forward to in particular. I think the organizers of this conference put in a 2 hour lunch break for a reason. But I was very happy to see Matt Waite again. This time flying around with a microscopic quadrotor drone. And I learned that there’s a drone journalism lab somewhere at the University of Nebraska. How satisfying for the nerd in me to hear Matt speak about hardware hacking, Arduino programming, drones and mesh networks. Where is journalism going? ūüėČ

People have shown amazing stuff here, and we can all do amazing stuff back home – by crossbreeding ideas and competences. And a little bit of coding ūüėČ

NICAR13 Day 2, Friday 1/3

Bringing Local Geodata to Journalism – Ate Poorthuis, Matt Zook
Ever since December 2011 Floatingsheep.org has consumed and indexed every geotagged tweet produced (about 3% of all tweets are geotagged) in the world using elasticsearch and Twitter’s streaming API. Unfortunately Floatingsheep is not openly available to the public.¬†Ate Poorthuis and Matt Zook from University of Kentucky demoed some of Floatingsheep’s awesome capabilities, like locating the epicenter of an earthquake(!).

Data visualization on a shoestring – Sharon Machlis, Kevin Hirten
What can you do if you are on a small budget, or even, no budget at all?
Sharon and Kevin pelleted the audience with free (as in free beer) tools:

Sharon’s chart: http://www.computerworld.com/s/article/9214755/Chart_and_image_gallery_30_free_tools_for_data_visualization_and_analysis

Smarter interactive Web projects with Google Spreadsheets and Tabletop.js – Tasneem Raja
Tasneem Raja at Mother Jones sees everything that reporters produce as data – “Everything is data, and since everything is data it can have structure”.¬†At Mother Jones they have build their own cms on Drupal and Google spreadsheets. Reporters feed data into spreadsheets and information is extracted into the browser using Tabletop.js. Tasneem pointed out a few caveats. Google limits access and threasholds¬†aren’t clear. And the solution is depending on Google not changing its API. The solution manages to run on a single private Google account though.

D3? R? Tableau? What’s right for you? – Amanda Cox, Robert Kosara
Having no particular experience with any of the tools, my impression is that D3, R, and Tableau each solves different problems. What caught my interest the most was the D3 Javascript library (here’s one example: the Waterman Butterfly¬†Map). Because D3 uses SVG it will not work (out of the box) with Internet Explorer below version 9. Another Javascript library mentioned was¬†numeric.js that can work with matrices and vectors.

How to serve mad traffic – Jeremy Bowers, Jacqui Maher
This session was hilarious! With a great sense of humor, Jeremy explained the three virtues of a great sysadmin: lazy, impatient, and proud. At Nytimes.com they use Ruby on Rails and nginx on Amazon S3. By putting their systems in the cloud they can tailor it to the traffic in a flexible way. But despite your best efforts your loadbalancer might melt. Things like Ajax polling might cause unexpected load.

A few pointers on serving mad traffic:

  • You need to know the path that each request travels.
  • And if each request requires an application server, it won’t scale.
  • There are only two hard things in computer science: cache invalidation and naming things – Phil Karlton
  • nginx! 100k+ req/sec
  • no db + dynamic == easy to scale
  • Scale up: more servers
  • Use consitend libs for live polling (js).
  • Sanity check data entry/ delivery points.
  • Plan to degrade gracefully at risky areas.
  • review, review, review
  • Don’t bypass caches.
  • Don’t request mbs of json every 30s!
  • Turn off keep-alive.
  • Turn off gzip.

I asked Jacqui and Jeremy how maps are served by Nytimes and apparently they use a tool called Tilemill. Gotta check that up…

Lightning talks
5 minutes enlightening lightning talks for about an hour. Fun and intelligent. I’m truly awed and impressed by the performances of the people on stage. As a data nerd and hardware hacker I found Matt Waite’s Arduino and Nintendo Wii hack particularly¬†inspiring. Using an accelerometer (harvested from a Wii remote control) connected to a programmable microcontroller, they built a data gathering device which they checked in with a bag at an airport to track TSA’s (mis)handling. I hope this example inspires more people to go hacking with hardware and programming – because it is true: with programming you can control robots!

NICAR13 Day 1, Thursday 28/2

I’m at the Computer Aided Reporting conference in Louisville Kentucky. Here’s my summary of day one:

Information design and crossing the digital divide – Christopher Canipe, Helene Sears
What inspires me is hearing stories from people who took on the challenge to do something new – like Christopher Canipe who moved from paper to web and had to learn about programming and Javascript. Helene Sears told the story about how graphical work is done at BBC. What I liked the most was the James Bond parallax scrolling infographics.

Prediction is very difficult, especially about the future – Andy Cox
Part of this session (like weather predictions and forecasts) went completely over my head, but I got inspired to try out the d3 Javascript library.

Down and dirty with the DocumentCloud API – Ted Han
DocumentCloud is a service that turns documents into searchable and analyzable data. It seems pretty useful with its API and scripting abilities. I wonder if there’s limitations with foreign languages like Swedish?

Dig deeper with social tools – Mandy Jenkins, Doug Haddix
Mandy and Doug went through an amazing array of useful social web tools. Go check them out for yourself: gohachi.com, topsy.com, socialmention.com, foller.me, gohachi.com, muckrack.com, twazzup.com, bank.jo, geofeedia.com, mappeo.net, allmytweets.net.

Practical machine learning: Tips, tricks and real-world examples for using machine learning in the newsroom – Jeff Larson, Chase Davis
As a data nerd, this was the most exciting session of the day. Jeff and Chase showed different techniques to create decision trees and other machine learning stuff, and pointed out the tool Weka to use for exploring such. Jeff and Chase have been kind enough to put their code on github: https://github.com/thejefflarson https://github.com/cjdd3b

Visualizing networks and connections – Irene Liu, Kevin Connor
Littlesis.org may perhaps be described as a crowdsourced Facebook about the power elite, where the dots are connected between the ultra rich and those in power, and how they connect to organizations. Another similar website is theyrule.net. Irene Liu made the perhaps boldest presentation so far – http://connectedchina.reuters.com¬†went live just half an hour before her presentation(!). A very impressive and thorough¬†html5 app on China’s power elite. A thing I learned about China is that although they have only one political party, it is highly fractioned. Very interesting. My guess is that the site will probably be censored in China.

Goodnight and see you tomorrow!

Home Automation FTW!

Temperature Graph

Although we’ve been living in this house for almost six years now, I never really bothered to figure out how the heating system works. Not until I bought a wireless energy monitor. What a surprise. I had no idea a house could use that much energy! I started to search old documents we inherited from the previous house owner, and finally I read up on the heating system. Left unattended the heat pump had gathered a lot of air and the circulation pump had come to a complete stop. I realized that I needed to get in control.¬†Enter home automation.

At work we monitor software systems in order to (hopefully) act proactively before things go out of hands. It’s as simple as this: if you don’t know, you don’t know! I liked to apply the same principles to the systems at home. Inspired by my¬†colleague Christian Lizell¬†(@lizell) who not only build our monitoring system at work but made his own home monitoring system based on the realtime graphing system Graphite¬†and home automation technology from¬†Telldus, I went out and bought a Tellstick Net unit.

The Tellstick Net is a fairly small device that connects to a cloud based service, and can be accessed with a web interface and an Iphone app. You can connect an array of 433 MHz devices, like on/off switches and dimmers to it and then control them individually and as groups. Schedules can be made so that lights turn on and off, let say on at sunset and turn off at sunrise. Better still the Tellstick Net can recognize a number of wireless sensors. I ended up buying 8 temperature sensors that I connected to outside, inside, the two attics, workshop, fridge, and freezer.

What I had i mind was to read temperature sensor values from Telldus every minute and then send those values to a Graphite backend. Telldus who open sourced their software, offers a REST API with which sensor values can be read. So I started out hacking on the Telldus tdtool.py and Graphite example-client.py and put the code on Github. How I love Python for tasks like this! Installing a Graphite server is quite a task so I ended up downloading a prebuilt VirtualBox Graphite server. Preferably I would like to create a server somewhere in the cloud.

With a graph showing the temperature over time I could tune the fridge, freezer, and workshop (notice the repeating form of the freezer graph). I expect the attic temperature to tell how well the insulation works, but I need more time and freezing temperatures outside to make that analysis. The workshop just need to be a few degrees above freezing.

Next step: When the farming season begins I want to automate irrigation and monitor humidity in the ground. Enter farm automation…

Search for Snow with Hadoop Hive

I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!

Instead of just googling the answer, we’re going to¬†fetch some raw data and feed it into Hadoop Hive.

Just a short update if you’re unfamiliar with Hadoop and Hive. Hadoop is an ecosystem of tools for storing and analyzing Big Data. At its core Hadoop has its own distributed filesystem called HDFS that can be made to span over hundred of nodes and thousand of terabytes, even petabytes, of data. One layer above HDFS lives MapReduce – a simple yet effective method introduced by Google to distribute calculations on data. Hive lives on HDFS and MapReduce and brings SQL capabilities to Hadoop.

Think of your ordinary RDBMS as a sports car – a fast vehicle often built on fancy hardware. An RDBMS can yield answers to rather complex queries within milliseconds, at least if you keep your data sets below a couple of million rows. Hadoop is a big yellow elephant. It has traded speed for scalability and brute force ¬†– it was conceived to move BIG chunks of data around. And it can live happily on commodity hardware.¬†For the sake a brevity we’re going to use some rather small data sets – about 1 megabyte each. It won’t even fill a file block in HDFS (64 megabyte). A more realistic example of Hadoop’s capabilities would be something like querying 100 billion tweets. A RDBMS can’t do that.

You can run Hadoop on your local machine. Like me, on an old MacBook Pro using VMWare. Just download the latest image from Cloudera

The Swedish national weather service SMHI provides the data we need: daily temperature and precipitation data from 1961 to 1997, gathered at a weather station about 60 km from where I live.

Logon to your Hadoop instance and open the terminal to download the data:

wget http://data.smhi.se/met/climate/time_series/day/temperature/SMHI_day_temperature_clim_9720.txt

(sudo yum install wget – if wget is missing)

Trim off header information with a text editor

– – – – – – – –
9720
STOCKHOLM-BROMMA
1961 2010
0101 1231
593537. 179513.
DATUM TT1 TT2 TT3 TTN TTTM TTX
– – – – – – – –

Replace leading and trailing spaces, and replace spaces between fields with commas:

cat SMHI_day_temperature_clim_9720.txt | sed -e ‘s/^[ \t]*//’ | sed ‘s/[[:space:]]\+/,/g’ > temperature.txt

Now we have properly formatted raw data ready to import into Hive. Just type “hive” in the terminal to start up Hive. The columns in the temperature data set looks like this:

DATUM YearMonthDay YYYYMMDD
TT1 temperature at 06 UTC
TT2 temperature at 12 UTC
TT3 temperature at 18 UTC
TTX(1) daily max-temperature
TTN(1) daily min-temperature
TTTM(2) daily mean temperature
-999.0 missing value

There’s no date data type in Hive, so we’ll to store the date as string.

CREATE TABLE temperature (
DATUM STRING,
TT1 DOUBLE,
TT2 DOUBLE,
TT3 DOUBLE,
TTN DOUBLE,
TTTM DOUBLE,
TTX DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

It is not intuitive to me why the field separator is defined in the table definition, but that’s apparently how Hive works. Load the data into the Hive table:

LOAD DATA LOCAL INPATH ‘temperature.txt’
OVERWRITE INTO TABLE temperature;

Iterate over the precipitation data:

wget http://data.smhi.se/met/climate/time_series/day/precipitation/SMHI_day_precipitation_clim_9720.txt

Trim off header information

– – – – – – – –
9720
STOCKHOLM-BROMMA
1961 1997
0101 1231
593537. 179513.
DATUM PES PQRR PRR PRRC1 PRRC2 PRRC3 PSSS PWS
– – – – – – – –

Replace leading and trailing spaces, and replace spaces between fields with commas:

cat SMHI_day_precipitation_clim_9720.txt | sed -e ‘s/^[ \t]*//’ | sed ‘s/[[:space:]]\+/,/g’ > precipitation.txt

Columns for  the precipitation data set:

DATUM YearMonthDay YYYYMMDD
PES(1) ground snow/ice code
PRR(2) precipitation mm
PQRR(3) quality code
PRRC1(4) precipitation type
PRRC2(4) precipitation type
PRRC3(4) precipitation type
PSSS(5) total snow depth cm
PWS(3) thunder, fog or aurora borealis code
-999.0 missing value

Create the Hive precipitation table:

CREATE TABLE precipitation (
DATUM STRING,
PES DOUBLE,
QRR DOUBLE,
PRR DOUBLE,
PRRC1 DOUBLE,
PRRC2 DOUBLE,
PRRC3 DOUBLE,
PSSS DOUBLE,
PWS DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

And load the precipitation data into the Hive table:

LOAD DATA LOCAL INPATH ‘precipitation.txt’
OVERWRITE INTO TABLE precipitation;

Let’s define a snowy day as a day that has a temperature below 0 degrees Celsius (freezing) with a precipitation of more than 3 mm (approximately 30 mm snow).

– Number of snow days grouped by year
TTTM Temperature < 0 degrees Celsius
PRR Percipitation > 3 mm (approximately 3 cm snow)

SELECT year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’))), count(*)
FROM precipitation join temperature on (precipitation.datum = temperature.datum)
AND temperature.TTTM < 0
AND precipitation.PRR > 3
GROUP BY year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’)));

Let’s execute the query:

hive> SELECT year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’))), count(*)
> FROM precipitation join temperature on (precipitation.datum = temperature.datum)
> AND temperature.TTTM < 0
> AND precipitation.PRR > 3
> GROUP BY year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’)));

Total MapReduce jobs = 2
Launching Job 1 out of 2

Starting Job = job_201201290156_0082, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201201290156_0082
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201201290156_0082
2012-01-29 13:26:24,615 Stage-1 map = 0%, reduce = 0%
2012-01-29 13:26:33,272 Stage-1 map = 50%, reduce = 0%
2012-01-29 13:26:36,288 Stage-1 map = 100%, reduce = 0%
2012-01-29 13:26:47,395 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201201290156_0082

Launching Job 2 out of 2

Starting Job = job_201201290156_0083, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201201290156_0083
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201201290156_0083
2012-01-29 13:26:54,675 Stage-2 map = 0%, reduce = 0%
2012-01-29 13:26:59,694 Stage-2 map = 100%, reduce = 0%
2012-01-29 13:27:10,791 Stage-2 map = 100%, reduce = 100%
Ended Job = job_201201290156_0083
OK
1961 6
1962 6
1963 7
1964 6
1965 7
1966 8
1967 7
1968 5
1969 7
1970 12
1971 8
1972 4
1973 8
1974 3
1975 3
1976 10
1977 13
1978 8
1979 6
1980 7
1981 17
1982 5
1983 8
1984 5
1985 19
1986 13
1987 4
1988 11
1989 4
1990 1
1991 3
1992 4
1993 6
1994 1
1995 6
1996 2
1997 6
Time taken: 52.914 seconds

Notice how Hive transform the SQL-query into MapReduce jobs. We could of course do this ourselves in Java, but we’d be swamped in code. Hive hides the underlying complexity of MapReduce in the form of more convenient and mainstream SQL.

Hive supports subqueries, let’s calculate the average number of snow days:

SELECT AVG(sum)
FROM (
SELECT year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’))), count(*) as sum
FROM precipitation join temperature on (precipitation.datum = temperature.datum)
AND temperature.TTTM < 0
AND precipitation.PRR > 3
GROUP BY year(from_unixtime(unix_timestamp(precipitation.datum, ‘yyyyMMdd’)))
) t;

From these calculations it seems like the worst case scenario for the snow removal service is 19 occasions per year with an average of 7.

Pretty sweet huh!

Flying with Arduino!

I like to build stuff with my hands and I like programming and I like things that fly, so for me the tricopter project was really three things coming together. The project was seeded about a year ago after seeing the amazing videos on Youtube made by¬†David Windest√•l. I soon ordered the hardware needed for the build from Hong Kong, but my order ended up in the customs and I didn’t get the stuff until late autumn. Finally, with a box full of cheep Chinese RC electronics in my possession, I started out making the wooden frame on which I fitted motors and electronics; a rather crude and simple construction. The only moving part (apart from the three rotating propellers, hence its name ‚ÄĚtricopter‚ÄĚ) is the tail rotor that controls the machine’s yaw direction (the ‚ÄĚrudder‚ÄĚ). This makes a tricopter a much simpler construction than a helicopter. When almost done with the fittings I discovered that my transmitter did not support the mixing needed to make David’s build work.

While hesitating to buy a new transmitter I stumbled upon the multiwii.com project, a project devoted to make multicopters (tricopters, quadcopters, hexacopters) fly with the help of Arduino (programmable microprocessor), Nintendo Wii Motion Plus (gyro), and Nintendo Nunchuck (accelerometer). It turns out that the gyro and accelerometer used in the Nintendo Wii hand controls are pretty competent. The Arduino is like a very small computer with enough processing power to make real time decisions based on the inputs from the Nintendo PCBs and the user. A tricopter is an inherently instable flying machine, and without the help of electronics (configured correctly) it will flip in a matter of milliseconds. The combination of the three main components makes for a very stable flying platform. Since these units are mass produced they come with a relatively low price tag.

Being a happy Arduino hobbyist I¬†immediately¬†fell in love with the idea to use Arduino as the tricopter’s brain, and as a bonus I could use my old transmitter. The mutilwii project comes with a two programs; an Arduino program that needs to be configured with your multicopters specific settings, such as number of rotors, min and max rpm, yaw direction and a few other parameters; and the MultiWiiConf program used to configure and calibrate the multicopter. Even though I followed the assembly instructions as close as I could, it took a while to sort things out. The soldering of the tiny Nintendo components and connecting all the wires to the Arduino proved to be quite a challenge. Finally when I had found the correct positions for the gyro (Wii Motion Plus) and the accelerometer (Nintendo Nunchuck), I could calibrate the thing and spin up the motors for the first time.

Equipped with three brushless motors, each capable of producining almost 1 kilo of thrust and mounted on the tip of long arms, the tricopter has enough punch to break lose from your hand if not  gripped firmly. With safety goggles on I did a lot of test runs, first holding the thing in my hand, and later on the tarmac outside my maker shed. Luckily nothing vital broke and no one got injured. After a few iterations of weeding out vibrations, trims, and other configurations; it finally took off!

Interviews with Johanna Rothman

It’s been a privilege to spend time with Johanna while doing these interviews. Johanna describes herself as a flaming extravert, and I would like to add to that description that she’s full of humor and wisdom. A great opportunity to meet with Johanna face-to-face, is to attend the PSL Workshop in January next year. There’s still a few seats available. Less than a few actually!

From the PNEHM! interview with Johanna:

“Anyone can achieve this kind of power. Because it comes from within, no one can take it away, except for yourself. And, no one can give it to you, except for yourself.”

From the first part of Johanna’s podcast:

“I took PSL in June of 96, and it was a real turning point for me…”

From the second part of Johanna’s podcast:

“It’s (PSL) really all about: how do you see yourself; how do you understand yourself first; and see what your defaults are; and then how do you make changes‚Äďif you choose to make changes.”