My book Clean Data is available from various booksellers like Amazon.
Topics include: cleaning data from Stack Overflow, getting and cleaning data from Twitter, extracting and cleaning data found in PDFs, extracting and cleaning data found in web pages, flipping data between various file formats and encodings.
Mostly it’s written in Python 2.7 with a little PHP thrown in for variety.
There’s a free sample on the publisher’s web site and a lot of the code is up on Github.
I’m at HICSS-48 this week. One of the things I get to do is present a new paper that my wonderful and talented undergraduate student Becca Gazda and I wrote: FLOSS as a source for profanity and insults: Collecting the data . And FLOSSmole is hosting the data from this paper here: http://flossdata.syr.edu/data/insults/
Anil Dash has a great explication of the issues surrounding what it means for something to be “public” and how we talk about “public” and how we defend or rely on that status, defining it and redefining it.
He calls attention to an increasing reliance on binary definitions: friends/global, public/private, when these are obviously over-simplified. He writes:
…there are two communities keenly interested in reducing the complexity of publicness:
The media industry benefits from things that are in a gray area being treated as “public”, because this makes it (at best) fair game for discussion or (at worst) raw materials to feed the insatiable need for new content.
The technology industry benefits from treating “public” as a binary state because handling more complex forms of privacy can be more expensive to accommodate in software, and because tech companies increasingly rely on the same advertising model which supports media, where public information is more valuable because it can be monetized.
We’ve spent so many years being indoctrinated in this new, false definition of what is public that most people will still not concede that there is any complexity to the issue.
… and proud of it!
The New York Times recently ran an article on how 50-80% of data science is “janitor work” like data cleaning and wrangling. This, along with data collection, is of course my favorite part of doing my job, so I liked the article. Check it out!
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
I’ll be speaking at All Things Open this year in Raleigh. My talk is titled “We’re Watching You: How and Why Researchers Study Open Source And What We’ve Found So Far”. Spoiler alert: I conclude open source developers are jerks, get thrown out of conference. Just kidding. Sort of.