I’m not really updating this blog much any more, but I need to keep it online since there are a lot of things in it that I and other people link to. If you find yourself here for some reason, you may wish to peruse my official Elon page which has a lot of my other links: amazon author page, stack overflow page, twitter, etc.
Topics include: cleaning data from Stack Overflow, getting and cleaning data from Twitter, extracting and cleaning data found in PDFs, extracting and cleaning data found in web pages, flipping data between various file formats and encodings.
Mostly it’s written in Python 2.7 with a little PHP thrown in for variety.
I’m at HICSS-48 this week. One of the things I get to do is present a new paper that my wonderful and talented undergraduate student Becca Gazda and I wrote: FLOSS as a source for profanity and insults: Collecting the data . And FLOSSmole is hosting the data from this paper here: http://flossdata.syr.edu/data/insults/
Anil Dash has a great explication of the issues surrounding what it means for something to be “public” and how we talk about “public” and how we defend or rely on that status, defining it and redefining it.
He calls attention to an increasing reliance on binary definitions: friends/global, public/private, when these are obviously over-simplified. He writes:
…there are two communities keenly interested in reducing the complexity of publicness:
The media industry benefits from things that are in a gray area being treated as “public”, because this makes it (at best) fair game for discussion or (at worst) raw materials to feed the insatiable need for new content.
The technology industry benefits from treating “public” as a binary state because handling more complex forms of privacy can be more expensive to accommodate in software, and because tech companies increasingly rely on the same advertising model which supports media, where public information is more valuable because it can be monetized.
We’ve spent so many years being indoctrinated in this new, false definition of what is public that most people will still not concede that there is any complexity to the issue.
… and proud of it!
The New York Times recently ran an article on how 50-80% of data science is “janitor work” like data cleaning and wrangling. This, along with data collection, is of course my favorite part of doing my job, so I liked the article. Check it out!
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”