School of Information Blogs

September 18, 2014

Ph.D. student

technical work

Dipping into Julian Orr’s Talking about Machines, an ethnography of Xerox photocopier technicians, has set off some light bulbs for me.

First, there’s Orr’s story: Orr dropped out of college and got drafted, then worked as a technician in the military before returning to school. He paid the bills doing technical repair work, and found it convenient to do his dissertation on those doing photocopy repair.

Orr’s story reminds me of my grandfather and great-uncle, both of whom were technicians–radio operators–during WWII. Their civilian careers were as carpenters, building houses.

My own dissertation research is motivated by my work background as an open source engineer, and my own desire to maintain and improve my technical chops. I’d like to learn to be a data scientist; I’m also studying data scientists at work.

Further fascinating was Orr’s discussion of the Xerox technician’s identity as technicians as opposed to customers:

The distinction between technician and customer is a critical division of this population, but for technicians at work, all nontechnicians are in some category of other, including the corporation that employs the technicians, which is seen as alien, distant, and only sometimes an ally.

It’s interesting to read about this distinction between technicians and others in the context of Xerox photocopiers when I’ve been so affected lately by the distinction between tech folk and others and data scientists and others. This distinction between those who do technical work and those who they serve is a deep historical one that transcends the contemporary and over-computed world.

I recall my earlier work experience. I was a decent engineer and engineering project manager. I was a horrible account manager. My customer service skills were abysmal, because I did not empathize with the client. The open source context contributes to this attitude, because it makes a different set of demands on its users than consumer technology does. One gets assistance with consumer grade technology by hiring a technician who treats you as a customer. You get assistance with open source technology by joining the community of practice as a technician. Commercial open source software, according to the Pentaho beekeeper model, is about providing, at cost, that customer support.

I’ve been thinking about customer service lately and reflecting on my failures at it a lot lately. It keeps coming up. Mary Gray’s piece, When Science, Customer Service, and Human Subjects Research Collide explicitly makes the connection between commercial data science at Facebook and customer service. The ugly dispute between Gratipay (formerly Gittip) and Shanley Kane was, I realized after the fact, a similar crisis between the expectations of customers/customer service people and the expectations of open source communities. When “free” (gratis) web services display a similar disregard for their users as open source communities do, it’s harder to justify in the same way that FOSS does. But there are similar tensions, perhaps. It’s hard for technicians to empathize with non-technicians about their technical problems, because their lived experience is so different.

It’s alarming how much is being hinged on the professional distinction between technical worker and non-technical worker. The intra-technology industry debates are thick with confusions along these lines. What about marketing people in the tech context? Sales? Are the “tech folks” responsible for distributional justice today? Are they in the throws of an ideology? I was reading a paper the other day suggesting that software engineers should be held ethically accountable for the implicit moral implications of their algorithms. Specifically the engineers; for some reason not the designers or product managers or corporate shareholders, who were not mentioned. An interesting proposal.

Meanwhile, at the D-Lab, where I work, I’m in the process of navigating my relationship between two teams, the Technical Team, and the Services Team. I have been on the Technical team in the past. Our work has been to stay on top of and assist people with data science software and infrastructure. Early on, we abolished regular meetings as a waste of time. Naturally, there was a suspicion expressed to me at one point that we were unaccountable and didn’t do as much work as others on the Services team, which dealt directly with the people-facing component of the lab–scheduling workshops, managing the undergraduate work-study staff. Sitting in on Services meetings for the first time this semester, I’ve been struck by how much work the other team does. By and large, it’s information work: calendering, scheduling, entering into spreadsheets, documenting processes in case of turnover, sending emails out, responding to emails. All important work.

This is exactly the work that information technicians want to automate away. If there is a way to reduce the amount of calendering and entering into spreadsheets, programmers will find a way. The whole purpose of computer science is to automate tasks that would otherwise be tedious.

Eric S. Raymond’s classic (2001) essay How to Become a Hacker characterizes the Hacker Attitude, in five points:

  1. 1. The world is full of fascinating problems waiting to be solved.
  2. 2. No problem should ever have to be solved twice.
  3. 3. Boredom and drudgery are evil.
  4. 4. Freedom is good.
  5. 5. Attitude is no substitute for competence.

There is no better articulation of the “ideology” of “tech folks” than this, in my opinion, yet Raymond is not used much as a source for understanding the idiosyncracies of the technical industry today. Of course, not all “hackers” are well characterized by Raymond (I’m reminded of Coleman’s injunction to speak of “cultures of hacking”) and not all software engineers are hackers (I’m sure my sister, a software engineer, is not a hacker. For example, based on my conversations with her, it’s clear that she does not see all the unsolved problems with the world to be intrinsically fascinating. Rather, she finds problems that pertain to some human interest, like children’s education, to be most motivating. I have no doubt that she is a much better software engineer than I am–she has worked full time at it for many years and now works for a top tech company. As somebody closer to the Raymond Hacker ethic, I recognize that my own attitude is no substitute for that competence, and hold my sister’s abilities in very high esteem.)

As usual, I appear to have forgotten where I was going with this.

by Sebastian Benthall at September 18, 2014 12:08 AM

September 17, 2014

Ph.D. student

frustrations with machine ethics

It’s perhaps because of the contemporary two cultures problem of tech and the humanities that machine ethics is in such a frustrating state.

Today I read danah boyd’s piece in The Message about technology as an arbiter of fairness. It’s more baffling conflation of data science with neoliberalism. This time, the assertion was that the ideology of the tech industry is neoliberalism hence their idea of ‘fairness’ is individualist and against social fabric. It’s not clear what backs up these kinds of assertions. They are more or less refuted by the fact that industrial data science is obsessed with our network of ties for marketing reasons. If anybody understands the failure of the myth of the atomistic individual, it’s “tech folks,” a category boyd uses to capture, I guess, everyone from marketing people at Google to venture capitalists to startup engineers to IBM researchers. You know, the homogenous category that is “tech folks.”

This kind of criticism makes the mistake of thinking that a historic past is the right way to understand a rapidly changing present that is often more technically sophisticated than the critics understand. But critical academics have fallen into the trap of critiquing neoliberalism over and over again. One problem is that tech folks don’t spend a ton of time articulating their ideology in ways that are convenient for pop culture critique. Often their business models require rather sophisticated understandings of the market, etc. that don’t fit readily into that kind of mold.

What’s needed is substantive progress in computational ethics. Ok, so algorithms are ethically and politically important. What politics would you like to see enacted, and how do you go about implementing that? How do you do it in a way that attracts new users and is competitively funded so that it can keep up with the changing technology with which we use to access the web? These are the real questions. There is so little effort spent trying to answer them. Instead there’s just an endless series of op-ed bemoaning the way things continue to be bad because it’s easier than having agency about making things better.

by Sebastian Benthall at September 17, 2014 03:26 AM

September 13, 2014

Ph.D. student

notes towards benign superintelligence j/k

Nick Bostrom will give a book talk on campus soon. My departmental seminar on “Algorithms as Computation and Culture” has opened with a paper on the ethics of algorithms and a paper on accumulated practical wisdom regarding machine learning. Of course, these are related subjects.

Jenna Burrell recently trolled me in order to get me to give up my own opinions on the matter, which are rooted in a philosophical functionalism. I’ve learned just now that these opinions may depend on obsolete philosophy of mind. I’m not sure. R. Scott Bakker’s blog post against pragmatic functionalism makes me wonder: what do I believe again? I’ve been resting on a position established when I was deeper into this stuff seven years ago. A lot has happened since then.

I’m turning into a historicist perhaps due to lack of imagination or simply because older works are more accessible. Cybernetic theories of control–or, electrical engineering theories of control–are as relevant, it seems, to contemporary debates as machine learning, which to the extent it depends on stochastic gradient descent is just another version of cybernetic control anyway, right?

Ashwin Parameswaran’s blog post about Benigner’s Control Revolution illustrates this point well. To a first approximation, we are simply undergoing the continuation of prophecies of the 20th century, only more thoroughly. Over and over, and over, and over, and over, like a monkey with a miniature cymbal.


One property of a persistent super-intelligent infrastructure of control would be our inability to comprehend it. Our cognitive models, constructed over the course of a single lifetime with constraints on memory both in time and space, limited to a particular hypothesis space, could simply be outgunned by the complexity of the sociotechnical system in which it is embedded. I tried to get at this problem with work on computational asymmetry but didn’t find the right audience. I just learned there’s been work on this in finance which makes sense, as it’s where it’s most directly relevant today.

by Sebastian Benthall at September 13, 2014 01:43 AM

September 09, 2014

Ph.D. student

more on algorithms, judgment, polarization

I’m still pondering the most recent Tufekci piece about algorithms and human judgment on Twitter. It prompted some grumbling among data scientists. Sweeping statements about ‘algorithms’ do that, since to a computer scientist ‘algorithm’ is about as general a term as ‘math’.

In later conversation, Tufekci clarified that when she was calling out the potential problems of algorithmic filtering of the Twitter newsfeed, she was speaking to the problems of a newsfeed curated algorithmically for the sake of maximizing ‘engagement’. Or ads. Or, it is apparent on a re-reading of the piece, new members. She thinks an anti-homophily algorithm would maybe be a good idea, but that this is so unlikely according to the commercial logic of Twitter to be a marginal point. And, meanwhile, she defends ‘human prioritizatin’ over algorithmic curation, despite the fact that homophily (not to mention preferential attachment) are arguable negative consequences of social system driven by human judgment.

I think inquiry into this question is important, but bound to be confusing to those who aren’t familiar in a deep way with network science, machine learning, and related fields. It’s also, I believe, helpful to have a background in cognitive science, because that’s a field which maintains that human judgment and computational systems are doing fundamentally commensurable kinds of work. When we think in sophisticated way about crowdsourced labor, we use this sort of thinking. We acknowledge, for example, that human brains are better at the computational task of image recognition, so then we employ Turkers to look at and label images. But those human judgments are then inputs to statistical proceses that verify and check those judgments against each other. Later, those determinations that result from a combination of human judgment and algorithmic processing could be used in a search engine–which returns answers to questions based on human input. Search engines, then, are also a way of combining human and purely algorithmic judgment.

What it comes down to is that virtually all of our interactions with the internet are built around algorithmic affordances. And these systems can be understood systematically if we reject the quantitative/qualitative divide at the ontological level. Reductive physicalism entails this rejection, but–and this is not to be underestated–pisses or alienates people who do qualitative or humanities research.

This is old news. C.P. Snow’s The Two Cultures. The Science Wars. We’ve been through this before. Ironically, the polarization is algorithmically visible in the contemporary discussion about algorithms.*

The Two Cultures on Twitter?

It’s I guess not surprising that STS and cultural studies academics are still around and in opposition to the hard scientists. What’s maybe new is how much computer science now affects the public, and how the popular press appears to have allied itself with the STS and cultural studies view. I guess this must be because cultural anthropologists and media studies people are more likely to become journalists and writers, whereas harder science is pretty abstruse.

There’s an interesting conflation now from the soft side of the culture wars of science with power/privilege/capitalism that plays out again and again. I bump into it in the university context. I read about it all the time. Tufekci’s pessimism that the only algorihtmic filtering Twitter would adopt would be one that essentially obeys the logic “of Wall Street” is, well, sad. It’s sad that an unfortunate pairing that is analytically contingent is historically determined to be so.

But there is also something deeply wrong about this view. Of course there are humanitarian scientists. Of course there is a nuanced center to the science wars “debate”. It’s just that the tedious framing of the science wars has been so pervasive and compelling, like a commercial jingle, that it’s hard to feel like there’s an audience for anything more subtle. How would you even talk about it?

* I need to confess: I think there was some sloppiness in that Medium piece. If I had had more time, I would have done something to check which conversations were actually about the Tufekci article, and which were just about whatever. I feel I may have misrepresented this in the post. For the sake of accessibility or to make the point, I guess. Also, I’m retrospectively skittish about exactly how distinct a cluster the data scientists were, and whether its insularity might have been an artifact of the data collection method. I’ve been building out poll.emic in fits mainly as a hobby. I built it originally because I wanted to at last understand Weird Twitter’s internal structure. The results were interesting but I never got to writing them up. Now I’m afraid that the culture has changed so much that I wouldn’t recognize it any more. But I digress. Is it even notable that social scientists from different disciplines would have very different social circles around them? Is the generalization too much? And are there enough nodes in this graph to make it a significant thing to say about anything, really? There could be thousands of academic tiffs I haven’t heard about that are just as important but which defy my expectations and assumptions. Or is the fact that Medium appears to have endorsed a particular small set of public intellectuals significant? How many Medium readers are there? Not as many as there are Twitter users, by several orders of magnitude, I expect. Who matters? Do academics matter? Why am I even studying these people as opposed to people who do more real things? What about all the presumabely sane and happy people who are not pathologically on the Internet? Etc.

by Sebastian Benthall at September 09, 2014 07:03 AM

September 08, 2014

MIMS 2011

Max Klein on Wikidata, “botpedia” and gender classification

Max Klein defines himself on his blog as a ‘Mathematician-Programmer, Wikimedia-Enthusiast, Burner-Yogi’ who believes in ‘liberty through wikis and logic’. I interviewed him a few weeks ago when he was in the UK for Wikimania 2014. He then wrote up some of his answers so that we could share with it others. Max is a long-time volunteer of Wikipedia who has occupied a wide range of roles as a volunteer and as a Wikipedian in residence for OCLC, among others. He has been working on Wikidata from the beginning but it hasn’t always been plain sailing. Max is outspoken about his ideas and he is respected for that, as well as for his patience in teaching those who want to learn. This interview serves as a brief introduction to Wikidata and some of its early disagreements. 

Max Klein in 2011. CC BY SA, Wikimedia Commons

Max Klein in 2011. CC BY SA, Wikimedia Commons

How was Wikidata originally seeded?
In the first days of Wikidata we used to call it a ‘botpedia’ because it was basically just an echo chamber of bots talking to each other. People were writing bots to import information from infoboxes on Wikipedia. A heavy focus of this was data about persons from authority files.

Authority files?
An authority file is a Library Science term that is basically a numbering system to assign authors unique identifiers. The point is to avoid a “which John Smith?” problem. At last year’s Wikimania I said that Wikidata itself has become a kind of “super authority control” because now it connects so many other organisations’ authority control (e.g. Library of Congress and IMDB). In the future I can imagine Wikidata being the one authority control system to rule them all.

In the beginning, each Wikipedia project was supposed to be able to decide whether it wanted to integrate Wikidata. Do you know how this process was undertaken?
It actually wasn’t decided site-by-site. At first only Hungarian, Italian, and Hebrew Wikipedias were progressive enough to try. But once English Wikipedia approved the migration to use Wikidata, soon after there was a global switch for all Wikis to do so (see the announcement here).

Do you think it will be more difficult to edit Wikipedia when infoboxes are linking to templates that derive their data from Wikidata? (both editing and producing new infoboxes?)
It would seem to complicate matters that infobox editing becomes opaque to those who aren’t Wikidata aware. However at Wikimania 2014, two Sergeys from Russian Wikipedia demonstrated a very slick gadget that made this transparent again – it allowed editing of the Wikidata item from the Wikipedia article. So with the right technology this problem is a nonstarter.

Can you tell me about your opposition to the ways in which Wikidata editors decided to structure gender information on Wikidata?
In Wikidata you can put a constraint to what values a property can have. When I came across it the “sex or gender” property said “only one of ‘male, female, or intersex'”. I was opposed to this because I believe that any way the Wikidata community structure the gender options, we are going to imbue it with our own bias. For instance already the property is called “sex or gender”, which shows a lack of distinction between the two, which some people would consider important. So I spent some time arguing that at least we should allow any value. So if you want to say that someone is “third gender” or even that their gender is “Sodium” that’s now possible. It was just an early case of heteronormativity sneaking into the ontology.

Wikidata uses a CC0 license which is less restrictive than the CC BY SA license that Wikipedia is governed by. What do you think the impact of this decision has been in relation to others like Google who make use of Wikidata in projects like the Google Knowledge Graph?
Wikidata being CC0 at first seemed very radical to me. But one thing I noticed was that increasingly this will mean where the Google Knowledge Graph now credits their “info-cards” to Wikipedia, the attribution will just start disappearing. This seems mostly innocent until you consider that Google is a funder of the Wikidata project. So in some way it could seem like they are just paying to remove a blemish on their perceived omniscience.

But to nip my pessimism I have to remind myself that if we really believe in the Open Source, Open Data credo then this rising tide lifts all boats.

by Heather Ford at September 08, 2014 11:24 AM

Code and the (Semantic) City

Mark Graham and I have just returned from Maynooth in Ireland where we participated in a really great workshop called Code and the City organised by Rob Kitchin and his team at the Programmable City project. We presented a draft paper entitled, ‘Semantic Cities: Coded Geopolitics and Rise of the Semantic Web’ where we trace how the city of Jerusalem is represented across Wikipedia and through WikiData, Freebase and to Google’s Knowledge Graph in order to answer questions about how linked data and the semantic web changes a user’s interactions with the city. We’ve been indebted to the folks from all of these projects who have helped us navigate questions about the history and affordances of these projects so that we can better understand the current Web ecology. The paper is currently being revised and will be available soon, we hope!

by Heather Ford at September 08, 2014 11:08 AM

September 02, 2014

MIMS 2011

Infoboxes and cleanup tags: Artifacts of Wikipedia newsmaking

Screen Shot 2014-09-02 at 2.06.05 PM

Infobox from the first version of the 2011 Egyptian Revolution (then ‘protests’) article on English Wikipedia, 25 January, 2011

My article about Wikipedia infoboxes and cleanup tags and their role in the development of the 2011 Egyptian Revolution article has just been published in the journal, ‘Journalism: Theory, Practice and Criticism‘ (a pre-print is available on The article forms part of a special issue of the journal edited by C W Anderson and Juliette de Meyer who organised the ‘Objects of Journalism’ pre-conference at the International Communications Association conference in London that I attended last year. The issue includes a number of really interesting articles from a variety of periods in journalism’s history – from pica sticks to interfaces, timezones to software, some of which we covered in the August 2013 edition of

My article is about infoboxes and cleanup tags as objects of Wikipedia journalism, objects that have important functions in the coordination of editing and writing by distributed groups of editors. Infoboxes are summary tables on the right hand side of an article that enable readability and quick reference, while cleanup tags are notices at the head of an article warning readers and editors of specific problems with articles. When added to an article, both tools simultaneously notify editors about missing or weak elements of the article and add articles to particular categories of work.

The article contains an account of the first 18 days of the protests that resulted in the resignation of then-president Hosni Mubarak based on interviews with a number of the article’s key editors as well as traces in related articles, talk pages and edit histories. Below is a selection from what happened on day 1:

Day 1: 25 January, 2011 (first day of the protests)

The_Egyptian_Liberal published the article on English Wikipedia on the afternoon of what would become a wave of protests that would lead to the unseating of President Hosni Mubarak. A template was used to insert the ‘uprising’ infobox to house summarised information about the event including fields for its ‘characteristics’, the number of injuries and fatalities. This template was chosen from a range of other infoboxes relating to history and events on Wikipedia, but has since been deleted in favor of the more recently developed ‘civil conflict’ infobox with fields for ‘causes’, ‘methods’ and ‘results’.

The first draft included the terms ‘demonstration’, ‘riot’ and ‘self-immolation’ in the ‘characteristics’ field and was illustrated by the Latuff cartoon of Khaled Mohamed Saeed and Hosni Mubarak with the caption ‘Khaled Mohamed Saeed holding up a tiny, flailing, stone-faced Hosni Mubarak’. Khaled Mohamed Saeed was a young Egyptian man who was beaten to death reportedly by Egyptian security forces and the subject of the Facebook group ‘We are all Khaled Said’ moderated by Wael Ghonim that contributed to the growing discontent in the weeks leading up to 25 January, 2011. This would ideally have been a filled by a photograph of the protests, but the cartoon was used because the article was uploaded so soon after the first protests began. It also has significant emotive power and clearly represented the perspective of the crowd of anti-Mubarak demonstrators in the first protests.

Upon publishing, three prominent cleanup tags were automatically appended to the head of the article. These included the ‘new unreviewed article’ tag, the ‘expert in politics needed’ tag and the ‘current event’ tag, warning readers that information on the page may change rapidly as events progress. These three lines of code that constituted the cleanup tags initiated a complex distribution of tasks to different groups of users located in work groups throughout the site: page patrollers, subject experts and those interested in current events.

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

Looking at the diffs in the first day of the article’s growth, it becomes clear that the article is by no means a ‘blank slate’ that editors fill progressively with prose. Much of the activity in the first stage of the article’s development consisted of editors inserting markers or frames in the article that acted to prioritize and distribute work. Cleanup tags alerted others about what they believed to be priorities (to improve weak sections or provide political expertise, for example) while infoboxes and tables provided frames for editors to fill in details iteratively as new information became available.

By discussing the use of these tools in the context of Bowker and Star’s theories of classification (2000), I argue that these tools are not only material but also conceptual and symbolic. They facilitate collaboration by enabling users to fill in details according to a pre-defined set of categories and by catalyzing notices that alert others to the work that they believe needs to be done on the article. Their power, however, cannot only be seen in terms of their functional value. These artifacts are deployed and removed as acts of social and strategic power play among Wikipedia editors who each want to influence the narrative about what happened and why it happened. Infoboxes and tabular elements arise as clean, simple, well-referenced numbers out of the messiness and conflict that gave rise to them. When cleanup tags are removed, the article develops an implicit authority, appearing to rise above uncertainty, power struggles and the impermanence of the compromise that it originated from.

This categorization practice enables editors to collaborate iteratively with one another because each object signals work that needs to be done by others in order to fill in the gaps of the current content. In addition to this functional value, however, categorization also has a number of symbolic and political consequences. Editors are engaged in a continual practice of iterative summation that contributes to an active construction of the event as it happens rather than a mere assembling of ‘reliable sources’. The deployment and removal of cleanup tags can be seen as an act of power play between editors that affects readers’ evaluation of the article’s content. Infoboxes are similar sites of struggle whose deployment and development result in an erasure of the contradictions and debates that gave rise to them. These objects illuminate how this novel journalistic practice has important implications for the way that political events are represented.

by Heather Ford at September 02, 2014 04:46 PM

September 01, 2014

MIMS 2014

Of Goodreads and Listicles

I’m a HUGE fan of Goodreads. I have been using it for a few years now, and I was depressed when the new Kindle Fire came out with Goodreads integration and my old Paperwhite didn’t get it for almost a year later. I mark every book I read (mostly the paper kinds, and yes, I’m weird like that) and rate them, though I rarely write reviews. Today, I was looking at the recent deluge of Facebook Book lists and it got me wondering why these lists were a Facebook thing, when all my friends seem to be on Goodreads too. When I started making my own list, I had this vague plan to link the Goodreads pages to the list, but then frankly, for a status message in FB it was just too cumbersome. It’s weird though. Many of my friends have books on their list that I want to read, but now, I have to go discover these lists (or hope that Facebook surfaces the ones I’d really like) and then keep adding on Goodreads.

I wondered why in these times of Buzzfeed and crazed listicles, Goodreads doesn’t have lists. Except, it does. I checked. But here’s my issue – I’m a longtime user and it took a search for me to discover this. I realize that in the interest of simplicity, there is no point in having lists upfront on the login screen. But, in times like this, especially when a book tag is doing the rounds, shouldn’t Goodreads be pushing users to publish these lists on Goodreads? Especially since the FB ones are going to die down, and none of us will ever be able to locate them later. It also looks like Goodreads believes the lists should only be of the format ‘Best Robot Books’ not “Michael’s Favorite Books’ – I wonder why. I mean, I may be far more interested in discovering something from A’s favorite books, than a list of her favorite thrillers, for instance. Maybe I’m projecting way too much of my self into the shoes of a generic user on Goodreads. Maybe people would prefer Goodreads be the way it is. It would be interesting though, to see if Goodreads could maybe create these list driven FB posts as a social media marketing campaign, where they get people to tag books on Goodreads or some such. I feel like all the virality should benefit them!

by muchnessofd at September 01, 2014 06:53 PM

Ph.D. alumna

What is Privacy?

Earlier this week, Anil Dash wrote a smart piece unpacking the concept of “public.” He opens with some provocative questions about how we imagine the public, highlighting how new technologies that make heightened visibility possible. For example,

Someone could make off with all your garbage that’s put out on the street, and carefully record how many used condoms or pregnancy tests or discarded pill bottles are in the trash, and then post that information up on the web along with your name and your address. There’s probably no law against it in your area. Trash on the curb is public.

The acts that he describes are at odds with — or at least complicate — our collective sense of what’s appropriate. What’s at stake is not about the law, but about our idea of the society we live in. This leads him to argue that the notion of public is not easy to define. “Public is not just what can be viewed by others, but a fragile set of social conventions about what behaviors are acceptable and appropriate.” He then goes on to talk about the vested interests in undermining people’s conception of public and expanding the collective standards of what is in.

To get there, he pushes back at the dichotomy between “public” and “private,” suggesting that we should think of these as a spectrum. I’d like to push back even further to suggest that our notion of privacy, when conceptualized in relationship to “public,” does a disservice to both concepts. The notion of private is also a social convention, but privacy isn’t a state of a particular set of data. It’s a practice and a process, an idealized state of being, to be actively negotiated in an effort to have agency. Once we realize this, we can reimagine how to negotiate privacy in a networked world. So let me unpack this for a moment.

Imagine that you’re sitting in a park with your best friend talking about your relationship troubles. You may be in a public space (in both senses of that term), but you see your conversation as private because of the social context, not the physical setting. Most likely, what you’ve thought through is whether or not your friend will violate your trust, and thus your privacy. If you’re a typical person, you don’t even begin to imagine drones that your significant other might have deployed or mechanisms by which your phone might be tapped. (Let’s leave aside the NSA, hacker-geek aspect of this.)

You imagine privacy because you have an understanding of the context and are working hard to control the social situation. You may even explicitly ask your best friend not to say anything (prompting hir to say “of course not” as a social ritual).

As Alice Marwick and I traversed the United States talking with youth, trying to make sense of privacy, we quickly realized that the tech-centric narrative of privacy just doesn’t fit with people’s understandings and experience of it. They don’t see privacy as simply being the control of information. They don’t see the “solution” to privacy being access-control lists or other technical mechanisms of limiting who has access to information. Instead, they try to achieve privacy by controlling the social situation. To do so, they struggle with their own power in that situation. For teens, it’s all about mom looking over their shoulder. No amount of privacy settings can solve for that one. While learning to read social contexts is hard, it’s especially hard online, where the contexts seem to be constantly destabilized by new technological interventions. As such, context becomes visible and significant in the effort to achieve privacy. Achieving privacy requires a whole slew of skills, not just in the technological sense, but in the social sense. Knowing how to read people, how to navigate interpersonal conflict, how to make trust stick. This is far more complex that people realize, and yet we do this every day in our efforts to control the social situations around us.

The very practice of privacy is all about control in a world in which we fully know that we never have control. Our friends might betray us, our spaces might be surveilled, our expectations might be shattered. But this is why achieving privacy is desirable. People want to be *in* public, but that doesn’t necessarily mean that they want to *be* public. There’s a huge difference between the two. As a result of the destabilization of social spaces, what’s shocking is how frequently teens have shifted from trying to restrict access to content to trying to restrict access to meaning. They get, at a gut level, that they can’t have control over who sees what’s said, but they hope to instead have control over how that information is interpreted. And thus, we see our collective imagination of what’s private colliding smack into the notion of public. They are less of a continuum and more of an entwined hairball, reshaping and influencing each other in significant ways.

Anil is right when he highlights the ways in which tech companies rely on conceptions of “public” to justify data collection practices. He points to the lack of consent, which signals what’s really at stake. When powerful actors, be they companies or governmental agencies, use the excuse of something being “public” to defend their right to look, they systematically assert control over people in a way that fundamentally disenfranchises them. This is the very essence of power and the core of why concepts like “surveillance” matter. Surveillance isn’t simply the all-being all-looking eye. It’s a mechanism by which systems of power assert their power. And it is why people grow angry and distrustful. Why they throw fits over beingexperimented on. Why they cry privacy foul even when the content being discussed is, for all intents and purposes, public.

As Anil points out, our lives are shaped by all sorts of unspoken social agreements. Allowing organizations or powerful actors to undermine them for personal gain may not be illegal, but it does tear at the social fabric. The costs of this are, at one level, minuscule, but when added up, they can cause a serious earthquake. Is that really what we’re seeking to achieve?

(The work that Alice and I did with teens, and the implications that this has for our conception of privacy writ large, is written up as “Networked Privacy” in New Media & Society. If you don’t have library access, email me and I’ll send you a copy.)

(This entry was first posted on August 1, 2014 at Medium under the title “What is Privacy” as part of The Message.)

by zephoria at September 01, 2014 02:34 PM

August 28, 2014

Ph.D. student

a mathematical model of collective creativity

I love my Mom. One reason I love her is that she is so good at asking questions.

I thought I was on vacation today, but then my Mom started to ask me questions about my dissertation. What is my dissertation about? Why is it interesting?

I tried to explain: I’m interested in studying how these people working on scientific software work together. That could be useful in the design of new research infrastructure.

M: Ok, so like…GitHub? Is that something people use to share their research? How do they find each other using that?

S: Well, people can follow each others repositories to get notifications. Or they can meet each other at conferences and learn what people are working on. Sometimes people use social media to talk about what they are doing.

M: That sounds like a lot of different ways of learning about things. Could your research be about how to get them all to talk about it in one place?

S: Yes, maybe. In some ways GitHub is already serving as that central repository these days. One application of my research could be about how to design, say, an extension to GitHub that connects people. There’s a lot of research on ‘link formation’ in the social media context–well I’m your friend, and you have this other friend, so maybe we should be friends. Maybe the story is different for collaborators. I have certain interests, and somebody else does too. When are our interests aligned, so that we’d really want to work together on the same thing? And how do we resolve disputes when our interests diverge?

M: That sounds like what open source is all about.

S: Yeah!

M: Could you build something like that that wasn’t just for software? Say I’m a researcher and I’m interesting in studying children’s education, and there’s another researcher who is interested in studying children’s education. Could you build something like that in your…your D-Lab?

S: We’ve actually talked about building an OKCupid for academic research! The trick there would be bringing together researchers interested in different things, but with different skills. Maybe somebody is really good at analyzing data, and somebody else is really good at collecting data. But it’s a lot of work to build something nice. Not as easy as “build it and they will come.”

M: But if it was something like what people are used to using, like OKCupid, then…

S: It’s true that would be a really interesting project. But it’s not exactly my research interest. I’m trying really hard to be a scientist. That means working on problems that aren’t immediately appreciable by a lot of people. There are a lot of applications of what I’m trying to do, but I won’t really know what they are until I get the answers to what I’m looking for.

M: What are you looking for?

S: I guess, well…I’m looking for a mathematical model of creativity.

M: What? Wow! And you think you’re going to find that in your data?

S: I’m going to try. But I’m afraid to say that. People are going to say, “Why aren’t you studying artists?”

M: Well, the people you are studying are doing creative work. They’re developing software, they’re scientists…

S: Yes.

M: But they aren’t like Beethoven writing a symphony, it’s like…

S: …a craft.

M: Yes, a craft. But also, it’s a lot of people working together. It’s collective creativity.

S: Yes, that’s right.

M: You really should write that down. A mathematical model of collective creativity! That gives me chills. I really hope you’ll write that down.

Thanks, Mom.

by Sebastian Benthall at August 28, 2014 08:26 PM

Ph.D. student

Re: Homebrew Website Club: August 27, 2014

I did make it to the Indiewebcamp/Homebrew meeting this evening after all, in Portland this time, since I happened to be passing through.

I was able to show off some of the work I've been doing on embedding data-driven graphs/charts in the Web versions of in-progress academic writing: d3.js generating SVG tables in the browser, but also saving SVG/PDF versions which are used as figures in the LaTeX/PDF version (which I still need for sharing the document in print and with most academics). I need to write a brief blog post describing my process for doing this, even though it's not finished. In fact, that's a theme; we all need to be publishing code and writing blog posts, especially for inchoate work.

Also, I've been thinking about pseudonymity in the context of personal websites. Is there anything we need to do to make it possible to maintain different identities / domain names without creating links between them? Also, it may be a real privacy advantage to split the reading and writing on the Web: if you don't have to create a separate list of friends/follows in each site with each pseudonym, then you can't as easily be re-identified by having the same friends. But I want to think carefully about the use case, because while I've become very comfortable with a domain name based on my real name and linking my professional, academic and personal web presences, I find that a lot of my friends are using pseudonyms, or intentionally subdividing

Finally, I learned about some cool projects.

  • Indiewebcamp IRC logs become more and more featureful, including an interactive chat client in the logs page itself
  • Google Web Starter Kit provides boilerplate and a basic build/task system for building static web sites
  • Gulp and Harp are two (more) JavaScript-based tools for preparing/processing/hosting static web sites

All in all, good fun. And then I went to the Powell's bookstore dedicated just to technical and scientific books, saw an old NeXT cube and bought an old book on software patterns.

Thanks for hosting us, @aaronpk!
— Nick

by at August 28, 2014 05:49 AM

August 27, 2014

Ph.D. student

a response to “Big Data and the ‘Physics’ of Social Harmony” by @doctaj; also Notes towards ‘Criticality as ideology';

I’ve been thinking over Robin James’ “Big Data & the ‘Physics’ of Social Harmony“, an essay in three sections. The first discusses Singapore’s use of data science to detect terrorists and public health threats for the sake of “social harmony,” as reported by Harris in Foreign Policy. The second ties together Plato, Pentland’s “social physics”, and neoliberalism. The last discusses the limits to individual liberty proposed by J.S. Mill. The author admits it’s “all over the place.” I get the sense that it is a draft towards a greater argument. It is very thought-provoking and informative.

I take issue with a number of points in the essay. Underlying my disagreement is what I think is a political difference about the framing of “data science” and its impact on society. Since I am a data science practitioner who takes my work seriously, I would like this framing to be nuanced, recognizing both the harm and help that data science can do. I would like the debate about data science to be more concrete and pragmatic so that practitioners can use this discussion as a guide to do the right thing. I believe this will require discussion of data science in society to be informed by a technical understanding of what data science is up to. However, I think it’s also very important that these discussions rigorously take up the normative questions surrounding data sciences’ use. It’s with this agenda that I’m interested in James’ piece.

James is a professor of Philosophy and Women’s/Gender Studies and the essay bears the hallmarks of these disciplines. Situated in a Western and primarily anglophone intellectual tradition, it draws on Plato and Mill for its understanding of social harmony and liberalism. At the same time, it has the political orientation common to Gender Studies, alluding to the gendered division of economic labor, at times adopting Marxist terminology, and holding suspicion for authoritarian power. Plato is read as being the intellectual root of a “particular neoliberal kind of social harmony” that is “the ideal that informs data science.” James contrasts this ideal with the ideal of individual liberty, as espoused and then limited by Mill.

Where I take issue with James is that I think this line of argument is biased by its disciplinary formation. (Since this is more or less a truism for all academics, I suppose this is less a rebuttal than a critique.) Where I believe this is most visible is in her casting of Singapore’s ideal of social harmony as an upgrade of Plato, via the ideology of neoliberalism. She does not not consider in the essay that Singapore’s ideal of social harmony might be rooted in Eastern philosophy, not Western philosophy. Though I have no special access or insight into the political philosophy of Singapore, this seems to me to be an important omission given that Singapore is ethnically 74.2% Chinese and with Buddhist plurality.

Social harmony is a central concept in Eastern, especially Chinese, philosophy with deep roots in Confucianism and Daoism. A great introduction for those with background in Western philosophy who are interested in the philosophical contributions of Confucius is Fingarette’s Confucius: The Secular as Sacred. Fingarette discusses how Confucian thought is a reaction to the social upheaval and war of Anciant China’s Warring States Period, roughly 475 – 221 BC. Out of these troubling social conditions, Confucian thought attempts to establish conditions for peace. These include ritualized forms of social interaction at whose center is a benevolent Emperor.

There are many parallels with Plato’s political philosophy, but Fingarette makes a point of highlighting where Confucianism is different. In particular, the role of social ritual and ceremony as the basis of society is at odds with Western individualism. Political power is not a matter of contest of wills but the proper enactment of communal rites. It is like a dance. Frequently, the word “harmony” is used in the translation of Confucian texts to refer to the ideal of this functional, peaceful ceremonial society and, especially, its relationship with nature.

A thorough analysis of use of data science for social control in light of Eastern philosophy would be an important and interesting work. I certainly haven’t done it. My point is simply that when we consider the use of data science for social control as a global phenomenon, it is dubious to see it narrowly in light of Western intellectual history and ideology. That includes rooting it in Plato, contrasting it with Mill, and characterizing it primarily as an expression of white neoliberalism. Expansive use of these Western tropes is a projection, a fallacy of “I think this way, therefore the world must.” This I submit is an occupational hazard of anyone who sees their work primarily as an analysis of critique of ideology.

In a lecture in 1965 printed in Knowledge and Human Interests, Habermas states:

The concept of knowledge-constitutive human interests already conjoins the two elements whose relation still has to be explained: knowledge and interest. From everyday experience we know that ideas serve often enough to furnish our actions with justifying motives in place of the real ones. What is called rationalization at this level is called ideology at the level of collective action. In both cases the manifest content of statements is falsified by consciousness’ unreflected tie to interests, despite its illusion of autonomy. The discipline of trained thought thus correctly aims at excluding such interests. In all the sciences routines have been developed that guard against the subjectivity of opinion, and a new discipline, the sociology of knowledge, has emerged to counter the uncontrolled influence of interests on a deeper level, which derive less from the individual than from the objective situation of social groups.

Habermas goes on to reflect on the interests driving scientific inquiry–“scientific” in the broadest sense of having to do with knowledge. He delineates:

  • Technical inquiry motivated by the drive for manipulation and control, or power
  • Historical-hermeneutic inquiry motivated by the drive to guide collective action
  • Critical, reflexive inquiry into how the objective situation of social groups controls ideology, motivated by the drive to be free or liberated

This was written in 1965. Habermas was positioning himself as a critical thinker; however, unlike some of the earlier Frankfurt School thinkers he drew on, he did maintained that technical power was an objective human interest. (see Bohman and Rehg) In the United States especially, criticality as a mode of inquiry took aim at the ideologies that aimed at white, bourgeois, and male power. Contemporary academic critique has since solidified as an academic discipline and wields political power. In particular, is frequently enlisted as an expression of the interests of marginalized groups. In so doing, academic criticality has (in my view regrettably) becomes mere ideology. No longer interested in being scientifically disinterested, it has become a tool of rationalization. It’s project is the articulation of changing historical conditions in certain institutionally recognized tropes. One of these tropes is the critique of capitalism, modernism, neoliberalism, etc. and their white male bourgeois heritage. Another is the feminist emphasis on domesticity as a dismissed form on economic production. This trope features in James’ analysis of Singapore’s ideal of social harmony:

Harris emphasizes that Singaporeans generally think that finely-tuned social harmony is the one thing that keeps the tiny city-state from tumbling into chaos. [1] In a context where resources are extremely scarce–there’s very little land, and little to no domestic water, food, or energy sources, harmony is crucial. It’s what makes society sufficiently productive so that it can generate enough commercial and tax revenue to buy and import the things it can’t cultivate domestically (and by domestically, I really mean domestically, as in, by ‘housework’ or the un/low-waged labor traditionally done by women and slaves/servants.) Harmony is what makes commercial processes efficient enough to make up for what’s lost when you don’t have a ‘domestic’ supply chain. (emphasis mine)

To me, this parenthetical is quite odd. There are other uses of the word “domestic” that do not specifically carry the connotation of women and slave/servants. For example, the economic idea of gross domestic product just means “an aggregate measure of production equal to the sum of the gross values added of all resident institutional units engaged in production (plus any taxes, and minus any subsidies, on products not included in the value of their outputs).” Included in that production is work done by men and high-wage laborers. To suggest that natural resources are primarily exploited by “domestic” labor in the ‘housework’ sense is bizarre given, say, agribusiness, industrial mining, etc.

There is perhaps an interesting etymological relationship here; does our use of ‘domestic’ in ‘domestic product’ have its roots in household production? I wouldn’t know. Does that same etymological root apply in Singapore? Was agriculture in East Asia traditionally the province of household servants in China and Southeast Asia (as opposed to independent farmers and their sons?)? Regardless, domestic economic production agricultural production is not housework now. So it’s mysterious that this detail should play a role in explaining Singapore’s emphasis on social harmony today.

So I think it’s safe to say that this parenthetical remark by James is due to her disciplinary orientation and academic focus. Perhaps it is a contortion to satisfy the audience of Cyborgology, which has a critical left-leaning politics. A Harris’s original article does not appear to support this interpretation. Rather, it only uses the word ‘harmony’ twice, and maintains a cultural sensitivity that James’ piece lacks, noting that Singapore’s use of data science may be motivated by a cultural fear of loss or risk.

The colloquial word kiasu, which stems from a vernacular Chinese word that means “fear of losing,” is a shorthand by which natives concisely convey the sense of vulnerability that seems coded into their social DNA (as well as their anxiety about missing out — on the best schools, the best jobs, the best new consumer products). Singaporeans’ boundless ambition is matched only by their extreme aversion to risk.

If we think that Harris is closer to the source here, then we do not need the projections of Western philosophy and neoliberal theory to explain what is really meant by Singapore’s use of data science. Rather, we can look to Singapore’s culture and perhaps its ideological origins in East Asian thinking. Confucius, not Plato.

* * *

If there it is a disciplinary bias to American philosophy departments, it is that they exist to reproduce anglophone philosophy. This is point that James has recently expressed herself…in fact while I have been in the process of writing this response.

Though I don’t share James’ political project, generally speaking I agree that effort spent of the reproduction of disciplinary terminology is not helpful to the philosophical and scientific projects. Terminology should be deployed for pragmatic reasons in service to objective interests like power, understanding, and freedom. On the other hand, language requires consistency to be effective, and education requires language. My own personal conclusion on is that the scientific project can only be sustained now through disciplinary collapse.

When James suggests that old terms like metaphysics and epistemology prevent the de-centering of the “white supremacist/patriarchal/capitalist heart of philosophy”, she perhaps alludes to her recent coinage of “epistemontology” as a combination of epistemology and ontology, as a way of designating what neoliberalism is. She notes that she is trying to understand neoliberalism as an ideology, not as a historical period, and finds useful the definition that “neoliberals think everything in the universe works like a deregulated, competitive, financialized capitalist market.”

However helpful a philosophical understanding of neoliberalism as market epistemontology might be, I wonder whether James sees the tension between her statements about rejecting traditional terminology that reproduces the philosophical discipline and her interest in preserving the idea of “neoliberalism” in a way that can be be taught in an introduction to philosophy class, a point she makes in a blog comment later. It is, perhaps, in the act of teaching that a discipline is reproduced.

The use of neoliberalism as a target of leftist academic critique has been challenged relatively recently. Craig Hickman, in a blog post about Luis Suarez-Villa, writes:

In fact Williams and Srinicek see this already in their first statement in the interview where they remind us that “what is interesting is that the neoliberal hegemony remains relatively impervious to critique from the standpoint of the latter, whilst it appears fundamentally unable to counter a politics which would be able to combat it on the terrain of modernity, technology, creativity, and innovation.” That’s because the ball has moved and the neoliberalist target has shifted in the past few years. The Left is stuck in waging a war it cannot win. What I mean by that is that it is at war with a target (neoliberalism) that no longer exists except in the facades of spectacle and illusion promoted in the vast Industrial-Media-Complex. What is going on in the world is now shifting toward the East and in new visions of technocapitalism of which such initiatives as Smart Cities by both CISCO (see here) and IBM and a conglomerate of other subsidiary firms and networking partners to build new 21st Century infrastructures and architectures to promote creativity, innovation, ultra-modernity, and technocapitalism.

Let’s face it capitalism is once again reinventing itself in a new guise and all the Foundations, Think-Tanks, academic, and media blitz hype artists are slowly pushing toward a different order than the older market economy of neoliberalism. So it’s time the Left begin addressing the new target and its ideological shift rather than attacking the boogeyman of capitalism’s past. Oh, true, the façade of neoliberalism will remain in the EU and U.S.A. and much of the rest of the world for a long while yet, so there is a need to continue our watchdog efforts on that score. But what I’m getting at is that we need to move forward and overtake this new agenda that is slowly creeping into the mix before it suddenly displaces any forms of resistance. So far I’m not sure if this new technocapitalistic ideology has even registered on the major leftist critiques beyond a few individuals like Luis Suarez-Villa. Mark Bergfield has a good critique of Suarez-Villa’s first book on Marx & Philosophy site: here.

In other words, the continuation of capitalist domination is due to its evolution relative to the stagnation of intellectual critiques of it. Or to put it another way, privilege is the capacity to evolve and not merely reproduce. Indeed, the language game of academic criticality is won by those who develop and disseminate new tropes through which to represent the interests of the marginalized. These privileged academics accomplish what Lyotard describes as “legitimation through paralogy.”

* * * * *

If James were working merely within academic criticality, I would be less interested in the work. But her aspirations appear to be higher, in a new political philosophy that can provide normative guidance in a world where data science is a technical reality. She writes:

Mill has already made–in 1859 no less–the argument that rationalizes the sacrifice of individual liberty for social harmony: as long as such harmony is enforced as a matter of opinion rather than a matter of law, then nobody’s violating anybody’s individual rights or liberties. This is, however, a crap argument, one designed to limit the possibly revolutionary effects of actually granting individual liberty as more than a merely formal, procedural thing (emancipating people really, not just politically, to use Marx’s distinction). For example, a careful, critical reading of On Liberty shows that Mill’s argument only works if large groups of people–mainly Asians–don’t get individual liberty in the first place. [2] So, critiquing Mill’s argument may help us show why updated data-science versions of it are crap, too. (And, I don’t think the solution is to shore up individual liberty–cause remember, individual liberty is exclusionary to begin with–but to think of something that’s both better than the old ideas, and more suited to new material/technical realities.)

It’s because of these more universalist ambitions that I think it’s fair to point out the limits of her argument. If a government’s idea of “social harmony” is not in fact white capitalist but premodern Chinese, if “neoliberalism” is no longer the dominant ideology but rather an idea of an ideology reproduced by a stagnating academic discipline, then these ideas will not help us understand what is going on in the contemporary world in which ‘data science’ is allegedly of such importance.

What would be better than this?

There is an empirical reality to the practices of data science. Perhaps it should be studied on its own terms, without disciplinary baggage.

by Sebastian Benthall at August 27, 2014 05:27 PM

August 18, 2014

MIMS 2012

1-year Retrospective

August 2nd marked the 1-year anniversary of my first post, so it seems appropriate to do a quick retrospective of my first year blogging on my personal site.

Writing Stats

I’ve written 22 posts in that time, which is a rate of 1.83 per month. My (unstated) goal was 2 per month, so I wasn’t far off. My most prolific month is a tie between September 2013 and May 2014, in which I wrote 4 articles each. But in September I re-used some posts I had written previously for Optimizely, so May wins for more original content.

Sadly, there were two months in which I didn’t write any articles: Dec 2013, and July 2014. In December I was in India, so that’s a pretty legitimate reason. July, however, has no good reason. It was generally a busy month, but I should have made time to write at least one post. And looking closer, just saying “July” makes it sound better than it actually was - I had a seven week stretch of no posts then!

My longest article was “Re-Designing Optimizely’s Preview Tool”, clocking in at 4,158 words!

Site Analytics

Diving into Google Analytics, I’ve had 3,092 page views, 2,158 sessions, and 1,778 users. I seem to get a steady trickle of traffic every day, with a few occasional spikes in activity (which are caused by retweets, Facebook posts, or sending posts to all of Optimizely). All of which I find pretty surprising since I don’t write very regularly, and I don’t do much to actively seek readers.

Google Analytics stats for the past year

So where do these visitors (i.e. you) come from? Google Analytics tells me that, even more surprisingly, the top two sources are organic search and direct, respectively. From looking through the search terms used to find me, they can be grouped into three categories:

  • My name: this is most likely people who are interviewing at Optimizely.
  • Cloud.typography and Typekit comparison: people are interested in a performance comparison of these two services. And in fact, I wrote this article precisely because I was searching for that information myself, but there weren’t any posts about it.
  • Framing messages: I wrote a post about the behavioral economics principle of framing, and how you can use it to generate A/B testing ideas. Apparently people want help writing headlines!

Top Posts

Continuing to dig into Google Analytics, these are my three most popular posts:

  1. “Extend – SASS’s Awkward Stepchild”, with 354 page views.
  2. “Re-Designing Optimizely’s Preview Tool”, with 306 page views.
  3. “Performance comparison of serving fonts through Typekit vs Cloud.typography”, with 302 page views.

They’re all pretty close in terms of traffic, but quite different in terms of content. So what does this tell me about what’s resonating with you, the reader, and what I should continue doing going forward? The main commonality is that all of those articles are original, in-depth content. In fact, this holds true past the top 10. My shorter posts that are responses to other people’s posts don’t receive as much mind share. I’ll have to think more about whether they’re worth doing at all anymore.


Overall I’m pretty satisfied with those numbers, and the content I’ve been able to produce. Going forward I hope I can write more in-depth content, especially about the design process of my projects (which are my favorite to write). Here’s to the upcoming year!

by Jeff Zych at August 18, 2014 04:13 AM

August 14, 2014

MIMS 2011

Diary of an internet geography project #4

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-08-05 at 1.31.00 PMContinuing with our series of blog posts exposing the workings behind a multidisciplinary big data project, we talk this week about the process of moving between small data and big data analyses. Last week, we did a group deep dive into our data. Extending the metaphor: Shilad caught the fish and dumped them on the boat for us to sort through. We wanted to know whether our method of collecting and determining the origins of the fish was working by looking at a bunch of randomly selected fish up close. Working out how we would do the sorting was the biggest challenge. Some of us liked really strict rules about how we were identifying the fish. ‘Small’ wasn’t a good enough description; better would be that small = 10-15cm diameter after a maximum of 30 minutes out of the water. Through this process we learned a few lessons about how to do this close-looking as a team. 

Step 1: Randomly selecting items from the corpus

We wanted to know two things about the data that we were selecting through this ‘small data’ analysis: Q1) Were we getting every citation in the article or were we missing/duplicating any? Q2) What was the best way to determine the location of the source?

Shilad used the WikiBrain software library he developed with Brent to identify all roughly one million geo-tagged Wikipedia articles. He then collected all external URLs (about 2.9 million unique URLs) appearing within those articles and used this data to create two samples for coding tasks. He sampled about 50 geotagged articles (to answer Q1) and selected a few hundred random URLs cited within particular articles (to answer Q2).

  • Batch 1 for Q1: 50 documents each containing an article title, url, list of citations, empty list of ‘missing citations’
  • Batch 2 for Q2: Spreadsheet of 500 random citations occurring in 500 random geotagged articles.

Example from batch 1:

Coding for Montesquiu

  1. Visit the page at Montesquiu
  2. Enter your initials in the ‘coder’ section
  3. Look at the list of extracted links below in the ‘Correct sources’ section
  4. Add a short description of each missed source to the ‘Missed sources’ section

Initials of person who coded this:

Correct sources

Missing sources

Example from batch 2:

url domain effective domain article article url
C&pg=PA308 Teatro Calderón (Valladolid)

For batch 1, we looked up each article and made sure that the algorithm we were using was catching all the citations. We found that there were a few anomalies where there was a duplication of citations (for example, when a single citation contained two urls: one to the ISBN address and another to a Google books url) or when we were missing citations (when the API was only listing a URL once when it had been used multiple times or when a book was cited without a url, for example) or when we were getting incorrect citations (when the citation url pointed to the Italian National Institute of Statistics (Istat) article on Wikipedia rather than the Istat domain).

The town of El Bayad in Libya contained two citations that weren’t included in the analysis because they didn’t contain a url, for example. One appears to be a newspaper and the other a book, but I couldn’t find the citations online. These would not be included in the analysis but it was the only example like this:

  • Amraja M. el Khajkhaj, “Noumou al Mudon as Sagheera fi Libia”, Dar as Saqia, Benghazi-2008, p.120.
  • Al Ain newspaper, Sep. 26; 2011, no. 20, Dar al Faris al Arabi, p.7.

We listed each of these anomalies in order to work out a) whether we can accommodate them in the algorithm or whether b) there are so few of them that they probably won’t affect the analysis too heavily.

Step 2: Developing a codebook and initial coding

I took the list of 500 random citations in batch 2 and went through each one to develop a new list of 100 working URLs and a codebook to help the others code the same list. I discarded 24 dead links and developed a working definition for each code in the codebook.

The biggest challenge when trying to locate citations in Wikipedia is whether to define the location according to the domain that is being pointed to, or whether one should find the original source. Google books urls are the most common form of this challenge. If a book is cited and the url points to its Google books location, do we cite the source as coming from Google or from the original publisher of the work?

My initial thought was to define URL location instead of original location — mostly because it seemed like the easiest way to scale up the analysis after this initial hand coding. But after discussing it, I really appreciated when Brent said, ‘Let’s just start this phase by avoiding thinking like computer scientists and code how we need to code without thinking about the algorithm.’ Instead, we tried to use this process as a way to develop a number of different ways of accurately locating sources and to see whether there were any major differences afterwards. Instead of using just one field for location, we developed three coding categories.

Source country:

Country where the article’s subject is located | Country of the original publisher | Country of the URL publisher

We’ll compare these three to the:

Country of the administrative contact for the URL’s domain

that Shilad and Dave are working on extracting automatically.

When I first started doing the coding, I was really interested in looking at other aspects of the data such as what kinds of articles are being captured by the geotagged list, as well as what types of sources are being employed. So I created two new codes: ‘source type’ and ‘article subject’. I defined the article subject as: ‘The subject/category of the article referred to in the title or opening sentence e.g. ‘Humpety is a village in West Sussex, England’ (subject: village)’. I defined source type as ‘the type of site/document etc that *best* describes the source e.g. if the url points to a list of statistics but it’s contained within a newspaper site, it should be classified as ‘statistics’ rather than ’newspaper’.

Coding categories based on example item above from batch 2:

subject subject country original publisher location URL publisher location language source type
building Spain Spain US Spanish book

In our previous project we divided up the ‘source type’ into many facets. These included the medium (e.g. website, book etc) and the format (statistics, news etc). But this can get very complicated very fast because there are a host of websites that do not fall easily into these categories. A url pointing to a news report by a blogger on a newspaper’s website, for example, or a link to a list of hyperlinks that download as spreadsheets on a government website. This is why I chose to use the ‘best guess’ for the type of source because choosing one category ends up being much easier than the faceted coding that we did in the previous project.

The problem was that this wasn’t a very conclusive definition and would not result in consistent coding. It is particularly problematic because we are doing this project iteratively and we want to try to get as much data as possible so that we have it if we need it later on. After much to-ing and fro-ing, we decided to go back to our research questions and focus on those. The most important thing that we needed to work out was how we were locating sources, and whether the data changed significantly depending on what definition we used. So we decided not to focus on the article type and source type for now, choosing instead to look at the three ways of coding location of sources so that we could compare them to the automated list that we develop.

This has been the hardest part of the project so far, I think. We went backwards and forwards a lot about how we might want to code this second set of randomly sampled citations. What definition of ‘source’ and ‘source location’ should we use? How do we balance the need to find the most accurate way to catch all outliers and a way that we could abstract into an algorithm that would enable us to scale up the study to look at all citations? It was a really useful exercise, though, and we have a few learnings from it.

- When you first look at the data, make sure you all do a small data analysis using a random sample;

- When you do the small data analysis, make sure you suspend your computer scientist view of the world and try to think about what is the most accurate way of coding this data from multiple facets and perspectives;

- After you’ve done this multiple analysis, you can then work out how you might develop abstract rules to accommodate the nuances in the data and/or to do a further round of coding to get a ‘ground truth’ dataset.

In this series of blog posts, a team of computer and social scientists including Heather Ford, Mark Graham, Brent Hecht, Dave Musicant and Shilad Sen are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. 

by Heather Ford at August 14, 2014 01:15 PM

August 11, 2014

Ph.D. student

picking a data backend for representing email in #python

I’m at a difficult crossroads with BigBang where I need to pick an appropriate data storage backend for my preprocessed mailing list data.

There are a lot of different aspects to this problem.

The first and most important consideration is speed. If you know anything about computer science, you know that it exists to quickly execute complex tasks that would take too long to do by hand. It’s odd writing that sentence since computational complexity considerations are so fundamental to algorithm design that this can go unspoken in most technical contexts. But since coming to grad school I’ve found myself writing for a more diverse audience, so…

The problem I’m facing is that in doing exploratory data analysis, I do not know all the questions I am going to ask yet. But any particular question will be impractical to ask unless I tune the underlying infrastructure to answer it. This chicken-and-egg problem means that the process of inquiry is necessarily constrained by the engineering options that are available.

This is not new in scientific practice. Notoriously, the field of economics in the 20th century was shaped by what was analytically tractable as formal, mathematical results. The nuance of contemporary modeling of complex systems is due largely to the fact that we now have computers to do this work for us. That means we can still have the intersubjectively verified rigor that comes with mathematization without trying to fit square pegs into round holes. (Side note: something mathematicians acknowledge that others tend to miss is that mathematics is based on dialectic proof and intersubjective agreement. This makes it much closer epistemologically to something like history as a discipline than it is to technical fields dedicated to prediction and control, like chemistry or structural engineering. Computer science is in many ways an extension of mathematics. Obviously, these formalizations are then applied to great effect. Their power comes from their deep intersubjective validity–in other words, their truth. Disciplines that have dispensed with intersubjective validity as a grounds for truth claims in favor of a more nebulous sense of diverse truths in a manifold of interpretation have difficulty understanding this and so are likely to see the institutional gains of computer scientists to be a result of political manipulation, as opposed to something more basic: mastery of nature, or more provacatively, use of force. This disciplinary disfunction is one reason why these groups see their influence erode.)

For example, I have determined that in order to implement a certain query on the data efficiently, it would be best if another query were constant time. One way to do this is to use a database with an index.

However, setting up a database is something that requires extra work on the part of the programmer and so makes it harder to reproduce results. So far I have been keeping my processed email data “in memory” after it is pulled from files on the file system. This means that I have access to the data within the programming environment I’m most comfortable with, without depending on an external or parallel process. Fewer moving parts means that it is simpler to do my work.

So there is a tradeoff between the computational time of the software as it executes and the time and attention is takes me (and others that want to reproduce my results) to set up the environment in which the software runs. Since I am running this as an open source project and hope others will build on my work, I have every reason to be lazy, in a certain sense. Every inconvenience I suffer is one that will be suffered by everyone that follows me. There is a Kantian categorical imperative to keep things as simple as possible for people, to take any complex procedure and replace it with a script, so that others can do original creative thinking, solve the next problem. This is the imperative that those of us embedded in this culture have internalized. (G. Coleman notes that there are many cultures of hacking; I don’t know how prevalent these norms are, to be honest; I’m speaking from my experience) It is what makes this social process of developing our software infrastructure a social one with a modernist sense of progress. We are part of something that is being built out.

There are also social and political considerations. I am building this project intentionally in a way that is embedded within the Scientific Python ecosystem, as they are also my object of study. Certain projects are trendy right now, and for good reason. At the Python Worker’s Party at Berkeley last Friday, I saw a great presentation of Blaze. Blaze is a project that allows programmers experienced with older idioms of scientific Python programming to transfer their skills to systems that can handle more data, like Spark. This is exciting for the Python community. In such a fast moving field with multiple interoperating ecosystems, there is always the anxiety that ones skills are no longer the best skills to have. Has your expertise been made obsolete? So there is a huge demand for tools that adapt one way of thinking to a new system. As more data has become available, people have engineered new sophisticated processing backends. Often these are not done in Python, which has a reputation for being very usable and accessible but slow to run in operation. Getting the usable programming interface to interoperate with the carefully engineered data backends is hard work, work that Matt Rocklin is doing while being paid by Continuum Analytics. That is sweet.

I’m eager to try out Blaze. But as I think through the questions I am trying to ask about open source projects, I’m realizing that they don’t fit easily into the kind of data processing that Blaze currently supports. Perhaps this is dense on my part. If I knew better what I was asking, I could maybe figure out how to make it fit. But probably, what I’m looking at is data that is not “big”, that does not need the kind of power that these new tools provide. Currently my data fits on my laptop. It even fits in memory! Shouldn’t I build something that works well for what I need it for, and not worry about scaling at this point?

But I’m also trying to think long-term. What happens if an when it does scale up? What if I want to analyze ALL the mailing list data? Is that “big” data?

“Premature optimization is the root of all evil.” – Donald Knuth

by Sebastian Benthall at August 11, 2014 06:28 PM

MIMS 2011

Wikipedia and breaking news: The promise of a global media platform and the threat of the filter bubble

I gave this talk at Wikimania in London yesterday. 

In the first years of Wikipedia’s existence, many of us said that, as an example of citizen journalism and journalism by the people, Wikipedia would be able to avoid the gatekeeping problems faced by traditional media. The theory was that because we didn’t have the burden of shareholders and the practices that favoured elite viewpoints, we could produce a media that was about ‘all of us’ and not just ‘some of us’.

Dan Gillmor (2004) wrote that Wikipedia was an example of a wave of citizen journalism projects initiated at the turn of the century in which ‘news was being produced by regular people who had something to say and show, and not solely by the “official” news organizations that had traditionally decided how the first draft of history would look’ (Gillmor, 2004: x).

Yochai Benkler (2006) wrote that projects like Wikipedia enables ‘many more individuals to communicate their observations and their viewpoints to many others, and to do so in a way that cannot be controlled by media owners and is not as easily corruptible by money as were the mass media.’ (Benkler, 2006: 11)

I think that at that time we were all really buoyed by the idea that Wikipedia and peer production could produce information products that were much more representative of “everyone’s” experience. But the idea that Wikipedia could avoid bias completely, I now believe, is fundamentally wrong. Wikipedia presents a particular view of the world while rejecting others. Its bias arises both from its dependence on sources which are themselves biased, but Wikipedia itself has policies and practices that favour particular viewpoints. Although Wikipedia is as close to a truly global media product than we have probably ever come*, like every media product it is a representation of the world and is the result of a series of editorial, technical and social decisions made to prioritise certain narratives over others.

Mark Graham (2009) has shown how Wikipedia’s representation of place is skewed towards the developed North; researchers such as Brendan Luyt (2011) have shown that Wikipedia’s coverage of history suffers from an over-reliance on foreign government sources, and others like Tony Lam, Anuradha Uduwage, Zhenhua Dong, Shilad Sen, Dave Musicant, Loren Terveen and John Riedl (Lam et al., 2011) have shown how there are significant gender-associated imbalances in its topic coverage.

But there is, as yet, little research on how such imbalances might manifest themselves in articles about breaking news topics. At a stage when there is often no single conclusive narrative about what happened and why, we see the effects of these choices most starkly – both in decisions about whether a particular idea, event, object or person is important enough to warrant a standalone article, as well as decisions about which statements (aka ‘facts’) to include, what those statements will be, and what shape the narrative arc will take. Wikipedia, then, presents a particular view of the world in the face of a variety of alternatives.

It is not necessarily problematic that we choose to present one article about an event rather than 20 different takes on it. But it becomes problematic when Wikipedia is presented as a neutral source; a source that represents the views of “everyone”. It’s problematic because it means that users don’t often recognise that Wikipedia is constructed (and mirrors in many ways the biases of the sources it uses to support it), but it is also problematic because it means that Wikipedians don’t always recognise that we need to change the way that we work in order to be more inclusive of perspectives. Such perspectives will remain unreflected if we continue to adhere to policies developed to favour a particular perspective of the world.

What is Wikipedia news?

The 6th biggest website in the world, Wikipedia enjoys 18 billion page views and nearly 500 million unique visitors a month, according to comScore. Since 2003, the top 25 Wikipedia articles with the most contributors every month consist nearly exclusively of articles related to current events (Keegan, 2013). Last week, for example, the article entitled ‘Ebola virus disease’ was the most viewed article on English Wikipedia at about 1.8 million views, and the Israeli-Palestinian conflict and related articles made up 5 of the top 25 most popular articles and accounted for about 1.5 million views (see the Top 25 Report on English Wikipedia).

Wikipedia didn’t always ‘do news’. Brian Keegan (2013) writes that the policy around breaking news emerged after the September 11 attacks in 2001 when there was an attempt to write articles about every person who died in the Twin Towers. There was a subsequent decision to separate these out in what is called the ‘Memorial Wiki’. It was also around this time that editors defined what would constitute a notable event and signaled the start of an ‘in the news’ section on the website that would list the most important news of the day linking to good quality Wikipedia articles about those topics. Editors can now propose topics for the ‘in the news’ section on the home page and discuss whether the articles are good enough quality to be featured and whether the news is appropriate for that day.

Although both Wikipedia and traditional media both produce news, probably the most fundamental difference between Wikipedia and journalism practice is in our handling of sources. Journalists pride themselves on their ability to do original research by finding the right people to answer the right kinds of questions and for them to distil the important elements from those conversations into an article. For journalists, the people they interview and interrogate are their sources, and the process is a dialogic one: through conversation, questions, answers, follow-up questions and clarifications, they produce their article.

Wikipedians, on the other hand, are forbidden from doing ‘original research’ and must write what they can about the world on the basis of what we find in ‘reliable secondary sources’. For Wikipedians, sources are the ‘documents’ that that we find to back up what we write. This is both a limiting and empowering feature of Wikipedia – limiting because we rely heavily on what documents say (and documents can be contradictory and false without an opportunity to follow up with their authors), but empowering (at least in theory) because it enables readers to follow up on the sources that have been provided to back up different arguments and check or verify whether they are accurate. This is called the ‘verifiability’ principle and is one of Wikipedia’s core policies.

Wikipedia’s ‘no original research’ article is summarised as follows:

Wikipedia does not publish original thought: all material in Wikipedia must be attributable to a reliable, published source. Articles may not contain any new analysis or synthesis of published material that serves to reach or imply a conclusion not clearly stated by the sources themselves.

The problem is that when Wikipedia says it doesn’t allow ‘original research’, this doesn’t mean that Wikipedians aren’t constantly making decisions about what to write and what content to include that are to a lesser or greater extent subjective decisions. This is true for the construction of articles which require Wikipedians to construct a narrative from a host of distinct reliable and unreliable sources, but it is especially true when Wikipedians must decide whether something that happened is important enough to warrant a standalone article.

Notability, according to Wikipedia, is defined as follows:

The topic of an article should be notable, or “worthy of notice”; that is, “significant, interesting, or unusual enough to deserve attention or to be recorded”. Notable in the sense of being “famous”, or “popular”—although not irrelevant—is secondary.

Decisions about notability, then, can only be original research: the conclusion that something is important enough (according to Wikipedia’s criteria) to warrant an article must be made according to editors’ subjective viewpoints. The way in which Wikipedians summarise issues and pay attention to particular points about an issue is also subjective: there is no single reference that is an exact replica of what is represented in an article, decisions about what to include and what to leave out are happening all the time.

Such decisions are made, not just by information reflected in reliable sources but by a host of informational sources and experiences of the individual editor and the social interactions that develop as a result of the progress of an article. We don’t only learn about the world through ‘reliable sources’; we learn about the world through a host of informational cues – through street corner conversations, through gossip, through signage and posters and abandoned newspapers in restaurants, in train carriages and through social media and email and text messages and a whole host of what would be regarded, according to Wikipedia’s definition, as totally ‘unreliable sources’.

Let’s take the example of the first version of the 2011 Egyptian Revolution article (then called ‘protests’ rather than ‘revolution’). The article was started at 4:26pm local time on the first day of the Egyptian protests that led to the unseating of then-President Hosni Mubarak. (The first protests began around 2pm on that day). Let’s first look at the AFP article used as a citation in the article:

Egypt braces for nationwide protests

By Jailan Zayan (AFP) – Jan 25, 2011

CAIRO — Egypt braced for a day of nationwide anti-government protests on Tuesday, with organisers counting on the Tunisian uprising to inspire crowds to mobilise for political and economic reforms.

And then the first version of the article:

The 2011 Egyptian protests are a continuing series of street demonstrations taking place throughout Egypt from January 2010 onwards with organisers counting on the Tunisian uprising to inspire the crowds to mobilize.

I interviewed some of the (frankly amazing) Wikipedians who worked on this article about what became the 2011 Egyptian Revolution Wikipedia article. I knew that there had been a series of protests in Egypt in the run-up to the January 25 protest, but none of these had articles on them on Wikipedia, so I wondered why this article was started (backed up by such weak evidence at the time) and how the article was able to survive.

In a Skype interview, the editor who started the article and oversaw much of its development, TheEgyptianLiberal, said that he knew

‘the thing was going to be big… before the revolution became a revolution’

(He had actually prepared the article the day before the protests even began.) When I asked him how he knew that it was going to be significant, he replied,

‘The frustration in the street. And what happened in Tunisia.’

The Egyptian Liberal had access to a wealth of information on the streets of Cairo that gave him access to what was really happening and what was happening, he (rightly) believed, was definitely “a thing” – “a thing” worth taking notice of. In Wikipedia’s terms, this was something “significant, interesting, or unusual enough to deserve attention or to be recorded”, despite the fact that it was impossible to tell at this early stage.

Another article wasn’t as successful in its early stages. When working for Ushahidi in 2011/12, I took a trip to Kenya to visit the HQ. Ushahidi let me do a side project which for me was to try to understand the development of Swahili Wikipedia. When I arrived in Nairobi, the first thing I did was to buy every newspaper available from the local supermarket. I wanted to immerse myself in the media environment that Kenyans were being enveloped by. I also sat in the B&B and watched a lot of local television. Most of the headlines were about the looming war against Al Shabaab in southern Somalia as the Kenyan army moved into southern Somalia to try to root out the militant Al Shabaab terrorist group who were alleged to have kidnapped several foreign tourists and aid workers inside Kenya.

This was the first time that the Kenyan army had been engaged in a military campaign since independence, and so it was a big deal because the government wanted to be seen to be acting to root out the elements that were believed to be behind a series of kidnappings and murders of both locals and foreigners near the border. After two bombings in central Nairobi while I was visiting, people were trying to stay at home and avoid crowded areas.

During this time, one of the Wikipedians who I interviewed, AbbasJnr pointed me to a deletion discussion going on on the English Wikipedia about whether the Kenyan military intervention warranted its own article. The article had been nominated for deletion on the grounds that it did not meet Wikipedia’s notability standards. The nominator wrote that the event was not being reported in reliable news outlets as an ‘invasion’ but rather an ‘incursion’ and since it was ‘routine’ for troops from neighboring countries to cross the border for military operations, this event did not warrant its own article.

The Wikipedians who I spoke to were very sure that what was happening in their country should be considered notable. I was sure too, having spent at that stage only 24 hours in the country. But the media in the West weren’t reflecting this story as an important, notable event as the people living in Kenya understood it to be. Wikipedians (the majority of whom are based in the West ) were making the decision to put the article up for deletion because they didn’t have much to go by – they only had the ‘reliable sources’, a few international publications, very few of the Kenyan media publications (since few are online and updated regularly) and fewer still of the informal and unreliable communications that filled the air in Kenya at that time.

Local spheres

Both of these examples show that there is important local contextual information required to make decisions about whether something is a notable event worth documenting on Wikipedia, and that usually we don’t notice this because the majority of editors of English Wikipedia share a similar media sphere and world view. They occupy similar informal and formal media spheres. When there is disagreement, the disagreement is usually about how to cover an issue rather than whether the issue is actually important.

There are plenty of disagreements that result from these isolated media spheres. We see these disagreements when the very different and highly isolated media spheres operating inside Israel and in Gaza are exposed; we see them when we find Russian government employees as well as ordinary citizens attempting to edit articles about the Crimean crisis, and how high a Ukrainian military jet can fly in a Russian Wikipedian article to support alternative narratives being promulgated inside Russia about what happened to MH17. We see glimpses of what Eli Pariser calls, ‘the filter bubble‘ when Gilad Lotan invites us to do a Facebook search and see what our friends are saying about a particular event or when he shows us how there are distinct, isolated Twitter groupings in accounts following news about one of the recent UNRWA school bombings in Gaza.

When the issue or the event happens outside both the formal and informal media spheres that the majority of Wikipedians inhabit, where the BBC or NYTimes is not covering the issue and when there aren’t many Wikipedians in place to account for its importance, Wikipedia has nothing to go on. Our ‘no original research’ and notability policies do not help us.


Not only does Wikipedia inherit biases from the traditional media but we also have our own biases brought upon by the local media spheres that we inhabit. What makes our bias more problematic is that Wikipedia is taken as an neutral source on the world. A large degree of Wikipedia’s authoritativeness comes from the authority implied by the encyclopaedic form.

Like a dictionary, an encyclopedia gives a very good impression of being comprehensive because comphrehensiveness is its goal, its history and narrative. Having a Wikipedia article brings with it an air of importance. Not every event has a Wikipedia article so there is an assumption that a group of people have reached consensus on the importance of the topic, but there is also an implicit authority from the format of an encyclopedic entry. The journalistic account is written as a single author account of an event gathering evidence from their reliable sources and/or from being there. The encyclopedic account, on the other hand, is written without explicit author credits (which adds to the authority) and with evidence of alternative points of view which gives an added appearance of neutrality. When we as a reader come to a Wikipedia or newspaper article, we come to it with a vast background understanding or assumptions about what encyclopedias or newspapers are, and that influences our understanding.

The context of use also points to Wikipedia’s assumed authority. Whereas people generally go to newspapers to find out what is happening in the world, people go to Wikipedia to get the authoritative take. Wikipedia is used to settle arguments in bars about how many people there are in Britain or when the London Underground was built. They go to Wikipedia to find ‘facts’ about the world. The encyclopedia gains its authority because it is an encyclopedia, and this is a very different authority gained from being the New York Times for its readers, for example. (You won’t find someone saying that they’re going to go to the New York Times to find the authoritative answer to how many people were killed in World War 1.)

Wikipedia, then, is not just powerful because it is widely consulted, it is powerful because it is seen as unbiased, as neutral and as a reflection of all of us, instead of the ‘some of us’ that it actually represents.


Wikipedians need to recognise their power as newsmakers — newsmakers who are making decisions to prioritise one narrative at the expense of others and to make sure we wield that power with care. We need to take a much closer look at our policies and practices and make sure that we are building a conducive environment for future articles about a big part of the world that is as yet unrepresented. It means re-looking at efforts to expand our definitions of ‘original research’ such as those currently being discussed by Peter Gallert and others to accommodate oral citations in the developing world. It means recognising that, although we are doing a great job in broadening the scope of the events we cover, we have not begun to represent them in any way that can be considered truly global. This requires a whole lot more work in bringing editors from other countries into the Wikipedia fold, and in being more flexible about how we define what constitutes reliable knowledge.

Finally, I think that understanding Wikipedia bias enables us to develop an understanding of how media bias itself is changing – because it’s not so much about how the media industry is failing to give us an accurate, balanced picture of the world so much as it is about us getting out of our filter bubbles and recognising the role of “unreliable sources” in our understandings about the world.

* But there are others like Global Voices, that are, in many ways, more globally representative than Wikipedia.

by Heather Ford at August 11, 2014 03:32 PM

August 10, 2014

MIMS 2012

DRY isn't the One True Principle of CSS

Ben Frain wrote a great article of recommendations for writing long-living CSS that’s authored by many people (e.g. CSS for a product). The whole thing is worth reading, but I especially agree with his argument against writing DRY (Don’t Repeat Yourself) code at the expense of maintainability. He says:

As a concrete example; being able to delete an entire Sass partial (say 9KB) in six months time with impunity (in that I know what will and won’t be affected by the removal) is far more valuable to me than a 1KB saving enjoyed because I re-used or extended some vague abstracted styles.

Essentially, by abstracting styles as much as possible and maximizing re-use, you’re creating an invisible web of dependencies. If you ever want to change or remove styles, you will cause a ripple effect throughout your site. Tasks that should have been 30 or 60 minutes balloon into multi-hour endeavors. As Ben says, being able to delete (or modify) with impunity is far more valuable than the small savings you get by abstracting your code.

I have made this mistake many times in my career, and it took me a long time to distinguish between good and bad code reuse. The trick I use is to ask, “If I were to later change the style of this component, would I want the others that depend on it to be updated, too?” More often than not, the answer is no, and the styles should be written separately. It took some time for me to be comfortable having duplicate CSS in order to increase maintainability.

Another way of thinking about this is to figure out why two components look the same. If it’s because one is a modification of a base module (e.g. a small version of a button), it’s a good candidate for code reuse. If not (e.g. a tab that looks similar to a button, but behaves differently), then you’re just setting yourself up for a maintainability nightmare.

As beneficial as DRY code is, it isn’t the One True Principle of CSS. At best, it saves some time and bytes, but at worst, it’s a hindrance to CSS maintainability.

by Jeff Zych at August 10, 2014 10:50 PM

August 08, 2014

Ph.D. student

the research lately

I’ve been working hard.

I wrote a lot, consolidating a lot of thinking about networked public design, digital activism, and Habermas. A lot of the thinking was inspired by Xiao Qiang’s course over a year ago, then a conversation with Nathan Mathias and Brian Keegan on Twitter, then building @TheTweetserve for Theorizing the Web. Interesting how these things acrete.

Through this, I think I’ve gotten a deeper understanding of Habermas’ system/lifeworld distinction than I’ve ever had before. Where I’m still weak on this is on his understanding of the role of law. There’s an angle in there about technology as regulation (a la Lessig) that ties things back to the recursive public. But of course Habermas was envisioning the normal kind of law–the potentially democratic law. Since the I School engages more with policy than it does with technicality, it would be good to have sharper thinking about this besides vague notions of the injustice or not of “the system”–how much of this rhetoric is owed to Habermas or the people he’s drawing on?

My next big writing project is going to be about Piketty and intellectual property, I hope. This is another argument that I’ve been working out for a long time–as an undergrad working on microeconomics of intellectual property, on the job at OpenGeo reading Lukacs for some reason, in grad school coursework. I tried to write something about this shortly after coming back to school but it went nowhere, partly because I was using anachronistic concepts and partly because the term “hacker” got weird political treatment due to some anti-startup yellow journalism.

The name of the imagined essay is “Free Capital.” It will try to trace the economic implications of free software and other open access technical designs, especially their impact on the relationship between capital and labor. It’s sort of an extension of this. I feel like there is more substance there to dig out, especially around liquidity and vendor- and employer- lock in. I’m imagining engaging some of the VC strategy press–I’ve been following the thinking of Kanyi Maqbela for a long time and always learning from it.

What I need to hone in on in terms of economic modeling is under what conditions it’s in labor’s interest to work to produce open source IP or ‘free capital’, and under what conditions is it in capital’s interest to invest in free capital, and what the macroeconomic implications of this are. It’s clear that capital will invest in free capital in order to unseat a monopoly–take Android for instance, or Firefox–but that this is (a) unstable and (b) difficult to take into account in measures of economic growth, since the gains in this case are to be had in the efficiency of the industrial organization rather than on the the value of the innovation itself. Meanwhile, Matt Asay has been saying for years that the returns on open source investment are not high enough to attract serious investment, and industry experience appears to bear that out.

Meanwhile, Picketty argues that the main force for convergence in income is technology and skills diffusion. But these are exogenous to his model. Meanwhile, here in the Bay Area the gold rush rages on and at least word on the grapevine is that VC money is finding a harder and harder time finding high-return investments, and are sinking it into lamer and lamer teams of recent Stamford undergrads.

My weakness in these arguments is that I don’t have data and don’t even know what predictions I’m making. It’s dangerously theoretical.

Meanwhile, my actual dissertation work progresses…slowly. I managed to get a lot done to get my preliminary results with BigBang ready for SciPy 2014. Since then I’ve switched it over to favor an Anaconda build and use I Python Notebooks internally–all good architectural changes but it’s yak shaving. Now I’m hitting performance issues and need to make some serious considerations about databases and data structures.

And then there’s the social work around it. They are good instincts–that I should be working on accessibility, polishing my communication, trying to encourage collaborator’s interest. I know how to start an open source project and it requires that. But then–what about the research? What about the whole point of the thing? Talking with Dave Kush today, he pointed me towards research on computational discourse analysis, which is where I think this needs to go. The material felt way over my head, a reminder that I’ve been barking up so many trees that are not where I think the real problem to work on is. Mainly because I’ve been caught up in the politics of things. It’s bewildering how enriching but distracting the academic context is–how many barriers there are to sitting and doing your best work. Petty disciplinary disputes, for example.

by Sebastian Benthall at August 08, 2014 03:43 AM

August 07, 2014

Ph.D. alumna

Why Jane Doe doesn’t get to be a sex trafficking victim

In detailing the story of “Jane Doe,” a 16-year-old transgender youth stuck in an adult prison in Connecticut for over six weeks without even being charged, Shane Bauer at Mother Jones steps back to describe the context in which Jane grew up. In reading this horrific (but not that uncommon) account of abuse, neglect, poverty, and dreadful state interventions, I came across this sentence:

“While in group homes, she says she was sexually assaulted by staffers, and at 15, she became a sex worker and was once locked up for weeks and forced to have sex with “customers” until she escaped.”Mother Jones

What makes this sentence so startling is the choice of the term “sex work.” Whether the author realizes it or not, this term is extraordinarily political, especially when applied to an abused and entrapped teenager. I couldn’t help but wonder why the author didn’t identify Jane as a victim of human trafficking.

Commercial sexual exploitation of minors

Over the last few years, I’ve been working with an amazing collection of researchers in an effort to better understand technology’s relationship to human trafficking and, more specifically, the commercial sexual exploitation of children. In the process, I’ve learned a lot about the politics of sex work and the political framing of sex trafficking. What’s been infuriating is to watch the way in which journalists and the public reify a Hollywood narrative of what trafficking is supposed to look like — innocent young girl abducted from happy, healthy, not impoverished home with loving parents and then forced into sexual acts by a cruel older man. For a lot of journalists, this is the only narrative that “counts.” These are the portraits that are held up and valorized, so much so that an advocate reportedly fabricated her personal story to get attention for the cause.

The stark reality of how youth end up being commercially sexually exploited is much darker and implicates many more people in power. All too often, we’re talking about a child growing up in poverty, surrounded by drug/alcohol addiction. More often than not, the parents are part of the problem. If the child wasn’t directly pimped out by the parents, there’s a high likelihood that s/he was abused or severely neglected. The portrait of a sex trafficking victim is usually a white or Asian girl, but darker skinned youth are more likely to be commercially sexually exploited and boys (and especially queer youth) are victimized far more than people acknowledge.

Many youth who are commercially exploited are not pimped out in the sense of having a controlling adult who negotiates their sexual acts. All too often, youth begin trading sex for basic services — food, shelter, protection. This is part of what makes the conversation about sex work vs. human trafficking so difficult. The former presumes agency, even though that’s not always the case while the latter assumes that no agency is possible. When it comes to sex work, there’s a spectrum. Sex work by choice, sex work by circumstance, and sex work by coercion. The third category is clearly recognizable as human trafficking, but when it comes to minors, most anti-trafficking advocates and government actors argue that it’s all trafficking. Except when that label’s not convenient for other political efforts. And this is where I find myself scratching my head at how Jane Doe’s abuse is framed.

How should we label Jane Doe’s abuse?

By the sounds of the piece in Mother Jones, Jane Doe most likely started trading sex for services. Perhaps she was also looking for love and validation. This is not that uncommon, especially for queer and transgender youth. For this reason, perhaps it is valuable to imply that she has agency in her life, to give her a label of sex work to suggest that these choices are her choices.

Yet, her story shows that things are far more complicated than that. It looks as though those who were supposed to protect her — staff at group homes — took advantage of her. This would also not be that uncommon for youth who end up commercially sexually exploited. Too many sexually exploited youth that I’ve met have had far worse relationships with parents and state actors than any client. But the clincher for me is her account of having been locked up and forced to have sex until she escaped. This is coercion through-and-through. Regardless of why Doe entered into the sex trade or how we want to read her agency in this process, there is no way to interpret this kind of circumscribed existence and abuse as anything other than trafficking.

So why isn’t she identified as a trafficking victim? Why aren’t human trafficking advocacy organizations raising a stink about her case? Why aren’t anti-trafficking journalists telling her story?

The reality is that she’s not a good example for those who want clean narratives. Her case shows the messiness of human trafficking. The way in which commercial exploitation of minors is entwined with other dynamics of poverty and abuse. The ways in which law enforcement isn’t always helpful. (Ah, yes, our lovely history of putting victims into jail because “it’s safer there.”) Jane Doe isn’t white and her gender identity confounds heteronormative anti-trafficking conversations. She doesn’t fit people’s image of a victim of commercial sexual exploitation. So it’s safer to avoid terms like trafficking so as to not muddy the waters even though the water was muddy in the first place.

(This entry was first posted on June 19, 2014 at Medium under the title “Why Jane Doe doesn’t get to be a sex trafficking victim” as part of The Message.)

by zephoria at August 07, 2014 12:34 AM

August 06, 2014

MIMS 2011

Big Data and Small: Collaborations between ethnographers and data scientists

This article first appeared in Big Data and Society journal published by Sage and is licensed by the author under a Creative Commons Attribution license. [PDF]


In the past three years, Heather Ford—an ethnographer and now a PhD student—has worked on ad hoc collaborative projects around Wikipedia sources with two data scientists from Minnesota, Dave Musicant and Shilad Sen. In this essay, she talks about how the three met, how they worked together, and what they gained from the experience. Three themes became apparent through their collaboration: that data scientists and ethnographers have much in common, that their skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to their success.

In July 2011, at WikiSym in Mountain View, California, I met two computer scientists from Minnesota. I was working as an ethnographer for the non-profit technology company Ushahidi at the time, and I had worked with computer scientists before on tool building and design projects, but never on research. The three of us were introduced because we were all working on the subject of Wikipedia sources and citations.

We recently argued about who started the conversation. Dave Musicant, a computer scientist at Carleton College, said later that, although he loved doing interdisciplinary research, he was much too shy to have introduced himself. Shilad Sen is an Assistant Professor of Computer Science at Macalester College and had been working with Dave on a dataset of about 67 million source postings from about 3.5 million Wikipedia articles. In his usual generous manner, Shilad wrote later: “We had ground to a halt when you came to talk to us. We had done this Big Data analysis, but didn’t have any idea what we should do with the data. You saved us!”

In retrospect, the collaboration that followed involved a great deal of mutual “saving.” I was trying to build a portrait of how Wikipedians managed sources during breaking news events to inform Ushahidi’s software development projects, but I did not have the bigger picture of Wikipedia sources to guide new directions in the research. Dave and Shilad were looking at whether one could predict which sources would stay on Wikipedia longer than others in order to build software tools to suggest citations to Wikipedians, but they had little detailed insight into why sources were added or removed in different contexts.

Over the next two years, the three of us met on Skype every few months to share our findings and then to conduct new analyses, test out new theories about the data, and finally produce a paper entitled “Getting to the source” (Ford et al., 2013) for WikiSym in 2013. I visited the two in Minnesota more recently to discuss the possible future trajectories for research, but our collaboration has remained informal and ad hoc. Despite this (or perhaps in large part because of this), my collaboration with Dave and Shilad has been one of the most enjoyable, educational experiences for me as an early career researcher. This is perhaps partly due to the unique combination of personalities that happen to combine particularly well, but I also think that interdisciplinary research like this can yield very exciting results for researchers coming from very different epistemological and methodological vantage points if they remain open and creative about the process. Three observations are particularly noteworthy here: that data scientists and ethnographers have much in common, that our skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to our success.

Ethnographers and data scientists have much in common

Although at first glance Big Data and ethnography can be seen in opposition (after all, ethnographers have their roots in studies of societies far removed from the heavily mediated ones of today), there are actually some significant commonalities. Both recognize that what people actually do (rather than only what they say) is invaluable, and both require an immersion in data in order to understand their research subject. As Jenna Burrell (2012) writes for Ethnography Matters:Ethnographers get at this the labor-intensive way, by hanging around and witnessing things first hand. Big data people do it a different way, by figuring out ways to capture actions in the moment, i.e. someone clicked on this link, set that preference, moved from this wireless access point to that one at a particular time.Burrell argues that where there are differences is in the emphasis that ethnographers and data scientists place on what people do. Ethnographers, for example, do a lot of complementary work to connect apparent behavior to underlying meaning through in situ conversations or more formal interviews. Data scientists, on the other hand, tend to focus only on behavioral data traces.

If timed well, however, ethnographers and data scientists can come together at appropriate moments to collaborate on answers to common questions before moving on to wider (in the case of data science) or deeper (in the case of ethnography) research projects. In the case of the “Getting to the source” collaboration, the three of us shared a curiosity about sources and with Wikipedia practice more generally, and it was this shared curiosity that drove the project forward. I was interested in large-scale approaches to Wikipedia sources because I had been looking at Wikipedia’s policy on sources and was finding in the examples and interviews that practice around sourcing was very different from what was being recommended in the policies. I was curious about whether source choices were, in fact, contradictory to policies that preferred academic sources. To understand whether my cases were indicative of larger trends, I needed to get a handle on the entire corpus of data traces. Shilad and Dave were interested in the “stickiness” of sources, trying to understand why some sources stuck around more than others. Sourcing practice, for them, was therefore really important for understanding how to analyze and evaluate the data traces represented in the database. All of us recognized the benefit of sharing skills and knowledge that we had gained in our different areas. I needed to understand ways of analyzing the entire corpus, and Dave and Shilad needed to understand everyday Wikipedia practice.

It turned out that, in addition to common questions and the need for shared expertise, we also shared commonalities in our approach. I was pleasantly surprised when I started working with Dave and Shilad that all of us preferred an approach that was inductive (testing out theories about the data as we progressed), systematic (being sure to follow up leads and challenge our assumptions), and collaborative (sharing responsibilities equally and understanding the decisions that we were all making and their impact on the project as a whole). I started this collaboration with an idea that quantitative research was largely deductive and that quantitative researchers would feel they had little to gain from working with those who tend to take a more qualitative approach. Through working with Dave and Shilad, however, I learned that we had much more in common than not, and that collaboration could yield worthwhile results for both data scientists and ethnographers.

Our skills and experience are complementary

In the Wikipedia research arena, a few Big Data researchers have used interviewing, participant observation, and coding in addition to their large-scale analyses to explore research questions. Brian Keegan’s large-scale network analyses of traces through a system (Keegan et al., 2012) is an exemplar of Big Data research, for example, but Keegan also spent countless hours participating in the production of the class of Wikipedia articles that he was studying in order to understand the meaning of the traces that he was collecting. Keegan is, however, a rare example of an individual researcher who possesses the variety of skills necessary to answer some of the important questions of our age. More usual are the types of collaborations where researchers with a wide variety of skills and epistemologies work together to build rich perspectives on their research subjects and learn from one another in order to improve their skills and experience with methods with which they are unfamiliar.

In the case of the Wikipedia sources collaboration, Dave and Shilad had the necessary skills and resources to extract over 67 million source postings from about 3.5 million Wikipedia articles. Based on the interviews that I had done on ways in which Wikipedians chose and inscribed sources on the encyclopedia, I was able to contribute ideas about different ways of slicing the data in order to gain new insights. Dave and Shilad had access to sophisticated software and data processing tools for managing such a high volume of data, and I had the knowledge about Wikipedia practice that would inform some of the analyses that we chose to do on this data. After hearing from an expert interviewee that Wikipedians often discover their information using local sources but cite Western sources, for example, we were able to explore the diversity of sources in relation to their geographical provenance. By understanding this practice, we could also mention what was missing from the data that we had access to, namely, that citations did not necessarily represent what sources editors were using to find information, but rather what citations they believed others were more likely to respect. This small detail has significant implications for the conclusions that we draw about what sources and citations represent and the dynamics of collaboration on large peer production communities like Wikipedia. By discussing my findings with Dave and Shilad iteratively, we were able to come up with methods for operationalizing these hypotheses and developing different lenses for analyzing the data. Through this process, we recognized that our skills and experiences were highly complementary.

Discovering the data together is better than compartmentalizing activities

Where a large number of collaborative research activities fail is where tasks are divided up according to perceived skills and expertise of different types of research identities, rather than taking a more creative approach to research design. In this traditional view, ethnographers might be asked to do the interviews and manual coding where the Big Data analysts do the large-scale analyses with little collaboration and experience of these processes shared. The result is that there is no learning or sharing of skills: data scientists are seen merely as technicians who are able to manipulate the data and ethnographers as those who will “fill in” the context during write-up. If both researchers are to learn from the experience and stand on one another’s shoulders to produce high-quality results, it is important that researchers share some unfamiliar tasks, or that they are at least taken through the processes that resulted in particular data being produced.

Although Dave and Shilad could have asked me to do the manual coding for our sources project alone, we decided to divide the tasks up so that we all contributed to the development of the coding scheme, and coded individual results and checked the accuracy of one another’s coding. Although I led the development of a coding scheme, Dave and Shilad challenged me on the ways in which I was defining the scheme and both helped to manually code the random sample and to check my results. In this way, we all came out with a deeper understanding of the subject and of the ways in which our particular lens contributed to the shape of the research output. We also learned some important new skills. I learned how such large-scale analysis is done and about the choices that are made to achieve a particular result. Shilad, on the other hand, used the coding scheme that we developed together as an example in one of the method classes that he now teaches at Macalester College. We all extended ourselves through this project by sharing unfamiliar tasks and gaining a great deal more from this than we might have if we had kept to our traditional roles.

In summary, ethnographers have much to gain from analyzing large-scale data sources because they can provide a unique insight into how participants are interacting in complex media platforms in ways that complement observations in the field. Data scientists, in turn, can benefit from more qualitative insight into the implications of missing data, data incompleteness, and the social meanings attributed to data traces. Working together, ethnographers and data scientists can not only produce rigorous research but can also find ways of diversifying their skills as researchers. My experience with this project has given me new respect for quantitative research done well and has reiterated the fact that good research is good research whatever we call ourselves.

This article is distributed under the terms of the Creative Commons Attribution 3.0 License ( which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (


  1. Burrell J (2012) The ethnographer’s complete guide to big data: Answers.Ethnography Matters. Available at: (accessed 9 July 2014).
  2. Ford H, Sen S, Musicant DR, et al. (2013) Getting to the source: Where does Wikipedia get its information from? In: Proceedings of the 9th international symposium on open collaboration. New York, NY: ACM, pp. 9:1–9:10. doi:10.1145/2491055.2491064.
  3. Keegan B, Gergle D and Contractor N (2012) Do editors or articles drive collaboration? Multilevel statistical network analysis of Wikipedia coauthorship. In:Proceedings of the ACM 2012 conference on computer supported cooperative work. New York, NY: ACM, pp. 427–436. doi:10.1145/2145204.2145271.


by Heather Ford at August 06, 2014 06:44 PM

Full disclosure: Diary of an internet geography project #3

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-25 at 2.51.29 PMIn this series of blog posts, we are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, I outline the way we initially thought about locating citations and Dave Musicant tells the story of how he has started to build a foundation for coding citation location at scale. It includes feats of superhuman effort including the posting of letters to a host of companies around the world (and you thought that data scientists sat in front of their computers all day!)   

Many articles about places on Wikipedia include a list of citations and references linked to particular statements in the text of the article. Some of the smaller language Wikipedias have fewer citations than the English, Dutch or German Wikipedias, and some have very, very few but the source of information about places can still act as an important signal of ‘how much information about a place comes from that place‘.

When Dave, Shilad and I did our overview paper (‘Getting to the Source‘) looking at citations on English Wikipedia, we manually looked up the whois data for a set of 500 randomly collected citations for articles across the encyclopedia (not just about places). We coded citations according to their top-level domain so that if the domain was a country code top-level domain (such as ‘.za’), then we coded it according to the country (South Africa), but if it was using a generic top-level domain such as .com, we looked up the whois data and entered the country for the administrative contact (since often the technical contact is the domain registration company often located in a different country). The results were interesting, but perhaps unsurprising. We found that the majority of publishers were from the US (at 56% of the sample), followed by the UK (at 13%) and then a long tail of countries including Australia, Germany, India, New Zealand, the Netherlands and France at either 2 or 3% of the sample.

Screen Shot 2014-07-30 at 12.42.37 PM

Geographic distribution of English Wikipedia sources, grouped by country and continent. Ref: ‘Getting to the Source: Where does Wikipedia get its information from?’ Ford, Musicant, Sen, Miller (2013).

Screen Shot 2014-07-17 at 5.28.50 PMThis was useful to some extent, but we also knew that we needed to extend this to capture more citations and to do this across particular types of article in order for it to be more meaningful. We were beginning to understand that local citations practices (local in the sense of the type of article and the language edition) dictated particular citation norms and that we needed to look at particular types of article in order to better understand what was happening in the dataset. This is a common problem besetting many ‘big data’ projects when the scale is too large to get at meaningful answers. It is this deeper understanding that we’re aiming at with our Wikipedia geography of citations research project. Instead of just a random sample of English Wikipedia citations, we’re going to be looking up citation geography for millions of articles across many different languages, but only for articles about places. We’re also going to be complementing the quantitative analysis with some deep dive qualitative analysis into citation practice within articles about places, and doing the analysis across many language versions, not just English. In the meantime, though, Dave has been working on the technical challenge of how to scale up location data for citations using the whois lookups as a starting point.

[hands over the conch to Dave…]

In order to try to capture the country associated with a particular citation, we thought that capturing information from whois databases might be instructive since every domain, when registered, has an administrative address which represents in at least some sense the location of the organization registering the domain. Though this information would not necessarily always tell us precisely where a cited source was located (when some website is merely hosting information produced elsewhere, for example), we felt like it would be a good place to start.

To that end, I set out to do an exhaustive database lookup by collecting the whois administrative country code associated with each English Wikipedia citation. For anyone reading this blog who is familiar with the structure of whois data, this would readily be recognized as exceedingly difficult to do without spending lots of time or money. However, these details were new to me, and it was a classic experience of me learning about something “the hard way.”

I soon realised how difficult it was going to be to obtain the data quickly. Whois data for a domain can be obtained from a whois server. This data is typically obtained interactively by running a whois client, which is most commonly either a command-line program or alternatively served through a whois client website. I found a Python library to make this easy if I already had the IP addressed I needed, and so I discovered, in initial benchmarking, that I could run about 1,000 IP-address-based whois queries an hour. That would make it exceedingly slow to look up the millions of citations in English Wikipedia, before even getting to other language versions. I later discovered that most whois servers limit the number of queries that you can make per day, and had I continued along this route, I undoubtedly would have been blocked from those servers for exceeding daily limits.

The team chatted, and we found what seemed to be some good options for doing bulk whois results. We found web pages of the Regional Internet Registry (RIR) ARIN, which has a system whereby researchers are able to request access to their entire database after filling out some forms. Apart from the red tape (the forms had to be mailed in by postal mail), this sounded great. I then discovered that ARIN and the other RIRs make the entire dump of the IP addresses and country codes that they allocate available publicly, via FTP. ‘Perfect!’ I thought. I downloaded this data, and decided that since I was already looking up the IP addresses associated with the Wikipedia citations before doing the whois queries, I could then look up those IP addresses in the bulk data available from the RIRs instead.

Now that I had a feasible plan, I then proceeded to write more code to lookup IP addresses for the domains in each citation. This was much faster, as domain-to-IP lookups are done locally, at our DNS server.  I could now do approximately 600 lookups a minute to get IP addresses, and then an effectively instant lookup for country code on the data I obtained from the RIRs. It was then pointed out to me, however, that this approach was flawed because of content distribution networks (CDNs), such as Akamai. Many large- and medium-sized companies use CDNs to mirror their websites, and when you do a lookup on domain to get IP address, you get the IP address of the CDN, not of the original site. ‘Ouch!’ This approach would not work…

I next considered going back to the full bulk datasets available from the RIRs. After filling out some forms, mailing them abroad, and filling out a variety of online support requests, I finally engaged in email conversations with some helpful folks at two of the RIRs who told me that they had no information on domains at all. The RIRs merely allocate ranges of IP address to domain registrars, and they are the ones who can actually map domain to IP. It turns out that the place to find the canonical IP address associated with a domain is precisely the same place as I would get the country code I wanted: the whois data.

Whois data isn’t centralized – not even in a few places. Every TLD essentially has its own canonical whois server, each one of which reports the data back in its own different natural-text format. Each one of those servers limits how much information you can get per day. When you issue a whois query yourself, at a particular whois server, it in turn passes the query along to other whois servers to get the right answer for you, which it passes back along.

There have been efforts to try to make this simpler. The software projects, ruby-whois and phpwhois implement a large number of parsers designed to cope with the outputs from all the different whois servers, but you still need to be able to get the data from them without being limited. Commercial providers will provide you bulk lookups at a cost – they must query what they can at whatever speed they can, and archive the results. But they are quite expensive. Robowhois, one of the more economical bulk providers, asks for $750 for 500,000 lookups. Furthermore, there is no particularly good way to validate the accuracy or completion of their mirrored databases.

It was finally proposed that maybe we could do this ourselves by the use of parallel processing, using multiple IP addresses ourselves so as to not get rate limited. I began looking into that possibility but it was only then that I realized that many of the whois providers don’t ever really use country codes at all in the results of a whois query. At the time I’m writing this, none of the following queries return a country code:








So after all that, we’ve landed in the following place:

- Bulk querying whois databases is exceedingly time consuming or expensive, with challenges in getting access to servers blocked.

- Even if the above problems were solved, many TLDs don’t provide country code information on a whois lookup, which would make doing an exhaustive lookup pointless because it would unbalance the whole endeavor towards those domains where we could get country information.

- I’m a lot more knowledgeable than I was about how whois works.

So, after a long series of efforts, I found myself dramatically better educated about how whois works; and in much better shape to understand why obtaining whois data for all of the English Wikipedia citations is so challenging.


by Heather Ford at August 06, 2014 06:36 PM

Full disclosure: Diary of an internet geography project #2

Reblogged from ‘Connectivity, Inclusivity and Inequality

In this series of blog posts, Heather Ford documents the process by which a group of computer and social scientists are working together in a project to understand the geography of Wikipedia citations. Their aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, Heather discusses how the group are focusing their work on a series of exploratory research questions.  week3 In last week’s call, we had a conversation about articulating the initial research questions that we’re trying to answer. At it’s simplest level, we decided that what we’re interested in is:

‘How much information about a place on Wikipedia comes from that place?’

In the English Wikipedia article about Guinea Bissau, for example, how many of the citations originate from organisations or authors in Guinea Bissau? In the Spanish Wikipedia article about Argentina, for example, what proportion of editors are from Argentina? Cumulatively, can we see any patterns among different language versions that indicate that some language versions contain more ‘local’ voices than others? We think that these are important questions because they point to extent to which Wikipedia projects can be said to be a reflection of how people from a particular place see the world; they also point to the importance of particular countries in shaping information about certain places from outside their borders. We think it makes a difference to the content of Wikipedia that the US’s Central Intelligence Agency (CIA) is responsible for such a large proportion of the citations, for example.

Past research from Brendan Luyt and Tan (2010, PDF) is instructive here. In 2010, Luyt and Tan took a random sample of national history articles on Wikipedia (English) and found that 17% were government sites and of those 17%, four of the top five sites were US government sites including the US Department of State and the CIA World Fact Book. The authors write that this is problematic because ‘the nature of the institutions producing these documents makes it difficult for certain points of view to be included. Being part of the US government, the CIA World Fact Book, for example, is unlikely to include a reading of Guatemalan history that stresses the role of US intervention as an explanation for that country’s long and brutal civil war.’ (p719) Instead of Luyt and Tan’s history articles, we’re looking at country articles and we’re zeroing in on citations and trying to ‘locate’ those citations in different ways. While we were talking on Skype, Shilad drew this really great diagram to show how we seem to be looking at this question of information geography: Screen Shot 2014-07-22 at 10.04.37 AM In this research, we seem to be looking at locating all three elements (the location of the article, the sources/citations and the editors) and then establishing the relationships between them i.e.

RQ1a What proportion of editors from a place edit articles about that place?

RQ1b What proportion of sources in an article about a place come from that place?

RQ1c What proportion of sources from particular places are added by editors from that place?

We started out by taking the address of the administrative contact contained in a source’s domain registration as the signal for the source’s location but we’ve come up against a number of issues as we’ve discussed the initial results. A question that seems to be a precursor to the questions above seems to be how we define ‘location’ in the context of a citation contained within in an article about a particular place. There are numerous signals that we might use to associate a citation with a particular place: the HQ of the publisher, for example, or the nationality of the author; the place in which the article/paper/book is set, or the place in which the publishers are located. An added complexity has to do with the fact that websites sometimes host content produced elsewhere. Are we using ‘author’ or ‘publisher’ when we attempt to locate a citation? If we attribute the source to the HQ of the website and not the actual text, are we still accurately portraying the location of the source?

In order to understand which signals to use in our large scale analysis, then, we’ve decided to try to get a better understanding of both the shape of these citations and the context in which those citations occur by looking more closely at a random sample of citations from articles about places and asking the questions: RQ0a To what extent might signals like ‘administrative contact of the domain registrant’ or ‘country domain’ accurately reflect the location of authors of Wikipedia sources about places? RQ0b What alternative signals might more accurately capture the locations of sources? Already in my own initial analysis of the English Wikipedia article on Mombasa, I noticed that citations to some articles written by locals were hosted on domains such as and that are registered in the US and Poland respectively. There was also a citation to the Kenyan 2009 census authored by the Kenya National Bureau of Statistics hosted by, and a link to an article about Mombasa written by a Ugandan on a US-based travel blog. All this means that we are going to under-represent East Africans’ participation in the writing of this place-based article about Mombasa if we use signals like domain registration.

We can, of course, ‘solve’ each of these problems by removing hosting sites like WordPress from our analysis, but the concern is whether this will negatively affect the representation of efforts by those few in developing countries who are doing their best to produce local content on Wikipedia. Right now, though, we’re starting with the micro level instances and developing a deeper understanding that way, rather than the other way around. And that I really appreciate.

by Heather Ford at August 06, 2014 06:29 PM

July 30, 2014

Ph.D. student

Re: Homebrew Website Club: July 30, 2014

I'll try to make it tonight for Homebrew meeting. Maybe I can get "fragmentions" (ugh, terminology) or annotations on academic papers working beforehand.


P.S. While last time I RSVP'ed I worried that these irrelevant posts in my feed were needless, I ended up getting multiple emails with really valuable responses (about hiding certain types of posts and about the academic writing on the web project in general). So I'm persuaded not to worry urgently about hiding them from my index page or feed.

by at July 30, 2014 07:36 PM

Ph.D. alumna

Goodbye Avis, Hello Uber

Only two hours before the nightmare that would unfold, I was sitting with friends sharing my loyalties to travel programs. I had lost status on nearly everything when I got pregnant with my son (where’s parental leave??), forcing me to rethink my commitments. I told everyone about how I loved the fact that Avis had been so good to me, so willing to give me hybrids when they were available. I had been in an Avis car for 20 of the 28 days that month and I was sad that I didn’t have a hybrid in LA but the customer service rep was super apologetic and I understood that it was a perk, not a guarantee.

When I got into my car at 10PM that night, I discovered I had a flat tire. Exhausted and jetlagged, I called Roadside Assistance and braced myself to begin the process. I didn’t give it much thought given that I was 7 miles from LAX where it’d be easy to exchange a car. And it’s LA, land of cars, right? I had gotten stuck in much worse situations, situations without phone service. When I got the rep on the phone, we went through the process and I said that I didn’t feel safe driving significantly on a spare, especially not in LA. I asked how long for an exchange because we were so close. He said it’d be longer. I asked how long but he didn’t know; he said he’d text me when the order was placed. I figured go ahead and I can always call back and shift things. It was dark, I was falling asleep, and time passed.

An hour later, I still hadn’t heard anything. I called back, now much more frustrated. They told me that they still didn’t know. I pushed and pushed and they told me it’d probably take 4 hours. WTF? Are you serious!?!? How long for a spare to be changed I asked? Another 90 minutes they told me. They wanted me to wait until 12:30AM to get a spare tire on my car or until 3AM to get a replacement. I told them that this wasn’t safe, they asked if I was in a life-threatening emergency. No, it just wasn’t safe for me to sleep in my car in the middle of Los Angeles. I asked if I could just take a cab to the 24/7 LAX counter and hand over the keys. No, I couldn’t get a new car without giving up the old one and they wouldn’t receive the keys without the car. They reminded me that I was liable for the car. At one point, he recommended that I just leave the keys in the unlocked car. At this point, I knew the rep knew zero about the context in which I was in. Los Angeles. Late at night. In the dark. I was furious. Luckily, I have friends in Los Angeles. One is a late night owl and agreed to take the keys and do the exchange. I got driven to the hotel, angry as hell.

They texted us that they’d arrive at 4AM to pick up the car. They didn’t show up. At 9:30AM, I called back furious. They blamed the towing company and said another 30 minutes. Eventually they showed up at 11:30AM. Luckily, my friend was amazingly awesome and managed to make it work even though she worked and had to juggle. At 4PM, I called Avis to make sure they had the car. Nope. And they couldn’t close the account or look up the repair information. Roadside assistance told me to call customer service, customer service told me to call LAX rental directly, LAX rental sent me to his manager who went straight to voicemail. Not surprisingly, they didn’t return that phone call. I tweeted throughout and the only response that I got from the Avis rep was a polite note to say that they hoped everything worked out. I wrote back that it absolutely had not and got zero response. I wrote to the Avis customer service and the Avis FIRST email. No response. So much for being a valuable customer. Luckily I had done all of this through Amex Business Travel who was just awesome and leveraged their status to push Avis into taking care of it and giving me a refund.

I know lots of people have horrible customer service experiences with companies like Avis, but I’m still stunned by the acceptability of what unfolded. The way in which such treatment is considered acceptable, normative even. The absolute lack of accountability or recognition of how outright problematic that experience was. It all comes back down to markets and “choice,” as though the answer is simply for me to go to another company. Admittedly, I will walk away from Avis and my status now but it’s not simply because I think that a different company will be better. It’s because the entire experience soured me on the very social contract that I thought I had with Avis.

What if I was in a city where I didn’t have friends? What if I had been in a more remote setting (like I had been for 14 of the 20 days of rentals this month)? What if I had a plane to catch? I thought the whole promise of roadside assistance was that Avis would be there for me when things went haywire. Instead, they passed the buck at every turn, making it clear that they refused to take responsibility for their vendors. One of the phone reps eventually went off script and noted that some of the company policies are disturbing. But he was clearly resigned to it.

As customer service has become more automated, more mechanized, companies create distance between them and their customers. We aren’t people. We are simply a pool of possible money, valued based on our worth to the company. They do enough to keep us from going elsewhere if we are valuable, but otherwise do everything possible to not take responsibility. They don’t want us calling in so they pass the buck to keep their numbers and they stick to their scripts. The low-level employees have no power and they know darn straight that when we ask for their managers, we’ll never reach them. This is what Kafka feared and the reality of it is far more pervasive than we acknowledge in a market economy.

Old industries rage against new startups who are seeking to disrupt them, but what they don’t take account for is the way in which customers are fed up being beholden to the Milgram-esque practices of these large companies. When all goes well, working with big companies can be seamless. But when it doesn’t, you’re on your own. And that’s a terrifying risk to take. Cars break down, flights get delayed, hotels get oversold. The risks are more upfront with new disruptors but, above all else in peer economy stuff, you often get to interact with people. It’s not perfect – and goddess knows that there are incidents that are forcing the peer economy companies to develop better protections – but somehow, it feels better to know that you’ll be interacting with people, not automatons.

I rent cars for work travel mostly because I like listening to NPR when I’m moving around. I like being able to explore when I don’t know where to eat and this has historically made it easier. But I’m reassessing that logic. I never want to have a repeat of the hellish night that I went through this week. I don’t trust Avis to be there for me. I have a lot more faith in the imperfections of the network of Uber drivers than the coldness of the corporate giant. When they leave you stranded, they leave you *really* stranded. As for my non-urban car rentals, I need to figure out what’s next. I am very angry at Avis. Truly, overwhelmingly offended by how they’ve treated me this week. Also, scared. Scared of what happens the next time when the circumstances aren’t as functional.  But are any of the other companies any better? Do we really have market choice or is it a big ole farce?

by zephoria at July 30, 2014 06:56 PM

July 27, 2014

Ph.D. student

imperialist/sexist maps

I’ve been thinking about the qualities of maps — imperialist, humanitarian, democratizing — and the demographics of cartographers, neogeographers and Web mapping folks.1 Did you all see this article on the possibilities of sexism in street maps? I’m encouraged to see writing on the topic, but I see two important lessons to remember.

First, let’s use data, quantitative and qualitative, to investigate sexism and fairness in maps.

The FastCo article makes a specific claim that OpenStreetMap “may contain more strip clubs than day care centers”. Getting an exhaustive answer to that question would take some time, but it doesn’t take long to gather some initial data.2 Because OpenStreetMap is actually more an accessible database than it is a map, we can issue queries and get statistics. In fact, we have an awesome opportunity to ask and answer some of these questions of fairness in the map, such that it’s worth spending some time investigating.

Strip clubs are typically tagged as “amenity=stripclub”, a well-defined tag, of which there are 455 in OSM. Day care centers are a little more difficult to count (more on this later) because people use different tags for them: one might be “amenity=kindergarten” which covers pre-school centers (what we often call “kindergarten” in the US) and services that look after young children but which aren’t educational. I count 124,197 of these in OSM, but I can’t quickly tell how many are pre-school (what I often call kindergarten) and which are child care centers. A few years ago there was a proposal to formally start using amenity=childcare to refer to child care facilities, a proposal that was rejected by some voters who thought it overlapped too much with amenity=kindergarten. Nevertheless, OSM users can use whatever tags they want,3 and many of them are using amenity=childcare, there are 1,329 instances (triple the amenity=stripclub count, although that tag is more formally approved). There are 504 instances of the more obscure syntax of social_facility:for=child and :for=juvenile, although I suspect many of those are covering group homes, orphanages, community centers and various social work facilities for children.

(We could also search OSM by name to try to count child care centers and strip clubs. There seem to be roughly 1100 with “day care” in the name and 640 with “child care”, but those names are likely very English specific and don’t provide for good comparisons with stripclubs, which I expect rarely include “strip club” in the venue name. I can find about 20 that use some variation of “gentlemen’s club”.)

But these numbers don’t discount the concern that the distribution of mapped venues may be skewed in a way that might be sexist in intent or impact. Rather, I suspect that statistics would actually bring the problem into greater relief. For example, if business license records show 100 or 1000 times as many child care centers as strip clubs in many jurisdictions and the OSM database only shows 5 times as many child care centers, that would be an important result. Rather than comparing only two numbers, we would do better to compare the proportions of different venues to some independent measure to see which are disproportionately present or missing. Side note: if you gather that data, give it to OSM volunteers so they can identify where or why that skew is happening.4 We would learn more by comparing other categories as well: while the stripclub/childcare example may be relevant (in particular because of the taxonomic question, see below), not all and not only women care about childcare services.

Beyond statistical counts of features, we can and should use qualitative methods to evaluate sexism in the map and in the community. For example, that rejected 2011 proposal for a recommended amenity=childcare tag revealed that the (mostly male) OSM editors may discount the need for a separate tag, to the detriment of users who would benefit from it. Because OSM conducts votes on these taxonomic questions, complete with explanations memorialized in wiki form, researchers and the public can review the debate, comment by comment. That proposal is also an interesting case because it reveals a linguistic difference. As I understand it, “kindergarten” is used differently in Germany (which many OSM editors call home), where a day care (Kindertagesstätte) might be more closely related. None of this is my discovery, so don’t take my word for it:

  • Dr. Monica Stephens gave a nice (short, I just watched it, it’s awesome, go watch it!) talk on exactly this topic in 2012 — you can watch the video online — comparing the sparse selection of childcare tags to the large diversity of bar/nightclub/swingerclub tags. See also her paper from 2013.5
  • The discussion page on the OSM wiki has a good “post-mortem” discussion of the childcare proposal that’s worth reading.
  • This blog post from April does some numerical analyses after the childcare tag controversy, and also tries to analyze the presence of commercial venues that might tend to bias towards one gender or another, with mixed results.

The second lesson to remember is that maps always reflect perspectives’ of their creators; there is no present or future “objective” map.

But unlike Google Maps, which rigorously chronicles every address, gas station, and shop on the ground, OpenStreetMap’s perspective on the world is skewed by its contributors.

I don’t dispute the latter clause: OSM is absolutely skewed by its contributors. However, I don’t see that maps that don’t rely on crowdsourced data (whether it’s Google Maps or the USGS or any other) are, in contrast, objective in a way that OpenStreetMap could never be.

All maps are skewed by the selections of the humans who make them or, increasingly, the technology that humans build in order to make them. In fact, one might define a map as exactly the process of selecting some geographic data and leaving out all others. This American Life illustrates this point beautifully.6

Google Maps may have a relatively exhaustive accounting of commercial venues. Even in that incredibly narrow category, though, consider last week’s article on mistakes in the Google business directory from malicious or mistaken reports. I say “narrow” because what about the parts of our physical world that aren’t commercial venues or roads for automobiles? Here’s a quick list of some of the interesting feature types in OpenStreetMap that aren’t as easy to find in Google Maps.7

  • Car-sharing locations: as a user of the CityCarShare nonprofit, I’m pleased that many CityCarShare locations are available in the OSM database and are typically rendered on the basemap. Searching in my neighborhood, I find that Google Maps actually does let you find car-sharing locations, but maybe only for Zipcar?

  • Benches: OSM has the locations of 400,000 benches! (Mostly in Europe and some in the US and I bet this isn’t nearly exhaustive enough, but I love that it’s there.)

  • Mailboxes: While the USPS can let you search online for locations of those blue mailboxes, Google Maps only directs you to FedEx or Mailboxes Etc venues. In Oakland there aren’t many of these marked in OSM yet, but when I was traveling in Brussels, I thought it was pretty awesome to be able to pull up the exact location of the nearest red postbox. (161,982 in OSM.)

  • Fire Hydrants: Almost a quarter million of these in OSM. Maybe they’d be useful for Adopt-a-Hydrant websites, without a city having to import all the data themselves.

  • Wheelchair Accessibility: This one is a real challenge. It would be awesome if streets, sidewalks, businesses, toilets could all have metadata about their wheelchair accessibility, so that, for example, your navigation software could tell you how to get from point A to point B without ever directing you to take stairs, or cross a road where there isn’t a curb ramp. OSM has 600,000 tags with wheelchair accessibility metadata, but even that surely isn’t nearly enough. (OSM Wiki has a page on wheelchair routing and there are also some Google Maps projects for crowdsourcing that kind of data.)

  • Trees: OSM has 3.4 million individual trees mapped. Ha, awesome. I really want to map all the trees in our neighborhood in order to make a more beautiful and detailed print map of the area. (See also, the Urban Forest Map of all 88,000 trees on San Francisco's streets.)

Personally, I like to use OpenStreetMap for its detailed data on hiking trails, including gates and fences along the route. Others use OSM for bicycle routing and Google Maps also has a different mode for viewing bike lanes. But it should be clear that no single map contains everything, and certainly not everything in an objective way that doesn’t involve the perspectives of both the designer of the map and the contributor of the data. Even the distribution of categories themselves is a pragmatic, rather than essentialist, exercise.

It can be tempting, and perhaps more so now with maps that are more databases than static cartographic projections, to believe that a map can contain everything, such that claims of sexism could simply be refuted. Maybe if we just had more and more data, all the data, then the perspective of the cartographer would disappear as it became more and more precise, until the map itself contained everything in the territory. Indeed, digital maps have made it possible to represent different maps in different situations; to show different ownership of the same territory based on who’s looking at the map, for example. But mapping, like any data collection and analysis project, will always have perspectives. We can do better or worse at being aware of these perspectives and adjusting our practices to address disparities in the design of maps, but we shouldn’t imagine that one day there will be such an authoritative source that we can stop asking whether the map is sexist or how to make it less so.

Please forgive my verbose enthusiasm. Yes, of course, I’m super into maps, but I can’t help but think that these same lessons will arise in every data project we pursue.

Thanks for reading my ramblings,

P.S. And thanks to Brendan, Geoff, Julie and Zeina for helping to clarify some of this before I posted it publicly.

  1. In short, is my vague impression correct that mapping technology meetings are more disproportionately white even than other technology-focused communities? Is there greater representation of Brits and Americans? And if so, why, and what are the implications?

  2. A commenter on the FastCo site has already pointed this out in brief, but I’ll share my data with some links anyway.

  3. This point is extremely important — a case where the users/implementers can behave in ways that contradict the attempt to standardize. It’s a good check on what I think was a real mistake in not approving the tag. I’m not sure, however, if this voting affects the common renderings of the map, like at

  4. As one friend put it, and maybe this is the issue with many disparities in tech, the community is good-hearted but just naive.

  5. Monica Stephens, “Gender and the Geoweb: Divisions in the Production of User-Generated Cartographic Information,” GeoJournal 78, no. 6 (2013): 981–96,

  6. The whole episode is great, but the first few minutes of the prologue are enough, and lovely.

  7. This might seem like I’m poking fun or diminishing Google’s awesome map, but I really don’t mean it that way at all. Different maps work for different uses, and while I think there’s often a healthy competition among Web mappers, these are just examples.

by at July 27, 2014 10:04 PM

July 22, 2014

Ph.D. student

responding to @npdoty on ethics in engineering

Nick Doty wrote a thorough and thoughtful response to my earlier post about the Facebook research ethics problem, correcting me on a number of points.

In particular, he highlights how academic ethicists like Floridi and Nissenbaum have an impact on industry regulation. It’s worth reading for sure.

Nick writes from an interesting position. Since he works for the W3C himself, he is closer to the policy decision makers on these issues. I think this, as well as his general erudition, give him a richer view of how these debates play out. Contrast that with the debate that happens for public consumption, which is naturally less focused.

In trying to understand scholarly work on these ethical and political issues of technology, I’m struck by how differences in where writers and audiences are coming from lead to communication breakdown. The recent blast of popular scholarship about ‘algorithms’, for example, is bewildering to me. I had the privilege of learning what an algorithm was fairly early. I learned about quicksort in an introductory computing class in college. While certainly an intellectual accomplishment, quicksort is politically quite neutral.

What’s odd is how certain contemporary popular scholarship seeks to introduce an unknowing audience to algorithms not via their basic properties–their pseudocode form, their construction from more fundamental computing components, their running time–but for their application in select and controversial contexts. Is this good for the public education? Or is this capitalizing on the vagaries of public attention?

My democratic values are being sorely tested by the quality of public discussion on matters like these. I’m becoming more content with the fact that in reality, these decisions are made by self-selecting experts in inaccessible conversations. To hope otherwise is to downplay the genuine complexity of technical problems and the amount of effort it takes to truly understand them.

But if I can sit complacently with my own expertise, this does not seem like a political solution. The FCC’s willingness to accept public comment, which normally does not elicit the response of a mass action, was just tested by Net Neutrality activists. I see from the linked article that other media-related requests for comments were similarly swamped.

The crux, I believe, is the self-referential nature of the problem–that the mechanics of information flow among the public are both what’s at stake (in terms of technical outcomes) and what drives the process to begin with, when it’s democratic. This is a recipe for a chaotic process. Perhaps there are no attractor or steady states.

Following Rash’s analysis of Habermas and Luhmann’s disagreement as to the fate of complex social systems, we’ve got at least two possible outcomes for how these debates play out. On the one hand, rationality may prevail. Genuine interlocutors, given enough time and with shared standards of discourse, can arrive at consensus about how to act–or, what technical standards to adopt, or what patches to accept into foundational software. On the other hand, the layering of those standards on top of each other, and the reaction of users to them as they build layers of communication on top of the technical edifice, can create further irreducible complexity. With that complexity comes further ethical dilemmas and political tensions.

A good desideratum for a communications system that is used to determine the technicalities of its own design is that its algorithms should intelligently manage the complexity of arriving at normative consensus.

by Sebastian Benthall at July 22, 2014 08:55 PM

This is truly unfortunate

This is truly unfortunate.

In one sense, this indicates that the majority of Facebook users have no idea how computers work. Do these Facebook users also know that their use of a word processor, or their web browser, or their Amazon purchases, are all mediated by algorithms? Do they understand that what computers do–more or less all they ever do–is mechanically execute algorithms?

I guess not. This is a massive failure of the education system. Perhaps we should start mandating that students read this well-written HowStuffWorks article, “What is a computer algorithm?” That would clear up a lot of confusion, I think.

by Sebastian Benthall at July 22, 2014 07:09 PM

July 16, 2014

Ph.D. student

Re: Homebrew Website Club: July 16, 2014

Sure, I'm in for tonight's Homebrew meeting. I don't have a ton of progress to report, but I've been working on academic writing that can be simultaneously posted to the Web (where it can be easily shared and annotated) and also formatted to PDF via LaTeX. Oh, and I'm excited to chat with people about OpenPGP for indieweb purposes.

P.S. While I like the idea of posting RSVPs via my website, it seems a little silly to include them in RSS feeds or the blog index page like any other blog entry. What are people doing to filter/distinguish different kinds of posts?

by at July 16, 2014 10:04 PM

July 15, 2014

Ph.D. student

Re: The Facebook ethics problem is a political problem

Thanks for writing. I’m inspired to write a couple of comments in response.

First, are academic, professional ethicists as irrelevant as you suggest? (Okay, that’s a bit of a strawman framing, but I hope the response is still useful.)

Floridi is an interesting example. I’m also a fan of his work (although I know him more for his philosophy of information work — I like to cite him on semantics/ontologies, for example (Floridi 2013) — rather than his ethics work), but he’s also in the news this week because he’s on Google’s panel of experts (their “Advisory Council”) for determining the right balance in processing right-to-be-forgotten requests.

Also, I think we see the influence of these ethical and other academic theories play out in practical terms, even if they’re not cited in a direct company response to a particular scandal. For example, you can see Nissenbaum’s contextual integrity theory of privacy (Nissenbaum 2004) throughout the Federal Trade Commission’s 2012 report on privacy (FTC 2012), even though she’s never explicitly cited. And, forgive me for rooting for the home team here, but I think Ken and Deirdre’s research of “on the ground” privacy (Bamberger and Mulligan 2011) played a pretty prominent role in the White House framework for consumer privacy (“Consumer Data Privacy in a Networked World: A Framework for Protecting Privacy and Promoting Innovation in the Global Digital Economy” 2012).

But second, I’m even more excited about your conclusion. Yes, decentralize!, despite the skepticism about it (Narayanan et al. 2012). But more than just repeating that rallying cry (which I still think needs repeating – I’m trying to support #indieweb as my part of that), is the form of the problem.

I think a really cool project that everybody who cares about this should be working on is designing and executing on building that alternative to Facebook. That’s a huge project. But just think about how great it would be if we could figure out how to fund, design, build, and market that. These are the big questions for political praxis in the 21st century.

Politics in our century might be defined by engineering challenges, and if that’s true, then it emphasizes even more how coding is not just entangled with, but is itself a question of, policy and values. I think our institution could dedicate a group blog just to different takes on that.


Some references:

Bamberger, KA, and DK Mulligan. 2011. “Privacy on the Books and on the Ground.” Stanford Law Review.

“Consumer Data Privacy in a Networked World: A Framework for Protecting Privacy and Promoting Innovation in the Global Digital Economy.” 2012. White House, Washington, DC.

Floridi, Luciano. 2013. “Semantic Conceptions of Information.” In Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta, Spring 201.

FTC. 2012. “Protecting Consumer Privacy in an Era of Rapid Change Recommendations for Businesses and Policymakers.” Technical report March. Federal Trade Commission.

Narayanan, Arvind, Vincent Toubiana, Helen Nissenbaum, and Dan Boneh. 2012. “A Critical Look at Decentralized Personal Data Architectures.”

Nissenbaum, Helen. 2004. “Privacy as Contextual Integrity.” Washington Law Review 79 (1): 101–139.

by at July 15, 2014 06:02 AM

July 14, 2014

MIMS 2011

Full disclosure: Diary of an internet geography project #1

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-10 at 12.28.58 PMOII research fellow, Mark Graham and DPhil student, Heather Ford (both part of the CII group) are working with a group of computer scientists including Brent HechtDave Musicant and Shilad Sen to understand how far Wikipedia has come to representing ‘the sum of all human knowledge’. As part of the project, they will be making explicit the methods that they use to analyse millions of data records from Wikipedia articles about places in many languages. The hope is that by experimenting with a reflexive method of doing multidisciplinary ‘big data’ project, others might be able to use this as a model for pursuing their own analyses in the future. This is the first post in a series in which Heather outlines the team’s plans and processes.  

It was a beautiful day in Oxford and we wanted to show our Minnesotan friends some Harry Pottery architecture, so Mark and I sat on a bench in the Balliol gardens while we called Brent, Dave and Shilad who are based in Minnesota for our inaugural Skype meeting. I have worked with Dave and Shilad on a paper about Wikipedia sources in the past, and Mark and Brent know each other because they both have produced great work on Wikipedia geography, but we’ve never all worked together as a team. A recent grant from Oxford University’s John Fell Fund provided impetus for the five of us to get together and pool efforts in a short, multidisciplinary project that will hopefully catalyse further collaborative work in the future.

In last week’s meeting, we talked about our goals and timing and how we wanted to work as a team. Since we’re a multidisciplinary group who really value both quantitative and qualitative approaches, we thought that it might make sense to present our goals as consisting of two main strands: 1) to investigate the origins of knowledge about places on Wikipedia in many languages, and 2) to do this in a way that is both transparent and reflexive.

In her eight ‘big tent’ criteria for excellent qualitative research, Sarah Tracy (2010, PDF) includesself-reflexivity and transparency in her conception of researcher ‘sincerity’. Tracy believes that sincerity is a valuable quality that relates to researchers being earnest and vulnerable in their work and ‘considering not only their own needs but also those of their participants, readers, coauthors and potential audiences’. Despite the focus on qualitative research in Tracy’s influential paper, we think that practicing transparency and reflexivity can have enormous benefits for quantitative research as well but one of the challenges is finding ways to pursue transparency and reflexivity as a team rather than as individual researchers.


Tracy writes that transparency is about researchers being honest about the research process.

‘Transparent research is marked by disclosure of the study’s challenges and unexpected twists and turns and revelation of the ways research foci transformed over time.’

She writes that, in practice, transparency requires a formal audit trail of all research decisions and activities. For this project, we’ve set up a series of Google docs folders for our meeting agendas, minutes, Skype calls, screenshots of our video call as well as any related spreadsheets and analyses produced during the week. After each session, I clean up the meeting minutes that we’ve co-produced on the Google doc while we’re talking, and write a more narrative account about what we did and what we learned beneath that.

Although we’re co-editing these documents as a team, it’s important to note that, as the documenter of the process, it’s my perspective that is foregrounded and I have to be really mindful of this as reflect what happened. Our team meetings are occasions for discussion of the week’s activities, challenges and revelations which I try to document as accurately as possible, but I will probably also need to conduct interviews with individual members of the team further along in the process in order to capture individual responses to the project and the process that aren’t necessarily accommodated in the weekly meetings.


According to Tracy, self-reflexivity involves ‘honesty and authenticity with one’s self, one’s research and one’s audience’. Apart from the focus on interrogating our own biases as researchers, reflexivity is about being frank about our strengths and weaknesses, and, importantly, about examining our impact on the scene and asking for feedback from participants.

Soliciting feedback from participants is something quite rare in quantitative research but we believe that gaining input from Wikipedians and other stakeholders can be extremely valuable for improving the rigor of our results and for providing insight into the humans behind the data.

As an example, a few years ago when I was at a Wikimedia Kenya meetup, I asked what editorsthought about Mark Graham’s Swahili Wikipedia maps. One respondent was immediately able to explain the concentration of geolocated articles from Turkey because he knew the editor who was known as a specialist of Turkey geography stubs. Suddenly the map took on a more human form — a reflection of the relationships between real people trying to represent their world. More recently, a Swahili Wikipedians contacted Mark about the same maps and engaged him in a conversation about how they could be made better. Inspired by these engagements, we want to really encourage those conversations and invite people to comment on our process as it evolves. To do this, we’ll be blogging about the progress of the project and inviting particular groups of stakeholders to provide comments and questions. We’ll then discuss those comments and questions in our weekly meetings and try to respond to as many of them as possible in thinking about how we move the analysis forward.

In conclusion, transparency and reflexivity are two really important aspects of researcher sincerity. The challenge with this project is trying to put this into practice in a quantitative rather than qualitative project, a project driven by a team rather than an individual researcher. Potential risks are that I inaccurately report on what we’re doing, or expose something about our process that is considered inappropriate. What I’m hoping is that we can mark these entries clearly as my initial, necessarily incomplete reflections on our process and that this can feed into the team’s reflections going forward. Knowing the researchers in the team and having worked with all of them in the past, my goal is to reflect the ways in which they bring what Tracy values in ‘sincere’ researchers: the empathy, kindness, self-awareness and self deprecation that I know all of these team members display in their daily work.

by Heather Ford at July 14, 2014 03:10 PM

July 09, 2014

Ph.D. alumna

The Cost of Contemporary Policing: A Review of Alice Goffman’s ‘On the Run’

Growing up in Lancaster, Pennsylvania in the 80s and 90s, I had a pretty strong sense of fear and hatred for cops. I got to witness corruption and intimidation first hand, and I despised the hypocritical nature of the “PoPo.” As a teen, I worked at Subway. Whenever I had a late shift, I could rely on cops coming by. About half of them were decent. They’d order politely and, as if recognizing the fear in my body, would try to make small talk to suggest that we were on even ground in this context. And they’d actually pay their bills. The other half were a different matter. Especially when they came in in pairs. They’d yell at me, demean me, sexualize me. More importantly, I could depend on the fact that they would not pay for their food and threaten me if I tried to get them to pony up. On the job, I got one free sandwich per shift. If I was lucky, and it was only one cop, I could cover it by not eating dinner. For each additional cop, I would be docked an hour’s pay. There were nights where I had to fork over my entire paycheck.

I had it easy. Around me, I saw much worse. A girl at a neighboring school was gang raped by a group of cops after her arrest for a crime it turned out she didn’t commit but which was committed by a friend of her first cop rapist. Men that I knew got beaten up when they had a run-in. The law wasn’t about justice; it was about power and I knew to stay clear. The funny thing is that I always assumed that this was because “old” people were messed up. And cops were old people. This notion got shattered when I went back for a friend’s high school reunion. Some of his classmates had become police officers and so they decided to do a series of busts that day to provide drugs to the revelers. Much to my horror, some of the very people that I grew up with became corrupt cops. I had to accept that it wasn’t just “old” people; it was “my” people.

I did not grow up poor, although we definitely struggled. We always had food on the table and the rent got paid, but my mother worked two jobs and was always exhausted to the bones. Of course, we were white and living in a nice part of town so I knew my experiences were pretty privileged from the getgo. Most of my close friends who got arrested were arrested for hacking and drug-related offenses. Only those of color were arrested for more serious crimes. I knew straight up that my white, blonde self wasn’t going to be targeted which meant that I just needed to keep my nose clean. But in practice, that meant dumping OD’ed friends off at the steps of the hospital and driving away rather than walking through the front door.

As I aged and began researching teens, my attitude towards law enforcement became more complex. I met police officers who were far more interested in making the world a better place than those who I encountered as a kid. At the same time, I met countless youth whose run-ins were far worse than anything that I ever experienced. I knew that certain aspects of policing were far darker than I got to see first hand, but I didn’t really have the right conceptual frame for understanding what was at play with many of the teens that I met.

And then I read Alice Goffman’s On the Run.

This book has forced to me to really contend with all of my mixed and complicated feelings towards law enforcement, while providing a deeper context for my own fieldwork with teens. More than anything, this book has shed a spotlight on exactly what’s at stake in our racist and classist policing practices. She brilliantly deciphers the cultural logic of black men’s relationship with law enforcement, allowing outsiders to better understand why black communities respond the way they do. In doing so, she challenges most people’s assumptions about policing and inequality in America.

Alice Goffman’s ‘On the Run

For the better part of her undergraduate and graduate school years, Alice Goffman embedded herself in a poor black neighborhood of Philadelphia, in a community where young men are bound to run into the law and end up jailed. What began as fieldwork for a class paper turned into an undergraduate thesis and then grew into a dissertation which resulted in her first book, published by University of Chicago, called On the Run: Fugitive Life in an American City. This book examines the dynamics of a group of boys — and the people around them — as they encounter law enforcement and become part of the system. She lived alongside them, participated in their community, and bore witness to their experiences. She lived through arrests, raids, and murders. She saw it all and the account she offers doesn’t pull punches.

While I’ve seen police intimidation and corruption, the detail with which Goffman documents the practices of policing in the community in which she studied is both eloquent and harrowing. Through her writing, you can see what she saw, offering insight into a dynamic that few privileged people can bear witness. What’s most striking about Goffman’s accounting is the empathy with which she approaches the community. It is a true ethnographic account, in every sense. But, at the same time, it is so accessible and delightful that I want the world to read it.

Although most Americans realize that black men are overrepresented in US jails, most people don’t realize just how bad it is. As Goffman notes in her prologue, 1 in every 107 people in the adult population is currently in jail while 3% of the adult population is under correctional supervision. Not only are 37% of those in prison black, but 60% of black men who didn’t finish high school will go to prison by their mid-30s. We’ve built a prison-industrial complex and most of our prison reform laws have only made life worse for poor blacks.

The incentive structures around policing are disgusting and, with the onset of predictive policing, getting worse. As Goffman shows, officers have to hit their numbers and they’re free to use many abusive practices to get there. Although some law enforcement officers have a strong moral compass, many have no qualms about asserting their authority in the most vicious and abusive ways imaginable. The fear that they produce in poor communities doesn’t increase lawful behavior; it undermines the very trust in authority that is necessary to a health democracy.

The most eye-opening chapter in Goffman’s book is her accounting of what women experience as they are forced into snitching on the men in their communities. All too often, their houses are raided and they are threatened with violence, arrest, eviction, and the loss of children. Their homes are torn apart, their money is taken, and they are constantly surveilled. Police use phone records to “prove” that their boyfriends are cheating on them or offer up witnesses who suggest that the men in their lives aren’t really looking out for them. While she describes how important loyalty is in these communities, she also details just how law enforcement actively destroys the fabric of these communities through intimidation and force. Under immense pressure, most everyone breaks. It’s a modern day instantiation ofantebellum slavery practices. If you tear apart a community, authority has power.

For all of the abuse and intimidation faced by those targeted by policing practices, it delights me to see the acts of creative resistance that many of Goffman’s informants undertake. Consider, for example, the realities of banking in poor communities. Most poor folks have no access to traditional banks to store their money and keeping cash on them is tricky. Not only might they be robbed by someone in the community, but they can rely on the fact that any police officer who frisks them will take whatever cash is found. So where should they store money for safe keeping?

When you bail someone out of jail and they show up for their court dates, you can get your bail money back. But why not just leave it at the court for safe keeping? You have up to six months to recover it and it’s often safer there than anywhere else. In her analysis, Goffman offers practices like these as well as other innovative ways poor people use the unjust system to their advantage.

Seeing Police Through the Eyes of Teens

Reading Goffman’s book also allowed me to better understand the teens that I encountered through my research. Doing fieldwork with working class and poor youth of color was both the highlight of my study and the hardest to fully grok. I have countless fieldnotes about teens’ recounted problems with cops, their struggles to stay out of trouble, and the violence that they witnessed all around them. I knew the stats. I knew that many of the teens that I met would probably end up in jail, if they hadn’t already had a run-in with the law. But I didn’t really get it.

Perhaps the hardest interview I had was with a young man who had just gotten out of jail and was in a halfway house. When he was a small boy, his mom got sick of his dad and so asked him to rat out his dad when the cops showed up. He obliged and his father was sent to jail. His mom then moved him and his younger brother across the country. By the time he was a teenager, his mom would call the cops on him and his brother whenever she wanted some peace and quiet. He’d eventually ran away and was always looking for a place to stay. His brother made a different decision — he found older white men who would “take care of him.” The teen I met was disgusted by his brother’s activities and thought that these men were gross so one day, he planted drugs on one of the guy’s cars and called the cops on him. And so the cycle continues.

In order to better understand human trafficking, I began talking to commercially exploited youth. Here, I also witnessed some pretty horrible dynamics. Teens who were arrested for prostitution “to keep them safe,” not to mention the threats and rapes that many young people engaged in sex work encountered from the very same law enforcement officers who were theoretically there to protect them. All too often, teens told me that their abusive “boyfriends” were much better than the abusive State apparatus (and their fathers). And based on what I saw, this was a fair assessment. And so I continue to struggle with policy discussions that center on empowering law enforcement. Sure, I had met some law enforcement folks in this work that were really working to end commercial sexual abuse of minors. And I want to see law enforcement serve a healthy enforcing role. But every youth I met feared the cops far more than they feared their abusers. And I still struggle to make sense of the right path forward.

Although the teens that I met often recounted their negative encounters with police, I never fully understood the underlying dynamics that shaped what they were telling me. What I was studying theoretically had nothing to do with teens’ relationship with the law and so this data was simply context. Context I was curious about, but not context that I got to observe properly. I knew that there was a lot more going on. A lot that I didn’t see. Enough to make me concerned about how law enforcement shapes the lives of working class and poor youth, but not enough to enable me to do anything about it.

What Goffman taught me was to appreciate the way in which the teens that I met were forced into a game of survival that was far more extreme than what I imagined. They are trying to game a system that is systematically unfair, that leaves them completely disempowered, and that teaches them to trust no one. For most poor populations, authority isn’t just corrupt — it’s outright abusive. Why then should we expect marginalized populations to play within a system that is out to get them?

As Ta-Nehisi Coates eloquently explained in “The Case for Reparations,” we may speak of a post-racial society where we no longer engage in racist activities, but the on-the-ground realities are much more systemically destructive. The costs of our historical racism and the damage done by slavery are woven into the fabric of our society. “It is as though we have run up a credit-card bill and, having pledged to charge no more, remain befuddled that the balance does not disappear. The effects of that balance, interest accruing daily, are all around us.”

We cannot expect the most marginalized people in American society to simply start trusting authority when authority continues to actively fragment their communities in an abusive assertion of power. It is both unfair and unreasonable to expect poor folks to work within a system that was designed to oppress them. If we want change, we need to better understand what’s at stake.

Goffman’s On the Run offers a brilliant account of what poor black people who are targeted by policing face on a daily basis. And how they learn to live in a society where their every move is surveilled. It is a phenomenal and eye-opening book, full of beauty and sorrow. Without a doubt, it’s one of the best books I’ve read in a long time. It makes very clear just how much we need policing reform in this country.

Understanding the cultural logic underpinning poor black men’s relationship with the law is essential for all who care about equality in this country. Law enforcement has its role in society, but, as with any system of power, it must always be checked. This book is a significant check to power, making visible some of the most invisible mechanisms of racism and inequality that exist today.

(Photo by Pavel P.)

(This entry was first posted on June 9, 2014 at Medium under the title “The Cost of Contemporary Policing” as part of The Message.)

by zephoria at July 09, 2014 06:21 PM

Ph.D. student

The Facebook ethics problem is a political problem

So much has been said about the Facebook emotion contagion experiment. Perhaps everything has been said.

The problem with everything having been said is that by an large people’s ethical stances seem predetermined by their habitus.

By which I mean: most people don’t really care. People who care about what happens on the Internet care about it in whatever way is determined by their professional orientation on that matter. Obviously, some groups of people benefit from there being fewer socially imposed ethical restrictions on data scientific practice, either in an industrial or academic context. Others benefit from imposing those ethical restrictions, or cultivating public outrage on the matter.

If this is an ethical issue, what system of ethics are we prepared to use to evaluate it?

You could make an argument from, say, a utilitarian perspective, or a deontological perspective, or even a virtue ethics standpoint. Those are classic moves.

But nobody will listen to what a professionalized academic ethicist will say on the matter. If there’s anybody who does rigorous work on this, it’s probably somebody like Luciano Floridi. His work is great, in my opinion. But I haven’t found any other academics who work in, say, policy that embrace his thinking. I’d love to be proven wrong.

But since Floridi does serious work on information ethics, that’s mainly an inconvenience to pundits. Instead we get heat, not light.

If this process resolves into anything like policy change–either governmental or internally at Facebook–it will because of a process of agonistic politics. “Agonistic” here means fraught with conflicted interests. It may be redundant to modify ‘politics’ with ‘agonistic’ but it makes the point that the moves being made are strategic actions, aimed at gain for ones person or group, more than they are communicative ones, aimed at consensus.

Because e.g. Facebook keeps public discussion fragmented through its EdgeRank algorithm, which even in its well-documented public version is full of apparent political consequences and flaws, there is no way for conversation within the Facebook platform to result in consensus. It is not, as has been observed by others, a public. In a trivial sense, it’s not a public because the data isn’t public. The data is (sort of) private. That’s not a bad thing. It just means that Facebook shouldn’t be where you go to develop a political consensus that could legitimize power.

Twitter is a little better for this, because it’s actually public. Facebook has zero reason to care about the public consensus of people on Twitter though, because those people won’t organize a consumer boycott of Facebook, because they can only reach people that use Twitter.

Facebook is a great–perhaps the greatest–example of what Habermas calls the steering media. “Steering,” because it’s how powerful entities steer public opinion. For Habermas, the steering media control language and therefore culture. When ‘mass’ media control language, citizens no longer use language to form collective will.

For individualized ‘social’ media that is arranged into filter bubbles through relevance algorithms, language is similarly controlled. But rather than having just a single commanding voice, you have the opportunity for every voice to be expressed at once. Through homophily effects in network formation, what you’d expect to see are very intense clusters of extreme cultures that see themselves as ‘normal’ and don’t interact outside of their bubble.

The irony is that the critical left, who should be making these sorts of observations, is itself a bubble within this system of bubbles. Since critical leftism is enacted in commercialized social media which evolves around it, it becomes recuperated in the Situationist sense. Critical outrage is tapped for advertising revenue, which spurs more critical outrage.

The dependence of contemporary criticality on commercial social media for its own diffusion means that, ironically, none of them are able to just quit Facebook like everyone else who has figured out how much Facebook sucks.

It’s not a secret that decentralized communication systems are the solution to this sort of thing. Stanford’s Liberation Tech group captures this ideology rather well. There’s a lot of good work on censorship-resistant systems, distributed messaging systems, etc. For people who are citizens in the free world, many of these alternative communication platforms where we are spared from algorithmic control are very old. Some people still use IRC for chat. I’m a huge fan of mailing lists, myself. Email is the original on-line social media, and ones inbox is ones domain. Everyone who is posting their stuff to Facebook could be posting to a WordPress blog. WordPress, by the way, has a lovely user interface these days and keeps adding “social” features like “liking” and “following”. This goes largely unnoticed, which is too bad, because Automattic, the company the runs WordPress, is really not evil at all.

So there are plenty of solutions to Facebook being bad for manipulative and bad for democracy. Those solutions involve getting people off of Facebook and onto alternative platforms. That’s what a consumer boycott is. That’s how you get companies to stop doing bad stuff, if you don’t have regulatory power.

Obviously the real problem is that we don’t have a less politically problematic technology that does everything we want Facebook to do only not the bad stuff. There are a lot of unsolved technical accomplishments to getting that to work. I think I wrote a social media think piece about this once.

I think a really cool project that everybody who cares about this should be working on is designing and executing on building that alternative to Facebook. That’s a huge project. But just think about how great it would be if we could figure out how to fund, design, build, and market that. These are the big questions for political praxis in the 21st century.

by Sebastian Benthall at July 09, 2014 04:35 AM

July 08, 2014

Ph.D. student

Theorizing the Web and SciPy conferences compared

I’ve just been through two days of tutorials at SciPy 2014–that stands for Scientific Python (the programming language). The last conference I went to was Theorizing the Web 2014. I wonder if I’m the first person to ever go to both conferences. Since I see my purpose in grad school as being a bridge node, I think it’s worthwhile to write something comparing the two.

Theorizing the Web was held in a “gorgeous warehouse space” in Williamsburg, the neighborhood of Brooklyn, New York that was full of hipsters ten years ago and now is full of baby carriages but still has gorgeous warehouse spaces and loft apartments. The warehouse spaces are actually gallery spaces that only look like warehouses from the outside. On the inside of the one where TtW was held, whole rooms with rounded interior corners were painted white, perhaps for a photo shoot. To call it a “warehouse” is to appeal to the blue color and industrial origins that Brooklyn gentrifiers appeal to in order to distinguish themselves from the elites in Manhattan. During my visit to New York for the conference, I crashed on a friend’s air mattress in the Brooklyn neighborhood I had been gentrifying just a few years earlier. The speakers included empirical scientific researchers, but these were not the focus of the event. Rather, the emphasis was on theorizing in a way that is accessible to the public. The most anticipated speaker was a porn actress. Others were artists or writers of one sort or another. One was a sex worker who then wrote a book. Others were professors of sociology and communications. Another was a Buzzfeed editor.

SciPy is taking place in the AT&T Education and Conference Center in Austin, Texas, near the UT Austin campus. I’m writing from the adjoining hotel. The conference rooms we are using are in the basement; they seat many in comfortable mesh rolling chairs on tiers so everybody can see the dual projector screens. The attendees are primarily scientists who do computationally intensive work. One is a former marine biologist who now does bioinformatics mainly. Another team does robotics. Another does image processing on electron microscope of chromosomes. They are not trying to be accessible to the public. What they are trying to teach is hard enough to get across to others with similar expertise. It is a small community trying to enlarge itself by teaching others its skills.

At Theorizing the Web, the rare technologist spoke up to talk about the dangers of drones. In the same panel, it was pointed out how the people designing medical supply drones for use in foreign conflict zones were considering coloring them white, not black, to make them less intimidating. The implication was that drone designers are racist.

It’s true that the vast majority of attendees of the conference are white and male. To some extent, this is generational. Both tutorials I attended today–including the one one on software for modeling multi-body dynamics, useful for designing things like walking robots–were interracial and taught by guys around my age. The audience has some older folks. These are not necessarily academics, but may be industry types or engineers whose firms are paying them to attend to train on cutting edge technology.

The afterparty first night of Theorizing the Web was in a dive bar in Williamsburg. Brooklyn’s Williamsburg has dive bars the same way Virginia’s Williamsburg has a colonial village–they are a cherished part of its cultural heritage. But the venue was alienating for some. One woman from abroad confided to me that they were intimidated by how cool the bar felt. It was my duty as an American and a former New Yorker to explain that Williamsburg stopped being cool a long time ago.

I’m an introvert and am initially uneasy in basically any social setting. Tonight’s SciPy afterparty was in the downtown office of Enthought, in the Bank of America building. Enthought’s digs are on the 21st floor, with spatious personal offices and lots of whiteboards which display serious use. As an open source product/consulting/training company, it appears to be doing quite well. I imagine really cool people would find it rather banal.

I don’t think it’s overstating things to say that Theorizing the Web serves mainly those skeptical of the scientific project. Knowledge is conceived of as a threat to the known. One panelist at TtW described the problem of “explainer” sites–web sites whose purpose is to explain things that are going on to people who don’t understand them–when they try to translate cultural phenomena that they don’t understand. It was argued that even in cases where these cultural events are public, to capture that content and provide a interpretation or narration around it can be exploitative. Later, Kate Crawford, a very distinguished scholar on civic media, spoke to a rapt audience about the “conjoint anxieties” of Big Data. The anxieties of the watched are matched by the anxieties of the watchmen–like the NSA and, more implicitly, Facebook–who must always seek out more data in order to know things. The implication is that their political or economic agenda is due to a psychological complex–damning if true. In a brilliant rhetorical move that I didn’t quite follow, she tied this in to normcore, which I’m pretty sure is an Internet meme about a fake “fashion” trend in New York. Young people in New York go gaga for irony like this. For some reason earlier this year hipsters ironically wearing unstylish clothing became notable again.

I once met somebody from L.A. who told me their opinion of Brooklyn was that all nerds gathered in one place and thought they could decide what cool was just by saying so. At the time I had only recently moved to Berkeley and was still adjusting. Now I realize how parochial that zeitgeist is, however much I may still identify with it some.

Back in Austin, I have interesting conversations with folks at the SciPy party. One conversation is with two social scientists (demographic observation: one man, one woman) from New York that work on statistical analysis of violent crime in service to the city. They talk about the difficulty of remaining detached from their research subjects, who are eager to assist with the research somehow, though this would violate the statistical rigor of their study. Since they are doing policy research, objectivity is important. They are painfully aware of the limitations of their methods and the implications this has on those their work serves.

Later, I’m sitting alone when I’m joined by an electrical engineer turned programmer. He’s from Tennessee. We talk shop for a bit but the conversation quickly turns philosophical–about the experience of doing certain kinds of science, the role of rationality in human ethics, whether religion is an evolved human impulse and whether that mattes. We are joined by a bioinformatics researcher from Paris. She tells us later that she has an applied math/machine learning background.

The problem in her field, she explains, is that for rare diseases it is very hard to find genetic causes because there isn’t enough data to do significant inference. Genomic data is very highly dimensional–thousands of genes–and for some diseases there may be less than fifty cases to study. Machine learning researchers are doing their best to figure out ways for researchers to incorporate “prior knowledge”–theoretical understanding from beyond the data available–to improve their conclusions.

Over meals the past couple days I’ve been checking Twitter, where a lot of the intellectuals who organize Theorizing the Web or are otherwise prominent in that community are active. One conversation extended conversation is about the relative failure of the open source movement to produce compelling consumer products. My theory is that this has to do with business models and the difficulty of coming up with upfront capital investment. But emotionally my response to that question is that it is misplaced: consumer products are trivial. Who cares?

Today, folks on Twitter are getting excited about using Adorno’s concept of the culture industry to critique Facebook’s emotional contagion experiment and other media manipulation. I find this both encouraging–it’s about time the Theorizing the Web community learned to embrace Frankfurt School thought–and baffling, because I believe they are misreading Adorno. The culture industry is that sector of the economy that produces cultural products, like Hollywood and television productions companies. On the Internet, the culture industry is Buzzfeed, the Atlantic, and to a lesser extent (though this is surely masked by it’s own ideology) The New Inquiry. My honest opinion for a long time has been that the brand of “anticapitalist” criticality indulged in on-line is a politically impotent form of entertainment equivalent to the soap opera. A concept more appropriate for understanding Facebook’s role in controlling access to news and the formation of culture is Habermas’ idea of steering media.

He gets into this in Theory of Communicative Action, vol. 2, which is underrated in America probably due to its heaviness.

by Sebastian Benthall at July 08, 2014 05:33 AM

July 06, 2014

Ph.D. student

economic theory and intellectual property

I’ve started reading Picketty’s Capital. His introduction begins with an overview of the history of economic theory, starting with Ricardo and Marx.

Both these early theorists predicted the concentration of wealth into the hands of the owners of factors of production that are not labor. For Ricardo, land owners extract rents and dominate the economy. For Marx, capitalists–owners of private capital–accumulate capital and dominate the economy.

Since those of us with an eye on the tech sector are aware of a concentration of wealth in the hands of the owners of intellectual property, it’s a good question what kind of economic theory ought to apply to those cases.

One one sense, intellectual property is a kind of capital. It is a factor of production that is made through human labor.

On the other hand, we talk about ideas being ‘discovered’ like land is discovered, and we imagine that intellectual property can in principle be ‘shared’ like a ‘commons’. If we see intellectual property as a position in a space of ideas, it is not hard to think of it like land.

Like land, a piece of intellectual property is unique and gains in value due to further improvements–applications or innovations–built upon it. In a world where intellectual property ownership never expires and isn’t shared, you can imagine that whoever hold some critical early work in some field could extract rents for perpetuity. Owning a patent would be like owning a land estate.

Like capital, intellectual property is produced by workers and often owned by those investing in the workers with pre-existing capital. The produced capital is then owned by the initiating capitalist, and accumulates.

Open source software is an important exception to this pattern. This kind of intellectual property is unalienated from those that produce it.

by Sebastian Benthall at July 06, 2014 06:38 PM

Ph.D. student

notary digital?

Recently I had the honor of swearing, and having notarized, an affidavit of bona fide marriage for a good friend as part of an immigration application. Speaking with another friend who had done the same for a friend of hers, she remarked that it was such a basic and important thing to do, that even if she did nothing else this year it would have been an accomplishment. And the formal, official process of notarization was interesting enough itself that I spent some time looking into how to become one.

Notary Public

Becoming a notary is a strange process. By its nature, it's an extremely regulated field: state law specifies exactly what a notary must do, what training they must have, what level of verification is needed for different notarized documents, exactly how much a notary may charge for each service, how the notary may advertise itself, etc. That is, you become a notary public, not just a notary. Presumably this is in part because other legal and commercial processes depend on notarization of certain kinds.

Given all those regulations, if the notary errs or forgets when conducting her duties, the law provides penalties. Forgot to thumbprint someone when you notarized their affidavit? That's $2500. Forgot to inform the Secretary of State when you moved to a new apartment? $500. Screw up the process for identifying an individual in a way that screws up someone else's business? They can sue you for damages. In short, if you're a notary, you need to buy notary errors and omissions insurance, at least $50 for four years. Also, the State wants to be sure that you can pay if you become a rogue notary who violates all these rules. As a result, as soon as you become a notary you're required to execute a bond of $15,000 with your county. In short, you pay a certified bond organization maybe $50 for the bond; if the State thinks you screwed up, they get the money directly from the bondsman and then the bondsman comes and gets the money from you.

Notary Digital?

But mostly I'm curious about this just because I've been thinking about the idea of a digital notary. (This is not to be confused with completing notary public activities with webcam verification instead of in-person, which appears to be illegal in most states, and not what I'm offering.)

That is, it seems like there are some operations we do in our digital, electronic lives these days that could benefit from some in-person verification. Those operations might otherwise just be cumbersome or awkward, but if we have an existing structure — of people who advertise themselves as carefully completing these verification operations in person — maybe that would actually work well, even with our online personas. These thoughts are, charmingly I hope, inchoate and I would appreciate your thoughts about them.

Backup / Escrow

Some really important digital files you want to backup in a secure, offline way, where you're guaranteed to be able to get them back. (Say: Bitcoin wallets; financial records; passwords, certificate revocations, private keys.) You meet with the digital notary; she confirms who you are, who can have access to the files, whether you want them encrypted in a way that she can't access them, how and when to get them back to you (offline-only, online with certain verifications, etc.). You pay her a fee then and a fee at the time if you ever need to retrieve them.

Alternatives: online "cold storage" services; university IT escrow services (not sure if this is common, but Chicago provides it for faculty and staff); bank safety deposit boxes with USB keys in them; online backup you really hope is secure.

Verification and Certification

You can go to a digital notary to get some digital confirmation that you are who you say you are online. The digital notary can give you a certificate to use that has your legal name and her signature (complete with precise verification steps) that you can use to sign electronic documents or sign/encrypt email. Sure, anyone can sign your OpenPGP key and confirm your identity, but the notary can help you set it up and give you a trusted verification (based on her well-known reputation and connection to the Web of Trust and other notaries).

And, traditional to the notary, she can sign a jurat. That is, you can swear an affidavit of some statement and she can verify that it was really you saying exactly what you said, but do so in a way that can be automatically and remotely verified.

Alternatives: key-signing parties; certificate authorities (some do this for free, others require a fee, or require a fee if it's not just personal use); creating your own key and participating in the Web of Trust in order to establish some reputation.


While we hope to see an increase in the thanatosensitivity (oh man, I've been waiting for an excuse to use that term again; here are all my bookmarks related to the topic) of online services — like Google's Inactive Account Manager — after we die, it's likely that our online accounts will become defunct and difficult for our next-of-kin to access. It would be useful to give someone instructions for what we want done with our accounts and data after death; that person will likely have to securely maintain passwords and keys and be able to verify, offline, our identities. Pay your digital notary a fee and she can execute certain actions (deleting some data, revealing some passwords to whichever family members you chose, disabling social media accounts) after your death, after verifying it using not just inactivity, but also confirmation with government or family.

Alternatives: a lawyer who understands technology well enough to execute these digital terms of your will just as they do your regular will and testament. (Does anyone know the current state of the art for lawyers who know how to handle these things?)


And actually what might be most valuable about digital notary services is that she can explain to you these digital verifications work. That is, not only can a digital notary provide digital execution with in-person verification, she can provide the basic capability, explain how it works and then conduct it. Another advantage of in-person meetings, you can seek individualized counsel, not just formalistic execution of tasks.

It would be nice if information technology had a profession with a fiduciary responsibility to its clients; the implications of digital work are increasingly important to us but remain hard for non-experts to understand, much less control. Just as we expect with our doctors and our lawyers, we should be able to ask technological experts for advice and services that are in our own best, and varied, interests. Related, it would be useful if the law reflected that relationship and provided liability but also confidentiality, for such transactions. That latter part will take a little while (the law is slow to change, as we know), but a description of the profession and some common ethical guidelines of its own could help.

A Shingle?

As an experiment, I offer you all and our friends the services described above — escrow of files/keys; authentication, encryption and certification of messages; execution of a digital will and testament — at a nominal $2 fee per service.

Sincerely yours,


P.S. Did you know that payment of fees is one factor used to determine that a privileged client-attorney relationship has been established?

by at July 06, 2014 04:41 AM

July 03, 2014

Ph.D. student

Preparing for SciPy 2014

I’ve been instructed to focus my attention on mid-level concepts rather than grand theory as I begin my empirical

This is difficult for me, as I tend to oscillate between thinking very big and thinking very narrowly. This is an occupational hazard of a developer. Technical minutiae accumulate into something durable and powerful. To sustain ones motivation one has to be able to envision ones tiny tasks (correcting the spelling of some word in a program) stepping towards a larger project.

I’m working in my comfort zone. I’ve got my software project open on GitHub and I’m preparing to present my preliminary results at SciPy 2014 next week. A colleague and mentor I met with today told me it’s not a conference for people marking up career points. It’s a conference for people to meet each other, get an update on how their community is doing as a whole, and to learn new skills from each other.

It’s been a few years since I’ve been to a developer conference. In my past career I went to FOSS4G, the open source geospatial conference, a number of times. In 2008, the conference was in South Africa. I didn’t know anybody, so I blogged about it, and got chastised for being too divisive. I wasn’t being sensitive to the delicate balance between the open source developer geospatial community and their greatest proprietary coopetitor, ESRI. I was being an ideologue at a time when the open source model was in that industry just in its inflection point and becoming mainstream. Obviously I didn’t understand the subtlety of the relationships, business and personal, threaded through the conference.

Later I attended FOSS4G in 2010 to pitch the project my team had recently launched, GeoNode. It was a very exciting time for me personally. I was very personally invested in the project, and I was so proud of my team and myself for pulling through on the beta release. In retrospect, building a system for serving spatial data modeled on a content management system seems like a no-brainer. Today there are plenty of data management startups and services out there, some industrial, some academic. But at the time we were ahead of the curve, thanks largely to the vision of Chris Holmes, who at the time the wunderkind visionary president of OpenGeo.

Cholmes always envisioned OpenGeo turning into an anti-capitalist organization, a hacker coop with as much transparency as it could handle. If only it could get its business model right. It was incubating in a pre-crash bubble that thinned out over time. I was very into the politics of the organization when I joined it, but over time I became more cynical and embraced the economic logic I was being taught by the mature entrepreneurs who had been attracted to OpenGeo’s promise and standing in the geospatial world. While trying to wrap my head around managing developers, clients, and the budget around GeoNode, I began to see why businesses are the way they are, and how open source plays out in the industrial organization of the tech industry as a whole.

GeoNode, the project, remains a success. There is glory to that, though in retrospect I can claim little of it. I made many big mistakes and the success of the project has always been due to the very intelligent team working on it, as well as its institutional positioning.

I left OpenGeo because I wanted to be a scientist. I had spent four years there, and had found my way onto a project where we were building data plumbing for disaster reduction scientists and the military. OpenGeo had become a victim of its own success and outgrown its non-profit incubator, buckling under the weight of the demand for its services. I had deferred enrollment at Berkeley for a year to see GeoNode through to a place where it couldn’t get canned. My last major act was to raise funding for a v1.1 release that fixed the show-stopping bugs in the v1.0 version.

OpenGeo is now Boundless, a for-profit company. It’s better that way. It’s still doing revolutionary work.

I’ve been under the radar in the open source world for the three years I’ve been in grad school. But as I begin this dissertation work, I feel myself coming back to it. My research questions, in one framing, are about software ecosystem sustainability and management. I’m drawing from my experience participating in and growing open source communities and am trying to operationalize my intuitions from that work. At Berkeley I’ve discovered the scientific Python community, which I feel at home with since I learned about how to do open source from the inimitable Whit Morris, a Pythonista of the Plone cohort, among others.

After immersing myself in academia, I’m excited to get back into the open source development world. Some of the most intelligent and genuine people I’ve ever met work in that space. Like the sciences, it is a community of very smart and creative people with the privilege to pursue opportunity but with goals that go beyond narrow commercial interests. But it’s also in many ways a more richly collaborative and constructive community than the academic world. It’s not a prestige economy, where people are rewarded with scarce attention and even scarcer titles. It’s a constructive economy, where there is always room to contribute usefully, and to be recognized even in a small way for that contribution.

I’m going to introduce my research on the SciPy communities themselves. In the wake of the backlash against Facebook’s “manipulative” data science research, I’m relieved to be studying a community that has from the beginning wanted to be open about its processes. My hope is that my data scientific work will be a contribution to, not an exploitation of, the community I’m studying. It’s an exciting opportunity that I’ve been preparing for for a long time.

by Sebastian Benthall at July 03, 2014 11:18 PM

July 01, 2014

Ph.D. alumna

What does the Facebook experiment teach us?

I’m intrigued by the reaction that has unfolded around the Facebook “emotion contagion” study. (If you aren’t familiar with this, read this primer.) As others have pointed out, the practice of A/B testing content is quite common. And Facebook has a long history of experimenting on how it can influence people’s attitudes and practices, even in the realm of research. An earlier study showed that Facebook decisions could shape voters’ practices. But why is it that *this* study has sparked a firestorm?

In asking people about this, I’ve been given two dominant reasons:

  1. People’s emotional well-being is sacred.
  2. Research is different than marketing practices.

I don’t find either of these responses satisfying.

The Consequences of Facebook’s Experiment

Facebook’s research team is not truly independent of product. They have a license to do research and publish it, provided that it contributes to the positive development of the company. If Facebook knew that this research would spark the negative PR backlash, they never would’ve allowed it to go forward or be published. I can only imagine the ugliness of the fight inside the company now, but I’m confident that PR is demanding silence from researchers.

I do believe that the research was intended to be helpful to Facebook. So what was the intended positive contribution of this study? I get the sense from Adam Kramer’s comments that the goal was to determine if content sentiment could affect people’s emotional response after being on Facebook. In other words, given that Facebook wants to keep people on Facebook, if people came away from Facebook feeling sadder, presumably they’d not want to come back to Facebook again. Thus, it’s in Facebook’s better interest to leave people feeling happier. And this study suggests that the sentiment of the content influences this. This suggests that one applied take-away for product is to downplay negative content. Presumably this is better for users and better for Facebook.

We can debate all day long as to whether or not this is what that study actually shows, but let’s work with this for a second. Let’s say that pre-study Facebook showed 1 negative post for every 3 positive and now, because of this study, Facebook shows 1 negative post for every 10 positive ones. If that’s the case, was the one week treatment worth the outcome for longer term content exposure? Who gets to make that decision?

Folks keep talking about all of the potential harm that could’ve happened by the study – the possibility of suicides, the mental health consequences. But what about the potential harm of negative content on Facebook more generally? Even if we believe that there were subtle negative costs to those who received the treatment, the ongoing costs of negative content on Facebook every week other than that 1 week experiment must be more costly. How then do we account for positive benefits to users if Facebook increased positive treatments en masse as a result of this study? Of course, the problem is that Facebook is a black box. We don’t know what they did with this study. The only thing we know is what is published in PNAS and that ain’t much.

Of course, if Facebook did make the content that users see more positive, should we simply be happy? What would it mean that you’re more likely to see announcements from your friends when they are celebrating a new child or a fun night on the town, but less likely to see their posts when they’re offering depressive missives or angsting over a relationship in shambles? If Alice is happier when she is oblivious to Bob’s pain because Facebook chooses to keep that from her, are we willing to sacrifice Bob’s need for support and validation? This is a hard ethical choice at the crux of any decision of what content to show when you’re making choices. And the reality is that Facebook is making these choices every day without oversight, transparency, or informed consent.

Algorithmic Manipulation of Attention and Emotions

Facebook actively alters the content you see. Most people focus on the practice of marketing, but most of what Facebook’s algorithms do involve curating content to provide you with what they think you want to see. Facebook algorithmically determines which of your friends’ posts you see. They don’t do this for marketing reasons. They do this because they want you to want to come back to the site day after day. They want you to be happy. They don’t want you to be overwhelmed. Their everyday algorithms are meant to manipulate your emotions. What factors go into this? We don’t know.

Facebook is not alone in algorithmically predicting what content you wish to see. Any recommendation system or curatorial system is prioritizing some content over others. But let’s compare what we glean from this study with standard practice. Most sites, from major news media to social media, have some algorithm that shows you the content that people click on the most. This is what drives media entities to produce listicals, flashy headlines, and car crash news stories. What do you think garners more traffic – a detailed analysis of what’s happening in Syria or 29 pictures of the cutest members of the animal kingdom? Part of what media learned long ago is that fear and salacious gossip sell papers. 4chan taught us that grotesque imagery and cute kittens work too. What this means online is that stories about child abductions, dangerous islands filled with snakes, and celebrity sex tape scandals are often the most clicked on, retweeted, favorited, etc. So an entire industry has emerged to produce crappy click bait content under the banner of “news.”

Guess what? When people are surrounded by fear-mongering news media, they get anxious. They fear the wrong things. Moral panics emerge. And yet, we as a society believe that it’s totally acceptable for news media – and its click bait brethren – to manipulate people’s emotions through the headlines they produce and the content they cover. And we generally accept that algorithmic curators are perfectly well within their right to prioritize that heavily clicked content over others, regardless of the psychological toll on individuals or the society. What makes their practice different? (Other than the fact that the media wouldn’t hold itself accountable for its own manipulative practices…)

Somehow, shrugging our shoulders and saying that we promoted content because it was popular is acceptable because those actors don’t voice that their intention is to manipulate your emotions so that you keep viewing their reporting and advertisements. And it’s also acceptable to manipulate people for advertising because that’s just business. But when researchers admit that they’re trying to learn if they can manipulate people’s emotions, they’re shunned. What this suggests is that the practice is acceptable, but admitting the intention and being transparent about the process is not.

But Research is Different!!

As this debate has unfolded, whenever people point out that these business practices are commonplace, folks respond by highlighting that research or science is different. What unfolds is a high-browed notion about the purity of research and its exclusive claims on ethical standards.

Do I think that we need to have a serious conversation about informed consent? Absolutely. Do I think that we need to have a serious conversation about the ethical decisions companies make with user data? Absolutely. But I do not believe that this conversation should ever apply just to that which is categorized under “research.” Nor do I believe that academe is necessarily providing a golden standard.

Academe has many problems that need to be accounted for. Researchers are incentivized to figure out how to get through IRBs rather than to think critically and collectively about the ethics of their research protocols. IRBs are incentivized to protect the university rather than truly work out an ethical framework for these issues. Journals relish corporate datasets even when replicability is impossible. And for that matter, even in a post-paper era, journals have ridiculous word count limits that demotivate researchers from spelling out all of the gory details of their methods. But there are also broader structural issues. Academe is so stupidly competitive and peer review is so much of a game that researchers have little incentive to share their studies-in-progress with their peers for true feedback and critique. And the status games of academe reward those who get access to private coffers of data while prompting those who don’t to chastise those who do. And there’s generally no incentive for corporates to play nice with researchers unless it helps their prestige, hiring opportunities, or product.

IRBs are an abysmal mechanism for actually accounting for ethics in research. By and large, they’re structured to make certain that the university will not be liable. Ethics aren’t a checklist. Nor are they a universal. Navigating ethics involves a process of working through the benefits and costs of a research act and making a conscientious decision about how to move forward. Reasonable people differ on what they think is ethical. And disciplines have different standards for how to navigate ethics. But we’ve trained an entire generation of scholars that ethics equals “that which gets past the IRB” which is a travesty. We need researchers to systematically think about how their practices alter the world in ways that benefit and harm people. We need ethics to not just be tacked on, but to be an integral part of how *everyone* thinks about what they study, build, and do.

There’s a lot of research that has serious consequences on the people who are part of the study. I think about the work that some of my colleagues do with child victims of sexual abuse. Getting children to talk about these awful experiences can be quite psychologically tolling. Yet, better understanding what they experienced has huge benefits for society. So we make our trade-offs and we do research that can have consequences. But what warms my heart is how my colleagues work hard to help those children by providing counseling immediately following the interview (and, in some cases, follow-up counseling). They think long and hard about each question they ask, and how they go about asking it. And yet most IRBs wouldn’t let them do this work because no university wants to touch anything that involves kids and sexual abuse. Doing research involves trade-offs and finding an ethical path forward requires effort and risk.

It’s far too easy to say “informed consent” and then not take responsibility for the costs of the research process, just as it’s far too easy to point to an IRB as proof of ethical thought. For any study that involves manipulation – common in economics, psychology, and other social science disciplines – people are only so informed about what they’re getting themselves into. You may think that you know what you’re consenting to, but do you? And then there are studies like discrimination audit studies in which we purposefully don’t inform people that they’re part of a study. So what are the right trade-offs? When is it OK to eschew consent altogether? What does it mean to truly be informed? When it being informed not enough? These aren’t easy questions and there aren’t easy answers.

I’m not necessarily saying that Facebook made the right trade-offs with this study, but I think that the scholarly reaction of research is only acceptable with IRB plus informed consent is disingenuous. Of course, a huge part of what’s at stake has to do with the fact that what counts as a contract legally is not the same as consent. Most people haven’t consented to all of Facebook’s terms of service. They’ve agreed to a contract because they feel as though they have no other choice. And this really upsets people.

A Different Theory

The more I read people’s reactions to this study, the more that I’ve started to think that the outrage has nothing to do with the study at all. There is a growing amount of negative sentiment towards Facebook and other companies that collect and use data about people. In short, there’s anger at the practice of big data. This paper provided ammunition for people’s anger because it’s so hard to talk about harm in the abstract.

For better or worse, people imagine that Facebook is offered by a benevolent dictator, that the site is there to enable people to better connect with others. In some senses, this is true. But Facebook is also a company. And a public company for that matter. It has to find ways to become more profitable with each passing quarter. This means that it designs its algorithms not just to market to you directly but to convince you to keep coming back over and over again. People have an abstract notion of how that operates, but they don’t really know, or even want to know. They just want the hot dog to taste good. Whether it’s couched as research or operations, people don’t want to think that they’re being manipulated. So when they find out what soylent green is made of, they’re outraged. This study isn’t really what’s at stake. What’s at stake is the underlying dynamic of how Facebook runs its business, operates its system, and makes decisions that have nothing to do with how its users want Facebook to operate. It’s not about research. It’s a question of power.

I get the anger. I personally loathe Facebook and I have for a long time, even as I appreciate and study its importance in people’s lives. But on a personal level, I hate the fact that Facebook thinks it’s better than me at deciding which of my friends’ posts I should see. I hate that I have no meaningful mechanism of control on the site. And I am painfully aware of how my sporadic use of the site has confused their algorithms so much that what I see in my newsfeed is complete garbage. And I resent the fact that because I barely use the site, the only way that I could actually get a message out to friends is to pay to have it posted. My minimal use has made me an algorithmic pariah and if I weren’t technologically savvy enough to know better, I would feel as though I’ve been shunned by my friends rather than simply deemed unworthy by an algorithm. I also refuse to play the game to make myself look good before the altar of the algorithm. And every time I’m forced to deal with Facebook, I can’t help but resent its manipulations.

There’s also a lot that I dislike about the company and its practices. At the same time, I’m glad that they’ve started working with researchers and started publishing their findings. I think that we need more transparency in the algorithmic work done by these kinds of systems and their willingness to publish has been one of the few ways that we’ve gleaned insight into what’s going on. Of course, I also suspect that the angry reaction from this study will prompt them to clamp down on allowing researchers to be remotely public. My gut says that they will naively respond to this situation as though the practice of research is what makes them vulnerable rather than their practices as a company as a whole. Beyond what this means for researchers, I’m concerned about what increased silence will mean for a public who has no clue of what’s being done with their data, who will think that no new report of terrible misdeeds means that Facebook has stopped manipulating data.

Information companies aren’t the same as pharmaceuticals. They don’t need to do clinical trials before they put a product on the market. They can psychologically manipulate their users all they want without being remotely public about exactly what they’re doing. And as the public, we can only guess what the black box is doing.

There’s a lot that needs reformed here. We need to figure out how to have a meaningful conversation about corporate ethics, regardless of whether it’s couched as research or not. But it’s not so simple as saying that a lack of a corporate IRB or a lack of a golden standard “informed consent” means that a practice is unethical. Almost all manipulations that take place by these companies occur without either one of these. And they go unchecked because they aren’t published or public.

Ethical oversight isn’t easy and I don’t have a quick and dirty solution to how it should be implemented. But I do have a few ideas. For starters, I’d like to see any company that manipulates user data create an ethics board. Not an IRB that approves research studies, but an ethics board that has visibility into all proprietary algorithms that could affect users. For public companies, this could be done through the ethics committee of the Board of Directors. But rather than simply consisting of board members, I think that it should consist of scholars and users. I also think that there needs to be a mechanism for whistleblowing regarding ethics from within companies because I’ve found that many employees of companies like Facebook are quite concerned by certain algorithmic decisions, but feel as though there’s no path to responsibly report concerns without going fully public. This wouldn’t solve all of the problems, nor am I convinced that most companies would do so voluntarily, but it is certainly something to consider. More than anything, I want to see users have the ability to meaningfully influence what’s being done with their data and I’d love to see a way for their voices to be represented in these processes.

I’m glad that this study has prompted an intense debate among scholars and the public, but I fear that it’s turned into a simplistic attack on Facebook over this particular study rather than a nuanced debate over how we create meaningful ethical oversight in research and practice. The lines between research and practice are always blurred and information companies like Facebook make this increasingly salient. No one benefits by drawing lines in the sand. We need to address the problem more holistically. And, in the meantime, we need to hold companies accountable for how they manipulate people across the board, regardless of whether or not it’s couched as research. If we focus too much on this study, we’ll lose track of the broader issues at stake.

by zephoria at July 01, 2014 11:00 PM

June 27, 2014

Ph.D. alumna

‘Selling Out’ Is Meaningless: Teens live in the commercial world we created

In the recent Frontline documentary “Generation Like,” Doug Rushkoff lamented that today’s youth don’t even know what the term “sell-out” means. While this surprised Rushkoff and other fuddy duddies, it didn’t make me blink for a second. Of course this term means nothing to them. Why do we think it should?

The critique of today’s teens has two issues intertwined into one. First, there’s the issue of language — is this term the right term? Second, there’s the question of whether or not the underlying concept is meaningful in contemporary youth culture.

Slang Shifts Over Time

My cohort grew up with the term “dude” with zero recognition that the term was originally a slur for city slickers and dandies known for their fancy duds (a.k.a. clothing). And even as LGBT folks know that “gay” once meant happy, few realize that it once referred to hobos and drifters. Terms change over time.

Even the term “sell-out” has different connotations depending on who you ask… and when you ask. While it’s generally conceptualized as a corrupt bargain, it was originally of political origins, equivalent to traitor. For example, it was used to refer to those in the South who chose to leave the Confederacy for personal gain. Among the black community, it took a different turn, referring to those African-Americans who appeared to be too white. Of course, the version that Rushkoff is most familiar with stems from when musicians were being attacked for putting commercial interests above artistic vision. Needless to say, those who had the privilege to make these decisions were inevitably white men, so it’s not that surprising that the notion of selling out was particularly central to the punk and alternative music scenes from the 1960s-1990s, when white men played a defining role. For many other musicians, hustling was always part of the culture and you were darn lucky to be able to earn a living doing what you loved. This doesn’t mean that the music industry isn’t abusive or corrupt or corrupting. Personally, I’m glad that today’s music ecosystem isn’t as uniformly white or male as it once was.

All that said, why on earth should contemporary adults expect today’s teens to use the same terms that us old fogies have been using to refer to cultural dynamics? Their musical ecosystem is extraordinarily different than what I grew up with. RIAA types complain about how technology undercut their industry, but I would argue that the core industry got greedy and, then, abusive. Today’s teens are certainly living in a world with phenomenally famous pop stars, but they are also experiencing the greatest levels of fragmentation ever. Rather than relying on the radio for music recommendations, they turn to YouTube and share media content through existing networks, undermining industrial curatorial control. As a result, I constantly meet teens whose sense of the music industry is radically different than that of peers who live next in the next town over. The notion of selling out requires that there is one reigning empire. That really isn’t the case anymore.

Of course, the issue of slang is only the surface issue. Do teens recognize the commercial ecosystem that they live in? And how do they feel about it? What I found in my research was pretty consistent on this front.

Growing Up in a Commercial World

Today’s teens are desperate for any form of freedom. In a world where they have limited physical mobility and few places to go, they’re deeply appreciative of any space that will accept them. Because we’ve pretty much obliterated all public spaces for youth to gather in, they find their freedomin commercial spaces, especially online. This doesn’t mean teens like the advertisements that are all around them, but they’ll accept this nuisance for the freedom to socialize with their friends. They know it’s a dirty trade-off and they’re more than happy to mess with the data that the systems scrape, but they are growing up in a world where they don’t feel as though they have much agency or choice.

These teens are not going to critique their friends for being sell-outs because they’ve already been sold out by the adults in their world. These teens want freedom and it’s our fault that they don’t have it except in commercial spaces. These teens want opportunities and we do everything possible to restrict those that they have access to. Why should we expect them to stand up to commercial surveillance when every adult in their world surveils their every move “for their own good”? Why should these teens lament the commercialization of public spaces when these are the only spaces that they feel actually allow them to be authentic?

It makes me grouchy when adults gripe about teens’ practice without taking into account all of the ways in which we’ve forced them into the corners that they’re trying to navigate. There’s good reason to be critical of how commercialized American society has become, but I don’t think that we should place the blame on the backs of teenagers who are just trying to find their way. If we don’t like what we see when we watch teenagers, it’s time to look in the mirror. We’ve created this commercially oriented society. Teens are just trying to figure out how to live in it.

(Thanks to Tamara Kneese for helping track down some of the relevant history for this post.)

(This entry was first posted on May 27, 2014 at Medium under the title “‘Selling Out’ Is Meaningless” as part of The Message.)

by zephoria at June 27, 2014 06:36 PM

June 25, 2014

Ph.D. student

metaphorical problems with logical solutions

There are polarizing discourses on the Internet about the following four dichotomies:

  • Public vs. Private (information)
  • (Social) Inclusivity vs. Exclusivity.
  • Open vs. Closed (systems, properties, communities).

Each of these pairings enlists certain metaphors and intuitions. Rarely are they precisely defined.

Due to their intuitive pull, it’s easy to draw certain naive associations. I certainly do. But how do they work together logically?

To what extent can we fill in other octants of this cube? Or is that way of modeling it too simplistic as well?

If privacy is about having contextual control over information flowing out of oneself, then that means that somebody must have the option of closing off some access to their information. To close off access is necessarily to exclude.


But it has been argued that open sociotechnical systems exclude as well by being inhospitable to those with greater need for privacy.


These conditionals limit the kinds of communities that can exist.


Social inclusivity in sociotechnical systems is impossible. There is no such thing as a sociotechnical system that works for everybody.

There are only three kinds of systems: open systems, private systems, or systems that are neither open nor private. We can call the latter leaky systems.

These binary logical relations capture only the limiting properties of these systems. If there has ever been an open system, it is the Internet; but everyone knows that even the Internet isn’t truly open because of access issues.

The difference between a private system and a leaky system is participant’s ability to control how their data escapes the system.

But in this case, systems that we call ‘open’ are often private systems, since participants choose whether or not to put information into the open.

So is the only question whether and when information is disclosed vs. leaked?

by Sebastian Benthall at June 25, 2014 11:33 PM

June 24, 2014

Ph.D. student

Protected: some ruminations regarding ‘openness’

This post is password protected. You must visit the website and enter the password to continue reading.

by Sebastian Benthall at June 24, 2014 06:22 AM

June 19, 2014

Ph.D. student

turns out network backbone markets in the US are competitive after all

I’ve been depressed lately about the oligopolistic control of telecommunications for a while now. There’s the Web We’ve Lost; there’s Snowden leaks; there’s the end of net neutrality. I’ll admit a lot of my moodiness about this has been just that–moodiness. But it was moodiness tied to a particular narrative.

In this narrative, power is transmitted via flows of information. Media is, if not determinative of public opinion, determinative of how that opinion is acted up. Surveillance is also an information flow. Broadly, mid-20th century telecommunications enabled mass culture due to the uniformity of media. The Internet’s protocols allowed it to support a different kind of culture–a more participatory one. But monetization and consolidation of the infrastructure has resulted in a society that’s fragmented but more tightly controlled.

There is still hope of counteracting that trend at the software/application layer, which is part of the reason why I’m doing research on open source software production. One of my colleagues, Nick Doty, studies the governance of Internet Standards, which is another piece of the puzzle.

But if the networking infrastructure itself is centrally controlled, then all bets are off. Democracy, in the sense of decentralized power with checks and balances, would be undermined.

Yesterday I learned something new from Ashwin Mathew, another colleague who studies Internet governance at the level of network administration. The man is deep in the process of finishing up his dissertation, but he looked up from his laptop for long enough to tell me that the network backbone market is in fact highly competitive at the moment. Apparently, there was a lot of dark fiberoptic cable (“dark fiber“–meaning, no light’s going through it) laid during the first dot-com boom, which has been laying fallow and getting bought up by many different companies. Since there are many routes from A to B and excess capacity, this market is highly competitive.

Phew! So why the perception of oligopolistic control of networks? Because the consumer-facing telecom end-points ARE an oligopoly. Here there’s the last-mile problem. When wire has to be laid to every house, the economies of scale are such that it’s hard to have competitive markets. Enter Comcast etc.

I can rest easier now, because I think that this means there’s various engineering solutions to this (like AirJaldi networks? though I think those still aren’t last mile…; mesh networks?) as well as political solutions (like a local government running its last mile network as a public utility).

by Sebastian Benthall at June 19, 2014 10:38 PM

June 18, 2014

Ph.D. student


This post is password protected. You must visit the website and enter the password to continue reading.

by Sebastian Benthall at June 18, 2014 10:50 PM

MIMS 2010

Reworking the CourtListener Datamodel

Brian and I have been hard at work the past week figuring out how to make CourtListener able to understand more that one document type. Our goal right now is to make it possible to add:

  • oral arguments and other audio content,
  • video content if it's available,
  • content from RECAP, and
  • thousands of ninth circuit briefs that has recently scanned

The problem with our current database is that it's not organized in a way that supports linkages between content. So, if we have the oral argument and the opinion from a single case, we have no way of pointing them at each other. Turns out this is a sticky problem.

The solution we've come up with is an architecture like the following:

(we also have a more detailed version and an editable version)

And eventually, this will also have a Case table above the docket that allows multiple dockets to be associated with a single case. For now though, that's moot, as we don't have anyway of figuring out which dockets go together.

The first stage of this will be to add support for oral arguments, since they make a simple case to work with. Once that's complete the next stage will be either to add the RECAP documents or those from


Since this is such a big change, we're also taking this opportunity to re-work our URLs. Currently, they look like this:


For example:

A few things bug me about that. First, it doesn't tell you anything about what kind of thing you can expect to see if you click that link. Second, the alpha-numeric ID is kind of lame. It's just a reference to the database primary key for the item, and we should just show that value (in this case, "yjn" means "108713"). To fix both of these issues, the new URLs will be:



That should be easier to read and should tell you what type of item you're about to look at. Don't worry, the old URLs will keep working just fine.

And the rest of the new URLs will be:


and eventually:



We expect these changes to come with changes to the API, so we'll likely be releasing API version 1.1 that will add suport for dockets and oral arguments.

The current version 1.0 should keep working just fine, since we're not changing any of the underlying data, but I expect that it will have some changes to the URLs and things like that. I'll be posting more about this in the CourtListener dev list. as the changes become more clear and as we sort out what a fair policy is for the deprecation of old APIs.

by mlissner at June 18, 2014 05:26 PM

MIMS 2012

Design Process of Optimizely's Sample Size Calculator

Optimizely just released a sample size calculator, which tells people how many visitors they need for an A/B test to get results. This page began as a hack week project, which produced a functioning page, but needed some design love before being ready for primetime. So my coworker Jon (a communication designer at Optimizely) and I teamed up to push this page over the finish line. In this post, I’m going to explain the design decisions we made along the way.

Finished sample size calculator

The finished sample size calculator

We Started with a Functioning Prototype

The page started as an Optimizely hack week project that functioned correctly, but suffered from a confusing layout that didn’t guide people through the calculation. After brainstorming some ideas, we decided the calculator’s layout should follow the form of basic math taught in primary school. You start at the top, write down each of the inputs in a column, and calculate the answer at the bottom.

Original sample size calculator prototype

The original sample size calculator prototype

This made sense conceptually, but put the most important piece of data (the number of visitors needed) at the bottom of the page. Conventional wisdom and design theory would say our information hierarchy is backwards, and users may not even see this content. It also meant the answer is below the fold, which could increase the bounce rate.

All of these fears make sense when viewing the page through the lens of static content. But this page is interactive, and requires input from the user to produce a meaningful answer to how many visitors are needed for an A/B test. Lining up the inputs in a vertical column, and placing the answer at the bottom, encourages people to look at each piece of data going into the calculation, and enter appropriate values before getting an answer. The risk of visitors bouncing, or not seeing the answer, is minimal. Although this is counter to best practices, we felt our reasons for breaking the rules were sound.

Even so, we did our due diligence and sketched a few variations that shuffled around the inputs (e.g. horizontal; multi-column) and the sample size per variation (e.g. top; sides). None of these alternates felt right, and having the final answer at the bottom made the most sense for the reasons described above. But sketching out other ideas made us confident in our original design decision.

Power User Inputs

After deciding on the basic layout of the page, we tackled the statistical power and significance inputs. We knew from discussions with our statisticians that mathematically speaking these were important variables in the calculation, but they don’t need to be changed by most people. The primary user of this page just wants to know how many visitors they’ll need to run through an A/B test, for whom the mathematical details of these variables are unimportant. However, the values should still be clear to all users, and editable for power users who understand their effect.

To solve this challenge, we decided to display the value in plain text, but hide the controls behind an “Edit” button. Clicking the button reveals a slider to change the input. We agreed that this solution gave enough friction to deter most users from playing around with these values, but it’s not so burdensome as to frustrate an expert user who wants to change it.

Removing the “Calculate” Button

The original version of the page didn’t spit out the number of visitors until the “Calculate” button was clicked. But once I started using the page and personally experienced the annoyance of having to click this button every time I changed the values, it was clear the whole process would be a lot smoother if the answer updated automatically anytime an input changed. This makes the page much more fluid to use, and encourages people to play with each variable to see how it affects the number of visitors their test needs.

This is a design decision that only became clear to me from using a working implementation. In a static mock, a button will look fine and come across as an adequate user experience. But it’s hard to assess the user experience unless you can actually experience a working implementation. Once I re-implemented the page, it was clear auto-updating the answer was a superior experience. But without actually trying each version, I wouldn’t have been confident in that decision.


This project was a fun cross-collaboration between product and communication design at Optimizely. I focused on the interactions and implementation, while Jon focused on the visual design, but we sat side-by-side and talked through each decision together, sketching and pushing each other along the way. Ultimately, the finished product landed in a better place from this collaboration. It was a fun little side project that we hope adds value to the customer experience.

by Jeff Zych at June 18, 2014 03:24 PM

June 13, 2014

Ph.D. alumna

San Francisco’s (In)Visible Class War

In 2003, I was living in San Francisco and working at a startup when I overheard a colleague of mine — a self-identified libertarian — spout off about “the homeless problem.” I don’t remember exactly what he said, but I’m sure it fit into a well-trodden frame about no-good lazy leeches. I marched right over to him and asked if he’d ever talked to someone who was homeless. He looked at me with shock and his cheeks flushed, so I said, “Let’s go!” Unwilling to admit discomfort, he followed.

>We drove down to 6th Street, and I nodded to a group of men sitting on the sidewalk and told him to ask them about their lives. Then I watched as he nervously approached one guy and stumbled through a conversation. I was pleasantly surprised that he ended up talking for longer than I expected before coming back to me.

“He’s a vet.”
“And he said the government got him addicted and he can’t shake the habit.”
“And he doesn’t know what he should do to get a job because no one will ever talk to him.”
“I didn’t think…. He’s not doing so well…”

I let him trail off as we got back into the car and drove back to the office in silence.

San Francisco is in the middle of a class war. It’s not the first or last city to have heart-wrenching inequality tear at its fabric, challenge its values, test its support structures. But what’s jaw-dropping to me is how openly, defensively, and critically technology folks demean those who are struggling. The tech industry has a sickening obsession with meritocracy. Far too many geeks and entrepreneurs worship at the altar of zeros and ones, believing that outputs can be boiled down to a simple equation based on inputs. In a modern-day version of the Protestant ethic, there’s a sense that success is a guaranteed outcome of hard work, skills, and intelligence. Thus, anyone who is struggling can be blamed for their own circumstances.

This attitude is front and center when it comes to people who are visibly homeless on the streets of San Francisco, a mere fraction of the total homeless population in that city.

I wish that more people working in the tech sector would take a moment to talk to these men and women. Listening to their stories is humbling. Vets who fought for our country, under the banner of “freedom,” only to be cognitively imprisoned by addiction and mental illness. Abused runaways trying to find someone who will treat them with respect. People who were working hard and getting by until an accident struck and they lost their job and ended up in medical debt. Immigrants who came looking for the American Dream only to find themselves trapped. These aren’t no-good lazy leeches. They’re people. People whose lives have been a hell of a lot harder than most of us can even fathom. People who struggle on a daily basis to find food and shelter. People who we’ve systematically disenfranchised and failed to support. People who the bulk of tech workers ignore, shun, resent, and demonize.

A city without a safety net cannot be a healthy society. And nothing exacerbates this worse than condescension, resentment, and dismissal. We can talk about the tech buses and the lack of affordable housing, but it all starts with appreciating those who are struggling. Only a mere fraction of San Francisco’s homeless population are visible, but those who are reveal the starkness of what’s unfolding. And, as with many things, there’s more of a desire to make the visible invisible than there is to grapple with dynamics of poverty, mental illness, addiction, abuse, and misfortune. Too many people think that they’re invincible.

If you’re living in the Bay Area and working in tech, take a moment to do what I asked my colleague to do a decade ago. Walk around the Tenderloin and talk with someone whose poverty is written on their body. Respectfully ask about their life. Where did they come from? How did they get here? Where do they want to go? Ask about their hopes and dreams, struggles and challenges. Get a sense for their story. Connect as people. Then think about what meritocracy in tech really means.

(Photo by Darryl Harris.)

Two great local organizations: Delancey Street Foundation and Homeless Children’s Network.

(This entry was first posted on May 13, 2014 at Medium under the title “San Francisco’s (In)Visible Class War” as part of The Message.)

by zephoria at June 13, 2014 04:27 PM

June 10, 2014

MIMS 2014

Spotlight: Diagknowzit


diagknowzitTo me, data science is the combination of better decision-making with technology that enables us to actually execute those better decisions in real time. Understanding the science behind better decision making is not on its own sufficient. Technology is required to gather the relevant information, present it to people, and enable them to act upon it in real time. I learned this first-hand five years ago when I was working at the global health NGO, Partners In Health (PIH).

In rural Lesotho, PIH developed an app for feature phones that enabled community health workers to verify whether patients had actually taken their antiretroviral (ARV) medications via SMS. The text messages were uploaded directly into PIH’s electronic medical record (EMR) system, and allowed the medical staff to monitor ARV adherence rates in their catchment areas in real time. By using a simple feature phone app, PIH leapfrogged over the serious infrastructure challenges that are present in Lesotho– a landlocked country where there are a lot of mountains and very few roads.

Seeing PIH pioneer this solution inspired me to take a more technical turn in my career. When it came time to do my final project for my masters degree, I reached out to PIH to see if we could find an opportunity to put data science to work for PIH. We identified the following problem:

When patients arrive at PIH clinics, someone has to document their visit in the EMR. This person may not have much medical expertise; they might not be very familiar with the in’s and out’s of the EMR system. Still, the way the system is set up, that person has to record a presumed diagnosis into the EMR system. To deal with the problem currently, PIH allows entry technicians to enter whatever they want into the system as a presumed diagnosis. If the system doesn’t recognize the user input, the input is stored as a placeholder value that someone has to go back and manually code at some point in the future. Manually coding the data must be done by someone with greater medical/system knowledge–perhaps even a doctor. Looking at the entire process, we concluded these workers would be more effective seeing patients instead of dealing with data entry.

To fix the problem, we built Diagknowzit, a recommendation engine built on top of Open-MRS, the open source EMR used by PIH. Diagknowzit works similarly to Google’s “Did you mean…?” feature–except whereas “Did you mean…?” gets the user to spell things correctly, Diagknowzit matches the medical condition meant by the user to the EMR system’s official representation of that condition.

The challenges in building Diagknowzit could fill many blog posts, but here I’ll only mention a few. For one thing, the core codebase of Open-MRS–built using Spring/Hibernate and Java–was built in the mid 2000s and is a bit dated at this point. My partners and I had to spend a significant amount of time porting our knowledge of more Pythonic frameworks back in time to be able to build with the requisite toolset.

Perhaps the biggest challenge we faced was the fact that we only received the data we needed to train our recommendation engine two weeks before the final project was due. Sharing data across organizations is never easy, and when that information contains sensitive medical information, the risks are particularly high. It took a long time for our request to filter through all the necessary layers of oversight at PIH, but we’re very happy and grateful that it worked out in the end.Algorithm Performance

Because of the compressed timeline, we didn’t have quite as much time to explore and experiment with the data as I would have liked. Still, even with a relatively simple machine learning model, we were able to achieve pretty decent performance. Using multinomial logistic regression, our engine guessed correctly 71% of the time, which is in the ballpark of similar projects we found during our literature review.

Ultimately, our goal in doing this project was more than just to build something useful for PIH. I mentioned before how PIH’s EMR system, Open-MRS, is open source; it was actually developed by PIH and others as a response to the growing need for global health NGOs to manage their information effectively. The Open-MRS development community is thriving, but little attention has been paid thus far to the potential of data science-based tools to improve clinical decision support. We hope that our promising results inspire more development in this direction. In fact, if you are a member of the Open-MRS community and are interested in Diagknowzit or more tools like it, please don’t hesitate to get in touch :)

by dgreis at June 10, 2014 03:13 PM

June 06, 2014

Ph.D. student

i’ve started working on my dissertation // diversity in open source // reflexive data science

I’m studying software development and not social media for my dissertation.

That’s a bit of a false dichotomy. Much software development happens through social media.

Which is really the point–that software development is a computer mediated social process.

What’s neat is that it’s a computer mediated social process that, at its best, creates the conditions for it to continue as a social process. c.f. Kelty’s “recursive public”

What’s also neat is that this is a significant kind of labor that is not easy to think about given the tools of neoclassical economics or anything else really.

In particular I’m focusing on the development of scientific software, i.e. software that’s made and used to improve our scientific understanding of the natural world and each other.

The data I’m looking at is communications data between developers and their users. I’m including the code, under version control, as this. In addition to being communication between developers, you might think of source code as a communication between developers and machines. The process of writing code as a collaboration or conversation between people and machines.

There is a lot of this data so I get to use computational techniques to examine it. “Data science,” if you like.

But it’s also legible, readable data with readily accessible human narrative behind it. As I debug my code, I am reading the messages sent ten years ago on a mailing list. Characters begin to emerge serendipitously because their email signatures break my archive parser. I find myself Googling them. “Who is that person?”

One email I found while debugging stood out because it was written, evidently, by a woman. Given the current press on diversity in tech, I thought it was an interesting example from 2001:

From sag at Thu Nov 29 15:21:04 2001
From: sag at (Sue Giller)
Date: Thu Nov 29 15:21:04 2001
Subject: [Numpy-discussion] Re: Using Reduce with Multi-dimensional Masked array
In-Reply-To: <000201c17917$ac5efec0$>
References: <>
Message-ID: <>


Well, you’re right. I did misunderstand your reply, as well as what
the various functions were supposed to do. I was mis-using the
sum, minimum, maximum as tho they were MA..reduce, and
my test case didn’t point out the difference. I should always have
been doing the .reduce version.

I apologize for this!

I found a section on page 45 of the Numerical Python text (PDF
form, July 13, 2001) that defines sum as
‘The sum function is a synonym for the reduce method of the add
ufunc. It returns the sum of all the elements in the sequence given
along the specified axis (first axis by default).’

This is where I would expect to see a caveat about it not retaining
any mask-edness.

I was misussing the MA.minimum and MA.maximum as tho they
were .reduce version. My bad.

The MA.average does produce a masked array, but it has changed
the ‘missing value’ to fill_value=[ 1.00000002e+020,]). I do find this
a bit odd, since the other reductions didn’t change the fill value.

Anyway, I can now get the stats I want in a format I want, and I
understand better the various functions for array/masked array.

Thanks for the comments/input.


I am trying to approach this project as a quantitative scientist. But the process of developing the software for analysis is putting me in conversation not just with the laptop I run the software on, but also the data. The data is a quantified representation–I count the number of lines, even the number of characters in a line as I construct the regular expression needed to parse the headers properly–but it represents a conversation in the past. As I write the software, I consult documentation written through a process not unlike the one I am examining, as well as Stack Overflow posts written by others who have tried to perform similar tasks. And now I am writing a blog post about this work. I will tweet a link of this out to my followers; I know some people from the Scientific Python community that I am studying follow me on Twitter. Will one of them catch wind of this post? What will they think of it?

by Sebastian Benthall at June 06, 2014 04:52 AM

June 05, 2014

Ph.D. student

Being the Machine



The age of digital fabrication is upon us and there are hundreds of articles that will tell you that it will surely change the world. My current research project called “Being the Machine” explores how the design of digital fabricators could be different and explores new possibilities for interacting with fabrication and computer numeric controlled (CNC) technologies. My approach designs for fabrication as a kind of performance rather than a tool for accomplishing a given task. As a performance, all parts of the system become aesthetically meaningful: the movements of the human, the movements of the machine, the objects that are produced, the contexts in which they are placed, and the materials used for development. The system I am building consists of a head-worn laser guide that draws G-Code paths that the user follows by hand. What this system does is guide someone in building any object in the way that a 3D printer would. This system allows us to fabrication in ways that are currently difficult with existing 3D printers as it is completely portable and the user has a wide range of choices about what materials to use in fabrication (sand at the beach, snow in the mountains, polenta in the kitchen etc.). Additionally, since fabrication is tied to a human rather than a machine, the user is free to explore different ways to “break” the system in order to reveal new aesthetic choices. For instance, the material properties of the objects one is building with can be unpredictable and subject to environmental factors (wind, rain, etc). In the spirit of indeterminacy, these “unknowns” can be productive ways to expose new aesthetic possibilities. By building and studying this system, I hope to reveal new insights about the way in which value is constructed in fabricated objects and the role fabrication might play in someones social and emotional life. I am currently developing this project as Graduate Student Researcher and an Artist-In-Residence at Instructables/Autodesk. I’m posting all of my prototypes and progress here:

by admin at June 05, 2014 08:40 PM

Ph.D. alumna

Will my grandchildren learn to drive? I expect not

I rarely drive these days, and when I do, it’s bloody terrifying. Even though I grew up driving and drove every day for fifteen years, my lack of practice is palpable as I grip the steering wheel. Every time I get behind the wheel, in order to silence my own fears about all of the ways in which I might crash, I ruminate over the anxieties that people have about teenagers and driving. I try not to get distracted in my own driving by looking to see if other drivers are texting while driving, but I can’t help but muse about these things. And while I was driving down the 101 in California last week, it hit me: driving is about to become obsolete.

The history of cars in America is tied up with what it means to be American in the first place. American history —with its ups and downs — can be understood through the automobile industry. In fact, it can be summed up with one word: Detroit. Once a booming metropolis, this one-industry town iconically highlights the issues that surround globalization, class inequality, and labor identities. But entwined with the very real economic factors surrounding the automobile industry is an American obsession with freedom.

It used to be that getting access to a car was the ultimate marker of freedom. As a teenager in the nineties, I longed for my sixteenth birthday and all that was represented by a driver’s license. Today, this sentiment is not echoed by the teens that I meet. Some still desperately want a car, but it doesn’t have the same symbolic feeling that it once did. When I ask teens about driving, what they share with me reveals the burdens imposed by this supposed tool of freedom. They talk about the costs — especially the cost of gas. They talk about the rules — especially the rules that limit them from driving with other teens in the car. And they talk about the risks — regurgitating back countless PSAs on drinking or texting while driving. While plenty of teens still drive, the very notion of driving doesn’t prompt the twinkle in their eyes that I knew from my classmates.

Driving used to be hard work. Before there was power steering and automatic transmission, maneuvering a car took effort. Driving used to be a gateway for learning. Before there were computers in every part of a car, curious youth could easily tear apart their cars and tinker with their innards. Learning to drive and manipulate a car used to be admired. Driving also used to be fun. Although speed limits and safety belts have saved many lives, I still remember the ways in which we would experiment with the boundaries of a car by testing its limits in parking lots on winter days. And I will never forget my first cross-country road trip, when I embraced the openness of the road and pushed my car to the limits and felt the wind on my face. Freedom, I felt freedom.

Today, what I feel is boredom, if not misery. The actual mechanisms of driving are easy, fooling me into a lull when I get into a car. Even with stimuli all around me, all I get to do is pump the gas, hit the brakes, and steer the wheel no more than ten degrees. My body is bored and my brain turns off. By contrast, I totally get the allure of the phone—or anything that would be more interesting than trying to navigate the road while changing the radio station to avoid the incessant chatter from not-very-entertaining DJs.

It’s rare that I hear many adults talk about driving with much joy. Some still get giddy about their cars; I hear this most often from my privileged friends when they get access to a car that changes their relationship to driving, such as an electric car or a hybrid or a Tesla. But even in those cases, I hear enthusiasm for a month before people go back to moaning about traffic and parking and surveillance. Outside of my friends, I hear people lament gas prices and tolls and this, that, or the other regulation. And when I listen to parents, they’re always complaining about having to drive their kids here, there, and everywhere. Not surprisingly, the teens that I meet rarely hear people talk joyously about cars. They hear it as a hassle.

So where does this end up? Data from both the CDC and AAA suggests that fewer and fewer American teens are bothering to even get their driver’s license. There’s so much handwringing about driving dangers, so much effort towards passing new laws and restrictions targeting teens in particular, and so much anxiety about distracted driving. Not surprisingly, more and more teens are throwing their hands in the air and giving up, demanding their parents drive them because there’s no other way. This, in turn, means that parents hate driving even more. And since our government is incapable of working together to invest in infrastructural investments, thereby undermining any hopes of public transit in huge parts of the country, what we’re effectively doing is laying the groundwork for autonomous vehicles. It’s been 75 years since General Motors exhibited an autonomous car at the 1939 World’s Fair, but we’ve now created the cultural conditions for this innovation to fit into American society.

We’re going to see a decade of people flipping out over fear that autonomous vehicles are dangerous, even though I expect them to be a lot less dangerous that sleepy drivers, drunken drivers, distracted drivers, and inexperienced drivers. Older populations that still associate driving with freedom are going to be resistant to the very idea of autonomous vehicles, but both parents and teenagers will start to see them as more freeing than driving. We’re still a long way from autonomous vehicles being meaningfully accessible to the general population. But we’re going to get there. We’ve spent the last thirty years ratcheting up fears and safety measures around cars, and we’ve successfully undermined the cultural appeal of driving. This is what will open the doors to a new form of transportation. And the opportunities for innovation here are only just beginning.

(This entry was first posted on May 5, 2014 at Medium under the title “Will my grandchildren learn to drive? I expect not” as part of The Message.)

by zephoria at June 05, 2014 03:23 PM

May 25, 2014

MIMS 2012

I, Too, Only Work On Unshiny Products

Jeff Domke’s article about working on “unshiny” products sums up my view of my own work at Optimizely. The crux of his argument is that many designers are drawn to working on “shiny” products — products that are pretty and lauded in the design community — but “unshiny” products are a lot more interesting to work on. They’re often solving difficult problems, and have more room for you to make an impact. You’re working to make the product reach its potential. Shiny products, on the other hand, have reached that potential, and you are less able to make your mark.

Optimizely sits right in the sweet spot. We aren’t “shiny” (compared to sexy products like Square and Medium, we have a long way to go); nor are we “unshiny” (our customers describe us as well-designed and easy-to-use). Rather, we land more in the middle — we have a solid user experience that has a lot of room for improvement.

We’re also solving really hairy, complex problems. Apps that get mounds of praise tend to solve relatively simple problems (such as to-do list apps). It’s much more interesting to work on a problem space that is unexplored and is full of murky, vague, conflicting goals that must be untangled. And once you’ve made sense of the mess, you know you’ve enabled someone to do their job better.

And that’s the most exciting part of working at Optimizely — making the product fulfill its potential, and solving tough problems that impact businesses' bottom lines.

Want to come work on hard problems with me at Optimizely? Check out our jobs, or reach out @jlzych.

by Jeff Zych at May 25, 2014 06:55 PM

Ph.D. alumna

Matt Wolf’s “Teenage”

Close your eyes and imagine what it was like to be a teenager in the 1920s. Perhaps you are out late dancing swing to jazz or dressed up as a flapper. Most likely, you don’t visualize yourself stuck at home unable to see your friends like today’s teenagers. And for good reason. In the 1920s, teenagers used to complain when their parents made them come home before 11pm. Many, in fact, earned their own money; compulsory high school wasn’t fully implemented until the 1930s when adult labor became anxious about the limited number of available jobs.

Although contemporary parents fret incessantly about teenagers, most people don’t realize that the very concept of a “teenager” is a 1940s marketing invention. And it didn’t arrive overnight. It started with a transformation in the 1890s when activists began to question child labor and the psychologist G. Stanley Hall identified a state of “adolescence” that was used to propel significant changes in labor laws. By the early 1900s, with youth out of the work force and having far too much free time, concerns about the safety and morality of the young emerged, prompting reformers to imagine ways to put youthful energy to good use. Up popped the Scouts, a social movement intended to help produce robust youth, fit in body, mind, and soul. This inadvertently became a training ground for World War I soldiers who, by the 1920s, were ready to let loose. And then along came the Great Depression, sending a generation into a tailspin and prompting government intervention. While the US turned to compulsory high school and the Civilian Conservation Corps, Germany saw the rise of Hitler Youth. And an entire cohort, passionate about being a part of a community with meaning, was mobilized on the march towards World War II.

All of this (and much more) is brilliantly documented in Jon Savage’s beautiful historical account Teenage: The Creation of Youth Culture. This book helped me rethink how teenagers are currently understood in light of how they were historically positioned. Adolescence is one of many psychological and physical transformations that people go through as they mature, but being a teenager is purely a social construct, laden with all sorts of political and economic interests.

When I heard that Savage’s book was being turned into a film, I was both ecstatic and doubtful. How could a filmmaker do justice to the 576 pages of historical documentation? To my surprise and delight, the answer was simple: make a film that brings to visual life the historical texts that Savage referenced.

In his new documentary, Teenage, Matt Wolf weaves together an unbelievable collection of archival footage to produce a breathless visual collage. Overlaid on top of this visual eye candy are historical notes and diary entries that bring to life the voices and experiences of teens in the first half of the 20th century. Although this film invites the viewer to reflect on the past, doing so forces a reflection on the present. I can’t help but wonder: what will historians think of our contemporary efforts to isolate young people “for their own good”?

This film is making its way through US independent theaters so it may take a while until you can see it, but to whet your appetite, watch the trailer:


(This entry was first posted on April 25, 2014 at Medium under the title “A Dazzling Film About Youth in the Early 20th Century” as part of The Message.)

by zephoria at May 25, 2014 03:15 PM

May 20, 2014

MIMS 2012

On Being a Generalist

I recently read Frank Chimero’s excellent article, “Designing in the Borderlands”. The gist of it is that we (designers, and the larger tech community) have constructed walls between various disciplines that we see as opposites, such as print vs. digital, text vs. image, and so on. However, the most interesting design happens in the borderlands, where these different media connect. He cites examples that combine physical and digital media, but the most interesting bit for me was his thoughts on roles that span disciplines:

For a long time, I perceived my practice’s sprawl as a defect—evidence of an itchy mind or a fear of commitment—but I am starting to learn that a disadvantage can turn into an advantage with a change of venue. The ability to cross borders is an asset. Who else could go from group to group and be welcomed? The pattern happens over and over: if you’re not a part of any group, you can move amongst them all by tip-toeing across the lines that connect them.

I have felt this way many times throughout my career (especially that “fear of commitment” part). I have long felt like a generalist who works in both design and engineering, and I label myself to help people understand what I do (not to mention the necessity of a title). But I’ve never cleanly fit into any discipline.

This line was further blurred by my graduate degree from UC Berkeley’s School of Information. The program brings together folks with diverse backgrounds, and produces T-shaped people who can think across disciplines and understand the broader context of their work, whether it be in engineering, design, policy & law, sociology, or dozens of other fields in which our alumni call home.

These borderlands are the best place for a designer like me, and maybe like you, because the borderlands are where things connect. If you’re in the borderlands, your different tongues, your scattered thoughts, your lack of identification with a group, and all the things that used to be thought of as drawbacks in a specialist enclave become the hardened armor of a shrewd generalist in the borderlands.

Couldn’t have said it any better. Being able to move between groups and think across disciplines is more of an advantage than a disadvantage.

by Jeff Zych at May 20, 2014 04:02 AM

May 14, 2014

Ph.D. student

A dynamically-generated robots.txt: will search engine bots recognize themselves?

In short, I built a script that dynamically generates a robots.txt file for search engine bots, who download the file when they seek direction on what parts of a website they are allowed to index. By default, it directs all bots to stay away from the entire site, but then presents an exception: only the bot that requests the robots.txt file is allowed full reign over the site. If Google’s bot downloads the robots.txt file, it will see that only Google’s bot gets to index the entire site. If Yahoo’s bot downloads the robots.txt file, it will see that only Yahoo’s bot gets to index the entire site. Of course, this is assuming that bots identify themselves to my server in a way that they recognize when it is reflected back to them.

What is a robots.txt file? Most websites have one of these very simple file called “robots.txt” on the main directory of their server. The robots.txt file has been around for almost two decades, and it is now a standardized way of communicating what pages search engine bots (or crawlers) should and should not visit. Crawlers are supposed to request and download a robots.txt file from any website they visit, and then obey the directives mentioned in such a file. Of course, there is nothing which prevents a crawler from still crawling pages which are forbidden in a robots.txt file, but most major search engine bots behave themselves. 

In many ways, robots.txt files stand out as a legacy from a much earlier time. When was the last time you wrote something for public distribution in a .txt file, anyway? In an age of server-side scripting and content management systems, robots.txt is also one the few public-facing files a systems administrator will actually edit and maintain by hand, manually adding and removing entries in a text editor. A robots.txt file has no changelog in it, but its revision history would be a partial chronicle of a systems administrator’s interactions with how their website is represented by various search engines.You can specify different directives for different bots by specifying a user agent, and well-behaved bots are supposed to look for their own user agents in a robots.txt file and follow the instructions left for them. As for my own, I’m sad to report that I simply let all bots through wherever they roam, as I use a sitemap.tar.gz file which a WordPress plugin generates for me on a regular basis and submits to the major search engines. So my robots.txt file just looks like this:

User-agent: *
Allow: /

An interesting thing about contemporary web servers is that file formats no longer really matter as much as they used to. In fact, files don’t even have to exist as we they are typically represented in URLs. When your browser requests the page, there is a directory called “wordpress” on my server, but everything after that is a fiction. There is no directory called 2014, no a subdirectory called 05, and no file called robots-txt that existed on the server before or after you downloaded it. Rather, when WordPress receives a request to download this non-existent file, it intercepts it and interprets it as a request to dynamically generate a new HTML page on the fly. WordPress queries a database for the content of the post, inserts that into a theme, and then has the server send you that HTML page — with linked images, stylesheets, and Javascript files, which often do actually exist as files on a server. The server probably stores the dynamically-generated HTML page in its memory, and sometimes there is caching to pre-generate these pages to make things faster, but other than that, the only time an HTML file of this page ever exists in any persistent form is if you save it to your hard drive. 

Yet robots.txt lives on, doing its job well. It doesn’t need any fancy server-side scripting; it does just fine on its own. Still, I kept thinking about what it would be like to have a script dynamically generate a robots.txt file on the fly whenever it is requested. Given that the only time a robots.txt file is usually downloaded is when an automated software agent requests it, there is something strangely poetic about an algorithmically-generated robots.txt file. It is something that would, for the most part, only ever really exist in the fleeting interaction between two automated routines. So of course I had to build one.

The code required to implement this is trivial. First, I needed to modify how my web server interprets requests, so that whenever a request was made to robots.txt, the server would execute a script called robots.php and send the client the output as robots.txt. Modify the .htaccess file to add:

RewriteEngine On
RewriteBase /
RewriteRule ^robots.txt$ /robots.php

Next, the PHP script itself:

echo "User-agent: *" . "\r\n";
echo "Allow: /" . "\r\n";

Then I realized that this was all a little impersonal, and I could do better since I’m scripting. With PHP, I can easily query the user-agent of the client which is requesting the file, the identifier it sends to the web server. Normally, user agents define the browser that is requesting the page, but bots are supposed to have an identifiable user-agent like “Googlebot” or “Twitterbot” so that you can know them when they come to visit. Instead of granting access to every user agent with the asterisk, I made it so that the user agent of the requesting client is the only one that is directed to have full access.

echo "User-agent:" . $_SERVER['HTTP_USER_AGENT'] . "\r\n";
echo "Allow: /" . "\r\n";

After making sure this worked, I realized that I needed to go out there a little more. If the bots didn’t recognize themselves, then by default, they would still be allowed to crawl the site anyway. robots.txt works on a principle of allow by default. So I needed to add a few more lines which made it so that the robots.txt file the bot downloaded would direct all other bots to not crawl the site, but give full reign to bots with the user agent it sent the server.

 echo "User-agent: *" . "\r\n";
 echo "Disallow: /" . "\r\n";
 echo "User-agent:" . $_SERVER['HTTP_USER_AGENT'] . "\r\n";
 echo "Allow: /" . "\r\n";

This is what you get if you download it in Chrome:

User-agent: *
Disallow: /
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36 
Allow: /

The restrictive version is now live, up at I’ve also put it up on github, because apparently that’s what cool kids do. I’m looking forward to seeing what will happen. Google’s webmaster tools will notify me if its crawlers can’t index my site, for whatever reason, and I’m curious if Google’s bots will identify themselves to my servers in a way that they will recognize.

by stuart at May 14, 2014 01:37 AM

May 12, 2014

MIMS 2012

Why We Hire UI Engineers on the Design Team

This Happy Cog article by Stephen Caver perfectly encapsulates why we hire UI Engineers on the design team at Optimizely (as opposed to the engineering team). We want the folks coding a UI to be involved in the design process from the beginning, to understand the design system that underlies a user experience, and to be empowered to make design decisions while developing a UI. Successful designs must adapt to various contexts and degrade gracefully. The people most qualified to make those kinds of decisions are the ones writing the code. As said in the article, “In this new world, the best thing a developer can do is to acquire an eye for design—to be able to take design aesthetic, realize its essential components, and reinterpret them on the fly.” By embracing this mindset in our hiring and design process, we’ve found the end result is a higher quality product.

by Jeff Zych at May 12, 2014 04:13 AM

May 07, 2014

Ph.D. student


autocatalysis sustains autopoeisis

by Sebastian Benthall at May 07, 2014 05:36 PM

MIMS 2010

Ways to communicate with me that are more effective than leaving a voicemail

  • text message
  • email
  • postal mail
  • Twitter direct message
  • pager
  • skywriting
  • interpretive dance
  • smoke signals
  • drunk carrier pigeon
  • singing telegram
  • sports arena jumbotron
  • tell Suzy to tell Rachel to tell Bill to tell me
  • message in a bottle thrown into ocean
  • give me a telling look
  • send a taxi to pick me up and drive me to the coast where a crewman aboard a ship signals using flag semaphore
  • Western Union
  • telepathy

Expanded from the condensed version. Hat-tip to @ravi and @av.

by Ryan at May 07, 2014 06:18 AM

May 06, 2014

Ph.D. alumna

What if the sexual predator image you have in your mind is wrong?

(I wrote the following piece for Psychology Today under the title “Sexual Predators: The Imagined and the Real.”)

If you’re a parent, you’ve probably seen the creepy portraits of online sexual predators constructed by media: The twisted older man, lurking online, ready to abduct a naive and innocent child and do horrible things. If you’re like most parents, the mere mention of online sexual predators sends shivers down your spine. Perhaps it prompts you to hover over your child’s shoulder or rally your school to host online safety assemblies.

But what if the sexual predator image you have in your mind is wrong? And what if that inaccurate portrait is actually destructive?

When it comes to child safety, the real statistics don’t stop parental worry. Exceptions dominate the mind. The facts highlight how we fail to protect those teenagers who are most at-risk for sexual exploitation online.

If you poke around, you may learn that 1 in 7 children are sexually exploited online. This data comes from the very reputable Crimes Against Children Research Center, however, very few take the time to read the report carefully. Most children are sexually solicited by their classmates, peers, or young adults just a few years older than they are. And most of these sexual solicitations don’t upset teens. Alarm bells should go off over the tiny percentage of youth who are upsettingly solicited by people who are much older than them. No victimization is acceptable, but we need to drill into understanding who is at risk and why if we want to intervene.

The same phenomenal research group, led by David Finkelhor, went on to analyze the recorded cases of sexual victimization linked to the internet and identified a disturbing pattern. These encounters weren’t random. Rather, those who were victimized were significantly more likely to be from abusive homes, grappling with addiction or mental health issues, and/or struggling with sexual identity. Furthermore, the recorded incidents showed a more upsetting dynamic. By and large, these youth portrayed themselves as older online, sought out interactions with older men, talked about sex online with these men, met up knowing that sex was in the cards, and did so repeatedly because they believed that they were in love. These teenagers are being victimized, but the go-to solutions of empowering parents, educating youth about strangers, or verifying the age of adults won’t put a dent into the issue. These youth need professional help. We need to think about how to identify and support those at-risk, not build another an ad campaign.

What makes our national obsession with sexual predation destructive is that it is used to justify systematically excluding young people from public life, both online and off. Stopping children from connecting to strangers is seen as critical for their own protection, even though learning to navigate strangers is a key part of growing up. Youth are discouraged from lingering in public parks or navigating malls without parental supervision. They don’t learn how to respectfully and conscientiously navigate new people because they are taught to fear all who are unknown.

The other problem with our obsession with sexual predators is that it distracts parents and educators. Everyone rallies to teach children to look out for and fear rare dangers without giving them the tools for managing more common forms of harm that they might encounter. Far too many young people are raped and sexually victimized in this country. Only a minuscule number of them are harmed at the hands of strangers, online or off. Most who will be abused will suffer at the hands of their classmates and peers.

In a culture of abstinence-only education, schools don’t want to address any aspect of sexual and reproductive health for fear of upsetting parents. As a result, we fail to give young people the tools to handle sexual victimization. When the message is “just say no,” we shame young people who were sexually abused or violated.

It’s high time that we walk away from our nightmare scenarios and focus on addressing the serious injustices that exist. The world we live in isn’t fair and many youth who are most at-risk do not have concerned parents looking out for them. Because we have stopped raising children as a community, adults are often too afraid to step on other parents’ toes. Yet, we need adults who are looking out for more than just their children. Furthermore, our children need us to talk candidly about sexual victimization without resorting to boogeymen.

While it’s important to protect youth from dangers, a society based on fear-mongering is not healthy. Let’s instead talk about how we can help teenagers be passionate, engaged, constructive members of society rather than how we can protect them from statistically anomalous dangers. Let’s understand those teens who are truly at risk; these teens often have the least support.

(This piece was first published at Psychology Today.)

by zephoria at May 06, 2014 01:17 AM

May 04, 2014

Ph.D. student

Re: The Great Works of Software

Hi Paul,

This "Great Works of Software" piece is fantastic. Of course I want to correct it, and I'm sure everyone does and I'm fairly confident that was the intention of it, and getting everyone to reflect and debate the greatest pieces of software is as worthy of an intention for a blog post (even one hosted on Medium) as any I can think of.

I don't dispute any of your five [0], but I was surprised by something: where are the Internet and the Web? Sure, the Web is a little young at 25, but it's old enough to have been declared dead a good handful of times and the Internet calls Word and Photoshop young whippersnappers. Does the Web satisfy your criteria of everyday, meaningful use? Of course. But I'm guessing that you didn't just forget the Web when writing about meaningful software. Instead, I suspect you very intentionally chose [1] to leave these out to illustrate an important point: that the Web isn't a single piece of software in the same sense that the programs you listed are.

The Web is made up of software (and hardware): web server software running on millions of machines all around the world; user agents running on every client machine we can think of (desktop, mobile, laptop, refrigerator); proxies and caching middleboxes; DNS servers; software and firmware running on routers and switches, in Internet Exchange Points and Internet Service Providers; software not included in this classification; crawlers constantly indexing and archiving Web pages; open source libraries which encrypt communications for Transport Layer Security; et cetera. But even if one had an overly-simplified view of Web architecture (and I wouldn't criticize anyone for this; this is the poor-man's Web architecture that I teach students all the time) consisting of servers and browsers, anyone would see that there's no singular piece of software involved. You mentioned the TCP/IP stack as a runner up, but there's no single TCP/IP implementation that's particularly great or important: what's important is that separate implementations of the relevant IETF standards interoperate [2]. Other listmakers included a browser (Kirschenbaum highlighted Mosaic [3]; PC World, Navigator) or you could imagine listing Apache as a canonical server (and the corresponding foundation and software development methodology), but even as important as those pieces were (and are!), alone they just don't make a difference.

As a thought experiment then, I submit a preliminary list for a Web software canon, listing not single pieces of software but systems of software, standards and people.

Non-exhaustive, of course, but I hope it's helpful for your next blog post, which I hope to see on Is there something distinctive to these systems of software that are intrinsically tied up with the communities that use and develop them? Whole publics that are recursive, say [4]? I hope there are a few people out there writing books and dissertations about that. (I should really get back to writing that prospectus.)


[0] Okay, I'm skeptical about Emacs -- isn't the operating system/joining of small software pieces already well-covered by Unix?

[1] By the Principle of Charity.

[2] It might be tempting, for someone who works on Web standards like I do, to claim that the Web is really just a set of interoperable standards, but that's nonsense as soon as I think about it at all. Sure, I think standards are important, but a standard without an implementation is just a bit of text somewhere. An of course, that's not hypothetical at all: standards without widespread implementation are commonplace, and bittersweet.

[3] Also, Kirschenbaum includes Hypercard in his list, with a reference to Vannevar Bush and the Memex, which I love, and it might be the closest in these lists to something that looks like the Web/hypertext but in non-networked single-piece-of-software form.

[4] Kelty, Christopher M. Two Bits: The Cultural Significance of Free Software. Duke University Press Books, 2008.

by at May 04, 2014 05:12 AM


Can you say "sacrilegious"? Can I?

Every year I serve as the judge for an Author's Spelling Bee held to benefit the nonprofit outfit Small Press Distribution in Berkeley, which distributes works (poetry, fiction, journals, translations) published by several hundred  independent presses (check them out, really). My job is a lot less harrowing than being a competitor: I just say the words to the authors and then signal their success (bell) or failure (slide whistle), the latter obliging them to flip over their name tags and withdraw from the competition. The only place where linguistic expertise plays any role is in helping the organizers winnow down the word list they've prepared, suggesting additions and deleting items that are too obscure or that have several alternate spellings (acknowledgment, say). The best words to use are the relatively familiar but tricky ones that trip up many literate people but not everyone—braggadocio, supersede, absorption, minuscule—the object being to neither eliminate everybody on the first go round (which could easily happen with an item like dieffenbachia) nor let the affair run on more than 45 minutes or so. Oh, and I check beforehand to make sure I actually know how to pronounce the words. Which is what sent me to Merriam-Webster to see what they gave for sacrilegious:

That's all?? How about Oxford Dictionaries?

And the redoubtable American Heritage, of whose usage panel I hold the august title of chair emeritus?

Hmm. What's an honest judge to do? Granted,  the [-lɪdʒəs] pronunciation (rhymes with prestigious) is by far more common than the historically correct [-li:dʒəs] (rhymes with egregious), which is the only pronunciation given in the OED's first edition (the second accepts both).

That's almost certainly because everyone folk-etymologizes the word as a derivative of religious, rather than of sacrilege, which in turn is why so many people misspell it with the i and e reversed (we lost two or three competitors before Daniel Levin Becker nailed it). And when the preponderance of literate usage favors a particular pronunciation, how could it be other than correct? As H. W. Fowler wrote, "Pronounce as your neighbours do, not better; for words in general use your neighbour is the general public." Indeed, as early as 1912, the author of a book called Correct Pronunciation was enjoining readers to say the word with /i:/ rather than /I/, which means of course that the latter was already common.

But Fowler's rule doesn't carry over to orthography: no dictionary would think of recording the word as sacrilegious, even as an alternate. And if you do know how it's spelled, is it right to accede to a pronunciation that based on what you know to be a misconception about how it's derived and written? Shouldn't the dictionary at least nod to [li:dʒəs] as an alternate, as the OED Second Edition did? It makes you wonder whether the editors themselves fretted over this—or did the people who recorded the words just use the pronunciation they were familiar with without giving it a thought?

So what to do? I couldn't bring myself to say it as if it were spelled sacreligious, not just because it would have encouraged the competitors to spell it that way, but also because—let's be frank—I couldn't bear to imagine that someone might think I didn't know myself how the word was spelled  or where it came from. But I couldn't say it to rhyme with egregious (or for that matter sortilegious), which would only suggest an unseasonable ostentation of learning (as Johnson once defined pedantry) but would likely have given the orthographic game away. So I sort of swallowed the vowel, and nobody seemed the wiser.

Fortunately, it's not a problem you're apt to encounter outside of a spelling bee. If you need the concept you could go with any of a number of near-synonyms whose pronunciation presents no problems, like impious. And anyway, what would you have done in my place—other, I mean, than judiciously dropping the word from the list?

by Geoff Nunberg at May 04, 2014 04:19 AM

MIMS 2012

Matthew Carter&#8217;s &ldquo;My Life in Typefaces&rdquo;

I just got around to watching Matthew Carter’s excellent TED talk, “My Life in Typefaces”. In it, he talks about his experience designing type for the past 5 decades, and how technical constraints influenced his designs. The central question he tries to answer is, “Does a constraint force a compromise? By accepting a constraint, are you working to a lower standard?” This is a question that comes up in every discipline, and with every technological change. Matthew Carter’s take on this subject is interesting because he’s experienced numerous technological changes, and has designed superb typefaces for all of them.

At first blush, it’s easy to conclude that constraints force designers to compromise their vision. But design isn’t produced in a vacuum, and ultimately must be realized through one or more mediums (print, screen, radio, etc.). Therefore, one must work within constraints to produce the best designs. To do so, designers must understand the technology that enables their designs to be experienced, be it code, the printing process, and so on. As Matthew Carter said in this talk, “I’m a pragmatist, not an idealist, out of necessity,” which is a valuable lesson that all designers should take to heart.

by Jeff Zych at May 04, 2014 03:24 AM