School of Information Blogs

December 20, 2014

Ph.D. student

Discourse theory of law from Habermas

There has been at least one major gap in my understanding of Habermas’s social theory which I’m just filling now. The position Habermas reaches towards the end of Theory of Communicative Action vol 2 and develops further in later work in Between Facts and Norms (1992) is the discourse theory of law.

What I think went on is that Habermas eventually gave up on deliberative democracy in its purest form. After a career of scholarship about the public sphere, the ideal speech situation, and communicative action–fully developing the lifeworld as the ground for legitimate norms–but eventually had to make a concession to “the steering media” of money and power as necessary for the organization of society at scale. But at the intersection between lifeworld and system is law. Lawserves as a transmission belt between legitimate norms established by civil society and “system”; at it’s best it is both efficacious and legitimate.

Law is ambiguous; it can serve both legitimate citizen interests united in communicative solidarity. It can also serve strong powerful interests. But it’s where the action is, because it’s where Habermas sees the ability for lifeworld to counter-steer the whole political apparatus towards legitimacy, including shifting the balance of power between lifeworld and system.

This is interesting because:

  • Habermas is like the last living heir of the Frankfurt School mission and this is a mature and actionable view nevertheless founded in the Critical Theory tradition.
  • If you pair it with Lessig’s Code is Law thesis, you get a framework for thinking about how technical mediation of civil society can be legitimate but also efficacious. I.e., code can be legitimized discoursively through communicative action. Arguably, this is how a lot of open source communities work, as well as standards bodies.
  • Thinking about managerialism as a system of centralized power that provides a framework of freedoms within it, Habermas seems to be presenting an alternative model where law or code evolves with the direct input of civil stakeholders. I’m fascinated by where Nick Doty’s work on multistakeholderism in the W3C is going and think there’s an alternative model in there somewhere. There’s a deep consistency in this, noted a while ago (2003) by Froomkin but largely unacknowledged as far as I can tell in the Data and Society or Berkman worlds.

I don’t see in Habermas anything about funding the state. That would mean acknowledging military force and the power to tax. But this is progress for me.


Zurn, Christopher. “Discourse theory of law”, in Jurgen Habermas: Key Concepts, edited by Barbara Fultner

by Sebastian Benthall at December 20, 2014 04:43 AM

December 15, 2014

Ph.D. student

Some research questions

Last week was so interesting. Some weeks you just get exposed to so many different ideas that it’s trouble to integrate them. I tried to articulate what’s been coming up as a result. It’s several difficult questions.

  • Assuming trust is necessary for effective context management, how does one organize sociotechnical systems to provide social equity in a sustainable way?
  • Assuming an ecology of scientific practices, what are appropriate selection mechanisms (or criteria)? Are they transcendent or immanent?
  • Given the contradictory character of emotional reality, how can psychic integration occur without rendering one dead or at least very boring?
  • Are there limitations of the computational paradigm imposed by data science as an emerging pan-constructivist practice coextensive with the limits of cognitive or phenomenological primitives?

Some notes:

  • I think that two or three of these questions above may be in essence the same question. In that they can be formalized into the same mathematical problem, and the solution is the same in each case.
  • I really do have to read Isabelle Stengers and Nancy Nersessian. Based on the signals I’m getting, they seem to be the people most on top of their game in terms of understanding how science happens.
  • I’ve been assuming that trust relations are interpersonal but I suppose they can be interorganizational as well, or between a person and an organization. This gets back to a problem I struggle with in a recurring way: how do you account for causal relationships between a macro-organism (like an organization or company) and a micro-organism? I think it’s when there are entanglements between these kinds of entities that we are inclined to call something an “ecosystem”, though I learned recently that this use of the term bothers actual ecologists (no surprise there). The only things I know about ecology are from reading Ulanowicz papers, but those have been so on point and beautiful that I feel I can proceed with confidence anyway.
  • I don’t think there’s any way to get around having at least a psychological model to work with when looking at these sorts of things. A recurring an promising angle is that of psychic integration. Carl Jung, who has inspired clinical practices that I can personally vouch for, and Gregory Bateson both understood the goal of personal growth to be integration of disparate elements. I’ve learned recently from Turner’s The Democratic Surround that Bateson was a more significant historical figure than I thought, unless Turner’s account of history is a glorification of intellectuals that appeal to him, which is entirely possible. Perhaps more importantly to me, Bateson inspired Ulanowicz, and so these theories are compatible; Bateson was also a cyberneticist following Wiener, who was prescient and either foundational to contemporary data science or a good articulator of its roots. But there is also a tie-in to constructivist epistemology. DiSessa’s epistemology, building on Piaget but embracing what he calls the computational metaphor, understands the learning of math and physics as the integration of phenomenological primitives.
  • The purpose of all this is ultimately protocol design.
  • This does not pertain directly to my dissertation, though I think it’s useful orienting context.

by Sebastian Benthall at December 15, 2014 07:03 AM

Ph.D. student

what i'm protesting for

[Meta: Is noise an appropriate list for this conversation? I hope so, but I take no offense if you immediately archive this message.]

I’ve been asked, what are you protesting for? Black lives matter, but what should we do about it, besides asking cops not to shoot people?

Well, I think there’s value in marching and protesting even without a specific goal. If you’ve been pushed to the edge for so long, you need some outlet for anger and frustration and I want to respect that and take part to demonstrate solidarity. I see good reason to have sympathy even for protesters taking actions I wouldn’t support.

As Jay Smooth puts it, "That unrest we saw […] was a byproduct of the injustice that preceded it." Or MLK, Jr: "I think that we've got to see that a riot is the language of the unheard."

If you’re frustrated with late-night protests that express that anger in ways that might include destruction of property, I encourage you to vote with your feet and attend daytime marches. I was very pleased to run into iSchool alumni at yesterday afternoon’s millions march in Oakland. Families were welcome and plenty of children were in attendance.

But I also think it’s completely reasonable to ask for some pragmatic ends. To really show that black lives matter, we must take action that decreases the number of these thoughtless deaths. There are various lists of demands you can find online (I link to a few below). But below I’ll list four demands I’ve seen that resonate with me (and aren’t Missouri-specific). This is what I’m marching for, and will keep marching for. (It is obviously not an exhaustive list or even the highest priority list for people who face this more directly than I; it's just my list.) If our elected and appointed government leaders, most of whom are currently doing nothing to lead or respond to the massive outpouring of popular demand, want there to be justice and want protesters to stop protesting, I believe this would be a good start.

* Special Prosecutor for All Deadly Force Cases

Media have reported extensively on the “grand jury decisions” in St. Louis County and in Staten Island. I believe this is a misnomer. Prosecutors, who regularly work very closely with their police colleagues in bringing charges, have taken extraordinary means in the alleged investigations of Darren Wilson and Daniel Pantaleo not to obtain indictments. Bob McCullough in St. Louis spent several months presenting massive amounts of evidence to the grand jury, and presented conflicting or even grossly inaccurate information about the laws governing police use of force. A typical grand jury in such a case would have taken a day: presentation of forensic evidence showing several bullets fired into an unarmed man, a couple eyewitnesses describing the scene, that’s more than enough evidence to get an indictment on a number of different charges and have a proper public trial. Instead, McCullough sought to have a sort of closed door trial where Wilson himself testified (unusual for a grand jury), and then presented as much evidence as he could during the announcement of non-indictment in defense of the police officer. That might sound like a sort of fair process with the evidence, but it’s actually nothing like a trial, because we don’t have both sides represented, we don’t have public transparency, we don’t have counsel cross-examining statements and the like. If regular prosecutors (who work with these police forces every day) won’t actually seek indictments in cases where police kill unarmed citizens, states need to set a formal requirement that independent special prosecutors will be appointed in cases of deadly force.

More info:
The Demands
Gothamist on grand juries

* Police Forces Must Be Representative and Trained

States and municipalities should take measures so that their police forces are representative of the communities they police. We should be suspicious of forces where the police are not residents of the towns where they serve and protect or where the racial makeup is dramatically different from the population. In Ferguson, Missouri, for example, a mostly white police force serves a mostly black town, and makes significant revenue by extremely frequently citing and fining those black residents. Oakland has its own issues with police who aren’t residents (in part, I expect, because of the high cost of living here). But I applaud Oakland for running police academies in order to give the sufficient training to existing residents so they can become officers. Training might also be one way to help with racial disparities in policing. Our incoming mayor, Libby Schaaf, calls for an increase in “community policing”. I’m not sure why she isn’t attending and speaking up at these protests and demonstrating her commitment to implementing such changes in a city where lack of trust in the police has been a deep and sometimes fatal problem.

More info:
The Demands
Libby Schaaf on community policing
FiveThirtyEight on where police live
WaPo on police race and ethnicitiy
Bloomberg on Ferguson ticketing revenues

* The Right to Protest

Police must not use indiscriminate violent tactics against non-violent protesters. I was pleased to have on-the-ground reports from our colleague Stu from Berkeley this past week. The use of tear gas, a chemical weapon, against unarmed and non-violent student protesters is particularly outrageous. If our elected officials want our trust, they need to work on coordinating the activities of different police departments and making it absolutely clear that police violence is not an acceptable response to non-violent demonstration.

Did the Oakland PD really not even know about the undercover California “Highway” Patrol officers who were walking with protesters at a march in Oakland then wildly waved a gun at the protesters and media when they were discovered? Are police instigating vandalism and violence among protesters?

In St. Louis, it seemed to be a regular problem that no one knew who was in charge of the law enforcement response to protesters, and we seem to be having the same problem when non-Berkeley police are called in to confront Berkeley protesters. Law enforcement must make it clear who is in charge and to whom crimes and complaints about police brutality can be reported.

More info:
Tweets on undercover cops
staeiou on Twitter

* Body Cameras for Armed Police

The family of Michael Brown has said:

Join with us in our campaign to ensure that every police officer working the streets in this country wears a body camera.

This is an important project, one that has received support even from the Obama administration, and one where the School of Information can be particularly relevant. While it’s not clear to me that all or even most law enforcement officials need to carry firearms at all times, we could at least ask that those officers use body-worn cameras to improve transparency about events where police use potentially deadly force against civilians. The policies, practices and technologies used for those body cameras and the handling of that data will be particularly important, as emphasized by the ACLU. Cameras are no panacea — the killings of Oscar Grant, Eric Garner, Tamir Rice and others have been well-captured by various video sources — but at least some evidence shows that increased transparency can decrease use of violence by police and help absolve police of complaints where their actions are justified.

More info:
ACLU on body cameras
White House fact sheet on policing reforms proposal
NYT on Rialto study of body cameras

Finally, here are some of the lists of demands that I’ve found informative or useful:
The Demands
Millions March Demands, via Facebook
Millions March NYC Demands, via Twitter
MillionsMarchOakland Demands, via Facebook

I have valued so much the conversations I’ve been able to have in this intellectual community about these local protests and the ongoing civil rights struggle. I hope these words can contribute something, anything to that discussion. I look forward to learning much more.


by at December 15, 2014 12:50 AM

December 10, 2014

Ph.D. student

Discovering Thomas Sowell #blacklivesmatter

If you come up with a lot of wrong ideas and pay a price for it, you are forced to think about it and change your ways or else be eliminated. But there is no such test. The only test for most intellectuals is whether other intellectuals go along with them. And if they all have a wrong idea, then it becomes invincible.

On Sunday night I walked restlessly through the streets of Berkeley while news helicopters circled overhead and sirens wailed. For the second night in a row I saw lines of militarized police. Texting with a friend who had participated in the protests the night before about how he was assaulted by the cops, I walked down Shattuck counting smashed shop windows. I discovered a smoldering dumpster. According to Bernt Wahl, who I bumped into outside of a shattered RadioShack storefront, there had been several fires started around the city; he was wielding a fire extinguisher, advising people to walk the streets to prevent further looting.

The dramatic events around me and the sincere urgings of many deeply respected friends that I join the public outcry against racial injustice made me realize that I could no longer withhold judgment on the Brown and Garner cases and the responses to them. I have reserved my judgment, unwilling to follow the flow of events as told play-by-play by journalists because, frankly, I don’t trust them. As I was discussing this morning with somebody in my lab, real journalism takes time. You have to interview people, assemble facts. That’s not how things are being done around these highly sensitive and contentious issues. In The Democratic Surround, Fred Turner writes about how in the history of the United States, psychologists and social scientists once thought the principal mechanism by which fascism spread was through the mass media’s skillful manipulation of their audience’s emotions. Out of their concern for mobilizing the American people to fight World War II, the state sponsored a new kind of domestic media strategy that aimed to give its audience the grounds to come to its own rational conclusions. That media strategy sustained what we now call “the greatest generation.” These principles seem to be lacking in journalism today.

I am a social scientist, so when I started to investigate the killings thoroughly, the first thing I wanted to see was numbers. Specifically, I wanted to know the comparative rates of police killings broken down by race so I could understand the size of the problem. The first article I found on this subject was Jack Kelly’s article in Real Clear Politics, which I do not recommend you read. It is not a sensitively written piece and some of the sources and arguments he uses signal, to me, a conservative bias.

What I do highly recommend you read are two of Kelly’s sources, which he doesn’t link to but which are both in my view excellent. One is Pro Publica’s research into the data about police violence and the killings of young men. It gave me a sense of proportion I needed to understand the problems at hand.

Thomas Sowell

The other is this article on Michael Brown published last Saturday by Thomas Sowell, who has just skyrocketed to the top of my list of highly respected people. Sowell is far more accomplished than I will ever be and of much humbler origins. He is a military veteran and apparently a courageous scholar. He is now Senior Fellow at the Hoover Institution at Stanford University. Though I am at UC Berkeley and say this very grudgingly, as I write this blog post I am slowly coming to understand that Stanford might be a place of deep and sincere intellectual inquiry, not just the preprofessional school spitting out entrepreneurial drones whose caricature I had come to believe.

Sowell’s claim is that the grand jury has determined that Brown was guilty of assaulting the officer who shot him, that this judgment was based on the testimony of several black witnesses. He notes the tragedy of the riots related to the event and accuses the media of misrepresenting the facts.

So far I have no reason to doubt Sowell’s sober analysis of the Brown case. From what I’ve heard, the Garner case is more horrific and I have not yet had the stomach to work my way through its complexities. Instead I’ve looked more into Sowell’s scholarly work. I recommend watching this YouTube video of him discussing his book Intellectuals and Race in full.

I don’t agree with everything in this video, and not just because much of what Sowell says is the sort of thing I “can’t say”. I find the interviewer too eager in his guiding questions. I think Sowell does not give enough credence to the prison industrial complex and ignores the recent empirical work on the value of diversity–I’m thinking of Scott Page’s work in particular. But Sowell makes serious and sincere arguments about race and racism with a rare historical awareness. In particular, he is critical of the role of intellectuals in making race relations in the U.S. worse. As an intellectual myself, I think it’s important to pay attention to this criticism.

by Sebastian Benthall at December 10, 2014 03:30 PM

December 09, 2014

Ph.D. alumna

Data & Civil Rights: What do we know? What don’t we know?

From algorithmic sentencing to workplace analytics, data is increasingly being used in areas of society that have had longstanding civil rights issues.  This prompts a very real and challenging set of questions: What does the intersection of data and civil rights look like? When can technology be used to enable civil rights? And when are technologies being used in ways that undermine them? For the last 50 years, civil rights has been a legal battle.  But with new technologies shaping society in new ways, perhaps we need to start wondering what the technological battle over civil rights will look like.

To get our heads around what is emerging and where the hard questions lie, the Data & Society Research Institute, The Leadership Conference on Civil and Human Rights, and New America’s Open Technology Institute teamed up to host the first “Data & Civil Rights” conference.  For this event, we brought together diverse constituencies (civil rights leaders, corporate allies, government agencies, philanthropists, and technology researchers) to explore how data and civil rights are increasingly colliding in complicated ways.

In preparation for the conversation, we dove into the literature and see what is known and unknown about the intersection of data and civil rights in six domains: criminal justice, education, employment, finance, health, and housing.  We produced a series of primers that contextualize where we’re at and what questions we need to consider.  And, for the conference, we used these materials to spark a series of small-group moderated conversations.

The conference itself was an invite-only event, with small groups brought together to dive into hard issues around these domains in a workshop-style format.  We felt it was important that we make available our findings and questions.  Today, we’re releasing all of the write-ups from the workshops and breakouts we held, the videos from the level-setting opening, and an executive summary of what we learned.  This event was designed to elicit tensions and push deeper into hard questions. Much is needed for us to move forward in these domains, including empirical evidence, innovation, community organizing, and strategic thinking.  We learned a lot during this process, but we don’t have clear answers about what the future of data and civil rights will or should look like.  Instead, what we learned in this process is how important it is for diverse constituencies to come together to address the challenges and opportunities that face us.

Moving forward, we need your help.  We need to go beyond hype and fear, hope and anxiety, and deepen our collective understanding of technology, civil rights, and social justice. We need to work across sectors to imagine how we can create a more robust society, free of the cancerous nature of inequity. We need to imagine how technology can be used to empower all of us as a society, not just the most privileged individuals.  This means that computer scientists, software engineers, and entrepreneurs must take seriously the costs and consequences of inequity in our society. It means that those working to create a more fair and just society need to understand how technology works.  And it means that all of us need to come together and get creative about building the society that we want to live in.

The material we are releasing today is a baby step, an attempt to scope out the landscape as best we know it so that we can all work together to go further and deeper.  Please help us imagine how we should move forward.  If you have any ideas or feedback, don’t hesitate to contact us at nextsteps at

(Image by Mark K.)

by zephoria at December 09, 2014 04:47 PM

December 07, 2014

MIMS 2012

Warm Gun 2014 Conference Notes

This year’s Warm Gun conference was great, just like last year’s. This year generally seemed to be about using design to generate and validate product insights, e.g. through MVPs, prototyping, researching, etc.

Kickoff (Jared Spool)

Jared Spool’s opening talk focused mostly on MVPs and designing delight into products. To achieve this, he recommended the Kano Model and Dana Chisnell’s Three Approaches to Delight (adding pleasure e.g. through humorous copy, improving the flow, and meaning e.g. believing in a company’s mission [hardest to achieve]).

You’re Hired! Strategies for Finding the Perfect Fit (Kim Goodwin)

This was a great talk about how to hire a great design team, which is certainly no easy task (as I’ve seen at Optimizely).

  • Hiring is like dating on a deadline – after a couple of dates, you have to decide whether or not to marry the person!
  • You should worry more about missing the right opportunity, rather than avoiding the wrong choice

5 Lessons She’s Learned

  1. Hire with a long-term growth plan in mind
    • Waiting until you need someone to start looking is too late (it takes months to find the right person)
    • What kind of will you need? Do you want generalists (can do everything, but not great at any one thing; typically needed by small startups) or specialists (great at one narrow thing, like visual design; typically needed by larger companies)
    • Grow (i.e. mentor junior people) or buy talent? Training junior people isn’t cheap - it takes a senior person’s time.
      • A healthy team has a mix of skills levels (she recommends a ratio of 1 senior:4 mid:2 junior). (Optimizely isn’t far off – we mostly lack a senior member!)
    • A big mistake she sees a lot of startups make is to hire too junior of a person too early
  2. Understand the Market
    • The market has 5 cohorts: low skill junior folks (think design is knowing tools); spotty/developing skills; skilled specialists; generalists; and team leads
    • Senior == able to teach others, NOT a title (there’s lots of title inflation in the startup industry). Years of experience does NOT make a person senior. Many people with “senior” in their title have holes in their skills (especially if they’ve only worked on small teams at startups)
    • 5 years experience only at a design consultancy == somewhat valuable (lots of mentorship opportunities from senior folks, but lack working continuously/iteratively on a product)
    • 5 years experience mixed at design consultancy and on an in-house team == best mix of skills (worked with senior folks, and on a single product continuously)
    • 5 years only on small startup teams == less valuable than the other two; is a red flag. There are often holes in the skills. They’re often “lone rangers” who haven’t worked with senior designers, or a team, and probably developed bad habits. Often have an inflated self-assessment of their skills and don’t know how to take feedback. (uh-oh! I’m kind of in this group)
    • It takes craft to assess craft
    • Alternate between hiring leads and those who can learn from the leads (i.e. a mix of skill levels)
    • Education matters - school can fill in gaps of knowledge. Schools have different types of people they product (HCI, visual, etc.). (yay for me!)
  3. Make 2 lists before posting the job
    • First, list the responsibilities a person will have (what will they do?)
    • Second, list the skills they need to achieve the first list.
    • Turn these 2 lists into a job posting (avoid listing tools in the hiring criteria - that is dumb)
    • DON’T look for someone who has experience designing your exact product in your industry already. The best designers can quickly adapt to different contexts (better to hire a skilled designer with no mobile experience than a junior/mid designer with ONLY mobile experience - there’s ramp-up time, but that’s negligible for a skilled designer)
    • Junior to senior folks progress through this: Knows concepts -> Can design effectively w/coach -> Can design solo -> Can teach others
    • On small/inexperienced teams, watch out for the “Similar to me” effect. Designers new to hiring/interviewing will evaluate people against themselves, rather than evaluate their actual skills or potential. (Can ask, “Where were you relative to this person when you were hired?” to combat this).
  4. Evaluate Based on Skills You Need
    • Resumes == bozo filter
    • Look at the goals, process, role, results, lessons learned, things they’d do differently (we’re pretty good at this at Optimizely!)
    • Do “behavioral interviewing”, i.e. focus on specifics of actual behavior. Look at their actual work, do design exercises, ask “Tell me about a time when…”. It’s a conversation, not an interrogation. (Optimizely is also good at this!)
  5. Be Real to Close the Deal
    • Be honest about what you’re looking for
    • If you have doubts about a person, discuss them directly with the candidate to try overcome them (or confirm them). (We need to do this more at Optimizely)

Product Strategy in a Growing Company (Des Traynor)

This was one of my favorite talks. Product strategy is hard, and it’s really easy to say “Yes” to every idea and feature request. One of my favorite bits was you need to say “No” because somethings not in the product vision. If you don’t ever say this, then you have no product vision. (This has been a challenge at Optimizely at times).

  • Software is eating the world!
  • We’re the ones who control the software that’s available to everyone.
  • Niece asked, “Where do products come from?”. There are 5 reasons a product is built:
    1. Product visionary
    2. Customer-focused (built to solve pain point(s))
    3. Auter (art + business)
    4. Scratching own itch (you see a need in the marketplace)
    5. Copy a pattern (e.g. “Uber for X!”)
  • (Optimizely started as scratching own itch, but is adapting to customer-focused)
  • Scope: scalpel vs. swiss army knife
    • When first starting, build a scalpel (it’s the only way to achieve marketshare when starting)
    • Gall’s Law: complex systems evolve from simple ones (like WWW. Optimizely is also evolving towards complexity!). You can’t set out to design a complex system from scratch (think Google Wave)
  • A simple product !== making a product simple (i.e. an easy to use product isn’t necessarily simple [difference between designing a whole UX vs. polishing a specific UI]).
    • Simplify a product by removing steps. Watch out for Scopi-locks (i.e. scope the problem just right – not too big, not too small). You don’t want to solve steps of a flow that are already solved by a market leader, or when there are multiple solutions already in use by people (e.g. don’t rebuild email, there’s already Gmail and Outlook and Mailbox, etc.)
  • How to fight feature creep
    • Say “No” by default
    • Have a checklist new ideas must go through before you build them, e.g. (this is a subset):
      • Does it benefit all customers?
      • Is the value worth the effort?
      • Does it improve existing features? Does it increase engagement across the system, or divide it?
      • If a feature takes off, can we afford it? (E.g. if you have a contractor build an Android app, how will you respond to customer feedback and bugs?)
      • Is it low effort for the customer to use, and result in high value? (E.g. Circles in G+ fail this criteria - no one wants to manage friends like this)
      • It’s been talked about forever; it’s easy to build; we’ve put a lot of effort in already == all bad reasons to ship a new feature
    • (Optimizely has failed at this a couple of times. E.g. Preview As. On the other hand, Audiences met these criteria)
    • Once you ship something, it’s really hard to take back. Even if customers complain about it, there is always a minority that will be really angry if you take it away.

Hunches, Instincts, and Trusting Your Gut (Leah Buley)

This was probably my least favorite talk. The gist of it is that as a designer, there are times you need to be an expert and just make a call using your gut (colleagues and upper management need you to be this person sometimes). We have design processes to follow, but there are always points at which you need to make a leap of faith and trust your gut. I agree with those main points, but this talk lost me by focusing only on visual design. She barely mentioned anything about user goals or UX design.

Her talk was mainly about “The Gut Test”, which is a way of looking at a UI (or print piece, or anything that has visual design) and assessing your gut reaction to it. This is useful for visual design, but won’t find issues like, “Can the user accomplish their goal? Is the product/feature easy to use?” (Something can be easy to use, but fail her gut test). It’s fine that she didn’t tackle these issues, but I wish she would have acknowledged more explicitly that the talk was only addressing a thin slice of product design.

  • In the first 50ms of seeing something, we have an immediate visceral reaction to things
  • Exposure effect: the more we see something, the more we like it (we lose our gut reaction to it)
  • To combat this, close your eyes for 5 seconds, then look at a UI and ask these questions:
    • What do you notice first?
    • How does it make you feel (if anything)? What words would you use to describe it?
    • Is it prototypical? (i.e. does it conform to known patterns). Non-conformity == dislike and distrust
  • Then figure out what you can do to address any issues discovered.

Real Life Trust Online (Mara Zepeda)

This talk was interesting, but not directly applicable to my everyday job at Optimizely. The gist of it was how do we increase trust in the world, and not just in the specific product or company we’re using? For example, when you buy or sell something successfully on Craigslist, your faith in humanity increases a little bit. But reviews on Amazon, for example, increases your trust in that product and Amazon, but not necessarily in your fellow human beings.

  • Before trust is earned, there’s a moment of vulnerability and an optimism about the future.
  • Trust gets built up in specific companies (e.g. Craigslist - there’s no reason to trust the company or site, but trust in humans and universe increases when a transaction is successful).
  • Social networks don’t create networks of trust in the real world
  • Switchboard MVP was a phone hotline
    • LinkedIn: ask for job connections, no one responds. But if you call via Switchboard, people are more likely to respond (there’s a real human connection)
    • They’re trying to create a trust network online
  • To build trust:
    • Humanize the transaction (e.g. make it person to person)
    • Give a victory lap (i.e. celebrate when the system works)
    • Provide allies / mentors along the journey (i.e. people who are invested in the journey being successful, and can celebrate the win)
  • She brought up the USDA’s “Hay Net” as an example of this. It connects those who have Hay with those who need Hay (and vice versa). UI had two options: “Have Hay” and “Need Hay”, which I find hilarious and amazing.

Designing for Unmet Needs (Steve Portigal)

Steve Portigal’s talk was solid, but it didn’t really tell me anything I didn’t already know. The gist of it was there are different types of research (generative v. evaluative), you need to know which is appropriate for your goals (although it’s a spectrum, not a dichotomy), and there are ways around anything you think is stopping you (e.g. no resources; no users; no product; etc.). The two most interesting points to me were:

  • Create provocative examples/prototypes/mocks to show people and gather responses (he likened this to a scientist measuring reactions to new stimuli). Create a vision of the future and see what people think of it, find needs, iterate, adapt. Go beyond iterative improvements to an existing product or UI (we’re starting to explore this technique at Optimizely now).
  • Build an infrastructure for ongoing research. This is something that’s been on my mind for awhile, since we’re very reactive in our research needs at Optimizely. I’d like us to have more continual, ongoing research that’s constantly informing product decisions.

Redesigning with Confidence (Jessica Harllee)

This was a cool talk that described the process Etsy went through to redesign the seller onboarding experience, and how they used data to be confident in the final result. The primary goal was increasing the shop open rate, while maintaining the products listed per seller. They a/b tested a new design that increased the open rate, but had fewer products listed per seller. They made some tweaks, a/b tested again, and found a design that increased the shop open rate while maintaining the number of products listed per seller. Which means more products are listed on Etsy overall!

I didn’t learn much new from this talk, but enjoy hearing these stories. It also got me thinking about how we don’t a/b test much in the product at Optimizely. A big reason is because it takes too long to get significant results (as Jessica mentioned in her talk, they had to run both tests for months, and the overall project took over a year). Another reason is that when developing new features, there aren’t any direct metrics to compare. Since Jessica’s example was a redesign, they could directly compare behavior of the redesign to the original.

Designing for Startups Problems (Braden Kowitz)

Braden’s talk was solid, as usual, but since I’ve seen him talk before and read his blog I didn’t get much new out of it. His talk was about how Design (and therefore, designers) can help build a great company (beyond just UIs). Most companies think of design at the “surface level”, i.e. visual design, logos, etc., but at its core design is about product and process and problem solving. Design can help at the highest levels.

  • 4 Skills Designers Need:
    1. Know where to dig
      • Map the problem
      • Identify the riskiest part (e.g. does anyone need this product or feature at all?)
      • Design to answer that question. Find the cheapest/simplest/fastest thing you can create to answer this question (fake as much as you can to avoid building a fully working artifact)
    2. Get dirty
      • Prototype as quickly as possible (colors, polish, etc., aren’t important)
      • Focus on the most important part, e.g. the user flow, layout, copy, etc. Use templates/libraries to save time
      • Use deadlines (it’s really easy to polish a prototype forever)
    3. Pump in fresh data
      • Your brain fills in gaps in data, so have regular research and testing (reinforces Portigal’s points nicely)
    4. Take big leaps
      • Combine the above 3 steps to generate innovative solutions to problems

Accomplish Big Goals with Objective & Key Results (Christina Wodtke)

This was an illuminating talk about the right way to create and evaluate OKRs. I didn’t hear much I hadn’t already heard (we use OKRs at Optimizely and have discussed best practices). But to recap:

  • Objective == Your Dream, Your Goal. It’s hard. It’s qualitative.
  • Key Results == How you know you reached your goal. They’re quantitative. They’re measurable. They’re not tasks (it’s something you can put on a dashboard and measure over time, e.g. sales numbers, adoption, etc.).
  • Focus: only have ONE Objective at a time, and measure it with 3 Key Results. (She didn’t talk about how to scale this as companies get bigger. I wish she did).
  • Measure it throughout the quarter so you can know how you’re tracking. Don’t wait until the end of the quarter.

Thought Experiments for Data-Driven Design (Aviva Rosenstein)

This was an illuminating talk about the right way to incorporate data into the decision making process. You need to find a balance between researching/measuring to death, and making a decision. She used DocuSign’s feedback button as a good example of this.

  • Don’t research to death — try something and measure the result (but make an educated guess).
  • DocuSign tried to roll their own “Feedback” button (rather than using an existing service). They gave the user a text box to enter feedback, and submitting it sent it to an email alias (not stored anywhere; not categorized at all).
    • This approach became a data deluge
    • There was no owner of the feedback
    • Users entered all kinds of stuff in that box that shouldn’t have gone there (e.g. asking for product help). People use the path of least resistance to get what they want. (I experienced this with the feedback button in the Preview tool)
  • Data should lead to insight (via analysis and interpretation)
  • Collecting feedback by itself has no ROI (can be negative because if people feel their feedback is being ignored they get upset)
  • Aviva’s goal: find a feedback mechanism that’s actually useful
  • Other feedback approaches:
    • Phone/email conversation (inefficient, hard to track)
    • Social media (same as above; biased)
    • Ad hoc survey/focus groups (not systematic; creating good surveys is time consuming)
  • Feedback goals:
    1. Valid: trustworthy and unbiased
    2. Delivery: goes to the right person/people
    3. Retention: increase overall company knowledge; make it available when needed
    4. Usable: can’t pollute customer feedback
    5. Scalable: easy to implement
    6. Contextual: gather feedback in context of use
  • They changed their feedback mechanism slightly by asking users to bucket the feedback first (e.g. “Billing problems”, “Positive feedback”, etc.), then emailed it to different places. This made it more actionable.
  • Doesn’t need to be “Ready -> Fire -> Aim”: we can use data and the double diamond approach to inform the problem, and make our best guess.
    • This limits collateral damage from not aiming. A poorly aimed guess can mar the user experience, which users don’t easily forget.

Growing Your Userbase with Better Onboarding (Samuel Hulick)

This was one of my favorite talks of the day (and not only because Samuel gave Optimizely a sweet shout out). I didn’t learn a ton new from it, but Samuel is an entertaining speaker. His pitch is basically that the first run experience is important, and needs to be thought about at the start of developing a product (not tacked on right before launch).

  • “Onboarding” is often just overlaying an UI with coach’s marks. But there’s very little utility in this.
  • Product design tends to focus on the “flying” state, once someone is using a system. Empty states, and new user experiences, are often tacked on.
  • You need to start design with where the users start
  • Design Recommendations
    • Show a single, action-oriented tooltip at a time (Optimizely was his example of choice here!)
      • Ask for signup when there’s something to lose (e.g. after you’ve already created an experiment)
      • Assume guided tours will be skipped, i.e. don’t rely on them to make your product usable
    • Use completion meters to get people fully using a product
    • Keep in mind that users don’t buy products, they buy better versions of themselves (Mario + fire flower), and use this as the driving force to get people fully engaged with your product
    • Provide positive reinforcement when they complete steps! (Emails can help push them along)

Fostering Effective Collaboration in a Global Environment (PJ McCormick)

PJ’s talk was just as good this year as it was last year. He gave lots of great tips for increasing collaboration and trust among teams (especially the engineering and design teams), which is also a topic that has been on my mind recently.

  • His UX team designs retail pages (e.g. discover new music page). In one case, he presented work to the stakeholders and dev team, who then essentially killed the project. What went wrong? Essentially, it was a breakdown of communication and he didn’t include the dev team early enough.
  • Steps to increasing collaboration:
    1. Be accessible and transparent
      • Put work up on public walls so everyone can see progress (this is something I want to do more of)
      • Get comfortable showing work in progress
      • Demystify the black box of design
    2. Listen
      • Listen to stakeholders' and outside team members opinions and feedback (you don’t have to incorporate it, but make sure they know they’re being heard)
    3. Be a Person
      • On this project, the communication was primarily through email or bug tracking, which lacks tone of voice, body language, etc.
      • There was no real dialog. Talk more face to face, or over phone. (I have learned this again and again, and regularly walk over to someone to hash things out at the first hint of contention in an email chain. It’s both faster to resolve and maintains good relations among team members)
    4. Work with people, not at them
      • He should have included stakeholders and outside team members in the design process.
      • Show them the wall; show UX studies; show work in progress
      • Help teach people what design is (this is hard. I want to get better at this)

A question came up about distributed teams, since much of his advice hinges on face to face communication. I’ve been struggling with this (actually, the whole company has), and his recommendations are in line with what we’ve been trying: use a webcam + video chat to show walls (awkward; not as effective as in person), and take pictures/digitize artifacts to share with people (has the side effect of being archived for future, but introduces the problem of discoverability).

And that’s all! (Actually, I missed the last talk…). Overall, a great conference that I intend to go back to next year.

by Jeff Zych at December 07, 2014 02:53 AM

December 04, 2014

Ph.D. student

Notes on The Democratic Surround; managerialism

I’ve been greatly enjoying Fred Turner’s The Democratic Surround partly because it cuts through a lot of ideological baggage with smart historical detail. It marks a turn, perhaps, in what intellectuals talk about. The critical left has been hung up on neoliberalism for decades while the actual institutions that are worth criticizing have moved on. It’s nice to see a new name for what’s happening. That new name is managerialism.

Managerialism is a way to talk about what Facebook and the Democratic Party and everybody else providing a highly computationally tuned menu of options is doing without making the mistake of using old metaphors of control to talk about a new thing.

Turner is ambivalent about managerialism perhaps because he’s at Stanford and so occupies an interesting position in the grand intellectual matrix. He’s read his Foucault, he explains when he speaks in public, though he is sometimes criticized for not being critical enough. I think ‘critical’ intellectuals may find him confusing because he’s not deploying the same ‘critical’ tropes that have been used since Adorno even though he’s writing sometimes about Adorno. He is optimistic, or at least writes optimistically about the past, or at least writes about the past in a way that isn’t overtly scathing which is just more upbeat than a lot of writing nowadays.

Managerialism is, roughly, the idea of technocratically bounded space of complex interactive freedom as a principle of governance or social organization. In The Democratic Surround, he is providing a historical analysis of a Bauhaus-initiated multimedia curation format, the ‘surround’, to represent managerialist democracy in the same way Foucault provided a historical analysis of the Panopticon to represent surveillance. He is attempting to implant a new symbol into the vocabulary of political and social thinkers that we can use to understand the world around us while giving it a rich and subtle history that expands our sense of its possibilities.

I’m about halfway through the book. I love it. If I have a criticism of it it’s that everything in it is a managerialist surround and sometimes his arguments seems a bit stretched. For example, here’s his description of how John Cage’s famous 4’33” is a managerialist surround:

With 4’33”, as with Theater Piece #1, Cage freed sounds, performers, and audiences alike from the tyrannical wills of musical dictators. All tensions–between composer, performer, and audience; between sound and music; between the West and the East–had dissolved. Even as he turned away from what he saw as more authoritarian modes of composition and performance, though, Cage did not relinquish all control of the situation. Rather, he acted as an aesthetic expert, issuing instructions that set the parameters for action. Even as he declined the dictator’s baton, Cage took up a version of the manager’s spreadsheet and memo. Thanks to his benevolent instructions, listeners and music makers alike became free to hear the world as it was and to know themselves in that moment. Sounds and people became unified in their diversity, free to act as they liked, within a distinctly American musical universe–a universe finally freed of dictators, but not without order.

I have two weaknesses as a reader. One is a soft spot for wicked vitriol. Another is an intolerance of rhetorical flourish. The above paragraph is rhetorical flourish that doesn’t make sense. Saying that 4’33” is a manager’s spreadsheet is just about the most nonsensical metaphor I could imagine. In a universe with only fascists and managerialists, then I guess 4’33” is more like a memo. But there are so many more apt musical metaphors for unification in diversity in music. For example, a blues or jazz band playing a standard. Literally any improvisational musical form. No less quintessentially American.

If you bear with me and agree that this particular point is poorly argued and that John Cage wasn’t actually a managerialist and was in fact the Zen spiritualist that he claimed to be in his essays, then either Turner is equating managerialism with Zen spiritualism or Turner is trying to make Cage a symbol of managerialism for his own ideological ends.

Either of these is plausible. Steve Jobs was an I Ching enthusiast like Cage. Stewart Brand, the subject of Turner’s last book, From Counterculture to Cyberculture, was a back-to-land commune enthusiast before he become a capitalist digerati hero. Running through Turner’s work is the demonstration of the cool origins of today’s world that’s run by managerialist power. We are where we are today because democracy won against fascism. We are where we are today because hippies won against whoever. Sort of. Turner is also frank about capitalist recuperation of everything cool. But this is not so bad. Startups are basically like co-ops–worker owned until the VC’s get too involved.

I’m a tech guy, sort of. It’s easy for me to read my own ambivalence about the world we’re in today into Turner’s book. I’m cool, right? I like interesting music and read books on intellectual history and am tolerant of people despite my connections to power, right? Managers aren’t so bad. I’ve been a manager. They are necessary. Sometimes they are benevolent and loved. That’s not bad, right? Maybe everything is just fine because we have a mode of social organization that just makes more sense now than what we had before. It’s a nice happy medium between fascism, communism, anarchism, and all the other extreme -ism’s that plagued the 20th century with war. People used to starve to death or kill each other en masse. Now they complain about bad management or, more likely, bad customer service. They complain as if the bad managers are likely to commit a war crime at any minute but that’s because their complaints would sound so petty and trivial if they were voiced without the use of tropes that let us associate poor customer service with deliberate mind-control propaganda or industrial wage slavery. We’ve forgotten how to complain in a way that isn’t hyperbolic.

Maybe it’s the hyperbole that’s the real issue. Maybe a managerialist world lacks catastrophe and so is so frickin’ boring that we just don’t have the kinds of social crises that a generation of intellectuals trained in social criticism have been prepared for. Maybe we talk about how things are “totally awesome!” and totally bad because nothing really is that good or that bad and so our field of attention has contracted to the minute, amplifying even the faintest signal into something significant. Case in point, Alex from Target. Under well-tuned managerialism, the only thing worth getting worked up about is that people are worked up about something. Even if it’s nothing. That’s the news!

So if there’s a critique of managerialism, it’s that it renders the managed stupid. This is a problem.

by Sebastian Benthall at December 04, 2014 02:45 AM

December 01, 2014

MIMS 2012

Optimizely's iOS SDK Hits Version 1.0!

On Novemeber 18th, 2014, Optimizely officially released version 1.0 of our iOS SDK and a new mobile editing experience. As the lead designer of this project, I’m extremely proud of the progress we’ve made. This is just the beginning — there’s a lot more work to come! Check out the product video below:

Stay tuned for future posts about the design process.

by Jeff Zych at December 01, 2014 02:13 AM

November 29, 2014

Ph.D. student

textual causation

A problem that’s coming up for me as a data scientist is the problem of textual causation.

There has been significant interesting research into the problem of extracting causal relationships between things in the world from text about those things. That’s an interesting problem but not the problem I am talking about.

I am talking about the problem of identifying when a piece of text has been the cause of some event in the world. So, did the State of the Union address affect the stock prices of U.S. companies? Specifically, did the text of the State of the Union address affect the stock price? Did my email cause my company to be more productive? Did specifically what I wrote in the email make a difference?

A trivial example of textual causation (if I have my facts right–maybe I don’t) is the calculation of Twitter trending topics. Millions of users write text. That text is algorithmically scanned and under certain conditions, Twitter determines a topic to be trending and displays it to more users through its user interface, which also uses text. The user interface text causes thousands more users to look at what people are saying about the topic, increasing the causal impact of the original text. And so on.

These are some challenges to understanding the causal impact of text:

  • Text is an extraordinarily high-dimensional space with tremendous irregularity in distribution of features.
  • Textual events are unique not just because the probability of any particular utterance is so low, but also because the context of an utterance is informed by all the text prior to it.
  • For the most part, text is generated by a process of unfathomable complexity and interpreted likewise.
  • A single ‘piece’ of text can appear and reappear in multiple contexts as distinct events.

I am interested in whether it is possible to get a grip on textual causation mathematically and with machine learning tools. Bayesian methods theoretically can help with the prediction of unique events. And the Pearl/Rubin model of causation is well integrated with Bayesian methods. But is it possible to use the Pearl/Rubin model to understand unique events? The methodological uses of Pearl/Rubin I’ve seen are all about establishing type causation between independent occurrences. Textual causation appears to be as a rule a kind of token causation in a deeply integrated contextual web.

Perhaps this is what makes the study of textual causation uninteresting. If it does not generalize, then it is difficult to monetize. It is a matter of historical or cultural interest.

But think about all the effort that goes into communication at, say, the operational level of an organization. How many jobs require “excellent communication skills.” A great deal of emphasis is placed not only on that communication happens, but how people communicate.

One way to approach this is using the tools of linguistics. Linguistics looks at speech and breaks it down into components and structures that can be scientifically analyzed. It can identify when there are differences in these components and structures, calling these differences dialects or languages.

by Sebastian Benthall at November 29, 2014 04:49 PM

analysis of content vs. analysis of distribution of media

A theme that keeps coming up for me in work and conversation lately is the difference between analysis of the content of media and analysis of the distribution of media.

Analysis of content looks for the tropes, motifs, psychological intentions, unconscious historical influences, etc. of the media. Over Thanksgiving a friend of mine was arguing that the Scorpions were a dog whistle to white listeners because that band made a deliberate move to distance themselves from influence of black music on rock. Contrast this with Def Leppard. He reached this conclusion based by listening carefully to the beats and contextualizing them in historical conversations that were happening at the time.

Analysis of distribution looks at information flow and the systemic channels that shape it. How did the telegraph change patterns of communication? How did television? Radio? The Internet? Google? Facebook? Twitter? Ello? Who is paying for the distribution of this media? How far does the signal reach?

Each of these views is incomplete. Just as data underdetermines hypotheses, media underdetermines its interpretation. In both cases, a more complete understanding of the etiology of the data/media is needed to select between competing hypotheses. We can’t truly understand content unless we understand the channels through which it passes.

Analysis of distribution is more difficult than analysis of content because distribution is less visible. It is much easier to possess and study data/media than it is to possess and study the means of distribution. The means of distribution are a kind of capital. Those that study it from the outside must work hard to get anything better than a superficial view of it. Those on the inside work hard to get a deep view of it that stays up to date.

Part of the difficulty of analysis of distribution is that the system of distribution depends on the totality of information passing through it. Communication involves the dynamic engagement of both speakers and an audience. So a complete analysis of distribution must include an analysis of content for every piece of implicated content.

One thing that makes the content analysis necessary for analysis of distribution more difficult than what passes for content analysis simpliciter is that the former needs to take into account incorrect interpretation. Suppose you were trying to understand the popularity of Fascist propaganda in pre-WWII Germany and were interested in how the state owned the mass media channels. You could initially base your theory simply on how people were getting bombarded by the same information all the time. But you would at some point need to consider how the audience was reacting. Was it stirring feelings of patriotic national identity? Did they experience communal feelings with others sharing similar opinions? As propaganda provided interpretations of Shakespeare saying he was secretly a German and denunciation of other works as “degenerate art”, did the audience believe this content analysis? Did their belief in the propaganda allow them to continue to endorse the systems of distribution in which they took part?

This shows how the question of how media is interpreted is a political battle fought by many. Nobody fighting these battles is an impartial scientist. Since one gets an understanding of the means of distribution through impartial science, and since this understanding of the means of distribution is necessary for correct content analysis, we can dismiss most content analysis as speculative garbage, from a scientific perspective. What this kind of content analysis is instead is art. It can be really beautiful and important art.

On the other hand, since distribution analysis depends on the analysis of every piece of implicated content, distribution analysis is ultimately hopeless without automated methods for content analysis. This is one reason why machine learning techniques for analyzing text, images, and video are such a hot research area. While the techniques for optimizing supply chain logistics (for example) are rather old, the automated processing of media is a more subtle problem precisely because it involves the interpretation and reinterpretation by finite subjects.

By “finite subject” here I mean subjects that are inescapably limited by the boundaries of their own perspective. These limits are what makes their interpretation possible and also what makes their interpretation incomplete.

by Sebastian Benthall at November 29, 2014 04:16 PM

November 26, 2014

Ph.D. student

things I’ve been doing while not looking at twitter

Twitter was getting me down so I went on a hiatus. I’m still on that hiatus. Instead of reading Twitter, I’ve been:

  • Reading Fred Turner’s The Democratic Surround. This is a great book about the relationship between media and democracy. Since a lot of my interest in Twitter has been because of my interest in the media and democracy, this gives me those kinds of jollies without the soap opera trainwreck of actually participating in social media.
  • Going to arts events. There was a staging of Rhinoceros at Berkeley. It’s an absurdist play in which a small French village is suddenly stricken by an epidemic wherein everybody is transformed into a rhinoceros. It’s probably an allegory for the rise of Communism or Fascism but the play is written so that it’s completely ambiguous. Mainly it’s about conformity in general, perhaps ideological conformity but just as easily about conformity to non-ideology, to a state of nature (hence, the animal form, rhinoceros.) It’s a good play.
  • I’ve been playing Transistor. What an incredible game! The gameplay is appealingly designed and original, but beyond that it is powerfully written an atmospheric. In many ways it can be read as a commentary on the virtual realities of the Internet and the problems with them. Somehow there was more media attention to GamerGate than to this one actually great game. Too bad.
  • I’ve been working on papers, software, and research in anticipation of the next semester. Lots of work to do!

Above all, what’s great about unplugging from social media is that it isn’t actually unplugging at all. Instead, you can plug into a smarter, better, deeper world of content where people are more complex and reasonable. It’s elevating!

I’m writing this because some time ago it was a matter of debate whether or not you can ‘just quit Facebook’ etc. It turns out you definitely can and it’s great. Go for it!

(Happy to respond to comments but won’t respond to tweets until back from the hiatus)

by Sebastian Benthall at November 26, 2014 10:02 PM

November 14, 2014

Ph.D. alumna

Heads Up: Upcoming Parental Leave

If you’ve seen me waddle onto stage lately, you’ve probably guessed that I’m either growing a baby or an alien. I’m hoping for the former, although contemporary imaging technologies still do make me wonder. If all goes well, I will give birth in late January or early February. Although I don’t publicly talk much about my son, this will be #2 for me and so I have both a vague sense of what I’m in for and no clue at all. I avoid parenting advice like the plague so I’m mostly plugging my ears and singing “la-la-la-la” whenever anyone tells me what I’m in for. I don’t know, no one knows, and I’m not going to pretend like anything I imagine now will determine how I will feel come this baby’s arrival.

What I do know is that I don’t want to leave any collaborator or partner in the lurch since there’s a pretty good chance that I’ll be relatively out of commission (a.k.a. loopy as all getup) for a bit. I will most likely turn off my email firehose and give collaborators alternate channels for contacting me. I do know that I’m not taking on additional speaking gigs, writing responsibilities, scholarly commitments, or other non-critical tasks. I also know that I’m going to do everything possible to make sure that Data & Society is in good hands and will continue to grow while I wade through the insane mysteries of biology. If you want to stay in touch with everything happening at D&S, please make sure to sign up for our newsletter! (You may even catch me sneaking into our events with a baby.)

As an employee of Microsoft Research who is running an independent research institute, I have a ridiculous amount of flexibility in how I navigate my parental leave. I thank my lucky stars for this privilege on a regular basis, especially in a society where we force parents (and especially mothers) into impossible trade-offs. What this means in practice for me is that I refuse to commit to exactly how I’m going to navigate parental leave once #2 arrives. Last time, I penned an essay “Choosing the ‘Right’ Maternity Leave Plan” to express my uncertainty. What I learned last time is that the flexibility to be able to work when it made sense and not work when I’d been up all night made me more happy and sane than creating rigid leave plans. I’m fully aware of just how fortunate I am to be able to make these determinations and how utterly unfair it is that others can’t. I’m also aware of just how much I love what I do for work and, in spite of folks telling me that work wouldn’t matter as much after having a child, I’ve found that having and loving a child has made me love what I do professionally all the more. I will continue to be passionately engaged in my work, even as I spend time welcoming a new member of my family to this earth.

I don’t know what the new year has in store for me, but I do know that I don’t want anyone who needs something from me to feel blindsided. If you need something from me, now is the time to holler and I will do my best. I’m excited that my family is growing and I’m also ecstatic that I’ve been able to build a non-profit startup this year. It’s been a crazy year and I expect that 2015 will be no different.

by zephoria at November 14, 2014 03:35 PM

November 10, 2014

Ph.D. alumna

me + National Museum of the American Indian

I’m pleased to share that I’m joining the Board of Trustees of Smithsonian’s National Museum of the American Indian (NMAI) in 2015.  I am honored and humbled by the opportunity to help guide such an esteemed organization full of wonderful people who are working hard to create a more informed and respectful society.

I am not (knowingly) of Native descent, but as an American who has struggled to make sense of our history and my place in our cultural ecosystem, I’ve always been passionate about using the privileges I have to make sure that our public narrative is as complex as our people.  America has a sordid history and out of those ashes, we have a responsibility to both remember and do right by future generations.  When the folks at NMAI approached me to see if I were willing to use the knowledge I have about technology, youth, and social justice to help them imagine their future as a cultural institution, the answer was obvious to me.

Make no mistake – I have a lot to learn.  I cannot and will not speak on behalf of Native peoples or their experiences. I’m joining this Board, fully aware of how little I know about the struggles of Indians today, but I am doing so with a deep appreciation of their stories and passions. I am coming to this table to learn from those who identify as Native and Indian with the hopes that what I have to offer as a youth researcher, technologist, and committed activist can be valuable. As an ally, I hope that I can help the Museum accomplish its dual mission of preserving and sharing the culture of Native peoples to advance public understanding and empower those who have been historically disenfranchised.

I am still trying to figure out how I will be able to be most helpful, but at the very least, please feel free to use me to share your thoughts and perspectives that might help NMAI advance its mission and more actively help inform and shape American society. I would also greatly appreciate your help in supporting NMAI’s education initiatives through a generous donation. In the United States, these donations are tax deductible.

by zephoria at November 10, 2014 04:10 PM

November 09, 2014

Ph.D. alumna

What is Fairness?

What is “fairness”? And what happens when technology decides?

Fairness is one of those values that Americans love to espouse. It’s just as beloved in technical circles, where it’s often introduced as one of the things that “neutral” computers do best. We collectively perceive ourselves and our systems to be fair and push against any assertion that our practices are unfair. But what do we even mean by fairness in the first place?

In the United States, fairness has historically been a battle between equality and equity. Equality is the notion that everyone should have an equal opportunity. It’s the core of meritocracy and central to the American Dream. Preferential treatment is seen as antithetical to equality and the root of corruption. And yet, as civil rights leaders have long argued, we don’t all start out from the same place. Privilege matters. As a result, we’ve seen historical battles over equity, arguing that fairness is only possible when we take into account systemic marginalization and differences of ability, opportunity, and access. When civil rights leaders fought for equity in the 60s, they were labeled communists. Still, equity-based concepts like “affirmative action” managed to gain traction. Today, we’ve shifted from communism to socialism as the anti-equity justification. Many purposefully refuse to acknowledge that people don’t start out from the same position and take offense at any effort to right historical wrongs through equity-based models. Affirmative action continues to be dismantled and the very notion of reparations sends many into a tizzy.

Beyond the cultural fight over equality vs. equity, a new battle to define fairness has emerged. Long normative in business, a market logic of fairness is moving beyond industry to increasingly become our normative understanding of fairness in America.

To understand market-driven models of fairness, consider frequent flyer programs. If you are high status on Delta, you get all sorts of privileges. You don’t have to pay $25 to check a bag, you get better seats and frequent upgrades, you get free food and extra services, etc. etc. We consider this fair because it enables businesses to compete. Delta cares to keep you as a customer because they rely on you spending a lot more over the year or lifetime of the program than you cost in terms of perks. Bob, on the other hand, isn’t that interesting to Delta if he only flies once a year and isn’t even eligible for the credit card. Thus, Bob doesn’t get the perks and is, in effect, charged more for equivalent services.

What happens when this logic of fairness alters the foundations of society? Consider financial services where business rubs up against something so practical — and seemingly necessary — as housing. Martha Poon has done phenomenal work on the history of FICO scores which originally empowered new populations to get access to credit. These days, FICO scores are used for many things beyond financial services, but even in the financial services domain, things aren’t as equitable as one might think. The scores are not necessarily fair and their usage introduces new problems. If you’re seeking a loan and you have a better score than Bob, you pay a lower interest rate. This is considered acceptable because you are a lower risk than Bob. But just like Delta wants to keep you as a customer, so does Chase. And so they start to give you deals to compete with other banks for your business. In effect, they minimize the profit they make directly off of the wealthiest because they need high end customers for secondary and competitive reasons. As a result, not only is Bob burdened with the higher interest loans, but all of the profits are also made off of him as well.

For a moment, let’s turn away from business-based environments altogether and think more generally about how allocation of scarce resources is beginning to unfold thanks to computational systems that can distribute resources “fairly.” Consider, for example, what’s happening with policing practices, especially as computational systems allow precincts to distribute their officers “fairly.” In many jurisdictions, more officers are placed into areas that are deemed “high risk.” This is deemed to be appropriate at a societal level. And yet, people don’t think about the incentive structures of policing, especially in communities where the law is expected to clear so many warrants and do so many arrests per month. When they’re stationed in algorithmically determined “high risk” communities, they arrest in those communities, thereby reinforcing the algorithms’ assumptions.

Addressing modern day redlining equivalents isn’t enough. Statistically, if your family members are engaged in criminal activities, there’s a high probability that you will too. Is it fair to profile and target individuals based on their networks if it will make law enforcement more efficient?

Increasingly, tech folks are participating in the instantiation of fairness in our society. Not only do they produce the algorithms that score people and unevenly distribute scarce resources, but the fetishization of “personalization” and the increasingly common practice of “curation” are, in effect, arbiters of fairness.

The most important thing that we all need to recognize is that how fairness is instantiated significantly affects the very architecture of our society. I regularly come back to a quote by Alistair Croll:

Our social safety net is woven on uncertainty. We have welfare, insurance, and other institutions precisely because we can’t tell what’s going to happen — so we amortize that risk across shared resources. The better we are at predicting the future, the less we’ll be willing to share our fates with others. And the more those predictions look like facts, the more justice looks like thoughtcrime.

The market-driven logic of fairness is fundamentally about individuals at the expense of the social fabric. Not surprisingly, the tech industry — very neoliberal in cultural ideology — embraces market-driven fairness as the most desirable form of fairness because it is the model that is most about individual empowerment. But, of course, this form of empowerment is at the expense of others. And, significantly, at the expense of those who have been historically marginalized and ostracized.

We are collectively architecting the technological infrastructure of this world. Are we OK with what we’re doing and how it will affect the society around us?

(This post was originally published on September 3, 2014 in The Message on Medium.)

by zephoria at November 09, 2014 04:11 PM

November 06, 2014

MIMS 2010

Updating Bulk Data in CourtListener…Again

I wrote a few weeks ago about our new system for creating bulk files in CourtListener. The system was pretty good. The goal was and is to efficiently create one bulk file for every jurisdiction—object pair in the system. So, that means one bulk file for oral arguments from Supreme Court, another for opinions from the Ninth Circuit of Appeals, another for dockets from Alabama’s appellate court, etc. We have about 350 jurisdictions and four different object types right now, for a total of about 1,400 bulk files.

This system needs to be fast.

The old system that I wrote about before would create 350 open file handles at a time, and then would iterate over each item in the database, adding it to the correct file as it inspected the item. This was a beautiful system because it only had to iterate over the database once, but even after performance tuning, it still took about 24 hours. Not good enough.

I got to thinking that it was terrible to create the entire bulk archive over and over when in reality only a few items change each day. So I created a bug to make bulk data creation incremental.

This post is about that process.

The First Approach

The obvious way to do this kind of thing is to grab the bulk files you already have (as gz-compressed tar files), and add the updated items to those files. Well, I wrote up code for this, tested it pretty thoroughly and considered it done. Only to realize that, like regular files, when you create a compressed tar file with a command like this…

tar_files['all'] ='all.tar.gz', mode='w:gz')

…it clobbers any old files that might have the same name. So much for that.

Next Approach

Well, it looked like we needed append mode for compressed tar files, but alas, from the documentation:

Note that ‘a:gz’, ‘a:bz2’ or ‘a:xz’ is not possible.

a:gz” means a gz-compressed tar file in append mode, so much for that idea. Next!

Next Approach

Well, you can’t make gz-compressed tar files in append mode, but you can create tar files in append mode as step one, then compress them as step two. I tried this next, and again, it looked like it was working…until I realized that my tar files contained copy after copy after copy of each file. I was hoping that it’d simply clobber files that were previously in the file, but instead it was just putting multiple files of the same name into the tar.

Perhaps I can delete from the tar file before adding items back to it? Nope, that’s not possible either. Next idea?

Final Approach

I was feeling pretty frustrated by now, but there was one more approach, and that was to add an intermediate step. Instead of creating the tar files directly in Python, I could save the individual json files I was previously putting into the tar file to disk, then create the compressed tar files directly from those once they’re all created. We proved earlier that Python has no issues about clobbering items on disk, so that’ll work nicely for incremental bulk files, which will just clobber old versions of the files.

From performance analyses of the code, most of the bottleneck is in serializing JSON, so this will change it so that gets done at most once per item in the database and then most of the remaining work will be making tar files and gz-compressing them.


I was hoping that I would be able to easily update items inside a compressed tar file, or even inside an uncompressed tar file, but that doesn’t seem possible.

I was hoping that I could create these files while iterating over the database, as described in the first approach, but that’s not doable either.

At the end of the day, the final method is just to write things to disk. Simple beats complicated this time, even when it comes to performance.

by Mike Lissner at November 06, 2014 08:00 AM

Updating Bulk Data in CourtListener

Update: I’ve written another post about how the solution presented here wasn’t fast enough and didn’t work out. You may want to read it as well.

There’s an increasing demand for bulk data from government systems, and while this will generate big wins for transparency, accountability, and innovation (at the occasional cost of privacy1), it’s important to consider a handful of technical difficulties that can come along with creating and providing such data. Do not misread this post as me saying, “bulk data is hard, don’t bother doing it.” Rather, like most of my posts, read this as an in-the-trenches account of issues we’ve encountered and solutions we’ve developed at CourtListener.

The Past and Present

For the past several years we’ve had bulk data at CourtListener, but to be frank, it’s been pretty terrible in a lot of ways. Probably the biggest issue with it was that we created it as a single massive XML file (~13GB, compressed!). That made a lot of sense for our backend processing, but people consuming the bulk data complained that it crashed 32 bit systems2, consumed memory endlessly, decompressing it wasn’t possible on Windows2, etc.

On top of these issues for people consuming our bulk data, and even though we set it up to be efficient for our servers, we did a stupid thing when we set it up and made it so our users could generate bulk files whenever they wanted for any day, month, year or jurisdiction. And create bulk files they did. Indeed in the year since we started keeping tabs on this, people made nearly fifty thousand requests for time-based bulk data3.

On the backend, the way this worked was that the first time somebody wanted a bulk file, they requested it, we generated it, and then we served it. The second time somebody requested that same file, we just served the one we generated before, creating a disk-based cache of sorts. This actually worked pretty well but it let people start long-running processes on our server that could degrade the performance of the front end. It wasn’t great, but it was a simple way to serve time- and jurisdiction-based files.4

As any seasoned developer knows, the next problem with such a system would be cache invalidation. How would we know that a cached bulk file had bad data and how would we delete it if necessary? Turns out this wasn’t so hard, but every time we changed (or deleted) an item in our database we had code that went out to the cache on disk and deleted any bulk files that might contain stale data. Our data doesn’t change often, so for the most part this worked, but it’s the kind of spaghetti code you want to avoid. Touching disk whenever an item is updated? Not so good.

And there were bugs. Weird ones.

Yeah, the old system kind of sucked. The last few days I’ve been busy re-working our bulk data system to make it more reliable, easier to use and just overall, better.

The New System

Let’s get the bad news taken care of off the bat: The new system no longer allows date-based bulk files. Since these could cause performance issues on the front end, and since nobody opposed the change, we’ve done away with this feature. It had a good life, may it RIP.

The good news is that by getting rid of the date-based bulk files, we’ve been able to eliminate a metric ton of complexity, literally! No longer do we need the disk-cache. No longer do we need to parse URLs and generate bulk data on the fly. No longer is the code a mess of decision trees based on cache state and user requests. Ah, it feels so free at last!

And it gets even better. On top of this, we were able to resolve a long-standing feature request for complete bulk data files by jurisdiction. We were able to make the schema of the bulk files match that of our REST API. We were able to make the bulk file a tar of smaller JSON files, so no more issues unzipping massive files or having 32 bit systems crash. We settled all the family business.

Oh, and one more thing: When this goes live, we’ll have bulk files and an API for oral arguments as well — Another CourtListener first.

Jeez, That’s Great, Why’d You Wait So Long?

This is a fair question. If it was possible to gain so much so quickly, why didn’t we do it sooner? Well, there are a number of reasons, but at the core, like so many things, it’s because nothing is actually that easy.

Before we could make these improvements, we needed to:

And pretty much everything else you can imagine. So, I suppose the answer is: We waited so long because it was hard.

But being hard is one thing. Another thing is that although a number of organizations have used our bulk data, never has any contributed either energy or resources to fixing the bugs that they reported. Despite the benefits these organizations got from the bulk files, none chose to support the ecosystem from which they benefited. You can imagine how this isn’t particularly motivational for us, but we’re hopeful that with the new and improved system, those using our data will appreciate the quality of the bulk data and consider supporting us down the road.

Wrapping Up

So, without sucking on too many sour grapes, that’s the story behind the upgrades we’re making to the bulk files at CourtListener. At first blush it may seem like a fairly straightforward feature to get in place (and remember, in many cases bulk data is stupid-easy to do), but we thought it would be interesting to share our experiences so others might compare notes. If you’re a consumer of CourtListener bulk data, we’ll be taking the wraps off of these new features soon, so make sure to watch the Free Law Project blog. If you’re a developer that’s interested in this kind of thing, we’re eager to hear your feedback and any thoughts you might have.

  1. For example, a few days ago some folks got access to NYC taxi information in bulk. In theory it was anonymized using MD5 hashing, but because there were a limited number of inputs into the hashing algorithm, all it took to de-anonymize the data was to compute every possible hash (“computing the 22M hashes took less than 2 minutes“) and then work backwards from there to the original IDs. While one researcher did that, another one began finding images of celebrities in taxis and figuring out where they went. Privacy is hard. 

  2. I confess I’m not that sympathetic… 

  3. To be exact: 48271 requests, as gathered by our stats module. 

  4. So far, 17866 files were created this way that haven’t been invalidated, as counted by:

    find -maxdepth 1 -type d | while read -r dir; do printf "%s:\t" "$dir"; find "$dir" -type f | wc -l; done

  5. Commence a fun digression for the developers. As you might expect, aside from compressing bulk files, the bottleneck of generating 350+ bulk files at once is pulling items from the database and converting them to JSON. We tried a few solutions to this problem, but the best we came up with takes advantage of the fact that every item in the database belongs in exactly two bulk files: The all.tar.gz file and the {jurisdiction}.tar.gz file. One way to put the item into both places would be to generate the all.tar.gz file and then generate each of the 350 smaller files.

    That would iterate every item in the database twice, but while making the jurisdiction files you’d have to do a lot of database filtering…something that it’s generally good to avoid. Our solution to this problem is to create a dictionary of open file handles and then to iterate the entire database once. For each item in the database, add it to both the all.tar.gz file and add it to the {jurisdiction}.tar.gz file. Once complete, close all the file handles. For example:

    # Get all the courts
    courts = Court.objects.all()
    # Create a dictionary with one key per jurisdiction
    tar_files = {}
    for court in courts:
        tar_files[] =
            '/tmp/bulk/opinions/%s.tar.gz' %,
    # Then iterate over everything, adding it to the correct key
    for item in item_list:
        # Add the json str to the two tarballs
        tarinfo = tarfile.TarInfo("%s.json" %
            tarinfo, StringIO.StringIO(json_str))
            tarinfo, StringIO.StringIO(json_str))

    In a sense the first part creates a variable for every jurisdiction on the fly and the second part uses that variable as a dumping point for each item as it iterates over them.

    A fine hack. 

by Mike Lissner at November 06, 2014 08:00 AM

November 04, 2014

Ph.D. student

prediction and computational complexity

To the extent that an agent is predictable, it must be:

  • observable, and
  • have a knowable internal structure

The first implies that the predictor has collected data emitted by the agent.

The second implies that the agent has internal structure and that the predictor has the capacity to represent the internal structure of the other agent.

In general, we can say that people do not have the capacity to explicitly represent other people very well. People are unpredictable to each other. This is what makes us free. When somebody is utterly predictable to us, their rigidity is a sign of weakness or stupidity. They are following a simple algorithm.

We are able to model the internal structure of worms with available computing power.

As we build more and more powerful predictive systems, we can ask: is our internal structure in principle knowable by this powerful machine?

This is different from the question of whether or not the predictive machine has data from which to draw inferences. Though of course the questions are related in their implications.

I’ve tried to make progress on modeling this with limited success. Spiros has just told me about binary decision diagrams which are a promising lead.

by Sebastian Benthall at November 04, 2014 05:54 AM

November 03, 2014

Ph.D. student

objective properties of text and robot scientists

One problem with having objectivity as a scientific goal is that it may be humanly impossible.

One area where this comes up is in the reading of a text. To read is to interpret, and it is impossible to interpret without bringing ones own concepts and experience to bear on the interpretation. This introduces partiality.

This is one reason why Digital Humanities are interesting. In Digital Humanities, one is using only the objective properties of the text–its data as a string of characters and its metadata. Semantic analysis is reduced to a study of a statistical distribution over words.

An odd conclusion: the objective scientific subject won’t be a human intelligence at all. It will need to be a robot. Its concepts may never be interpretable by humans because any individual human is too small-minded or restricted in their point of view to understand the whole.

Looking at the history of cybernetics, artificial intelligence, and machine learning, we can see the progression of a science dedicated to understanding the abstract properties of an idealized, objective learner. That systems such as these underly the infrastructure we depend on for the organization of society is a testament to their success.

by Sebastian Benthall at November 03, 2014 07:59 PM

November 01, 2014

MIMS 2010

Some Thoughts on Celery

We finally upgraded CourtListener last week and things went pretty well with the exception of two issues. First, we had some extended downtime as we waited for the database migration to complete. In retrospect, I should have realized that updating every item one row at a time would take a while. My bad.

Second, Celery broke again and that took me the better part of a day to detect and fix. As a central part of our infrastructure, this is really, truly frustrating. The remainder of this post goes into what happened, why it happened and how I finally managed to fix it.


First, why did this happen? Well…because I decided to log something. I created a task that processes our new audio files and I thought, “Hey, these should really log to the Juriscraper log rather than the usual celery log.” So, I added two lines to the file: One importing the log file and the second writing a log message. This is the little change that brought Celery to a grinding halt.

What the Hell?

If you’re wondering why logging would break an entire system, well, the answer is because Celery runs as a different user than everything else. In our case, as the celery user — a user that didn’t have permission to the log file I requested. Ugh.

Fine, that’s not so bad, but there were a number of other frustrating things that made this much worse:

  1. The Celery init script that we use was reporting the following:

     sudo service celeryd start
    celeryd-multi v3.0.13 (Chiastic Slide)
    > Starting nodes...
        > OK

    But no, it was not starting “OK”. It was immediately crashing.

  2. No log messages…anywhere. This appears to be because you have to detach stdin and stdout before daemonizing and according to asksol on IRC, this has been fixed in recent versions of Celery so even daemonizing errors can go to the Celery logs. Progress!

  3. The collection of things that happens when celery starts is complicated:

    1. I call sudo service celeryd start

    2. service calls /etc/init.d/celeryd

    3. celeryd does some stuff and calls (another file altogether), where our settings are.

      Update: Apparently this is a CourtListener-specific customization, so this step probably won’t apply to you, but I have no idea where this wacky set up came from (it’s been in place for years).

    4. Control is returned to celery, which starts celery itself with a command generated from

    On top of this, there’s a celery binary and there’s a celery management command for Django. (Update the Django commands were removed in Celery 3.1. More progress!) celery --help prints out 68 lines of documentation. Not too bad, but many of those lines refer you to other areas of the documentation. For example, celery worker --help prints another 100 lines of help text. Jesus this thing is complicated.

    Did I mention it has changing APIs?

I digress a bit, but the point here is that it fails silently, there are no log messages when it fails, and there’s no way to know which part of a complicated infrastructure is the problem. End rant.1

Seeking Sanity

It took me a long time to figure out what was going wrong, but I did eventually figure it out. The process, in case you run into something similar, is to modify celeryd so it prints out the command that it eventually runs. At that point you’ll have the correct command. With that, you can run it as the celery user and with some luck you’ll see what the problem is. There’s a modified init script for this purpose, if you like.

Other tips:

  1. If you have a new enough version of Celery, there are some troubleshooting tips that should help. They did nothing for me, because I haven’t upgraded yet for fear of the changing APIs.

  2. There seem to be a handful of different command line flags that Celery can use to be sent to the background. You’ll need to disable these when you’re testing or else you won’t see error messages or anything (apparently?).

Moving Forward

So, I feel bad: I’ve ranted a good deal about Celery, but I haven’t proposed any solutions. It looks like a lot of things have been improved in recent versions of Celery, so part of the solution is likely for us to upgrade.

But this isn’t the first time I’ve spent a long time trying to make Celery work, so what other ideas it take to make Celery a less complicated, more reliable tool?

The ideas I’ve come up with so far are:

  • More documentation for installation and set up troubleshooting with the possibility of a wiki.
    • But already I rant about how much documentation it has.
  • A simpler interface that eliminates a number of edge uses.
    • But I have no idea what, if anything, can be eliminated.
  • Support for fewer task brokers.
    • But I use RabbitMQ and am considering switching to Redis.
  • A more verbose, more thorough debug mode.
    • But apparently this is already in place in the latest versions?
  • Let Celery run as the www-data user as a general practice?
    • But apparently that’s a bad idea.

      Update this is a bad idea in general, but it’s not particularly bad if you don’t expose Celery on the network. If you’re only running it locally, you can probablly get by with Celery as a www-data user or similar.

As you can tell, I don’t feel strongly that any of these are the right solution. I am convinced though that Celery has a bad smell and that it’s ripe for a leaner solution to fill some of its simpler use cases. I’m currently considering switching to a simpler task queue, but I don’t know that I’ll do it since Celery is the de-facto one for Django projects.

We deserve a good, simple, reliable task queue though, and I wonder if there are good ideas for what could be changed in Celery to make that possible. I, for one, would love to never spend another minute trying to make RabbitMQ, Celery and my code play nicely together.

  1. In truth Celery is a classic love/hate relationship. On the one hand, it evokes posts like this one, but on the other, it allows me to send tasks to a background queue and distribute loads among many servers. Hell, it’s good enough for Instagram. On the other hand, god damn it, when it fails I go nuts. 

by Mike Lissner at November 01, 2014 07:00 AM

October 30, 2014

Ph.D. student

Comments on Haraway: Situated knowledge, bias, and code

“Above all, rational knowledge does not pretend to disengagement: to be from everywhere and so nowhere, to be free from interpretation, from being represented, to be fully self-contained or fully formalizable. Rational knowledge is a process of ongoing critical interpretation among “fields” of interpreters and decoders. Rational knowledge is power-sensitive conversation. Decoding and transcoding plus translation and criticism; all are necessary. So science becomes the paradigmatic model, not of closure, but of that which is contestable and contested. Science becomes the myth, not of what escapes human agency and responsibility in a realm above the fray, but, rather, of accountability and responsibility for translations and solidarities linking the cacophonous visions and visionary voices that characterize the knowledges of the subjugated.” – Donna Haraway, “Situated Knowledges: The Science Question in Feminism and the Privilege of the Partial Perspective”, 1988

We are reading Donna Haraway’s Situated Knowledges and Cyborg Manifesto for our department’s “Classics” reading group. An odd institution at Berkeley’s School of Information, the group formed years ago to be the place where doctoral students could gather together to read the things they had a sneaking suspicion they should have read, but never were assigned in a class. Since we bridge between many disciplines, there is a lot of ground to cover. Often our readings are from the Science and Technology Studies (STS) tradition.

I love Haraway’s writing. It’s fun. I also think she is mostly right about things. This is not what I expected going into reading her. Her position is that for feminists, rational objective knowledge has to be found in the interpretation and reinterpretation of partial perspectives, not a “god trick” that is assumed to know everything. This “god trick” she associates with phallogocentric white male patriarchal science. This is in 1988.

In 1981, Habermas published his Theory of Communicative Action in German. This work incorporates some of the feminist critiques of his earlier work on the formation of the bourgeois public sphere. Habermas reaches more or less the same conclusion as Haraway: there is no trancendent subject or god’s point of view to ground science; rather, science must be grounded in the interaction of perspectives through communicative action aimed at consensus.

Despite their similarities, there are some significant differences between these points of view. Importantly, Haraway’s feminist science has no white men in it. It’s not clear if it has any Asian, Indian, Black, or Latino men in it either, though she frequently mentions race as an important dimension of subjugation. It’s an appropriation and erasure of non-white masculinity. Does it include working class white men? Or men with disabilities of any kind? Apparently not. Since I’m a man and many of my scientist friends are men (of various races), I find this objectionable.

Then there is Haraway’s belief that such a conversation must always be cacaphonous and frenetic. Essentially, she does not believe that the scientific conversation can or should reach consensus or agreement. She is focusing on the critical process. Reading Habermas, on the other hand, you get the sense that he believes that if everyone would just calm down and stop bickering, we would at last have scientific peace.

Perhaps the difference here comes from the presumed orientation or purpose of interpretation. For Habermas, is it mutual understanding. For Haraway, it is criticism and contest. The “we” must never completely subsume individual partiality for Haraway.

Advocates of a cyborg feminist science or successor science or science of situated knowledges might argue for it on the grounds that it improves diversity. Specifically, it provides a way for women to participate in science.

In UC Berkeley’s D-Lab, where I work, we also have an interest in diversity in science, especially computational social science. In a recent training organized by the committee for Equity, Inclusion, and Diversity, we met together and did exercises where we discussed our unconscious biases.

According to Wikipedia, “Bias is an inclination of temperament or outlook to present or hold a partial perspective, often accompanied by a refusal to even consider the possible merits of alternative points of view. People may be biased toward or against an individual, a race, a religion, a social class, or a political party. Biased means one-sided, lacking a neutral viewpoint, not having an open mind.” To understand ones bias is to understand ones partial perspective. The problem with bias in a diverse setting is that it leads to communication breakdown and exclusion.

A related idea is the idea of a statistical bias, which is when a statistic is systematically out of sync from the population of interest. In computational social science, we have to look out for statistical biases because we aim for our social scientific results to be objective.

Another related idea is cognitive bias, a psychological phenomenon more general than the first kind of bias. These biases are deviation from rationality in psychological thought. This Nobel Prize winning psychological research has found systematic ways in which all people make mental shortcuts that skew their judgments. I’m not familiar with the research on how these cognitive biases interact with social psychology, but one would imagine that the answer is significantly so.

Haraway’s situated knowledges are biased, partial knowledges. She is upholding the rationality of these knowledges in opposition to the “god trick,” “view from nowhere,” which she also thinks is the position of phallogocentric subjugating science. Somehow, for Haraway, men have no perspective, not even a partial one. Yet, from this non-perspective, men carry out their project of domination.

As much as I like Haraway’s energetic style and creativity, as a man I have a difficult time accepting her view of science as situated knowledges because it seems to very deliberately exclude my position.

So instead I am going along with what we learned in Equity, Inclusion, and Diversity training, which is to try to understand better my own biases so that I can do my best to correct them.

This experience is something that anybody who has worked collaboratively on source code will recognize as familiar. When working on software with a team of people, everybody has different ideas about how things should be organized and implemented. Some people may have quirky ideas, some people may be flat out wrong. The discussion that happens on, for example, an open source issue tracker is a discussion about reaching consensus on a course of action. Like in statistics or the pursuit of psychological rationality, this activity is one of finding an agreement that reduces the bias of the outcome.

In machine learning and statistics, one of the ways you can get an unbiased estimator is by combining many biased ones together and weighting their outcomes. One name for this is bagging, short for ‘bootstrap aggregating’. The idea of an unbiased democratic outcome of combined partial perspectives is familiar to people who work in data science or computational social science because it is foundational to their work. It is considered foundational because in the “exact sciences”–which is how Haraway refers to mathematics–there is a robust consensus on the mathematics backing the technique, as well as a robust empirical conclusion of the technique’s successful performance. This robust consensus has happened through translation and criticism of many, many scientist’s partial perspectives.

It is frustrating that this kind of robust, consensually arrived at agreement is still sometimes rejected as radically contingent or historical by those from the Science and Technology Studies (STS) tradition who find their epistemic roots in Haraway. It’s especially frustrating, to me, because I would like to see more diversity–especially more women–in computational social science, or data science more generally. Haraway seems to believe that women in science are unable to overcome their own bias (partiality), or at least encourages them to not try. That seems like a non-starter for women in STEM, because I don’t know how you would ever learn statistics or programming without orienting yourself towards unbiased agreement with others.

So I have to conclude that teaching people Haraway as an epistemology is really bad for science, because it’s bad for diversity in science. That’s a little sad because obviously Haraway had the best of intentions and she is a really interesting writer. It’s also sad because a lot of STS people who base their work off of Haraway really think they are supporting diversity in science. I’ve argued: Nope. Maybe they should be reading Habermas instead.

by Sebastian Benthall at October 30, 2014 05:29 AM

October 29, 2014

Ph.D. student

A troubling dilemma

I’m troubling over the following dilemma:

On the one hand, serendipitous exposure to views unlike your own is good, because that increases the breadth of perspective that’s available to you. You become more cosmopolitan and tolerant.

On the other hand, exposure to views that are hateful, stupid, or evil can be bad, because this can be hurtful, misinforming, or disturbing. Broadly, content can harm.

So, suppose you are deciding what to expose yourself to, or others to, either directly or through the design of some information system.

This requires making a judgment about whether exposure to that perspective will be good or bad.

How is it possible to make that judgment without already having been exposed to it?

Put another way, filter bubbles are sometimes good and sometimes bad. How can you tell the difference, from within a bubble, about whether bridging to another bubble is worthwhile? How could you tell from outside of a bubble? Is there a way to derive this from the nature of bubbles in the abstract?

by Sebastian Benthall at October 29, 2014 06:36 AM

October 26, 2014

Ph.D. student

Dealing with harassment (and spam) on the Web

I see that work is ongoing for anti-spam proposals for the Web — if you post a response to my blog post on your own blog and send me a notification about it, how should my blog software know that you're not a spammer?

But I'm more concerned about harassment than spam. By now, it should be impossible to think about online communities without confronting directly the issue of abuse and harassment. That problem does not affect all demographic groups directly in the same way, but it effects a loss of the sense of safety that is currently the greatest threat to all of our online communities. #GamerGate should be a lesson for us. Eg. Tim Bray:

Part of me sus­pects there’s an up­side to GamerGate: It dragged a part of the In­ter­net that we al­ways knew was there out in­to the open where it’s re­al­ly hard to ig­nore. It’s dam­aged some people’s lives, but you know what? That was hap­pen­ing all the time, any­how. The dif­fer­ence is, now we can’t not see it.

There has been useful debate about the policies that large online social networking sites are using for detecting, reporting and removing abusive content. It's not an easy algorithmic problem, it takes a psychological toll on human moderators, it puts online services into the uncomfortable position of arbiter of appropriateness of speech. Once you start down that path, it becomes increasingly difficult to distinguish between requests of various types, be it DMCA takedowns (thanks, Wendy, for; government censorship; right to be forgotten requests.

But the problem is different on the Web: not easier, not harder, just different. If I write something nasty about you on my blog, you have no control over my web server and can't take it down. As Jeff Atwood, talking about a difference between large, worldwide communities (like Facebook) and smaller, self-hosted communities (like Discourse) puts it, it's not your house:

How do we show people like this the door? You can block, you can hide, you can mute. But what you can't do is show them the door, because it's not your house. It's Facebook's house. It's their door, and the rules say the whole world has to be accommodated within the Facebook community. So mute and block and so forth are the only options available. But they are anemic, barely workable options.

I'm not sure I'm willing to accept that these options are anemic, but I want to consider the options and limitations and propose code we can write right now. It's possible that spam could be addressed in much the same way.

Self-hosted (or remote) comments are those comments and responses that are posts hosted by the commenter, on his own domain name, perhaps as part of his own blog. The IndieWeb folks have put forward a proposed standard for WebMentions so that if someone replies to my blog on their own site, I can receive a notification of that reply and, if I care to, show that response at the bottom of my post so that readers can follow the conversation. (This is like Pingback, but without the XML-RPC.) But what if those self-hosted comments are spam? What if they're full of vicious insults?

We need to update our blog software with a feature to block future mentions from these abusive domains (and handling of a block file format, more later).

The model of self-hosted comments, hosted on the commenter's domain, has some real advantages. If is writing insults about me on his blog and sending notifications via WebMention, I read the first such abusive message and then instruct my software to ignore all future notifications from Joe might create a new domain tomorrow, start blogging from and send me another obnoxious message, but then I can block too. It costs him $10 in domain registration fees to send me a message, which is generally quite a bit more burdensome than creating an email address or a new Twitter account or spoofing a different IP address.

This isn't the same as takedown, though. Even if I "block" in my blog software so that my visitors and I don't see notifications of his insulting writing, it's still out there and people who subscribe to his blog will read it. Recent experiences with trolling and other methods of harassment have demonstrated that real harm can come not just from forcing the target to read insults or threats, but also from having them published for others to read. But this level of block functionality would be a start, and an improvement upon what we're seeing in large online social networking sites.

Here's another problem, and another couple proposals. Many people blog not from their own domain names, but as a part of a larger service, e.g. or If someone posts an abusive message on, I can block (automatically ignore and not re-publish) all future messages from, but it's easy for the harasser to register a new account on a new subdomain and continue (,, etc.). While it would be easy to block all messages from every subdomain of, that's probably not what I want either. It would be better if, 1) I could inform the host that this harassment is going on from some of their users and, 2) I could share lists with my friends of which domains, subdomains or accounts are abusive.

To that end, I propose the following:

  1. That, if you maintain a Web server that hosts user-provided content from many different users, you don't mean to intentionally host abusive content and you don't want links to your server to be ignored because some of your users are posting abuse, you advertise an endpoint for reporting abuse. For example, on, I would find in the <head> something like:

    <link rel="abuse" href="">

    I imagine that would direct to a human-readable page describing their policies for handling abusive content and a form for reporting URLs. Large hosts would probably have a CAPTCHA on that submission form. Today, for email spam/abuse, the Network Abuse Clearinghouse maintains email contact information for administrators of domains that send email, so that you can forward abusive messages to the correct source. I'm not sure a centralized directory is necessary for the Web, where it's easy to mark up metadata in our pages.

  2. That we explore ways to publish blocklists and subscribe to our friend's blocklists.
  3. I'm excited to see, which is a Twitter tool for blocking certain types of accounts and managing lists of blocked accounts, which can be shared. Currently under discussion is a design for subscribing to lists of blocked accounts. I spent some time working on Flaminga, a project from Cori Johnson to create a Twitter client with blocking features, at the One Web For All Hackathon. But I think has a more promising design and has taken the work farther.

    Publishing a list of domain names isn't technically difficult. Automated subscription would be useful, but just a standard file-format and a way to share them would go a long way. I'd like that tool in my browser too: if I click a link to a domain that my friends say hosts abusive content, then warn me before navigating to it. Shared blocklists also have the advantage of hiding abuse without requiring every individual to moderate it away. I won't even see mentions from if my friend has already dealt with his abusive behavior.

    Spam blocklists are widely used today as one method of fighting email spam: maintained lists primarily of source IP addresses, that are typically distributed through an overloading of DNS. Domain names are not so disposable, so list maintainance may be more effective. We can come up with a file format for specifying inclusion/exclusion of domains, subdomains or even paths, rather than re-working the Domain Name System.

Handling, inhibiting and preventing online harassment is so important for open Web writing and reading. It's potentially a major distinguishing factor from alternative online social networking sites and could encourage adoption of personal websites and owning one's own domain. But it's also an ethical issue for the whole Web right now.

As for email spam, let's build tools for blocking domains for spam and abuse on the social Web, systems for notifying hosts about abusive content and standards for sharing blocklists. I think we can go implement and test these right now; I'd certainly appreciate hearing your thoughts, via email, your blog or at TPAC.


P.S. I'm not crazy about the proposed vouching system, because it seems fiddly to implement and because I value most highly the responses from people outside my social circles, but I'm glad we're iterating.

Also, has anyone studied the use of rhymes/alternate spellings of GamerGate on Twitter? I find an increasing usage of them among people in my Twitter feed, doing that apparently to talk about the topic without inviting the stream of antagonistic mentions they've received when they use the #GamerGate hashtag directly. Cf. the use of "grass mud horse" as an attempt to evade censorship in China, or rhyming slang in general.

by at October 26, 2014 06:11 PM

October 25, 2014

MIMS 2012

Testing Theory: Mo' Choices, Mo' Problems

When asked, most of us would say we’d prefer more options to choose from, rather than fewer. We want to make the best possible choice, so more options should increase the likelihood we’ll choose correctly. But in actuality, research shows that more choice usually leads to worse decisions, or the abandonment of choice altogether. In this post, I will describe how we can use this knowledge to generate A/B test ideas.

Cognitive Overload

More choices are more mentally taxing to compare and evaluate, leading to cognitive overload and a decrease in decision making skills. Anecdotally, it’s the experience of walking into a supermarket to buy toothpaste, only to be confronted by an endless wall of brands and specialized types that all seem roughly the same. You’re quickly overwhelmed, and with no distinguishing characteristics to help you choose, you just grab whatever you bought last time and get the hell out of there.

This common experience was formally studied by Iyengar and Lepper (pdf) (2000), who compared buying rates when shoppers were presented 24 jams to sample, versus just 6. They found when 24 jams were available, only 3% of people bought a jar. But when only 6 jams were available, 30% bought a jar. By providing fewer jams to sample, it was easier for shoppers to compare them to each other and make a decision.

Generating Test Ideas

From these findings you can apply a simple rule to your site or mobile app to generate test ideas: any time a user has to make a choice (e.g. deciding which product to buy; clicking a navigation link; etc.), reduce the number of available options. Here are some examples:

  • Have just one call-to-action. If you have “Sign Up” and “Learn More” buttons, for example, try removing the “Learn More” button. (See below for an example).
  • Remove navigation items. For example, Amazon has been continually simplifying its homepage by hiding its store categories in favor of search. Shoppers don’t need to think about which category might have their desired item; rather, they just search for it. (For help simplifying your navigation, check out this series of articles on Smashing Magazine).
  • Try offering fewer products. See if hiding unpopular or similar products increases purchases of the few that remain.
  • If removing products isn’t feasible, try asking people to make a series of simple choices to narrow down their options. Returning to the toothpaste example, you could ask people to choose a flavor, then a type (whitening, no-fluoride, baking soda, etc.), and present toothpastes that only match those choices. The key is to make sure your customers understand each facet, and the answers are distinct and not too numerous (i.e. less than 6).
  • Break up checkout forms into discrete steps.
  • Remove navigation from checkout funnels. Many eCommerce sites (like Crate&Barrel and Amazon) do this because it leaves one option to the user — completing the purchase (see below).

"Crate&Barrel checkout flow comparison"

By removing the main navigation from their checkout flow, Crate&Barrel increased average order value by 2% and revenue per visitor by 1%.

"SeeClickFix call-to-action comparison"

By removing extraneous calls-to-action (“Free Sign Up” and “Go Pro! Free Trial”), SeeClickFix (a service for reporting neighborhood issues) focused users’ attention on the search bar and increased engagement by 8%.

Know your audience

Of course, there are times when more choice is better. Broadly speaking, experts typically know what they’re looking for, and are able to evaluate many options because they understand all the distinguishing minutia. For example, professional tennis players can rapidly narrow down the choice of thousands of racquets to just a few because they understand the difference between different materials, weights, head sizes, lengths, and so on. If you don’t offer what they’re looking for, or make it easy to get to what they want, they’ll look elsewhere. For this reason, it’s important that you understand your audience and cater to their buying habits.

We’re trained from an early age to believe that more choice is always better. But in actuality, more choices are mentally taxing, and lead to poor decision making or the abandonment of choice altogether. By testing the removal or simplification of options, you can increase sales, conversions, and overall customer satisfaction.

Further Reading

by Jeff Zych at October 25, 2014 08:17 PM

Ph.D. student

developing a nuanced view on transparency

I’m a little late to the party, but I think I may at last be developing a nuanced view on transparency. This is a personal breakthrough about the importance of privacy that I owe largely to the education I’m getting at Berkeley’s School of Information.

When I was an undergrad, I also was a student activist around campaign finance reform. Money in politics was the root of all evil. We were told by our older, wiser activist mentors that we were supposed to lay the groundwork for our policy recommendation and then wait for journalists to expose a scandal. That way we could move in to reform.

Then I worked on projects involving open source, open government, open data, open science, etc. The goal of those activities is to make things more open/transparent.

My ideas about transparency as a political, organizational, and personal issue originated in those experiences with those movements and tactics.

There is a “radically open” wing of these movements which thinks that everything should be open. This has been debunked. The primary way to debunk this is to point out that less privileged groups often need privacy for reasons that more privileged advocates of openness have trouble understanding. Classic cases of this include women who are trying to evade stalkers.

This has been expanded to a general critique of “big data” practices. Data is collected from people who are less powerful than people that process that data and act on it. There has been a call to make the data processing practices more transparent to prevent discrimination.

A conclusion I have found it easy to draw until relatively recently is: ok, this is not so hard. What’s important is that we guarantee privacy for those with less power, and enforce transparency on those with more power so that they can be held accountable. Let’s call this “openness for accountability.” Proponents of this view are in my opinion very well-intended, motivated by values like justice, democracy, and equity. This tends to be the perspective of many journalists and open government types especially.

Openness for accountability is not a nuanced view on transparency.

Here are some examples of cases where an openness for accountability view can go wrong:

  • Arguably, the “Gawker Stalker” platform for reporting the location of celebrities was justified by an ‘opennes for accountability’ logic. Jimmy Kimmel’s browbeating of Emily Gould indicates how this can be a problem. Celebrity status is a form of power but also raises ones level of risk because there is a small percentage of the population that for unfathomable reasons goes crazy and threatens and even attacks people. There is a vicious cycle here. If one is perceived to be powerful, then people will feel more comfortable exposing and attacking that person, which increases their celebrity, increasing their perceived power.
  • There are good reasons to be concerned about stereotypes and representation of underprivileged groups. There are also cases where members of those groups do things that conform to those stereotypes. When these are behaviors that are ethically questionable or manipulative, it’s often important organizationally for somebody to know about them and act on them. But transparency about that information would feed the stereotypes that are being socially combated on a larger scale for equity reasons.
  • Members of powerful groups can have aesthetic taste and senses of humor that are offensive or even triggering to less powerful groups. More generally, different social groups will have different and sometimes mutually offensive senses of humor. A certain amount of public effort goes into regulating “good taste” and that is fine. But also, as is well known, art that is in good taste is often bland and fails to probe the depths of the human condition. Understanding the depths of the human condition is important for everybody but especially for powerful people who have to take more responsibility for other humans.
  • This one is based on anecdotal information from a close friend: one reason why Congress is so dysfunctional now is that it is so much more transparent. That transparency means that politicians have to be more wary of how they act so that they don’t alienate their constituencies. But bipartisan negotiation is exactly the sort of thing that alienates partisan constituencies.

If you asked me maybe two years ago, I wouldn’t have been able to come up with these cases. That was partly because of my positionality in society. Though I am a very privileged man, I still perceived myself as an outsider to important systems of power. I wanted to know more about what was going on inside important organizations and was frustrated by my lack of access to it. I was very idealistic about wanting a more fair society.

Now I am getting older, reading more, experiencing more. As I mature, people are trusting me with more sensitive information, and I am beginning to anticipate the kinds of positions I may have later in my career. I have begun to see how my best intentions for making the world a better place are at odds with my earlier belief in openness for accountability.

I’m not sure what to do with this realization. I put a lot of thought into my political beliefs and for a long time they have been oriented around ideas of transparency, openness, and equity. Now I’m starting to see the social necessity of power that maintains its privacy, unaccountable to the public. I’m starting to see how “Public Relations” is important work. A lot of what I had a kneejerk reaction against now makes more sense.

I am in many ways a slow learner. These ideas are not meant to impress anybody. I’m not a privacy scholar or expert. I expect these thoughts are obvious to those with less of an ideological background in this sort of thing. I’m writing this here because I see my current role as a graduate student as participating in the education system. Education requires a certain amount of openness because you can’t learn unless you have access to information and people who are willing to teach you from their experience, especially their mistakes and revisions.

I am also perhaps writing this now because, who knows, maybe one day I will be an unaccountable, secretive, powerful old man. Nobody would believe me if I said all this then.

by Sebastian Benthall at October 25, 2014 06:14 PM

October 24, 2014

Ph.D. alumna

My name is danah and I’m a stats addict.

I love data and I hate stats. Not stats in abstract — statistics are great — but the kind of stats that seem to accompany any web activity. Number of followers, number of readers, number of viewers, etc. I hate them in the way that an addict hates that which she loves the most. My pulse quickens as I refresh the page to see if one more person clicked the link. As my eyes water and hours pass, I have to tear myself away from the numbers, the obsessive calculating that I do, creating averages and other statistical equations for no good reason. I gift my math-craving brain with a different addiction, turning to various games — these days, Yushino — to just get a quick hit of addition. And then I grumble, grumble at the increasing presence of stats to quantify and measure everything that I do.

My hatred is not a new hatred. Oh, no. I’ve had an unhealthy relationship with measurement since I was a precocious teenager. I went to a school whose grades came in five letters — A, B, C, D, F (poor E, whatever happened to E?). Grades were rounded to the nearest whole number so if you got an 89.5, you got an A. I was that horrible bratty know-it-all student who used to find sick and twisted joy in performing at exactly bare minimum levels. Not in that slacking way, but in that middle finger in the air way. For example, if you gave me a problem set with 10 questions on it, where each question was harder than the last, I would’ve done the last 9 problems and left the first one blank. Oh was I arrogant.

The reason that I went to Brown for college was because it was one of two colleges I found out about that didn’t require grades; the other college didn’t have a strong science program. I took every class that I could Pass/Fail and loved it. I’m pretty sure that I actually got an A in those classes, but the whole point was that I didn’t have to obsess over it. I could perform strongly without playing the game, without trying to prove to some abstract entity that I could manipulate the system just for the fun of it.

When I started my blog back in the day, I originally put on a tracker. (Remember those older trackers that were basically data collectors in the old skool days??) But then I stopped. I turned it off. I set it up to erase the logs. I purposefully don’t login to my server because I don’t want to know. I love not knowing, not wondering why people didn’t come back, not realizing that a post happened to hit it big in some country. I don’t want the data, I don’t want to know.

Data isn’t always helpful. When my advisor was dying of brain cancer, he struggled to explain to people how he didn’t want to know more than was necessary. His son wanted the data to get his head around what was happening, to help his father. But my advisor and I made a deal — I didn’t look up anything about his illness and would take his interpretation at face-value, never conflicting whatever he said. He was a political scientist and an ethnographer, a man who lived by data of all forms and yet, he wanted to be free from the narrative of stats as he was dying.

As we move further and further into the era of “big data,” I find that stats are everywhere. I can’t turn off the number of followers I have on Twitter. And I can’t help but peek at the stats on my posts on Medium, even though I know it’s not healthy for me to look. (Why aren’t people reading that post!?!? It was sooo good!!) The one nice thing about the fact that 3rd party stats on book sales are dreadfully meaningless is that I have zero clue how many people have bought my book. But I can’t help but query my Amazon book sale rank more often than I’d like to admit. For some, it’s about nervously assessing their potential income. I get this. But for me, it’s just the obsessive desire to see a number, to assess my worth on a different level. If it goes up, it means “YAY I’M AWESOME!” but if it goes down, no one loves me, I’m a terrible person. At least in the domain that I have total control over — my website — I don’t track how many people have downloaded my book. I simply don’t know and I like it that way. Because if I don’t know, I can’t beat myself up over a number.

The number doesn’t have to be there. I love that Benjamin Grosser created a demetricator to remove numbers from Facebook (tx Clive Thompson!). It’s like an AdBlocker for stats junkies, but it only works on Facebook. It doesn’t chastise you for sneaking a look at numbers elsewhere. But why is it that it takes so much effort to remove the numbers? Why are those numbers so beneficial to society that everyone has them?

Stats have this terrible way of turning you — or, at least, me — into a zombie. I know that they don’t say anything. I know that huge chunks of my Twitter followers are bots, that I could’ve bought my way to a higher Amazon ranking, that my Medium stats say nothing about the quality of my work, and that I should not treat any number out there as a mechanism for self-evaluation of my worth as a human being. And yet, when there are numbers beckoning, I am no better than a moth who sees a fire.

I want to resist. I want serenity and courage and wisdom. And yet, and yet… How many people will read this post? Are you going to tweet it? Are you going to leave comments? Are you going to tell me that I’m awesome? Gaaaaaaaah.

(This post was originally published on September 24, 2014 in The Message on Medium.)

by zephoria at October 24, 2014 04:27 PM

Ph.D. student

writing about writing

Years ago on a now defunct Internet forum, somebody recommended that I read a book about the history of writing and its influence on culture.

I just spent ten minutes searching through my email archives trying to find the reference. I didn’t find it.

I’ve been thinking about writing a lot lately. And I’ve been thinking about writing especially tonight, because I was reading this essay that is in a narrow sense about Emily Gould but in a broad sense is about writing.*

I used to find writing about writing insufferable because I thought it was lazy. Only writers with nothing to say about anything else write about writing.

I don’t disagree with that sentiment tonight. Instead I’ve succumbed to the idea that actually writing is a rather specialized activity that is perhaps special because it affords so much of an opportunity to scrutinize and rescrutinize in ways that everyday social interaction does not. By everyday social interaction, I mean specifically the conversations I have with people that are physically present. I am not referring to the social interactions that I conduct through writing with sometimes literally hundreds of people at a time, theoretically, but actually more on the order of I don’t know twenty, every day.

The whole idea that you are supposed to edit what you write before you send it presupposes a reflective editorial process where text, as a condensed signal, is the result of an optimization process over possible interpretations that happens before it is ever emitted. The conscious decision to not edit text as one writes it is difficult if not impossible for some people but for others more…natural. Why?

The fluidity with which writing can morph genres today–it’s gossip, it’s journalism, it’s literature, it’s self expression reflective of genuine character, it’s performance of an assumed character, it’s…–is I think something new.

* Since writing this blog post, I have concluded that this article is quite evil.

by Sebastian Benthall at October 24, 2014 08:09 AM

October 19, 2014

Ph.D. student

It all comes back to Artificial Intelligence

I am blessed with many fascinating conversations every week. Because of the field I am in, these conversations are mainly about technology and people and where they intersect.

Sometimes they are about philosophical themes like how we know anything, or what is ethical. These topics are obviously relevant to an academic researcher, especially when one is interested in computational social science, a kind of science whose ethics have lately been called into question. Other times they are about the theoretical questions that such a science should or could address, like: how do we identify leaders? Or determine what are the ingredients for a thriving community? What is creativity, and how can we mathematically model how it arises from social interaction?

Sometimes the conversations are political. Is it a problem that algorithms are governing more of our political lives and culture? If so, what should we do about it?

The richest and most involved conversations, though, are about artificial intelligence (AI). As a term, it has fallen out of fashion. I was very surprised to see it as a central concept in Bengio et al.’s “Representation Learning: A Review and New Perspectives” [arXiv]. In most discussion scientific computing or ‘data science’ for the most part people have abandoned the idea of intelligent machines. Perhaps this is because so many of the applications of this technology seem so prosaic now. Curating newsfeeds, for example. That can’t be done intelligently. That’s just an algorithm.

Never mind that the origins of all of what we now call machine learning was in the AI research program, which is as old as computer science itself and really has grown up with it. Marvin Minsky famously once defined artificial intelligence as ‘whatever humans still do better than computers.’ And this is the curse of the field. With every technological advance that is at the time mind-blowingly powerful, performing a task that it used to require hundreds of people to perform, it very shortly becomes mere technology.

It’s appropriate then that representation learning, the problem of deriving and selecting features from a complex data set that are valuable for other kinds of statistical analysis in other tasks, is brought up in the context of AI. Because this is precisely the sort of thing that people still think they are comparatively good at. A couple years ago, everyone was talking about the phenomenon of crowdsourced image tagging. People are better at seeing and recognizing objects in images than computers, so in order to, say, provide the data for Google’s Image search, you still need to mobilize lots of people. You just have to organize them as if they were computer functions so that you can properly aggregate their results.

On of the earliest tasks posed to AI, the Turing Test, proposed and named after Alan Turing, the inventor of the fricking computer, is the task of engaging in conversation as if one is a human. This is harder than chess. It is harder than reading handwriting. Something about human communication is so subtle that it has withstood the test of time as an unsolved problem.

Until June of this year, when a program passed the Turing Test in the annual competition. Conversation is no longer something intelligent. It can be performed by a mere algorithm. Indeed, I have heard that a lot of call centers now use scripted dialog. An operator pushes buttons guiding the caller through a conversation that has already been written for them.

So what’s next?

I have a proposal: software engineering. We still don’t have an AI that can write its own source code.

How could we create such an AI? We could use machine learning, training it on data. What’s amazing is that we have vast amounts of data available on what it is like to be a functioning member of a software development team. Open source software communities have provided an enormous corpus of what we can guess is some of the most complex and interesting data ever created. Among other things, this software includes source code for all kinds of other algorithms that were once considered AI.

One reason why I am building BigBang, a toolkit for the scientific analysis of software communities, is because I believe it’s the first step to a better understanding of this very complex and still intelligent process.

While above I have framed AI pessimistically–as what we delegate away from people to machines, that is unnecessarily grim. In fact, with every advance in AI we have come to a better understanding of our world and how we see, hear, think, and do things. The task of trying to scientifically understand how we create together and the task of developing an AI to create with us is in many ways the same task. It’s just a matter of how you look at it.

by Sebastian Benthall at October 19, 2014 03:04 AM

October 18, 2014

MIMS 2012

October 14, 2014

MIMS 2010

Creating a Non-Profit

The Goal

This post is an attempt to document the things that we’ve done at Free Law Project to get our official Federal and State non-profit status. This has been a grueling process for Brian and me but as we announced on Twitter, we now have it officially in hand, and likely in record time:

Check out this beauty! We’re finally the real deal.

All through the process, I wished there was something that had all the documentation of the process, so this is my attempt at such a post. I’m writing this after the fact, so I expect that I’ll munge a few details. If you catch me making a mistake, you can either edit this page yourself using my handy guide, or you can send me a note and I’ll update it.

Before We Begin

Three notes before we begin:

  1. Our complete IRS packet is available. Please feel free to take a look.

  2. The very best resource we found for this was a checklist from Public Counsel which reads like this blog post if it were written by qualified lawyers.

  3. Nothing here, there, or anywhere is tax advice or legal advice or advice in any way, period. This is an overview of the process as we understand it. It might work for you, it might not. We’re not tax or IRS experts. Hell, I’m not even a lawyer.

The Overall Process

Here are the major steps for California and Federal recognition:

  1. Reserve the name of your organization with the Secretary of State
  2. Check for trademarks on your organization name
  3. Get an EIN from the IRS
  4. Get official with the Secretary of State of California
    1. Write Bylaws
    2. Write Articles of Incorporation
    3. Write Conflict of Interest and Ethics Policy
    4. Hold a meeting creating directors (and having them resign as incorporators, if necessary)
    5. Hold a meeting to ratify and adopt everything above
  5. File Statement of Information with Secretary of State
  6. Register with the California Attorney General’s registry of charitable trusts
  7. Get Federal recognition as a 501(c)(3)
    1. IRS-1023
    2. Your organization’s press coverage
    3. Your homepage
    4. Articles of Incorporation stamped by the State Secretary
    5. Phone a friend
  8. File for California tax exemption
  9. Get Municipal recognition

Reserve Your Name with the Secretary of State

This is an important step if you think somebody else might already have your name or if you think it might get scooped up before you finish your paperwork. This is your opportunity to say that your organization is named such-and-such and nobody else can have that name.

More information about this can be found on the Secretary of State’s website. The process involves downloading a form, filling it out, mailing it in, and then waiting for a reply. Once you get the reply, the name is yours for 60 days. This is probably also a good time to…

Check for Trademarks

If you think the name of your organization might be a trademark, you should check the USPTO’s trademark database to see if it is. If so, it’s probably wise to re-think the name of your organization. Naming your organization the Nike Charitable Trust probably won’t work out well for you.

Get an EIN

This is the official step that’s required to incorporate your organization and it’s a fairly easy one. Once this is done you’ll have an Employer Identification Number (EIN) from the IRS.

To do this, there is a multi-step form on the IRS website. Work your way through it, and if you come out the other side, you’ll quickly be the owner of a freshly minted EIN. Keep it private, as you would an SSN.

Get Official with California

At this point, you’ve moved past the easy stuff. It’s time for the weird and difficult paperwork.

Write Bylaws, Articles of Incorporation and Conflict of Interest and Ethics Policy

Writing these three items is a very persnickety part of the process. Each item must include certain phrases and failure to include those phrases will sink any attempt to get 501(c)(3) status down the road. The template we used for each of these was created by Public Counsel and can be downloaded from their website (Bylaws, Articles of Incorporation, Ethics Policy).

The best process we discovered for this was to very carefully work our way through each template and to update any section that needed it. The result clocks in at 25 pages for the Bylaws and Articles of Incorporation and ten pages for the Conflict of Interest and Ethics Policy.

Hold Some Meetings

OK, you’ve got your name, EIN, Bylaws, Articles of Incorporation and Conflict of Interest and Ethics Policy all ready. Now what? Well, now we enter the portion of the process that involves magic and wizardry. What we do now is we hold two meetings. Feel free to chant during these meetings if that helps them make sense.

The first meeting serves the purpose of creating directors and having them resign as incorporators, if necessary. To have this meeting, get all of your incorporators and directors together and decide to make it so. Have your secretary keep minutes from the meeting. You’ll need them for the 501(c)(3). Here are ours and here’s the template we used. You can see how this might feel a bit like voodoo magic if your board of directors is the same group of people as your incorporators (as in our case) — One minute they’re incorporators, the next they’re directors, and the people that authorized the switch are themselves.

The second meeting is where the real business goes down. Here you adopt all of the paperwork you created above, establish bank accounts, etc. Again, we used the templates from Public Counsel to keep minutes for this meeting. Check out our minutes for details and here’s a template. You’ll also see in our minutes a waiver of notice that waives the director’s normal requirement to tell people about the meeting in advance.

These two meetings can (and probably will) take place back to back, but they need to have separate minutes and need to be separate meetings. This is because until the board adopts themselves in the first meeting, they can’t very well do the things in the second meeting. Voodoo? Perhaps.

File Statement of Information with Secretary of State of California

Within 90 days of when you filed your original Articles of Incorporation, you need to take all of the above and send it into the secretary of state along with the SI-100 form.

If you do all of this well and properly, you’ll soon be registered with the State of California, but until you get your 501(c)(3) pushed through you can’t become an official California non-profit, so you’ll have to hold on for a bit for that piece of the puzzle. More on this in a moment.

Register with the California Attorney General

Another thing you’ve got to do, once you’ve got state recognition is to register with the California Attorney General. You have 30 days to do this from the moment when you first had assets as an organization. Be swift.

To do this, there are instructions on the Attorney General’s website, and there’s a PDF that you need to complete.

Get Federal Recognition

If you’ve come this far, you’re actually doing pretty well, but it’s time to find a good fortifying drink, because it’s about to get worse. Much worse. Our operating theory is that the IRS makes this hard because they simply don’t like giving tax exemptions — it’s antithetical to their whole raison d’être. But be that as it may, we must persevere if we’re going to make our organization a 501(c)(3).

So, what’s this process look like?

Well, there are really only two forms that you need to worry about. The first is the IRS-1023 and the second is the checklist for the IRS-1023. That should tell you something about the process you’re about to engage in: There’s a form for the form. Oh, and that’s not all: there’s a web form for the form for the form. Also, the IRS-1023 is an interactive PDF with parts that appear and disappear as you complete it. Also it crashes sometimes (save often!), can only be opened in Adobe Reader and there are three versions of the form and two different revisions. Dizzy yet?

Let’s see if I can simplify this:

  1. The IRS-1023 from December 2013 is currently the main form you want — it’s long and has a lot of questions. It is available in three forms: Interactive (recommended), Regular (no interactive stuff), and Accessible (even less interactive stuff?). You only seem to be able to get this form if you answer a bunch of questions aimed at prepping you for the process. Even then it gives you a zip containing the form, sigh.1
  2. The 1023 checklist must be included in your submission as a table of contents of sorts. The newest one I’ve found is from June 2006.
  3. There are copious resources online to help you complete these forms. The ones we used were, the IRS’s own documentation, and the IRS’s FAQ for the form.

OK, you’ve got your forms, let’s talk a bit about the packet you’re going to send to the IRS. The best place to begin understanding the packet is by looking at the checklist we just downloaded. In addition to the items mentioned above, it also requests a number of new items we haven’t seen before. Most of these won’t be necessary for most non-profits, but one is new and worth mentioning: the Expedite Request.

Getting Expedited

As we’ve come to understand it, there are essentially three queues your paperwork can fall into at the IRS:

  1. The urgent queue (30 days?)
  2. The normal queue (90 days?) and
  3. The troublemaker queue (> 90 days / Never)

Your goal is to fall into one of the first two queues. If you fall into the third, it’s possible you’ll never come out the other side. Seriously.

If you want to fall into the first queue, you need to complete an Expedite Request. These are actually pretty straightforward, but you need to qualify. You can see an example of our Expedite Request in our 1023 submission, but basically, you need to state specific harm that will occur if your organization doesn’t get swift 501(c)(3) processing. There are guides about this on the IRS website that we used (successfully, we believe).

Getting faster processing is great but it’s not always possible. Failing that, the thing to do is make sure that you don’t fall into the third queue.

I think the important parts of this are:

  1. Carefully follow the instructions provided by the IRS for the 1023.
  2. Make sure that your articles of incorporation contain the proper purpose and dissolution clauses (they will if you use the templates).
  3. Check the top ten list provided by the IRS for speeding up the process.
  4. Do not mention any of cursed words on the IRS’s list to “Be On the Look Out” for (So-called BOLO words).

The list is apparently no longer in use due to the furor it caused, but it’s still instructive to know what was on it. For example, in our case “Open Source” was on the list, so despite working in the open (something we believe contributes to our educational purpose), we had to be very careful never to mention that in our mission or anywhere else just to ensure there were no misunderstandings.

Once you’ve got your Expedite Request completed, it’s time to work on the 1023 itself. This is a long and arduous process that is too detailed to get into. Be careful, be thorough, follow the guides, and get help from a friend or lawyer. We found it to be incredibly useful to get somebody with experience to carefully look at our paperwork.

Other Things We Sent the IRS

In addition to the items mentioned above, we also included printed copies of a partnership agreement we have with Princeton for the hosting of RECAP, a printed selection of press, and printed copies of our homepages (RECAP, CourtListner, Free Law Project).

The goal of these enclosures was mostly to keep the IRS reviewer from touching their computer, but also to keep their life as simple as possible. Like any application, you want to control the information that is provided to the reviewer. Just like you wouldn’t want your next boss seeing your Facebook profile, you don’t want the IRS reviewer looking up your organization’s website. There’s likely nothing bad for them to see, but you want to keep things as simple as possible. Maybe, we reason, if we provide a printed copy of our homepage they won’t bother booting up their computer. Perhaps.

Remarks on Formatting, Etc.

Sadly, like filing with the Supreme Court, completing your 1023 involves a few formatting and clerical details that we must attend to. First you must be sure to put your name and EIN on the top of every page. This is suprisingly difficult since many of the pages are PDFs you don’t control, but you can pull it off if you try by feeding your printed documents through the printer twice. The first time, you print the regular stuff, the second time you print a blank page over and over that contains your EIN and organization name in the header. Fun.

The second thing to attend to is the ordering of the documents themselves. This is the order of our 1023, and from what we can tell, you really shouldn’t do anything much different:

  1. 1023 Checklist
  2. Request for Expedited Processing
  3. List of Enclosures
  4. The 1023 itself
  5. Articles of Incorporation
  6. Bylaws
  7. Supplemental answers to 1023 questions
  8. Conflict of Interest and Ethics Policy
  9. Minutes adopting Conflict of Interest and Ethics Policy (remember when we made these?)
  10. A partnership agreement we have with Princeton
  11. Our selection of press coverage
  12. Printed copies of our homepages
  13. IRS Form SS-4 indicating our EIN

In total: 83 pages of delightful paperwork and one check for $850.

Total weight: 3/4 lbs.

File California Tax Exemption Forms

If all goes well, you’ll soon hear back from the IRS and be granted your Federal recognition as a 501(c)(3). Congratulations on a hard-won victory.

Now that that’s in place, it’s time to switch back to California and wrap things up with them. To do this you need to complete form 3500A (information / download).

Don’t try to save this form. You can’t:

Go F*** Yourself -- You cannot save this file.

Instead, fill it out, print it, and mail it in along with a copy of your Federal Recognition. If you can print to PDF, that might save your work.

Get Municipal Recognition

The final step of this process for us, though it might come much earlier for you, was to get in touch with the city where we incorporated and to tell them that we exist. We tried to do this early on and had the city staff member in charge of business licenses tell us to come back once we had 501(c)(3) recognition. In the city we selected non-profits are exempt from city business license fees, so that may be why they were so lax about the timing of this paperwork. You may find in your city that they want you to have a business license and pay related fees while you’re waiting on 501(c)(3) status (and sometimes even after).

Wrapping Up

All in all, that’s the basic process of creating a non-profit and getting tax exemption from the feds, the state and your city. Most of this went pretty smoothly, but the most difficult part was by far the IRS-1023, and even that we were able to get our results back in about 30 days. This feels like something of a miracle, but it took us over a year to get all the paperwork completed and submitted.

In the end I liken the process to an incantation of a magic spell: Done correctly, you wind up with a massive pile of paperwork that magically looks like a bad-ass application for tax-exempt status that washes over anybody that looks at it, convincing him or her that your organization is charitable and deserves tax exemption in a forthright manner.

Done incorrectly, you enter a hole of despair, despondency and, worse, taxation.

by Mike Lissner at October 14, 2014 07:00 AM

October 08, 2014

Ph.D. alumna

Frameworks for Understanding the Future of Work

Technology is changing work. It’s changing labor. Some imagine radical transformations, both positive and negatives. Words like robots and drones conjure up all sorts of science fiction imagination. But many of the transformations that are underway are far more mundane and, yet, phenomenally disruptive, especially for those who are struggling to figure out their place in this new ecosystem. Disruption, a term of endearment in the tech industry, sends shutters down the spine of many, from those whose privilege exists because of the status quo to those who are struggling to put bread on the table.

A group of us at Data & Society decided to examine various different emergent disruptions that affect the future of work. Thanks to tremendous support from the Open Society Foundations, we’ve produced five working papers that help frame various issues at play. We’re happy to share them with you today.

  • Understanding Intelligent Systems unpacks the science fiction stories of robots to look at the various ways in which intelligent systems are being integrated into the workforce in both protective and problematic ways. Much of what’s at stake in this domain stems from people’s conflicting values regarding robots, drones, and other intelligent systems.
  • Technologically Mediated Artisanal Production considers the disruptions introduced by 3D printing and “maker culture,” as the very act of physical production begins to shift from large-scale manufacturing to localized creation. The implications for the workforce are profound, but there are other huge potential shifts here, ranging from positive possibilities like democratizing design to more disconcerting concerns like increased environmental costs.
  • Networked Employment Discrimination examines the automation of hiring and the implications this has on those seeking jobs. The issues addressed here range from the ways in which algorithms automatically exclude applicants based on keywords to the ways in which people are dismissed for not having the right networks.
  • Workplace Surveillance traces the history of efforts to using tracking technologies to increase efficiency and measure productivity while decreasing risks for employers. As new technologies come into the workplace to enable new forms of surveillance, a whole host of ethical and economic questions emerge.
  • Understanding Fair Labor Practices in a Networked Age dives into the question of what collective bargaining and labor protections look like when work is no longer cleanly delineated, bounded, or structured within an organization, such as those engaged in peer economy work. Far from being an easy issue, we seek to show the complexity of trying to get at fair labor in today’s economy.

Each of these documents provides a framework for understanding the issues at play while also highlighting the variety of questions that go unanswered. We hope that these will provide a foundation for those trying to better understand these issues and we see this as just the beginning of much needed work in these areas. As we were working on these papers, we were delighted to see a wide variety of investigative journalism into these issues and we hope that much more work is done to better understand the social and cultural dimensions of these technological shifts. We look forward to doing more work in this area and would love to hear feedback from others, including references to other work and efforts to address these issues. Feel free to contact us at

(All five papers were authored by a combination of Alex Rosenblat, Tamara Kneese, and danah boyd; author order varies by document. This work was supported by the Open Society Foundations and is part of ongoing efforts at Data & Society to better understand the Future of Labor.)

(Photo by David Blaine.)

by zephoria at October 08, 2014 07:34 PM

MIMS 2010


Kurt Opsahl for EFF

Kurt Opsahl for EFF

I spent the morning today in court watching Kurt Opsahl from the Electronic Frontier Foundation deliver oral arguments for In Re National Security Letters, AKA Under Seal v. Eric Holder. I had originally planned on attending this as a lay person holding a smart phone camera and taking pictures, but soon after I filed an application to take photos, I became the media “pool”, meaning I was the one guy responsible for taking photos and distributing to the press after the fact.

Well, I’m no photographer, but since nobody else applied to take photos, here are the photos that I took:

I hereby release these as public domain photos. If you want to give me attribution, great. If not, no worries.

Some Comments on NSLs and the Proceedings

The district court previously found that National Security Letters (NSLs) violate the First Amendment and today the government appealed that finding with a variety of interesting claims. Probably my favorite was when the government claimed that the gag orders that typically apply to NSLs were necessary because without them the FBI wouldn’t be able to use NSLs as an investigative tool anymore.

Matt Zimmerman, the attorney when this case was in the district court, put it well:

The interesting thing here is that it’s probably true. If the gag orders were removed, the thousands of National Security Letters that the government issues each year would need more review, making it so that the FBI no longer had an extra-judicial backdoor into investigations. The FBI wouldn’t want that.


There were a million other arguments made that I’ll leave to the lawyers to discuss, but all in all it looked pretty good for EFF and I think folks are feeling confident. From here, it’s now up to the panel of judges to issue their decision. My bet is that regardless of what the decision says, this one will be appealed to the Supreme Court. Perhaps they’ll need a photographer too.

by Mike Lissner at October 08, 2014 07:00 AM

October 06, 2014

MIMS 2010

Editing a File on Github

When writing programs, developers have a choice of whether they want their work to be public or private. Programs that are made public are called “open source” and ones that are not are called “closed source”. In both cases the developer can share a program with the world as a website or iPhone app, or whatever, but in the case where the code is shared publicly it’s also possible for anybody anywhere in the world to change the program to make it better. (For more detail on this and other jargon, see the definitions at the end)

This is very cool!

But I hear you asking, “How do I, a non-developer, make use of this system to make the world a better place?” I’m glad you asked — this article is for you.

And then there was Git

Git is an extremely popular system that developers use to keep track of the code they write. The main thing it does is make it so that two developers can work on the same file, track their individual changes and then combine their work, as you might do in Microsoft Word. Since all programs are just collections of lots of files that are together known as a “repository”, this lets a number of developers work together without tramping on each others changes.

There are a million ways to use Git but lately a lot of people use Git through a website called Github. Github makes it super-easy to use Git, but you still need to understand a few steps that are necessary to make changes. The basic steps we’ll take are:

  1. You: Find the file
  2. You: Change the file and save your changes
  3. You: Create a pull request
  4. The manager (me or somebody else): Merges the pull request, making your changes live

For the purpose of this article, I’ve created a new repository as a playground where you can try this out.

The playground is here:

Go check out the playground and create a Github account, then come back here and continue to the next step, changing a file.

Make your change

Like the rest of this, the process of making a change is actually pretty easy. All you have to do is find the file, make your change, and then save it. So:

Find the file

When you look at the playground, you’ll see a bunch of files like this:

File List

Click the file you want to edit. In this case, it’s we’ll actually be changing file called “your-name.txt”. Click it.

Once you do that, you’ll see the contents of the file — a list of names, mine at the top — and you’ll see a pencil that lets you edit the file.

Click the pencil!

Change the file

At this point you’ll see a message saying something like:

You are editing a file in a project you do not have write access to. We are forking this project for you (if one does not yet exist) to write your proposed changes to. Submitting a change to this file will write it to a new branch in your fork so you can send a pull request.

Groovy. If you ignore both the jargon and the bad grammar, you can go ahead and add your name to the bottom of the file, and then you’ll see two fields at the bottom that you can use to explain your change:

Explain Thyself

This is like an email. The first field is the subject of your change, something brief and to the point. The second field lets you flesh out in more detail what you did, why it’s useful, etc. In many cases — like simply adding your name to this file — your changes are obvious and you can just hit the big green “Propose file change” button.

Let’s press the big green button, shall we?

Send a “pull request”

At this point you’ll see another form with another somewhat cryptic message:

The change you just made was written to a new branch in your fork of this project named patch-1. If you’d like the author of the original project to merge these changes, submit a pull request.

I think the important part of that message is the second sentence:

If you’d like the author of the original project to merge these changes, submit a pull request.

Ok, so how do you do that? Well, it turns out that the page we’re looking at is very similar to the one we were just on. It has two fields, one for a subject and one for a comment. You can fill these out, but if it’s a simple change you don’t need to, and anyway, if you put stuff on the last page it’ll just be copied here already.

So: Press the big green button that says “Create pull request”.

You’re now done, but what did you do, exactly?

Let’s parse what’s happened so far

At this point, you’ve found a file, changed it, and submitted a pull request. Along the way, the system told you that it was “forking this project for you” and that your changes were, “written to a new branch in your fork of this project”.

Um, what?

The most amazing thing that Git does is allow many developers to work on the same file at the same time. It does this by creating what it calls forks and branches. For our purposes these are basically the same thing. The idea behind both is that every so often people working on a file save a copy of the entire repository into what’s called a commit. A commit is a copy of the code that is saved forever so anybody can travel back in time and see the code from two weeks ago or a month ago or whatever. 95% of any Git repository is just a bunch of these copies, and you actually created one when you saved your changes to the file.

This is super useful on its own, but when somebody forks or branches the repository, what they do is say, “I want a perfect copy of all the old stuff, but from here on, I’m going my own way whenever I save things.” Over time, everybody working in the repository does this, creating their own work in their own branches, and amazingly, one person’s work doesn’t interfere with another’s.

Later, once somebody thinks that their work is good enough to share with everybody, they create what’s called a “Pull Request”, just like you did a moment ago, and the owner of the repository — in this case, me — gets an email asking him or her to “pull” the code into the main repository and “merge” the changes into the files that are there. Once this is done, everybody gets those changes from then on.

It’s a brilliant system.

My turn: Merging the pull request

When you created that pull request a moment ago, you actually sent me an email and now you have to wait for me to do something. Eventually, I’ll get your email, and when I do I’ll go to Github and see a screen like this:

PR Screen

I’ll probably make a comment saying thank you, and then I’ll press the Big Green Button that says, “Merge pull request”.

This will merge your changes into mine and we’ll both go about our merry way. Mission accomplished!

Why this works so well

This system is pretty amazing and it works very well for tiny little projects and massive ones alike (for example, some projects have thousands of active forks). What’s great about this system is that it allows anybody to do whatever they want in their fork without requiring any permission from the owner of the code. Anybody can do whatever they want in their fork and I’m happy to see them experimenting. That work will never affect me until they issue a pull request and I merge it in, accepting their proposed changes.

This process mirrors a lot of real world processes between writers and editors, but solidifies and equalizes it so that there’s a right way to do things and so that nobody can cause any trouble. The process itself can be a little overwhelming at first, with lots of jargon and steps, but once you get it down, it’s smooth and quick and works very well.

As you might expect, there are tons of resources about this on the Web. Some really good ones are at Github and there are even entire online books going into these topics. Like all things, you can go as deep as you want, but the above should give you some good basics to get you started.

Some Definitions

  1. Open vs Closed Source: This is a topic entire theses and books have been written about, but in general open source is way of creating a program where a developer shares all of their code so anybody can see it. In general when a program is open source, people are welcome to edit the code, help file and fix bugs, etc. On the other hand, closed source development is a way of creating a program so that only the developers can see the code, and the public at large is generally not welcome to contribute, except to sometimes email the developer with comments.

    In a way, the product of open source development is a combination of the code itself plus the program it creates, while in closed source projects the product is the program alone. There are thousands of examples of each of these ways of developing software. For example, Android and the Linux Kernel are open source, while Microsoft Word and iPhones are not. (See how I couldn’t link to the latter two?)

  2. Repository: A collection of files, images, and other stuff that are kept together for a common purpose. Generally it’s a bunch of files that create a website or program, but some people use repositories for all kinds of things, like dealing with identity theft (shameless plug), holding the contents of this very webpage (shameless plug), or even writing online books teaching lawyers to code (not a shameless plug!).

  3. Pull request: A polite way to say, “This code is ready to get included in the main repository. Please pull it in.”

  4. Merging: The process of taking a branch or fork and merging the changes in it into another branch or fork. This combines two people’s work into a single place.

by Mike Lissner at October 06, 2014 07:00 AM

Adding New Fonts to Tesseract 3 OCR Engine


I’m attempting to keep this up to date as Tesseract changes. If you have corrections, please send them directly using the contact page.

I’ve turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you send me a link to a question on such a site, I’m much more likely to respond positively.


Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I’ve explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named ‘standard-training-text.txt’. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I’ve attached a sample doc too, if that helps. Set the text to the font you want to use, and save it as font-name.doc.

Save the document as a PDF (call it [lang].font-name.exp0.pdf, with lang being an ISO-639 three letter abbreviation for your language), and then use the following command to convert it to a 300dpi tiff (requires imagemagick):

convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif

You’ll now have a good training image called lang.font-name.exp0.tif. If you’re adding multiple fonts, or bold, italic or underline, repeat this process multiple times, creating one doc → pdf → tiff per font variation.

Train Tesseract

The next step is to run tesseract over the image(s) we just created, and to see how well it can do with the new font. After it’s taken its best shot, we then give it corrections. It’ll provide us with a box file, which is just a file containing x,y coordinates of each letter it found along with what letter it thinks it is. So let’s see what it can do:

tesseract lang.font-name.exp0.tiff lang.font-name.exp0 batch.nochop makebox

You’ll now have a file called, and you’ll need to open it in a box-file editor. There are a bunch of these on the Tesseract wiki. The one that works for me (on Ubuntu) is moshpytt, though it doesn’t support multi-page tiffs. If you need to use a multi-page tiff, see the issue on the topic for tips. Once you’ve opened it, go through every letter, and make sure it was detected correctly. If a letter was skipped, add it as a row to the box file. Similarly, if two letters were detected as one, break them up into two lines.

When that’s done, you feed the box file back into tesseract:

tesseract eng.font-name.exp0.tif nobatch box.train .stderr

Next, you need to detect the Character set used in all your box files:

unicharset_extractor *.box

When that’s complete, you need to create a font_properties file. It should list every font you’re training, one per line, and identify whether it has the following characteristics: <fontname> <italic> <bold> <fixed> <serif> <fraktur>

So, for example, if you use the standard training data, you might end up with a file like this: 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0

Note that this is the standard font_properties file that should be supplied with Tesseract and I’ve added the two bold rows for the blackletter fonts I’m training. You can also see which fonts are included out of the box.

We’re getting near the end. Next, create the clustering data:

mftraining -F font_properties -U unicharset -O lang.unicharset *.tr 
cntraining *.tr

If you want, you can create a wordlist or a unicharambigs file. If you don’t plan on doing that, the last step is to combine the various files we’ve created.

To do that, rename each of the language files (normproto, Microfeat, inttemp, pffmtable) to have your lang prefix, and run (mind the dot at the end):

combine_tessdata lang.

This will create all the data files you need, and you just need to move them to the correct place on your OS. On Ubuntu, I was able to move them to;

sudo mv eng.traineddata /usr/local/share/tessdata/

And that, good friend, is it. Worst process for a human, ever.


by Mike Lissner at October 06, 2014 07:00 AM

MIMS 2012

How Tina Fey’s “Lessons From Late Night” Apply to Product Design

In “Lessons From Late Night”, Tina Fey discusses the lessons she learned from Lorne Michaels while working at SNL. Turns out most of them are lessons I’ve learned in product design while working at Optimizely.

“Producing is about discouraging creativity.”

She goes on to say, “A TV show comprises many departments — costumes, props, talent, graphics, set dressing, transportation. Everyone in every department wants to show off his or her skills and contribute creatively to the show, which is a blessing.” But this creative energy must be harnessed and directed in a way that contributes positively to the sketch.

Applied to design, this basically means you need to say “no” a lot. Everyone’s full of ideas and will suggest things like, “Why don’t we just add a toggle so the user can choose their preference? More choice is good, right?” And then you need to explain that actually, users typically stick with the defaults, and don’t spend time configuring their software, and letting them toggle this option has all kinds of other product implications, so, no, sorry, lets try this other solution instead.

“The show doesn’t go on because it’s ready; it goes on because it’s eleven-thirty.”

This is a lesson in perfection. She elaborates:

You have to try your hardest to be at the top of your game and improve every joke until the last possible second, but then you have to let it go. You can’t be that kid standing at the top of the waterslide, overthinking it. You have to go down the chute. […] You have to let people see what you wrote. It will never be perfect, but perfect is overrated. Perfect is boring on live television.

Just change a few words to “design” and “the web,” and this applies perfectly to product design. Designers can polish forever, but perfect is the enemy of done. But unlike SNL, Optimizely (and I imagine most startups) doesn’t often have hard 11:30 PM Saturday night deadlines, which means we have a tendency to let dates slip. I used to think that was great (“I can spend time polishing!”), but I’ve found that deadlines force you to make tough decisions in order to release product (such as cutting the scope of a feature). And that extra “polish” I think I’m adding is just me overthinking decisions I’ve already made and, oh, guess what, now that actual human beings are using it we need to cut or change those corners of the UI you’ve polished because actually they don’t matter at all now that it’s being used in the real world.

“When hiring, mix Harvard nerds with Chicago improvisers and stir.”

The gist of this lesson is that diversity of thought is important when hiring. Having a variety of designers with different backgrounds and skills results in a better product.

At Optimizely, we have a mix of visual-designers-turned-UX-designers, designers formally trained in human–computer interaction and psychology (the Harvard nerds of the group, such as yours truly), and developers turned designers. We all push and learn from each other. “The Harvard guys check the logic and grammatical construction of every joke, and the improvisers teach them how to be human. It’s Spock and Kirk.” In our case, the HCI folks make sure designs are usable and don’t violate established interaction patterns, and the visual designers make sure we aren’t shipping a bunch of gray boxes.

“Never cut to a closed door.”

As applied to user experience design, this is a way of saying, “don’t leave the user at a dead end”. If users get to a screen where they can’t do anything, then you’ve lost them. Products often dump users in empty screens that have no content (we’ve made this mistake plenty at Optimizely), which lowers engagement and increases churn. Marketing pages can lack a call to action, leading to a high bounce rate. You should always provide clear next steps.

“Don’t hire anyone you wouldn’t want to run into in the hallway at three in the morning.”

This is a way of saying hire people you enjoy working with. At Optimizely, culture is a huge part of the hiring process. Work is easier, more fun, and turns out better when you’re working with people you like and respect.

Writing comedy and designing product don’t sound related, but there’s a lot of overlap in the creative process. As Tina’s lessons show, they each have a lot they can learn from each other.

by Jeff Zych at October 06, 2014 12:13 AM

September 30, 2014

Ph.D. student

objectivity is powerful

Like “neoliberal”, “objectivity” in contemporary academic discourse is only used as a term of disparagement. It has fallen out of fashion to speak about “objectivity” in scientific language. It remains in fashion to be critical of objectivity in those disciplines that have been criticizing objectivity since at least the 70’s.

This is too bad because objectivity is great.

The criticism goes like this: scientists and journalists both used to say that they were being objective. There was a lot to this term. It could mean ‘disinterested’ or it could mean so rigorous as to be perfectly inter-subjective. It sounded good. But actually, all the scientists and journalists who claimed to be objective were sexist, racist, and lapdogs of the bourgeoisie. They used ‘objectivity’ as a way to exclude those who were interested in promoting social justice. Hence, anyone who claims to be objective is suspicious.

There are some more sophisticated arguments than this but their sophistication only weakens the main emotional thrust of the original criticism. The only reason for this sophistication is to be academically impressive, which is fundamentally useless, or to respond in good faith to criticisms, which is politically unnecessary and probably unwise.

Why is it unwise to respond in good faith to criticisms of a critique of objectivity? Because to concede that good faith response to criticism is epistemically virtuous would be to concede something to the defender of objectivity. Once you start negotiating with the enemy in terms of reasons, you become accountable to some kind of shared logic which transcends your personal subjectivity, or the collective subjectivity of those whose perspectives are channeled in your discourse.

In a world in which power is enacted and exerted through discourse, and in which cultural logics are just rules in a language game provisionally accepted by players, this rejection of objectivity is true resistance. The act of will that resists logical engagement with those in power will stymie that power. It’s what sticks it to the Man.

The problem is that however well-intentioned this strategy may be, it is dumb.

It is dumb because as everybody knows, power isn’t exerted mainly through discourse. Power is exerted through violence. And while it may be fun to talk about “cultural logics” if you are a particular kind of academic, and even fun to talk about how cultural logics can be violent, that is vague metaphorical poetry compared to something else that they could be talking about. Words don’t kill people. Guns kill people.

Put yourself in the position of somebody designing and manufacturing guns. What do you talk about with your friends and collaborators? If you think that power is about discourse, then you might think that these people talk about their racist political agenda, wherein they reproduce the power dynamics that they will wield to continue their military dominance.

They don’t though.

Instead what they talk about is the mechanics of how guns work and the technicalities of supply chain management. Where are they importing their gunpowder from and how much does it cost? How much will it go boom?

These conversations aren’t governed by “cultural logics.” They are governed by logic. Because logic is what preserves the intersubjective validity of their claims. That’s important because to successful build and market guns, the gun has to go boom the same amount whether or not the person being aimed at shares your cultural logic.

This is all quite grim. “Of course, that’s the point: objectivity is the language of violence and power! Boo objectivity!”

But that misses the point. The point is that it’s not that objectivity is what powerful people dupe people into believing in order to stay powerful. The point is that objectivity is what powerful people strive for in order to stay powerful. Objectivity is powerful in ways that more subjectively justified forms of knowledge are not.

This is not a popular perspective. There a number of reasons for this. One is that attain objective understanding is a lot of hard work and most people are just not up for it. Another is that there are a lot of people who have made their careers arguing for a much more popular perspective, which is that “objectivity” is associated with evil people and therefor we should reject it as an epistemic principal. There will always be an audience for this view, who will be rendered powerless by it and become the self-fulfilling prophecy of the demagogues who encourage their ignorance.

by Sebastian Benthall at September 30, 2014 05:00 AM

Ph.D. student

Re: Adding a data use disclaimer for social media sites

Hi Kyle,

It's nice to think about what a disclaimer should look like for services that are backing-up/syndicating content from social networking sites. And comparing that disclaimer to the current situation is a useful reminder. It's great to be conscious of the potential privacy advantages but just generally the privacy implications of decentralized technologies like the Web.

Is there an etiquette about when it's fine and when it's not to publish a copy of someone's Twitter post? We may develop one, but in the meantime, I think that when someone has specifically replied to your post, it's in context to keep a copy of that post.


P.S. This is clearly mostly just a test of the webmention-sending code that I've added to this Bcc blog, but I wanted to say bravo anyway, and why not use a test post to say bravo?

by at September 30, 2014 12:50 AM

September 18, 2014

Ph.D. student

technical work

Dipping into Julian Orr’s Talking about Machines, an ethnography of Xerox photocopier technicians, has set off some light bulbs for me.

First, there’s Orr’s story: Orr dropped out of college and got drafted, then worked as a technician in the military before returning to school. He paid the bills doing technical repair work, and found it convenient to do his dissertation on those doing photocopy repair.

Orr’s story reminds me of my grandfather and great-uncle, both of whom were technicians–radio operators–during WWII. Their civilian careers were as carpenters, building houses.

My own dissertation research is motivated by my work background as an open source engineer, and my own desire to maintain and improve my technical chops. I’d like to learn to be a data scientist; I’m also studying data scientists at work.

Further fascinating was Orr’s discussion of the Xerox technician’s identity as technicians as opposed to customers:

The distinction between technician and customer is a critical division of this population, but for technicians at work, all nontechnicians are in some category of other, including the corporation that employs the technicians, which is seen as alien, distant, and only sometimes an ally.

It’s interesting to read about this distinction between technicians and others in the context of Xerox photocopiers when I’ve been so affected lately by the distinction between tech folk and others and data scientists and others. This distinction between those who do technical work and those who they serve is a deep historical one that transcends the contemporary and over-computed world.

I recall my earlier work experience. I was a decent engineer and engineering project manager. I was a horrible account manager. My customer service skills were abysmal, because I did not empathize with the client. The open source context contributes to this attitude, because it makes a different set of demands on its users than consumer technology does. One gets assistance with consumer grade technology by hiring a technician who treats you as a customer. You get assistance with open source technology by joining the community of practice as a technician. Commercial open source software, according to the Pentaho beekeeper model, is about providing, at cost, that customer support.

I’ve been thinking about customer service and reflecting on my failures at it a lot lately. It keeps coming up. Mary Gray’s piece, When Science, Customer Service, and Human Subjects Research Collide explicitly makes the connection between commercial data science at Facebook and customer service. The ugly dispute between Gratipay (formerly Gittip) and Shanley Kane was, I realized after the fact, a similar crisis between the expectations of customers/customer service people and the expectations of open source communities. When “free” (gratis) web services display a similar disregard for their users as open source communities do, it’s harder to justify in the same way that FOSS does. But there are similar tensions, perhaps. It’s hard for technicians to empathize with non-technicians about their technical problems, because their lived experience is so different.

It’s alarming how much is being hinged on the professional distinction between technical worker and non-technical worker. The intra-technology industry debates are thick with confusions along these lines. What about marketing people in the tech context? Sales? Are the “tech folks” responsible for distributional justice today? Are they in the throws of an ideology? I was reading a paper the other day suggesting that software engineers should be held ethically accountable for the implicit moral implications of their algorithms. Specifically the engineers; for some reason not the designers or product managers or corporate shareholders, who were not mentioned. An interesting proposal.

Meanwhile, at the D-Lab, where I work, I’m in the process of navigating my relationship between two teams, the Technical Team, and the Services Team. I have been on the Technical team in the past. Our work has been to stay on top of and assist people with data science software and infrastructure. Early on, we abolished regular meetings as a waste of time. Naturally, there was a suspicion expressed to me at one point that we were unaccountable and didn’t do as much work as others on the Services team, which dealt directly with the people-facing component of the lab–scheduling workshops, managing the undergraduate work-study staff. Sitting in on Services meetings for the first time this semester, I’ve been struck by how much work the other team does. By and large, it’s information work: calendering, scheduling, entering into spreadsheets, documenting processes in case of turnover, sending emails out, responding to emails. All important work.

This is exactly the work that information technicians want to automate away. If there is a way to reduce the amount of calendering and entering into spreadsheets, programmers will find a way. The whole purpose of computer science is to automate tasks that would otherwise be tedious.

Eric S. Raymond’s classic (2001) essay How to Become a Hacker characterizes the Hacker Attitude, in five points:

  1. The world is full of fascinating problems waiting to be solved.
  2. No problem should ever have to be solved twice.
  3. Boredom and drudgery are evil.
  4. Freedom is good.
  5. Attitude is no substitute for competence.

There is no better articulation of the “ideology” of “tech folks” than this, in my opinion, yet Raymond is not used much as a source for understanding the idiosyncracies of the technical industry today. Of course, not all “hackers” are well characterized by Raymond (I’m reminded of Coleman’s injunction to speak of “cultures of hacking”) and not all software engineers are hackers (I’m sure my sister, a software engineer, is not a hacker. For example, based on my conversations with her, it’s clear that she does not see all the unsolved problems with the world to be intrinsically fascinating. Rather, she finds problems that pertain to some human interest, like children’s education, to be most motivating. I have no doubt that she is a much better software engineer than I am–she has worked full time at it for many years and now works for a top tech company. As somebody closer to the Raymond Hacker ethic, I recognize that my own attitude is no substitute for that competence, and hold my sister’s abilities in very high esteem.)

As usual, I appear to have forgotten where I was going with this.

by Sebastian Benthall at September 18, 2014 12:08 AM

September 17, 2014

Ph.D. student

frustrations with machine ethics

It’s perhaps because of the contemporary two cultures problem of tech and the humanities that machine ethics is in such a frustrating state.

Today I read danah boyd’s piece in The Message about technology as an arbiter of fairness. It’s more baffling conflation of data science with neoliberalism. This time, the assertion was that the ideology of the tech industry is neoliberalism hence their idea of ‘fairness’ is individualist and against social fabric. It’s not clear what backs up these kinds of assertions. They are more or less refuted by the fact that industrial data science is obsessed with our network of ties for marketing reasons. If anybody understands the failure of the myth of the atomistic individual, it’s “tech folks,” a category boyd uses to capture, I guess, everyone from marketing people at Google to venture capitalists to startup engineers to IBM researchers. You know, the homogenous category that is “tech folks.”

This kind of criticism makes the mistake of thinking that a historic past is the right way to understand a rapidly changing present that is often more technically sophisticated than the critics understand. But critical academics have fallen into the trap of critiquing neoliberalism over and over again. One problem is that tech folks don’t spend a ton of time articulating their ideology in ways that are convenient for pop culture critique. Often their business models require rather sophisticated understandings of the market, etc. that don’t fit readily into that kind of mold.

What’s needed is substantive progress in computational ethics. Ok, so algorithms are ethically and politically important. What politics would you like to see enacted, and how do you go about implementing that? How do you do it in a way that attracts new users and is competitively funded so that it can keep up with the changing technology with which we use to access the web? These are the real questions. There is so little effort spent trying to answer them. Instead there’s just an endless series of op-ed bemoaning the way things continue to be bad because it’s easier than having agency about making things better.

by Sebastian Benthall at September 17, 2014 03:26 AM

September 15, 2014

Ph.D. alumna

Am I a Blogger?

On July 25th, I was asked to address thousands of women (and some men) at the 10th annual Blogher conference. I was asked to reflect on what it meant to be a blogger and so I did. You can watch it here:

Or you can read an edited version of the remarks I offered below

I started blogging in 1997. I was 19 years old. I didn’t call it blogging then, and my blog didn’t look like it does now. It was a manually-created HTML site with a calendar made of tables (OMG tables) and Geocities-style forward and back buttons with terrible graphics. I posted entries a few times a week as part of an independent study on Buddhism as a Brown University student that involved both meditation and self-reflection. Each week, the monk I was working with would ask me to reflect on certain things, and I would write. And write. And write. He lived in Ohio and had originally proposed sending letters, but I thought pencils were a foreign concept. I decided to type my thoughts and that, if I was going to type them, I might as well put them up online. Ah, teen logic.

Most of those early reflections were deeply intense. I posted in detail about what it meant to navigate rape and abuse, to come into a sense of self in light of scarring situations. I have since deleted much of this material, not because I’m ashamed by it, but because I found that it created the wrong introduction. As my blog became more professional, people would flip back and look at those first posts and be like errr… uhh… While I’m completely open about my past, I’ve found that rape details are not the way that I want to start a conversation most of the time. So, in a heretical act, I deleted those posts.

What my blog is to me and to others has shifted tremendously over the years. For the first five years, my blog was read by roughly four people. That was fine because I wasn’t thinking about audience. I was blogging to think, to process, to understand. To understand myself and the world around me. I wasn’t really aware of or interested in community building so I didn’t really participate in the broader practice. Blogging was about me. Until things changed.

As research became more central to my life, my blog became more focused on my research. In December 2002, I started tracking Friendster. (Keep in mind that the first public news story was written about Friendster by the Village Voice in June of 2003.) I was documenting my understanding of the new technologies that were emerging because that’s what I was thinking about. But because I was writing about tech, my blog caught the eye of technology folks who were trying to track this new phenomenon.

I became a blogger because people who identified as bloggers called me a blogger. And they linked to my blog. And commented on it. And talked about what I posted. I was invited to blog on group sites, like Many-to-Many, and participate in blogger-related activities. I became a part of the nascent blogging world, kinda by accident.

As I became understood as a blogger, people started asking me about my blogging practice. Errrr… Blogging practice? And then people started asking me about my monetization plans. Woah nelly! So I did some reflection and made a few very intentional decisions. I valued blogging because it allowed me to express what was on my mind without anyone else editing me, but I understood that I was becoming part of the public. I valued the freedom to have a single place where my voice sat, where I was in control, but I also had power. So I struggled, but I concluded that at the end of the day, I couldn’t keep this up if this stopped being about me. And so I decided to never add advertisements, to never commercialize my personal blog, and to never let others post there. I needed boundaries. I’d blog elsewhere under other terms, by my blog was mine.

I started thinking a lot more about blogging, both personally and professionally, when I went to work for Ev Williams at Blogger (already acquired by Google). My title was “ethnographic engineer.” (Gotta love Google titles.) And my job was to help the Blogger team better understand the diversity of practices unfolding in Blogger. I interviewed numerous bloggers. I randomly sampled Blogger entries to get a sense of the diversity of posts that we were seeing. And I helped the engineering team think about different types of practices. I also became much more involved in the blogging community, attending blogging events like Blogher ten years ago.

I made a decision to live certain parts of my life in public in order not to hide from myself, in order to be human in a networked age where I am more comfortable behind a keyboard than at a bar. But I also had to contend with the fact that I was visible in ways that were de-humanizing. As a public speaker, I am regularly objectified, just a mouthpiece on stage with no feelings. I’ve smiled my way through catcalls and sexualized commentary. Sadly, it hasn’t just been men who have objectified me. At the second Blogher, I was stunned to read many women blog about my talk by dissecting my hairstyle and clothing choices in condescending ways. I may have been a blogger, but I didn’t feel like it was a community. I felt like I had become another digital artifact to be poked and prodded.

My experience with objectification took on a whole new level in 2009 when, at Web2.0 Expo, my experience on stage devolved. I wrote about this incident in gory detail on my blog, but the basic story is that I talk fast. And when I’m nervous, I talk even faster. I was nervous, it was a big stage, there were high-power lights so I couldn’t see anything. And there was a Twitter feed behind me that I couldn’t see. As I nervously started in on my talk, the audience began critiquing me on Twitter and then laughing at what others wrote. It devolved into outright misogyny—the Twitter stream was taken down and then put back up. The audience was loud but clearly not listening to what I had to say. I didn’t know what was going on, and I melted. I talked faster, I stared at the podium. I didn’t leave stage in tears, but I thought about doing so. It was humiliating.

When I finally got off stage and online, I learned that people were talking about me as though I had no feelings. And so I decided to explain what it was like to be on that stage in that moment. It was gut-wrenching to write but it hit a chord. And it allowed me to see the beauty and pain of being public in every sense of the word. Being in public. Being a public figure. Being public with my feelings. Being public.

I’ve spent the last decade studying teenagers and their relationship to social media — in effect, their relationship to public life. Through the process, I’ve watched many of them struggle with what it means to be public, what it means to have a public voice — all in an environment where young people are not encouraged to be a part of public life. Over the last 30 years, we’ve systematically eliminated young people’s ability to participate in public life. They turn to technology as a relief valve, as an opportunity to have a space of their own. As a chance to be public. And, of course, we shoo them away from there too.

Because teens want to be *in* public, we assume that they want to *be* public. Thus, we assume that they don’t want privacy. Nothing could be further from the truth. Teens want to be a part of public life, but they want privacy from those who hold power over them. Having both is often very difficult so teenagers develop sophisticated techniques to be public and to have privacy. They focus more on hiding access to meaning than hiding access to content. They use the technologies they have around them to navigate their identity and voice. They are growing up in a digital world and they try to make sense of it the best they can. But all too often, they’re blamed and shamed for what they do and adults don’t take the time to understand where they’re coming from and why.

In spending a decade with youth, I’ve learned a lot about what it means to be public. I’ve learned how to encode what I’m saying, layer my messages to reveal different things to different people. I’ve learned how to appear to be open and still keep some things to myself. I’ve learned how to use different tools for different parts of my network. And I’ve learned just how significantly the internet has changed since I was a teen.

I grew up in an era when the internet was comprised of self-identified geeks, freaks, and queers. Claiming all three, I felt quite at home. Today’s internet is mainstream. Today’s youth are growing up in a world where technology is taken for granted. Traditional aspects of power are asserted through technology. It’s no longer the realm of the marginalized, but the new mechanisms by which marginalization happen.

A decade ago, people talked about the democratizing power of blogging, but even back then we all realized that some voices were more visible than others. This is what sparked the creation of Blogher in the first place. Women’s voices were often ignored online, even when they were participating. The mechanisms of structural inequality got reified, which went against what many people imagined the internet to be about. The conversation focused on how we could create a future based on common values, a future that challenged the status quo. We never imagined we would be the status quo.

As more and more people have embraced social media and blogging, normative societal values have dominated our cultural frames about these tools. It’s no longer about imagined communities, new mechanisms of enlightenment, or resisting institutional power. Technology is situated within a context of capitalism, traditional politics, and geoglobal power struggles.

With that in mind, what does it mean to be a blogger today? What does it mean to be public? Is value only derived by commercial acts of self-branding? How can we understand the work of identity and public culture development? Is there a coherent sense of being a blogger? What are the shared values that underpin the practice?

I started blogging to feel my humanity. I became a part of the blogging community to participate in shaping a society that I care about. I reflect and share publicly to engage others and build understanding. This is my blogging practice. What is yours?

(This post was originally posted on August 6, 2014 in The Message on Medium.)

by zephoria at September 15, 2014 04:12 PM

September 13, 2014

Ph.D. student

notes towards benign superintelligence j/k

Nick Bostrom will give a book talk on campus soon. My departmental seminar on “Algorithms as Computation and Culture” has opened with a paper on the ethics of algorithms and a paper on accumulated practical wisdom regarding machine learning. Of course, these are related subjects.

Jenna Burrell recently trolled me in order to get me to give up my own opinions on the matter, which are rooted in a philosophical functionalism. I’ve learned just now that these opinions may depend on obsolete philosophy of mind. I’m not sure. R. Scott Bakker’s blog post against pragmatic functionalism makes me wonder: what do I believe again? I’ve been resting on a position established when I was deeper into this stuff seven years ago. A lot has happened since then.

I’m turning into a historicist perhaps due to lack of imagination or simply because older works are more accessible. Cybernetic theories of control–or, electrical engineering theories of control–are as relevant, it seems, to contemporary debates as machine learning, which to the extent it depends on stochastic gradient descent is just another version of cybernetic control anyway, right?

Ashwin Parameswaran’s blog post about Benigner’s Control Revolution illustrates this point well. To a first approximation, we are simply undergoing the continuation of prophecies of the 20th century, only more thoroughly. Over and over, and over, and over, and over, like a monkey with a miniature cymbal.


One property of a persistent super-intelligent infrastructure of control would be our inability to comprehend it. Our cognitive models, constructed over the course of a single lifetime with constraints on memory both in time and space, limited to a particular hypothesis space, could simply be outgunned by the complexity of the sociotechnical system in which it is embedded. I tried to get at this problem with work on computational asymmetry but didn’t find the right audience. I just learned there’s been work on this in finance which makes sense, as it’s where it’s most directly relevant today.

by Sebastian Benthall at September 13, 2014 01:43 AM

September 09, 2014

Ph.D. student

more on algorithms, judgment, polarization

I’m still pondering the most recent Tufekci piece about algorithms and human judgment on Twitter. It prompted some grumbling among data scientists. Sweeping statements about ‘algorithms’ do that, since to a computer scientist ‘algorithm’ is about as general a term as ‘math’.

In later conversation, Tufekci clarified that when she was calling out the potential problems of algorithmic filtering of the Twitter newsfeed, she was speaking to the problems of a newsfeed curated algorithmically for the sake of maximizing ‘engagement’. Or ads. Or, it is apparent on a re-reading of the piece, new members. She thinks an anti-homophily algorithm would maybe be a good idea, but that this is so unlikely according to the commercial logic of Twitter to be a marginal point. And, meanwhile, she defends ‘human prioritizatin’ over algorithmic curation, despite the fact that homophily (not to mention preferential attachment) are arguable negative consequences of social system driven by human judgment.

I think inquiry into this question is important, but bound to be confusing to those who aren’t familiar in a deep way with network science, machine learning, and related fields. It’s also, I believe, helpful to have a background in cognitive science, because that’s a field which maintains that human judgment and computational systems are doing fundamentally commensurable kinds of work. When we think in sophisticated way about crowdsourced labor, we use this sort of thinking. We acknowledge, for example, that human brains are better at the computational task of image recognition, so then we employ Turkers to look at and label images. But those human judgments are then inputs to statistical proceses that verify and check those judgments against each other. Later, those determinations that result from a combination of human judgment and algorithmic processing could be used in a search engine–which returns answers to questions based on human input. Search engines, then, are also a way of combining human and purely algorithmic judgment.

What it comes down to is that virtually all of our interactions with the internet are built around algorithmic affordances. And these systems can be understood systematically if we reject the quantitative/qualitative divide at the ontological level. Reductive physicalism entails this rejection, but–and this is not to be underestated–pisses or alienates people who do qualitative or humanities research.

This is old news. C.P. Snow’s The Two Cultures. The Science Wars. We’ve been through this before. Ironically, the polarization is algorithmically visible in the contemporary discussion about algorithms.*

The Two Cultures on Twitter?

It’s I guess not surprising that STS and cultural studies academics are still around and in opposition to the hard scientists. What’s maybe new is how much computer science now affects the public, and how the popular press appears to have allied itself with the STS and cultural studies view. I guess this must be because cultural anthropologists and media studies people are more likely to become journalists and writers, whereas harder science is pretty abstruse.

There’s an interesting conflation now from the soft side of the culture wars of science with power/privilege/capitalism that plays out again and again. I bump into it in the university context. I read about it all the time. Tufekci’s pessimism that the only algorihtmic filtering Twitter would adopt would be one that essentially obeys the logic “of Wall Street” is, well, sad. It’s sad that an unfortunate pairing that is analytically contingent is historically determined to be so.

But there is also something deeply wrong about this view. Of course there are humanitarian scientists. Of course there is a nuanced center to the science wars “debate”. It’s just that the tedious framing of the science wars has been so pervasive and compelling, like a commercial jingle, that it’s hard to feel like there’s an audience for anything more subtle. How would you even talk about it?

* I need to confess: I think there was some sloppiness in that Medium piece. If I had had more time, I would have done something to check which conversations were actually about the Tufekci article, and which were just about whatever. I feel I may have misrepresented this in the post. For the sake of accessibility or to make the point, I guess. Also, I’m retrospectively skittish about exactly how distinct a cluster the data scientists were, and whether its insularity might have been an artifact of the data collection method. I’ve been building out poll.emic in fits mainly as a hobby. I built it originally because I wanted to at last understand Weird Twitter’s internal structure. The results were interesting but I never got to writing them up. Now I’m afraid that the culture has changed so much that I wouldn’t recognize it any more. But I digress. Is it even notable that social scientists from different disciplines would have very different social circles around them? Is the generalization too much? And are there enough nodes in this graph to make it a significant thing to say about anything, really? There could be thousands of academic tiffs I haven’t heard about that are just as important but which defy my expectations and assumptions. Or is the fact that Medium appears to have endorsed a particular small set of public intellectuals significant? How many Medium readers are there? Not as many as there are Twitter users, by several orders of magnitude, I expect. Who matters? Do academics matter? Why am I even studying these people as opposed to people who do more real things? What about all the presumabely sane and happy people who are not pathologically on the Internet? Etc.

by Sebastian Benthall at September 09, 2014 07:03 AM

September 08, 2014

MIMS 2011

Max Klein on Wikidata, “botpedia” and gender classification

Max Klein defines himself on his blog as a ‘Mathematician-Programmer, Wikimedia-Enthusiast, Burner-Yogi’ who believes in ‘liberty through wikis and logic’. I interviewed him a few weeks ago when he was in the UK for Wikimania 2014. He then wrote up some of his answers so that we could share with it others. Max is a long-time volunteer of Wikipedia who has occupied a wide range of roles as a volunteer and as a Wikipedian in residence for OCLC, among others. He has been working on Wikidata from the beginning but it hasn’t always been plain sailing. Max is outspoken about his ideas and he is respected for that, as well as for his patience in teaching those who want to learn. This interview serves as a brief introduction to Wikidata and some of its early disagreements. 

Max Klein in 2011. CC BY SA, Wikimedia Commons

Max Klein in 2011. CC BY SA, Wikimedia Commons

How was Wikidata originally seeded?
In the first days of Wikidata we used to call it a ‘botpedia’ because it was basically just an echo chamber of bots talking to each other. People were writing bots to import information from infoboxes on Wikipedia. A heavy focus of this was data about persons from authority files.

Authority files?
An authority file is a Library Science term that is basically a numbering system to assign authors unique identifiers. The point is to avoid a “which John Smith?” problem. At last year’s Wikimania I said that Wikidata itself has become a kind of “super authority control” because now it connects so many other organisations’ authority control (e.g. Library of Congress and IMDB). In the future I can imagine Wikidata being the one authority control system to rule them all.

In the beginning, each Wikipedia project was supposed to be able to decide whether it wanted to integrate Wikidata. Do you know how this process was undertaken?
It actually wasn’t decided site-by-site. At first only Hungarian, Italian, and Hebrew Wikipedias were progressive enough to try. But once English Wikipedia approved the migration to use Wikidata, soon after there was a global switch for all Wikis to do so (see the announcement here).

Do you think it will be more difficult to edit Wikipedia when infoboxes are linking to templates that derive their data from Wikidata? (both editing and producing new infoboxes?)
It would seem to complicate matters that infobox editing becomes opaque to those who aren’t Wikidata aware. However at Wikimania 2014, two Sergeys from Russian Wikipedia demonstrated a very slick gadget that made this transparent again – it allowed editing of the Wikidata item from the Wikipedia article. So with the right technology this problem is a nonstarter.

Can you tell me about your opposition to the ways in which Wikidata editors decided to structure gender information on Wikidata?
In Wikidata you can put a constraint to what values a property can have. When I came across it the “sex or gender” property said “only one of ‘male, female, or intersex'”. I was opposed to this because I believe that any way the Wikidata community structure the gender options, we are going to imbue it with our own bias. For instance already the property is called “sex or gender”, which shows a lack of distinction between the two, which some people would consider important. So I spent some time arguing that at least we should allow any value. So if you want to say that someone is “third gender” or even that their gender is “Sodium” that’s now possible. It was just an early case of heteronormativity sneaking into the ontology.

Wikidata uses a CC0 license which is less restrictive than the CC BY SA license that Wikipedia is governed by. What do you think the impact of this decision has been in relation to others like Google who make use of Wikidata in projects like the Google Knowledge Graph?
Wikidata being CC0 at first seemed very radical to me. But one thing I noticed was that increasingly this will mean where the Google Knowledge Graph now credits their “info-cards” to Wikipedia, the attribution will just start disappearing. This seems mostly innocent until you consider that Google is a funder of the Wikidata project. So in some way it could seem like they are just paying to remove a blemish on their perceived omniscience.

But to nip my pessimism I have to remind myself that if we really believe in the Open Source, Open Data credo then this rising tide lifts all boats.

by Heather Ford at September 08, 2014 11:24 AM

Code and the (Semantic) City

Mark Graham and I have just returned from Maynooth in Ireland where we participated in a really great workshop called Code and the City organised by Rob Kitchin and his team at the Programmable City project. We presented a draft paper entitled, ‘Semantic Cities: Coded Geopolitics and Rise of the Semantic Web’ where we trace how the city of Jerusalem is represented across Wikipedia and through WikiData, Freebase and to Google’s Knowledge Graph in order to answer questions about how linked data and the semantic web changes a user’s interactions with the city. We’ve been indebted to the folks from all of these projects who have helped us navigate questions about the history and affordances of these projects so that we can better understand the current Web ecology. The paper is currently being revised and will be available soon, we hope!

by Heather Ford at September 08, 2014 11:08 AM

September 02, 2014

MIMS 2011

Infoboxes and cleanup tags: Artifacts of Wikipedia newsmaking

Screen Shot 2014-09-02 at 2.06.05 PM

Infobox from the first version of the 2011 Egyptian Revolution (then ‘protests’) article on English Wikipedia, 25 January, 2011

My article about Wikipedia infoboxes and cleanup tags and their role in the development of the 2011 Egyptian Revolution article has just been published in the journal, ‘Journalism: Theory, Practice and Criticism‘ (a pre-print is available on The article forms part of a special issue of the journal edited by C W Anderson and Juliette de Meyer who organised the ‘Objects of Journalism’ pre-conference at the International Communications Association conference in London that I attended last year. The issue includes a number of really interesting articles from a variety of periods in journalism’s history – from pica sticks to interfaces, timezones to software, some of which we covered in the August 2013 edition of

My article is about infoboxes and cleanup tags as objects of Wikipedia journalism, objects that have important functions in the coordination of editing and writing by distributed groups of editors. Infoboxes are summary tables on the right hand side of an article that enable readability and quick reference, while cleanup tags are notices at the head of an article warning readers and editors of specific problems with articles. When added to an article, both tools simultaneously notify editors about missing or weak elements of the article and add articles to particular categories of work.

The article contains an account of the first 18 days of the protests that resulted in the resignation of then-president Hosni Mubarak based on interviews with a number of the article’s key editors as well as traces in related articles, talk pages and edit histories. Below is a selection from what happened on day 1:

Day 1: 25 January, 2011 (first day of the protests)

The_Egyptian_Liberal published the article on English Wikipedia on the afternoon of what would become a wave of protests that would lead to the unseating of President Hosni Mubarak. A template was used to insert the ‘uprising’ infobox to house summarised information about the event including fields for its ‘characteristics’, the number of injuries and fatalities. This template was chosen from a range of other infoboxes relating to history and events on Wikipedia, but has since been deleted in favor of the more recently developed ‘civil conflict’ infobox with fields for ‘causes’, ‘methods’ and ‘results’.

The first draft included the terms ‘demonstration’, ‘riot’ and ‘self-immolation’ in the ‘characteristics’ field and was illustrated by the Latuff cartoon of Khaled Mohamed Saeed and Hosni Mubarak with the caption ‘Khaled Mohamed Saeed holding up a tiny, flailing, stone-faced Hosni Mubarak’. Khaled Mohamed Saeed was a young Egyptian man who was beaten to death reportedly by Egyptian security forces and the subject of the Facebook group ‘We are all Khaled Said’ moderated by Wael Ghonim that contributed to the growing discontent in the weeks leading up to 25 January, 2011. This would ideally have been a filled by a photograph of the protests, but the cartoon was used because the article was uploaded so soon after the first protests began. It also has significant emotive power and clearly represented the perspective of the crowd of anti-Mubarak demonstrators in the first protests.

Upon publishing, three prominent cleanup tags were automatically appended to the head of the article. These included the ‘new unreviewed article’ tag, the ‘expert in politics needed’ tag and the ‘current event’ tag, warning readers that information on the page may change rapidly as events progress. These three lines of code that constituted the cleanup tags initiated a complex distribution of tasks to different groups of users located in work groups throughout the site: page patrollers, subject experts and those interested in current events.

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

Looking at the diffs in the first day of the article’s growth, it becomes clear that the article is by no means a ‘blank slate’ that editors fill progressively with prose. Much of the activity in the first stage of the article’s development consisted of editors inserting markers or frames in the article that acted to prioritize and distribute work. Cleanup tags alerted others about what they believed to be priorities (to improve weak sections or provide political expertise, for example) while infoboxes and tables provided frames for editors to fill in details iteratively as new information became available.

By discussing the use of these tools in the context of Bowker and Star’s theories of classification (2000), I argue that these tools are not only material but also conceptual and symbolic. They facilitate collaboration by enabling users to fill in details according to a pre-defined set of categories and by catalyzing notices that alert others to the work that they believe needs to be done on the article. Their power, however, cannot only be seen in terms of their functional value. These artifacts are deployed and removed as acts of social and strategic power play among Wikipedia editors who each want to influence the narrative about what happened and why it happened. Infoboxes and tabular elements arise as clean, simple, well-referenced numbers out of the messiness and conflict that gave rise to them. When cleanup tags are removed, the article develops an implicit authority, appearing to rise above uncertainty, power struggles and the impermanence of the compromise that it originated from.

This categorization practice enables editors to collaborate iteratively with one another because each object signals work that needs to be done by others in order to fill in the gaps of the current content. In addition to this functional value, however, categorization also has a number of symbolic and political consequences. Editors are engaged in a continual practice of iterative summation that contributes to an active construction of the event as it happens rather than a mere assembling of ‘reliable sources’. The deployment and removal of cleanup tags can be seen as an act of power play between editors that affects readers’ evaluation of the article’s content. Infoboxes are similar sites of struggle whose deployment and development result in an erasure of the contradictions and debates that gave rise to them. These objects illuminate how this novel journalistic practice has important implications for the way that political events are represented.

by Heather Ford at September 02, 2014 04:46 PM

September 01, 2014

MIMS 2014

Of Goodreads and Listicles

I’m a HUGE fan of Goodreads. I have been using it for a few years now, and I was depressed when the new Kindle Fire came out with Goodreads integration and my old Paperwhite didn’t get it for almost a year later. I mark every book I read (mostly the paper kinds, and yes, I’m weird like that) and rate them, though I rarely write reviews. Today, I was looking at the recent deluge of Facebook Book lists and it got me wondering why these lists were a Facebook thing, when all my friends seem to be on Goodreads too. When I started making my own list, I had this vague plan to link the Goodreads pages to the list, but then frankly, for a status message in FB it was just too cumbersome. It’s weird though. Many of my friends have books on their list that I want to read, but now, I have to go discover these lists (or hope that Facebook surfaces the ones I’d really like) and then keep adding on Goodreads.

I wondered why in these times of Buzzfeed and crazed listicles, Goodreads doesn’t have lists. Except, it does. I checked. But here’s my issue – I’m a longtime user and it took a search for me to discover this. I realize that in the interest of simplicity, there is no point in having lists upfront on the login screen. But, in times like this, especially when a book tag is doing the rounds, shouldn’t Goodreads be pushing users to publish these lists on Goodreads? Especially since the FB ones are going to die down, and none of us will ever be able to locate them later. It also looks like Goodreads believes the lists should only be of the format ‘Best Robot Books’ not “Michael’s Favorite Books’ – I wonder why. I mean, I may be far more interested in discovering something from A’s favorite books, than a list of her favorite thrillers, for instance. Maybe I’m projecting way too much of my self into the shoes of a generic user on Goodreads. Maybe people would prefer Goodreads be the way it is. It would be interesting though, to see if Goodreads could maybe create these list driven FB posts as a social media marketing campaign, where they get people to tag books on Goodreads or some such. I feel like all the virality should benefit them!

by muchnessofd at September 01, 2014 06:53 PM

Ph.D. alumna

What is Privacy?

Earlier this week, Anil Dash wrote a smart piece unpacking the concept of “public.” He opens with some provocative questions about how we imagine the public, highlighting how new technologies that make heightened visibility possible. For example,

Someone could make off with all your garbage that’s put out on the street, and carefully record how many used condoms or pregnancy tests or discarded pill bottles are in the trash, and then post that information up on the web along with your name and your address. There’s probably no law against it in your area. Trash on the curb is public.

The acts that he describes are at odds with — or at least complicate — our collective sense of what’s appropriate. What’s at stake is not about the law, but about our idea of the society we live in. This leads him to argue that the notion of public is not easy to define. “Public is not just what can be viewed by others, but a fragile set of social conventions about what behaviors are acceptable and appropriate.” He then goes on to talk about the vested interests in undermining people’s conception of public and expanding the collective standards of what is in.

To get there, he pushes back at the dichotomy between “public” and “private,” suggesting that we should think of these as a spectrum. I’d like to push back even further to suggest that our notion of privacy, when conceptualized in relationship to “public,” does a disservice to both concepts. The notion of private is also a social convention, but privacy isn’t a state of a particular set of data. It’s a practice and a process, an idealized state of being, to be actively negotiated in an effort to have agency. Once we realize this, we can reimagine how to negotiate privacy in a networked world. So let me unpack this for a moment.

Imagine that you’re sitting in a park with your best friend talking about your relationship troubles. You may be in a public space (in both senses of that term), but you see your conversation as private because of the social context, not the physical setting. Most likely, what you’ve thought through is whether or not your friend will violate your trust, and thus your privacy. If you’re a typical person, you don’t even begin to imagine drones that your significant other might have deployed or mechanisms by which your phone might be tapped. (Let’s leave aside the NSA, hacker-geek aspect of this.)

You imagine privacy because you have an understanding of the context and are working hard to control the social situation. You may even explicitly ask your best friend not to say anything (prompting hir to say “of course not” as a social ritual).

As Alice Marwick and I traversed the United States talking with youth, trying to make sense of privacy, we quickly realized that the tech-centric narrative of privacy just doesn’t fit with people’s understandings and experience of it. They don’t see privacy as simply being the control of information. They don’t see the “solution” to privacy being access-control lists or other technical mechanisms of limiting who has access to information. Instead, they try to achieve privacy by controlling the social situation. To do so, they struggle with their own power in that situation. For teens, it’s all about mom looking over their shoulder. No amount of privacy settings can solve for that one. While learning to read social contexts is hard, it’s especially hard online, where the contexts seem to be constantly destabilized by new technological interventions. As such, context becomes visible and significant in the effort to achieve privacy. Achieving privacy requires a whole slew of skills, not just in the technological sense, but in the social sense. Knowing how to read people, how to navigate interpersonal conflict, how to make trust stick. This is far more complex that people realize, and yet we do this every day in our efforts to control the social situations around us.

The very practice of privacy is all about control in a world in which we fully know that we never have control. Our friends might betray us, our spaces might be surveilled, our expectations might be shattered. But this is why achieving privacy is desirable. People want to be *in* public, but that doesn’t necessarily mean that they want to *be* public. There’s a huge difference between the two. As a result of the destabilization of social spaces, what’s shocking is how frequently teens have shifted from trying to restrict access to content to trying to restrict access to meaning. They get, at a gut level, that they can’t have control over who sees what’s said, but they hope to instead have control over how that information is interpreted. And thus, we see our collective imagination of what’s private colliding smack into the notion of public. They are less of a continuum and more of an entwined hairball, reshaping and influencing each other in significant ways.

Anil is right when he highlights the ways in which tech companies rely on conceptions of “public” to justify data collection practices. He points to the lack of consent, which signals what’s really at stake. When powerful actors, be they companies or governmental agencies, use the excuse of something being “public” to defend their right to look, they systematically assert control over people in a way that fundamentally disenfranchises them. This is the very essence of power and the core of why concepts like “surveillance” matter. Surveillance isn’t simply the all-being all-looking eye. It’s a mechanism by which systems of power assert their power. And it is why people grow angry and distrustful. Why they throw fits over beingexperimented on. Why they cry privacy foul even when the content being discussed is, for all intents and purposes, public.

As Anil points out, our lives are shaped by all sorts of unspoken social agreements. Allowing organizations or powerful actors to undermine them for personal gain may not be illegal, but it does tear at the social fabric. The costs of this are, at one level, minuscule, but when added up, they can cause a serious earthquake. Is that really what we’re seeking to achieve?

(The work that Alice and I did with teens, and the implications that this has for our conception of privacy writ large, is written up as “Networked Privacy” in New Media & Society. If you don’t have library access, email me and I’ll send you a copy.)

(This entry was first posted on August 1, 2014 at Medium under the title “What is Privacy” as part of The Message.)

by zephoria at September 01, 2014 02:34 PM

August 28, 2014

Ph.D. student

a mathematical model of collective creativity

I love my Mom. One reason I love her is that she is so good at asking questions.

I thought I was on vacation today, but then my Mom started to ask me questions about my dissertation. What is my dissertation about? Why is it interesting?

I tried to explain: I’m interested in studying how these people working on scientific software work together. That could be useful in the design of new research infrastructure.

M: Ok, so like…GitHub? Is that something people use to share their research? How do they find each other using that?

S: Well, people can follow each others repositories to get notifications. Or they can meet each other at conferences and learn what people are working on. Sometimes people use social media to talk about what they are doing.

M: That sounds like a lot of different ways of learning about things. Could your research be about how to get them all to talk about it in one place?

S: Yes, maybe. In some ways GitHub is already serving as that central repository these days. One application of my research could be about how to design, say, an extension to GitHub that connects people. There’s a lot of research on ‘link formation’ in the social media context–well I’m your friend, and you have this other friend, so maybe we should be friends. Maybe the story is different for collaborators. I have certain interests, and somebody else does too. When are our interests aligned, so that we’d really want to work together on the same thing? And how do we resolve disputes when our interests diverge?

M: That sounds like what open source is all about.

S: Yeah!

M: Could you build something like that that wasn’t just for software? Say I’m a researcher and I’m interesting in studying children’s education, and there’s another researcher who is interested in studying children’s education. Could you build something like that in your…your D-Lab?

S: We’ve actually talked about building an OKCupid for academic research! The trick there would be bringing together researchers interested in different things, but with different skills. Maybe somebody is really good at analyzing data, and somebody else is really good at collecting data. But it’s a lot of work to build something nice. Not as easy as “build it and they will come.”

M: But if it was something like what people are used to using, like OKCupid, then…

S: It’s true that would be a really interesting project. But it’s not exactly my research interest. I’m trying really hard to be a scientist. That means working on problems that aren’t immediately appreciable by a lot of people. There are a lot of applications of what I’m trying to do, but I won’t really know what they are until I get the answers to what I’m looking for.

M: What are you looking for?

S: I guess, well…I’m looking for a mathematical model of creativity.

M: What? Wow! And you think you’re going to find that in your data?

S: I’m going to try. But I’m afraid to say that. People are going to say, “Why aren’t you studying artists?”

M: Well, the people you are studying are doing creative work. They’re developing software, they’re scientists…

S: Yes.

M: But they aren’t like Beethoven writing a symphony, it’s like…

S: …a craft.

M: Yes, a craft. But also, it’s a lot of people working together. It’s collective creativity.

S: Yes, that’s right.

M: You really should write that down. A mathematical model of collective creativity! That gives me chills. I really hope you’ll write that down.

Thanks, Mom.

by Sebastian Benthall at August 28, 2014 08:26 PM

Ph.D. student

Re: Homebrew Website Club: August 27, 2014

I did make it to the Indiewebcamp/Homebrew meeting this evening after all, in Portland this time, since I happened to be passing through.

I was able to show off some of the work I've been doing on embedding data-driven graphs/charts in the Web versions of in-progress academic writing: d3.js generating SVG tables in the browser, but also saving SVG/PDF versions which are used as figures in the LaTeX/PDF version (which I still need for sharing the document in print and with most academics). I need to write a brief blog post describing my process for doing this, even though it's not finished. In fact, that's a theme; we all need to be publishing code and writing blog posts, especially for inchoate work.

Also, I've been thinking about pseudonymity in the context of personal websites. Is there anything we need to do to make it possible to maintain different identities / domain names without creating links between them? Also, it may be a real privacy advantage to split the reading and writing on the Web: if you don't have to create a separate list of friends/follows in each site with each pseudonym, then you can't as easily be re-identified by having the same friends. But I want to think carefully about the use case, because while I've become very comfortable with a domain name based on my real name and linking my professional, academic and personal web presences, I find that a lot of my friends are using pseudonyms, or intentionally subdividing

Finally, I learned about some cool projects.

  • Indiewebcamp IRC logs become more and more featureful, including an interactive chat client in the logs page itself
  • Google Web Starter Kit provides boilerplate and a basic build/task system for building static web sites
  • Gulp and Harp are two (more) JavaScript-based tools for preparing/processing/hosting static web sites

All in all, good fun. And then I went to the Powell's bookstore dedicated just to technical and scientific books, saw an old NeXT cube and bought an old book on software patterns.

Thanks for hosting us, @aaronpk!
— Nick

by at August 28, 2014 05:49 AM

August 27, 2014

Ph.D. student

a response to “Big Data and the ‘Physics’ of Social Harmony” by @doctaj; also Notes towards ‘Criticality as ideology';

I’ve been thinking over Robin James’ “Big Data & the ‘Physics’ of Social Harmony“, an essay in three sections. The first discusses Singapore’s use of data science to detect terrorists and public health threats for the sake of “social harmony,” as reported by Harris in Foreign Policy. The second ties together Plato, Pentland’s “social physics”, and neoliberalism. The last discusses the limits to individual liberty proposed by J.S. Mill. The author admits it’s “all over the place.” I get the sense that it is a draft towards a greater argument. It is very thought-provoking and informative.

I take issue with a number of points in the essay. Underlying my disagreement is what I think is a political difference about the framing of “data science” and its impact on society. Since I am a data science practitioner who takes my work seriously, I would like this framing to be nuanced, recognizing both the harm and help that data science can do. I would like the debate about data science to be more concrete and pragmatic so that practitioners can use this discussion as a guide to do the right thing. I believe this will require discussion of data science in society to be informed by a technical understanding of what data science is up to. However, I think it’s also very important that these discussions rigorously take up the normative questions surrounding data sciences’ use. It’s with this agenda that I’m interested in James’ piece.

James is a professor of Philosophy and Women’s/Gender Studies and the essay bears the hallmarks of these disciplines. Situated in a Western and primarily anglophone intellectual tradition, it draws on Plato and Mill for its understanding of social harmony and liberalism. At the same time, it has the political orientation common to Gender Studies, alluding to the gendered division of economic labor, at times adopting Marxist terminology, and holding suspicion for authoritarian power. Plato is read as being the intellectual root of a “particular neoliberal kind of social harmony” that is “the ideal that informs data science.” James contrasts this ideal with the ideal of individual liberty, as espoused and then limited by Mill.

Where I take issue with James is that I think this line of argument is biased by its disciplinary formation. (Since this is more or less a truism for all academics, I suppose this is less a rebuttal than a critique.) Where I believe this is most visible is in her casting of Singapore’s ideal of social harmony as an upgrade of Plato, via the ideology of neoliberalism. She does not not consider in the essay that Singapore’s ideal of social harmony might be rooted in Eastern philosophy, not Western philosophy. Though I have no special access or insight into the political philosophy of Singapore, this seems to me to be an important omission given that Singapore is ethnically 74.2% Chinese and with Buddhist plurality.

Social harmony is a central concept in Eastern, especially Chinese, philosophy with deep roots in Confucianism and Daoism. A great introduction for those with background in Western philosophy who are interested in the philosophical contributions of Confucius is Fingarette’s Confucius: The Secular as Sacred. Fingarette discusses how Confucian thought is a reaction to the social upheaval and war of Anciant China’s Warring States Period, roughly 475 – 221 BC. Out of these troubling social conditions, Confucian thought attempts to establish conditions for peace. These include ritualized forms of social interaction at whose center is a benevolent Emperor.

There are many parallels with Plato’s political philosophy, but Fingarette makes a point of highlighting where Confucianism is different. In particular, the role of social ritual and ceremony as the basis of society is at odds with Western individualism. Political power is not a matter of contest of wills but the proper enactment of communal rites. It is like a dance. Frequently, the word “harmony” is used in the translation of Confucian texts to refer to the ideal of this functional, peaceful ceremonial society and, especially, its relationship with nature.

A thorough analysis of use of data science for social control in light of Eastern philosophy would be an important and interesting work. I certainly haven’t done it. My point is simply that when we consider the use of data science for social control as a global phenomenon, it is dubious to see it narrowly in light of Western intellectual history and ideology. That includes rooting it in Plato, contrasting it with Mill, and characterizing it primarily as an expression of white neoliberalism. Expansive use of these Western tropes is a projection, a fallacy of “I think this way, therefore the world must.” This I submit is an occupational hazard of anyone who sees their work primarily as an analysis of critique of ideology.

In a lecture in 1965 printed in Knowledge and Human Interests, Habermas states:

The concept of knowledge-constitutive human interests already conjoins the two elements whose relation still has to be explained: knowledge and interest. From everyday experience we know that ideas serve often enough to furnish our actions with justifying motives in place of the real ones. What is called rationalization at this level is called ideology at the level of collective action. In both cases the manifest content of statements is falsified by consciousness’ unreflected tie to interests, despite its illusion of autonomy. The discipline of trained thought thus correctly aims at excluding such interests. In all the sciences routines have been developed that guard against the subjectivity of opinion, and a new discipline, the sociology of knowledge, has emerged to counter the uncontrolled influence of interests on a deeper level, which derive less from the individual than from the objective situation of social groups.

Habermas goes on to reflect on the interests driving scientific inquiry–“scientific” in the broadest sense of having to do with knowledge. He delineates:

  • Technical inquiry motivated by the drive for manipulation and control, or power
  • Historical-hermeneutic inquiry motivated by the drive to guide collective action
  • Critical, reflexive inquiry into how the objective situation of social groups controls ideology, motivated by the drive to be free or liberated

This was written in 1965. Habermas was positioning himself as a critical thinker; however, unlike some of the earlier Frankfurt School thinkers he drew on, he did maintained that technical power was an objective human interest. (see Bohman and Rehg) In the United States especially, criticality as a mode of inquiry took aim at the ideologies that aimed at white, bourgeois, and male power. Contemporary academic critique has since solidified as an academic discipline and wields political power. In particular, is frequently enlisted as an expression of the interests of marginalized groups. In so doing, academic criticality has (in my view regrettably) becomes mere ideology. No longer interested in being scientifically disinterested, it has become a tool of rationalization. It’s project is the articulation of changing historical conditions in certain institutionally recognized tropes. One of these tropes is the critique of capitalism, modernism, neoliberalism, etc. and their white male bourgeois heritage. Another is the feminist emphasis on domesticity as a dismissed form on economic production. This trope features in James’ analysis of Singapore’s ideal of social harmony:

Harris emphasizes that Singaporeans generally think that finely-tuned social harmony is the one thing that keeps the tiny city-state from tumbling into chaos. [1] In a context where resources are extremely scarce–there’s very little land, and little to no domestic water, food, or energy sources, harmony is crucial. It’s what makes society sufficiently productive so that it can generate enough commercial and tax revenue to buy and import the things it can’t cultivate domestically (and by domestically, I really mean domestically, as in, by ‘housework’ or the un/low-waged labor traditionally done by women and slaves/servants.) Harmony is what makes commercial processes efficient enough to make up for what’s lost when you don’t have a ‘domestic’ supply chain. (emphasis mine)

To me, this parenthetical is quite odd. There are other uses of the word “domestic” that do not specifically carry the connotation of women and slave/servants. For example, the economic idea of gross domestic product just means “an aggregate measure of production equal to the sum of the gross values added of all resident institutional units engaged in production (plus any taxes, and minus any subsidies, on products not included in the value of their outputs).” Included in that production is work done by men and high-wage laborers. To suggest that natural resources are primarily exploited by “domestic” labor in the ‘housework’ sense is bizarre given, say, agribusiness, industrial mining, etc.

There is perhaps an interesting etymological relationship here; does our use of ‘domestic’ in ‘domestic product’ have its roots in household production? I wouldn’t know. Does that same etymological root apply in Singapore? Was agriculture in East Asia traditionally the province of household servants in China and Southeast Asia (as opposed to independent farmers and their sons?)? Regardless, domestic economic production agricultural production is not housework now. So it’s mysterious that this detail should play a role in explaining Singapore’s emphasis on social harmony today.

So I think it’s safe to say that this parenthetical remark by James is due to her disciplinary orientation and academic focus. Perhaps it is a contortion to satisfy the audience of Cyborgology, which has a critical left-leaning politics. A Harris’s original article does not appear to support this interpretation. Rather, it only uses the word ‘harmony’ twice, and maintains a cultural sensitivity that James’ piece lacks, noting that Singapore’s use of data science may be motivated by a cultural fear of loss or risk.

The colloquial word kiasu, which stems from a vernacular Chinese word that means “fear of losing,” is a shorthand by which natives concisely convey the sense of vulnerability that seems coded into their social DNA (as well as their anxiety about missing out — on the best schools, the best jobs, the best new consumer products). Singaporeans’ boundless ambition is matched only by their extreme aversion to risk.

If we think that Harris is closer to the source here, then we do not need the projections of Western philosophy and neoliberal theory to explain what is really meant by Singapore’s use of data science. Rather, we can look to Singapore’s culture and perhaps its ideological origins in East Asian thinking. Confucius, not Plato.

* * *

If there it is a disciplinary bias to American philosophy departments, it is that they exist to reproduce anglophone philosophy. This is point that James has recently expressed herself…in fact while I have been in the process of writing this response.

Though I don’t share James’ political project, generally speaking I agree that effort spent of the reproduction of disciplinary terminology is not helpful to the philosophical and scientific projects. Terminology should be deployed for pragmatic reasons in service to objective interests like power, understanding, and freedom. On the other hand, language requires consistency to be effective, and education requires language. My own personal conclusion on is that the scientific project can only be sustained now through disciplinary collapse.

When James suggests that old terms like metaphysics and epistemology prevent the de-centering of the “white supremacist/patriarchal/capitalist heart of philosophy”, she perhaps alludes to her recent coinage of “epistemontology” as a combination of epistemology and ontology, as a way of designating what neoliberalism is. She notes that she is trying to understand neoliberalism as an ideology, not as a historical period, and finds useful the definition that “neoliberals think everything in the universe works like a deregulated, competitive, financialized capitalist market.”

However helpful a philosophical understanding of neoliberalism as market epistemontology might be, I wonder whether James sees the tension between her statements about rejecting traditional terminology that reproduces the philosophical discipline and her interest in preserving the idea of “neoliberalism” in a way that can be be taught in an introduction to philosophy class, a point she makes in a blog comment later. It is, perhaps, in the act of teaching that a discipline is reproduced.

The use of neoliberalism as a target of leftist academic critique has been challenged relatively recently. Craig Hickman, in a blog post about Luis Suarez-Villa, writes:

In fact Williams and Srinicek see this already in their first statement in the interview where they remind us that “what is interesting is that the neoliberal hegemony remains relatively impervious to critique from the standpoint of the latter, whilst it appears fundamentally unable to counter a politics which would be able to combat it on the terrain of modernity, technology, creativity, and innovation.” That’s because the ball has moved and the neoliberalist target has shifted in the past few years. The Left is stuck in waging a war it cannot win. What I mean by that is that it is at war with a target (neoliberalism) that no longer exists except in the facades of spectacle and illusion promoted in the vast Industrial-Media-Complex. What is going on in the world is now shifting toward the East and in new visions of technocapitalism of which such initiatives as Smart Cities by both CISCO (see here) and IBM and a conglomerate of other subsidiary firms and networking partners to build new 21st Century infrastructures and architectures to promote creativity, innovation, ultra-modernity, and technocapitalism.

Let’s face it capitalism is once again reinventing itself in a new guise and all the Foundations, Think-Tanks, academic, and media blitz hype artists are slowly pushing toward a different order than the older market economy of neoliberalism. So it’s time the Left begin addressing the new target and its ideological shift rather than attacking the boogeyman of capitalism’s past. Oh, true, the façade of neoliberalism will remain in the EU and U.S.A. and much of the rest of the world for a long while yet, so there is a need to continue our watchdog efforts on that score. But what I’m getting at is that we need to move forward and overtake this new agenda that is slowly creeping into the mix before it suddenly displaces any forms of resistance. So far I’m not sure if this new technocapitalistic ideology has even registered on the major leftist critiques beyond a few individuals like Luis Suarez-Villa. Mark Bergfield has a good critique of Suarez-Villa’s first book on Marx & Philosophy site: here.

In other words, the continuation of capitalist domination is due to its evolution relative to the stagnation of intellectual critiques of it. Or to put it another way, privilege is the capacity to evolve and not merely reproduce. Indeed, the language game of academic criticality is won by those who develop and disseminate new tropes through which to represent the interests of the marginalized. These privileged academics accomplish what Lyotard describes as “legitimation through paralogy.”

* * * * *

If James were working merely within academic criticality, I would be less interested in the work. But her aspirations appear to be higher, in a new political philosophy that can provide normative guidance in a world where data science is a technical reality. She writes:

Mill has already made–in 1859 no less–the argument that rationalizes the sacrifice of individual liberty for social harmony: as long as such harmony is enforced as a matter of opinion rather than a matter of law, then nobody’s violating anybody’s individual rights or liberties. This is, however, a crap argument, one designed to limit the possibly revolutionary effects of actually granting individual liberty as more than a merely formal, procedural thing (emancipating people really, not just politically, to use Marx’s distinction). For example, a careful, critical reading of On Liberty shows that Mill’s argument only works if large groups of people–mainly Asians–don’t get individual liberty in the first place. [2] So, critiquing Mill’s argument may help us show why updated data-science versions of it are crap, too. (And, I don’t think the solution is to shore up individual liberty–cause remember, individual liberty is exclusionary to begin with–but to think of something that’s both better than the old ideas, and more suited to new material/technical realities.)

It’s because of these more universalist ambitions that I think it’s fair to point out the limits of her argument. If a government’s idea of “social harmony” is not in fact white capitalist but premodern Chinese, if “neoliberalism” is no longer the dominant ideology but rather an idea of an ideology reproduced by a stagnating academic discipline, then these ideas will not help us understand what is going on in the contemporary world in which ‘data science’ is allegedly of such importance.

What would be better than this?

There is an empirical reality to the practices of data science. Perhaps it should be studied on its own terms, without disciplinary baggage.

by Sebastian Benthall at August 27, 2014 05:27 PM

MIMS 2010

New Version of the Site is Now Live

I have big news today for the small world of people who read my blog regularly: A new version of the site is now live and the old version shall die a quick death.

Version 4 was pretty nice though, while it lasted:

Site v4

Other old versions of the site still available!

The improvements

This new version comes with some big improvements that I’m quite pleased with:

  1. If you find typos in a blog post, you can edit them on Github and I can easily integrate your changes. Check out the link on the right to edit the typos in this very page. (I’ve left a few conspicuous ones as a treasure hunt for the reader!)
  2. The site is now much faster and can handle immense traffic without a hitch, thanks to being hosted by Github Pages. The previous version would have occasional hiccups during times of high traffic – something that’s really quite untenable.
  3. Comments are now moved to Disqus, though unfortunately old comments have not made the jump to the new version of the blog. Comments are collapsed by default so the scrollbar actually represents the length of a highly-commented post.
  4. The site now looks bad-ass. Regardless of whether you’re on a phone, tablet computer, or what-have-you, it’s going to look good.
  5. All content has been categorized as well as tagged, as you can see in the sidebar. There are Atom feeds for each.
  6. The homepage has a new design that focuses on my projects and bio, and then has recent posts below that.
  7. Long articles like this one get an automatic table of contents on the left.
  8. The site is now optimized for speed dial in Opera and to be made into apps on mobile phones and tablets. For example, if you’re reading this on Android Chrome, you can simply click “Add to homescreen” in the thing and you’ll be all set.
  9. Security is now invincible: No more webserver to update, no more database, no more outdated Drupal. It’s basically impossible to hack the new site. I’ve also added my PGP key to the contact page, for those interested.
  10. The entire site is now static and doesn’t require that I pay for or maintain a server or database. Bonus!

So those are the high-level changes you can see as of now. If you’re interested in the technical nitty-gritty, read on.

The Tech

The original motivation to rebuild the site came when the old version kept overwhelming the server that was running it and requiring that I step in to make it work again. And if that weren’t annoying enough, I have been paying for that server for the past several years, which just seems a bit silly for a simple blog like this one.

The solution? A so-called Static Site Generator or SSG. With one of these, the paradigm for your site totally changes. Instead of having a dynamic site that loads every time somebody visits the page or makes a comment, you generate the entire website on your laptop (this takes about 30 seconds), creating static HTML, and then push that to some cloud provider of choice (in my case, I use Github pages for this because it’s free and easy).

There are about 300 SSGs right now and the one I eventually landed on was Pelican due to it being written in a language I knew (Python), and due to it having lots of good themes and plugins. I briefly tried to make a switch to Hugo instead because it’s written in Go and is much faster at generating content, but the documentation for Hugo isn’t very good yet, and it doesn’t support basic pagination, which is something of a showstopper.

Switching to a SSG from Drupal

Switching from Drupal was pretty awful and took a lot of effort — days of it! The goal was to get all of my posts exported from Drupal, convert them all to markdown, and to get them all live on Github pages. Let’s go through this process together.

Exporting from Drupal

This step of the puzzle was, shall we say, a pain. Nobody has yet made a Drupal to Pelican converter, so I had to do it myself. The script that I wrote dug directly into Drupal’s database, pulled out the contents and converted them to a format that Hugo could understand. At the time I thought Hugo would be the SSG for me, but later I switched to Pelican, and had to write another script to make the conversion from Hugo to Pelican.

Problems with Drupal

This was a good start, but Drupal has a few funny conventions. One is that it allows files to be “attached” to blog posts. Most blogs don’t do this (Pelican and Hugo included), so I had to go through all of the items that I attached to Drupal posts and convert them to inline links instead. This took a while.

Another problem I ran into is that the posts themselves were written directly in HTML, which makes them kind of awful, and not very portable between blog engines. Content for Hugo or Pelican should be written in Markdown, so I began making this conversion to the 200+ posts on the site. In general, the process for this was to find a post and begin cleaning it up. If I encountered something that a computer could reliably fix across all the posts (for example, <i> can be converted to * and <strong> to **), I wrote a little script to do so. In the end, this took a lot of time, but I now have a collection of a few hundred nicely-formed markdown files that power the blog.

Moving to Github Pages

With all of the content converted properly, the remaining step was to get the project live on Github. I found this process confusing, but the process is basically this:

  1. You need to take the output file from Pelican and put it into a Git branch called gh-pages. To do this with Pelican is remarkably easy, as there is a simple command you can run: make github. Run that, and you’ll be all set, with the content pushed and everything.

  2. You need a file named CNAME that simply contains the domain of your website. This is easy in theory — it’s just a plaintext file — but in practice it is difficult because you need the file to be created by the make github command mentioned above. To do that add the CNAME file to a directory at content/extra/CNAME and then add the following to your pelican configuration file:

    EXTRA_PATH_METADATA = {'extra/CNAME': {'path': 'CNAME'}, }

    Do that, and the file will get copied over whenever you run make github.

    If you’ve done this correctly, you’ll see evidence of such in the repository’s settings page on Github, where it will tell you the domain in the CNAME.

  3. You need to configure your DNS provider to point your domain to Github. This varies by provider, but I can tell you that your final version should look something like this:

     dig +nostats +nocomments +nocmd
    ; <<>> DiG 9.9.5-3-Ubuntu <<>> +nostats +nocomments +nocmd
    ;; global options: +cmd
    ; IN  A 3600 IN  CNAME 3600    IN  CNAME  2   IN  A     66087   IN  NS     66087   IN  NS     66087   IN  NS     66087   IN  NS

Final Words

This is been a much larger undertaking than I expected, with tons of corner cases that I wanted to fix before releasing a new version of the site. In the end though, this has been a good investment that I can expect to keep the site going for the next five to ten years.

I hope you enjoy the new look and new features.

by Mike Lissner at August 27, 2014 07:00 AM

August 18, 2014

MIMS 2012

1-year Retrospective

August 2nd marked the 1-year anniversary of my first post, so it seems appropriate to do a quick retrospective of my first year blogging on my personal site.

Writing Stats

I’ve written 22 posts in that time, which is a rate of 1.83 per month. My (unstated) goal was 2 per month, so I wasn’t far off. My most prolific month is a tie between September 2013 and May 2014, in which I wrote 4 articles each. But in September I re-used some posts I had written previously for Optimizely, so May wins for more original content.

Sadly, there were two months in which I didn’t write any articles: Dec 2013, and July 2014. In December I was in India, so that’s a pretty legitimate reason. July, however, has no good reason. It was generally a busy month, but I should have made time to write at least one post. And looking closer, just saying “July” makes it sound better than it actually was - I had a seven week stretch of no posts then!

My longest article was “Re-Designing Optimizely’s Preview Tool”, clocking in at 4,158 words!

Site Analytics

Diving into Google Analytics, I’ve had 3,092 page views, 2,158 sessions, and 1,778 users. I seem to get a steady trickle of traffic every day, with a few occasional spikes in activity (which are caused by retweets, Facebook posts, or sending posts to all of Optimizely). All of which I find pretty surprising since I don’t write very regularly, and I don’t do much to actively seek readers.

Google Analytics stats for the past year

So where do these visitors (i.e. you) come from? Google Analytics tells me that, even more surprisingly, the top two sources are organic search and direct, respectively. From looking through the search terms used to find me, they can be grouped into three categories:

  • My name: this is most likely people who are interviewing at Optimizely.
  • Cloud.typography and Typekit comparison: people are interested in a performance comparison of these two services. And in fact, I wrote this article precisely because I was searching for that information myself, but there weren’t any posts about it.
  • Framing messages: I wrote a post about the behavioral economics principle of framing, and how you can use it to generate A/B testing ideas. Apparently people want help writing headlines!

Top Posts

Continuing to dig into Google Analytics, these are my three most popular posts:

  1. “Extend – SASS’s Awkward Stepchild”, with 354 page views.
  2. “Re-Designing Optimizely’s Preview Tool”, with 306 page views.
  3. “Performance comparison of serving fonts through Typekit vs Cloud.typography”, with 302 page views.

They’re all pretty close in terms of traffic, but quite different in terms of content. So what does this tell me about what’s resonating with you, the reader, and what I should continue doing going forward? The main commonality is that all of those articles are original, in-depth content. In fact, this holds true past the top 10. My shorter posts that are responses to other people’s posts don’t receive as much mind share. I’ll have to think more about whether they’re worth doing at all anymore.


Overall I’m pretty satisfied with those numbers, and the content I’ve been able to produce. Going forward I hope I can write more in-depth content, especially about the design process of my projects (which are my favorite to write). Here’s to the upcoming year!

by Jeff Zych at August 18, 2014 04:13 AM

August 14, 2014

MIMS 2011

Diary of an internet geography project #4

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-08-05 at 1.31.00 PMContinuing with our series of blog posts exposing the workings behind a multidisciplinary big data project, we talk this week about the process of moving between small data and big data analyses. Last week, we did a group deep dive into our data. Extending the metaphor: Shilad caught the fish and dumped them on the boat for us to sort through. We wanted to know whether our method of collecting and determining the origins of the fish was working by looking at a bunch of randomly selected fish up close. Working out how we would do the sorting was the biggest challenge. Some of us liked really strict rules about how we were identifying the fish. ‘Small’ wasn’t a good enough description; better would be that small = 10-15cm diameter after a maximum of 30 minutes out of the water. Through this process we learned a few lessons about how to do this close-looking as a team. 

Step 1: Randomly selecting items from the corpus

We wanted to know two things about the data that we were selecting through this ‘small data’ analysis: Q1) Were we getting every citation in the article or were we missing/duplicating any? Q2) What was the best way to determine the location of the source?

Shilad used the WikiBrain software library he developed with Brent to identify all roughly one million geo-tagged Wikipedia articles. He then collected all external URLs (about 2.9 million unique URLs) appearing within those articles and used this data to create two samples for coding tasks. He sampled about 50 geotagged articles (to answer Q1) and selected a few hundred random URLs cited within particular articles (to answer Q2).

  • Batch 1 for Q1: 50 documents each containing an article title, url, list of citations, empty list of ‘missing citations’
  • Batch 2 for Q2: Spreadsheet of 500 random citations occurring in 500 random geotagged articles.

Example from batch 1:

Coding for Montesquiu

  1. Visit the page at Montesquiu
  2. Enter your initials in the ‘coder’ section
  3. Look at the list of extracted links below in the ‘Correct sources’ section
  4. Add a short description of each missed source to the ‘Missed sources’ section

Initials of person who coded this:

Correct sources

Missing sources

Example from batch 2:

url domain effective domain article article url
C&pg=PA308 Teatro Calderón (Valladolid)

For batch 1, we looked up each article and made sure that the algorithm we were using was catching all the citations. We found that there were a few anomalies where there was a duplication of citations (for example, when a single citation contained two urls: one to the ISBN address and another to a Google books url) or when we were missing citations (when the API was only listing a URL once when it had been used multiple times or when a book was cited without a url, for example) or when we were getting incorrect citations (when the citation url pointed to the Italian National Institute of Statistics (Istat) article on Wikipedia rather than the Istat domain).

The town of El Bayad in Libya contained two citations that weren’t included in the analysis because they didn’t contain a url, for example. One appears to be a newspaper and the other a book, but I couldn’t find the citations online. These would not be included in the analysis but it was the only example like this:

  • Amraja M. el Khajkhaj, “Noumou al Mudon as Sagheera fi Libia”, Dar as Saqia, Benghazi-2008, p.120.
  • Al Ain newspaper, Sep. 26; 2011, no. 20, Dar al Faris al Arabi, p.7.

We listed each of these anomalies in order to work out a) whether we can accommodate them in the algorithm or whether b) there are so few of them that they probably won’t affect the analysis too heavily.

Step 2: Developing a codebook and initial coding

I took the list of 500 random citations in batch 2 and went through each one to develop a new list of 100 working URLs and a codebook to help the others code the same list. I discarded 24 dead links and developed a working definition for each code in the codebook.

The biggest challenge when trying to locate citations in Wikipedia is whether to define the location according to the domain that is being pointed to, or whether one should find the original source. Google books urls are the most common form of this challenge. If a book is cited and the url points to its Google books location, do we cite the source as coming from Google or from the original publisher of the work?

My initial thought was to define URL location instead of original location — mostly because it seemed like the easiest way to scale up the analysis after this initial hand coding. But after discussing it, I really appreciated when Brent said, ‘Let’s just start this phase by avoiding thinking like computer scientists and code how we need to code without thinking about the algorithm.’ Instead, we tried to use this process as a way to develop a number of different ways of accurately locating sources and to see whether there were any major differences afterwards. Instead of using just one field for location, we developed three coding categories.

Source country:

Country where the article’s subject is located | Country of the original publisher | Country of the URL publisher

We’ll compare these three to the:

Country of the administrative contact for the URL’s domain

that Shilad and Dave are working on extracting automatically.

When I first started doing the coding, I was really interested in looking at other aspects of the data such as what kinds of articles are being captured by the geotagged list, as well as what types of sources are being employed. So I created two new codes: ‘source type’ and ‘article subject’. I defined the article subject as: ‘The subject/category of the article referred to in the title or opening sentence e.g. ‘Humpety is a village in West Sussex, England’ (subject: village)’. I defined source type as ‘the type of site/document etc that *best* describes the source e.g. if the url points to a list of statistics but it’s contained within a newspaper site, it should be classified as ‘statistics’ rather than ’newspaper’.

Coding categories based on example item above from batch 2:

subject subject country original publisher location URL publisher location language source type
building Spain Spain US Spanish book

In our previous project we divided up the ‘source type’ into many facets. These included the medium (e.g. website, book etc) and the format (statistics, news etc). But this can get very complicated very fast because there are a host of websites that do not fall easily into these categories. A url pointing to a news report by a blogger on a newspaper’s website, for example, or a link to a list of hyperlinks that download as spreadsheets on a government website. This is why I chose to use the ‘best guess’ for the type of source because choosing one category ends up being much easier than the faceted coding that we did in the previous project.

The problem was that this wasn’t a very conclusive definition and would not result in consistent coding. It is particularly problematic because we are doing this project iteratively and we want to try to get as much data as possible so that we have it if we need it later on. After much to-ing and fro-ing, we decided to go back to our research questions and focus on those. The most important thing that we needed to work out was how we were locating sources, and whether the data changed significantly depending on what definition we used. So we decided not to focus on the article type and source type for now, choosing instead to look at the three ways of coding location of sources so that we could compare them to the automated list that we develop.

This has been the hardest part of the project so far, I think. We went backwards and forwards a lot about how we might want to code this second set of randomly sampled citations. What definition of ‘source’ and ‘source location’ should we use? How do we balance the need to find the most accurate way to catch all outliers and a way that we could abstract into an algorithm that would enable us to scale up the study to look at all citations? It was a really useful exercise, though, and we have a few learnings from it.

- When you first look at the data, make sure you all do a small data analysis using a random sample;

- When you do the small data analysis, make sure you suspend your computer scientist view of the world and try to think about what is the most accurate way of coding this data from multiple facets and perspectives;

- After you’ve done this multiple analysis, you can then work out how you might develop abstract rules to accommodate the nuances in the data and/or to do a further round of coding to get a ‘ground truth’ dataset.

In this series of blog posts, a team of computer and social scientists including Heather Ford, Mark Graham, Brent Hecht, Dave Musicant and Shilad Sen are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. 

by Heather Ford at August 14, 2014 01:15 PM

August 11, 2014

Ph.D. student

picking a data backend for representing email in #python

I’m at a difficult crossroads with BigBang where I need to pick an appropriate data storage backend for my preprocessed mailing list data.

There are a lot of different aspects to this problem.

The first and most important consideration is speed. If you know anything about computer science, you know that it exists to quickly execute complex tasks that would take too long to do by hand. It’s odd writing that sentence since computational complexity considerations are so fundamental to algorithm design that this can go unspoken in most technical contexts. But since coming to grad school I’ve found myself writing for a more diverse audience, so…

The problem I’m facing is that in doing exploratory data analysis, I do not know all the questions I am going to ask yet. But any particular question will be impractical to ask unless I tune the underlying infrastructure to answer it. This chicken-and-egg problem means that the process of inquiry is necessarily constrained by the engineering options that are available.

This is not new in scientific practice. Notoriously, the field of economics in the 20th century was shaped by what was analytically tractable as formal, mathematical results. The nuance of contemporary modeling of complex systems is due largely to the fact that we now have computers to do this work for us. That means we can still have the intersubjectively verified rigor that comes with mathematization without trying to fit square pegs into round holes. (Side note: something mathematicians acknowledge that others tend to miss is that mathematics is based on dialectic proof and intersubjective agreement. This makes it much closer epistemologically to something like history as a discipline than it is to technical fields dedicated to prediction and control, like chemistry or structural engineering. Computer science is in many ways an extension of mathematics. Obviously, these formalizations are then applied to great effect. Their power comes from their deep intersubjective validity–in other words, their truth. Disciplines that have dispensed with intersubjective validity as a grounds for truth claims in favor of a more nebulous sense of diverse truths in a manifold of interpretation have difficulty understanding this and so are likely to see the institutional gains of computer scientists to be a result of political manipulation, as opposed to something more basic: mastery of nature, or more provacatively, use of force. This disciplinary disfunction is one reason why these groups see their influence erode.)

For example, I have determined that in order to implement a certain query on the data efficiently, it would be best if another query were constant time. One way to do this is to use a database with an index.

However, setting up a database is something that requires extra work on the part of the programmer and so makes it harder to reproduce results. So far I have been keeping my processed email data “in memory” after it is pulled from files on the file system. This means that I have access to the data within the programming environment I’m most comfortable with, without depending on an external or parallel process. Fewer moving parts means that it is simpler to do my work.

So there is a tradeoff between the computational time of the software as it executes and the time and attention is takes me (and others that want to reproduce my results) to set up the environment in which the software runs. Since I am running this as an open source project and hope others will build on my work, I have every reason to be lazy, in a certain sense. Every inconvenience I suffer is one that will be suffered by everyone that follows me. There is a Kantian categorical imperative to keep things as simple as possible for people, to take any complex procedure and replace it with a script, so that others can do original creative thinking, solve the next problem. This is the imperative that those of us embedded in this culture have internalized. (G. Coleman notes that there are many cultures of hacking; I don’t know how prevalent these norms are, to be honest; I’m speaking from my experience) It is what makes this social process of developing our software infrastructure a social one with a modernist sense of progress. We are part of something that is being built out.

There are also social and political considerations. I am building this project intentionally in a way that is embedded within the Scientific Python ecosystem, as they are also my object of study. Certain projects are trendy right now, and for good reason. At the Python Worker’s Party at Berkeley last Friday, I saw a great presentation of Blaze. Blaze is a project that allows programmers experienced with older idioms of scientific Python programming to transfer their skills to systems that can handle more data, like Spark. This is exciting for the Python community. In such a fast moving field with multiple interoperating ecosystems, there is always the anxiety that ones skills are no longer the best skills to have. Has your expertise been made obsolete? So there is a huge demand for tools that adapt one way of thinking to a new system. As more data has become available, people have engineered new sophisticated processing backends. Often these are not done in Python, which has a reputation for being very usable and accessible but slow to run in operation. Getting the usable programming interface to interoperate with the carefully engineered data backends is hard work, work that Matt Rocklin is doing while being paid by Continuum Analytics. That is sweet.

I’m eager to try out Blaze. But as I think through the questions I am trying to ask about open source projects, I’m realizing that they don’t fit easily into the kind of data processing that Blaze currently supports. Perhaps this is dense on my part. If I knew better what I was asking, I could maybe figure out how to make it fit. But probably, what I’m looking at is data that is not “big”, that does not need the kind of power that these new tools provide. Currently my data fits on my laptop. It even fits in memory! Shouldn’t I build something that works well for what I need it for, and not worry about scaling at this point?

But I’m also trying to think long-term. What happens if an when it does scale up? What if I want to analyze ALL the mailing list data? Is that “big” data?

“Premature optimization is the root of all evil.” – Donald Knuth

by Sebastian Benthall at August 11, 2014 06:28 PM

MIMS 2011

Wikipedia and breaking news: The promise of a global media platform and the threat of the filter bubble

I gave this talk at Wikimania in London yesterday. 

In the first years of Wikipedia’s existence, many of us said that, as an example of citizen journalism and journalism by the people, Wikipedia would be able to avoid the gatekeeping problems faced by traditional media. The theory was that because we didn’t have the burden of shareholders and the practices that favoured elite viewpoints, we could produce a media that was about ‘all of us’ and not just ‘some of us’.

Dan Gillmor (2004) wrote that Wikipedia was an example of a wave of citizen journalism projects initiated at the turn of the century in which ‘news was being produced by regular people who had something to say and show, and not solely by the “official” news organizations that had traditionally decided how the first draft of history would look’ (Gillmor, 2004: x).

Yochai Benkler (2006) wrote that projects like Wikipedia enables ‘many more individuals to communicate their observations and their viewpoints to many others, and to do so in a way that cannot be controlled by media owners and is not as easily corruptible by money as were the mass media.’ (Benkler, 2006: 11)

I think that at that time we were all really buoyed by the idea that Wikipedia and peer production could produce information products that were much more representative of “everyone’s” experience. But the idea that Wikipedia could avoid bias completely, I now believe, is fundamentally wrong. Wikipedia presents a particular view of the world while rejecting others. Its bias arises both from its dependence on sources which are themselves biased, but Wikipedia itself has policies and practices that favour particular viewpoints. Although Wikipedia is as close to a truly global media product than we have probably ever come*, like every media product it is a representation of the world and is the result of a series of editorial, technical and social decisions made to prioritise certain narratives over others.

Mark Graham (2009) has shown how Wikipedia’s representation of place is skewed towards the developed North; researchers such as Brendan Luyt (2011) have shown that Wikipedia’s coverage of history suffers from an over-reliance on foreign government sources, and others like Tony Lam, Anuradha Uduwage, Zhenhua Dong, Shilad Sen, Dave Musicant, Loren Terveen and John Riedl (Lam et al., 2011) have shown how there are significant gender-associated imbalances in its topic coverage.

But there is, as yet, little research on how such imbalances might manifest themselves in articles about breaking news topics. At a stage when there is often no single conclusive narrative about what happened and why, we see the effects of these choices most starkly – both in decisions about whether a particular idea, event, object or person is important enough to warrant a standalone article, as well as decisions about which statements (aka ‘facts’) to include, what those statements will be, and what shape the narrative arc will take. Wikipedia, then, presents a particular view of the world in the face of a variety of alternatives.

It is not necessarily problematic that we choose to present one article about an event rather than 20 different takes on it. But it becomes problematic when Wikipedia is presented as a neutral source; a source that represents the views of “everyone”. It’s problematic because it means that users don’t often recognise that Wikipedia is constructed (and mirrors in many ways the biases of the sources it uses to support it), but it is also problematic because it means that Wikipedians don’t always recognise that we need to change the way that we work in order to be more inclusive of perspectives. Such perspectives will remain unreflected if we continue to adhere to policies developed to favour a particular perspective of the world.

What is Wikipedia news?

The 6th biggest website in the world, Wikipedia enjoys 18 billion page views and nearly 500 million unique visitors a month, according to comScore. Since 2003, the top 25 Wikipedia articles with the most contributors every month consist nearly exclusively of articles related to current events (Keegan, 2013). Last week, for example, the article entitled ‘Ebola virus disease’ was the most viewed article on English Wikipedia at about 1.8 million views, and the Israeli-Palestinian conflict and related articles made up 5 of the top 25 most popular articles and accounted for about 1.5 million views (see the Top 25 Report on English Wikipedia).

Wikipedia didn’t always ‘do news’. Brian Keegan (2013) writes that the policy around breaking news emerged after the September 11 attacks in 2001 when there was an attempt to write articles about every person who died in the Twin Towers. There was a subsequent decision to separate these out in what is called the ‘Memorial Wiki’. It was also around this time that editors defined what would constitute a notable event and signaled the start of an ‘in the news’ section on the website that would list the most important news of the day linking to good quality Wikipedia articles about those topics. Editors can now propose topics for the ‘in the news’ section on the home page and discuss whether the articles are good enough quality to be featured and whether the news is appropriate for that day.

Although both Wikipedia and traditional media both produce news, probably the most fundamental difference between Wikipedia and journalism practice is in our handling of sources. Journalists pride themselves on their ability to do original research by finding the right people to answer the right kinds of questions and for them to distil the important elements from those conversations into an article. For journalists, the people they interview and interrogate are their sources, and the process is a dialogic one: through conversation, questions, answers, follow-up questions and clarifications, they produce their article.

Wikipedians, on the other hand, are forbidden from doing ‘original research’ and must write what they can about the world on the basis of what we find in ‘reliable secondary sources’. For Wikipedians, sources are the ‘documents’ that that we find to back up what we write. This is both a limiting and empowering feature of Wikipedia – limiting because we rely heavily on what documents say (and documents can be contradictory and false without an opportunity to follow up with their authors), but empowering (at least in theory) because it enables readers to follow up on the sources that have been provided to back up different arguments and check or verify whether they are accurate. This is called the ‘verifiability’ principle and is one of Wikipedia’s core policies.

Wikipedia’s ‘no original research’ article is summarised as follows:

Wikipedia does not publish original thought: all material in Wikipedia must be attributable to a reliable, published source. Articles may not contain any new analysis or synthesis of published material that serves to reach or imply a conclusion not clearly stated by the sources themselves.

The problem is that when Wikipedia says it doesn’t allow ‘original research’, this doesn’t mean that Wikipedians aren’t constantly making decisions about what to write and what content to include that are to a lesser or greater extent subjective decisions. This is true for the construction of articles which require Wikipedians to construct a narrative from a host of distinct reliable and unreliable sources, but it is especially true when Wikipedians must decide whether something that happened is important enough to warrant a standalone article.

Notability, according to Wikipedia, is defined as follows:

The topic of an article should be notable, or “worthy of notice”; that is, “significant, interesting, or unusual enough to deserve attention or to be recorded”. Notable in the sense of being “famous”, or “popular”—although not irrelevant—is secondary.

Decisions about notability, then, can only be original research: the conclusion that something is important enough (according to Wikipedia’s criteria) to warrant an article must be made according to editors’ subjective viewpoints. The way in which Wikipedians summarise issues and pay attention to particular points about an issue is also subjective: there is no single reference that is an exact replica of what is represented in an article, decisions about what to include and what to leave out are happening all the time.

Such decisions are made, not just by information reflected in reliable sources but by a host of informational sources and experiences of the individual editor and the social interactions that develop as a result of the progress of an article. We don’t only learn about the world through ‘reliable sources’; we learn about the world through a host of informational cues – through street corner conversations, through gossip, through signage and posters and abandoned newspapers in restaurants, in train carriages and through social media and email and text messages and a whole host of what would be regarded, according to Wikipedia’s definition, as totally ‘unreliable sources’.

Let’s take the example of the first version of the 2011 Egyptian Revolution article (then called ‘protests’ rather than ‘revolution’). The article was started at 4:26pm local time on the first day of the Egyptian protests that led to the unseating of then-President Hosni Mubarak. (The first protests began around 2pm on that day). Let’s first look at the AFP article used as a citation in the article:

Egypt braces for nationwide protests

By Jailan Zayan (AFP) – Jan 25, 2011

CAIRO — Egypt braced for a day of nationwide anti-government protests on Tuesday, with organisers counting on the Tunisian uprising to inspire crowds to mobilise for political and economic reforms.

And then the first version of the article:

The 2011 Egyptian protests are a continuing series of street demonstrations taking place throughout Egypt from January 2010 onwards with organisers counting on the Tunisian uprising to inspire the crowds to mobilize.

I interviewed some of the (frankly amazing) Wikipedians who worked on this article about what became the 2011 Egyptian Revolution Wikipedia article. I knew that there had been a series of protests in Egypt in the run-up to the January 25 protest, but none of these had articles on them on Wikipedia, so I wondered why this article was started (backed up by such weak evidence at the time) and how the article was able to survive.

In a Skype interview, the editor who started the article and oversaw much of its development, TheEgyptianLiberal, said that he knew

‘the thing was going to be big… before the revolution became a revolution’

(He had actually prepared the article the day before the protests even began.) When I asked him how he knew that it was going to be significant, he replied,

‘The frustration in the street. And what happened in Tunisia.’

The Egyptian Liberal had access to a wealth of information on the streets of Cairo that gave him access to what was really happening and what was happening, he (rightly) believed, was definitely “a thing” – “a thing” worth taking notice of. In Wikipedia’s terms, this was something “significant, interesting, or unusual enough to deserve attention or to be recorded”, despite the fact that it was impossible to tell at this early stage.

Another article wasn’t as successful in its early stages. When working for Ushahidi in 2011/12, I took a trip to Kenya to visit the HQ. Ushahidi let me do a side project which for me was to try to understand the development of Swahili Wikipedia. When I arrived in Nairobi, the first thing I did was to buy every newspaper available from the local supermarket. I wanted to immerse myself in the media environment that Kenyans were being enveloped by. I also sat in the B&B and watched a lot of local television. Most of the headlines were about the looming war against Al Shabaab in southern Somalia as the Kenyan army moved into southern Somalia to try to root out the militant Al Shabaab terrorist group who were alleged to have kidnapped several foreign tourists and aid workers inside Kenya.

This was the first time that the Kenyan army had been engaged in a military campaign since independence, and so it was a big deal because the government wanted to be seen to be acting to root out the elements that were believed to be behind a series of kidnappings and murders of both locals and foreigners near the border. After two bombings in central Nairobi while I was visiting, people were trying to stay at home and avoid crowded areas.

During this time, one of the Wikipedians who I interviewed, AbbasJnr pointed me to a deletion discussion going on on the English Wikipedia about whether the Kenyan military intervention warranted its own article. The article had been nominated for deletion on the grounds that it did not meet Wikipedia’s notability standards. The nominator wrote that the event was not being reported in reliable news outlets as an ‘invasion’ but rather an ‘incursion’ and since it was ‘routine’ for troops from neighboring countries to cross the border for military operations, this event did not warrant its own article.

The Wikipedians who I spoke to were very sure that what was happening in their country should be considered notable. I was sure too, having spent at that stage only 24 hours in the country. But the media in the West weren’t reflecting this story as an important, notable event as the people living in Kenya understood it to be. Wikipedians (the majority of whom are based in the West ) were making the decision to put the article up for deletion because they didn’t have much to go by – they only had the ‘reliable sources’, a few international publications, very few of the Kenyan media publications (since few are online and updated regularly) and fewer still of the informal and unreliable communications that filled the air in Kenya at that time.

Local spheres

Both of these examples show that there is important local contextual information required to make decisions about whether something is a notable event worth documenting on Wikipedia, and that usually we don’t notice this because the majority of editors of English Wikipedia share a similar media sphere and world view. They occupy similar informal and formal media spheres. When there is disagreement, the disagreement is usually about how to cover an issue rather than whether the issue is actually important.

There are plenty of disagreements that result from these isolated media spheres. We see these disagreements when the very different and highly isolated media spheres operating inside Israel and in Gaza are exposed; we see them when we find Russian government employees as well as ordinary citizens attempting to edit articles about the Crimean crisis, and how high a Ukrainian military jet can fly in a Russian Wikipedian article to support alternative narratives being promulgated inside Russia about what happened to MH17. We see glimpses of what Eli Pariser calls, ‘the filter bubble‘ when Gilad Lotan invites us to do a Facebook search and see what our friends are saying about a particular event or when he shows us how there are distinct, isolated Twitter groupings in accounts following news about one of the recent UNRWA school bombings in Gaza.

When the issue or the event happens outside both the formal and informal media spheres that the majority of Wikipedians inhabit, where the BBC or NYTimes is not covering the issue and when there aren’t many Wikipedians in place to account for its importance, Wikipedia has nothing to go on. Our ‘no original research’ and notability policies do not help us.


Not only does Wikipedia inherit biases from the traditional media but we also have our own biases brought upon by the local media spheres that we inhabit. What makes our bias more problematic is that Wikipedia is taken as an neutral source on the world. A large degree of Wikipedia’s authoritativeness comes from the authority implied by the encyclopaedic form.

Like a dictionary, an encyclopedia gives a very good impression of being comprehensive because comphrehensiveness is its goal, its history and narrative. Having a Wikipedia article brings with it an air of importance. Not every event has a Wikipedia article so there is an assumption that a group of people have reached consensus on the importance of the topic, but there is also an implicit authority from the format of an encyclopedic entry. The journalistic account is written as a single author account of an event gathering evidence from their reliable sources and/or from being there. The encyclopedic account, on the other hand, is written without explicit author credits (which adds to the authority) and with evidence of alternative points of view which gives an added appearance of neutrality. When we as a reader come to a Wikipedia or newspaper article, we come to it with a vast background understanding or assumptions about what encyclopedias or newspapers are, and that influences our understanding.

The context of use also points to Wikipedia’s assumed authority. Whereas people generally go to newspapers to find out what is happening in the world, people go to Wikipedia to get the authoritative take. Wikipedia is used to settle arguments in bars about how many people there are in Britain or when the London Underground was built. They go to Wikipedia to find ‘facts’ about the world. The encyclopedia gains its authority because it is an encyclopedia, and this is a very different authority gained from being the New York Times for its readers, for example. (You won’t find someone saying that they’re going to go to the New York Times to find the authoritative answer to how many people were killed in World War 1.)

Wikipedia, then, is not just powerful because it is widely consulted, it is powerful because it is seen as unbiased, as neutral and as a reflection of all of us, instead of the ‘some of us’ that it actually represents.


Wikipedians need to recognise their power as newsmakers — newsmakers who are making decisions to prioritise one narrative at the expense of others and to make sure we wield that power with care. We need to take a much closer look at our policies and practices and make sure that we are building a conducive environment for future articles about a big part of the world that is as yet unrepresented. It means re-looking at efforts to expand our definitions of ‘original research’ such as those currently being discussed by Peter Gallert and others to accommodate oral citations in the developing world. It means recognising that, although we are doing a great job in broadening the scope of the events we cover, we have not begun to represent them in any way that can be considered truly global. This requires a whole lot more work in bringing editors from other countries into the Wikipedia fold, and in being more flexible about how we define what constitutes reliable knowledge.

Finally, I think that understanding Wikipedia bias enables us to develop an understanding of how media bias itself is changing – because it’s not so much about how the media industry is failing to give us an accurate, balanced picture of the world so much as it is about us getting out of our filter bubbles and recognising the role of “unreliable sources” in our understandings about the world.

* But there are others like Global Voices, that are, in many ways, more globally representative than Wikipedia.

by Heather Ford at August 11, 2014 03:32 PM

August 10, 2014

MIMS 2012

DRY isn't the One True Principle of CSS

Ben Frain wrote a great article of recommendations for writing long-living CSS that’s authored by many people (e.g. CSS for a product). The whole thing is worth reading, but I especially agree with his argument against writing DRY (Don’t Repeat Yourself) code at the expense of maintainability. He says:

As a concrete example; being able to delete an entire Sass partial (say 9KB) in six months time with impunity (in that I know what will and won’t be affected by the removal) is far more valuable to me than a 1KB saving enjoyed because I re-used or extended some vague abstracted styles.

Essentially, by abstracting styles as much as possible and maximizing re-use, you’re creating an invisible web of dependencies. If you ever want to change or remove styles, you will cause a ripple effect throughout your site. Tasks that should have been 30 or 60 minutes balloon into multi-hour endeavors. As Ben says, being able to delete (or modify) with impunity is far more valuable than the small savings you get by abstracting your code.

I have made this mistake many times in my career, and it took me a long time to distinguish between good and bad code reuse. The trick I use is to ask, “If I were to later change the style of this component, would I want the others that depend on it to be updated, too?” More often than not, the answer is no, and the styles should be written separately. It took some time for me to be comfortable having duplicate CSS in order to increase maintainability.

Another way of thinking about this is to figure out why two components look the same. If it’s because one is a modification of a base module (e.g. a small version of a button), it’s a good candidate for code reuse. If not (e.g. a tab that looks similar to a button, but behaves differently), then you’re just setting yourself up for a maintainability nightmare.

As beneficial as DRY code is, it isn’t the One True Principle of CSS. At best, it saves some time and bytes, but at worst, it’s a hindrance to CSS maintainability.

by Jeff Zych at August 10, 2014 10:50 PM

August 08, 2014

Ph.D. alumna

What Is an Honorable Response to Israel/Gaza?

In 1968, Walter Cronkite did the unthinkable. After visiting Vietnam to assess the state of the war in light of the Tet Offensive, he produced documentary coverage of the situation. And then, to the shock of many, he concluded with his opinion. In an era in which reporters never stated their own assessment, this act stunned the nation. And if lore has any basis in truth, his statement altered the course of history. Although the accuracy is debated, President Johnson is reported as having said, “If I’ve lost Cronkite, I’ve lost Middle America.” Johnson did not seek re-election that year.

I wake up each day to depressing news of what’s happening in Israel and Gaza. I read about America’s obsession with this war while ignoring what’s happening in Syria, while ignoring its own power games. I read through the histories of Middle Eastern politics, painfully aware of how the Palestinian people are perceived by other Arab nations and disgusted by the way in which each side is a pawn in others’ geopolitical games. I read the personal accounts of fear, anger, horror. And I listen to friends, family, and acquaintances spout racist narratives about the “other” side that make my blood boil. Painfully aware of just how divided the conversation has become, my data scientist partner Gilad Lotan obsessively scrapes social media conversations in an effort to understand how the news media from a country that he considers home could become so biased. I simply try to hold all of the conflicting perspectives in my head to better understand how we’ve gotten here. And the exercise makes me want to crawl into a hole and whimper.

I know that this war will continue. Maybe we’ll see a ceasefire this week, but that will not put the end to this war even if we label it differently in the next round. I know that there will be more innocent bloodshed. And the only outcome of this conflict will be increased intolerance. No good will come from this. Violence will not stop violence. Violence will not end poverty. Violence will not end social divisions. Violence will only increase hatred.

We can debate the particulars until we go blue. We can talk about particular decisions and pass judgment. But we’ll never get to a “right” answer. As Jon Stewart kindly illustrates, we can’t even have a civil conversation about Israel/Gaza without it devolving into a screaming match. We are, after all, trying to reduce a 3D puzzle into a 2D frame. Nowhere is this more clear then when you stand in the middle of the Old City in Jerusalem. Generation after generation of war has buried cities and built new ones on top of old ones. Which layer is legitimate? The first? The last? The most populous? The most harmed? The most powerful? Time creates the third dimension, creates the elevation. No good comes from declaring one layer legitimate.

Like Gilad, I’m watching social media — the tool I’ve spent the last 10 years studying — being used to fuel these fires. Far from engendering enlightenment or enabling civil conversation, I’m watching personalization allow people to revel in their intolerant ghettos, oblivious to how one-sided their perspective has become. I’m watching televised and written media consumed in a segregated fashion. And I’m watching people actively avoid perspectives that make them uncomfortable or otherwise make an effort to truly grok a different world view. Media may be framed as a tool to create an informed citizenry, but people’s biased engagement with it can be used to create ugly silos justified in their own hatred.

And so I go back and re-watch Cronkite’s conclusion, hoping that somehow, somewhere we’ll see an intervention as powerful as this one while also fully aware that the war continued on for seven more years after he asked the American people to reflect. We no longer live in a world where one man can get up on television and humbly profess his opinion such that the world takes stock. The question is: how can we collectively reach an unsatisfactory conclusion and come out as honorable people?

To say that we are closer to victory today is to believe, in the face of the evidence, the optimists who have been wrong in the past. To suggest we are on the edge of defeat is to yield to unreasonable pessimism. To say that we are mired in stalemate seems the only realistic, yet unsatisfactory, conclusion. On the off chance that military and political analysts are right, in the next few months we must test the enemy’s intentions, in case this is indeed his last big gasp before negotiations. But it is increasingly clear to this reporter that the only rational way out then will be to negotiate, not as victors, but as an honorable people who lived up to their pledge to defend democracy, and did the best they could.

(This post was originally published on August 4, 2014 in The Message on Medium.)

by zephoria at August 08, 2014 04:23 PM

Ph.D. student

the research lately

I’ve been working hard.

I wrote a lot, consolidating a lot of thinking about networked public design, digital activism, and Habermas. A lot of the thinking was inspired by Xiao Qiang’s course over a year ago, then a conversation with Nathan Mathias and Brian Keegan on Twitter, then building @TheTweetserve for Theorizing the Web. Interesting how these things acrete.

Through this, I think I’ve gotten a deeper understanding of Habermas’ system/lifeworld distinction than I’ve ever had before. Where I’m still weak on this is on his understanding of the role of law. There’s an angle in there about technology as regulation (a la Lessig) that ties things back to the recursive public. But of course Habermas was envisioning the normal kind of law–the potentially democratic law. Since the I School engages more with policy than it does with technicality, it would be good to have sharper thinking about this besides vague notions of the injustice or not of “the system”–how much of this rhetoric is owed to Habermas or the people he’s drawing on?

My next big writing project is going to be about Piketty and intellectual property, I hope. This is another argument that I’ve been working out for a long time–as an undergrad working on microeconomics of intellectual property, on the job at OpenGeo reading Lukacs for some reason, in grad school coursework. I tried to write something about this shortly after coming back to school but it went nowhere, partly because I was using anachronistic concepts and partly because the term “hacker” got weird political treatment due to some anti-startup yellow journalism.

The name of the imagined essay is “Free Capital.” It will try to trace the economic implications of free software and other open access technical designs, especially their impact on the relationship between capital and labor. It’s sort of an extension of this. I feel like there is more substance there to dig out, especially around liquidity and vendor- and employer- lock in. I’m imagining engaging some of the VC strategy press–I’ve been following the thinking of Kanyi Maqbela for a long time and always learning from it.

What I need to hone in on in terms of economic modeling is under what conditions it’s in labor’s interest to work to produce open source IP or ‘free capital’, and under what conditions is it in capital’s interest to invest in free capital, and what the macroeconomic implications of this are. It’s clear that capital will invest in free capital in order to unseat a monopoly–take Android for instance, or Firefox–but that this is (a) unstable and (b) difficult to take into account in measures of economic growth, since the gains in this case are to be had in the efficiency of the industrial organization rather than on the the value of the innovation itself. Meanwhile, Matt Asay has been saying for years that the returns on open source investment are not high enough to attract serious investment, and industry experience appears to bear that out.

Meanwhile, Picketty argues that the main force for convergence in income is technology and skills diffusion. But these are exogenous to his model. Meanwhile, here in the Bay Area the gold rush rages on and at least word on the grapevine is that VC money is finding a harder and harder time finding high-return investments, and are sinking it into lamer and lamer teams of recent Stamford undergrads.

My weakness in these arguments is that I don’t have data and don’t even know what predictions I’m making. It’s dangerously theoretical.

Meanwhile, my actual dissertation work progresses…slowly. I managed to get a lot done to get my preliminary results with BigBang ready for SciPy 2014. Since then I’ve switched it over to favor an Anaconda build and use I Python Notebooks internally–all good architectural changes but it’s yak shaving. Now I’m hitting performance issues and need to make some serious considerations about databases and data structures.

And then there’s the social work around it. They are good instincts–that I should be working on accessibility, polishing my communication, trying to encourage collaborator’s interest. I know how to start an open source project and it requires that. But then–what about the research? What about the whole point of the thing? Talking with Dave Kush today, he pointed me towards research on computational discourse analysis, which is where I think this needs to go. The material felt way over my head, a reminder that I’ve been barking up so many trees that are not where I think the real problem to work on is. Mainly because I’ve been caught up in the politics of things. It’s bewildering how enriching but distracting the academic context is–how many barriers there are to sitting and doing your best work. Petty disciplinary disputes, for example.

by Sebastian Benthall at August 08, 2014 03:43 AM

August 07, 2014

Ph.D. alumna

Why Jane Doe doesn’t get to be a sex trafficking victim

In detailing the story of “Jane Doe,” a 16-year-old transgender youth stuck in an adult prison in Connecticut for over six weeks without even being charged, Shane Bauer at Mother Jones steps back to describe the context in which Jane grew up. In reading this horrific (but not that uncommon) account of abuse, neglect, poverty, and dreadful state interventions, I came across this sentence:

“While in group homes, she says she was sexually assaulted by staffers, and at 15, she became a sex worker and was once locked up for weeks and forced to have sex with “customers” until she escaped.”Mother Jones

What makes this sentence so startling is the choice of the term “sex work.” Whether the author realizes it or not, this term is extraordinarily political, especially when applied to an abused and entrapped teenager. I couldn’t help but wonder why the author didn’t identify Jane as a victim of human trafficking.

Commercial sexual exploitation of minors

Over the last few years, I’ve been working with an amazing collection of researchers in an effort to better understand technology’s relationship to human trafficking and, more specifically, the commercial sexual exploitation of children. In the process, I’ve learned a lot about the politics of sex work and the political framing of sex trafficking. What’s been infuriating is to watch the way in which journalists and the public reify a Hollywood narrative of what trafficking is supposed to look like — innocent young girl abducted from happy, healthy, not impoverished home with loving parents and then forced into sexual acts by a cruel older man. For a lot of journalists, this is the only narrative that “counts.” These are the portraits that are held up and valorized, so much so that an advocate reportedly fabricated her personal story to get attention for the cause.

The stark reality of how youth end up being commercially sexually exploited is much darker and implicates many more people in power. All too often, we’re talking about a child growing up in poverty, surrounded by drug/alcohol addiction. More often than not, the parents are part of the problem. If the child wasn’t directly pimped out by the parents, there’s a high likelihood that s/he was abused or severely neglected. The portrait of a sex trafficking victim is usually a white or Asian girl, but darker skinned youth are more likely to be commercially sexually exploited and boys (and especially queer youth) are victimized far more than people acknowledge.

Many youth who are commercially exploited are not pimped out in the sense of having a controlling adult who negotiates their sexual acts. All too often, youth begin trading sex for basic services — food, shelter, protection. This is part of what makes the conversation about sex work vs. human trafficking so difficult. The former presumes agency, even though that’s not always the case while the latter assumes that no agency is possible. When it comes to sex work, there’s a spectrum. Sex work by choice, sex work by circumstance, and sex work by coercion. The third category is clearly recognizable as human trafficking, but when it comes to minors, most anti-trafficking advocates and government actors argue that it’s all trafficking. Except when that label’s not convenient for other political efforts. And this is where I find myself scratching my head at how Jane Doe’s abuse is framed.

How should we label Jane Doe’s abuse?

By the sounds of the piece in Mother Jones, Jane Doe most likely started trading sex for services. Perhaps she was also looking for love and validation. This is not that uncommon, especially for queer and transgender youth. For this reason, perhaps it is valuable to imply that she has agency in her life, to give her a label of sex work to suggest that these choices are her choices.

Yet, her story shows that things are far more complicated than that. It looks as though those who were supposed to protect her — staff at group homes — took advantage of her. This would also not be that uncommon for youth who end up commercially sexually exploited. Too many sexually exploited youth that I’ve met have had far worse relationships with parents and state actors than any client. But the clincher for me is her account of having been locked up and forced to have sex until she escaped. This is coercion through-and-through. Regardless of why Doe entered into the sex trade or how we want to read her agency in this process, there is no way to interpret this kind of circumscribed existence and abuse as anything other than trafficking.

So why isn’t she identified as a trafficking victim? Why aren’t human trafficking advocacy organizations raising a stink about her case? Why aren’t anti-trafficking journalists telling her story?

The reality is that she’s not a good example for those who want clean narratives. Her case shows the messiness of human trafficking. The way in which commercial exploitation of minors is entwined with other dynamics of poverty and abuse. The ways in which law enforcement isn’t always helpful. (Ah, yes, our lovely history of putting victims into jail because “it’s safer there.”) Jane Doe isn’t white and her gender identity confounds heteronormative anti-trafficking conversations. She doesn’t fit people’s image of a victim of commercial sexual exploitation. So it’s safer to avoid terms like trafficking so as to not muddy the waters even though the water was muddy in the first place.

(This entry was first posted on June 19, 2014 at Medium under the title “Why Jane Doe doesn’t get to be a sex trafficking victim” as part of The Message.)

by zephoria at August 07, 2014 12:34 AM

August 06, 2014

MIMS 2011

Big Data and Small: Collaborations between ethnographers and data scientists

This article first appeared in Big Data and Society journal published by Sage and is licensed by the author under a Creative Commons Attribution license. [PDF]


In the past three years, Heather Ford—an ethnographer and now a PhD student—has worked on ad hoc collaborative projects around Wikipedia sources with two data scientists from Minnesota, Dave Musicant and Shilad Sen. In this essay, she talks about how the three met, how they worked together, and what they gained from the experience. Three themes became apparent through their collaboration: that data scientists and ethnographers have much in common, that their skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to their success.

In July 2011, at WikiSym in Mountain View, California, I met two computer scientists from Minnesota. I was working as an ethnographer for the non-profit technology company Ushahidi at the time, and I had worked with computer scientists before on tool building and design projects, but never on research. The three of us were introduced because we were all working on the subject of Wikipedia sources and citations.

We recently argued about who started the conversation. Dave Musicant, a computer scientist at Carleton College, said later that, although he loved doing interdisciplinary research, he was much too shy to have introduced himself. Shilad Sen is an Assistant Professor of Computer Science at Macalester College and had been working with Dave on a dataset of about 67 million source postings from about 3.5 million Wikipedia articles. In his usual generous manner, Shilad wrote later: “We had ground to a halt when you came to talk to us. We had done this Big Data analysis, but didn’t have any idea what we should do with the data. You saved us!”

In retrospect, the collaboration that followed involved a great deal of mutual “saving.” I was trying to build a portrait of how Wikipedians managed sources during breaking news events to inform Ushahidi’s software development projects, but I did not have the bigger picture of Wikipedia sources to guide new directions in the research. Dave and Shilad were looking at whether one could predict which sources would stay on Wikipedia longer than others in order to build software tools to suggest citations to Wikipedians, but they had little detailed insight into why sources were added or removed in different contexts.

Over the next two years, the three of us met on Skype every few months to share our findings and then to conduct new analyses, test out new theories about the data, and finally produce a paper entitled “Getting to the source” (Ford et al., 2013) for WikiSym in 2013. I visited the two in Minnesota more recently to discuss the possible future trajectories for research, but our collaboration has remained informal and ad hoc. Despite this (or perhaps in large part because of this), my collaboration with Dave and Shilad has been one of the most enjoyable, educational experiences for me as an early career researcher. This is perhaps partly due to the unique combination of personalities that happen to combine particularly well, but I also think that interdisciplinary research like this can yield very exciting results for researchers coming from very different epistemological and methodological vantage points if they remain open and creative about the process. Three observations are particularly noteworthy here: that data scientists and ethnographers have much in common, that our skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to our success.

Ethnographers and data scientists have much in common

Although at first glance Big Data and ethnography can be seen in opposition (after all, ethnographers have their roots in studies of societies far removed from the heavily mediated ones of today), there are actually some significant commonalities. Both recognize that what people actually do (rather than only what they say) is invaluable, and both require an immersion in data in order to understand their research subject. As Jenna Burrell (2012) writes for Ethnography Matters:Ethnographers get at this the labor-intensive way, by hanging around and witnessing things first hand. Big data people do it a different way, by figuring out ways to capture actions in the moment, i.e. someone clicked on this link, set that preference, moved from this wireless access point to that one at a particular time.Burrell argues that where there are differences is in the emphasis that ethnographers and data scientists place on what people do. Ethnographers, for example, do a lot of complementary work to connect apparent behavior to underlying meaning through in situ conversations or more formal interviews. Data scientists, on the other hand, tend to focus only on behavioral data traces.

If timed well, however, ethnographers and data scientists can come together at appropriate moments to collaborate on answers to common questions before moving on to wider (in the case of data science) or deeper (in the case of ethnography) research projects. In the case of the “Getting to the source” collaboration, the three of us shared a curiosity about sources and with Wikipedia practice more generally, and it was this shared curiosity that drove the project forward. I was interested in large-scale approaches to Wikipedia sources because I had been looking at Wikipedia’s policy on sources and was finding in the examples and interviews that practice around sourcing was very different from what was being recommended in the policies. I was curious about whether source choices were, in fact, contradictory to policies that preferred academic sources. To understand whether my cases were indicative of larger trends, I needed to get a handle on the entire corpus of data traces. Shilad and Dave were interested in the “stickiness” of sources, trying to understand why some sources stuck around more than others. Sourcing practice, for them, was therefore really important for understanding how to analyze and evaluate the data traces represented in the database. All of us recognized the benefit of sharing skills and knowledge that we had gained in our different areas. I needed to understand ways of analyzing the entire corpus, and Dave and Shilad needed to understand everyday Wikipedia practice.

It turned out that, in addition to common questions and the need for shared expertise, we also shared commonalities in our approach. I was pleasantly surprised when I started working with Dave and Shilad that all of us preferred an approach that was inductive (testing out theories about the data as we progressed), systematic (being sure to follow up leads and challenge our assumptions), and collaborative (sharing responsibilities equally and understanding the decisions that we were all making and their impact on the project as a whole). I started this collaboration with an idea that quantitative research was largely deductive and that quantitative researchers would feel they had little to gain from working with those who tend to take a more qualitative approach. Through working with Dave and Shilad, however, I learned that we had much more in common than not, and that collaboration could yield worthwhile results for both data scientists and ethnographers.

Our skills and experience are complementary

In the Wikipedia research arena, a few Big Data researchers have used interviewing, participant observation, and coding in addition to their large-scale analyses to explore research questions. Brian Keegan’s large-scale network analyses of traces through a system (Keegan et al., 2012) is an exemplar of Big Data research, for example, but Keegan also spent countless hours participating in the production of the class of Wikipedia articles that he was studying in order to understand the meaning of the traces that he was collecting. Keegan is, however, a rare example of an individual researcher who possesses the variety of skills necessary to answer some of the important questions of our age. More usual are the types of collaborations where researchers with a wide variety of skills and epistemologies work together to build rich perspectives on their research subjects and learn from one another in order to improve their skills and experience with methods with which they are unfamiliar.

In the case of the Wikipedia sources collaboration, Dave and Shilad had the necessary skills and resources to extract over 67 million source postings from about 3.5 million Wikipedia articles. Based on the interviews that I had done on ways in which Wikipedians chose and inscribed sources on the encyclopedia, I was able to contribute ideas about different ways of slicing the data in order to gain new insights. Dave and Shilad had access to sophisticated software and data processing tools for managing such a high volume of data, and I had the knowledge about Wikipedia practice that would inform some of the analyses that we chose to do on this data. After hearing from an expert interviewee that Wikipedians often discover their information using local sources but cite Western sources, for example, we were able to explore the diversity of sources in relation to their geographical provenance. By understanding this practice, we could also mention what was missing from the data that we had access to, namely, that citations did not necessarily represent what sources editors were using to find information, but rather what citations they believed others were more likely to respect. This small detail has significant implications for the conclusions that we draw about what sources and citations represent and the dynamics of collaboration on large peer production communities like Wikipedia. By discussing my findings with Dave and Shilad iteratively, we were able to come up with methods for operationalizing these hypotheses and developing different lenses for analyzing the data. Through this process, we recognized that our skills and experiences were highly complementary.

Discovering the data together is better than compartmentalizing activities

Where a large number of collaborative research activities fail is where tasks are divided up according to perceived skills and expertise of different types of research identities, rather than taking a more creative approach to research design. In this traditional view, ethnographers might be asked to do the interviews and manual coding where the Big Data analysts do the large-scale analyses with little collaboration and experience of these processes shared. The result is that there is no learning or sharing of skills: data scientists are seen merely as technicians who are able to manipulate the data and ethnographers as those who will “fill in” the context during write-up. If both researchers are to learn from the experience and stand on one another’s shoulders to produce high-quality results, it is important that researchers share some unfamiliar tasks, or that they are at least taken through the processes that resulted in particular data being produced.

Although Dave and Shilad could have asked me to do the manual coding for our sources project alone, we decided to divide the tasks up so that we all contributed to the development of the coding scheme, and coded individual results and checked the accuracy of one another’s coding. Although I led the development of a coding scheme, Dave and Shilad challenged me on the ways in which I was defining the scheme and both helped to manually code the random sample and to check my results. In this way, we all came out with a deeper understanding of the subject and of the ways in which our particular lens contributed to the shape of the research output. We also learned some important new skills. I learned how such large-scale analysis is done and about the choices that are made to achieve a particular result. Shilad, on the other hand, used the coding scheme that we developed together as an example in one of the method classes that he now teaches at Macalester College. We all extended ourselves through this project by sharing unfamiliar tasks and gaining a great deal more from this than we might have if we had kept to our traditional roles.

In summary, ethnographers have much to gain from analyzing large-scale data sources because they can provide a unique insight into how participants are interacting in complex media platforms in ways that complement observations in the field. Data scientists, in turn, can benefit from more qualitative insight into the implications of missing data, data incompleteness, and the social meanings attributed to data traces. Working together, ethnographers and data scientists can not only produce rigorous research but can also find ways of diversifying their skills as researchers. My experience with this project has given me new respect for quantitative research done well and has reiterated the fact that good research is good research whatever we call ourselves.

This article is distributed under the terms of the Creative Commons Attribution 3.0 License ( which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (


  1. Burrell J (2012) The ethnographer’s complete guide to big data: Answers.Ethnography Matters. Available at: (accessed 9 July 2014).
  2. Ford H, Sen S, Musicant DR, et al. (2013) Getting to the source: Where does Wikipedia get its information from? In: Proceedings of the 9th international symposium on open collaboration. New York, NY: ACM, pp. 9:1–9:10. doi:10.1145/2491055.2491064.
  3. Keegan B, Gergle D and Contractor N (2012) Do editors or articles drive collaboration? Multilevel statistical network analysis of Wikipedia coauthorship. In:Proceedings of the ACM 2012 conference on computer supported cooperative work. New York, NY: ACM, pp. 427–436. doi:10.1145/2145204.2145271.


by Heather Ford at August 06, 2014 06:44 PM

Full disclosure: Diary of an internet geography project #3

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-25 at 2.51.29 PMIn this series of blog posts, we are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, I outline the way we initially thought about locating citations and Dave Musicant tells the story of how he has started to build a foundation for coding citation location at scale. It includes feats of superhuman effort including the posting of letters to a host of companies around the world (and you thought that data scientists sat in front of their computers all day!)   

Many articles about places on Wikipedia include a list of citations and references linked to particular statements in the text of the article. Some of the smaller language Wikipedias have fewer citations than the English, Dutch or German Wikipedias, and some have very, very few but the source of information about places can still act as an important signal of ‘how much information about a place comes from that place‘.

When Dave, Shilad and I did our overview paper (‘Getting to the Source‘) looking at citations on English Wikipedia, we manually looked up the whois data for a set of 500 randomly collected citations for articles across the encyclopedia (not just about places). We coded citations according to their top-level domain so that if the domain was a country code top-level domain (such as ‘.za’), then we coded it according to the country (South Africa), but if it was using a generic top-level domain such as .com, we looked up the whois data and entered the country for the administrative contact (since often the technical contact is the domain registration company often located in a different country). The results were interesting, but perhaps unsurprising. We found that the majority of publishers were from the US (at 56% of the sample), followed by the UK (at 13%) and then a long tail of countries including Australia, Germany, India, New Zealand, the Netherlands and France at either 2 or 3% of the sample.

Screen Shot 2014-07-30 at 12.42.37 PM

Geographic distribution of English Wikipedia sources, grouped by country and continent. Ref: ‘Getting to the Source: Where does Wikipedia get its information from?’ Ford, Musicant, Sen, Miller (2013).

Screen Shot 2014-07-17 at 5.28.50 PMThis was useful to some extent, but we also knew that we needed to extend this to capture more citations and to do this across particular types of article in order for it to be more meaningful. We were beginning to understand that local citations practices (local in the sense of the type of article and the language edition) dictated particular citation norms and that we needed to look at particular types of article in order to better understand what was happening in the dataset. This is a common problem besetting many ‘big data’ projects when the scale is too large to get at meaningful answers. It is this deeper understanding that we’re aiming at with our Wikipedia geography of citations research project. Instead of just a random sample of English Wikipedia citations, we’re going to be looking up citation geography for millions of articles across many different languages, but only for articles about places. We’re also going to be complementing the quantitative analysis with some deep dive qualitative analysis into citation practice within articles about places, and doing the analysis across many language versions, not just English. In the meantime, though, Dave has been working on the technical challenge of how to scale up location data for citations using the whois lookups as a starting point.

[hands over the conch to Dave…]

In order to try to capture the country associated with a particular citation, we thought that capturing information from whois databases might be instructive since every domain, when registered, has an administrative address which represents in at least some sense the location of the organization registering the domain. Though this information would not necessarily always tell us precisely where a cited source was located (when some website is merely hosting information produced elsewhere, for example), we felt like it would be a good place to start.

To that end, I set out to do an exhaustive database lookup by collecting the whois administrative country code associated with each English Wikipedia citation. For anyone reading this blog who is familiar with the structure of whois data, this would readily be recognized as exceedingly difficult to do without spending lots of time or money. However, these details were new to me, and it was a classic experience of me learning about something “the hard way.”

I soon realised how difficult it was going to be to obtain the data quickly. Whois data for a domain can be obtained from a whois server. This data is typically obtained interactively by running a whois client, which is most commonly either a command-line program or alternatively served through a whois client website. I found a Python library to make this easy if I already had the IP addressed I needed, and so I discovered, in initial benchmarking, that I could run about 1,000 IP-address-based whois queries an hour. That would make it exceedingly slow to look up the millions of citations in English Wikipedia, before even getting to other language versions. I later discovered that most whois servers limit the number of queries that you can make per day, and had I continued along this route, I undoubtedly would have been blocked from those servers for exceeding daily limits.

The team chatted, and we found what seemed to be some good options for doing bulk whois results. We found web pages of the Regional Internet Registry (RIR) ARIN, which has a system whereby researchers are able to request access to their entire database after filling out some forms. Apart from the red tape (the forms had to be mailed in by postal mail), this sounded great. I then discovered that ARIN and the other RIRs make the entire dump of the IP addresses and country codes that they allocate available publicly, via FTP. ‘Perfect!’ I thought. I downloaded this data, and decided that since I was already looking up the IP addresses associated with the Wikipedia citations before doing the whois queries, I could then look up those IP addresses in the bulk data available from the RIRs instead.

Now that I had a feasible plan, I then proceeded to write more code to lookup IP addresses for the domains in each citation. This was much faster, as domain-to-IP lookups are done locally, at our DNS server.  I could now do approximately 600 lookups a minute to get IP addresses, and then an effectively instant lookup for country code on the data I obtained from the RIRs. It was then pointed out to me, however, that this approach was flawed because of content distribution networks (CDNs), such as Akamai. Many large- and medium-sized companies use CDNs to mirror their websites, and when you do a lookup on domain to get IP address, you get the IP address of the CDN, not of the original site. ‘Ouch!’ This approach would not work…

I next considered going back to the full bulk datasets available from the RIRs. After filling out some forms, mailing them abroad, and filling out a variety of online support requests, I finally engaged in email conversations with some helpful folks at two of the RIRs who told me that they had no information on domains at all. The RIRs merely allocate ranges of IP address to domain registrars, and they are the ones who can actually map domain to IP. It turns out that the place to find the canonical IP address associated with a domain is precisely the same place as I would get the country code I wanted: the whois data.

Whois data isn’t centralized – not even in a few places. Every TLD essentially has its own canonical whois server, each one of which reports the data back in its own different natural-text format. Each one of those servers limits how much information you can get per day. When you issue a whois query yourself, at a particular whois server, it in turn passes the query along to other whois servers to get the right answer for you, which it passes back along.

There have been efforts to try to make this simpler. The software projects, ruby-whois and phpwhois implement a large number of parsers designed to cope with the outputs from all the different whois servers, but you still need to be able to get the data from them without being limited. Commercial providers will provide you bulk lookups at a cost – they must query what they can at whatever speed they can, and archive the results. But they are quite expensive. Robowhois, one of the more economical bulk providers, asks for $750 for 500,000 lookups. Furthermore, there is no particularly good way to validate the accuracy or completion of their mirrored databases.

It was finally proposed that maybe we could do this ourselves by the use of parallel processing, using multiple IP addresses ourselves so as to not get rate limited. I began looking into that possibility but it was only then that I realized that many of the whois providers don’t ever really use country codes at all in the results of a whois query. At the time I’m writing this, none of the following queries return a country code:








So after all that, we’ve landed in the following place:

- Bulk querying whois databases is exceedingly time consuming or expensive, with challenges in getting access to servers blocked.

- Even if the above problems were solved, many TLDs don’t provide country code information on a whois lookup, which would make doing an exhaustive lookup pointless because it would unbalance the whole endeavor towards those domains where we could get country information.

- I’m a lot more knowledgeable than I was about how whois works.

So, after a long series of efforts, I found myself dramatically better educated about how whois works; and in much better shape to understand why obtaining whois data for all of the English Wikipedia citations is so challenging.


by Heather Ford at August 06, 2014 06:36 PM