scrubbing numbers and other data
I've been getting asked about this a bunch lately, so here is an answer for everyone. You may have noticed that telephone numbers are displaying in completed transcripts as hash marks:
(###) ###-####
What is going on? Did the patron get the correct number in the answer?
Questions that are 'closed' are displayed as having been scrubbed, though answers sent to patrons are not. Questions that are 'open' are displayed without scrubbing. When you answer a question, it is automatically set to 'closed', so your answer will immediately display as scrubbed.
The idea right now is to make sure the scrubbing works well before we make it permanent.
Names (if we know it), e-mail addresses, phone numbers, groups of 5 numbers or more (ie zip codes) and groups of 11 numbers-and-spaces or more (sometimes people enter barcodes and credit card numbers) are removed from transcripts if they are not part of a URL.
We scrub phone numbers because the computer system doesn't know who it belongs to - if it is a patron's phone number, we want it removed, but who can tell? So we are removing anything that looks like it might be a phone number.
And all of this begs the question, why keep any of it at all? I used to be more firm in my conviction that the transcripts were valuable for statistics, for education, to be re-used, and to be mined for the connection between the questions people ask, the words we use in conversation, and the resources librarians use to answer them. I still believe this, but since we are quite slow to do anything about it lately, I am leaning more and more towards deleting transcripts altogether.
To help make things a little clearer, I added a message to closed transcripts:

Comments
and more
I wanted to add also that a forthcoming lawyerly article by Paul Ohm posits that a database is less and less useful as personal information is removed - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006.
According to Ohm, computer science tells us that a database can either be 'scrubbed' or useful, but not both. He argues that computer science has proven that policies for "scrubbing" personal information from databases are misguided because individuals involved can very often be re-identified.
I think this is a really important idea, and without getting too into it, I think L-net differs from what he is talking about in two significant ways:
First that we are not sharing this data with anyone, we are keeping it for ourselves ("ourselves" being our partner libraries). Perhaps we don't need to scrub at all. I do welcome library scientists who wish to use our database for research, though so far this has been limited to staff at our partner libraries.
Still, there is some risk that we will accidentally release the data (governments are known to do that) or that an "adversary" will break into the site, and if that happens, it's better that all they find is the scrubbed transcripts.
Second, our data isn't actually about people. People are part of our metadata. We aren't interested in the habits of individual people, in part because so many people only ever use our service once (though, due to scrubbing, I can't back that up), so there is nothing we can really conclude about anyone.
I think our scrubbed data is useful, but I also want to acknowledge that "scrubbed" can't always mean anonymous.