Library of Congress and the social media archive

By now everyone is up to their ears with tweets about the Library of Congress’s annoucement that they will archive every Tweet. Here are my initial concerns and lauds.

  1. Cost. Library Journal has already questioned this. How much storage space is this going to require? How will it be sustainable? And how often are they planning on doing updates to the data stream? Will they begin collecting Tweets in real time? Monthly? Yearly?
  2. Content and archival quality. What about all those shortened links? Or the old ones from services that have shut down, like Twurl? Or the really old ones that might be full URLs but that have rotted away? We can’t expect this to be perfect, but is LOC planning on trying to capture anything external to what the tweets may refer to? I got this idea from @dancohen. He suggests that LOC may need to take snapshots of the linked websites, and I think that sounds almost essential in a way albeit messy and difficult.
  3. Searchability. This could either be the greatest thing to happen to Twitter search, or a huge disappointment. Will LOC make their database of Tweets searchable? Right now, Twitter search is good for about two weeks. Library of Congress has a huge opportunity to blast that wide open, and we can only hope that they are able (infrastructure and $$$-wise) to do so.
  4. Privacy. A commenter was posted on the LJ blog about this issue. Is there a privacy problem here? Yes, our tweets are public, but is it somehow unethical even if it may not be a violation of copyright to republish Tweets in what could become public archive? Don’t ask me for an answer. Because I’ll say “no, it isn’t.”
  5. Metadata. How will the data about the tweets and their authors be captured and stored? Furthermore, Twitter is about to let us start adding annotations and other metadata to tweets in our stream. Will this sort of marginalia be lost?

All in all I have a feeling that this project is going to set a tone for social media archiving practice. One of the most talked about services being archived by one of the world’s largest libraries. If they truly think this is important (and I am tempted to agree), I think there is an excellent opportunity here to demonstrate that importance publicly. Essentially, I think the LOC is about the create the standard and best practices for social media archiving with this project, for better or for worse. If it is not implemented well in the beginning, it has the potential to set the bar too low (in both the technical and the public eye) for future endeavours seeking to capture online content.

In any case, this is a very exciting development to round off my library education. Two more days!


  1. ReadWriteWeb has some more good questions. Among them: “Will the archive include friend/follower connection data? Will it be usable for commercial purposes? Will there be a Web interface for searching it, and will that change the face of Twitter search for good? Is there any way that the much larger archive of Facebook data could be submitted to the same body for analysis of the same kind?” The answer to some of these is already known: no commercial use, there will [sounds like] be little web interface for searching–instead they will present a curated set for public use, while the entire archive will remain for serious research only.
  2. To address the problem of search, Google Replay was announced yesterday as well. This is Google’s attempt to capture what SearchEngineBLog calls a “vox populi” view of historical events. You can essentially search Google’s index of tweets easily for a specific date or range and keywords to get a sense of what was said about topics such as health care reform. With Twitter handling a reported 19-billion searches a month on their junky index, it’s about time we got another option. Google Replay, just like in their real-tme results display, resolves those shortened links, but I don’t know whether or not the full URL is saved within the index or if it is resolved on the fly. My guess is the latter.

iphone @danhooker

What I want and have always wanted was a way to search for specific tweets by specific users. Sometimes I can recall a fuzzy thing like, “I know @somebody tweeted something about “Topic A” like a month ago.” With Google Replay, we’re getting closer, but it’s not perfect, yet. It does effectively use Twitter handles as a search term, for example: “iphone @danhooker” brings up some tweets (but not all) that I have sent or that were RTd by me. I hope it will get better. Google has that habit, so I fully expect–and pray–this will be a workable option for meaningful Twitter search in the future.


3 thoughts on “Library of Congress and the social media archive

  1. Daniel,
    Storage is cheap and getting cheaper all the time. Google’s approach to data storage and unlimited space for e-mail are two examples. As long as the tweet-data is mineable many of the issues you raise can be resolved. I take your point about linkrot and how this will be confusing and frustrating to the researcher.


    1. Dean, storage is indeed “cheap” but it’s not free, and the LC doesn’t rake in billions of dollars a year doing search advertisements. Not saying this is a deal-breaker by any means, just another consideration I thought was worth mentioning.

      It will be nice to start seeing some research come out of this archive, won’t it? The link issue will be one to watch, I think. Maybe it would suffice to archive page titles, similar to what you see in a Google search?


  2. Twitter To Launch Their Own URL Shortener Soon (And Won’t Be Giving Users A Choice).

    This may help the confusion and frustration that future tweet miners will experience due to linkrot. Or…maybe not. We’ll see


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s