Tag Archives for archives
I have complained about Twitter’s search system before. I always used to think it was strange that you had to go to a subdomain, search.twitter.com. just to search anything at all. Eventually they integrated the search engine right onto the home page, so that problem went away. But as we’ve known for a while, it is nearly impossible to find older tweets since Twitter Search only keeps an index of tweets searchable for about two weeks.
Now of course that is probably small potatoes for most: why do you want to be able to search old tweets when Twitter is all about real time information? And now that the Library of Congress is taking over, who cares? These are good questions, and ultimately, the times are few when I truly need to find an old tweet through a search engine. But these things are steeped in principle: I should be able to if wanted. Moreover, there should be a better way to find a month-old tweet (one that I wrote, no less) than paging back through my entire timeline by clicking “more” 40 times.
Today a quirk of Twitter search was exposed that is even stranger. Apparently, Twitter is filtering search results that contain the word “RT.” This cuts down on repetitive results, to be sure, but again, a tailored list of results is not what we should be receiving from searching. Personally I think repetitive RT tweets are 1) easy to ignore and 2) a visual cue for gauging the popularity and, in some ways, the importance of a particular tweet.
I was a bit skeptical of this glitch even happening since it sounds like a bug, but it does indeed occur. Here’s the rub: it only happens when you are logged in. So logged-out or non-users searching the Twitter homepage get full results, as does anyone searching on search.twitter.com.
It is likely that Twitter is doing this on purpose, and for several reasons.
- They want to clean up results for users and see it as a wanted convienience and added value (which it may indeed be for many people).
- They want to encourage use of their “Retweet” button are are underhandedly “forcing” users to use it if they want to be included in search results.
- Building on the two above, @josiefraser from the Digizen project in the UK, thinks they are doing this to follow up on search deals with Google and Microsoft, to have Twitter’s new and approved Retweets as a way to easily and quantitatively rank tweet relevance.
So anyway, just another thing to be skeptical of when using Twitter for searching. With the addition of “sponsored tweets” and now the elimination of manual RTs, Twitter Search has really hit rock-bottom in terms of transparent and accessible search results. Worse yet, the users who do use the Retweet button do not get credit for having done so. All that appears is a note, saying that a certain tweet has been retweeted by x number of others (see picture above).
Hopefully the serious research archive at the Library of Congress will be useful for some who need a real archive. For the rest of you, hopefully most of you will see this as an improvement, an elimination of noise, and tell me to quit my bellyaching. As for me, well, I’ll just go crawl back under my griping rock and wait until someone finds me an easy way to search old tweets (that doesn’t involve Google or equally unappealing RSS parsing).
Update: The Next Web has weighed in saying, “If I were a betting blogger, I’d place my wager on Twitter addressing the filter as a “coding error” that will soon be “corrected”.” We’ll see.
By now everyone is up to their ears with tweets about the Library of Congress’s annoucement that they will archive every Tweet. Here are my initial concerns and lauds.
- Cost. Library Journal has already questioned this. How much storage space is this going to require? How will it be sustainable? And how often are they planning on doing updates to the data stream? Will they begin collecting Tweets in real time? Monthly? Yearly?
- Content and archival quality. What about all those shortened bit.ly links? Or the old ones from services that have shut down, like Twurl? Or the really old ones that might be full URLs but that have rotted away? We can’t expect this to be perfect, but is LOC planning on trying to capture anything external to what the tweets may refer to? I got this idea from @dancohen. He suggests that LOC may need to take snapshots of the linked websites, and I think that sounds almost essential in a way albeit messy and difficult.
- Searchability. This could either be the greatest thing to happen to Twitter search, or a huge disappointment. Will LOC make their database of Tweets searchable? Right now, Twitter search is good for about two weeks. Library of Congress has a huge opportunity to blast that wide open, and we can only hope that they are able (infrastructure and $$$-wise) to do so.
- Privacy. A commenter was posted on the LJ blog about this issue. Is there a privacy problem here? Yes, our tweets are public, but is it somehow unethical even if it may not be a violation of copyright to republish Tweets in what could become public archive? Don’t ask me for an answer. Because I’ll say “no, it isn’t.”
- Metadata. How will the data about the tweets and their authors be captured and stored? Furthermore, Twitter is about to let us start adding annotations and other metadata to tweets in our stream. Will this sort of marginalia be lost?
All in all I have a feeling that this project is going to set a tone for social media archiving practice. One of the most talked about services being archived by one of the world’s largest libraries. If they truly think this is important (and I am tempted to agree), I think there is an excellent opportunity here to demonstrate that importance publicly. Essentially, I think the LOC is about the create the standard and best practices for social media archiving with this project, for better or for worse. If it is not implemented well in the beginning, it has the potential to set the bar too low (in both the technical and the public eye) for future endeavours seeking to capture online content.
In any case, this is a very exciting development to round off my library education. Two more days!
UPDATED Apr 15:
- ReadWriteWeb has some more good questions. Among them: “Will the archive include friend/follower connection data? Will it be usable for commercial purposes? Will there be a Web interface for searching it, and will that change the face of Twitter search for good? Is there any way that the much larger archive of Facebook data could be submitted to the same body for analysis of the same kind?” The answer to some of these is already known: no commercial use, there will [sounds like] be little web interface for searching–instead they will present a curated set for public use, while the entire archive will remain for serious research only.
- To address the problem of search, Google Replay was announced yesterday as well. This is Google’s attempt to capture what SearchEngineBLog calls a “vox populi” view of historical events. You can essentially search Google’s index of tweets easily for a specific date or range and keywords to get a sense of what was said about topics such as health care reform. With Twitter handling a reported 19-billion searches a month on their junky index, it’s about time we got another option. Google Replay, just like in their real-tme results display, resolves those shortened links, but I don’t know whether or not the full URL is saved within the index or if it is resolved on the fly. My guess is the latter.
What I want and have always wanted was a way to search for specific tweets by specific users. Sometimes I can recall a fuzzy thing like, “I know @somebody tweeted something about “Topic A” like a month ago.” With Google Replay, we’re getting closer, but it’s not perfect, yet. It does effectively use Twitter handles as a search term, for example: “iphone @danhooker” brings up some tweets (but not all) that I have sent or that were RTd by me. I hope it will get better. Google has that habit, so I fully expect–and pray–this will be a workable option for meaningful Twitter search in the future.
I just came across a most interesting post by Ben Parr over at Mashable, entitled “5 Ways Social Media Will Change Recorded History.” Go ahead and take a look, I’ll wait.
Got the idea? Basically social media is open, “archived,” “organized” and ready for immediate analysis by future scholars who, in wondering how social trends developed, can look back one-hundred years across “Twitter, Facebook, blogs, websites, forums, and search habits” to gain a deep and, more important, true, sense of history.
In theory, this is true. We do record more data about our thoughts and feelings on social media tools than any previous generation has had the opportunity (or the willingness) to do. However, this does not mean ipso facto that we will be allowed to have access to this information for eternity. Tweets are hardly archived for more than a week in Twitter’s search engine. Parr’s own link at the bottom of his post was broken when I read the article. We all know the frustrating reality of rotten and broken links on old websites.
I have schoolwork to do, so I’ll be brief in my critique on the problems of the notions of social media’s persistence:
- The first problem is link shorteners. Twurl went down once, we’ve already seen the problems that can arise from our favorite tweet-enabling link service having server trouble. “Archival” tweets will forever be associated with what will most likely end up as unresolvable links to unknown content.
- The second problem is the tweets (or statuses, or forum posts) themselves. We don’t just “have” Facebook and Twitter to analyze. Facebook has Facebook. And whether or not we get to look at that data is up to them, or whomever owns the rights in a hundred years. And who’s to say anyone will? What if Twitter doesn’t get another round of funding and drops server support? We’ll be left with personal backups from strange people scattered across the globe in God-only-knows-what format.
- Finally, the notion of the social media “archive” is short-sighted at best. At worst, it doesn’t even exist. Parr says “the information is archived, easily organized, and a large stock of it is readily available to the public.” Regardless of whether the organization of social media data is “easy” the problem remains whether it is accessible or well organized at all. In my mind, social media data is disparate, context-dependent and, worst of all, proprietary.
This is not to say that the potential for social media to be groundbreaking in terms of social and possibly even recorded history does not exist. It does, and Parr is right in pointing it out. However, we can’t start the process of ensuring long-term access to social media from a position of blissful ignorance. We have a long, long road to travel before the data stored in our social media activities is available for use (viewable by the public does not equal usable for study), well-organized (reverse-chronological order does not equal structured) or archived (search-able does not equal persistently available). We cannot just assume that these things will be around forever or we risk losing them sooner than we might imagine.