thinkingmachine
Web Stuff
Registration and Unintended Consequences

Usually the law of unintended consequences decrees that things will be much worse than expected, but every now and then there's a pleasant surprise. Story from Topix via Greg Linden. I won't rehash the story in too much detail, but the upshot is that Topix got rid of their registration requirement on their user forums and not only did participation increase dramatically (as expected), spam rate actually decreased (pleasant surprise).

The thing is, requiring registration has some unintended consequences of its own. In particular, lots of people won't do it. And, on the other hand, spammers, trolls, and immature jerks are happy to do it. So (proportionately at least), you end up with more crappy posts. These observations are from the "2ch principles"; 2ch is an anything-goes web forum in Japan, and people have modeled their forums on 2ch's, having observed the benefits of open posting.

Another one of the mentioned principles is "anonymity counters vanity". The idea is that registered users will become cliqueish, protecting their turf and jockeying over pride and identity. An open, more anonymous system essentially gives less reward to pride of identity, and posts are more topic-focused. Though this seems to counter another bit of web conventional wisdom, which is that reputation is an incentive for good behavior. The 2ch observation is that, at least, reputation is a mixed blessing; people will behave better when they want to protect their reputation, but they will also spend more time maintaining, promoting, and jockeying over that reputation. The cost/benefit tradeoff is probably different for different uses. On eBay, reputation is needed to provide for some trust when making costly exchanges; on a message forum, perhaps it's just a distraction.

Blog Spamming Gets Worse

CNET reports on a sudden flurry of blog spam activity, apparently due to one clever and obnoxious spammer. It's not email spam, comment spam, or trackback spam, but thousands and thousands of blogs created on Blogspot with links to the spammer's sites embedded in text snagged from popular blogs.

It's pretty annoying, not to mention depressing, what with the apparently eternal arms race we're locked into here. But the most surprising thing in the article to me was that Google doesn't seem to do any serious account verification on its Blogspot service (or didn't until last week). It's not like captchas are top secret advanced technology -- pretty much everyone uses them now. Where was Google?

The other thing I don't quite get is: why did all these crap blogs create so much trouble? I mean, it's the nature of the web that there's all kinds of crap out there -- some spammer adding more crap sites shouldn't make it appreciably worse. This isn't like spam mail or comment spam, where someone is shoving their message into your inbox. The problem here seems to have come from the fact that the spammer cleverly made all his fake blogs highly appealing to search engines. They started to appear in people's search results and RSS feeds (why? do people have RSS feeds of open searches?), and that caused the problem.

So, why? You know, if people had been doing that search on Google, I don't think they would have gotten all those crap results, because Google takes into account reputation (in terms of incoming links) in its results rankings -- crap spam sites that are only linked to by other crap spam sites shouldn't get a reputation boost. So to Technorati and PubSub and so on: do what Google does. Not that Google is infallible (see above about captcha), but these blog search engines are talking about blocking Blogspot from their results. Which will work precisely as long as spammers don't crack other blog hosting sites. Anyway, this is probably going to get worse before it gets better.

WWW2004 thoughts

WWW is always an interesting conference. The range of relevant topics is quite wide, from cache-and-network type stuff for optimizing performance to speculative artificial-intelligence-type ideas, to sociological analysis and theory of what people actually do on the web. And so, going from one poster to another, or slipping out of your usual track into some other talk can be surprising, with the sometimes benefit of jolting you into a new idea.

The other interesting thing about WWW is that it does represent, in some ways, much of the brainpower at the center of web developments. Many of the people involved in standards and so on are there, and many of today's papers will be tomorrow's hot new ideas. On the other hand, so much of what happens on the web and affects it for regular users makes no appearance among the pointy-headed types at all. Some of it is just secretive (e.g., Google is well-represented at these things, but they never talk about what they're doing) and some of it just pays no attention to research papers (e.g., most ecommerce, publishing, and daily stuff that people use).

The tension here appears all the time, for example in the contrast between all the cool research ideas people have for search and data extraction, and what people actually do every day. Or the contrast between what WWWers hope to do with the semantic web and the reality of how much attention span most people have for such complexity. Or in the way that search engine response time and ease-of-use has basically eclipsed many clever ideas that would be too costly to add.

If pressed, I'd say this kind of contrast appears in many areas of computer science, the tension between what researchers can think of and what people actually will/do use. But it's all much more obvious at WWW, perhaps just because it's so widely used and develops so quickly and seems more like a force of nature (or, at least, an organic entity like a city or a nation) than a human-designed artifact. A lot of the time, we are just trying to keep up with its relentless development.

www2004: themes

themes at www2004: semantic web, learning/information extraction, search.

Continue reading "www2004: themes" »
www2004 talk

My research group at Intel had a paper titled Mining Models of Human Activities from the Web in WWW 2004. I gave the talk at the conference on Friday. Here is the powerpoint.

www2004: Udi Manber talk

Udi Manber, formerly of Yahoo and academia, now running Amazon's search engine offshoot a9. His topic was "Customer-centric innovations in search and e-commerce". He made some observations about search in general, talked about some Amazon and a9 projects, and ended with some what-ifs.

Continue reading "www2004: Udi Manber talk" »
www2004: finding new news

[...regarding a paper on figuring out how novel articles are in comparison to previously seen articles, for the purpose of presenting users with the maximally new information...]

Don Patterson: I don't think this is a real problem.
Mike Perkowitz: whys that
Don Patterson: Because news is almost all the same.
Mike Perkowitz: true
Don Patterson: Tons of places just repeat what others say.
Mike Perkowitz: but what about a story breaking over time and you want to catch the scoop

Continue reading "www2004: finding new news" »
www2004: Rick Rashid talk

The first talk this morning was from Rick Rashid, of Microsoft Research. His general theme was"Empowering the Individual". Highlight of the talk: a 1994 Microsoft promo video clip about "digital convergence" to the tune of "Surfin' USA" (with badly rewritten lyrics, believe you me). Anyway he had three basic themes: democratization of information, getting your life back, and bending things around you to your will. Though his talk came across more like a laundry list of MSR projects, with some themes perhaps detectable.

Continue reading "www2004: Rick Rashid talk" »
The Semantic Web

Rendezvous IM with Donald Patterson
Mike Perkowitz: i dont think i buy this semantic web
Donald Patterson: It's hard to pin down.

Continue reading "The Semantic Web" »
www2004: Tim Berners-Lee

TBL, as always, gave the conference keynote talk. He focused on a couple of things: top-level domains/namespace, and the semantic web.

Continue reading "www2004: Tim Berners-Lee" »
www2004

Greetings from WWW 2004. I'll try to make notes about anything interesting as it happens. At the moment, I'll just note that Mayor Bloomberg has declared today, 5/19/04 to be "World Wide Web Day" in NYC.