On Links

Traditionally, the way to get to a webpage to via a search engine, a social network, or an aggregator like reddit. But there isn’t really a reason why these should be the only ways, and I think the web is suffering because of it.

The original idea of the internet is that ordinary people could build a webpage like this:

<html>
<head>
    <title>My first website</title>
</head>
<body>
    Hello world! Here's a link to a cool thing:
    <a href="http://www.anexamplewebsite.com">My example website</a>
</body>
</html>

You’d use some type of hosting to “deploy your page to the web”. Then users could visit your website using a browser and roll their eyes at how boring it was. Then they’d click a link to another website. You’ll notice that the markup above includes a “link”. The browser would color the link blue.

Slowly things got added from there. Someone probably wanted to make the links red instead of blue, so Godthey invented CSS. Then Javascript was added for basic user interactivity. Nowadays websites can be really complicated, but at least at a surface level, they’re just markup documents as before - just HTML, CSS, and JS.

So far so good - right? Okay, now the problem is, all these chuckleheads are putting up websites, and there’s no good way to find a specific one, except by clicking links from the ones you already know about. The decentralized nature of the web makes it easy to create pages, but hard to navigate across them.

Given that need, people started to make little bots that would, effectively, start with a webpage, visit every link in the markup, open those as webpages, visit every link in their markup, and so on.

To a first approximation, Google’s index today is built the same way. They have a bunch of complex techniques for actually doing the visiting, link traversing, etc. but the underlying approach isn’t that different. Google’s primary innovation was that, for a user searching across the index, they ordered their search results by what was most linked to. The idea was to “crowdsource” which websites had value. The democracy of the internet would decide which links where worth adding and which were not. And Google would simply count up the votes, and then order the search results accordingly.

Since people would visit Google for their great search results, and their visits would drive linking behavior, it caused a positive feedback loop. Now almost everyone will search via Google, and therefore if you aren’t on Google, you might as well not exist.

This blog is an example of something that doesn’t really exist on Google (as of this writing). You can find it if you search for “sudopoint”, or a very specific phrase from one of the posts, but by definition, if you knew either of those things you probably wouldn’t have needed a search engine (you would have had the link some other way - either I spammed it to you via email, or you know me in real life, etc). The whole goal of a search engine is to surface links that you wouldn’t be able to find, and while that works well for “established websites you haven’t heard of yet” it doesn’t work nearly as well for “recently created websites”.

The problem then becomes, if you’re a new website, how do you get Google’s attention? And it turns out the answer is to write a bunch of drivel that other people will link to, after finding it on Google. Wait, what?

Yeah, that’s the downside to a positive feedback loop that everyone uses. The only way to enter the loop is to jump in and hope that the search->link->search->link lifecycle will bring you from the bottom of the results closer to the top. The idea behind adding the drivel is that unique phrases you happen to use will catch some small percentage of searchers, who then link to you, so you go up the rankings and more searchers find you.

Google’s hope of a democracy where people choose the most valuable webpages has turned into a rat race where people structure their webpages specifically to cater to Google’s algorithms. Want to make a useful tool? You should probably add a lot of text, not for users visiting the page, but so that you catch a tiny ray of Googly sunlight. That’s the real reason why recipe websites tend to have a whole bunch of crap before actually showing the recipe. The measure has become the target.

That also means that search results are getting worse. Google itself is driving traffic to the crap, and then measuring that traffic as an indicator of value. Meanwhile, us proletariat are starving to find decent, in-depth material that was pretty normal to find back in the 90s. Don’t even get me started on the vacuous, pseudo-intellectual tidbits that people tweet five times a day. Get off my lawn, modern internet.

Is this an unhinged rant? Yes. But I do have a couple of ideas for ways this could be improved.

The basic directory

The idea behind Yahoo’s directory (the first one that I can remember) is that most people wanted to visit the same, say, 500 websites. And we can organize those websites into categories - sports over here, culture over there, etc. So you can just make a massive page of links and let people use that to find stuff. In fact, Yahoo’s first search box was less of a search of the internet and more of a search of the directory.

It’s obvious that this approach probably wouldn’t work well today. The internet is much bigger now than it was when David & Jerry were building Yahoo, and you can’t browse through it all, or even categorize it, in one place. However, the number of distinct pages that people actually visit is pretty limited - could we do better?

Instead of a generic directory of everything on the internet, maybe we add things to the directory based on what people visit. So, for example, we have a page that lets people submit links, and we rank everything in the directory based on which links get visited most. Depending on how you think of it, facebook / twitter / linkedin are the “social network” version of this directory (where “links visited” is replaced by “likes / shares / retweets”) and hacker news / reddit are “link aggregator” versions of this directory (where “links visited” is replaced by “votes”).

The problem with this approach is, frankly, your results aren’t going to be that different. The same people clicking on boring garbage in search engines are probably going to like / vote on boring garbage in the social directory. You can always niche your directory to a specific cohort of users (pinterest is targeted to women, hacker news is targeted to programmers, and so on), but ultimately your directory is still going to be biased to the shallow and easily accessible.

Okay so maybe instead of treating all of your friends as equal, you take a harsher approach. Some of your friends are smart / intellectual / interesting, and the rest of them aren’t. You’re going to force people to categorize which folks are “trusted” and which aren’t, and you create a link directory solely based on what the trusted people like.

This doesn’t really exist today, but you could imagine building a “book club” like experience for reddit where only people “with status” get to submit articles. Or a browser plugin that scrapes everyones history and the collates it somewhere, and you choose which people’s links to follow, etc.

The problem is that smart people are still kind of dumb sometimes. Smart people will read memes (I still do). Smart people will talk about silly topics. Even if I picked my 20 cleverest friends, I’m not sure how good of a link directory we’d build together. We still need some way of removing the garbage.

However, this approach does solve one problem that Google currently doesn’t. Smart folks can easily distinguish between SEO crap and actually substantive, interesting writing. So here’s where we’re at now:

Shallow and interesting - supported by Google / social / aggregators & TSD
Shallow and uninteresting - supported by Google / social / aggregators, vetoed by TSD
Deep and interesting - supported by TSD, vetoed by Google / social / aggregators
Deep and uninteresting - supported by no one. Also, arguably, this isn’t a thing.

Distinguishing between shallow and deep is socially difficult but it might not be *algorithmically *difficult.

One could picture basic heuristics to sort deep pages from shallow ones. Deep pages will have a lot of text, shallow ones will have a lot of images. Deep pages will not be about certain topics (celebrity gossip, for example). Deep pages will generally link to other deep pages. And so on.

When I think about the webpages that I most often visit, I’m basically building this directory in my head each time: 1) did the person who sent me this link fall into the “trusted” or “untrusted” category and 2) does the page itself look like most of the other trusted pages look?

There are some positive signs that the current link aggregators know something is wrong and are trying to adjust. Facebook is currently attempting to make the newsfeed more “meaningful”, and Google is attempting to make their links more user relevant, though I agree with Paul Graham that it hasn’t gone that well.

I really think this might be a place for a viable competitor. Let me know if you find something that is moving in this direction - I’d love to try it!

The basic directory

The social directory

The tiered social directory (TSD)

Tiered social directory + algorithmic sorting