Online Tools English Text Math Webmaster Other

Algorithms for detection of unnatural links
Author: Mladen Adamovic

In its early days search engines used links as a measure for importance of the page. PageRank algorithm had nice results for ranking pages. But things have changed. Search engines most likely use other algorithms which involve anchor text, user behaviour on the page, TrustRank algorithm, etc. Links seems still to play major role. But, it seems that search engines have good reasons to discount some links. This article is about methods and algorithms for link patterns search engines would likely to discount.

With PageRank algorithm, link from higher ranked page with less outbound links were more valuable than links from lower ranked pages or with more outbound links. But it seems that nowadays search engines have good reasons to valuate differently other types of links - discounted links, negative links (links which harm the ranking of the site), or links which could ban the site from the search engines. Few years ago it were just heavily crosslinked sites.

This article is about types and algorithms used for determine devaluated links.
Sitewide links - these are links from each page on one site to another site. Obviously one webmaster want to give much of its reputation to another website. That could be biased and search engines most likely discount sitewide links. They would probably count it as just one link.

Footer links - one old and primitive way for passing PageRank for one domain into another are links in the footer of the page. Most likely footer links are there just to boost page or site they point to. They are usually designed for search engines, not for users. Since users don't benefit from these links, SE don't benefit from they either.

Links from unrelevant sites - if site A is about mobile phones and search engines statistical analyzers could determine it for a sure and site B is about real estate (and statistical analysers determine that) than links from site A to site B will be discounted.  If site has large number of untopic links, it will be banned.

Reciprocal links (relevant or unrelevant) - webmaster do some link exchange to boost those rankings and most likely reciprocal links are irrelevant and should be discounted. Finding of reciprocal links is a graph theory related problem and well known algorithms like DFS and BFS might be used for these, although it is known that there are new algorithms which could find loops in the very very large graph (millions of nodes) very fast.  Unrelevant masive reciprocal link would likely to harm your site's ranking.

Links from blog comments - blog comment spam is a major issue in discounting links. Search engines develop algorithms which use design of the site to determine its software. They probably find the start of the "comment" part of the article and discount all links from "comment" part. Moreover, if most of links are from blog comments the site did blog comment spam and most probably will be banned or burried at the end of the SERP (Search Engine Result Positions). It seems that it is easy to algorithmicaly determine comment section of the blog. It is usuall after "Comments >>><p/>"  and some other patterns. They just have to skip all links within comment section.

Directories - 10 years ago directories played important role in the Internet. Yahoo, nowadays one of major search engine has started as a directory. But, most directories nowadays are just there to allow webmaster to get link to its site. Most of them are designed for search engines, not for the users. Most of them sell links or trade links with link back (link exchange). Still, some directories show some quality - Yahoo's, DMOZ and some specific niche directories. There is also some grey area between directory and scraper sites. Some scraper sites looks like directories and vice versa. Directory could be found easily. The site which mainly contains links and descriptions (few sentences after the links). If sentences after the links are even from sites itself, directory will be banned by scraper detector anyway.

n-Way links exchange. These types of links are when webmaster A put some links to webmaster B sites and webmaster B put some links to OTHER webmaster A sites. These types of exchange could be detected as the same way as reciprocal links detector using one modification. That modifiction is : all sites which bellong to webmaster A and which are on the same IP could be collapsed into one node. Loop in the link graph will detect n-Way link exchange and these links could be discounted.

Link farms and other link manipulators. There are several algorithms for their detection. Let me mention some of them :
1) PR distribution test and Truncated PageRank algorithm - most sites which are in link farm are poor quality sites with low PR. Truncated PageRank is PageRank computer only from higher PR ranked pages. If coeficient betweek Truncated PageRank and PageRank is low it suggest that page used link manipulation.
2) The similar algorithm for detection of n-Way link exchange could be executed. The nodes which have a lot of loops comparing to the overal number of inbound links in the resulting graph will most likely be involved in link farm.
3) Pattern recognition : support vector machines, neural networks and fuzzy logic for analyzing of linking graphs could be invoked. Those machines should be trained by man.
4) Link farms use some delimiters for links. It could be some of the following : - , |, <p/>, <br/> , _. If most of the links to one site has some prefixes or suffixed like those it is very likely in link farm.
5) Some link farm have some other patterns in scripts or code which could be accesed by search engines.
6) BadRank (or DistrustRank) - idea is that sites which are detected as bad propagate its "spamicity" to all sites which links to them.
7) Site Level Link Alliances algorithm - if page p has in-links from pages (i1,...,ik) SE could test if those latter pages are highly connected
8) Site could be marked as bad if domains of n of their out-links match the domains of n of their in-links.
9) Composition of some of the above elements could be used.  If site fail for example tests 1, 3 and 4 it is most likely involved in link farm. Even if the algorithm succesfully detects only 90% sites (and most probably it would detect 100% of sites involved in link farm), the rest of involved sites will be marked bad by linking to "bad neighbourhood".

Paid links. If section of the page with links contains text like "sponsored links", "our sponsors", "our donators", "supporters", "advertisiment", and similar it could be easily detected as paid links.

If I forgot any type of links search engine should be treaten as "fraud links", please let me know.

This article is available under the terms of GNU General Public Licence ver 2.
You might copy this article to your site if you put link back to the original article (keep the following):
Algorithms for detection of unnatural links
Author: Mladen Adamovic



Bookmark us!   add to delicous   digg it add to YahooMyWeb   add to Reddit   add to Furl   add to Fark   add to blinklist  

If you found this page interesting, consider linking to it.
Simply copy and paste the code below into your web site (Ctrl+C to copy)
It will look like this: Algorithms for detection of unnatural links

Terms of use Contact us. We are very interested in ideas for new tools and bugs in existing. About us