Webmasters have long chosen to use JavaScript to achieve the dynamic behavior of web pages for a variety of reasons, such as making pages more responsive, slowing down site traffic, hiding links, or embedding ads. Website construction believes that because the early search engines do not have the corresponding processing power, resulting in the index of such pages often appear problems, may not be included valuable, may also appear cheating.
The purpose of introducing JavaScript parsing is to solve the above two problems, and the result is that search engines can more clearly understand the effect of users actually open the page. Some websites will be user reviews, ratings and other information from the page HTML separated, using JavaScript and even AJAX methods to dynamically display when the page is opened, the early search engine can handle the page content is missing, which will further affect the index value of the page judgment.
To introduce JavaScript parsing, it is necessary to consider its own design and implementation, parsing speed and impact on other aspects of the system and other factors, this paper analyzes how to design and implement a set of web JavaScript parsing system through some typical cases, and briefly introduces the role and impact of such a system on other parts of the search engine.
First, discover page links
Links are usually in the form of an A tag in HTML, with the link URL tagged in the href attribute, but there are actually some websites that choose A more "dynamic" approach. One is to dynamically write or adjust the A tag, and the other is to trigger an event to change the default link opening mode when the user clicks.
1. Dynamically write or adjust link labels
In the abstract, the effect of a web page to achieve this effect, and the other effects described below, is very similar to putting an elephant in a refrigerator, divided into three steps: find the target to write/modify (find the elephant), prepare the content to write/modify (open the refrigerator door), and perform the write/modify (put it in).
These three steps, mapped to JavaScript, invoke three sets of standard browser functions: page element positioning, data preparation, and page modification. The job of JavaScript parsing, then, is to also provide such functions, which naturally discover the corresponding content and behavior as the JavaScript code is called by the webmaster.
Website construction analysis at this point, the functions required to achieve are basically determined, which are relatively simple including:
document.getElementById // Location
Document. The getElementsByTagName / / positioning
Document. GetElementsByClassName / / location
Node. [firstChild/nextSibling/previousSibling/parentNode] / / positioning
Document. The createElement method/createTextNode / / create links
node.[appendChild/insertBefore/innerHTML=?] // Write content
element.getAttribute, element.setAttribute // Sets the attribute
element.href = ? // Set the properties
As for the content to be written, it may be stored in JavaScript in the form of arrays, or it may be loaded dynamically using AJAX. The former is a built-in feature of the JavaScript language and will not be repeated here; The latter is a separate topic that will be discussed later.
2. An event is triggered when clicked to change the default link opening mode
Pages do this for a variety of reasons, some to hide links, some to implement pop-ups, some to program the URL, some to do a check to see if the link should be opened, and so on. But all of these reasons correspond to the same implementation: adding the click event.
There are three ways to add a click event:
Set the href attribute of the A tag to "javascript:func(...) "Form
Set the onclick attribute of the A tag to the form onclick= "js_code"
Call an event binding function, such as my_link_node.addEventListener(' click ', func, false)
Set the href attribute of the A tag to "javascript:func(...) "Form
Set the onclick attribute of the A tag to the form onclick= "js_code"
Call an event binding function, such as my_link_node.addEventListener(' click ', func, false)
Supporting these three methods is relatively simple in itself, but it is important to note how such click events are triggered and how the destination URL is intercepted after they are triggered.
For trigger events, all possible click events need to be collected first, and then triggered in turn. However, for each click to be triggered, it must be checked whether it still exists before it is actually triggered, because the click event before it is likely to have deleted the current click.
To intercept a URL, you must first implement the relevant page jump function, location.href =? , window.open, etc. Then, by setting a series of flags, the click is associated with the page jump, so that the target URL is obtained.
Second, dynamic page content
Dynamic page content is a means to improve the page loading speed, enhance the flexibility of the website technology, can be those will change the content (such as comments, ratings, etc.) away, so that the page is divided into static and dynamic two parts: static content can use caching and other methods to speed up the page display speed, reduce website traffic; Dynamic content has the advantage of being simple in format and easy to generate, while also saving traffic.
On the other hand, dynamic content is also an important method of loading ads and content cheating, the most common is to write iframe, which has great stealth for early search engines.
On A technical level, the work required to dynamic page content is largely the same as in the previous section, "Dynamically writing or adjusting A tags," and what needs to be added here is the classic "document.write" approach.
This method was one of the first JavaScript features to write a piece of HTML code directly to a page and is still widely used today. For this method, the early search engines have support, but the method is basically limited to character matching, can only support the most direct way to write a JavaScript string, for slightly complex text concatenation is powerless. But for JavaScript parsing, this code is ultimately in line with the language specification, so it can be fully supported and handle various situations such as text concatenation, conditional judgment, and obfuscated code.
Another point to discuss here is nested document.write, which is to write a SCRIPT tag through document.write, and inside that tag is another document.write. This kind of problem is common in skip cheat pages, and its support not only requires JavaScript parsing, but also requires the HTML parser to support the processing of nested HTML writing functions, which will not be analyzed here.
Through the above methods, whether it is the main information of the web page, or advertising or other auxiliary information, it will be exposed, so as to better understand the intention of the webmaster.
Three, web page jump
In some cases, web jumping is necessary to achieve the page effect, but it will also be used for cheating. Technically, it appears in the following two ways:
Call the page jump function directly
Call the page jump function for search engine UA, referer, etc
Here to achieve recognition, the core is to implement the page jump function: location object. Since this is technically the only JavaScript jump function, it will eventually be called regardless of how the page's JavaScript is written or confused. Therefore, the jump code for different pages looks varied, but it is simple to identify.
Fourth, about AJAX
AJAX is a very common web technology, which basically means that during the display of a web page, a piece of data (which may be HTML or other) is dynamically obtained from the Internet and displayed after processing.
The fundamental work with this technique is not the implementation of the XMLHttpRequest object, but the impact on the crawler architecture of the search engine. As we all know, crawler crawls the page, traverses its links, and then in order to crawl the form of design, its work is mainly focused on scheduling and control of crawling pressure, the crawler itself is relatively simple, usually does not have the ability to execute JavaScript and grab AJAX data immediately after crawling, so the need for technical upgrades to support AJAX.
The analysis of grabbers is beyond the scope of this article, and interested readers can check other relevant literature.
Through the previous case analysis, we have summarized the basic work needed to achieve JavaScript parsing, in addition to adding some basic construction can form a relatively complete system. Here we rearrange it and divide it into three parts: 1. Embed JavaScript language engine in HTML parser. Language engine can choose mature open source solutions such as V8 and SpiderMonkey. 2. Implement the required functions, specifically refer to the W3C related HTML and DOM specifications. 3. As a direct corollary, you need to include the so-called.js file, which is the source code that JavaScript parsing needs to "parse". The functions introduced in this article are only a part of the more common JavaScript functions, so that the website construction search engine really see the actual page also need to further implement other required functions, in addition to the need to cooperate with HTML, CSS, pictures and other support. Finally, for webmasters who wish to use JavaScript, this article gives the following recommendations: 1. Don't use too complex JavaScript technology, which is not good for search engines. 2. Do not block the inclusion of.js files, otherwise it will limit the ability of JavaScript to parse 3. Divide the static and dynamic parts of the site reasonably