Web数据挖掘的应用英文文献和中文翻译(2)

From literature it has been observed that no concrete work has been done on Flash Web pages. Hence here we concentrated to work on XML Web page classification for future research avenues.
3. XML URL Classification based on their semantic orientation
System Architecture of proposed system, explains the steps we followed to achieve the classification process as shown in Fig. 2. Each inpidual process carried out based on XML web pages. Each step is discussed in the upcoming sections.
.
Fig. 2 Architecture of the Proposed System
3.1 Knowledge base
It is a domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Here, Knowledge Base is created in four steps as follows.
(1) Redundancy is checked on XML URL dataset
(2) Source code extraction
(3) Tag extraction using DOM structure
(4) Knowledge Base creation by tag redundancy analysis
3.1.1 XML URL Redundancy Analysis
In our proposed classification method, redundancy analysis is the very first step in Knowledge Base creation task. After creating the various types of XML URL data set such as Pure XML, Code based XML, HTML embedded XML and RSS XML URLs are processed inpidually in this phase.
Here, in first step, Algorithm reads the URL from the source files (Pure XML, Code based XML, HTML embedded XML and RSS XML) line by line and fetch(s) the URL(s). The fetched URL will be tested with destination file for redundancy based on sequential search. If the fetched URL is not present in destination file, then it will be appended otherwise it will not be appended. This process will be continued until the last URL in the source file. Finally the unique XML URLs of each category is obtained.
3.1.2 Source Code Extraction
The resultant vector of first step of the Algorithm will be given as input to the second step of Algorithm to extract the source of respective unique URLs. Here, Algorithm will read the URLs from input file and using Transmission Control Protocol (TCP) it will extract the source code. Extracted source code is saved in auto created destination file with respect to URL number.
3.1.3 Tag extraction using DOM Structure
After extracting the source code of XML URLs, in third step we extract the tags using Document Object Model (DOM) tree structure. Here, the extracted source codes are read line by line and algorithm looks for the tags using DOM. Then, found tags are extracted and stored in corresponding created file name.
3.1.4 Unique Tag Identification
Resultant vector of Step 3 is processed here to identify the unique tags and to create the knowledge base. In this phase, we read each tag files and compare with the destination file tags. Append if the comparing tag does not exist at destination file otherwise skip and move to the next tag of source file. This process will be carried out for all tag files and comparison will be done with destination file.

Fig. 3 Block diagram of XML URL Classification
All these four steps are carried out on each type of XML URLs consecutively to create the tag dictionary (Knowledge base). After creating the knowledge base for each category of XML URL's, here matching and representation has been done by using testing dataset. For each testing XML URLs, source code and its tag are extracted.
Here, the extracted tags are matched with Knowledge Base to identify their respective class. Matching process is done with all four Knowledge Bases such as KBRSS, KB Pure, KBHTML, and KB Code. By using string matching, overall matching level is calculated by number tags matched over number of tags of source file. Here the most matched (highest percentage) one is considered as its class. Web数据挖掘的应用英文文献和中文翻译(2):http://www.751com.cn/fanyi/lunwen_38929.html