毕业论文

当前位置: 毕业论文 > 计算机论文 >

聚焦搜索引擎研究+文献综述

时间:2017-08-12 15:09来源:毕业论文
现在网络上进行商品交易的活动越来越多,在这里将针对这种应用来设计聚焦于商品搜索引擎,简称聚焦搜索引擎,以方便用户能够在短时间内找到自己需要的商品,并及时购买,那么
摘要现在网络上进行商品交易的活动越来越多,在这里将针对这种应用来设计聚焦于商品搜索引擎,简称聚焦搜索引擎,以方便用户能够在短时间内找到自己需要的商品,并及时购买,那么这就涉及到聚焦搜索引擎的实现。
评价通用搜索引擎优劣有四大指标“全、准、新、快”,即爬取的网页覆盖率,返回结果的准确性,网页更新速度,爬取信息的速度。聚焦搜索引擎也从这几大指标进行优化,同时和通用搜索引擎也有所不同。一方面,如何获取更多的网页,而这些网页要和商品有关,所以在初始站点方面和通用搜索引擎不同。另外,页面信息提取,只提取与商品有关的信息,所以本文关注的是爬取的商品信息覆盖率。另一方面,要给用户准确的商品信息,这个依赖于爬取商品信息覆盖率较大的基础上,根据关键词搜索的准确性,文本借助lucene的开源框架实现了信息的索引和检索操作,达到快速、精确的搜索的需求。第三方面,要考虑得到的商品信息网页的更新速度,本文中采用的策略是根据网页抓取后内容是否变化来动态调节抓取频率。第四方面,本文考虑到借助Heritrix的爬取框架,实现了多线程爬取,同时优化了url散列算法,保证了多线程爬取在有效性的基础上高效,另外,在数据库、索引库的存取方面也极大地提高了性能。同时以美观,有新意页面呈现给用户。
关键词  搜索引擎  聚焦网络爬虫  lucene   heritrix  性能优化12376
毕业设计说明书(论文)外文摘要
Title    Focused search engine                     
Abstract
Nowdays, Commodities trading activities on network are becoming more and more popular.we will design search engine focused on production infomation for this application,namely a focused search engine.For the convenience of users in a short period of time to find the commodity which he needs.And purchase timely.So this involves the realization of the focused search engine. 源自六&维"论*文'网.加7位QQ3249'114 www.lwfree.cn
There are four indicators to evaluate the pros and cons of a general search engine.
Full, accurate, new, fast.namely,the coverage of the page climbed,the accuracy of the returned results,page update rate,crawling speed.Focused search engine also consider the optimization from these indicators.At the same time,it has difference between focused search engine and general search engine.On the one hand,how to get more pages,and these pages and the commodity-related.so it is different from general search engine on original site.In addition,the extraction of the page,only extract commodity-related information.So this article focuses on the coverage of crawling commodity information.On the other hand, to give the user the accurate information of the goods, this depends on the climb from the commodity information coverage larger basis, according to the key words search accuracy, text by means of Lucene framework to achieve the information indexing and retrieval operations, achieve the fast, precise search needs. Third, to consider the commodity information webpage update speed, this paper adopts the strategy is based on a webpage content whether changes after crawling to dynamically modulate crawl frequency. Fourth aspects, considering with the aid of Heritrix crawling framework, achieved a multi-threaded crawling, while optimizing the URL hash algorithm, ensure the multithreaded crawling in validity based on efficiency, in addition, in the database, indexed database access but also greatly improves the performance of. At the same time to appearance, new page rendering to the user.
Keywords  Search engine focused web crawler Lucene heritrix performance optimization
1  引言    3
1.1  背景及意义    3 聚焦搜索引擎研究+文献综述:http://www.lwfree.cn/jisuanjilunwen/20170812/12122.html
------分隔线----------------------------
推荐内容