毕业论文

当前位置: 毕业论文 > 管理论文 >

基于互联网的领域平行语料自动获取研究

时间:2017-12-06 20:13来源:毕业论文
对采集下来的结果进行两种模式的处理,分别是域名模式和目录模式,然后得到了两批候选平行网站中英文网站的URL。将得到的URL做语言模式替换的处理,比如英文网址中用“en”来表示
摘要作为跨语言信息检索、机器翻译等自然语言处理应用的重要资源,大规模的平行语料库一直是研究人员关注的重点。随着互联网的迅速发展,各个国家的人们在互联网上越来越多得分享信息。如今,互联网上的许多网站都是双语网站甚至多语网站,这些网站中的网页大都是平行的单语网页对(即互为翻译),可以说,其中存在着大量的平行语料。部分研究者们通过挖掘这些类型的网站上的资源来建立平行语料库。16228
本文利用搜索引擎及其搜索规则获取来自互联网的大量平行网页。主要工作包括以下三部分的内容。
首先,借助搜索引擎得到一批种子站点,这些种子站点作为候选的平行网站。在增加搜索规则的条件下,对这批种子站点进行二次搜索,并采集搜索出来的结果。
其次,对采集下来的结果进行两种模式的处理,分别是域名模式和目录模式,然后得到了两批候选平行网站中英文网站的URL。将得到的URL做语言模式替换的处理,比如英文网址中用“en”来表示英语语种,将网址中的“en”换成用来表示中文语种的“tc”,则生成了新的URL。
然后,对生成的新URL进行判断,看新生成的网址是否存在,并对所有存在的URL进行最终的测评,以判别替换前后的网址对是否为一对平行网页。我们采集了候选平行网页中英文网页里的内容,并做了LDA模型主题聚类,一共聚类得到20个主题。
毕业论文的实验最终一共采集到了153683个不重复的平行网站中英文网站的网址,其中对应替换后有88915个网址判断为存在,最终测评,替换前后的网址对为一对平行网页的准确率达到88.5%。 源自六;维,论/文.网*加7位QQ324`9114 www.lwfree.cn
最后对本文工作进行了总结并给出了该系统进一步完善的研究目标。
关键词  平行网页  URL模式  网页挖掘
毕业论文外文摘要
Title   Automatic Acquisition of Parallel Corpus Based on the Internet                                                 
Abstract
As an important resource for cross-language information retrieval, machine translation and other natural language processing applications, large-scale parallel corpus has been the focus of the researchers. With the rapid development of the Internet, people in various countries on the Internet have more chances to share the information with each other. Today, many websites on the Internet are bilingual or multilingual. These websites mostly consist of parallel monolingual web pages. It’s quite obvious that there is a large number of parallel corpora in these websites. Some researchers are digging through these types of resources on the Web to create a parallel corpus.
In this paper, we use the search engine with a few certain searching rules to get the parallel websites from the Internet. The main work includes the following three parts.
First, with the search engine, we get some seed websites, these sites are candidates for parallel sites. Under the conditions of the searching rules, we use the seed websites for a second batch of search, and collect the results.
From our point of view, the results are actually the websites in English language of the parallel websites. In the second step, the URLs were processed in two ways and we replaced the language mode of each URL to generate a new one.
Finally, the new generated URLs are judged to see if it can create a existing
Website , and we made a final evaluation for all the existing URLs to determine whether the former Website and the generated one is actually a pair of parallel pages or not. We collect the content of the English webpages and did a topic clustering with LDA model. We get a total of 20 themes about the content. 基于互联网的领域平行语料自动获取研究:http://www.lwfree.cn/guanli/20171206/17494.html
------分隔线----------------------------
推荐内容