注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

jinchangge的博客

趣味大学英语

 
 
 

日志

 
 

CorpusCreator  

2015-11-07 17:43:58|  分类: 语料库 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

CorpusCreator is a free and user-friendly tool to create small-to medium-size corpora using the Web as a source of texts. Despite some obvious drawbacks, such as the lack of documentation, control, etc., there is no doubt that the Internet has become the most important source of language data used in many areas of linguistic research. In particularly, it has been suggested to be the only way to build so called "disposable" corpora, text collections used for specific (and often limited) purposes, such as translation tasks, LSP-learning, terminology work, etc.

Even if it is theoretically possible to build small specialized corpora by hand, querying the Web with a commercial search engine, downloading all relevant documents and converting them into plain text, this task is very time-consuming. The simple interface of CorpusCreator looks like a normal search engine and let you perform a Web search using some keywords representing your domain of interest, download the documents you have found (PDF or HTML) and convert each document into plain text (UTF-8). The texts can be converted adding no mark-up at all, simple mark-up or a more complex XML mark-up (TEI) containing source, time, etc.

http://www.staff.uni-mainz.de/fantinuo/corpuscreator.html

  评论这张
 
阅读(75)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017