nutch-2.1导入eclipse+mysql运行
初次接觸nutch,記錄下來
首先數據庫?
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;?
表
CREATE TABLE `webpage` (`id` varchar(767) NOT NULL,`headers` blob,`text` mediumtext,`status` int(11) default NULL,`markers` blob,`parseStatus` blob,`modifiedTime` bigint(20) default NULL,`score` float default NULL,`typ` varchar(32) default NULL,`baseUrl` varchar(767) default NULL,`content` longblob,`title` varchar(2048) default NULL,`reprUrl` varchar(767) default NULL,`fetchInterval` int(11) default NULL,`prevFetchTime` bigint(20) default NULL,`inlinks` mediumblob,`prevSignature` blob,`outlinks` mediumblob,`fetchTime` bigint(20) default NULL,`retriesSinceFetch` int(11) default NULL,`protocolStatus` blob,`signature` blob,`metadata` blob,PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED;?
eclipse安裝svn,ivy,ant
以上兩個插件是nutch項目租使用的插件,自行安裝。
nutch2.1的遠程svn庫文件地址
https://svn.apache.org/repos/asf/nutch/tags/release-2.1
check out檢出項目
默認直接finish并創建java project項目
等待下載完成
下載完成后(注:這里的nutch2西面已做更改成nutch-2.1)
?
在project explorer下右擊項目,選擇properties。進入java build path
Add Folder > 導入選擇,并把plugin下面的項目中的src/java和src/test都加入進去
?
?
這一步也可以直接修改項目中的classpath文件,然后在直接刷新項目來自動添加,這樣比較方便,但要注意是否有添加錯誤
.classpath內容
<?xml version="1.0" encoding="UTF-8"?> <classpath><classpathentry kind="src" path="conf"/><classpathentry kind="src" path="src/java"/><classpathentry kind="src" path="src/test"/><classpathentry kind="src" path="src/plugin/protocol-file/src/test"/><classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/><classpathentry kind="src" path="src/plugin/subcollection/src/test"/><classpathentry kind="src" path="src/plugin/parse-html/src/test"/><classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/><classpathentry kind="src" path="src/plugin/parse-html/src/java"/><classpathentry kind="src" path="src/plugin/parse-tika/src/test"/><classpathentry kind="src" path="src/plugin/lib-http/src/test"/><classpathentry kind="src" path="src/plugin/parse-tika/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/><classpathentry kind="src" path="src/plugin/scoring-link/src/java"/><classpathentry kind="src" path="src/plugin/index-anchor/src/test"/><classpathentry kind="src" path="src/plugin/protocol-http/src/java"/><classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/><classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/><classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/><classpathentry kind="src" path="src/plugin/protocol-file/src/java"/><classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/><classpathentry kind="src" path="src/plugin/language-identifier/src/java"/><classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/><classpathentry kind="src" path="src/plugin/language-identifier/src/test"/><classpathentry kind="src" path="src/plugin/subcollection/src/java"/><classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/><classpathentry kind="src" path="src/plugin/index-basic/src/java"/><classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/><classpathentry kind="src" path="src/plugin/creativecommons/src/java"/><classpathentry kind="src" path="src/bin"/><classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/><classpathentry kind="src" path="src/plugin/tld/src/java"/><classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/><classpathentry kind="src" path="src/plugin/index-basic/src/test"/><classpathentry kind="src" path="src/plugin/lib-http/src/java"/><classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/><classpathentry kind="src" path="src/plugin/index-anchor/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/><classpathentry kind="src" path="src/plugin/index-more/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/><classpathentry kind="src" path="src/plugin/creativecommons/src/test"/><classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/><classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/><classpathentry kind="src" path="src/plugin/index-more/src/test"/><classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/><classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/><classpathentry kind="src" path="src/testresources"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=ivy%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fcreativecommons%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Ffeed%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Findex-anchor%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Findex-basic%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Findex-more%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flanguage-identifier%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-http%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-nekohtml%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-regex-filter%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Flib-xml%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fmicroformats-reltag%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fnutch-extensionpoints%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-ext%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-html%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-js%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-swf%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-tika%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fparse-zip%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-file%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-ftp%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-http%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-httpclient%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fprotocol-sftp%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fscoring-link%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fscoring-opic%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Fsubcollection%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Ftld%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-automaton%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-domain%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-prefix%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-regex%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-suffix%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlfilter-validator%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlnormalizer-basic%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlnormalizer-pass%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&ivyXmlPath=src%2Fplugin%2Furlnormalizer-regex%2Fivy.xml&confs=*"/><classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/><classpathentry kind="con" path="org.eclipse.jdt.junit.JUNIT_CONTAINER/4"/><classpathentry kind="lib" path="lib/org.restlet-2.0.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.example.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.atom_1.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.atom.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.crypto.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.fileupload_1.2.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.freemarker_2.3.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.freemarker.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.grizzly.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.gwt.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.httpclient.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.jaas.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.jackson.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.jaxb_2.1.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.jaxrs_1.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.jaxrs-2.0-RC3.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.jibx_1.1.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.json_2.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.json.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.net.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.odata.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.rdf.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.servlet-2.0-RC3.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.servlet-2.0.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.servlet.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.spring_2.5.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.spring-2.0.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.velocity_1.5.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.wadl_1.0.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.xml.jar"/><classpathentry kind="lib" path="lib/org.restlet.ext.xstream.jar"/><classpathentry kind="lib" path="lib/org.restlet.gae-2.0-RC3.jar"/><classpathentry kind="lib" path="lib/org.restlet.gwt.jar"/><classpathentry kind="lib" path="lib/org.restlet.lib.org.json-2.0.jar"/><classpathentry kind="lib" path="src/plugin/urlfilter-automaton/lib/automaton.jar"/><classpathentry kind="lib" path="lib/mysql-connector-java-5.0.7.jar"/><classpathentry kind="output" path="bin"/> </classpath>?
刷新項目就跟上面一樣了
?
?
?
接下order and export中要把conf提到最前面加載
這里處理玩之后接下來就是導包的過程
安裝ivy的插件則能直接右擊ivy.xml
直接finish。jar就會自動下載下來,需要注意,這里的ivy.xml有很多文件,只要有jar的都要add ivy library一次
這樣去找會消耗點時間
當所有的ivy到導入后,最后總會有幾個jar不存在的
(這里網上自行下載了,我這里自己另加入的包有)
?
另還有一個包hadoop-core的包需要修改,FileUtil.java
詳情見http://yangshangchuan.iteye.com/blog/1839784
摘錄下來(在運行時會提示錯誤)
錯誤信息: Exception in thread "main" java.io.IOException:Failed to set permissions of path:\tmp\hadoop-ysc\mapred\staging\ysc-2036315919\.staging to 0700官方BUG參考: https://issues.apache.org/jira/browse/HADOOP-7682解決方法: 1、下載并解壓http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-1.1.2/hadoop-1.1.2.tar.gz 2、修改hadoop-1.1.2\src\core\org\apache\hadoop\fs\FileUtil.java,搜索 Failed to set permissions of path,找到689行,把throw new IOException改為LOG.warn 3、修改hadoop-1.1.2\build.xml,搜索autoreconf,移除匹配的6個executable="autoreconf"的exec配置 4、下載解壓ant,將ant目錄下的bin目錄加入環境變量path 5、在Cygwin命令下行切換到hadoop-1.1.2目錄,執行ant 6、用新生成的hadoop-1.1.2\build\hadoop-core-1.1.3-SNAPSHOT.jar替換nutch的hadoop-core-1.0.3.jar 7、對于eclipse開發來說,替換C:\Users\ysc\.ivy2\cache\org.apache.hadoop\hadoop-core\jars\hadoop-core-1.1.2.jar附件中的JAR是對hadoop1.2.1修改后的JAR,可用于Nutch1.7,其他Nutch版本沒測試過。?
我在修改的時候直接下載這個然后替換ivy庫中的hadoop-core包,名稱一樣;
下載http://pan.baidu.com/s/1i3FBLEP
?
接下里就是配置
在nutch2.1/conf下
Gora.properties
加入:
并注釋掉其他的數據庫鏈接。
在ivy/ivy.xml
解除mysql-connector的注釋。
在/conf/nutch-site.xml.template的configuration中添加如下代碼:
在根目錄下的build.xml中找到如下代碼
<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy"> <ivy:resolve file="${ivy.file}" conf="default" log="download-only" /> <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" /> <antcall target="copy-libs" /> </target>將原本的
pattern="${build.lib.dir}/[artifact]-[revision].[ext]"改為
pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]"用來避免ivy再次下載編譯不通過的情況。原因:ivy會下載class的jar和source的jar,當時如果直接按照上面的pattern下載的話,兩個文件是無法區分的。會出現相同的文件的錯誤。
完成如上信息之后,點擊build.xml進行ant編譯就會生成runtime目錄。
?
在根目錄下添加一個urls文件夾,放入seed.txt文件,其中加一個網站地址。如:http://nutch.apache.org/
打開
?
第一頁已經默認填寫完畢
選擇第二個arguments
放入:
?
最后就可以使用run進行爬取該網站的鏈接信息了。
?
?
?
執行完后打印
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 1 records. Hit by time limit :0 fetching http://nutch.apache.org/ Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 -finishing thread FetcherThread2, activeThreads=1 -finishing thread FetcherThread3, activeThreads=1 -finishing thread FetcherThread4, activeThreads=1 -finishing thread FetcherThread7, activeThreads=1 -finishing thread FetcherThread8, activeThreads=1 -finishing thread FetcherThread9, activeThreads=1 -finishing thread FetcherThread5, activeThreads=1 -finishing thread FetcherThread6, activeThreads=1 -finishing thread FetcherThread1, activeThreads=1 -finishing thread FetcherThread0, activeThreads=0 0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 84 84 kb/s, 0 URLs in 0 queues -activeThreads=0 ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: parsing all Parsing http://nutch.apache.org/ Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 6 records. Hit by time limit :0 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching http://cassandra.apache.org/ fetching http://nutch.apache.org/ fetching http://accumulo.apache.org/ fetching http://avro.apache.org/ fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html fetching http://code.google.com/p/crawler-commons/ -finishing thread FetcherThread1, activeThreads=9 -finishing thread FetcherThread2, activeThreads=8 -finishing thread FetcherThread3, activeThreads=7 -finishing thread FetcherThread6, activeThreads=6 -finishing thread FetcherThread0, activeThreads=5 -finishing thread FetcherThread8, activeThreads=4 -finishing thread FetcherThread7, activeThreads=3 -finishing thread FetcherThread9, activeThreads=2 0/2 spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 136 136 kb/s, 0 URLs in 2 queues 0/2 spinwaiting/active, 4 pages, 0 errors, 0.4 0.0 pages/s, 68 0 kb/s, 0 URLs in 2 queues 0/2 spinwaiting/active, 4 pages, 0 errors, 0.3 0.0 pages/s, 45 0 kb/s, 0 URLs in 2 queues 0/2 spinwaiting/active, 4 pages, 0 errors, 0.2 0.0 pages/s, 34 0 kb/s, 0 URLs in 2 queues fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms -finishing thread FetcherThread4, activeThreads=1 fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms -finishing thread FetcherThread5, activeThreads=0 0/0 spinwaiting/active, 6 pages, 2 errors, 0.2 0.4 pages/s, 27 0 kb/s, 0 URLs in 0 queues -activeThreads=0 ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: parsing all Skipping http://sched.co/1pav9xl; different batch id (null) Skipping http://sched.co/1pbE15n; different batch id (null) Skipping http://t.co/k3VLhbJQhg; different batch id (null) Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null) Skipping http://www.cafepress.com/nutch; different batch id (null) Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null) Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null) Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null) Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null) Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null) Parsing http://code.google.com/p/crawler-commons/ Skipping https://twitter.com/ApacheNutch; different batch id (null) Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null) Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null) Skipping https://twitter.com/TheASF; different batch id (null) Skipping http://www.brics.dk/automaton/; different batch id (null) Skipping http://www.brics.dk/automaton/automaton; different batch id (null) Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html Parsing http://accumulo.apache.org/ Parsing http://avro.apache.org/ Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null) Parsing http://cassandra.apache.org/ Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null) Skipping http://gora.apache.org/; different batch id (null) Skipping http://hadoop.apache.org/; different batch id (null) Skipping http://hbase.apache.org/; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null) Skipping http://lucene.apache.org/; different batch id (null) Skipping http://lucene.apache.org/solr; different batch id (null) Skipping http://lucene.apache.org/solr/; different batch id (null) Parsing http://nutch.apache.org/ Skipping http://nutch.apache.org/bot.html; different batch id (null) Skipping http://nutch.apache.org/credits.html; different batch id (null) Skipping http://nutch.apache.org/downloads.html; different batch id (null) Skipping http://nutch.apache.org/index.html; different batch id (null) Skipping http://nutch.apache.org/javadoc.html; different batch id (null) Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null) Skipping http://nutch.apache.org/version_control.html; different batch id (null) Skipping http://s.apache.org/1.9-release; different batch id (null) Skipping http://s.apache.org/1zE; different batch id (null) Skipping http://s.apache.org/LPB; different batch id (null) Skipping http://s.apache.org/nutch10; different batch id (null) Skipping http://s.apache.org/nutch_2.3; different batch id (null) Skipping http://s.apache.org/oHY; different batch id (null) Skipping http://s.apache.org/PGa; different batch id (null) Skipping http://tika.apache.org/; different batch id (null) Skipping http://tika.apache.org/1.2/index.html; different batch id (null) Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null) Skipping http://wicket.apache.org/; different batch id (null) Skipping http://wiki.apache.org/nutch/; different batch id (null) Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null) Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null) Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null) Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null) Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null) Skipping http://www.apache.org/; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null) Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null) Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null) Skipping http://www.apache.org/foundation/sponsorship.html; different batch id (null) Skipping http://www.apache.org/foundation/thanks.html; different batch id (null) Skipping http://www.apache.org/licenses/; different batch id (null) Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null) Skipping http://www.apache.org/security/; different batch id (null) Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null) Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null) Skipping http://www.elasticsearch.org/; different batch id (null) Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null) Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null) Skipping http://search.maven.org/; different batch id (null) Skipping http://mongodb.org/; different batch id (null) Skipping http://osuosl.org/news_folder/nutch; different batch id (null) Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 fetching http://cassandra.apache.org/ fetching http://nutch.apache.org/ Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching http://accumulo.apache.org/ fetching http://avro.apache.org/ QueueFeeder finished: total 11 records. Hit by time limit :0 fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html fetching http://www.apache.org/foundation/sponsorship.html fetching http://code.google.com/p/crawler-commons/ fetching http://www.apache.org/security/ 7/10 spinwaiting/active, 5 pages, 0 errors, 1.0 1.0 pages/s, 169 169 kb/s, 3 URLs in 3 queues * queue: http://www.apache.orgmaxThreads = 1inProgress = 1crawlDelay = 4000minCrawlDelay = 0nextFetchTime = 1445831574525now = 14458315748140. http://www.apache.org/foundation/thanks.html1. http://www.apache.org/licenses/2. http://www.apache.org/ fetching http://www.apache.org/foundation/thanks.html 8/10 spinwaiting/active, 7 pages, 0 errors, 0.7 0.4 pages/s, 113 57 kb/s, 2 URLs in 3 queues * queue: http://www.apache.orgmaxThreads = 1inProgress = 0crawlDelay = 4000minCrawlDelay = 0nextFetchTime = 1445831583211now = 14458315798170. http://www.apache.org/licenses/1. http://www.apache.org/ fetching http://www.apache.org/licenses/ 8/10 spinwaiting/active, 8 pages, 0 errors, 0.5 0.2 pages/s, 86 31 kb/s, 1 URLs in 3 queues * queue: http://www.apache.orgmaxThreads = 1inProgress = 0crawlDelay = 4000minCrawlDelay = 0nextFetchTime = 1445831587582now = 14458315848200. http://www.apache.org/ fetching http://www.apache.org/ -finishing thread FetcherThread9, activeThreads=8 -finishing thread FetcherThread2, activeThreads=8 -finishing thread FetcherThread0, activeThreads=7 -finishing thread FetcherThread1, activeThreads=6 -finishing thread FetcherThread4, activeThreads=4 -finishing thread FetcherThread3, activeThreads=4 -finishing thread FetcherThread5, activeThreads=3 -finishing thread FetcherThread7, activeThreads=2 0/2 spinwaiting/active, 9 pages, 0 errors, 0.5 0.2 pages/s, 84 81 kb/s, 0 URLs in 2 queues fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms -finishing thread FetcherThread8, activeThreads=1 fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms -finishing thread FetcherThread6, activeThreads=0 0/0 spinwaiting/active, 11 pages, 2 errors, 0.4 0.4 pages/s, 67 0 kb/s, 0 URLs in 0 queues -activeThreads=0 ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: parsing all Skipping http://sched.co/1pav9xl; different batch id (null) Skipping http://sched.co/1pbE15n; different batch id (null) Skipping http://t.co/k3VLhbJQhg; different batch id (null) Skipping http://accumulosummit.com/; different batch id (null) Skipping http://www.amazon.com/Cassandra-High-Availability-Robbie-Strickland/dp/1783989122; different batch id (null) Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null) Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null) Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null) Skipping http://www.cafepress.com/nutch; different batch id (null) Skipping http://www.datastax.com/dev/blog/2012-in-review-performance; different batch id (null) Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html; different batch id (null) Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_primary_index_c.html; different batch id (null) Skipping http://www.datastax.com/resources/whitepapers/benchmarking-top-nosql-databases; different batch id (null) Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null) Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null) Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null) Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null) Skipping http://getbootstrap.com/; different batch id (null) Skipping https://github.com/apache/accumulo; different batch id (null) Skipping http://glyphicons.com/; different batch id (null) Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null) Parsing http://code.google.com/p/crawler-commons/ Skipping http://research.google.com/archive/bigtable.html; different batch id (null) Skipping https://www.linkedin.com/groups/Apache-Accumulo-Professionals-4554913; different batch id (null) Skipping http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/; different batch id (null) Skipping http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/; different batch id (null) Skipping http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html; different batch id (null) Skipping https://twitter.com/apacheaccumulo; different batch id (null) Skipping https://twitter.com/ApacheNutch; different batch id (null) Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null) Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null) Skipping https://twitter.com/TheASF; different batch id (null) Skipping http://www.brics.dk/automaton/; different batch id (null) Skipping http://www.brics.dk/automaton/automaton; different batch id (null) Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html Skipping http://fontawesome.io/; different batch id (null) Skipping http://freenode.net/; different batch id (null) Skipping http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra; different batch id (null) Skipping http://www.slideshare.net/daveconnors/cassandra-puppet-scaling-data-at-15-per-month; different batch id (null) Skipping http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376; different batch id (null) Skipping http://www.slideshare.net/jbellis; different batch id (null) Skipping http://www.slideshare.net/jbellis/cassandra-at-nosql-matters-2012; different batch id (null) Skipping http://www.slideshare.net/planetcassandra/3-mohit-anchlia; different batch id (null) Skipping http://www.slideshare.net/planetcassandra/nyc-tech-day-using-cassandra-for-dvr-scheduling-at-comcast; different batch id (null) Skipping http://www.slideshare.net/slideshow/embed_code/15832310; different batch id (null) Parsing http://accumulo.apache.org/ Skipping http://accumulo.apache.org/1.5/accumulo_user_manual.html; different batch id (null) Skipping http://accumulo.apache.org/1.5/apidocs; different batch id (null) Skipping http://accumulo.apache.org/1.5/examples; different batch id (null) Skipping http://accumulo.apache.org/1.6/accumulo_user_manual.html; different batch id (null) Skipping http://accumulo.apache.org/1.6/apidocs; different batch id (null) Skipping http://accumulo.apache.org/1.6/examples; different batch id (null) Skipping http://accumulo.apache.org/1.7/accumulo_user_manual.html; different batch id (null) Skipping http://accumulo.apache.org/1.7/apidocs; different batch id (null) Skipping http://accumulo.apache.org/1.7/examples; different batch id (null) Skipping http://accumulo.apache.org/bylaws.html; different batch id (null) Skipping http://accumulo.apache.org/contrib.html; different batch id (null) Skipping http://accumulo.apache.org/downloads; different batch id (null) Skipping http://accumulo.apache.org/downloads/; different batch id (null) Skipping http://accumulo.apache.org/get_involved.html; different batch id (null) Skipping http://accumulo.apache.org/git.html; different batch id (null) Skipping http://accumulo.apache.org/glossary.html; different batch id (null) Skipping http://accumulo.apache.org/governance/consensusBuilding.html; different batch id (null) Skipping http://accumulo.apache.org/governance/lazyConsensus.html; different batch id (null) Skipping http://accumulo.apache.org/governance/releasing.html; different batch id (null) Skipping http://accumulo.apache.org/governance/voting.html; different batch id (null) Skipping http://accumulo.apache.org/index.html; different batch id (null) Skipping http://accumulo.apache.org/mailing_list.html; different batch id (null) Skipping http://accumulo.apache.org/notable_features.html; different batch id (null) Skipping http://accumulo.apache.org/old_documentation.html; different batch id (null) Skipping http://accumulo.apache.org/papers.html; different batch id (null) Skipping http://accumulo.apache.org/people.html; different batch id (null) Skipping http://accumulo.apache.org/projects.html; different batch id (null) Skipping http://accumulo.apache.org/rb.html; different batch id (null) Skipping http://accumulo.apache.org/release_notes/; different batch id (null) Skipping http://accumulo.apache.org/release_notes/1.5.4.html; different batch id (null) Skipping http://accumulo.apache.org/release_notes/1.6.4.html; different batch id (null) Skipping http://accumulo.apache.org/release_notes/1.7.0.html; different batch id (null) Skipping http://accumulo.apache.org/releasing.html; different batch id (null) Skipping http://accumulo.apache.org/screenshots.html; different batch id (null) Skipping http://accumulo.apache.org/source.html; different batch id (null) Skipping http://accumulo.apache.org/verifying_releases.html; different batch id (null) Skipping http://accumulo.apache.org/versioning.html; different batch id (null) Parsing http://avro.apache.org/ Skipping http://avro.apache.org/credits.html; different batch id (null) Skipping http://avro.apache.org/docs/1.6.3; different batch id (null) Skipping http://avro.apache.org/docs/1.7.7; different batch id (null) Skipping http://avro.apache.org/docs/current; different batch id (null) Skipping http://avro.apache.org/docs/current/; different batch id (null) Skipping http://avro.apache.org/index.html; different batch id (null) Skipping http://avro.apache.org/irc.html; different batch id (null) Skipping http://avro.apache.org/issue_tracking.html; different batch id (null) Skipping http://avro.apache.org/mailing_lists.html; different batch id (null) Skipping http://avro.apache.org/releases.html; different batch id (null) Skipping http://avro.apache.org/version_control.html; different batch id (null) Skipping http://blogs.apache.org/accumulo; different batch id (null) Skipping https://blogs.apache.org/accumulo/; different batch id (null) Skipping https://builds.apache.org/view/A-D/view/Accumulo/; different batch id (null) Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null) Parsing http://cassandra.apache.org/ Skipping http://cassandra.apache.org/download/; different batch id (null) Skipping http://cassandra.apache.org/privacy.html; different batch id (null) Skipping https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute; different batch id (null) Skipping https://cwiki.apache.org/confluence/display/AVRO/Index; different batch id (null) Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null) Skipping http://forrest.apache.org/; different batch id (null) Skipping http://gora.apache.org/; different batch id (null) Skipping http://hadoop.apache.org/; different batch id (null) Skipping http://hadoop.apache.org/privacy_policy.html; different batch id (null) Skipping http://hbase.apache.org/; different batch id (null) Skipping https://issues.apache.org/jira/browse/accumulo; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null) Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null) Skipping http://lucene.apache.org/; different batch id (null) Skipping http://lucene.apache.org/solr; different batch id (null) Skipping http://lucene.apache.org/solr/; different batch id (null) Parsing http://nutch.apache.org/ Skipping http://nutch.apache.org/bot.html; different batch id (null) Skipping http://nutch.apache.org/credits.html; different batch id (null) Skipping http://nutch.apache.org/downloads.html; different batch id (null) Skipping http://nutch.apache.org/index.html; different batch id (null) Skipping http://nutch.apache.org/javadoc.html; different batch id (null) Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null) Skipping http://nutch.apache.org/version_control.html; different batch id (null) Skipping http://s.apache.org/1.9-release; different batch id (null) Skipping http://s.apache.org/1zE; different batch id (null) Skipping http://s.apache.org/LPB; different batch id (null) Skipping http://s.apache.org/nutch10; different batch id (null) Skipping http://s.apache.org/nutch_2.3; different batch id (null) Skipping http://s.apache.org/oHY; different batch id (null) Skipping http://s.apache.org/PGa; different batch id (null) Skipping http://thrift.apache.org/; different batch id (null) Skipping http://tika.apache.org/; different batch id (null) Skipping http://tika.apache.org/1.2/index.html; different batch id (null) Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null) Skipping http://wicket.apache.org/; different batch id (null) Skipping http://wiki.apache.org/cassandra; different batch id (null) Skipping http://wiki.apache.org/cassandra/Durability; different batch id (null) Skipping http://wiki.apache.org/cassandra/FAQ; different batch id (null) Skipping http://wiki.apache.org/cassandra/GettingStarted; different batch id (null) Skipping http://wiki.apache.org/cassandra/HintedHandoff; different batch id (null) Skipping http://wiki.apache.org/cassandra/HowToContribute; different batch id (null) Skipping http://wiki.apache.org/cassandra/ReadRepair; different batch id (null) Skipping http://wiki.apache.org/cassandra/ThirdPartySupport; different batch id (null) Skipping http://wiki.apache.org/nutch/; different batch id (null) Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null) Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null) Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null) Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null) Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null) Parsing http://www.apache.org/ Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null) Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null) Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null) Skipping http://www.apache.org/foundation/policies/conduct.html; different batch id (null) Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null) Parsing http://www.apache.org/foundation/sponsorship.html Parsing http://www.apache.org/foundation/thanks.html Parsing http://www.apache.org/licenses/ Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null) Parsing http://www.apache.org/security/ Skipping http://zookeeper.apache.org/; different batch id (null) Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null) Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null) Skipping http://www.elasticsearch.org/; different batch id (null) Skipping http://hypertable.org/; different batch id (null) Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null) Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null) Skipping http://search.maven.org/; different batch id (null) Skipping http://mongodb.org/; different batch id (null) Skipping http://osuosl.org/news_folder/nutch; different batch id (null) Skipping http://www.planetcassandra.org/; different batch id (null) Skipping http://planetcassandra.org/; different batch id (null) Skipping http://planetcassandra.org/blog/post/analytics-at-github-with-apache-cassandra/; different batch id (null) Skipping http://planetcassandra.org/blog/post/cassandra-at-cern-large-hadron-collider/; different batch id (null) Skipping http://planetcassandra.org/blog/post/cassandra-used-to-build-scalable-and-highly-available-systems-at-hulu-streaming-content-to-over-5-million-subscribers/; different batch id (null) Skipping http://planetcassandra.org/blog/post/godaddy-worlds-largest-domain-name-registrar-and-web-host-provider-utilizes-cassandra-for-replication-and-scalability/; different batch id (null) Skipping http://planetcassandra.org/blog/post/instagram-making-the-switch-to-cassandra-from-redis-75-instasavings/; different batch id (null) Skipping http://planetcassandra.org/blog/post/make-it-rain-apache-cassandra-at-the-weather-channel-for-severe-weather-alerts/; different batch id (null) Skipping http://planetcassandra.org/blog/post/reddit-upvotes-apache-cassandras-horizontal-scaling-managing-17000000-votes-daily/; different batch id (null) Skipping http://planetcassandra.org/companies/; different batch id (null) Skipping http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf; different batch id (null)?
表中插入的數據
到直接基本算是在eclipse導入完成
接下自己慢慢學習了
?---------------------------------------------------------------------------------
另一種簡單方式
File > New > Project > SVN > 從SVN 檢出項目
創建新的資源庫位置 >?
?
選中URL > Finish? ??彈出New Project向導,選擇Java Project > Next,
輸入Project name:nutch1.7 > Finishsd?? ?
搭建環境
在左部Package Explorer的 nutch1.7文件夾上單擊右鍵 >Build Path > Configure Build Path...
> 選中Source選項 > 選擇src > Remove > Add Folder... > 選擇src/bin, src/Java, src/test 和 src/testresources
切換到Libraries選項 >
Add Class Folder... > 選中nutch1.7/conf
Add Library... > IvyDE Managed Dependencies > Next >Main > Ivy File > Browse > ivy/ivy.xml > Finish
切換到Order and Export選項>選中conf > Top > OK
最后:在左部Package Explorer的 nutch1.7文件夾下的build.xml文件上單擊右鍵 > Run As > Ant Build ? ? ?(然后等待完成)
在左部Package Explorer的 nutch1.7文件夾上單擊右鍵 > Refresh
在左部Package Explorer的 nutch1.7文件夾上單擊右鍵 > Build Path > Configure Build Path... > 選中Libraries選項 > Add Class Folder... > 選中build >
等待完成
OK,整個工程導入完成,沒有紅叉
?
轉載于:https://www.cnblogs.com/hwaggLee/p/4910931.html
總結
以上是生活随笔為你收集整理的nutch-2.1导入eclipse+mysql运行的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 如何彻底删掉360安全卫士(全是干货!!
- 下一篇: 2018年人工智能之自动驾驶研究报告