當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Running Nutch in Eclipse

發(fā)布時間：2024/4/13 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Running Nutch in Eclipse 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

為什么80%的碼農(nóng)都做不了架構(gòu)師？>>> ??

1.安裝前需要先把hadoop環(huán)境成功跑起來。

2.打開，wiki.apache.org/nutch，然后搜索“RunNutchInEclipse”，即可找到教程了。RunNutchInEclipse

3.注意Prerequisites部分，首先ant是安裝在操作系統(tǒng)上，不是安裝在eclipse上；其次maven插件在eclipse已經(jīng)自帶了，不需要安裝；svn插件linux和windows通用，直接拷貝windows下的svn插件的plugins和features到linux下的eclipse的plugins和features目錄下即可。

4.checkout nutch的代碼時，可以直接用windows的tortoise svn檢出，再拷貝到linux環(huán)境下。https://svn.apache.org/repos/asf/nutch/branches/，找到對應(yīng)版本即可。

5.nutch的配置文件該怎么配置？參考“nutch2.3編譯安裝和hbase集成”，完全安裝這篇文章來即可。

6.注意第5步，是修改nutch-default.xml的plugin.folders屬性，我的值為"./build/plugins"（本項目下的bulid/plugins目錄），要不然會報錯誤“InjectorJob: java.lang.RuntimeException: job failed: name=inject urls, jobid=job_...”，具體可以查看日志，hadoop.log，在項目根目錄下的logs文件夾下

7.在執(zhí)行ant eclipse命令是，有可能提示“java.net.UnknownHostException: repo2.maven.org”，這需要修改dns，有可能默認(rèn)的dns訪問不到repo2.maven.org，增加dns：DNS2="114.114.114.114"，參考文章：Cassandra 2.x中文教程（23）：Cassandra 2.1.0源碼導(dǎo)入Eclipse LUNA

8.不要隨便改nutch帶的gora版本，改了有可能會報錯，最好用自帶的版本。

9.如果提示“ClassNotFoundException: org.apache.gora.sql.store.SqlStore”，首先檢查配置文件，看看配置文件是否全對，其次再檢查教程Load project in Eclipse的第6步，是否將conf文件夾top到最上面。

Before you start

Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line. However, it's very useful to be able to debug Nutch in Eclipse and is also extremely useful when applying and testing patches as it enables you to see them working in a larger context. This being said, you will still benefit greatly by looking at the hadoop.log output. This tutorial covers a fully internal Eclipse/Nutch set up, using only Eclipse tools and associated plugins.

Prerequisites

You need to have?Apache Ant?installed and configured on your system.
Grab the newest version of Eclipse available?here.
All of the following should be available from the?Eclipse Marketplace. However if not, you can download them throughout Eclipse as follows.
Once you've set up Eclipse, download Subclipse as per?here. N.B. If you experience an error with the 1.8.x release, try 1.6.x. This tends to solve compatibility problems.
Grab IvyDE plugin for Eclipse as?here.
Grab m2e plugin for Eclipse?here

Steps

Checkout and Build Nutch

Get the latest source code from SVN using terminal. For Nutch 1.x (ie.trunk) run this:

?svn?co?https://svn.apache.org/repos/asf/nutch/trunk?cd?trunk

For Nutch 2.x run this:

?svn?co?https://svn.apache.org/repos/asf/nutch/branches/2.x?cd?2.x

For Nutch 1.x (ie. trunk), skip ahead to step #5.

At this point you should have decided which data store you want to use. See the?Apache Gora?documentation to get more information about it. Here are few of the available options of storage classes:

??org.apache.gora.hbase.store.HBaseStore??org.apache.gora.cassandra.store.CassandraStore??org.apache.gora.accumulo.store.AccumuloStore??org.apache.gora.avro.store.AvroStore??org.apache.gora.avro.store.DataFileAvroStore

In “conf/nutch-site.xml” add the storage class name. eg. say you pick HBase as datastore, add this to “conf/nutch-site.xml”:

?<property>??<name>storage.data.store.class</name>??<value>org.apache.gora.hbase.store.HBaseStore</value>??<description>Default?class?for?storing?data</description>?</property>

In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:

??<dependency?org="org.apache.gora"?name="gora-hbase"?rev="0.3"?conf="*->default"?/>

Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties:

?gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Add “http.agent.name” and “http.robots.agents” with appropiate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.x", set the property to:

?<property>???<name>plugin.folders</name>???<value>/home/tejas/Desktop/2.x/build/plugins</value>?</property>

Run this command:

??ant?eclipse

Load project in Eclipse

In Eclipse, click on “File” -> “Import...”

Select “Existing Projects into Workspace”

In the next window, set the root directory to the location where you took the checkout of nutch 2.x (or trunk). Click “Finish”.

You will now see a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes its SVN cache and builds its workspace. You can see the status at the bottom right corner of Eclipse.

In Package Explorer, right click on the project “2.x” (or trunk), select “Build Path” -> “Configure Build Path”

In the “Order and Export” tab, scroll down and select “2.x/conf” (or trunk/conf). Click on “Top” button. Sadly, Eclipse will again build the workspace but this time it won’t take take much.

Create Eclipse launcher

Now, lets get geared to run something. Lets start off with the inject operation. Right click on the project in “Package Explorer” -> select “Run As” -> select “Run Configurations”. Create a new configuration. Name it as "inject".

For 1.x ie trunk : Set the main class as: org.apache.nutch.crawl.Injector
For 2.x : Set the main class as: org.apache.nutch.crawl.InjectorJob

In the arguments tab, for program arguments, provide the path of the input directory which has seed urls. Set VM Arguments to “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log”

Click "Apply" and then click "Run". If everything was set perfectly, then you should see inject operation progressing on console.

If you want to find out the java class corresponding to any command, just peek inside "src/bin/nutch" script and at the bottom you would find a switch case with a case corresponding to each command. Here are the important classes corresponding to the crawl cycle:

Operation	Class in Nutch 1.x (i.e.trunk)	Class in Nutch 2.x
inject	org.apache.nutch.crawl.Injector	org.apache.nutch.crawl.InjectorJob
generate	org.apache.nutch.crawl.Generator	org.apache.nutch.crawl.GeneratorJob
fetch	org.apache.nutch.fetcher.Fetcher	org.apache.nutch.fetcher.FetcherJob
parse	org.apache.nutch.parse.ParseSegment	org.apache.nutch.parse.ParserJob
updatedb	org.apache.nutch.crawl.CrawlDb	org.apache.nutch.crawl.DbUpdaterJob

Debug Nutch in Eclipse

Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs.
Here are a few good places to set breakpoints in the 1.x codebase:

Fetcher?[line:?1115]?-?runFetcher?[line:?530]?-?fetchFetcher$FetcherThread?[line:?560]?-?run()Generator?[line:?443]?-?generateGenerator$Selector?[line:?108]?-?mapOutlinkExtractor?[line:?71?&?74]?-?getOutlinks

Here are a few good places to set breakpoints in the 2.x codebase:

FetcherReducer$FetcherThread?run()?:?line?487?:?LOG.info("fetching?"?+?fit.url?....???????????????????????????????????:?line?519?:?final?ProtocolStatus?status?=?output.getStatus();GeneratorMapper?:?map()?:?line?53GeneratorReducer?:?reduce()?:?line?53OutlinkExtractor?:?getOutlinks()?:?line?84

Remote Debugging in Eclipse

create a new Debug Configuration as?Remote Java Application?and remember the port (here: 37649)

launch nutch from command-line but add options to use the?Java Debugger JDWP Agent Library, e.g. from bash:

%?export?NUTCH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:37649"%?$NUTCH_HOME/bin/nutch?parsechecker?http://myurl.com/

the application will be suspended just after launch

now go to Eclipse, set appropriate break-points, and run the previously created Debug Configuration

Instead of creating an extra launch configuration for every tool you want to debug, one single configuration is enough to debug any tool (parsechecker, indexchecher, URL filter, etc.) and that even remotely (crawler/tool running on server, Eclipse debugger locally).

Debugging and Timeouts

Debugging takes time, esp. when inspecting variables, stack traces, etc. Usually too much time, so that some timeout will apply and stop the application. Set timeouts in the nutch-site.xml used for debugging to a rather high value (or -1 for unlimited), e.g., when debugging the parser:

<property>??<name>parser.timeout</name>??<value>-1</value></property>

Display Javadoc for Dependent Libraries

Eclipse is able to show Javadocs immediately, not only for Nutch classes but also for dependent libraries. While Eclipse takes the Javadocs of Nutch classes directly from the source files, this is not the case for dependent?Ivy?managed libraries. There are two ways to tell Eclipse where to find the Javadocs of dependent libs: (1) adding the Javadoc URL to a jar file, or (2) use the IvyDE Eclipse plugin. Note that both ways will modify the file?.classpath. Because the?ant?eclipse?target will overwrite the?.classpath?file, you should make a backup before and merge the changes made via Eclipse back afterwards.

Connect a Library to the Javadoc URL

The simplest way to connect a jar library with its Javadocs is to add the Javadoc URL manually in the classpath editor, see screenshot.

IvyDE

The Nutch build system delegates the managment of library dependencies to?Apache Ivy. There is an Eclipse plugin?IvyDE?to integrate Ivy's dependency managment. It is well-documented, including a description?how to add the managed libraries to the Eclipse project. The main Ivy file is?ivy/ivy.xml?but note that every plugin has its own?ivy.xml. If working on a specific plugin, it is a good idea to add also its?ivy.xml. It is possible to use IvyDE in addition to the libraries placed by?ant?eclipse?in?.classpath.

The repository hosting a library often also provides packages containing javadoc and sources. E.g., the JUnit repository?https://repo1.maven.org/maven2/junit/junit/4.11/?provides the following files:

junit-4.11-javadoc.jar?????????????????????????????14-Nov-2012?19:21??????????????379344junit-4.11-sources.jar?????????????????????????????14-Nov-2012?19:21??????????????151329junit-4.11.jar?????????????????????????????????????14-Nov-2012?19:21??????????????245039junit-4.11.pom?????????????????????????????????????14-Nov-2012?19:21????????????????2344

IvyDE is then able to fetch also javadoc and source packages (if provided) and show them in Eclipse. Again, there is an excellent description, how this can be enabled in the?Source/Javadoc Mapping?section of the Ivy preferences. Note that the Ivy cache (usually?~/.ivy/cache/) must be cleaned before?Ivy Resolve?is called from Eclipse.

Troubleshooting

eclipse: Cannot create project content in workspace

The Nutch source code must be out of the workspace folder. Alternatively you can download the code with eclipse (svn) under your workspace rather than try to create the project using existing code, eclipse sometimes doesn't let you do it from source code into the workspace.

Plugin directory not found

Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-default.xml or even better in nutch-site.xml. Ideally all efforts should be made to keep nutch-default.xml completely intact.

<property>??<name>plugin.folders</name>??<value>/home/....../trunk/src/plugin</value>

No plugins loaded during unit tests in Eclipse

During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

Debugging Hadoop classes

Sometimes (fairly often) it makes sense to also have the Hadoop classes available during debugging. This should really second nature as Nutch heavily relies upon the underlying Hadoop infrastructure. Therefore you can check out the Hadoop sources into your Eclipse IDE and combine to debug this way. You can:

Checkout the Hadoop version that should be used within Nutch trunk
Configure a Hadoop project similar to the Nutch project within your Eclipse IDE. See?this.
Add the Hadoop project as a dependent project of Nutch project
You can now also set break points within Hadoop classes like inputformat implementations etc.

Non-ported Plugins to 2.x

Few plugins were not ported to Nutch 2.x series yet. If you are following the above tutorial for building Nutch 2.x, please check?Nutch2Plugins?for more information

轉(zhuǎn)載于:https://my.oschina.net/cjun/blog/405084

總結(jié)

以上是生活随笔為你收集整理的Running Nutch in Eclipse的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：扩展的母函数（可以做减法的母函数）（当然
下一篇：音乐社交APP源码ios版