I have generated lots of indexes for individual site using nutch and was looking for a way to merge all indexes into one index to be used in live system. I was really struggling to merge them all and finally I could able to do is. Here are the steps
Lets say, you have two working indexes i.e. crawl1 and crawl2. I am assuming that all these have following directories generated by bin/nutch crawl command
crawldb
index
indexes
linkdb
segments
Now you are on the parent directory that contains folder “crawl1″ and “crawl2″
You need to merge individual dbs ie. linkdb, crawldb and segments. Then you needs to generate index.
Please create a directory called mergeaall. This directory would contain all merged linkdb, crawldb and segments.
- Merge linkdbs
bin/nutch mergelinkdb mergeaall/linkdb crawl1/linkdb/ crawl2/linkdb/
- Merge crawldbs
bin/nutch mergedb mergeaall/crawldb crawl1/crawldb/ crawl2/crawldb/
- Merge segments
bin/nutch mergesegs mergeaall/segments crawl/segments/* crawl-rediff/segments/*
- Invertlinks
bin/nutch invertlinks mergeaall/linkdb/ mergeaall/segments/*
Now run index command to create nutch index
bin/nutch index mergeaall/indexes mergeaall/linkdb/ mergeaall/crawldb/ mergeaall/segments/*
Thats it !!! you are done.
Check out www.ajaxtrend.com ’s search facility where I have merged couple of nutch indexes.
August 28, 2008 at 8:19 pm
Good stuff for index merging. The question is at the last step, we have index all fetched URLs again. If we have two large crawl1 and crawl2, it takes long time. Is there anyway to avoid index again? I tried
nutch merge mergeaall/index crawl1/indexes crawl2/indexes
It seems to work at the first glance, because we can get some search results. But when i click cached to see the detailed content for 1 URL, it returns error.
November 18, 2008 at 7:42 am
I get an error when:
$ bin/nutch index new/indexes new/linkdb/ new/crawldb/ new/segments/*
Indexer: starting
Indexer: linkdb: new/crawldb
Indexer: adding segment: new/segments/20081118093959
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /opt/nutch-0.9/new/segments/20081118093959/crawl_fetch
Input path doesnt exist : /opt/nutch-0.9/new/segments/20081118093959/parse_data
Input path doesnt exist : /opt/nutch-0.9/new/segments/20081118093959/parse_text
at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)