How to merge nutch indexes v 0.9

By ajaxtrend

I have generated lots of indexes for individual site using nutch and was looking for a way to merge all indexes into one index to be used in live system. I was really struggling to merge them all and finally I could able to do is. Here are the steps

Lets say, you have two working indexes i.e. crawl1 and crawl2. I am assuming that all these have following directories generated by bin/nutch crawl command

crawldb

index

indexes

linkdb

segments

Now you are on the parent directory that contains folder “crawl1″ and “crawl2″

You need to merge individual dbs ie. linkdb, crawldb and segments. Then you needs to generate index.

Please create a directory called mergeaall. This directory would contain all merged linkdb, crawldb and segments.

- Merge linkdbs
bin/nutch mergelinkdb mergeaall/linkdb crawl1/linkdb/ crawl2/linkdb/

- Merge crawldbs

bin/nutch mergedb mergeaall/crawldb crawl1/crawldb/ crawl2/crawldb/

- Merge segments

bin/nutch mergesegs mergeaall/segments crawl/segments/* crawl-rediff/segments/*

- Invertlinks

bin/nutch invertlinks mergeaall/linkdb/ mergeaall/segments/*

Now run index command to create nutch index

bin/nutch index mergeaall/indexes mergeaall/linkdb/ mergeaall/crawldb/ mergeaall/segments/*

Thats it !!! you are done.

Check out www.ajaxtrend.com ’s search facility where I have merged couple of nutch indexes.

2 Responses to “How to merge nutch indexes v 0.9”

  1. tigertail Says:

    Good stuff for index merging. The question is at the last step, we have index all fetched URLs again. If we have two large crawl1 and crawl2, it takes long time. Is there anyway to avoid index again? I tried

    nutch merge mergeaall/index crawl1/indexes crawl2/indexes

    It seems to work at the first glance, because we can get some search results. But when i click cached to see the detailed content for 1 URL, it returns error.

  2. Kimvais Says:

    I get an error when:

    $ bin/nutch index new/indexes new/linkdb/ new/crawldb/ new/segments/*
    Indexer: starting
    Indexer: linkdb: new/crawldb
    Indexer: adding segment: new/segments/20081118093959
    Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /opt/nutch-0.9/new/segments/20081118093959/crawl_fetch
    Input path doesnt exist : /opt/nutch-0.9/new/segments/20081118093959/parse_data
    Input path doesnt exist : /opt/nutch-0.9/new/segments/20081118093959/parse_text
    at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
    at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
    at org.apache.nutch.indexer.Indexer.run(Indexer.java:295)
    at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
    at org.apache.nutch.indexer.Indexer.main(Indexer.java:278)

Leave a Reply