Lucene Hadoop

Home > Lesson > Chapter 10

5 Steps - 3 Clicks

Lucene Hadoop

Description

Before going know about Hadoop working with Lucene it's to know about Hadoop language. Hadoop is an open source framework. It is given by Apache to handle and break down an exceptionally gigantic volume of information. It is composed in Java and as of now utilized by Google, Facebook, LinkedIn, Yahoo, Twitter and so forth.

Because of the approach of new advances, gadgets, and correspondence implies like person to person communication destinations, the measure of information created by humankind is becoming quickly consistently. The measure of information delivered by us from the earliest starting point of time till 2003 was 5 billion gigabytes. On the off chance that you heap up the information as plates it might fill a whole football field. A similar sum was made in like clockwork in 2011, and in at regular intervals in 2013. This rate is as yet becoming tremendously. Despite the fact that this data delivered is significant and can be valuable when prepared, it is being dismissed.

Description

Big Data is an expression used to mean a gigantic volume of both composed and unstructured data that is so costly it is difficult to prepare using traditional database and programming systems. In most undertaking circumstances the volume of data is excessively tremendous or it moves too brisk or it outperforms current get ready point of confinement. Tremendous Data can help associations upgrade operations and make speedier, more insightful decisions. This data, when gotten, sorted out, controlled, set away, and analyzed can help an association to expand significant learning to construct salaries, get or hold customers, and improve operations.

Description

Hadoop File System was produced utilizing conveyed record framework plan. It is keep running on item equipment. Dissimilar to other dispersed frameworks, HDFS is very fault tolerant and composed utilizing ease equipment. HDFS holds expansive measure of information and gives simpler get to. To store such immense information, the records are put away over various machines. These records are put away in the repetitive mold to safeguard the framework from conceivable information misfortunes in the event of disappointment. HDFS additionally makes applications accessible to parallel preparing.

Description

Hadoop MapReduce is a product system for effectively composing applications which handle limitless measures of information in-parallel on extensive bunches of ware equipment in a solid, blame tolerant way. A MapReduce work normally parts the information set into autonomous lumps which are prepared by the guide errands in a totally parallel way. The structure sorts the yields of the maps, which are then the contribution to the decrease assignments. Ordinarily, both the information and the yield of the employment are put away in a record framework. The structure deals with booking assignments, checking them and re-executes the fizzled errands. Following is the result format. [java]import java.io.File; import java.io.IOException; import org.apache.hadoop.mapreduce.OutputCommitter; import org.apache.hadoop.mapreduce.RecordWriter; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class HdfsSyncingLocalFileOutputFormat<K, V> extends FileOutputFormat<K, V> { public static final String PARAMETER_LOCAL_SCRATCH_PATH = "param.localScratchPath"; private HdfsSyncingLocalFileOutputCommitter committer; @Override public synchronized OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException { if (committer == null) { // Create the temporary local directory on the local file system as pass it to the committer. File localScratchPath = new File (context.getConfiguration().get(PARAMETER_LOCAL_SCRATCH_PATH) + File.separator + "scratch" + File.separator + context.getTaskAttemptID().toString() + File.separator); committer = new HdfsSyncingLocalFileOutputCommitter(localScratchPath, super.getOutputPath(context), context); } return committer; } @Override public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException { return new RecordWriter<K, V>() { @Override public void close(TaskAttemptContext context) throws IOException, InterruptedException { } @Override public void write(K key, V val) throws IOException, InterruptedException { } }; } }[/java] The format depends on an OutputCommitter to handle the genuine matching up. It gets from Configuration the root catalog on every hub to store the Lucene record and passes it to the committer. The design parameter is set by the occupation driver class as follows. [java]getConf().set(HdfsSyncingLocalFileOutputFormat.PARAMETER_LOCAL_SCRATCH_PATH, localScratchPath); [/java] The localScratchPath variable can be instated from anyplace in driver class. For our situation, it was perused as an order line parameter. The write() and close() techniques on the RecordWriter in the OutputFormat are unfilled, in light of the fact that no genuine information is composed to HDFS from the OutputFormat. The information is side stacked by the OutputCommitter.

Key Points

Hadoop gives a summon interface to communicate with HDFS.
HBASE is a sorted guide information based on Hadoop. It is section arranged and evenly versatile.
Hadoop YARN is a system for SCHEDULING and CLUSTER asset administration.

Hide Index Show Index

Chapter 10

Lucene Hadoop

Basic Info/Lessons

Lucene Hadoop

Lucene With Hadoop

Description

Big Data

Description

HDFS

Description

Lucene Indexing with MapReduce

Description

Summary

Key Points