MapReduce之ReduceJoin案例

Reduce Join原理

Map端的主要工作：为来自不同表或文件的key/value对，打标签以区别不同来源的记录。然后用连接字段作为key，其余部分和新加的标志作为value，最后进行输出。
Reduce端的主要工作：在Reduce端以连接字段作为key的分组已经完成，我们只需要在每一个分组当中将那些来源于不同文件的记录（在Map阶段已经打标志）分开，最后进行合并就ok了。
该方法的缺点：这种方式的缺点很明显就是会造成Map和Reduce端也就是shuffle阶段出现大量的数据传输，效率很低。

案例实操

需求分析

通过将关联条件作为Map输出的key，将两表满足Join条件的数据并携带数据所来源的文件信息，发往同一个ReduceTask，在Reduce中进行数据的串联。

MR分析

替换的前提是：相同pid的数据，需要分到同一个区

0号区： 1001 01 1 01 小米

1号区： 1002 02 2 02 华为

注意：

分区时，以pid为条件进行分区！
两种不同的数据，经过同一个Mapper的map()处理，因此需要在map()中，判断切片数据的来源，根据来源执行不同的封装策略
一个Mapper只能处理一种切片的数据，所以在Map阶段无法完成join操作，需要在reduce中实现Join
在Map阶段，封装数据。自定义的Bean需要能够封装，两个切片中的所有的数据
在reduce输出时，只需要将来自于order.txt中的数据，将pid替换为pname，而不需要输出所有的key-value
在Map阶段对数据打标记，标记哪些key-value属于order.txt，哪些属于pd.txt

order.txt---->切片(orderId,pid,amount)----JoinMapper.map()------>JoinReducer pd.txt----->切片(pid,pname)----JoinMapper.map()

MR实现

Mapper: keyin-valuein: map: keyout=valueout:

Reducer: keyin-valuein: reduce: keyout=valueout:

ReduceJoin

ReduceJoin需要在Reduce阶段实现Join功能，一旦数据量过大，效率低！后面有一种方法使用MapJoin解决ReduceJoin低效的问题！

代码实现

创建商品和订合并后的Bean类，JoinBean.java

public class JoinBean implements Writable{
	
	private String orderId;
	private String pid;
	private String pname;
	private String amount;
	private String source;
	
	

	@Override
	public String toString() {
		return  orderId + "t" +  pname + "t" + amount ;
	}

	public String getOrderId() {
		return orderId;
	}

	public void setOrderId(String orderId) {
		this.orderId = orderId;
	}

	public String getPid() {
		return pid;
	}

	public void setPid(String pid) {
		this.pid = pid;
	}

	public String getPname() {
		return pname;
	}

	public void setPname(String pname) {
		this.pname = pname;
	}

	public String getAmount() {
		return amount;
	}

	public void setAmount(String amount) {
		this.amount = amount;
	}

	public String getSource() {
		return source;
	}

	public void setSource(String source) {
		this.source = source;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(orderId);
		out.writeUTF(pid);
		out.writeUTF(pname);
		out.writeUTF(amount);
		out.writeUTF(source);
		
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		orderId=in.readUTF();
		pid=in.readUTF();
		pname=in.readUTF();
		amount=in.readUTF();
		source=in.readUTF();
		
	}

}

编写Mapper类，ReduceJoinMapper.java

/*
 * Map阶段无法完成Join，只能封装数据，在Reduce阶段完成Join
 * 
 * 1. order.txt: 1001	01	1
 * 	 pd.txt :  01 小米
 * 
 * 2. Bean必须能封装所有的数据
 * 
 * 3. Reduce只需要输出来自于order.txt的数据，需要在Mapper中对数据打标记，标记数据的来源
 * 
 * 4. 在Mapper中需要获取当前切片的来源，根据来源执行不同的封装逻辑
 */
public class ReduceJoinMapper extends Mapper<LongWritable, Text, NullWritable, JoinBean>{
	
	private NullWritable out_key=NullWritable.get();
	private JoinBean out_value=new JoinBean();
	private String source;
	
	// setUp()在map()之前先运行，只运行一次
	@Override
	protected void setup(Mapper<LongWritable, Text, NullWritable, JoinBean>.Context context)
			throws IOException, InterruptedException {
		
		InputSplit inputSplit = context.getInputSplit();
		
		FileSplit split=(FileSplit) inputSplit;
		
		source=split.getPath().getName();
	}
	
	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, NullWritable, JoinBean>.Context context)
			throws IOException, InterruptedException {
		
		String[] words = value.toString().split("t");
		//打标记
		out_value.setSource(source);
		
		if (source.equals("order.txt")) {
			out_value.setOrderId(words[0]);
			out_value.setPid(words[1]);
			out_value.setAmount(words[2]);
			// 保证所有的属性不为null
			out_value.setPname("nodata");
		}else {
			out_value.setPid(words[0]);
			out_value.setPname(words[1]);
			// 保证所有的属性不为null
			out_value.setOrderId("nodata");
			out_value.setAmount("nodata");			
		}
		
		context.write(out_key, out_value);
	}
}

自定义分区器，MyPartitioner.java

/*
 * 1.保证pid相同的key-value分到同一个区
 */
public class MyPartitioner extends Partitioner<NullWritable, JoinBean>{

	@Override
	public int getPartition(NullWritable key, JoinBean value, int numPartitions) {
		
		return (value.getPid().hashCode() & Integer.MAX_VALUE) % numPartitions;
	}

}

编写Reducer类，JoinBeanReducer.java

/*
 *  order.txt: 1001	01	1
 * 	 pd.txt :  01 小米
 *          orderid,pid,amount,source,pname
 * 1. (null,1001，01，1,order.txt,nodata)
 * (null,nodata，01，nodata,pd.txt,小米)
 * 
 * 2. 在输出之前，需要把数据按照source属性分类
 * 		只能在reduce中分类
 */
public class JoinBeanReducer extends Reducer<NullWritable, JoinBean, NullWritable, JoinBean>{

	//分类的集合
	private List<JoinBean> orderDatas=new ArrayList<>();
	private Map<String, String> pdDatas=new HashMap<>();
	
	//根据source分类
	@Override
	protected void reduce(NullWritable key, Iterable<JoinBean> values,
			Reducer<NullWritable, JoinBean, NullWritable, JoinBean>.Context arg2)
			throws IOException, InterruptedException {
		
		for (JoinBean value : values) {
			
			if (value.getSource().equals("order.txt")) {
				
				// 将value对象的属性数据取出，封装到一个新的JoinBean中
				// 因为value至始至终都是同一个对象，只不过每次迭代，属性会随之变化
				JoinBean joinBean = new JoinBean();
				
				try {
					BeanUtils.copyProperties(joinBean, value);
				} catch (IllegalAccessException e) {
					e.printStackTrace();
				} catch (InvocationTargetException e) {
					e.printStackTrace();
				}
				
				orderDatas.add(joinBean);
				
			}else {
				
				//数据来源于pd.txt
				pdDatas.put(value.getPid(), value.getPname());
				
			}
			
		}
		
	}
	
	// Join数据，写出
	@Override
	protected void cleanup(Reducer<NullWritable, JoinBean, NullWritable, JoinBean>.Context context)
			throws IOException, InterruptedException {
		
		//只输出来自orderDatas的数据
		for (JoinBean joinBean : orderDatas) {	
			// 从Map中根据pid取出pname，设置到bean的pname属性中
			joinBean.setPname(pdDatas.get(joinBean.getPid()));
			context.write(NullWritable.get(), joinBean);
		}
		
	}
}

编写驱动类，CustomIFDriver.java

public class CustomIFDriver {
	
	public static void main(String[] args) throws Exception {
		
		Path inputPath=new Path("e:/mrinput/reducejoin");
		Path outputPath=new Path("e:/mroutput/reducejoin");
		

		//作为整个Job的配置
		Configuration conf = new Configuration();
		//保证输出目录不存在
		FileSystem fs=FileSystem.get(conf);
		
		if (fs.exists(outputPath)) {
			
			fs.delete(outputPath, true);
			
		}
		
		// ①创建Job
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(CustomIFDriver.class);
		
		
		// 为Job创建一个名字
		job.setJobName("wordcount");
		
		// ②设置Job
		// 设置Job运行的Mapper，Reducer类型，Mapper,Reducer输出的key-value类型
		job.setMapperClass(ReduceJoinMapper.class);
		job.setReducerClass(JoinBeanReducer.class);
		
		// Job需要根据Mapper和Reducer输出的Key-value类型准备序列化器，通过序列化器对输出的key-value进行序列化和反序列化
		// 如果Mapper和Reducer输出的Key-value类型一致，直接设置Job最终的输出类型
		job.setOutputKeyClass(NullWritable.class);
		job.setOutputValueClass(JoinBean.class);
		
		// 设置输入目录和输出目录
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		
		// 设置分区器
		job.setPartitionerClass(MyPartitioner.class);
		
		//需要Join的数据量过大 order.txt 10亿，pd.txt 100w，提高MR并行运行的效率
		// Map阶段：  修改片大小，切的片多，MapTask运行就多
		// Reduce阶段：  修改ReduceTask数量
		
		//可以设置ReduceTasks的数量，默认为1，将输出在一个文件。
		//在此案例中，如果是3，则分为三个文件。如果超过三，其余文件则是空的。
		//job.setNumReduceTasks(3);
		
		// ③运行Job
		job.waitForCompletion(true);
		
	}
}

运行结果：