intel-hadoop/HiBench流程分析----以贝叶斯算法为例
1.HiBench算法简介
Hibench 包含9个典型的hadoop负载(micro benchmarks,hdfs benchmarks,web search bench marks,machine learning benchmarks和data analytics benchmarks)
具体参考CDH集群安装&测试总结:第三节内容
- micro benchmarks Sort:使用hadoop randomtextwriter生成数据,并对数据进行排序。 Wordcount:统计输入数据中每个单词的出现次数,输入数据使用hadoop randomtextwriter生成。 TeraSort:输入数据由hadoop teragen产生,通过key值进行排序。
- hdfs benchmarks 增强行的dfsio:通过产生大量同时执行读写请求的任务测试hadoop机群的hdfs吞吐量
- web search bench marks Nutch indexing:大规模收索引擎,这个是负载测试nutch(apache的一个开源搜索引擎)的搜索子系统,使用自动生成的web数据,web数据中的连接和单词符合zipfian分布(一个单词出现的次数与它在频率表的排名成反比) Pagerank:这个负载包含在一种在hadoop上的pagerank的算法实现,使用自动生成的web数据,web数据中的链接符合zipfian分布。(对于任意一个term其频度(frequency)的排名(rank)和frequency的乘积大致是一个常数)
- machine learning benchmarks Mahout bayesian classification(bayes):大规模机器学习,这个负载测试mahout(apache开源机器学习库)中的naive bayesian 训练器,输入的数据是自动生成的文档,文档中的单词符合zipfian分布。 Mahout k-means clustering(kmeans):测试mahout中的k-means聚类算法,输入的数据集由基于平均分布和高斯分布的genkmeansdataset产生。
- data analytics benchmarks Hive query benchmarks(hivebench):包含执行的典型olap查询的hive查询(aggregation和join),使用自动生成的web数据,web数据的链接符合zipfian分布。
注:使用的生成数据程序在hadoop-mapreduce-examples-2.6.0 jar 包内,可以使用反编译工具查看。
2.HiBench中bayes算法流程
- 主要流程为conf下配置测试项,测试语言和DataSize,然后运行bin下run-all.sh完成一次测试,此流程为手动完成,可以编写脚本重复此步骤完成多次测试减少手动操作; e.g.
#!/bin/bash
# Time: 20160930,created by sunfei
# Describe: automatic run the hibench
# Functions :
# search(): Find the style of application in the 99-user_defined_properties.conf,eg:tiny,small..
# exec_application_noSQL(): run the application for times,and no use hive
# exec_application_SQL(): run the application for times,and use hive
# save_result(): save the result of application
# main_function(): the main function of running all the appliction
# main(): the main function of running different kind application
cpuLoad()
{
cpu=`grep -c 'model name' /proc/cpuinfo`
load_15=`uptime | awk '{print $NF}'`
average_load=`echo "scale=2;a=${load_15}/${cpu};if(length(a)==scale(a)) print 0;print a" | bc`
date >> datetime-load.txt
${average_load} >> cpu-load.txt
paste datetime-load.txt cpu-load.txt >> load-day.txt
}
search()
{
#config="/opt/HiBench/HiBench-master/conf/99-user_defined_properties.conf"
config=/usr/HiBench-master/conf/99-user_defined_properties.conf
sed -n '/hibench.scale.profile/p' ${config} >> hibench.txt
var=''
while read line
do
if [ ${line:0:13} = "hibench.scale" ];then
echo -e " 33[32m match sucessfull! 33[0m"
var=${line:22}
fi
done<"hibench.txt"
if [ "$var" = "${1}" ];then
echo -e " 33[31m The style of application can't same,do you want to continue? yes | no 33[0m"
read -p "Input your chose :" chose
if [ "${chose}" = "no" ];then
exit 1
else
echo -e " 33[32m The ${1} style of application will be run! 33[0m"
fi
fi
if [ -f "hibench.txt" ];then
rm -rf "hibench.txt"
echo -e " 33[32m The hibench.txt has deleted! 33[0m"
fi
echo -e " 33[32m The application will run the "${1}" style 33[0m"
sed -i "s/${var}/${1}/" ${config}
}
exec_application_noSQL()
{
var=0
for ((i=1;i<=${1};i++))
do
let "var=$i%1"
if [ "$var" -eq 0 ];then
hadoop fs -rm -r hdfs://archive.cloudera.com:8020/user/hdfs/.Trash/*
hadoop fs -rm -r hdfs://archive.cloudera.com:8020/HiBench/*
fi
echo -e " 33[32m **********************The current times is ********************: 33[0m" ${i}
#/opt/HiBench/HiBench-master/bin/run-all.sh
/usr/HiBench-master/bin/run-all.sh
echo -e " 33[32m ********************** The current time is "${i}" ,and it has exec finished successfully! ********************: 33[0m"
done
echo -e " 33[32m *********The application has finished,please modify the configuration!***** 33[0m"
}
exec_application_SQL()
{
var=0
for ((i=1;i<=${1};i++))
do
echo "drop table uservisits;drop table uservisits_aggre;drop table rankings;drop table rankings_uservisits_join;drop table uservisits_copy;exit;" | /usr/bin/hive
let "var=$i%1"
if [ "$var" -eq 0 ];then
hadoop fs -rm -r hdfs://archive.cloudera.com:8020/user/hdfs/.Trash/*
hadoop fs -rm -r hdfs://archive.cloudera.com:8020/HiBench/*
fi
echo -e " 33[32m **********************The current times is ********************: 33[0m" ${i}
#/opt/HiBench/HiBench-master/bin/run-all.sh
/usr/HiBench-master/bin/run-all.sh
echo -e " 33[32m **********************The current time is "${i}" ,and it has exec finished successfully! ********************: 33[0m"
done
echo -e " 33[32m *********The application has finished,please modify the configuration!***** 33[0m"
}
save_result()
{
if [ -f result.txt ];then
rm -rf result.txt
echo -e " 33[32m The hibench.txt has deleted! 33[0m"
fi
#select the words in the report
#filepath=/opt/HiBench/HiBench-master/report/hibench.report
filepath=/usr/HiBench-master/report/hibench.report
word=""
var1=`date +"%m/%d/%Y-%k:%M:%S"`
var2=${1}
var5=".txt"
var4=${var2}${var5}
case ${1} in
"aggregation")
word="JavaSparkAggregation"
;;
"join")
word="JavaSparkJoin"
;;
"scan")
word="JavaSparkScan"
;;
"kmeans")
word="JavaSparkKmeans"
;;
"pagerank")
word="JavaSparkPagerank"
;;
"sleep")
word="JavaSparkSleep"
;;
"sort")
word="JavaSparkSort"
;;
"wordcount")
word="JavaSparkWordcount"
;;
"bayes")
word="JavaSparkBayes"
;;
"terasort")
word="JavaSparkTerasort"
;;
*)
echo -e " 33[32m The name of application is wrong,please change it! 33[0m"
;;
esac
while read line
do
echo $line | sed -n "/${word}/p" >> ${var4}
done <$filepath
echo -e " 33[32m The job has finished! 33[0m"
}
main_function()
{
#Input the name of application need to exec
for appName in aggregation join scan pagerank sleep sort wordcount bayes terasort kmeans
do
#appConfig=/opt/HiBench/HiBench-master/conf/benchmarks.lst
appConfig=/usr/HiBench-master/conf/benchmarks.lst
echo "The name of application is :"${appName}
echo ${appName} > ${appConfig}
for style in tiny small large huge gigantic
do
search ${style}
if [ "aggregation" = ${appName} ] || [ "join" = ${appName} ] || [ "scan" = ${appName} ];then
exec_application_SQL ${1}
else
exec_application_noSQL ${1}
fi
done
save_result ${appName}
done
}
main()
{
# run the application
read -p "Input the times of exec: " times
if [ "${times}" -eq 0 -o "${times}" -gt 60 ];then
echo -e " 33[31m The times of application can't be empty or gt 60 ! Do you want to continue ? yes | no 33[0m"
read -p "Input your chose :" chose
if [ "${chose}" = "no" ];then
exit 1
else
echo -e " 33[32m The application will be run ${times} times ! 33[0m"
fi
fi
echo -e " 33[33m Select the style of application : 33[0m 33[31m All | Signal 33[0m"
read -p "Input your chose :" style
if [ "${style}" = "" ];then
echo -e " 33[31m The style of application can't be empty 33[0m"
exit 1
elif [ "${style}" != "All" -a "${style}" != "Signal" ];then
echo -e " 33[31m The style of application is wrong,please correct! 33[0m"
exit 1
else
echo -e " 33[32m The style of application is ok ! 33[0m"
fi
if [ "All" = "${style}" ];then
main_function ${times}
else
echo -e " 33[033m Input the name of apliaction,eg: 33[0m 33[31m aggregation | join | scan | kmeans | pagerank | sleep | sort | wordcount | bayes | terasort 33[0m"
read -p "Input you chose :" application
if [ "${application}" = "" ];then
echo -e " 33[31m The name of application can't be empty! 33[0m"
exit 1
fi
echo "********************The ${application} will be exec**********************"
appConfig=/usr/HiBench-master/conf/benchmarks.lst
#appConfig=/opt/HiBench/HiBench-master/conf/benchmarks.lst
read -p "Do you want exec all the style of application,eg:tiny,small,large,huge,gigantic? yes | no " chose
if [ "${chose}" = "" ];then
echo -e " 33[31m The style of application can't be empty! 33[0m"
exit 1
elif [ "yes" != ${chose} ] && [ "no" != ${chose} ];then
echo -e " 33[31m The style of application is wrong,please correct! 33[0m"
exit 1
else
echo -e " 33[32m The style of application is ok ! 33[0m"
fi
read -p "Input the sytle of application,eg:( tiny small large huge gigantic )!" appStyle
echo "***************************The ${appStyle} style will be exec***************************"
for appName in ${application}
do
echo ${appName} > ${appConfig}
if [ "yes" = "${chose}" ];then
for var in tiny small large huge gigantic
do
echo "******************The ${appName} will be exec!************************************"
search ${var}
if [ "aggregation" = ${appName} ] || [ "join" = ${appName} ] || [ "scan" = ${appName} ];then
exec_application_SQL ${times}
else
exec_application_noSQL ${times}
fi
done
else
# read -p "Input the sytle of application,eg:( tiny small large huge gigantic )!" appStyle
echo "**************************The ${appName} will be exec!************************"
if [ "${appStyle}" = "" ];then
echo -e " 33[31m The style of application can't be empty! 33[0m"
exit 1
fi
for var in ${appStyle}
do
search ${var}
if [ "aggregation" = ${appName} ] || [ "join" = ${appName} ] || [ "scan" = ${appName} ];then
exec_application_SQL ${times}
else
exec_application_noSQL ${times}
fi
done
fi
save_result ${appName}
done
fi
}
# the main function of application
main
- prepare.sh->run.sh为run-all.sh的子流程;
- enter_bench->…->leave_bench为prepare.sh和run.sh的子流程;
- enter_bench…..gen_report等为workload-functions.sh中的公共函数。
流程图如下:
2.1 数据生成代码分析,接口:HiBench.DataGen
对java代码我不太熟悉,接口中我看主要用了一个switch语句
DataGen类中DataOptions options = new DataOptions(args); 如果是bayes测试的话,就调用对应的数据生成类,进行数据生成。生成的数据接口部分代码:
case BAYES: {
BayesData data = new BayesData(options);
data.generate();
break;
}
BayesData实现:
package HiBench;
import java.io.IOException;
import java.net.URISyntaxException;
import java.util.Random;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.lib.NLineInputFormat;
public class BayesData {
private static final Log log = LogFactory.getLog(BayesData.class.getName());
private DataOptions options;
private Dummy dummy;
private int cgroups;
BayesData(DataOptions options) {
this.options = options;
parseArgs(options.getRemainArgs());
}
private void parseArgs(String[] args) {
for (int i=0; i<args.length; i++) {
if ("-class".equals(args[i])) {
cgroups = Integer.parseInt(args[++i]);
} else {
DataOptions.printUsage("Unknown bayes data arguments -- " + args[i] + "!!!");
System.exit(-1);
}
}
}
private static class CreateBayesPages extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {
private static final Log log = LogFactory.getLog(CreateBayesPages.class.getName());
private long pages, slotpages;
private int groups;
private HtmlCore generator;
private Random rand;
public void configure(JobConf job) {
try {
pages = job.getLong("pages", 0);
slotpages = job.getLong("slotpages", 0);
groups = job.getInt("groups", 0);
generator = new HtmlCore(job);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
@Override
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
int slotId = Integer.parseInt(value.toString().trim());
long[] range = HtmlCore.getPageRange(slotId, pages, slotpages);
generator.fireRandom(slotId);
rand = new Random(slotId * 1000 + 101);
Text k = new Text();
for (long i=range[0]; i<range[1]; i++) {
String classname = "/class" + rand.nextInt(groups);
k.set(classname);
value.set(generator.genBayesWords());
output.collect(k, value);
reporter.incrCounter(HiBench.Counters.BYTES_DATA_GENERATED,
k.getLength()+value.getLength());
if (0==(i % 10000)) {
log.info("still running: " + (i - range[0]) + " of " + slotpages);
}
}
}
}
private void setBayesOptions(JobConf job) throws URISyntaxException {
job.setLong("pages", options.getNumPages());
job.setLong("slotpages", options.getNumSlotPages());
job.setInt("groups", cgroups);
Utils.shareWordZipfCore(options, job);
}
private void createBayesData() throws IOException, URISyntaxException {
log.info("creating bayes text data ... ");
JobConf job = new JobConf();
Path fout = options.getResultPath();
Utils.checkHdfsPath(fout);
String jobname = "Create bayes data";
job.setJobName(jobname);
Utils.shareDict(options, job);
setBayesOptions(job);
FileInputFormat.setInputPaths(job, dummy.getPath());
job.setInputFormat(NLineInputFormat.class);
job.setJarByClass(CreateBayesPages.class);
job.setMapperClass(CreateBayesPages.class);
job.setNumReduceTasks(0);
FileOutputFormat.setOutputPath(job, fout);
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
log.info("Running Job: " +jobname);
log.info("Pages file " + dummy.getPath() + " as input");
log.info("Rankings file " + fout + " as output");
JobClient.runJob(job);
log.info("Finished Running Job: " + jobname);
}
private void init() throws IOException {
Utils.checkHdfsPath(options.getResultPath(), true);
Utils.checkHdfsPath(options.getWorkPath(), true);
dummy = new Dummy(options.getWorkPath(), options.getNumMaps());
int words = RawData.putDictToHdfs(new Path(options.getWorkPath(), HtmlCore.getDictName()), options.getNumWords());
options.setNumWords(words);
Utils.serialWordZipf(options);
}
public void generate() throws Exception {
init();
createBayesData();
close();
}
private void close() throws IOException {
log.info("Closing bayes data generator...");
Utils.checkHdfsPath(options.getWorkPath());
}
}
prepare.sh运行时输出如下,可以看到刚开始主要是读取配置文件中的内容,随后调用hadoop和jar包跑了一个任务,这个就是bayes文本分类的生成数据,按照第一节以及介绍的和官网的说明,这个文本主要使用linux中的字典:”/usr/share/dict/words”并且符合zipfian分布。
[hdfs@sf11 prepare]$ ./prepare.sh patching args= Parsing conf: /opt/HiBench/HiBench-master/conf/00-default-properties.conf Parsing conf: /opt/HiBench/HiBench-master/conf/01-default-streamingbench.conf Parsing conf: /opt/HiBench/HiBench-master/conf/10-data-scale-profile.conf Parsing conf: /opt/HiBench/HiBench-master/conf/20-samza-common.conf Parsing conf: /opt/HiBench/HiBench-master/conf/30-samza-workloads.conf Parsing conf: /opt/HiBench/HiBench-master/conf/99-user_defined_properties.conf Parsing conf: /opt/HiBench/HiBench-master/workloads/bayes/conf/00-bayes-default.conf Parsing conf: /opt/HiBench/HiBench-master/workloads/bayes/conf/10-bayes-userdefine.conf probe sleep jar: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/share/hadoop/mapreduce2/hadoop-mapreduce-client-jobclient-tests.jar start HadoopPrepareBayes bench /opt/HiBench/HiBench-master/bin/functions/workload-functions.sh: line 120: /dev/stderr: Permission denied rm: `hdfs://archive.cloudera.com:8020/HiBench/Bayes/Input’: No such file or directory Submit MapReduce Job: /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop –config /etc/hadoop/conf jar /opt/HiBench/HiBench-master/src/autogen/target/autogen-5.0-SNAPSHOT-jar-with-dependencies.jar HiBench.DataGen -t bayes -b hdfs://archive.cloudera.com:8020/HiBench/Bayes -n Input -m 300 -r 1600 -p 500000 -class 100 -o sequence 16/10/21 16:34:02 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/10/21 16:34:32 INFO HiBench.BayesData: Closing bayes data generator… finish HadoopPrepareBayes bench
部分生成数据:
在看了将近两周的HiBench代码进行测试后,终于摸清上述的运行流程,intel 的这个测试框架确实比较简介,通过配置文件和shell以及一些大数据框架自带的例子(如Hibench中的workcount测试就是直接调用hadoop或者spark自带的程序)完成了整个庞大的测试工作,下面我们针对贝叶斯文本分类算法中HiBench使用的三种语言:python,scala,java分别进行分析:
2.3 python代码分析
部分python代码:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""
A naive bayes program using MLlib.
This example requires NumPy (http://www.numpy.org/).
"""
import sys
from pyspark import SparkContext
from pyspark.mllib.util import MLUtils
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.storagelevel import StorageLevel
from operator import add
from itertools import groupby
#
# Adopted from spark's doc: http://spark.apache.org/docs/latest/mllib-naive-bayes.html
#
def parseVector(line):
return np.array([float(x) for x in line.split(' ')])
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: bayes <file>"
exit(-1)
sc = SparkContext(appName="PythonNaiveBayes")
filename = sys.argv[1]
data = sc.sequenceFile(filename, "org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
wordCount = data
.flatMap(lambda (key, doc):doc.split(" "))
.map(lambda x:(x, 1))
.reduceByKey(add)
wordSum = wordCount.map(lambda x:x[1]).reduce(lambda x,y:x+y)
wordDict = wordCount.zipWithIndex()
.map(lambda ((key, count), index): (key, (index, count*1.0 / wordSum)) )
.collectAsMap()
sharedWordDict = sc.broadcast(wordDict)
# for each document, generate vector based on word freq
def doc2vector(dockey, doc):
# map to word index: freq
# combine freq with same word
docVector = [(key, sum((z[1] for z in values))) for key, values in
groupby(sorted([sharedWordDict.value[x] for x in doc.split(" ")],
key=lambda x:x[0]),
key=lambda x:x[0])]
(indices, values) = zip(*docVector) # unzip
label = float(dockey[6:])
return label, indices, values
vector = data.map( lambda (dockey, doc) : doc2vector(dockey, doc))
vector.persist(StorageLevel.MEMORY_ONLY)
d = vector.map( lambda (label, indices, values) : indices[-1] if indices else 0)
.reduce(lambda a,b:max(a,b)) + 1
# print "###### Load svm file", filename
#examples = MLUtils.loadLibSVMFile(sc, filename, numFeatures = numFeatures)
examples = vector.map( lambda (label, indices, values) : LabeledPoint(label, Vectors.sparse(d, indices, values)))
examples.cache()
# FIXME: need randomSplit!
training = examples.sample(False, 0.8, 2)
test = examples.sample(False, 0.2, 2)
numTraining = training.count()
numTest = test.count()
print " numTraining = %d, numTest = %d." % (numTraining, numTest)
model = NaiveBayes.train(training, 1.0)
model_share = sc.broadcast(model)
predictionAndLabel = test.map( lambda x: (x.label, model_share.value.predict(x.features)))
# prediction = model.predict(test.map( lambda x: x.features ))
# predictionAndLabel = prediction.zip(test.map( lambda x:x.label ))
accuracy = predictionAndLabel.filter(lambda x: x[0] == x[1]).count() * 1.0 / numTest
print "Test accuracy = %s." % accuracy
2.4 scala 代码分析
run-spark-job org.apache.spark.examples.mllib.SparseNaiveBayes ${INPUT_HDFS}
显然scala 的朴素贝叶斯就是调用spark mllib库中的代码了
2.5 java 代码分析
run-spark-job com.intel.sparkbench.bayes.JavaBayes ${INPUT_HDFS}
java部分比较意外的HiBench没有采用原生的代码或者jar包,而是自己写了一个 代码如下,回头慢慢分析:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.intel.sparkbench.bayes;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.rdd.RDD;
import org.apache.spark.storage.StorageLevel;
import scala.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.hadoop.io.Text;
import java.lang.Boolean;
import java.lang.Double;
import java.lang.Long;
import java.util.*;
import java.util.regex.Pattern;
/*
* Adopted from spark's doc: http://spark.apache.org/docs/latest/mllib-naive-bayes.html
*/
public final class JavaBayes {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaBayes <file>");
System.exit(1);
}
Random rand = new Random();
SparkConf sparkConf = new SparkConf().setAppName("JavaBayes");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
// int numFeatures = Integer.parseInt(args[1]);
// Generate vectors according to input documents
JavaPairRDD<String, String> data = ctx.sequenceFile(args[0], Text.class, Text.class)
.mapToPair(new PairFunction<Tuple2<Text, Text>, String, String>() {
@Override
public Tuple2<String, String> call(Tuple2<Text, Text> e) {
return new Tuple2<String, String>(e._1().toString(), e._2().toString());
}
});
JavaPairRDD<String, Long> wordCount = data
.flatMap(new FlatMapFunction<Tuple2<String, String>, String>() {
@Override
public Iterable<String> call(Tuple2<String, String> e) {
return Arrays.asList(SPACE.split(e._2()));
}
})
.mapToPair(new PairFunction<String, String, Long>() {
@Override
public Tuple2<String, Long> call(String e) {
return new Tuple2<String, Long>(e, 1L);
}
})
.reduceByKey(new Function2<Long, Long, Long>() {
@Override
public Long call(Long i1, Long i2) {
return i1 + i2;
}
});
final Long wordSum = wordCount.map(new Function<Tuple2<String, Long>, Long>(){
@Override
public Long call(Tuple2<String, Long> e) {
return e._2();
}
})
.reduce(new Function2<Long, Long, Long>() {
@Override
public Long call(Long v1, Long v2) throws Exception {
return v1 + v2;
}
});
List<Tuple2<String, Tuple2<Long, Double>>> wordDictList = wordCount.zipWithIndex()
.map(new Function<Tuple2<Tuple2<String, Long>, Long>, Tuple2<String, Tuple2<Long, Double>>>() {
@Override
public Tuple2<String, Tuple2<Long, Double>> call(Tuple2<Tuple2<String, Long>, Long> e) throws Exception {
String key = e._1()._1();
Long count = e._1()._2();
Long index = e._2();
return new Tuple2<String, Tuple2<Long, Double>>(key, new Tuple2<Long, Double>(index,
count.doubleValue() / wordSum));
}
}).collect();
Map<String, Tuple2<Long, Double>> wordDict = new HashMap();
for (Tuple2<String, Tuple2<Long, Double>> item : wordDictList) {
wordDict.put(item._1(), item._2());
}
final Broadcast<Map<String, Tuple2<Long, Double>>> sharedWordDict = ctx.broadcast(wordDict);
// for each document, generate vector based on word freq
JavaRDD<Tuple3<Double, Long[], Double[]>> vector = data.map(new Function<Tuple2<String, String>, Tuple3<Double, Long[], Double[]>>() {
@Override
public Tuple3<Double, Long[], Double[]> call(Tuple2<String, String> v1) throws Exception {
String dockey = v1._1();
String doc = v1._2();
String[] keys = SPACE.split(doc);
Tuple2<Long, Double>[] datas = new Tuple2[keys.length];
for (int i = 0; i < keys.length; i++) {
datas[i] = sharedWordDict.getValue().get(keys[i]);
}
Map<Long, Double> vector = new HashMap<Long, Double>();
for (int i = 0; i < datas.length; i++) {
Long indic = datas[i]._1();
Double value = datas[i]._2();
if (vector.containsKey(indic)) {
vector.put(indic, value + vector.get(indic));
} else {
vector.put(indic, value);
}
}
Long[] indices = new Long[vector.size()];
Double[] values = new Double[vector.size()];
SortedSet<Long> sortedKeys = new TreeSet<Long>(vector.keySet());
int c = 0;
for (Long key : sortedKeys) {
indices[c] = key;
values[c] = vector.get(key);
c+=1;
}
Double label = Double.parseDouble(dockey.substring(6));
return new Tuple3<Double, Long[], Double[]>(label, indices, values);
}
});
vector.persist(StorageLevel.MEMORY_ONLY());
final Long d = vector
.map(new Function<Tuple3<Double,Long[],Double[]>, Long>() {
@Override
public Long call(Tuple3<Double, Long[], Double[]> v1) throws Exception {
Long[] indices = v1._2();
if (indices.length > 0) {
// System.out.println("v_length:"+indices.length+" v_val:" + indices[indices.length - 1]);
return indices[indices.length - 1];
} else return Long.valueOf(0);
}
})
.reduce(new Function2<Long, Long, Long>() {
@Override
public Long call(Long v1, Long v2) throws Exception {
// System.out.println("v1:"+v1+" v2:"+v2);
return v1 > v2 ? v1 : v2;
}
}) + 1;
RDD<LabeledPoint> examples = vector.map(new Function<Tuple3<Double,Long[],Double[]>, LabeledPoint>() {
@Override
public LabeledPoint call(Tuple3<Double, Long[], Double[]> v1) throws Exception {
int intIndices [] = new int[v1._2().length];
double intValues [] = new double[v1._3().length];
for (int i=0; i< v1._2().length; i++){
intIndices[i] = v1._2()[i].intValue();
intValues[i] = v1._3()[i];
}
return new LabeledPoint(v1._1(), Vectors.sparse(d.intValue(),
intIndices, intValues));
}
}).rdd();
//RDD<LabeledPoint> examples = MLUtils.loadLibSVMFile(ctx.sc(), args[0], false, numFeatures);
RDD<LabeledPoint>[] split = examples.randomSplit(new double[]{0.8, 0.2}, rand.nextLong());
JavaRDD<LabeledPoint> training = split[0].toJavaRDD();
JavaRDD<LabeledPoint> test = split[1].toJavaRDD();
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaRDD<Double> prediction =
test.map(new Function<LabeledPoint, Double>() {
@Override
public Double call(LabeledPoint p) {
return model.predict(p.features());
}
});
JavaPairRDD < Double, Double > predictionAndLabel =
prediction.zip(test.map(new Function<LabeledPoint, Double>() {
@Override
public Double call(LabeledPoint p) {
return p.label();
}
}));
double accuracy = (double) predictionAndLabel.filter(
new Function<Tuple2<Double, Double>, Boolean>() {
@Override
public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / test.count();
System.out.println(String.format("Test accuracy = %f", accuracy));
ctx.stop();
}
}
3.运行结果
Type |
Date |
Time |
Input_data_size |
Duration(s) |
Throughput(bytes/s) |
Throughput/node |
---|---|---|---|---|---|---|
JavaSparkBayes |
2016-10-09 |
16:41:09 |
113387030 |
48.857 |
2320793 |
2320793 |
ScalaSparkBayes |
2016-10-09 |
16:42:00 |
113387030 |
45.164 |
2510562 |
2510562 |
PythonSparkBayes |
2016-10-09 |
16:44:03 |
113387030 |
118.521 |
956683 |
956683 |
bayes算法数据规模参考:
#Bayes hibench.bayes.tiny.pages 25000 hibench.bayes.tiny.classes 10 hibench.bayes.tiny.ngrams 1 hibench.bayes.small.pages 30000 hibench.bayes.small.classes 100 hibench.bayes.small.ngrams 2 hibench.bayes.large.pages 100000 hibench.bayes.large.classes 100 hibench.bayes.large.ngrams 2 hibench.bayes.huge.pages 500000 hibench.bayes.huge.classes 100 hibench.bayes.huge.ngrams 2 hibench.bayes.gigantic.pages 1000000 hibench.bayes.gigantic.classes 100 hibench.bayes.gigantic.ngrams 2 hibench.bayes.bigdata.pages 20000000 hibench.bayes.bigdata.classes 20000 hibench.bayes.bigdata.ngrams 2
参考文献
- 检测到Loaderlock的问题
- 权威报告预测比特币在2018年“王位”不保
- Linux下FTP环境部署梳理(vsftpd和proftpd)
- Silverlight如何与JS相互调用
- Docker容器学习梳理--私有仓库Registry使用
- 从插件重构看如何提升测试质量与效率
- 巧用WinRAR+Javascript解决activeX的自动安装问题
- 在网页中实现QQ的屏幕截图功能
- Activity之间传递参数
- linux下rsync和tar增量备份梳理
- 重温Delphi之:面向对象
- Android新手之旅(15) Win7下配置遇到的问题
- 重温Delphi之:如何定义一个类
- Android新手之旅(2) 新手问题
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- django 模型字段设置默认值代码
- Python局部变量与全局变量区别原理解析
- 为什么是 Python -m
- Python csv文件记录流程代码解析
- 简单的Python人脸识别系统
- PHP实现的微信公众号扫码模拟登录功能示例
- PHP获取数据库表中的数据插入新的表再原删除数据方法
- python3 中时间戳、时间、日期的转换和加减操作
- PHP通过文件路径获取文件名的实例代码
- PHP常用正则表达式精选(推荐)
- PHP观察者模式实例分析【对比JS观察者模式】
- PHP实现图片压缩
- Python库安装速度过慢解决方案
- PHP按一定比例压缩图片的方法
- PHP消息队列实现及应用详解【队列处理订单系统和配送系统】