基于腾讯云对象存储跑hadoop任务实战一

时间:2022-06-18
本文章向大家介绍基于腾讯云对象存储跑hadoop任务实战一,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

公司在腾讯云有一个大数据集群,用hdp的ambari部署管理的,hdp面有hadoop、hive、spark等常用的大数据组件,公司的报表都从这里生成。

随着业务发展,历史数据量急速增加,半年已经过百TB,还在以每天几TB的数量增加着,删除太可惜了,要找个地方备份。于是想到了对象存储,看了下官方文档 https://cloud.tencent.com/document/product/436/6884, hadoop任务竟然可以直接跑在腾讯云对象存储上,太给力了!直接动手实践一下。

部署过程

部署配置过程,文档里还是写的很详细的,这里简单列下步骤:

1、在腾讯云存储新建一个bucket,注意bucket建的园区需要和你大数据cvm的园区相同,这样上传、下载都会走内网,不收钱的。如果建成了不同园区,下载会收流量费用。

2、有了bucket后,去文档里给的github地址(https://github.com/tencentyun/hadoop-cos/tree/master/dep)下载编译好的最新版本的两个lib,我的hadoop版本是2.7.3,所以下载 cos_hadoop_api-5.2.6.jar,hadoop-cos-2.7.3.jar(如果没有对应版本也可以下载源码自己编译)。这两个lib需要复制到跑hadoop任务的所有机器上去,都需要放到同一个路径下,后面会用到。

3、按照官网文档修改你的hadoop集群配置。这一步,有可能很多朋友现网环境配置不能随意变更,也可以针对对象存储,单独生成一份配置,在跑hadoop任务的时候通过参数指定 ,例如:hadoop fs -conf ./core-site.xml -libjars ./cos_hadoop_api-5.2.6.jar,./hadoop-cos-2.7.3.jar -ls / 。

通过上面三步,大概不到半小时的时间,就部署好了。赶紧执行下 hadoop fs -ls / 试试效果,列出了我的bucket里的文件,so easy!

跑mr任务

不过,在执行hadoop的example里的MapReduce任务的时候 hadoop jar ./mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount cosn://example/mr/input cosn://example/mr/output,报错了 "Class org.apache.hadoop.fs.CosFileSystem not found"。

上网查了下,为了能够在云对象存储运行 MAPREDUCE 任务,还需更改hdp安装目录下/hdp/apps/2.6.xxxx/mapreduce/mapreduce.tar.gz包的内容,将 COSN 的两个lib放如上述压缩包,替换掉原来的文件,测试可用。不过这个方案有个缺点,就是hdp的ambari每次修改完配置重启的时候,都会把mapreduce.tar.gz

替换回系统自带的。该方案治标不治本。

不过上面的方案启发了我,应该是MapReduce任务的配置有点问题,导致找不到cos的lib,研究了下mapred-site.xml这个配置文件,在mapreduce.application.classpath里加上cosn的lib,就ok了。

测试ok,即使修改集群配置、重启集群,都还能正常执行MapReduce任务:

[hdfs@hadoop01 ~]$ hadoop jar /usr/hdp/2.6.xxx/hadoop-mapreduce/hadoop-mapreduce-examples.jar  wordcount cosn://testshhadoop-xxxxxxx/input/ cosn://testshhadoop-xxxxxxxx/example/output/
19/02/22 10:44:12 INFO client.RMProxy: Connecting to ResourceManager at hadoop03.cos.qcloud.com/xxxxxxx:8050
19/02/22 10:44:12 INFO client.AHSProxy: Connecting to Application History server at hadoop04.cos.qcloud.com/xxxxxxx:10200
19/02/22 10:44:13 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:13 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.jar' upload complete
19/02/22 10:44:14 INFO input.FileInputFormat: Total input paths to process : 1
19/02/22 10:44:14 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:14 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.split' upload complete
19/02/22 10:44:14 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:14 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.splitmetainfo' upload complete
19/02/22 10:44:14 INFO mapreduce.JobSubmitter: number of splits:1
19/02/22 10:44:14 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:14 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.xml' upload complete
19/02/22 10:44:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1550379850904_0007
19/02/22 10:44:14 INFO impl.YarnClientImpl: Submitted application application_1550379850904_0007
19/02/22 10:44:14 INFO mapreduce.Job: The url to track the job: http://hadoop03.cos.qcloud.com:8088/proxy/application_1550379850904_0007/
19/02/22 10:44:14 INFO mapreduce.Job: Running job: job_1550379850904_0007
19/02/22 10:44:20 INFO mapreduce.Job: Job job_1550379850904_0007 running in uber mode : false
19/02/22 10:44:20 INFO mapreduce.Job:  map 0% reduce 0%
19/02/22 10:44:26 INFO mapreduce.Job:  map 100% reduce 0%
19/02/22 10:44:31 INFO mapreduce.Job:  map 100% reduce 100%
19/02/22 10:44:34 INFO mapreduce.Job: Job job_1550379850904_0007 completed successfully
19/02/22 10:44:34 INFO mapreduce.Job: Counters: 49
        File System Counters
                COSN: Number of bytes read=14408
                COSN: Number of bytes written=55
                COSN: Number of read operations=0
                COSN: Number of large read operations=0
                COSN: Number of write operations=0
                FILE: Number of bytes read=89
                FILE: Number of bytes written=303554
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=9808
                Total time spent by all reduces in occupied slots (ms)=5824
                Total time spent by all map tasks (ms)=2452
                Total time spent by all reduce tasks (ms)=2912
                Total vcore-milliseconds taken by all map tasks=2452
                Total vcore-milliseconds taken by all reduce tasks=2912
                Total megabyte-milliseconds taken by all map tasks=10043392
                Total megabyte-milliseconds taken by all reduce tasks=5963776
        Map-Reduce Framework
                Map input records=3
                Map output records=9
                Map output bytes=89
                Map output materialized bytes=89
                Input split bytes=117
                Combine input records=9
                Combine output records=7
                Reduce input groups=7
                Reduce shuffle bytes=89
                Reduce input records=7
                Reduce output records=7
                Spilled Records=14
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=74
                CPU time spent (ms)=1840
                Physical memory (bytes) snapshot=593371136
                Virtual memory (bytes) snapshot=9280081920
                Total committed heap usage (bytes)=462946304
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=4994
        File Output Format Counters 
                Bytes Written=55

至此,就能通过hadoop自带的hadoop distcp任务,愉快地把我们公司的历史数据批量备份到腾讯云对象存储上了,并且想要对这些数据做分析,也不用把他们拉回本地hdfs,可以直接分析云上数据。