Spark入门01 - 码农教程

一， Spark概述

spark框架地址

1、官网：
    http://spark.apache.org/

2、源码托管：
    https://github.com/apache/spark

3、母公司网站：
    https://databricks.com/
    官方博客：https://databricks.com/blog/、https://databricks.com/blog/category/engineering/spark

1，官方定义

http://spark.apache.org/docs/2.2.0/

Spark框架，类似于MapReduce框架，针对大规模数据分析框架。

2，大数据分析类型

离线处理：处理分析的数据是静态不变的，类似MapReduce和Hive框架等
交互式分析：即席查询，类似于impala
实时分析：针对流式数据实时处理，展示结果等

3，Spark框架介绍

在磁盘上对100TB的数据进行排序，可以看到Spark比hadoop快的多，效率高。

为什么Spark框架如此快？

数据结构

RDD：弹性分布式数据集，Spark将要处理的数据封装到集合RDD中，调用RDD中函数处理数据。

RDD数据可以放到内存中，内存不足可以放到磁盘中。

Task任务运行方式不一样

MapReduce应用运行：MapTask和ReduceTask都是JVM进程。启动一个jvm进程很慢。

Spark中Task以线程Thread方式运行，线程运行在进程中，创建和销毁代价小，效率高。

D： Spark框架特性

块：与hadoop的mapreduce相比，Spark基于内存的运算要快100倍以上，就算是基于硬盘运算也要快10倍以上。
易用：Spark支持Java，Python，R，Scala的API，还支持80多种高级算法。
通用：Spark适用于批处理，交互式查询，实时流处理，机器学习，图计算。
兼容性：Spark可以使用Hadoop的yarn作为它的资源管理和调度器。

二，框架模块

A：SparkCore：Spark框架核心，主要内容是RDD，针对海量数据进行离线分析处理，类似MapReduce框架

B：SparkSQL：使用最多的框架，类似Hive框架，提供SQL功能，分析数据，远远不止时SQL，还提供DSL。

C：SparkStreaming：针对流式应用处理的模块。

D：Structured Streaming：Spark2.x出现的新型的流式数据处理框架

E：Spark MLlib：机器学习库

F：Spark GraphX：图形计算

G：PySpark：针对Python开发的模块

H：SparkR：针对R语言开发的模块

三，Spark 运行模型

1，本地模式：Local Mode

主要用于开发测试

2，集群模式：cluster mode

Spark Standalone集群，Spark自带的框架

Hadoop Yarn企业中常常把MR，Flink，Spark运行在Yarn上

四，快速入门

1，本地模式

本质：启动一个JVM Process进程，执行任务Task

--master local | local[*] | local[K] 建议 K >= 2 正整数

具体的说明如下：

启动命令如下：

bin/spark-shell --master local[2]

2，词频统计

# 准备数据
/export/servers/hadoop/bin/hdfs dfs -put wordcount.input /datas

# 读取HDFS文本数据，封装到RDD集合中，文本中每条数据就是集合中每条数据
val inputRDD = sc.textFile("/datas/wordcount.input")

# 将集合中每条数据按照分隔符分割，使用正则：https://www.runoob.com/regexp/regexp-syntax.html
val wordsRDD = inputRDD.flatMap(line => line.split("\\s+"))
# inputRDD.flatMap(_.split("\\s+"))

# 转换为二元组，表示每个单词出现一次
val tuplesRDD = wordsRDD.map(word => (word, 1))
# wordsRDD.map((_, 1))

# 按照Key分组，对Value进行聚合操作
# scala中二元组就是Java中Key/Value对
# reduceByKey：先分组，再聚合
# val wordcountsRDD = tuplesRDD.reduceByKey((a, b) => a + b)
val wordcountsRDD = tuplesRDD.reduceByKey((tmp, item) => tmp + item)

# 查看结果
wordcountsRDD.take(5)

# 保存结果数据到HDFs中
wordcountsRDD.saveAsTextFile("/datas/spark-wc")

# 查结果数据
/export/servers/hadoop/bin/hdfs dfs -text /datas/spark-wc/par*

3，列表中的聚合函数

// 高级函数：函数A的参数类型是一个函数，那么函数A就是高阶函数
def reduce[A1 >: Int](op: (A1, A1) => A1): A1

// 函数要求
op: (A1: 参数一, A1： 参数二) => A1
/*
    表示需要两个参数及返回值，并且类型全部一样
    参数一：聚合中间临时变量
    参数二：集合中每个元素
*/
scala> val list = (1 to 10).toList
list: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
/*
    val tmp = 0 
    tmp = tmp + item
    return tmp 
    
    tmp: 聚合时中间临时变量
    
    需求：平均值
        总和 / 总数
*/

list.reduce((tmp, item) => {
    println(s"tmp = $tmp, item = $item, sum = ${tmp + item}")
    tmp + item
})

list.reduceLeft((tmp, item) => {
    println(s"tmp = $tmp, item = $item, sum = ${tmp + item}")
    tmp + item
})

list.reduceRight((item, tmp) => {
    println(s"tmp = $tmp, item = $item, sum = ${tmp + item}")
    tmp + item
})

比较reduceRight函数使用

4，本地模式运行圆周率

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master local[2] \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \
10

# 参数10含义：表示运行10次，每次100000点

五，Spark运行组成

1，mapreduce组成

一个MapReduce运行的就是一个Job，有一个AppMaster它是应用的管理者，负责这个应用中所有的Task执行，每个mapTask和reduceTask以进程的方式运行。

2，Spark应用组成

每个Spark Application可以包含多个Job，每个Application 运行在集群上面的时候，也是有两个部分。

第一部分，Driver Program相当于AppMaster，是整个应用的管理者，负责应用中所有的Job的调度执行。它是一个JVM Process，运行程序的Main函数，必须创建SparkContext上下文对象。

第二部分Executor，相当于一个线程池，运行JVM Process，其中有很多的线程，每个线程运行一个Task任务，一个任务运行需要1核CPU，所有可以认为Executor中的线程数就等于Cpu的核数。

六，Spark应用的开发

对于Spark的应用来说主要分为三部分，读取数据，处理数据，输出数据。

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * 基于Scala语言使用SparkCore编程实现词频统计：WordCount
  *     从HDFS上读取数据，统计WordCount，将结果保存到HDFS上
  */
object SparkWordCount {

    def main(args: Array[String]): Unit = {

        // 创建SparkConf对象，设置应用的配置信息，比如应用名称和应用运行模式
        val sparkConf: SparkConf = new SparkConf()
            .setMaster("local[2]")
            .setAppName("SparkWordCount")
        // TODO: 构建SparkContext上下文实例对象，读取数据和调度Job执行
        val sc: SparkContext = new SparkContext(sparkConf)
        // 设置日志级别，可设置的选项：Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
        sc.setLogLevel("WARN")

        // 第一步、读取数据
        //  封装到RDD集合，认为列表List
        val inputRDD: RDD[String] = sc.textFile("/datas/wordcount.input")


        // 第二步、处理数据
        //  调用RDD中函数，认为调用列表中的函数
        // a. 每行数据分割为单词
        val wordsRDD = inputRDD.flatMap(line => line.split("\\s+"))
        // b. 转换为二元组，表示每个单词出现一次
        val tuplesRDD: RDD[(String, Int)] = wordsRDD.map(word => (word, 1))
        // c. 按照Key分组聚合
        val wordCountsRDD: RDD[(String, Int)] = tuplesRDD.reduceByKey((tmp, item) => tmp + item)


        // 第三步、输出数据
        //  保存到为存储系统，比如HDFS
        wordCountsRDD.saveAsTextFile(s"/datas/swc-output-${System.currentTimeMillis()}")
        wordCountsRDD.foreach(println)

        // 为了测试，线程休眠，查看WEB UI界面
        Thread.sleep(10000000)

        // TODO：应用程序运行接收，关闭资源
        sc.stop()
    }

}

七，Spark的应用提交

1，Spaek Submit

http://spark.apache.org/docs/2.2.0/submitting-applications.html

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
  Some of the commonly used options are:

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

使用help命令查看相关参数：

# bin/spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

提交应用：

Usage: spark-submit [options] <app jar | python file> [app arguments]

1)、options
    可选参数，应用运行配置信息，比如运行在哪里，本地模式还是集群模式
    重要的一点
2）、<app jar | python file> 
    如果使用Java或者SCALa语言，将程序编译jar包；如果是Python语言，脚本文件
    
3）、[app arguments]
    应用程序参数，可有可无

2，提交单词计数程序

提交在本地模式：

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master local[2] \
--class cn.itcast.bigdata.spark.submit.SparkSubmit \
${SPARK_HOME}/day01-core_2.11-1.0-SNAPSHOT.jar \
/datas/wordcount.input /datas/swcs

提交在Standalone集群

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master spark://bigdata-cdh02.itcast.cn:7077,bigdata-cdh03.itcast.cn:7077 \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 1 \
--total-executor-cores 2 \
--class cn.itcast.bigdata.spark.submit.SparkSubmit \
${SPARK_HOME}/day01-core_2.11-1.0-SNAPSHOT.jar \
/datas/wordcount.input /datas/swcs

3，Spark on Yarn

文档：http://spark.apache.org/docs/2.2.0/running-on-yarn.html

提交Spark Application到YARN上，找的就是ResourceManager，命令如下：

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \
10

分析可知，spark运行在yarn上，需要将依赖的jar包和配置文件上传，进行使用，需要很长时间。我们需要提交把jar包上传到hdfs，告知应用已经上传，这样就不需要每次都上传jar包，而浪费大量的时间。

附录

<!-- 指定仓库位置，依次为aliyun、cloudera和jboss仓库 -->
<repositories>
    <repository>
        <id>aliyun</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>jboss</id>
        <url>http://repository.jboss.com/nexus/content/groups/public</url>
    </repository>
</repositories>

<properties>
    <scala.version>2.11.8</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <spark.version>2.2.0</spark.version>
    <hadoop.version>2.6.0-cdh5.14.0</hadoop.version>
</properties>

<dependencies>
    <!-- 依赖Scala语言 -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
    <!-- Spark Core 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Hadoop Client 依赖 -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>

<build>
    <outputDirectory>target/classes</outputDirectory>
    <testOutputDirectory>target/test-classes</testOutputDirectory>
    <resources>
        <resource>
            <directory>${project.basedir}/src/main/resources</directory>
        </resource>
    </resources>
    <!-- Maven 编译的插件 -->
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.2.0</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                    <configuration>
                        <args>
                            <arg>-dependencyfile</arg>
                            <arg>${project.build.directory}/.scala_dependencies</arg>
                        </args>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

原文地址：https://www.cnblogs.com/qidi/p/11861697.html