spark2 sql读取数据源编程学习样例1
问题导读 1.dataframe如何保存格式为parquet的文件? 2.在读取csv文件中,如何设置第一行为字段名? 3.dataframe保存为表如何指定buckete数目? 作为一个开发人员,我们学习spark sql,最终的目标通过spark sql完成我们想做的事情,那么我们该如何实现。这里根据官网,给出代码样例,并且对代码做一些诠释和说明。
[Scala] 纯文本查看 复制代码
?
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.examples.sql
import java.util.Properties
import org.apache.spark.sql.SparkSession
object SQLDataSourceExample {
case class Person(name: String, age: Long)
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("Spark SQL data sources example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
runBasicDataSourceExample(spark)
runBasicParquetExample(spark)
runParquetSchemaMergingExample(spark)
runJsonDatasetExample(spark)
runJdbcDatasetExample(spark)
spark.stop()
}
private def runBasicDataSourceExample(spark: SparkSession): Unit = {
// $example on:generic_load_save_functions$
val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
// $example off:generic_load_save_functions$
// $example on:manual_load_options$
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
// $example off:manual_load_options$
// $example on:manual_load_options_csv$
val peopleDFCsv = spark.read.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load("examples/src/main/resources/people.csv")
// $example off:manual_load_options_csv$
// $example on:direct_sql$
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
// $example off:direct_sql$
// $example on:write_sorting_and_bucketing$
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
// $example off:write_sorting_and_bucketing$
// $example on:write_partitioning$
usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")
// $example off:write_partitioning$
// $example on:write_partition_and_bucket$
peopleDF
.write
.partitionBy("favorite_color")
.bucketBy(42, "name")
.saveAsTable("people_partitioned_bucketed")
// $example off:write_partition_and_bucket$
spark.sql("DROP TABLE IF EXISTS people_bucketed")
spark.sql("DROP TABLE IF EXISTS people_partitioned_bucketed")
}
private def runBasicParquetExample(spark: SparkSession): Unit = {
// $example on:basic_parquet_example$
// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._
val peopleDF = spark.read.json("examples/src/main/resources/people.json")
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
// $example off:basic_parquet_example$
}
private def runParquetSchemaMergingExample(spark: SparkSession): Unit = {
// $example on:schema_merging$
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.parquet("data/test_table/key=1")
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
cubesDF.write.parquet("data/test_table/key=2")
// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
// root
// |-- value: int (nullable = true)
// |-- square: int (nullable = true)
// |-- cube: int (nullable = true)
// |-- key: int (nullable = true)
// $example off:schema_merging$
}
private def runJsonDatasetExample(spark: SparkSession): Unit = {
// $example on:json_dataset$
// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)
// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// | name|
// +------+
// |Justin|
// +------+
// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// | address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+
// $example off:json_dataset$
}
private def runJdbcDatasetExample(spark: SparkSession): Unit = {
// $example on:jdbc_dataset$
// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF2 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Specifying the custom data types of the read schema
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()
jdbcDF2.write
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Specifying create table column data types on write
jdbcDF.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// $example off:jdbc_dataset$
}
}
代码已经有了,我们来看看里面包含的内容。在这之前,我们可以想到自己以前是如何编程的。无论是那种语言,首先我们需要引入系统包,然后创建程序入口,最后去实现一个个功能。当然spark sql也是这样的。我们来看。 包名 首先
[Scala] 纯文本查看 复制代码
?
package org.apache.spark.examples.sql
这里是包名,如果熟悉Java编程,相信这个很容易理解。其它语言可以网上查查包的作用。 导入系统包 接着就是我们熟悉的导入系统包,也就是spark相关包。
[Scala] 纯文本查看 复制代码
?
import java.util.Properties
import org.apache.spark.sql.SparkSession
单例对象 导入包后,我们就要创建程序入口,在创建入口之前,我们需要一个单例对象SQLDataSourceExample
[Scala] 纯文本查看 复制代码
?
object SQLDataSourceExample
在其它程序,SQLDataSourceExample可能是一个静态类,这就涉及到Scala的特殊之处了,由于静态成员(方法或者变量)在Scala中并不存在。Scala从不定义静态成员,而通过定义单例object取而代之 更多参考 http://www.aboutyun.com/forum.php?mod=viewthread&tid=12402 程序入口main def main(args: Array[String]) 这里我们看到它的定义关键字def来实现,args是参数名,Array[String]是参数类型 实例化sparksession
[Scala] 纯文本查看 复制代码
?
val spark = SparkSession
.builder()
.appName("Spark SQL data sources example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
上面代码则是实例化SparkSession
[Scala] 纯文本查看 复制代码
?
runBasicDataSourceExample(spark)
runBasicParquetExample(spark)
runParquetSchemaMergingExample(spark)
runJsonDatasetExample(spark)
runJdbcDatasetExample(spark)
上面其实去入口里面实现的功能,是直接调用的函数
[Scala] 纯文本查看 复制代码
?
spark.stop()
spark.stop这里表示程序运行完毕。这样入口,也可以说驱动里面的内容,我们已经阅读完毕。 函数实现 接着我们看每个函数的功能实现。
[Scala] 纯文本查看 复制代码
?
private def runBasicDataSourceExample(spark: SparkSession): Unit = {
// $example on:generic_load_save_functions$
val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
// $example off:generic_load_save_functions$
// $example on:manual_load_options$
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
// $example off:manual_load_options$
// $example on:manual_load_options_csv$
val peopleDFCsv = spark.read.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load("examples/src/main/resources/people.csv")
// $example off:manual_load_options_csv$
// $example on:direct_sql$
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
// $example off:direct_sql$
// $example on:write_sorting_and_bucketing$
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
// $example off:write_sorting_and_bucketing$
// $example on:write_partitioning$
usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")
// $example off:write_partitioning$
// $example on:write_partition_and_bucket$
peopleDF
.write
.partitionBy("favorite_color")
.bucketBy(42, "name")
.saveAsTable("people_partitioned_bucketed")
// $example off:write_partition_and_bucket$
spark.sql("DROP TABLE IF EXISTS people_bucketed")
spark.sql("DROP TABLE IF EXISTS people_partitioned_bucketed")
}
私有函数的定义通过private关键字实现。Unit 是 greet 的结果类型。Unit 的结果类型指的是函数没有返回有用的值。Scala 的 Unit 类型接近于 Java 的 void 类型。这里面最让我们不习惯的是冒号,其实这里可以理解为一个分隔符。
[Scala] 纯文本查看 复制代码
?
private def runBasicDataSourceExample(spark: SparkSession): Unit =
下面我们每个解读代码:
[Scala] 纯文本查看 复制代码
?
val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
用来读取数据。
[Scala] 纯文本查看 复制代码
?
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
用来指定name和age字段保存格式为parquet,save("namesAndAges.parquet"),这里容易让我们理解为文件,其实这里是文件夹。
[Scala] 纯文本查看 复制代码
?
val peopleDFCsv = spark.read.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load("examples/src/main/resources/people.csv")
上面代码用来读取csv文件。option是csv设置,比如header是指是否以第一行作为字段名。默认为false。这是我们设置为true,也就是说默认第一行为字段名。 更多信息参考下面来自官网
[Plain Text] 纯文本查看 复制代码
?
sep (default ,): sets the single character as a separator for each field and value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv.
escape (default ): sets the single character used for escaping quotes inside an already quoted value.
comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): a flag indicating whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): a flag indicating whether or not trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive infinity value.
negativeInf (default -Inf): sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
maxColumns (default 20480): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
multiLine (default false): parse one record, which may span multiple lines.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
[Scala] 纯文本查看 复制代码
?
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
42为bucket数目,name为字段名。这是在spark2.1才有的功能
[Scala] 纯文本查看 复制代码
?
usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")
在文件系统中按给定列favorite_color分区输出。
[Scala] 纯文本查看 复制代码
?
peopleDF
.write
.partitionBy("favorite_color")
.bucketBy(42, "name")
.saveAsTable("people_partitioned_bucketed")
// $example off:write_partition_and_bucket$
spark.sql("DROP TABLE IF EXISTS people_bucketed")
spark.sql("DROP TABLE IF EXISTS people_partitioned_bucketed")
上面分别为保存为表及删除表。 后续再更新
- Silverlight 3.0 中的 WriteableBitmap
- WCF后续之旅(10): 通过WCF Extension实现以对象池的方式创建Service Instance
- Silverlight菜单控件 — CurveMenu
- 实力终端撑腰 两枚域名均五位数被秒
- Silverlight制作逐帧动画 v2 - part2
- Nodejs学习笔记(四)--- 与MySQL交互(felixge/node-mysql)
- 学习Spark——环境搭建(Mac版)
- 离线网络环境下一键式部署
- WCF后续之旅(17):通过tcpTracer进行消息的路由
- Linux同步机制(一) - 线程锁
- Silverlight类库介绍-FJCore
- 大型网站的自强之路
- 人工智能:浮现
- 机器人进化 如何确保 安全概率?
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- TensorFlow中如何确定张量的形状实例
- 使用Dajngo 通过代码添加xadmin用户和权限(组)
- python和js交互调用的方法
- Python中flatten( ),matrix.A用法说明
- python中id函数运行方式
- CentOS 7如何实现定时执行python脚本
- PHP自动生成缩略图函数的源码示例
- 解决tensorflow 释放图,删除变量问题
- php生成word并下载代码实例
- TensorFlow保存TensorBoard图像操作
- 浅谈PHP SHA1withRSA加密生成签名及验签
- PHP PDO数据库操作预处理与注意事项
- laravel 框架配置404等异常页面
- Django –Xadmin 判断登录者身份实例
- Laravel 队列使用的实现