hive

时间:2019-08-07
本文章向大家介绍hive,主要包括hive使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

hive安装

1.上传解压

2.cd conf ;新增hive-site.xml (配置mysql数据库信息)

vim hive-site.xml

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://pengyy63:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>

3.cd lib

上传 mysql 驱动包 mysql-connector-java-5.1.39.jar

4.启动 hive

bin/hive

hive 1.x版本直接启动不会报错
hive 3.x版本启动成功后输入 sql 命令会报以下错误
FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
解决方案:
bin/schematool -initSchema -dbType mysql
从报错原因可以看出,没有初始化,需要在Hive安装目录的bin目录下面执行 

hive语法

1.show databases;
2.create database pengyy;
3.use pengyy;
4.show tables;
5.select * from t_user;
6.显示列名称 set hive.cli.print.header=true;--仅在当前窗口设置中生效
7.显示数据库名称 set hive.cli.print.current.db=true;--仅在当前窗口设置中生效
--备注:要想将数据库名称和列名称在hive启动时就生效,可以在 当前用户目录下创建  .hiverc
    cd ~   --返回家目录
    vim .hiverc
    set hive.cli.print.current.db=true;
    set hive.cli.print.header=true;
8.建表
use pengyy;
create table t_movies(id string,name string,director string)
row format delimited
fields terminated by ',';
vim movies.txt;
1,哪吒,张艺谋
2,星空,唐晶
3,啦啦,呼呼
hadoop fs -put /user/hive/warehouse/pengyy.db/t_movies/
hive客户端:
select * from t_moives;
+--------------+----------------+--------------------+
| t_movies.id  | t_movies.name  | t_movies.director  |
+--------------+----------------+--------------------+
| 1            | 哪吒             | 张艺谋                |
| 2            | 星空             | 唐晶                 |
| 3            | 啦啦             | 呼呼                 |

vim movies2.txt
aa,战狼2,吴京
bb,三生三世十里桃花,杨幂
hadoop fs -put movies2.txt /user/hive/warehouse/pengyy.db/t_movies/
select * from t_moives;
+--------------+----------------+--------------------+
| t_movies.id  | t_movies.name  | t_movies.director  |
+--------------+----------------+--------------------+
| 1            | 哪吒             | 张艺谋                |
| 2            | 星空             | 唐晶                 |
| 3            | 啦啦             | 呼呼                 |
| aa           | 战狼2            | 吴京                 |
| bb           | 三生三世十里桃花       | 杨幂                 |
+--------------+----------------+--------------------+

hive服务启动

hive server前台启动

在 pengyy42 hive服务端上启动  bin/hiveserver2 
在 pengyy43 hive客户端上启动  bin/beeline
    bin/beeline
    !connect jdbc:hive2://pengyy42:10000
如果出现以下问题:19/07/30 22:49:29 [main]: WARN jdbc.HiveConnection: Failed to connect to pengyy42:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://pengyy42:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)
解决方案:
在hadoop的配置文件core-site.xml中添加如下属性:
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>
就将上面配置hadoop.proxyuser.xxx.hosts和hadoop.proxyuser.xxx.groups中的xxx设置为root(即你的错误日志中显示的User:xxx为什么就设置为什么)。“*”表示可通过超级代理“xxx”操作hadoop的用户、用户组和主机。重启hdfs hiveserver beeline客户端
这样改的原因: 
主要原因是hadoop引入了一个安全伪装机制,使得hadoop 不允许上层系统直接将实际用户传递到hadoop层,而是将实际用户传递给一个超级代理,由此代理在hadoop上执行操作,避免任意客户端随意操作hadoop

hive server后台启动

linux   1 表示标准输出    2 表示错误输出
./test.sh
./test.sh &
./test.sh 1>test.log 2>err.log &   将控制台上的输出内容重定向到日志文件中

linux 中  /dev/null 可以被看做是一个 黑洞文件 ,所有写入它的内容都会永远丢失
如果脚本中的输出内容不想要,可以写入 /dev/null 文件中
./test.sh 1>/dev/null 2>/dev/null &
./test.sh 1>/dev/null 2>&1 &     --将2的输出重定向引用1的输出重定向
fg 将后台程序显示在前台 然后 ctrl+c  退出
fg 1

nohup ./test.sh 1>/dev/null 2>&1 &   --程序一直在后台运行

hive server 后台启动
nohup bin/hiveserver2 >/dev/null 2>&1 &
hive server服务启动默认占用 10000 端口,查看启动可以通过命令查看
netstat -ntlp 

客户端第一种启动方式
bin/beeline
!connect jdbc:hive2://pengyy42:10000
客户端第一种启动方式
bin/beeline -u jdbc:hive2://pengyy42:10000 -n root

hive 脚本化运行

1.直接输命令 hive -e "select * from t_user"
2.新建shell脚本,然后运行shell脚本
    vim test1.sh
        #!/bin/bash
        hive -e "create table t_count_sex(sex string,number int)"
        hive -e "insert into t_count_sex select sex,count(1) from t_user group by sex"
        hive -e "select * from t_count_sex"
    sh test1.sh
3.直接新建 test2.sql脚本,通过命令  hive -f test2.sql 运行 
    vim test2.sql
        select * from t_user;--一大堆的sql脚本 这里省略
    hive -f test2.sql

hive 外部表内部表

内部表(managed_table):表目录按照hive的规范来部署,位于hive的仓库目录 /user/hive/warehouse/
    内部表删除,1.删除hdfs上的数据文件;2.删除表的元数据(mysql hive库中的元数据信息)
外部表(external_table):表目录由建表用户自己指定

vim pv_log.txt
192.168.31.22,http://sina.com/a,2019-07-31 20:00:23
192.168.31.22,http://sina.com/b,2019-07-26 15:00:31
192.168.31.23,http://sina.com/k,2019-07-31 20:09:27
192.168.31.24,http://sina.com/a,2019-07-27 20:00:31
192.168.31.25,http://sina.com/h,2019-07-31 20:08:29
192.168.31.28,http://sina.com/a,2019-07-29 20:22:31
192.168.31.22,http://sina.com/e,2019-07-31 20:07:22
192.168.31.88,http://sina.com/c,2019-07-31 20:00:11

hadoop fs -mkdir -p /pvlogs
hadoop fs -put pv_log.txt /pvlogs

use pengyy;
create external table t_pv_log(ip string,url string ,access_time string)
row format delimited
fields terminated by ','
location '/pvlogs'
;
select * from t_pv_log;

hive 分区表

cd /usr/local/test
vim pv_log.31
192.168.31.22,http://sina.com/a,2019-07-31 20:00:23
192.168.31.22,http://sina.com/b,2019-07-31 15:00:31
192.168.31.23,http://sina.com/k,2019-07-31 20:09:27
192.168.31.24,http://sina.com/a,2019-07-31 20:00:31
192.168.31.25,http://sina.com/h,2019-07-31 20:08:29
192.168.31.28,http://sina.com/a,2019-07-31 20:22:31
192.168.31.22,http://sina.com/e,2019-07-31 20:07:22
192.168.31.88,http://sina.com/c,2019-07-31 20:00:11

vim pv_log.30
192.168.31.22,http://sina.com/a,2019-07-30 20:00:23
192.168.31.22,http://sina.com/b,2019-07-30 15:00:31
192.168.31.23,http://sina.com/k,2019-07-30 20:09:27
192.168.31.24,http://sina.com/a,2019-07-30 20:00:31
192.168.31.25,http://sina.com/h,2019-07-30 20:08:29
192.168.31.28,http://sina.com/a,2019-07-30 20:22:31
192.168.31.22,http://sina.com/e,2019-07-30 20:07:22
192.168.31.88,http://sina.com/c,2019-07-30 20:00:11

vim pv_log.29
192.168.31.22,http://sina.com/a,2019-07-29 20:00:23
192.168.31.22,http://sina.com/b,2019-07-29 15:00:31
192.168.31.23,http://sina.com/k,2019-07-29 20:09:27
192.168.31.24,http://sina.com/a,2019-07-29 20:00:31
192.168.31.25,http://sina.com/h,2019-07-29 20:08:29
192.168.31.28,http://sina.com/a,2019-07-29 20:22:31
192.168.31.22,http://sina.com/e,2019-07-29 20:07:22
192.168.31.88,http://sina.com/c,2019-07-29 20:00:11

--创建分区表
create external table t_pv_log_daily(ip string,url string ,access_time string)
partitioned by (dtime string)
row format delimited
fields terminated by ','
;

将 linux  本地文件加载到hive
进入 hive 客户端:
load data local inpath "/usr/local/test/pv_log.31" into table t_pv_log_daily partition(dtime='20190731');
load data local inpath "/usr/local/test/pv_log.31" into table t_pv_log_daily partition(dtime='20190730');
load data local inpath "/usr/local/test/pv_log.31" into table t_pv_log_daily partition(dtime='20190729');

注意:分区字段不能是表定义中的已存在字段。会将分区字段看做是表的一个伪字段

创建表结构一样的表:create table t_pv_log_daily_2 like t_pv_log_daily; --不含数据,表结构一模一样

创建表字段与查询结果一样的表:
create table t_pv_log_daily_3 as select * from t_pv_log_daily;--含有数据

select * from t_pv_log_daily;
+------------------+---------------------+-----------------------------+-----------------------+
| t_pv_log_daily.ip  | t_pv_log_daily.url  | t_pv_log_daily.access_time  | t_pv_log_daily.dtime  |
+------------------+---------------------+-----------------------------+-----------------------+
| 192.168.31.22      | http://sina.com/a   | 2019-07-29 20:00:23         | 20190729              |
| 192.168.31.22      | http://sina.com/b   | 2019-07-29 15:00:31         | 20190729              |
| 192.168.31.23      | http://sina.com/k   | 2019-07-29 20:09:27         | 20190729              |
| 192.168.31.24      | http://sina.com/a   | 2019-07-29 20:00:31         | 20190729              |
| 192.168.31.25      | http://sina.com/h   | 2019-07-29 20:08:29         | 20190729              |
| 192.168.31.28      | http://sina.com/a   | 2019-07-29 20:22:31         | 20190729              |
| 192.168.31.22      | http://sina.com/e   | 2019-07-29 20:07:22         | 20190729              |
| 192.168.31.88      | http://sina.com/c   | 2019-07-29 20:00:11         | 20190729              |
| 192.168.31.22      | http://sina.com/a   | 2019-07-30 20:00:23         | 20190730              |
| 192.168.31.22      | http://sina.com/b   | 2019-07-30 15:00:31         | 20190730              |
| 192.168.31.23      | http://sina.com/k   | 2019-07-30 20:09:27         | 20190730              |
| 192.168.31.24      | http://sina.com/a   | 2019-07-30 20:00:31         | 20190730              |
| 192.168.31.25      | http://sina.com/h   | 2019-07-30 20:08:29         | 20190730              |
| 192.168.31.28      | http://sina.com/a   | 2019-07-30 20:22:31         | 20190730              |
| 192.168.31.22      | http://sina.com/e   | 2019-07-30 20:07:22         | 20190730              |
| 192.168.31.88      | http://sina.com/c   | 2019-07-30 20:00:11         | 20190730              |
| 192.168.31.22      | http://sina.com/a   | 2019-07-31 20:00:23         | 20190731              |
| 192.168.31.22      | http://sina.com/b   | 2019-07-31 15:00:31         | 20190731              |
| 192.168.31.23      | http://sina.com/k   | 2019-07-31 20:09:27         | 20190731              |
| 192.168.31.24      | http://sina.com/a   | 2019-07-31 20:00:31         | 20190731              |
| 192.168.31.25      | http://sina.com/h   | 2019-07-31 20:08:29         | 20190731              |
| 192.168.31.28      | http://sina.com/a   | 2019-07-31 20:22:31         | 20190731              |
| 192.168.31.22      | http://sina.com/e   | 2019-07-31 20:07:22         | 20190731              |
| 192.168.31.88      | http://sina.com/c   | 2019-07-31 20:00:11         | 20190731              | 
+------------------+---------------------+-----------------------------+-----------------------+

可以看出分区字段 dtime 也被看成表的一个伪字段

desc t_pv_log_daily;
+--------------------------+------------+----------+
|         col_name         | data_type  | comment  |
+--------------------------+------------+----------+
| ip                       | string     |          |
| url                      | string     |          |
| access_time              | string     |          |
| dtime                    | string     |          |
|                          | NULL       | NULL     |
| # Partition Information  | NULL       | NULL     |
| # col_name               | data_type  | comment  |
| dtime                    | string     |          |
+--------------------------+------------+----------+

show create table t_pv_log_daily;
----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `t_pv_log_daily`(            |
|   `ip` string,                                     |
|   `url` string,                                    |
|   `access_time` string)                            |
| PARTITIONED BY (                                   |
|   `dtime` string)                                  |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://pengyy22:9000/user/hive/warehouse/pengyy.db/t_pv_log_daily' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'transient_lastDdlTime'='1564577640')            |
+----------------------------------------------------+

desc formatted t_pv_log_daily;
|           col_name            |                     data_type                      |        comment        |
+---------------------------+-----------------------------------------+-----------------------+
| # col_name                    | data_type                                          | comment               |
| ip                            | string                                             |                       |
| url                           | string                                             |                       |
| access_time                   | string                                             |                       |
|                               | NULL                                               | NULL                  |
| # Partition Information       | NULL                                               | NULL                  |
| # col_name                    | data_type                                          | comment               |
| dtime                         | string                                             |                       |
|                               | NULL                                               | NULL                  |
| # Detailed Table Information  | NULL                                               | NULL                  |
| Database:                     | pengyy                                             | NULL                  |
| OwnerType:                    | USER                                               | NULL                  |
| Owner:                        | root                                               | NULL                  |
| CreateTime:                   | Wed Jul 31 20:54:00 CST 2019                       | NULL                  |
| LastAccessTime:               | UNKNOWN                                            | NULL                  |
| Retention:                    | 0                                                  | NULL                  |
| Location:                     | hdfs://pengyy22:9000/user/hive/warehouse/pengyy.db/t_pv_log_daily | NULL                  |
| Table Type:                   | EXTERNAL_TABLE                                     | NULL                  |
| Table Parameters:             | NULL                                               | NULL                  |
|                               | EXTERNAL                                           | TRUE                  |
|                               | bucketing_version                                  | 2                     |
|                               | numFiles                                           | 3                     |
|                               | numPartitions                                      | 3                     |
|                               | numRows                                            | 0                     |
|                               | rawDataSize                                        | 0                     |
|                               | totalSize                                          | 1249                  |
|                               | transient_lastDdlTime                              | 1564577640            |
|                               | NULL                                               | NULL                  |
| # Storage Information         | NULL                                               | NULL                  |
| SerDe Library:                | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL                  |
| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat           | NULL                  |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                  |
| Compressed:                   | No                                                 | NULL                  |
| Num Buckets:                  | -1                                                 | NULL                  |
| Bucket Columns:               | []                                                 | NULL                  |
| Sort Columns:                 | []                                                 | NULL                  |
| Storage Desc Params:          | NULL                                               | NULL                  |
|                               | field.delim                                        | ,                     |
|                               | serialization.format                               | ,                     |
+-------------------------+-------------------------------------------+-----------------------+

数据导入导出

进入hive客户端:
将本地文件导入 hive
load data local inpath '/usr/local/test/user.info' into table t_user;
将 hdfs 文件导入 hive
load data inpath '/user2.info' into table t_user;
区别:
本地文件导入hive ,是复制,会在相应的表目录下创建一个相同的文件
hdfs文件导入hive ,是移动,会将文件从 hdfs源路径移动到 对应表的目录下

原文地址:https://www.cnblogs.com/pengyy/p/11318262.html