Hive 在多维统计分析中的应用 & 技巧总结
多维统计一般分两种,我们看看 Hive 中如何解决:
1、同属性的多维组合统计
(1)问题: 有如下数据,字段内容分别为:url, catePath0, catePath1, catePath2, unitparams
https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
(2)需求: 计算 catePath0, catePath1, catePath2 这三种维度组合下,各个 url 对应的 pv、uv,如:
0 1 23 1 1 0 1 25 1 1 0 1 8 1 1 0 1 ALL 3 3 0 3 8 1 1 0 3 98 1 1 0 3 ALL 2 1 0 5 118 1 1 0 5 18 1 1 0 5 81 1 1 0 5 ALL 3 2 0 ALL ALL 8 3 ALL ALL ALL 8 3
(3)解决思路: hive 中同属性多维统计问题通常用 union all 组合出各种维度然后 group by 进行求解:
create EXTERNAL table IF NOT EXISTS t_log (
url string, c0 string, c1 string, c2 string, unitparams string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' location '/tmp/decli/1';
select * from (
select host, c0, c1, c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
union all
select host, c0, c1, 'ALL' c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
union all
select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
union all
select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test;
select c0, c1, c2, count(host) PV, count(distinct(host)) UV from (
select host, c0, c1, c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
union all
select host, c0, c1, 'ALL' c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
union all
select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
union all
select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0
LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test group by c0, c1, c2;
2、不同属性的多维组合统计
这种场景下我们一般选择 Multi Table/File Inserts,下面选自《programming hive》P124
Making Multiple Passes over the Same Data Hive has a special syntax for producing multiple aggregations from a single pass through a source of data, rather than rescanning it for each aggregation. This change can save considerable processing time for large input data sets. We discussed the details previously in Chapter 5. For example, each of the following two queries creates a table from the same source table, history: hive> INSERT OVERWRITE TABLE sales > SELECT * FROM history WHERE action='purchased'; hive> INSERT OVERWRITE TABLE credits > SELECT * FROM history WHERE action='returned'; This syntax is correct, but inefficient. The following rewrite achieves the same thing, but using a single pass through the source history table: hive> FROM history > INSERT OVERWRITE sales SELECT * WHERE action='purchased' > INSERT OVERWRITE credits SELECT * WHERE action='returned';
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
https://cwiki.apache.org/confluence/display/Hive/Tutorial
注意事项以及一些小技巧:
1、hive union all 的用法:不支持 top level,以及各个select字段名称、属性必须严格一致
2、结果的顺序问题,可以自己加字符控制排序
3、多重insert和union all一样也只扫描一次,但因为要insert到多个分区,所以做了很多其他的事情,导致消耗的时间非常长,其会产生多个job,union all 本身只有一个job
关于 insert overwrite 产生多 job 并行执行的问题:
set hive.exec.parallel=true; //打开任务并行执行 set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度,默认为8。 http://superlxw1234.iteye.com/blog/1703713
4、当前HIVE 不支持 not in 中包含查询子句的语法,形如如下的HQ语句是不被支持的: 查询在key字段在a表中,但不在b表中的数据 select a.key from a where key not in(select key from b) 该语句在hive中不支持 可以通过left outer join进行查询,(假设B表中包含另外的一个字段 key1 select a.key from a left outer join b on a.key=b.key where b.key1 is null
5、left out join 不能连续3个以上使用,必须2个一组,2个一组包装起来使用。
select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (
select * from (
select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis
where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1
left outer join
(
select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis
where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi
) p2 on p1.ssi=p2.ssi
) p3
left outer join
(
select ssi, count(1) nuv from FactClickAnalysis
where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi
) p4 on p3.ssi=p4.ssi
) p
6、hive本地执行mr
http://superlxw1234.iteye.com/blog/1703546
7、hive动态分区创建过多遇到的一个错误
http://superlxw1234.iteye.com/blog/1677938
8、hive中巧用正则表达式的贪婪匹配
http://superlxw1234.iteye.com/blog/1751216
9、hive匹配全中文字段
用java中匹配中文的正则即可:
name rlike '^[\u4e00-\u9fa5]+$'
判断一个字段是否全数字:
select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\d+$' limit 50;
10、hive中使用sql window函数 LAG/LEAD/FIRST/LAST
http://superlxw1234.iteye.com/blog/1600323
http://www.shaoqun.com/a/18839.aspx
11、hive优化之------控制hive任务中的map数和reduce数
http://superlxw1234.iteye.com/blog/1582880
12、hive中转义$等特殊字符
http://superlxw1234.iteye.com/blog/1568739
13、日期处理:
查看N天前的日期:
select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;
获取两个日期之间的天数/秒数/分钟数等等:
select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400 from t_lxw_test limit 1;
14、删除 Hive 临时文件 hive.exec.scratchdir
http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576
REF:
http://superlxw1234.iteye.com/blog/1536440 http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/ http://blog.csdn.net/azhao_dn/article/details/6921429
http://superlxw1234.iteye.com/category/228899
- 这或许是对小白最友好的python入门了吧——5,修改和添加列表元素
- 这或许是对小白最友好的python入门了吧——4,列表
- 【深度学习】自动驾驶:使用深度学习预测汽车的转向角度
- 这或许是对小白最友好的python入门了吧——3,数字处理
- 数据库恢复方案
- 这或许是对小白最友好的python入门了吧——2,变量和字符串
- Extjs4.2+webAPI+EF实现分页以及webapi的数据传值(续)
- Linux 系统安全与优化配置
- Extjs 项目中常用的小技巧,也许你用得着(2)
- 这或许是对小白最友好的python入门了吧——16,输入文本
- Extjs 项目中常用的小技巧,也许你用得着(1)
- Extjs4.2+webAPI+EF实现分页以及webapi的数据传值
- 【实践操作】 在iOS11中使用Core ML 和TensorFlow对手势进行智能识别
- 这或许是对小白最友好的python入门了吧——15,嵌套
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- 51NOD 2072 装箱问题 背包问题 01 背包 DP 动态规划
- 51 NOD 1049 最大子段和 动态规划 模板 板子 DP
- 51NOD 1006 最长公共子序列 Lcs 动态规划 DP 模板题 板子
- CodeForces - 262C 贪心
- 花狗C语言彩色贪吃蛇(完整代码)
- CodeForces - 262B
- CodeForces - 260B
- 蓝桥杯第九届C语言C组第一题:哪天返回
- java学习之路:17.掌握Boolean对象的创建以及Boolean类提供的各种方法
- java学习之路:16.掌握Integer,Long,Short对象的创建以及其类提供的各种方法
- java学习之路:15.对象的创建,属性,行为,引用,比较,销毁
- java学习之路:14.类的构造方法,静态变量,常量和方法,类的主方法
- java学习之路:13.类(成员变量,成员方法,权限修饰符,局部变量及有效范围,this关键字)
- 线性表--定长顺序串(十四)
- java学习之路:10.数组的基本操作(遍历,替换,排序,复制,查询)