ElasticSearch Cardinality Aggregation聚合计算的误差
使用ES不久,今天发现生产环境数据异常,其使用的ES版本是2.1.2,其它版本也类似。通过使用ES的HTTP API进行查询,发现得到的数据跟javaClient API 查询得到的数据不一致,于是对代码逻辑以及ES查询工具产生了怀疑。通过查阅官方文档找到如下描述:
Precision controledit
This aggregation also supports the
precision_threshold
option:The
precision_threshold
option is specific to the current internal implementation of thecardinality
agg, which may change in the future{ "aggs" : { "author_count" : { "cardinality" : { "field" : "author_hash", "precision_threshold": 100
} } } }
The
precision_threshold
options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).Counts are approximateedit
Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.
This
cardinality
aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:
- configurable precision, which decides on how to trade memory for accuracy,
- excellent accuracy on low-cardinality sets,
- fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
For a precision threshold of
c
, the implementation that we are using requires aboutc * 8
bytes.The following chart shows how the error varies before and after the threshold:
For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.
其意思就是:聚合查询存在误差,在5%范围之内,通过调整“precision_threshold”参数进行调整。
于是翻阅查询代码:加入如下部分问题得到解决。该参数在查询时未设置的情况下,默认值为3000。
private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) {
// 设置聚合条件
TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE);
// 查询条件构建
BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery();
FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery);
packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值
agg.subAggregation(packAgg);
nativeSearchQueryBuilder.addAggregation(agg);
}
- 平衡树初阶——AVL平衡二叉查找树+三大平衡树(Treap + Splay + SBT)模板【超详解】
- HDU 2689 Sort it【树状数组】
- BZOJ 1800: [Ahoi2009]fly 飞行棋【思维题,n^4大暴力】
- Vijos P1066 弱弱的战壕【多解,线段树,暴力,树状数组】
- GeetTest~下一代验证(附C#案例)
- [接口测试 - http.client篇] 17 http.client之入门级接口测试框架
- 评论JS插件~多说+畅言
- jQuery HTML5 Uploader
- 1022: [SHOI2008]小约翰的游戏John【Nim博弈,新生必做的水题】
- [接口测试 - http.client篇] 16 基于http.client之POM实战一下
- 数论部分第一节:素数与素性测试【详解】
- ProtoBuf 序列化工具组件
- C++STL vector简单使用练习1
- 小解Redis 系列
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- js .map方法
- 【一起学系列】之模板方法:写SSO我只要5分钟
- ConcurrentDictionary线程不安全么,你难道没疑惑,你难道弄懂了么?
- 【一起学系列】之迭代器&组合:虽然有点用不上啦
- 移动端touch事件影响click事件以及在touchmove添加preventDefault导致页面无法滚动的解决方法
- 使用ActionFilterAttribute 记录 WebApi Action 请求和返回结果记录
- scipy.stats连续分布的基本操作
- InvocationHandler中invoke方法中的第一个参数proxy的用途
- height、offsetheight、clientheight、scrollheight、innerheight、outerheight
- mysql sql-mode 解析和设置
- JAVABEAN EJB POJO区别
- @Component和@Bean以及@Autowired、@Resource
- mybatis generator and 和or条件
- 『.Net反射』ILGenerator.Emit 动态MSIL 编程
- Spring通过XML配置文件以及通过注解形式来AOP 来实现前置,后置,环绕,异常通知