基于两阶段分层抽样的近似聚合查询方法
作者:
作者单位:

1.北方工业大学信息学院,北京 100144;2.大规模流数据集成与分析技术北京市重点实验室(北方工业大学), 北京 100144

作者简介:

通讯作者:

基金项目:

国家自然科学基金国际(地区)合作与交流项目(62061136006)。


Approximate Aggregate Query Method Based on Two-Stage Stratified Sampling
Author:
Affiliation:

1.College of Information, North China University of Technology, Beijing 100144,China;2.Beijing Key Laboratory on Integration and Analysis of Large-Scale Stream Data, Beijing 100144,China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    以数据仓库应用为代表的交互式查询分析技术为智能决策提供了支持。随着数据规模的不断增大,准确计算聚合查询结果往往需要全局数据扫描,使得这类查询面临着实时响应能力不足的问题。基于预先抽取的样本数据,复杂聚合查询提供快速的近似答案,在许多场景下是解决该问题的可行方案。分析了分层抽样优于随机抽样的具体条件,提出了一种两阶段分层抽样方法。首先针对业务特征进行分组,每个分组中使用随机抽样方法进行随机抽样,并评估其抽样效果。再针对抽样效果较差的分组,利用自组织特征映射网络(Self-organizing feature mapping,SOM)对数值进行聚类分组,改进其近似查询效果。基于公开数据集和实际电网数据的实验结果表明:本文方法相比于随机抽样、分层随机抽样以及国会抽样算法在相同抽样率下可达到15%的性能提升;与使用K-means、基于密度的聚类算法(Density-based spatial clustering of applications with noise,DBSCAN)等聚类方法相比,自SOM具有较好的近似查询结果。

    Abstract:

    The interactive query analysis technology represented by data warehouse application provides support for intelligent decision-making. With the continuous increase of data scale, accurate calculation of query results often requires global data scanning, which makes the group-by query face the problem of insufficient real-time response ability. Based on the pre-extracted sample data, it can provide fast approximate answers for aggregate queries, which is a feasible solution to this problem in many scenarios. This paper analyzes the specific conditions that stratified sampling is better than random sampling, and proposes a two-stage stratified sampling method. In the first stage, the sampling is grouped according to the business characteristics. In each grouping, the random sampling method is first used for random sampling, and the sampling effect is evaluated. To improve the effect of approximate query, the second stage sampling is carried out, and the self-organizing feature mapping (SOM) clustering method is used to group the values. Experimental results on the public data set and the actual power grid data show that, compared with random sampling, stratified random sampling and congressional sampling algorithm, performance of the proposed method can be improved by 15% at most under the same sampling rate. And SOM has better approximate query results than K-means and density-based spatial clustering of applications with noise (DBSCAN) clustering methods.

    参考文献
    相似文献
    引证文献
引用本文

房俊,赵博,左昌麒.基于两阶段分层抽样的近似聚合查询方法[J].数据采集与处理,2022,37(5):1049-1058

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2021-09-10
  • 最后修改日期:2022-01-23
  • 录用日期:
  • 在线发布日期: 2022-09-25