Trestle Random Forest Based on Multiple Randomness and Privacy Protection
Author:
Affiliation:
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
Fund Project:
摘要
|
图/表
|
访问统计
|
参考文献
|
相似文献
|
引证文献
|
资源附件
摘要:
作为一种针对分类和回归任务行之有效的集成学习算法,随机森林(Random forest, RF)还面临着泛化能力提升和隐私保护的挑战。本文提出了一种改进的基于多重随机性与隐私保护的栈式随机森林(Bernoulli-multinomial stacked random forest,BMS-RF)算法。基本思想是在构造决策树分裂特征和分裂点选择阶段引入伯努利分布Dropout部分特征向量选择候选特征向量,通过两个多项分布随机选择分裂特征与分裂点,每棵决策树采用非数值查询的指数机制添加噪声维持其隐私保护机制,在集成分类器时引入多层栈式结构将前一层的输出随机投影和源训练集拼接作为新的输入,使得每一森林可以共享源样本空间信息,逐层提高基学习器分类性能。通过对此算法的一致性以及隐私能力的理论分析表明BMS-RF可以通过栈式结构显著提高分类性能。14个中小规模数据集合上的实验结果验证了该算法不但能降低运行时间且具有更好的泛化性能,隐私保护水平较强时可以在简化结构和提高运行速度的基础上达到与RF变体基本一致的分类性能。
Abstract:
As an effective ensemble learning algorithm for classification and regression tasks, the random forest (RF) also faces challenges in improving generalization ability and privacy protection. In response to this challenge, this paper proposes an improved Bernoulli-multinomial stacked random forest (BMS-RF) algorithm based on multiple randomness and privacy protection. The basic idea is to introduce Bernoulli distribution Dropout partial feature vectors to select candidate feature vectors in the stage of constructing decision tree splitting features and splitting point selection. By randomly selecting splitting features and splitting points through two polynomial distributions, each decision tree adopts a non numerical query index mechanism to add noise for maintaining its privacy protection mechanism. When integrating classifiers, a multi-layer stack structure is introduced to randomly project the output of the previous layer and concatenate the source training set as new inputs, so that each forest can share the spatial information of the source samples and improve the classification performance of the base learner layer by layer. Theoretical analysis of the consistency and privacy ability of this algorithm shows that BMS-RF can significantly improve classification performance through a stack structure. Experimental results on 14 small and medium-sized datasets verify that the algorithm not only reduces running time but also has better generalization performance. When the privacy protection is strong, it can achieve classification performance similar to RF variants on the basis of simplifying the structure and improving running speed.