当前位置: 首页>>学术科研>>羊城讲坛>>正文

第二十九讲——《Distributed Estimation of Principal Eigenspaces》

2017年12月28日 14:14  点击:[]

复旦大学大数据学院和大数据研究院院长范剑青教授莅临我院指导

2017年12月27日下午1点30分,应广州大学经济与统计学院和岭南统计科学研究中心的邀请,复旦大学大数据学院和大数据研究院院长范剑青教授在行政东前座412会议室作了题为“Distributed Estimation of Principal Eigenspaces”的讲座——暨“羊城讲坛”第二十九讲,旨在进一步提高年轻学者及研究生对研究的理解。此次讲座由崔霞副院长主持,相关专业的师生参加了此次讲座。本报告提出并研究了一种分布式PCA算法:每个节点机计算出最高特征向量并将其传送给中央服务器;中央服务器然后汇集来自所有节点机器的信息,并基于汇总的信息进行PCA。调查最后的$ K $特征向量的结果分布估计的偏差和方差。特别的是,本文表明,对于具有对称创新的分布,分布式PCA是“无偏”的。本文推导出分布式PCA估计量的收敛速度,它明显依赖于协方差的有效等级,特征差距和机器数量。本文表明,当机器数量不是不合理的大时,分布式PCA和整个样本PCA一样好,即使没有完整的数据访问。理论结果通过广泛的模拟研究来验证。

/private/var/mobile/Containers/Data/Application/533E9625-FCC4-4B92-B175-7D2DE757C8F9/tmp/insert_image_tmp_dir/2017-12-27 14:13:45.098000.png2017-12-27 14:13:45.098000

摘要:Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top $K$ eigenvectors. In particular, we show that for distributions with symmetric innovation, the distributed PCA is ``unbiased''. We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigen-gap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigen-structures.

/private/var/mobile/Containers/Data/Application/533E9625-FCC4-4B92-B175-7D2DE757C8F9/tmp/insert_image_tmp_dir/2017-12-27 14:13:26.201000.png2017-12-27 14:13:26.201000

上一条:第三十讲——《学术研究的心得与体会》 下一条:第二十八讲——《order determination for large dimensional matrices》

关闭

地址:广州市番禺区大学城外环西路230号 邮编:510006 电话:020-39366825 E-mail:ses@gzhu.edu.cn版权所有@2015 广州大学经济与统计学院