時(shí)間:2018年5月16日上午9:30
地點(diǎn):望江校區(qū)東三教503會(huì)議室
報(bào)告人:唐明潔
報(bào)告人簡介:2007年永利yl23411官網(wǎng)計(jì)算機(jī)本科畢業(yè),2010年從中國科學(xué)院研究生院取得計(jì)算機(jī)碩士學(xué)位,2013年從美國普渡大學(xué)獲得計(jì)算機(jī)碩士學(xué)位,2016年從美國普渡大學(xué)取得計(jì)算機(jī)博士學(xué)位。曾就職于美國微軟,IBM研究院?,F(xiàn)就職于大數(shù)據(jù)公司Hortonworks做研究科學(xué)家,主要從事Spark和TensorFlow的研究和開發(fā)。博士期間在包括VLDB, TKDE, ICDE, EDBT, SIGSPATILA, IEEEIntelj在內(nèi)的會(huì)議雜志發(fā)篇論文20余篇,曾獲得數(shù)據(jù)庫會(huì)議SISAP201最佳論文,數(shù)據(jù)挖掘會(huì)議ADMA2009最佳應(yīng)用論文,部分研究成果已經(jīng)被開源社區(qū)PostgreSQL和Spark所采用。
學(xué)術(shù)報(bào)告摘要:TensorFlow and XGBoost are state-of-the-art platform for Deep learning and Machine learning. However, either of them are suit for big data processing in real production environment. For example, TensorFlow fail to provide OLAP or ETL over big data, thus, it impedes TensorFlow to train a deep learning model with clean and enough data in more efficient way. Similarly, despite better performance compared with other gradient-boosting implementations, it’s still a time-consuming task to train XGBoost model when the data is big. And it usually requires extensive parameter tuning to get a highly accurate model, which brings the strong requirement to speed up the whole process.
In this talk, we will mainly introduce how Spark to improve TensorFlow and XGBoost in the real application, and demonstrate how these platforms could be benefit from big data techniques. More specifically, we at first introduce how Spark ML come to support auto parameter tuning, and apply transfer learning to enhance the real application like recommendation system and image searching. Secondly, we cover the implementation and performance improvement of GPU-based XGBoost algorithm, summarize model tuning experience and best practice, share the insights on how to build a heterogeneous data analytic and machine learning pipeline based on Spark in a GPU-equipped YARN cluster, and show how to push model into production.