BIG DATA PROCESSING AND MACHINE LEARNING WITH APACHE SPARK
ImpactHub - workshop room 2
19th November, 09:30-18:00
Apache Spark is the next Big Data computing framework and compared to Hadoop MapReduce, Spark is easier to use and offers a flexible computing model. Spark’s popularity is due the fact that it excels in in-memory computations, which is some cases is 10x-100x faster than Hadoop MapReduce.
Spark was designed from the ground up to support multiple processing modules, like batch processing, SQL, machine learning, streaming and graph processing. It provides a nice abstraction of large datasets with the concept of Resilient Distributed Datasets (RDD) and DataFrame, both offering elegant APIs to easily manipulate them.
The main focus of this workshop is to get you familiar with Spark core basics, like RDD, DataFrame, Transformation and Actions by working with hands-on exercises. Nevertheless we will present you others Spark capabilities, like SQL and Machine Learning using multiple examples.
All of these examples will be explained in three different programming languages (java,scala and python), so workshop participants can choose the most familiar environment. We have a dataset offered by yelp.com, from where we can build interesting examples and get meaningful insights from data.
After this workshop, you will have installed the Spark infrastructure on your own laptop and will have the necessary knowledge to continue working with Spark and try other examples.
The intended audience is software developers and technical architects that are interested to try their first examples in big data and machine learning using Apache Spark.
Daniel is leading the Big Data and Machine Translation group at SDL Research Cluj, working on a Hadoop-based data pipeline that processes Petabytes of unstructured data; the resulting clean data is used to increase the quality of SDL’s Statistical Machine Translation engines. Daniel is involved in the local Big Data community as a meet-up organizer/speaker in the effort to raise the awareness of the Big Data and Hadoop field.
I'm passionate about bigdata technologies, especially Apache Hadoop. I heard for the first time about Apache Hadoop when I was at my master courses and from that time I was fascinated about bigdata world. My first big professional success was when I introduced the Apache Hadoop technology into the company I'm working for, skobbler, in 2012. From that time I'm working as a full time devops on bigdata projects. My current work involve to well manage and maintain the Hadoop and Hbase in-house clusters. From this passion, I have initiated a bigdata/data science community in my town, Cluj-Napoca, Romania, with the goals of meeting new passionate people, working together on cool projects and helping IT companies to adopt BigData technologies. Until now we had many meetups and workshops with many participants.
Adrian Bona is a Software Engineer at Telenav, where he is working with large amounts of GPS data and planet-scale maps. He is passionate about Big Data, Machine Learning and how the new knowledge gets traction in active developer communities, such as the ones from Cluj. In his free time, Adrian is teaching Fundamental Algorithms and Logic Programming laboratories at Technical University of Cluj-Napoca.