Big Data Spain 2015

This week I have attended the Big Data Spain 2015 conference. The format of this conference is:

  • 2 tracks with no thematic separation
  • Spanned over 2 days
  • Talk duration around 45 minutes length, except some lightning talks (appropriately named talkreduce()) of 15 minutes or so

I would like to share some links and notes on some talks and ideas that I have picked.

Paco Nathan (@pacoid) did a news recap in his talk about Data Science in 2016: Moving up with (literally) tons of links to different sources. I have selected three:

  • The Project Euclid, an impressive platform that I did not know about, for maths and statistics publishers
  • The fantastic PyData Seattle 2015 keynote by Lorena Barba titled Data driven Education and the Quantified Student
  • The adoption of the Jupyter Notebooks as a first-class vehicle of publishing, by O’Reilly’s beta initiative

He also gave a Crash introduction to Apache Spark workshop. Although he did not have time to finish, the repo is online, so we can do it self-paced.

Kartik Paramasivam from Linkedin, showed us some projects that I did not know:

  • The Apache Samza distributed stream processing framework that they are using to handle the incoming data from Kafka
  • The embeddable RocksDB persistent key-value storage. Their benchmarks look impressive

Matthias Braeger spoke about a tool that the CERN has developed for monitoring systems called c2mon. Java-written (JMS-based), it is suited for high availability and high data volume.

Antonio Gallego (@antoniogallego) from Pivotal, showed a stock inference engine using Apache Geode and Spark ML. Quite a bit demo effect, but since the code is in Github, I will try to re-run the demo and test it by myself.

Nicolás Poggi (@ni_po) from the Barcelona Supercomputing Center, explained and demoed Aloja, a powerful Big Data benchmarking platform. Really impressive piece of software, and probably the greatest discovery in this event, at least for me. Congrats!

Kostas Tzoumas (@kostas_tzoumas) from Data Artisans, talked about Apache Flink. It seems that Flink is one of the hottest solutions for both streaming and batch processing, but focusing on streaming. Will have to keep an eye on it.

William Vambenepe (@vambenepe) from Google, explained Dataflow. This is not only the newest paper of Google on Big Data architecture but also an open-source implementation that is already available on Google Cloud Platform. It works with a concept called watermark for windowing the reception of out-of-order real-time events and is able to use the same algorithms for processing both real-time and batch. Looks very cool, and cooler for me since he explained that a Python SDK is coming, and that the visualization can be done via Jupyter notebooks uploaded to their Datalab platform. Python rulez!