Quantitative Big Imaging

Kevin Mader
7 May 2015

Scaling Up / Big Data

Course Outline

  • 19th February - Introduction and Workflows
  • 26th February - Image Enhancement (A. Kaestner)
  • 5th March - Basic Segmentation, Discrete Binary Structures
  • 12th March - Advanced Segmentation
  • 19th March - Applying Graphical Models and Machine Learning (A. Lucchi)
  • 26th March - Analyzing Single Objects
  • 2nd April - Analyzing Complex Objects
  • 16th April - Groups and Spatial Distribution
  • 23rd April - Statistics and Reproducibility
  • 30th April - Dynamic Experiments
  • 7th May - Scaling Up / Big Data
  • 21th May - Guest Lecture, Applications in High-content Screening and Wood
  • 28th May - Project Presentations

Literature / Useful References

Big Data

Cluster Computing

Databases

  • Ollion, J., Cochennec, J., Loll, F., Escudé, C., & Boudier, T. (2013). TANGO: a generic tool for high-throughput 3D image analysis for studying nuclear organization. Bioinformatics (Oxford, England), 29(14), 1840–1. doi:10.1093/bioinformatics/btt276

Cloud Computing

  • Amazon S3
  • Sitaram, D., & Manjunath, G. (2012). Moving To The Cloud. null (Vol. null). Elsevier. doi:10.1016/B978-1-59749-725-1.00006-8
  • Duan, P., Wang, W., Zhang, W., Gong, F., Zhang, P., & Rao, Y. (2013). Food Image Recognition Using Pervasive Cloud Computing. In 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing (pp. 1631–1637). IEEE. doi:10.1109/GreenCom-iThings-CPSCom.2013.296

Outline

  • Motivation
  • Computer Science Principles
    • Parallelism
    • Distributed Computing
    • Imperative Programming
    • Declarative Programming
  • Organization
    • Queue Systems / Cluster Computing
    • Parameterization
    • Databases
  • Big Data
    • MapReduce
    • Spark
    • Streaming
  • Cloud Computing
  • Beyond / The future

Rich, heavily developed platform

Available Tools

Tools built for table-like data data structures and much better adapted to it.

Commercial Support

Dozens of major companies (Apple, Google, Facebook, Cisco, …) donate over $30M a year to development of Spark and the Berkeley Data Analytics Stack

  • 2 startups in the last 6 months with seed-funding in excess of $15M each

Academic Support

  • All source code is available on GitHub
    • Elegant (20,000 lines vs my PhD of 75,000+)
  • No patents or restrictions on usage
  • Machine Learning Course in D-INFK next semester based on Spark

Beyond: Streaming

Post-processing goals

  • Analysis done in weeks instead of months
  • Some real-time analysis and statistics

Streaming

Can handle static data or live data coming in from a 'streaming' device like a camera to do real-time analysis. The exact same code can be used for real-time analysis and static code

Scalability

Connect more computers.

Start workers on these computer.

Beyond: Approximate Results

Projects at AMPLab like Spark and BlinkDB are moving towards approximate results.

  • Instead of mean(volume)
    • mean(volume).within_time(5)
    • mean(volume).within_ci(0.95)

For real-time image processing it might be the only feasible solution and could drastically reduce the amount of time spent on analysis.