Interactive Scientific Image Analysis and Analytics using Spark

Kevin Mader
Spark East, NYC, 19 March 2015

SIL

4Quant Paul Scherrer Institut ETH Zurich

Outline

  • Background: Our Technique (why we have big data)
    • X-Ray Tomographic Microscopy
  • Imaging in 2015
  • The Problem(s)

The Tools

  • Spark Imaging Layer
  • 3D Imaging
  • Hyperspectral Imaging
  • Interactive Analysis / Streaming

The Science

  • Genome Scale Studies
  • Large Datasets
  • Outlook / Developments

Internal Structures

Synchrotron-based X-Ray Tomographic Microscopy

The only technique which can do all

  • peer deep into large samples
  • achieve \mathbf{<1\mu m} isotropic spatial resolution
    • with 1.8mm field of view
  • achieve >10 Hz temporal resolution
  • 8GB/s of images

SLS and TOMCAT

[1] Mokso et al., J. Phys. D, 46(49),2013

Courtesy of M. Pistone at U. Bristol

Image Science in 2015: More and faster

X-Ray

  • Swiss Light Source (SRXTM) images at (>1000fps) \rightarrow 8GB/s, diffraction patterns (cSAXS) at 30GB/s
  • Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data

Optical

  • Light-sheet microscopy (see talk of Jeremy Freeman) produces images \rightarrow 500MB/s
  • High-speed confocal images at (>200fps) \rightarrow 78Mb/s

Geospatial

  • New satellite projects (Skybox, etc) will measure hundreds of terabytes to petabytes of images a year

Personal

  • GoPro 4 Black - 60MB/s (3840 x 2160 x 30fps) for $600
  • fps1000 - 400MB/s (640 x 480 x 840 fps) for $400
    plot of chunk time-figure

Using Machine Learning

Now that the images are stored as feature vectors, they can be easily analyzed with standard Machine Learning tools. It is also much easier to combine with training information.

x y Absorb Scatter Training
700 4 0.3706262 0.9683849 0.0100140
704 4 0.3694059 0.9648784 0.0100140
692 8 0.3706371 0.9047878 0.0183156
696 8 0.3712537 0.9341989 0.0334994
700 8 0.3666887 0.9826912 0.0453049
704 8 0.3686623 0.8728824 0.0453049

Want to predict Training from x,y, Absorb, and Scatter \rightarrow MLLib: Logistic Regression, Random Forest, K-Nearest Neighbors, …

plot of chunk unnamed-chunk-28

Beyond Image Processing

For many datasets processing, segmentation, and morphological analysis is all the information needed to be extracted. For many systems like bone tissue, cellular tissues, cellular materials and many others, the structure is just the beginning and the most interesting results come from the application to physical, chemical, or biological rules inside of these structures.

\sum_j \vec{F}_{ij} = m\ddot{x}_i

Such systems can be easily represented by a graph, and analyzed using GraphX in a distributed, fault tolerant manner.

plot of chunk unnamed-chunk-29

Hadoop Filesystem (HDFS not HDF5)

Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl

SIL

One of the central tenants of MapReduce™ is data-centric computation \rightarrow instead of data to computation, move the computation to the data.

  • Use fast local storage for storing everything redundantly \rightarrow less transfer and fault-tolerance
  • Largest file size: 512 yottabytes, Yahoo has 14 petabyte filesystem in use

SIL