Kevin Mader
Spark East, NYC, 19 March 2015
The only technique which can do all
[1] Mokso et al., J. Phys. D, 46(49),2013
Courtesy of M. Pistone at U. Bristol
If you looked at one 1000 x 1000 sized image every second
It would take you
139
hours to browse through a terabyte of data.
Year | Time to 1 TB | Man power to keep up | Salary Costs / Month |
---|---|---|---|
2000 | 4096 min | 2 people | 25 kCHF |
2008 | 1092 min | 8 people | 95 kCHF |
2014 | 32 min | 260 people | 3255 kCHF |
2016 | 2 min | 3906 people | 48828 kCHF |
\textrm{Transistors} \propto 2^{T/(\textrm{18 months})}
Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d
There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?
The figure shows the range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.
The figure shows the cost of a cloud based solution as a percentage of the cost of buying a single machine. The values below 1 show the percentage as a number. The panels distinguish the average time to replacement for the machines in months
What took an entire PhD 3-4 years ago, can now be measured in a weekend, or even several seconds. Analysis tools have not kept up, are difficult to customize, and usually highly specific.
Data-structures that were fast and efficient for computers with 640kb of memory do not make sense anymore
CPU's are not getting that much faster but there are a lot more of them. Iterating through a huge array takes almost as long on 2014 hardware as 2006 hardware
The most important job for any piece of analysis is to be correct.
Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets
The last of the major priorities is speed which covers both scalability, raw performance, and development time.
The two frameworks provide a free out of the box solution for
These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.
I want to
We have developed a number of commands for SIL handling standard image processing tasks
Fully exensible with
Hyperspectral imaging is a rapidly growing area with the potentially for massive datasets and a severe deficit of usuable tools.
The scale of the data is large and standard image processing tools are ill-suited for handling them, although the ideas used in image processing are equally applicable to hyperspectral data (filtering, thresholding, segmentation,…) and distributed, parallel approaches make even more sense on such massive datasets
Developing in Scala brings additional flexibility through types[1], with microscopy the standard formats are 2-, 3- and even 4- or more dimensional arrays or matrices which can be iterated through quickly using CPU and GPU code. While still possible in Scala, there is a great deal more flexibility for data types allowing anything to be stored as an image and then processed as long as basic functions make sense.
[1] Fighting Bit Rot with Types (Experience Report: Scala Collections), M Odersky, FSTTCS 2009, December 2009
A collection of positions and values, maybe more (not an array of double). Arrays are efficient for storing in computer memory, but often a poor way of expressing scientific ideas and analyses.
combine information from nearby pixels
determine groups of pixels which are very similar to desired result
trait BasicMathSupport[T] extends Serializable {
def plus(a: T, b: T): T
def times(a: T, b: T): T
def scale(a: T, b: Double): T
def negate(a: T): T = scale(a,-1)
def invert(a: T): T
def abs(a: T): T
def minus(a: T, b: T): T = plus(a, negate(b))
def divide(a: T, b: T): T = times(a, invert(b))
def compare(a: T, b: T): Int
}
Simple filter implementation
def SimpleFilter[T](inImage: Image[T])
(implicit val wst: BasicMathSupport[T]) = {
val width: Double = 1
kernel = (pos: D3int,value: T) => value * exp(-(pos.mag/width)**2)
kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5
runFilter(inImage,kernel,kernelReduce)
}
Spectra as well supported types
implicit val SpectraBMS = new BasicMathSupport[Array[Double]] {
def plus(a: Array[Double], b: Array[Double]) =
a.zip(b).map(_ + _)
...
def scale(a: Array[Double], b: Double) =
a.map(_*b)
Combining many different components together inside of the Spark Shell, IPython or Zeppelin, make it easier to assemble workflows
We want to understand the relationship between genetic background and bone structure
Genetic studies require hundreds to thousands of samples, in this case the difference between 717 and 1200 samples is the difference between finding the links and finding nothing.
val bones = sc.loadImages("work/f2_bones/*/bone.tif")
val hardTissue = bones.threshold(OTSU)
val softTissue = hardTissue.invert
val cells = hardTissue.componentLabel.
filter(c=>c.size>100 & c.size<1000)
cells.shapeAnalysis.WriteOutput("lacuna.csv")
val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")
val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE")
avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)
Task | Single Core Time | Spark Time (40 cores) |
---|---|---|
Load and Preprocess | 360 minutes | 10 minutes |
Single Column Average | 4.6s | 400ms |
1 K-means Iteration | 2 minutes | 1s |
\textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right]
dispField = Images.
cartesian(Images).map{
case ((xA,ImA), (xB,ImB)) =>
xcorr(ImA,ImB,in=xB-xA)
}
From the updated information provided by the cross correlations and by applying appropriate smoothing criteria (if necessary).
The stitching itself, rather than rewriting the original data can be done in a lazy fashion as certain regions of the image are read.
def getView(tPos,tSize) =
stImgs.
filter(x=>abs(x-tPos)<img.size).
map { case (x,img) =>
val oImg = new Image(tSize)
oImg.copy(img,x,tPos)
}.addImages(AVG)
This also ensures the original data is left unaltered and all analysis is reversible.
getView(Pos(26.5,13),Size(2,2))
In the biological imaging community, the open source tools of ImageJ2 and Fiji are widely accepted and have a large number of readily available plugins and tools.
We can integrate the functionality directly into Spark and perform operations on much larger datasets than a single machine could have in memory. Additionally these analyses can be performed on streaming data.
val wr = new WebcamReceiver()
val ssc = sc.toStreaming(strTime)
val imgList = ssc.receiverStream(wr)
val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))
val totImgs = inImages.count()
val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)
val eventImages = filtImgs.
transform{
inImages =>
val corImage = inImages.map {
case (inTime,inImage) =>
val corImage = inImage.subtract(bgImage)
(corImage.getImageStatistics().mean,
(inTime,corImage))
}
corImage
}
eventImages.filter(iv => Math.abs(iv._1)>20).
foreachRDD(showResultsStr("outlier",_))
Apache Spark is brilliant platform and utilizing GraphX, MLLib, and other packages there unlimited possibilities
def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = {
val wind=(-windSize to windSize)
val pos=pvec._1
val scalevalue=pvec._2/(wind.length*wind.length)
for(x<-wind; y<-wind)
yield ((pos._1+x,pos._2+y),scalevalue)
}
val filtImg=roiImg.
flatMap(cvec => spread_voxels(cvec)).
filter(roiFun).reduceByKey(_ + _)
Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.
A spinoff - 4Quant: From images to insight
We are interested in partnerships and collaborations
A pairing between spatial information (position) and some other kind of information (value). \vec{x} \rightarrow \vec{f}
We are used to seeing images in a grid format where the position indicates the row and column in the grid and the intensity (absorption, reflection, tip deflection, etc) is shown as a different color
The alternative form for this image is as a list of positions and a corresponding value
\hat{I} = (\vec{x},\vec{f})
x | y | Intensity |
---|---|---|
1 | 1 | 12 |
2 | 1 | 68 |
3 | 1 | 81 |
4 | 1 | 89 |
5 | 1 | 87 |
1 | 2 | 40 |
This representation can be called the feature vector and in this case it only has Intensity
If we use feature vectors to describe our image, we are no longer to worrying about how the images will be displayed, and can focus on the segmentation/thresholding problem from a classification rather than a image-processing stand point.
So we have an image of a cell and we want to identify the membrane (the ring) from the nucleus (the point in the middle).
A simple threshold doesn't work because we identify the point in the middle as well. We could try to use morphological tricks to get rid of the point in the middle, or we could better tune our segmentation to the ring structure.
In this case we add a very simple feature to the image, the distance from the center of the image (distance).
x | y | Intensity | Distance |
---|---|---|---|
-10 | -10 | 0.9350683 | 14.14214 |
-10 | -9 | 0.7957197 | 13.45362 |
-10 | -8 | 0.6045178 | 12.80625 |
-10 | -7 | 0.3876575 | 12.20656 |
-10 | -6 | 0.1692429 | 11.66190 |
-10 | -5 | 0.0315481 | 11.18034 |
We now have a more complicated image, which we can't as easily visualize, but we can incorporate these two pieces of information together.
Now instead of trying to find the intensity for the ring, we can combine density and distance to identify it
iff (5<\textrm{Distance}<10 \\ \& 0.5<\textrm{Intensity}>1.0)
The distance while illustrative is not a commonly used features, more common various filters applied to the image
x | y | Intensity | Sobel | Gaussian |
---|---|---|---|---|
1 | 1 | 0.94 | 0.32 | 0.53 |
1 | 10 | 0.48 | 0.50 | 0.45 |
1 | 11 | 0.50 | 0.50 | 0.46 |
1 | 12 | 0.48 | 0.64 | 0.46 |
1 | 13 | 0.43 | 0.78 | 0.45 |
1 | 14 | 0.33 | 0.94 | 0.42 |
The distributions of the features appear very different and can thus likely be used for identifying different parts of the images.
Combine this with our a priori information (called supervised analysis)
Now that the images are stored as feature vectors, they can be easily analyzed with standard Machine Learning tools. It is also much easier to combine with training information.
x | y | Absorb | Scatter | Training |
---|---|---|---|---|
700 | 4 | 0.3706262 | 0.9683849 | 0.0100140 |
704 | 4 | 0.3694059 | 0.9648784 | 0.0100140 |
692 | 8 | 0.3706371 | 0.9047878 | 0.0183156 |
696 | 8 | 0.3712537 | 0.9341989 | 0.0334994 |
700 | 8 | 0.3666887 | 0.9826912 | 0.0453049 |
704 | 8 | 0.3686623 | 0.8728824 | 0.0453049 |
Want to predict Training
from x,y, Absorb, and Scatter
\rightarrow MLLib: Logistic Regression, Random Forest, K-Nearest Neighbors, …
For many datasets processing, segmentation, and morphological analysis is all the information needed to be extracted. For many systems like bone tissue, cellular tissues, cellular materials and many others, the structure is just the beginning and the most interesting results come from the application to physical, chemical, or biological rules inside of these structures.
\sum_j \vec{F}_{ij} = m\ddot{x}_i
Such systems can be easily represented by a graph, and analyzed using GraphX in a distributed, fault tolerant manner.
Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl
One of the central tenants of MapReduce™ is data-centric computation \rightarrow instead of data to computation, move the computation to the data.