For the Intraclass Correlation Coefficient you will need the ICC package which can be installed using the following command from R-Studio (or the R terminal)
install.packages("ICC")
The new names are explained relative to the dataset used in the first problem. So read the first problem for more introductory material on these tables
ID | BMD | MECHANICS_STIFFNESS | CORT_DTO__C_TH | CORT_DTO__C_TH_SD |
---|---|---|---|---|
351 | 0.0302208 | 57.16318 | 0.186455 | 0.019785 |
356 | 0.0327883 | 54.97201 | 0.183007 | 0.015696 |
357 | 0.0360747 | 73.59088 | 0.216930 | 0.028019 |
359 | 0.0311451 | 49.85482 | 0.193758 | 0.024087 |
JOIN
ID | D1Mit64 | D1Mit236 | D1Mit7 | D1Mit386 |
---|---|---|---|---|
351 | H | H | H | H |
356 | H | A | A | A |
357 | H | B | B | B |
359 | A | B | B | B |
BECOMES (hiding excess columns)
ID | BMD | MECHANICS_STIFFNESS | D1Mit64 | D1Mit236 |
---|---|---|---|---|
351 | 0.0302208 | 57.16318 | H | H |
356 | 0.0327883 | 54.97201 | H | A |
357 | 0.0360747 | 73.59088 | H | B |
359 | 0.0311451 | 49.85482 | A | B |
for(eleA in tableA):
for (eleB in tableB):
yield (eleA,eleB)
ID | BMD | MECHANICS_STIFFNESS | CORT_DTO__C_TH | CORT_DTO__C_TH_SD |
---|---|---|---|---|
351 | 0.0302208 | 57.16318 | 0.186455 | 0.019785 |
356 | 0.0327883 | 54.97201 | 0.183007 | 0.015696 |
357 | 0.0360747 | 73.59088 | 0.216930 | 0.028019 |
359 | 0.0311451 | 49.85482 | 0.193758 | 0.024087 |
CROSS JOIN
ID | D1Mit64 | D1Mit236 | D1Mit7 | D1Mit386 |
---|---|---|---|---|
351 | H | H | H | H |
356 | H | A | A | A |
357 | H | B | B | B |
359 | A | B | B | B |
BECOMES (hiding excess columns)
ID.x | BMD | MECHANICS_STIFFNESS | ID.y | D1Mit64 | D1Mit236 |
---|---|---|---|---|---|
351 | 0.0302208 | 57.16318 | 351 | H | H |
356 | 0.0327883 | 54.97201 | 351 | H | H |
357 | 0.0360747 | 73.59088 | 351 | H | H |
359 | 0.0311451 | 49.85482 | 351 | H | H |
351 | 0.0302208 | 57.16318 | 356 | H | A |
356 | 0.0327883 | 54.97201 | 356 | H | A |
357 | 0.0360747 | 73.59088 | 356 | H | A |
359 | 0.0311451 | 49.85482 | 356 | H | A |
351 | 0.0302208 | 57.16318 | 357 | H | B |
356 | 0.0327883 | 54.97201 | 357 | H | B |
357 | 0.0360747 | 73.59088 | 357 | H | B |
359 | 0.0311451 | 49.85482 | 357 | H | B |
351 | 0.0302208 | 57.16318 | 359 | A | B |
356 | 0.0327883 | 54.97201 | 359 | A | B |
357 | 0.0360747 | 73.59088 | 359 | A | B |
359 | 0.0311451 | 49.85482 | 359 | A | B |
For example if we have each experiment in a separate file, it would make sense to concatenate them all together rather than Join or Cross Join
First table
ID | BMD | MECHANICS_STIFFNESS | CORT_DTO__C_TH | CORT_DTO__C_TH_SD |
---|---|---|---|---|
351 | 0.0302208 | 57.16318 | 0.186455 | 0.019785 |
356 | 0.0327883 | 54.97201 | 0.183007 | 0.015696 |
357 | 0.0360747 | 73.59088 | 0.216930 | 0.028019 |
359 | 0.0311451 | 49.85482 | 0.193758 | 0.024087 |
Second set of results
ID | BMD MEC | HANICS_STIFFNESS COR | T_DTO__C_TH COR | T_DTO__C_TH_SD | |
---|---|---|---|---|---|
21 | 457 | 0.0311451 | 49.28858 | 0.178971 | 0.020928 |
22 | 459 | 0.0293992 | 58.17281 | 0.158907 | 0.018856 |
23 | 470 | 0.0316586 | 61.49628 | 0.162435 | 0.019928 |
24 | 472 | 0.0345342 | 56.62613 | 0.185258 | 0.028595 |
The concatenation of the two groups
ID | BMD | MECHANICS_STIFFNESS | CORT_DTO__C_TH | CORT_DTO__C_TH_SD | |
---|---|---|---|---|---|
1 | 351 | 0.0302208 | 57.16318 | 0.186455 | 0.019785 |
2 | 356 | 0.0327883 | 54.97201 | 0.183007 | 0.015696 |
3 | 357 | 0.0360747 | 73.59088 | 0.216930 | 0.028019 |
4 | 359 | 0.0311451 | 49.85482 | 0.193758 | 0.024087 |
21 | 457 | 0.0311451 | 49.28858 | 0.178971 | 0.020928 |
22 | 459 | 0.0293992 | 58.17281 | 0.158907 | 0.018856 |
23 | 470 | 0.0316586 | 61.49628 | 0.162435 | 0.019928 |
24 | 472 | 0.0345342 | 56.62613 | 0.185258 | 0.028595 |
For this example we will start with a fairly complicated dataset from a genetics analysis we did at the Institute of Biomechanics.
There are 1000 mouse femur bones which have been measured at high resolution and a number of shape analyses run on each sample. - Phenotypical Information - Each column represents a metric which was assessed in the images - CORT_DTO__C_TH
for example is the mean thickness of the cortical bone
ID | BMD | MECHANICS_STIFFNESS | CORT_DTO__C_TH | CORT_DTO__C_TH_SD |
---|---|---|---|---|
351 | 0.0302208 | 57.16318 | 0.186455 | 0.019785 |
356 | 0.0327883 | 54.97201 | 0.183007 | 0.015696 |
357 | 0.0360747 | 73.59088 | 0.216930 | 0.028019 |
359 | 0.0311451 | 49.85482 | 0.193758 | 0.024087 |
‘-’ is missing or erronous measurements
ID | D1Mi | t64 D1Mi | t236 D1Mi | t7 D1Mi | t386 |
---|---|---|---|---|---|
101 | 685 | A | H | H | H |
102 | 686 | B | B | B | H |
103 | 687 | A | A | H | H |
104 | 688 | H | H | H | A |
105 | 689 | H | H | - | A |
For this example we will compare two real cortical bone samples taken from mice. The data will be downloaded in KNIME from the course website (KNIME can also download / upload to FTP servers making sharing results and data easier). - If you are using your own computer you will need to change the Target Folder in both of the “Download” nodes to something reasonable (just click Browse)
For the purpose of the analysis and keeping the data sizes small, we will use Kevin’s Crazy Camera again for simulating the noisy detection process. The assignment aims to be more integrative and you will combine a number of different lectures to get to the final answer.
library(ggplot2)
cur.df<-data.frame(
sample=as.factor(knime.in$"Image Number"),
measurement=knime.in$"Measurement_Number",
volume=knime.in$"Num Pix"
)
ggplot(cur.df,aes(x=volume))+
geom_density(aes(color=sample,group=measurement))+
theme_bw(25)
library(ggplot2)
library(plyr)
cur.df<-data.frame(
sample=as.factor(knime.in$"Image Number"),
measurement=knime.in$"Measurement_Number",
volume=knime.in$"Num Pix"
)
sum.df<-ddply(cur.df,.(sample,measurement),function(in.df) {
data.frame(mean.volume=mean(in.df$volume),cell.count=nrow(in.df))
})
ggplot(sum.df,aes(x=mean.volume))+
geom_density(aes(color=sample))+
theme_bw(25)
library(ggplot2)
library(plyr)
cur.df<-data.frame(
sample=as.factor(knime.in$"Image Number"),
measurement=knime.in$"Measurement_Number",
volume=knime.in$"Num Pix"
)
sum.df<-ddply(cur.df,.(sample,measurement),function(in.df) {
data.frame(mean.volume=mean(in.df$volume),cell.count=nrow(in.df))
})
ggplot(sum.df,aes(x=cell.count))+
geom_density(aes(color=sample))+
theme_bw(25)
library(ICC)
in.data<-data.frame(
metric = knime.in$"Num Pix",
group = knime.in$"Image Number"
)
icVal<-ICCbare(metric,group,data=in.data)$ICC
icVal
One possible final result looks something like this
This exercise (workflow named - Statistical Significance Hunter) shows the same results as we discussed in the lecture for finding p-values of significance. It takes a completely random stream of numbers with a mean 0.0 and tests them against a null hypothesis (that they equal 0) in small batches, you can adjust the size of the batches, the number of items and the confidence interval. The result is a pie chart showing the number of “significant” results found using the standard scientific criteria for common studies.
Making plots or graphics should be divided into separate independent components. - Setup is the ggplot command and the data - Mapping is in the aes command ggplot(input data frame,aes(x=name of x column,y=name of y column))+ - Plot is the next command (geom_point, geom_smooth, geom_density, geom_histogram, geom_contour) geom_point()+ - Coordinates can then be added to any type of plot (coord_equal, coord_polar, etc) - Scales can also be added (scale_x_log10, scale_y_sqrt, scale_color_gradientn) - Labels are added labs(x="\textrm{x label}",y="\textrm{y label}",title="\textrm{title}")
library("ggplot2","knitr")
# read the table in
phen.table<-read.csv("09-files/phenoTable.csv")
# take the first 4 rows and columns
p.sub.table<-pheno.table[c(1:4),c(35,c(1:4))]
# print it out nicely
kable(p.sub.table)
ID | BMD | MECHANICS_STIFFNESS | CORT_DTO__C_TH | CORT_DTO__C_TH_SD |
---|---|---|---|---|
351 | 0.0302208 | 57.16318 | 0.186455 | 0.019785 |
356 | 0.0327883 | 54.97201 | 0.183007 | 0.015696 |
357 | 0.0360747 | 73.59088 | 0.216930 | 0.028019 |
359 | 0.0311451 | 49.85482 | 0.193758 | 0.024087 |
- Set | up the inpu | t table as ```phen.tab | le``` and the map | ping with the x position mapped to BMD (Bone Mineral Density) and the y position as CT_TH_RAD (Cortical Bone Thickness) |
ggplot(phen.table,aes(x=BMD,y=CT_TH_RAD))
ggplot(phen.table,aes(x=BMD,y=CT_TH_RAD))+geom_point()
- Change color of the points to show if the animal is female or not (in the mapping)
ggplot(phen.table,aes(x=BMD,y=CT_TH_RAD,color=FEMALE))+geom_point()
- Show the color as a discrete (factor) value instead of a number
m.table<-mutate(phen.table,GENDER = ifelse(FEMALE==1,"F","M"))
ggplot(m.table,aes(x=BMD,y=CT_TH_RAD,color=GENDER))+geom_point()
- Make the plot in two facets (windows) instead of the same one
ggplot(m.table,aes(x=BMD,y=CT_TH_RAD,color=GENDER))+
geom_point()+
facet_wrap(~GENDER)
- Make the facet based on evenly (in number) dividing the points into 6 groups by the number of lacuna (LACUNA_COUNT)
m.table<-mutate(phen.table,
GENDER = ifelse(FEMALE==1,"F","M"),
LACUNA_NUMBER_GROUP = cut_number(LACUNA_COUNT,6)
)
ggplot(m.table,aes(x=BMD,y=CT_TH_RAD,color=GENDER))+
geom_point()+
facet_wrap(~LACUNA_NUMBER_GROUP)
- Fix the labels and add a white background (
theme_bw
) with font-size 12
ggplot(m.table,aes(x=BMD,y=CT_TH_RAD,color=GENDER))+
geom_point()+
facet_wrap(~LACUNA_NUMBER_GROUP)+
labs(x="Bone Mineral Density",y="Cortical Bone Thickness (um)",color="Gender")+
theme_bw(12)
For more information and tutorial read about it in: http://ggplot2.org/
Unit Testing works differently in every language but the major ideas remain the same.
def countPoints(in: Array[Array[Boolean]]): Int =
in.flatten.map(if(_) 1 else 0).sum
A simple 1x2 array with one true
import org.scalatest.{FunSuite, Matchers}
class PointCounterTests extends FunSuite with Matchers
test("Simple Array") {
val simpleArr = Array(Array(true,false,false))
countPoints(simpleArr) shouldBe 1
}
}
An empty error should return 0
val emptyArr = Array(new Array[Boolean](0))
countPoints(emptyArr) shouldBe 0
It is critical to note, if you are using a statically-type language (C, C++, Fortran Java, Scala, Groovy, etc) you do not need to explicitly check type issues since that is automatically done by the compiler (you cannot give a string to a function expecting an integer, it simply will not run) If you are using a dynamically typed language (Matlab, Python, R, Mathematica, etc) you will need to make additional tests to make sure your function handles a variety of different input correctly. For example if you make the countPoints
function in Matlab what happens when you give it a 3D array or a string? Either it should immediately throw an error message explaining that this input doesn’t make sense or explicitly handle this case. Otherwise you could have very unexpected behavior. The following is a perfectly reasonable function in Matlab
function [out]=isBiggerThanFive(inVar)
if inVar>5
disp([inVar ' is bigger than 5'])
out=true;
else
disp([inVar ' is smaller than 5'])
out=false;
end
And if you execute isBiggerThanFive('Tom')
it executes perfectly and you get Tom is bigger than 5
which is probably meaningless. It would be much better if the program throws an error saying that a string is not an appropriate type for this function.
To make a report in RStudio you can go the File -> New Files -> R Markdown… menu. You can then enter code in blocks as are shown in the example and text or Markdown (text with dashes and pound signs for formatting) to make a document which compiles to a PDF, HTML-file, or Word Document.
This allows you to have your entire analysis in a simple document. Additionally it prevents you from having to deal with some of the major challenges with Matlab, Python, and R (dynamic languages) involving name-spaces and analyses that run fine now, but do not work or work differently tomorrow.