class: center, middle, inverse # Publishing computer-aided research ### Konrad Hinsen Centre de Biophysique Moléculaire, Orléans, France Synchrotron SOLEIL, Saint Aubin, France --- # Publishing is an important part of doing research - Publications constitute the scientific record of studies and their results - A publication must provide enough detail for replication of the study # Publishing is important for a scientist's career - "Publish or perish" - Bibliometry as a substitute for evaluating quality # Two distinct goals create conflicts of interest - Advancing science requires detailed and accurate publications of innovative studies. - Advancing a career in science requires a large number of publications that catch interest. - Peer review is supposed to ensure that scientists work for science and not just for their careers. With the increase of publication output and specialization, peer review works less and less well - but that's not today's topic. --- # What do we publish (ideallly)? Experimental studies: - the setup of the experiment, the experimental protocol - all the results of all observations - data analysis protocols - processed data - an interpretation Theoretical studies: - the theoretical model(s) - assumptions and approximations - mathematical and computational procedures - the results of interest to the authors - an interpretation No need to publish "all results" for theoretical work: - Anyone can do computations starting from the same models, assumptions, and approximations using the described procedures. There is no "experimental error", nor any "uncontrolled parameters". - An infinite number of results can be obtained from any given model. --- # Computation and automation have changed the game Experiments: - Automated recording of observations can create huge amounts of data, too much for publication in a paper. - Data analysis protocols take the form of computer programs. Often no other precise and complete formulation exists. - Papers no longer contain all observations, and only a summary of the data analysis protocols. Theoretical work: - Models and computational procedures often exist only in the form of computer programs. - Papers provide only summaries of models, assumptions, approximations, and procedures, making it impossible for their readers to do their own computations. Computing and automation have led to scientific papers becoming less and less complete, because the publishing technique is no longer adequate. This has led to serious mistakes in publications (see this [Nature news story](http://www.nature.com/news/2010/101013/full/467775a.html)), and probably also to increased fraud. **Computer-aided research requires publishing _all the software_ and _all observed data_ in electronic form.** --- # Why publish software and data? - Moral obligation for any good citizen of the scientific community. - For your career: bibliometry starts taking software and data into account. - A [recent study](http://dx.doi.org/10.7717/peerj.175) claims that publishing the data will lead to more citations of your papers. - Journals are starting to require data publication with article submissions. Software publication will likely follow. # Where? - Supplementary material for an article. - Data repositories: [figshare](http://figshare.com/), [Zenodo](http://zenodo.org/), plus domain-specific or institution-specific sites - Version-control hosting sites: [GitHub](http://github.com), [Bitbucket](http://bitbucket.org/), plus institution-specific sites such as [Sourcesup](http://sourcesup.renater.fr/) in France. - For large datasets, [Academic Torrents](http://www.academictorrents.com/) # What? - At the very least, the files used in your work: software source code, data files, workflows. - Better: files that are easy to understand and use by others. This requires thinking about publication _before_ doing the research. --- # In the ideal world... ... there would be a file format for storing, sharing, and publishing _computations_, i.e. combinations of programs and datasets, with all the dependency information of a workflow. _Self-advertisement:_ this happens to be a [research project](http://activepapers.org/) that I am working on. ## In the real world... ... we have to do the best we can with the tools we have. Let's look at some of them. --- # IPython notebook A notebook is a mixture of program code, results, and documentation. It is a good medium to explain computations and their results. A notebook can also contain short datasets, e.g. from experiments, expressed as Python data structures. More information [on the IPython Web site](http://ipython.org/notebook.html) The IPython notebook was initially written for the Python language, but support for other languages is being added. Working right now: [Ruby](http://www.ruby-lang.org/) and [Julia](http://julialang.org/). An IPython notebook is a plain text file that can be managed with version control. Similar approaches: - [knitr](http://yihui.name/knitr/) for R - [Org-mode/Babel](http://orgmode.org/worg/org-contrib/babel/index.html) for Emacs users - [Sage notebooks](http://www.sagenb.org/) - [Mathematica notebooks](http://reference.wolfram.com/mathematica/tutorial/NotebooksAsDocuments.html) --- # Literate programming Combination of text and program code, but no live computation nor any results embedded in the document. Usually no maths either. Notebooks are much better for "telling a story" about a computational study, they are closer to a paper. Literate programming tools provide one features that notebooks don't have: the possibility to describe multi-file software libraries. Some popular tools (there are many more): - [noweb](http://www.cs.tufts.edu/~nr/noweb/) is a popular literate programming tool - [pyWeb](http://pywebtool.sourceforge.net/) is a more modern alternative - [Leo](http://leoeditor.com/) is an editor for literate programs --- # What's still missing? - Literate programs combine software with explanations. - Notebooks add results, and graphics. - The missing piece is _data_. - Small and simply structured datasets can be integrated into a notebook. - Big or structured datasets need special file formats and separate storage. - The best you can do right now is bundle up everything in a zip file. --- # A look into the future - Scientific publishers are experimenting with new publication technologies, integrating multimedia content, interactive exploration of data, etc. - On-line executable computations as companion sites to articles are proposed by [Exec&Share](http://execandshare.org). Elsevier is testing a similar approach called [Collage](http://collage.elsevier.com/). - Scientific cloud hosting sites such as [Wakari](http://wakari.io) permit the publication of IPython notebooks together with support libraries and datasets, letting anyone explore your results. - [ActivePapers](http://www.activepapers.org/) permits the bundling of code and datasets into publishable bundles. It is safe to assume that the publication of scientific computation will be very different a few years from now.