I need to finish this analysis for a manuscript that's X years old (...gasp)

Thejasvi Beleyur

2020-04-08 08:26

Updated : 2020-04-23 08:53:00 Category: data analysis

With the social distancing measures implemented thanks to the COVID19 pandemic, I saw a whole wave of memes about academics thinking they'd get more productive. Now, that there're no more unnecessary meetings, teaching, or random interruptions from the workplace, it'd be great to finally get back to doing SCAAAAAAINCE. I myself, wasn't quite sure, as an end phase PhD student working at a mainly research institute, I can't quite complain about any of these things, although I will say I am beginning to enjoy the luxury and convenience of waking up, getting ready, and heading to the desk a few metres away.

One of the things I've been trying to get done is make some new plots and add some additional analyses from a project I was part of back in 2015. Yes, the project is now almost five years old. The manuscript has gone through one rejection, been revamped a bit and now it's soon to be submitted at its second journal. I actually began working in full steam on the analysis about a month ago, and made a decent amount of progress getting back to R and writing up a new Markdown notebook to document the new analysis, and that's when I began to realise how (!@#$'ing) hard it is to keep track of experiments, data and analyses that happened anything more than a few months ago.

Old code, that's not tested or documented well can be a nightmare. Old experimental analyses that're semi-documented with a bunch of intermediate files lying aroud everywhere - that's just torture. And yes, I will admit that this always happens whenever I'm working with my (not-so) favourite collaborator, pastme. Pastme has a habit of coming up with 'interesting' ideas for an analysis, and then putting them into the same Rmd notebook as the final figures for journal submission. The other thing pastme does is this irritating thing of having taken a a bunch of analyses/explorative plots about 75% of the way, but then forgetting to say what else needs to be done in case someone'd like to continue the work forward. The burden of having to figure it all out afresh each time, of course, means there's a growing sense of reluctance each time.

This is the point where I began to think about what could be done to improve the situation. I have heard from my colleagues who say they always have a fixed folder structure for instance, or that they keep all the 'old' stuff in one folder and the manuscript-worthy stuff in another. I guess there is a lot of room for personal choice. However, the one thing with personal choice is that it means there are some real bad ideas that're not propagated, but even worse, there're some great ideas that don't reach out too! One of the things I picked up from following coding and documentation conventions in the Python world (eg. NumPy docstrings, and using the Sphinx documentation system, is the power and discipline they bring to the way you write code. Using conventions and defined ways of expressing yourself may seem constraining, but in reality I've realised their power. Being familiar with these formats of writing, anyone with a familiarity of the conventions can access my code, and me theirs.

I have this nagging feeling there must be a field of research or industry where a nice set of protocols must have been formulated about how to organise the code, raw data and processed data properly. I'd now venture to say, it must be out there...time to start looking properly!!

Comments