Alpathrupthi - the key to sustained research?

original post: 05/20, edits in 04/21

Almost over half a decade ago, I was still in college/university and busy training my future scientist-self in the art of collaboration through evenings filled with multi-player games.

Despite the fact that I really liked playing these games, it turned out I really ....sucked. My hand-eye coordination was too bad for fast paced first-person shooter games, and my sense of strategy for real-time strategy games was inadequate, to say the least. The problem was that we'd often play in teams, and so the people in my team would suffer quite a bit thanks to my incompetence :P. The main issue: my thresholds for satisfaction were low, and I was extremely happy whenever the most basic things worked out. In a first person shooter, this meant not dying and my ultimate strategy was to indiscrimately spray everything with bullets. In the strategy games, this meant not being invaded, or succeeding in a minor takeover of a neighbouring village. On top of that, I was in many ways like the proverbial frog in boiling water. Many real-time strategy games require you to keep check of your kingdom's finances, labour power etc. I could never notice my kingdom failing, and would ask for a loan from my team-mates after it was too late. Also, I never got better through the two years that we played :p. No matter how badly we lost, I'd still be happy with something insignificant that had happened in the game.

On one of these evenings, a friend of mine finally expressed his frustration and amusement by by calling me an 'alpathruptha', and that phrase has somehow remained with me for a long time. Alpathruptha, is a sanskrit word (also used in Kannada) that describes someone who is easily satisfied, or with low ambition 1. My dear friend was of course strongly focussing on my utter lack of drive and ambition. While I may be an alpathruptha in certain parts of life, I constantly struggle with being dissatisfied with the progress in a project or where it is at right now.

The origin of this flashback to the word now fail me, but it somehow lead me to think about how in some ways the more of an alpathrutha one is, the better it might be while doing research. Especially while handling a new topic or technical method, perhaps less is better? In some ways being thrilled at having survived the day, is a great incentive for coming back the next day.

Considering how a lot of research in the initial phases is two steps front, ten steps back, a sense of alpathrupthi is only appropriate perhaps. Perhaps it's also important to celebrate the realisation of failure. When you realise something is wrong with the code/equipment/results, maybe it should be considered an achievement in itself. Of course some might argue, this is nothing but alpathrupthi embodified, you decide.


  1. alpathruptha also has the other more ascetically associated meanings: without great desire, one with low expectations or easily satisfied. 

first post!!

Hello!

Hello, and welcome to the first post of my blog. This is where I'm planning to write about science and scientific computing. This is a somewhat informal venue to discuss some ideas, and so even though most of the ideas are not mine - do not expect formal citations for many things!

Why?

Why am I throwing all these words onto the internet. Isn't the internet filled with enough people who are already talking and writing about all kinds of specialised topics? Yes, it is, and that's kind of exactly what academia is too. A lot of people writing and discussing very specialised topics, with tiny audiences :P. I'm writing mainly to clarify some of the thoughts/ideas for myself, and hopefully any reader with the patience and curiosity to stick around.

How?

Like many things computer-related I've done, this site and blog was built through a Python package, Nikola. The link to the package is in the page footer.

Universal data formats

I'm still trying to get through an old but (I think) cool analysis for a manuscript that's over five years in the making now. The code in the manuscript itself reflects the need for mutiple coding platforms and how each of them bring its own superpowers with it. All of the image analysis results in the project has been done in MATLAB. The stats and analyses have been executed and documented in R with Markdown notebooks. A collaborator did some additional analyses in MATLAB recently and sent over some new results.

It would have all been fine, and I probably wouldn't have even written the post if I was still using the same laptop I had when the project started. Now, however, even though the laptop is the same, it's gone through a couple of OS changes and currenly has R and Python installed in it - same cover different contents. Getting some .mat files and having to open them to analyse them is not impossible. I'm very thankful for cross-language packages like R.matlab and the inbuilt scipy.io.loadmat, and was able to load the data without problem.

However, even the act of having to find a specialised package to load a dataset saved in a platform-specific format made me re-think how I would save my data in the future. The problem of language specific data files is not a computing platform based issue. Even as a Python/R used it seems only natural to end the day by saving the results into one .Rda or .pkl file. The problem only arises after all
when the same files have to be read by someone else who's not invested in the platform of your choice. Simple question, what if I'd decided to send a collaborator a .Rda/.pkl file, and they use a completely different computing platform which is called say... Kidneybeans? Kidneybeans is an established platform in the field of MagicMaking and has a small but established community of researchers using it. Should your collaborator bend to the pressure of a larger established community and spend 45 minutes of their time just to load and re-format the data? Not fair right?s

The solution to overcoming cross-language barriers is of course to use 'standard' formats (csv, json, hdf5). This has been suggested multiple times ref1,ref2 and despite having read the literature it's only beginning to dawn on me. It does require some effort, to plan and re-organise all the data into standard formats instead of being able to 'naturally' dump that list, data frame and image data into one file (or maybe multiple files?..). I guess, the main advantage of saving stuff into universal formats is the data will still remain accessible to future collaborators using any computational platform, and of course the most likely future collaborator as always is futureyou!

I need to finish this analysis for a manuscript that's *X* years old (...gasp)

Updated : 2020-04-23 08:53:00 Category: data analysis

With the social distancing measures implemented thanks to the COVID19 pandemic, I saw a whole wave of memes about academics thinking they'd get more productive. Now, that there're no more unnecessary meetings, teaching, or random interruptions from the workplace, it'd be great to finally get back to doing SCAAAAAAINCE. I myself, wasn't quite sure, as an end phase PhD student working at a mainly research institute, I can't quite complain about any of these things, although I will say I am beginning to enjoy the luxury and convenience of waking up, getting ready, and heading to the desk a few metres away.

One of the things I've been trying to get done is make some new plots and add some additional analyses from a project I was part of back in 2015. Yes, the project is now almost five years old. The manuscript has gone through one rejection, been revamped a bit and now it's soon to be submitted at its second journal. I actually began working in full steam on the analysis about a month ago, and made a decent amount of progress getting back to R and writing up a new Markdown notebook to document the new analysis, and that's when I began to realise how (!@#$'ing) hard it is to keep track of experiments, data and analyses that happened anything more than a few months ago.

Old code, that's not tested or documented well can be a nightmare. Old experimental analyses that're semi-documented with a bunch of intermediate files lying aroud everywhere - that's just torture. And yes, I will admit that this always happens whenever I'm working with my (not-so) favourite collaborator, pastme. Pastme has a habit of coming up with 'interesting' ideas for an analysis, and then putting them into the same Rmd notebook as the final figures for journal submission. The other thing pastme does is this irritating thing of having taken a a bunch of analyses/explorative plots about 75% of the way, but then forgetting to say what else needs to be done in case someone'd like to continue the work forward. The burden of having to figure it all out afresh each time, of course, means there's a growing sense of reluctance each time.

This is the point where I began to think about what could be done to improve the situation. I have heard from my colleagues who say they always have a fixed folder structure for instance, or that they keep all the 'old' stuff in one folder and the manuscript-worthy stuff in another. I guess there is a lot of room for personal choice. However, the one thing with personal choice is that it means there are some real bad ideas that're not propagated, but even worse, there're some great ideas that don't reach out too! One of the things I picked up from following coding and documentation conventions in the Python world (eg. NumPy docstrings, and using the Sphinx documentation system, is the power and discipline they bring to the way you write code. Using conventions and defined ways of expressing yourself may seem constraining, but in reality I've realised their power. Being familiar with these formats of writing, anyone with a familiarity of the conventions can access my code, and me theirs.

I have this nagging feeling there must be a field of research or industry where a nice set of protocols must have been formulated about how to organise the code, raw data and processed data properly. I'd now venture to say, it must be out there...time to start looking properly!!

So, how much do you really understand about (cool) topic X? (also, it's not you who's dumb, find the right source/teacher)

Working in academia can be quite stressful. Handling the uncertainty of where you'll be in the next few years, and whether you're competent enough at all (impostor syndrome) are questions that're always hanging around. Even once you think you've understood something, five minutes later another thing'll come up to show that the mental model you had was actually rather wrong. It can be very frustrating to put in a lot of effort, think of it as progress, and then back to square one. Now after five years of doing this, I've come to accept this as a daily part of a researcher's life: grappling with new topics, themes and approaches all the time. Even within your own narrow academic field, there's always something new that's being presented in a conference talk or paper. And let's face it, learning new concepts is hard, especially when it's thrown at you for the first time in the form of a few pages in a paper or even worse, over a few minutes in a talk.

Even though I may have the curiosity to pursue a subject or topic after hearing/reading about it somewhere, I often find strong historical and disciplinary barriers to my entry. For instance, while working on a Python package to quantify bat echolocation calls, I realised (again) the main way I was trying to separate two types of sound would work 75% of the time (method X), but would fail horribly the other 25% of the time. Method X is a somewhat 'standard' line of thinking approach that my supervisor and I somehow converged upon relatively quickly. However, it didn't work as well. Luckily, it wasn't so hard to find an alternative method (let's call it method Y) because signal processing is a vast field with applications in all kinds of situations. Excited, I started downloading seemingly relevant papers, and books to try and figure out whether this method could be applied at all onto the bat calls or not. The first barrier is disciplinary. Opening up a book filled with proofs of convergence or homology, or such things doesn't quite help. As an earlier grad student I may have felt underconfident about not understanding the equations in the page, but now, with a few years on the job experience (and being able to acknowledge my limitations quicker :P) I'm able to skim through the whole text without worrying about the details in the equations. If I missed something important, it'll probably be there in another text too! The second barrier is historical. Especially since method Y was mainly applied in the signal processing and some fields of physics, there were few non-mathematically driven explanations. I'm happy to handle math when it's presented with more context around it, but independently it's quite a struggle. Despite the abundance of nice equation-loaded LaTex pdfs and slideshows,there are always crumbs of information to be picked up on the go. Even without getting into the nitty-gritties of the actual equations I learnt that the method's useful, but can be hard to interpret, and was the basis for a bunch of newer methods used in the analysis of very short signals. Now, that's not bad huh?

So what do you do when you're trying to understand a niche topic filled with foreign terms and steeped in its own historical context? Honestly speaking, all fields of science are like this, and I guess the only thing to do is to shop around pdfs, talks and presentations until you find the one that 'speaks' to you, whether it's a Youtube video, a MOOC lecture, or a conversation with a colleague. The fact is, when you're new to a field or trying to get into it, blaming yourself for not understanding is counter-productive. If I'm arrive in a foreign country and someone starts talking to me in the language of the land, I do not feel stupid about not understanding, but am quick to realise it. It's not you who's dumb, there's a right source/teacher waiting for you at the level you are at right now - find it!

Simultaneous code and docs: a safety net for your project

In the past I always used to either badly document my code properly or ignore it completely. The focus was to get the job done and to get out. Nowadays, however, things have changed a lot, and I really only have the Sphinx project to blame :P. Documenting your code as it's being written helps in a bunch of ways, and the person who benefits from it the most is futureyou. I've thanked pastme a zillion times by now because the current project I'm working on has grown to be a more than a few files, and I was quite afraid of not being able to keep track of stuff. However, this time I really decided to write NumPy style docstrings on all functions that are more than a two lines, and with a little bit of Sphinx magic, and some ReadTheDocs hosting, the result is a beautiful webpage that I keep visiting myself just because it's rewarding seeing something nice.

The problem I tend to have with a code base that gets large enough is that I forget which functions do what and their input and output formats. There is a limit to the number of things you can keep in your head over the span of a week or a month. Writing detailed documentation with the code is useful because it helps you keep track of how things are implemented, what is required for each function, and even keep notes on the limitations of different functions/objects.

Is it worth it? Even though in the beginning there may be only five functions used repeatedly, it slows grows to seven, ten, fifteen, and that's way too many to keep in mind, even though I'm working with the code on a daily basis. It's only once you get knee-deep into a project that you begin to realise all the details that weren't so obvious from a distance.

Writing documentation:

In the console/command line: So you've begun your coding session, and now you want to do <<coolthing>>, with <<amazefunction>>, but, aargh, do you have to specify the amazelevel=10 or is it a default value? And don't you vaguely remember that the value for amazelevel is actually an input for <<create_amazingness>>? So, if past you had done a good job of it, when you type in help(amazefunction), you'd get some information hopefully?

help(amazefunction)
Help on function amazefunction in module __main__:

amazefunction(amaze_type, **kwargs)

Disappointing?..yeah. Now however, imagine you'd actually invested a minute extra while actually writing this function, and thus saved futureyou a few minutes. This is what the output from help(amazefunction) could have looked like!

Help on function amazefunction in module __main__:

amazefunction(amaze_type, **kwargs)
    Creates an amazing thing. 

    Parameters
    ----------
    amaze_type : str
        The type of amazing object to be created.
        Needs to be one of these values ['unicorn', 'rakshasa', 'wolpertinger'] 

    Returns
    -------
    amazingness : dictionary
        The amazingness dictionary has keys
        'height', 'weight', 'powers' and other  optional
        descriptive features controlled by optional 
        arguments.

    See Also
    --------
    make_amazing_object
    check_amaze_type

To package or not to package?

After having worked in a field for over a few years, and gotten comfortable with programming in the language of your choice, you slowly begin to realise that your work essentially always consists of a set of analyses/tasks which are used again and again. For me, experimentally this means, having scripts ready that will initiate recordings, playbacks, and saving of audio files for the experiments I do. In terms of simulations, it means writing a lot of code, typically based on a bunch of acoustics paradigms that are described by a bunch of equations/assumptions.

I have of course read over and over again, that it's good practise to bundle all of your goodies into one place, and keep them as a package somewhere [REFS]. But why don't I do it yet? Right now, as an end-phase graduate student, I guess it's mainly because each time I do this task, I'm not sure when I'll actually have to do it again. The effort of trying to create a common framework and plan all the basic experimenta/computational tasks to be implemented doesn't pay off. Writing a package, I realise is as much about the single tiny functions, as it is about the broad common concept and the sufficiently detailed documentation aroud it. If I'm the only one who's going to use it, and am not even sure when - it's not worth the time :P.

Butt, there are cases when I think the code I've been using regularly is worth putting a decent amount of thought into and pushing it all the way to a publicly available package. Initially I used to think it's worth putting the effort of making a whole package and releasing it for public use only if there was a whole community of researchers interested in them. However, this thinking changed when I had the opportunity to meet Kalle A{dot}stroem, professor at the University of Lund. His group has been developing a package to automatically calculate microphone positions from sound playbacks. I had approached him in order to check if the package they'd developed could be used to infer the positions of my microphone arrays I was using to track bats. The package worked for my data, and we thus began of developing an integrated experimental + software workflow for field biologists to use. While talking about who might be interested in something like this, Kalle mentioned it'd be very cool even if just two-three labs would be genuinely interested in using this whole workflow. This sounded like a very small number to me, two-three labs could basically just consist of four-six people (one PI + one grad student), and back then I began wondering if this was too niche?

Nowadays, however, I've realised how niche academic research can be. While we may work on a day-to-day basis with a common set of assummptions and follow the same conceptual principles, each of us has slightly different use cases. Even when one other person benefits from the use of a released package, it should be considered a success.

In some sense, the effect of publishing a package is a bit more tangible than publishing a paper in a journal. A package is something that can be used on a day-to-day basis, unlike a paper which is a string of ideas put together. So should you publish your package or not? Well, the rule of thumb I'm beginning to develop is essentially, if you think there's at least one more person who might find it interesting - do it. And remember, the most likely person who will find it interesting to use is this familiar-yet-unfamiliar stranger - future you.

REFS

The coding language maketh not the science, but...

When it comes to choice of programming language to get any task done, people can sometimes have strong opinions. I've seen tweets from authors reporting reviewers who wanted the stats/plots/something done in R, and being dismissive of it being done in any other language. This is truly ridiculous, and this kind of attitude amounts to a weird, unhealthy kind of gate-keeping. What will come next, oh the code is only compatible with a Windows OS, it should have been developed keeping a Unix OS in mind...or even more specific comments about which packages to use?

The point is, there is a feeling some members of the community have that certain computational tasks are well suited to be done with insert favouriteprogramminglanguage. I find this attitude absurd and take it to be a form of irrational favoritism. If a piece of code is not in a coding language I use regularly, the only thing it means is that I may not be familiar with a whole bunch of cool concepts and ideas that the authors use. It doesn't mean the work is sub-standard. This argument cuts both ways, whether the language is an open-source or proprietary platform. If anything at all, if I have to read code written in an unfamiliar language and understand it - it needs to be well-documented! The user/reader needs to understand what is happening in the code irrespective of the actual for and while loops running under the hood.

Documenting code well is not a trivial task, and not something which can be done well over a couple of days. The closest task to documenting a codebase is writing a (scientific) manuscript. Things keep changing, you realise a bunch of things over a series of iterations, and even then there may be details lying around from the time you actually created the manuscript file itself.

So, okay, the programming language maketh not the science. A well-documented codebase is more capable of convincing an audience of its own utility and accuracy.

So, you do *everything* in Python?

"..and that was when Mowgli knew he would forever be safe in the comforting coils of Saamp, the Python" -- Not Rudyard Kipling, 1892

"Open your eyes, look up to the skies and see" * --Queen, Not Rudyard Kipling, 1975 *

Every now and then at a conference, someone might ask me what programming language all this stuff was done in. When I say Python, there's typically two reactions. One typical response is 'That's cool, I too use Python, which libraries...' , and as you can imagine, the conversation takes a very detailed trajectory. The other typical response is 'I see, yeah, everyone's using Python nowadays. My supervisor/whole lab uses , and so I'm kind of stuck at the moment'. My response to this statement is to strongly urge the person to switch. Now is the best time to switch, and save yourself time later etc. I probably(definitely) end up sounding like a weird mix of concerned parent and preacher.

Before I begin talking about the benefits of swithching to Python, I will make it clear upfront that the computer language maketh not the science. Good code remains good code irrespective of the language of choice (and the bad code..). The language of choice, however, strongly facilitates the kinds of techniques and attitudes to coding (which is a post for another time).

So, why should you switch to Python/ as early as possible?
Three strong reasons:

* 1) It's free for me, my collaborators, and anyone else in the world to download and use
* 2) The supporting packages I need to do your science are typically also free to download and use
* 3) I can write a piece of code, share it, and anyone can use it!
It's free (a la MJ, for me, and for you and the entire human race)

The fact that installing an open-source computing platform makes it extremely portable. I can download Python on my personal laptop, go to a field site in another part of the world and download it on a Raspberry Pi there! Working with commercially licensed platforms can be a genuine pain. You do not want to waste precious research hours going online and trying to get a license validated and authorised each time you switch locations, devices or labs. Moreover, most licenses are given through the research institution/lab using them - and not to individuals. What happens when you leave, what if you'd actually like to work with the latest release of the language, does that need an extra license? (Here I complain from personal experience)

Open your eyes, and see - there's a rich scientific package ecosystem around you!

One common comment I've heard is 'oh, but there's no good packages for '. Whenever someone says that, it must taken with a big handful of salt. Statements like 'oh, but has such cool packages for audio input-output and signal processing' are IMHO a result of misunderstanding what to expect from a language's package ecosystem. Even established scientists have said to me, 'Ah, Python, yeah - there aren't any good packages to do X'. Since all of us work in niche topics, we will have niche requirements. Not even proprietary computing languages have niche packages do to specific things like processing black-hole data, or detecting bat calls in recordings. What any computing platform can provide, is an ecosystem to handle the variety of tasks needed on a day-to-day basis (data I/O, instrument interfacing, data pipelines, statistics, visualisation). I do not realistically expect or require my computing language to have inbuilt packages to solve my niche scientific needs. I can only expect that the bricks are available!

So..be warned, Python's ecosystem is pretty damn good, and it's only getting better with time. In the recent past, I've been using only Python, even for experimental work involving speaker playbacks, microphone recordings, signal analysis, visualisation, statistics $^{1}$ and the cherry on the cake - writing it all in pretty Jupyter noteboooks. The open package ecosystem means I essentially have a wide variety of packages to learn and try new techniques with time. Contrast this to constantly thinking about which packages you can actually afford ('..I'd love to try out machine learning, but it costs , but would also need the other cool package along with that').

Write, share, repeat

I write rather niche code most of the time. There are times when I will take the effort of putting it up online and sharing it : 1) as part of a paper submission (as a DOI on a public repository) 2) when I think the code might be of interest to others, then I make it into a Python package and release it. I can confidently put up my code somewhere knowing fully well that anyone who would like to try it can step into it without thinking twice about the costs - and keep up withh any required upgrades or language version changes. This may not be such an issue when you work in a rather established institute/university. I'm constantly thus surprised when I see introductory scientific computing courses being done with proprietary computing languages.

$^{1}$ The packages I use for these tasks are: scipy, sounddevice, pandas, matplotlib, statsmodels

What you *could* do, but you shouldn't

"What gets you into trouble ain't what you don't know, but what you think you do, but ain't so" - Anon

This post is about how good software should be like an honest person - who knows their area of competence before taking up a task REPHRASE THISS-- IT'S TOO STRONG. We (myself included) often receive or write code to get things working, and then proceed to keep using it or share it eventually with our lab mates. Like anything in the course of life, use-cases for the software may change. A new project comes up, or someone (Person X) legitimately decides it'll be cool to try out this new experiment and process the data with this awesome codebase. Person X now triumphantly sits in front of the computer...and waits for the code to run through with anticipation .... and after a few seconds of anticipation, out come the plots!!

Of course, there's a twist to the story in that the results actually seem okay in the beginning, until Person X notices a bunch of weird details. Factor Z, a universal constant, is about 1.5 times more than it should - and this is too much. But the diagnostic plots look fine, and the code ran!! What is happening??

Many agonising hours later, Person X finally figures out that the Awesome Codebase has been a somewhat dishonest person. How is Awesome Codebase like a dishonest person you ask? Well, Person X figured out finally that Awesome Codebase was basically written to handle the analysis of Cool Experiments done to understand Factor Z using four Z-probes. What are Z-probes? They are the devices that measure the value of the Z-factor. Awesome Codebase can handle a wide variety of Z-probes, and that's why it's quite famous in its community. However, it turns out Person X wanted to replicate the experiments under the simplest possible conditions. Person X's experiment involves measurements with three Z-probes, instead of the standard five, six, or seven. In principle, Person X, thinks - the only reason people haven't done it before is because they were being exta careful in the early days, who needs four Z-probes anyway??!! But no, days later, Person X finds out that Awesome Codebase really wasn't built to handle three Z-probes at all! It is a logical impossibility given the equations and Science in the field of Z-factor studies. It turns out all Z-factor experiments have been done only with a prime number of Z-probes, ranging from 5 to 29, but no-one had really thought of going below 5. If it is so impossible, why didn't Awesome Codebase throw an error and stop everything dead in its tracks?

Yes, the analogy isn't clear yet. Awesome Codebase has been like someone who nods silently but enthusiastically to the question 'Can you speak German, the people at the dinner we're going to don't speak much English ?'. And then, when you meet with Awesome Codebase for dinner, Awesome Codebase says 'Gruesse Sie', 'Danke', 'Salz' in a burst, and then settles down quietly for the rest of the evening. Yes, you could argue, this person does know some German - but the question is can they handle a conversation? Do they understand what is happening around them? Why are they not saying anything, do they know saying these three things in a continuous string may be grammatically correct but socially unusual.

The biggest question remains of course, that is - why didn't my acquaintancce just say they don't speak German well enough. You could have then made other plans. And so, in analogy, if Awesome Codebase was not built to handle less than five Z-probes - it should have said it out loud, thrown a warning, or even better - thrown a nasty error saying exaclty that Z-probe error, this codebase cannot handle <5 Z-probes! Person X would have been happy, and moved on to find another equally awesome codebase, and Awesome Codebase wouldn't have been dishonest.

This is not a rant about how codebases are limited in their nature, it is a rant about the lack of documentation and clear communication to the user. All codebases are limited in their capabilities and are constrained by the historically envisioned use-cases. The use-cases will naturally change, but if the codebases capabilities don't match the requirements of the use-case -- this must be easily detectable! The capabilities and limitations of a code base must be clearly highlighted in the form of a README or user manuals. If even these are not heeded or clear, the true weapon to prevent misuse is throwing a clear error.