I have run into a lot of references in papers to software that was used to support various research efforts, particularly in the area of healthcare research. The idea behind this sharing is admirable: supporting reproducible research and clearly providing the methods. In a number of papers, the tool with its cute, acronym-formed name boldly on the title is the core component of the publication: ‘Look!’ it cries, ‘I’ve made this software available with you in mind! Please use it!’ A good example of this might be what I recently came across here for some work I was doing. The publication clearly states that ‘ScAN and ScANER are publicly available’ — and while there is a Github repository, there is no real documentation and the code doesn’t really do anything. Some key components of ScAN and ScANER were omitted.
Others are released to meet the requirements of publishers who increasingly want better visibility into the methods, particularly when relying on natural language processing (NLP) or some other customizable solution. These often can be clearly formulated in a paper, yet mask the reality of a bunch of odds and ends awkwardly glued together. While there are a number of more recent examples, I think my favorite insight is this repository which clearly states that the sharing of the code is based on a request by the ‘scientific report editor’. The code reveals the arbitrary nature of the NLP, and the authors made no effort to allow someone to replicate their work.
It has come to the point that while an investigator I’m working with will trust the flowery headlines which proclaim that some brilliants minds at such-and-such institute have flawlessly solved this problem. No, they usually haven’t. Further, it will take me a while to review there code, probably longer to get it to run, and there is always some key which has been left out. Reaching out to the authors typically results in silence or a brief reply that said author is on to bigger and better things. In the wake of these, we confront the problem of citationware: tools and code that ‘worked for the project’ and are ‘shared/cited in the publication’, but no human will ever get running again. I very much doubt the original programmer could solve these problems. (If you’re unfamiliar with the suffix -ware
, look at ransomware, vapourware, and abandonware.)
The origins of this problem, as I see it, are manifold:
Conflict between needs for publications and tool quality
Unlike a software company (including app developers) with an inherit interest in releasing a high quality tool to facilitate customer adoption (though incentives might need to be re-evaluated in terms of long-term maintenance…), scientific projects lack any incentive to release their code or to do so in a way which permits usage by other entities. The cycle follows a strict process of obtaining grants, doing the work in order to produce publications, and then using those publications to solicit additional grant funds. The products are publication, not code.
Further there are a set of problematic assumptions among investigators who over-value the scientific manuscript, particularly in areas of slightly more ‘artsy’ processes like NLP. I have been on a project where one investigator has repeatedly emphasized using a set of validated measures from the literature on identifying ‘safety plans’ for individuals with suicidal ideation. Unfortunately, this doesn’t work. The literature produces so-called validated measures, but these have been only validated at a single site and, in all likelihood, do not scale. As anyone who tries to implement the code released by these ‘validated algorithms’ can attest or even follow the steps the are outlined in the manuscript methods, the methods are incomplete and the code won’t work. In all likelihood (as I have a significant amount of experience in), even getting the code to run after days of labour will result in poor performance — I could outdo these with a couple regular expressions and an hour of work.
Regarding the safety plan NLP, the paper (which is conveniently behind a paywall) promises to release ‘all of the programs and tools’ to their Github page which (surprise!) has some waivers and IRB approvals, but nothing remotely usable. Even having obtained the code (which is site-specific and would not work at any other location), it would have been easier to start fresh.
Inability of reviewers to evaluate the software
I suspect there’s another major problem as well, and that has to do with the expertise of those within the peer-review process. For those outside of this lifecycle, the peer review process is to ensure that scientific manuscripts are critiqued: their methods questioned and put ‘through the wringer’, the results perspicaciously evaluated, and any discussion points considered for limitations and logical fallacies. While this process has a number of issues (e.g., incompetent reviewers, authors lying about [or obfuscating] their methods, reviewers focusing on the wrong things [typos not methodology], ideologically-bent journals, and reviewers wanting to place their mark on the manuscript), it generally works quite well. However, not many experts want to spend time reviewing articles (they don’t get compensated for it) so many journals are rather desperate — they even reach out to me! And these experts are incredibly unlikely to have any familiarity with code posted on a website — I’d bet that most don’t even check the link if it’s proposed.
I do tell investigators that I work with to forward the links to code in any reviews they make so that I can at least tell them if the code is ‘runnable’ and/or well-documented. Unfortunately, no one has taken me up on the offer (I’m also not sure if it might be verboten for the reviewer to share…). Anyway, even if it is verboten, ask the editor if they can have a programmer evaluate the Github repository (or other shared code). I will extend the invitation to anyone reading this: if you are reviewing a publication that links code, either ensure that it runs or forward to a programmer to evaluate it (including myself).
Programmers as data specialists, not ‘coders’
At many research institutes, the programmers are not computer scientists. They do not understand or question the inner workings of their programs — for them, it is a tool. They have a task (e.g., get X data and prepare it for sharing/discussion at our next meeting) which they use the programming language (i.e., tool) to implement. Their typical expertise is in the data. (In contrast, I am the rare bird who is much less interested in the data and more in the methodology.)
Well, what’s the problem with that? For the purposes of the project work, there’s nothing really wrong. The programmer collects the data, generates a few pretty graphs, and a paper is published. The problem is with the replication or dissemination of this research. Here, instead of modular code with clear separation between code and data (e.g., with config files, etc. to access data or alter parameters), it’s a mess. There’s no version control; code is commented out not because it’s not run but because the programmer wanted to run it in steps; etc.
If you step into their shoes, imagine going through the iterations with all the local filesystem paths and modifications, commenting and uncommenting what you want to get run…then, you suddenly expect them to share this thing!? On Github? They don’t really understand version control (except script1.sas
, script1_final.sas
, and script1_final2.sas
) and associate git
with the zombie apocalypse. Oh, and they have no paid time left on the project to get it up there. How hard, thinks the investigator, is to move some files to a website?
Well, dear investigator, if the code hasn’t been prepared to share, hasn’t been versioned controlled, etc., give it at least a week of paid time to do it properly, or don’t share it. Don’t add to the bloat of Github citationware
. Because, with a few days of hard work, you could actually make it…and the most important thing to recall is that just because it’s been uploaded doesn’t mean you’re done.
Make sure the code you shared works
‘Testing’ that the shared code actually works following the instructions needs to be completed. The effort of uploading your code to somewhere like github is saved until the very end of the project and never tested. These are easy to test. Create a new directory somewhere, pull your project, follow the instructions, and run it. Did it work? Probably not — so what did you forget? What bit of code, dependency, or instruction was omitted? Fix it and try again. You have to prove that someone else can use it otherwise don’t share it. You have to start from the beginning and deploy/run your code. I am almost certain that no one else has ever done this with published code…
Here are the steps:
- Create a new folder/directory on your file system.
- Clone/download the code.
- Follow the steps and only the steps in your README.md (or other referenced documentation).
- Whenever you find something that was missed or unclear, fix it (in the documentation or code).
- Repeat (or, even better, ask another programmer to deploy it) until everything works.
Additionally, don’t ‘refactor’ (i.e., rewrite the logic or alter the structure) in your code unless you actually re-run it on your data. The worst code I ever see (e.g., sorry to pick on this repo again, recency bias) so often has the git
commit message of ‘refactoring…’. When I re-wrote the ScAN package to actually work, I had to go back to the original commit prior to this ‘refactoring’.
Also, rather than saving moving code to github for the end of a project, get the code up earlier in the process. Design it from the beginning to be shareable. Create documentation from the start.
And, moreover, …
For all of these reasons, I don’t expect any change in the cycle. I am reminded of submitting my most recent biosketch for a grant application. My first-authored publications are pretty slim, but my code is generally clean, and well-documented (godlike levels of both in the healthcare research space I inhabit, which is not intended a self-compliment…I have a long way to go). Of the twelve spaces for citations, I only included a single reference to my repositories. Why? Because publications are the only currency.