Let's talk preprints

gjuggler · May 9, 2025, 7:05pm

Here’s one of those half-baked product ideas I was talking about in my intro post.

I’d love to hear any feedback or critiques of my logic!

tl;dr

I’m proposing Paperpile should handle preprints by creating a strong distinction between preprint and published articles, and aggressively inter-linking between them where possible.

Background

We know that preprints are a challenge for literature management, and Paperpile is far from perfect here (see this, that, or the other thread… among many others). There was even a period many years back when some major publishers wouldn’t accept papers that had been initially published as preprints (though the vast majority of publishers are now unrestricted in this regard – much credit to bioRxiv for pushing the field forward here).

I’ve spent a bit of time reviewing discussions from the forum and picking the brains of others on Team Paperpile about the topic of preprints. Two takeaways struck me:

It’s clear that the handling of preprints in Paperpile is far from ideal: the behavior is somewhat undefined, certainly not documented, and often surprising to users.
There doesn’t seem to be a single, well-understood, consensus description of how preprints should be handled in a reference manager like Paperpile!

This sort of duality always gets me excited as a product developer, because it means there’s an opportunity to corral complexity and ambiguity into a clear, and ideally simple, concept.

So, after thinking about this for more than a minute but less than an hour (I promised half-baked, right?) here’s where I landed. I’ll first start with a few principles, then the product concept, and at least one edge case to consider. I’d love to hear what you think.

Product principles:

Each Paperpile reference should correspond roughly to an independently-citable “thing”. It follows that a pre-print manuscript and its subsequently-published journal article should of course be treated as separate “things”. Metadata leakage or de-duplication between the two should be strictly avoided (wherever possible).
Paperpile aspires to be the best tool for quickly and automatically fetching full-text PDFs of articles for later reading, note-taking, etc. (It’s often the case that an article will be paywalled, but its corresponding preprint manuscript is openly available. So that’s something we’ll need to account for.)
Researchers often make use of links between papers. Citations are an obvious one, but other types of links exist (e.g. reference A is a commentary on reference B, or contributed chapter X is a component of book Y). In my opinion, Paperpile could do more to leverage and surface these links between “things”.

Following from that, the product concept has two straightforward components:

Create a strong firewall between preprint manuscripts and published journal articles: all metadata, PDF links, and deduplication activities should never cross between the preprint manuscript and journal article version of a paper.
Collect and surfaces cross-links between preprints and journal articles to guide users. Concretely:
- When a user has a preprint manuscript in their library and we know that a corresponding journal article is available, Paperpile should surface this and make it easy to import the journal article.
- Likewise: when a journal article is in a user’s library and a preprint version is available, Paperpile might surface this to the user. (This flow is more meaningful in cases where the journal PDF is unavailable; see the edge case below.)

And then there’s one major edge case to consider:

Guiding users to a preprint PDF when importing a paywalled article.
- One drawback of the above concept is that Paperpile could no longer automatically fetch a biorxiv preprint PDF for a paywalled journal article.
- Concrete example: https://onlinelibrary.wiley.com/doi/10.1002/jez.b.23138. This article is published in JEZ-B Mol Dev Evol and is paywalled for me.
- When I import to my library, Paperpile (1) tries to fetch the publisher PDF, but that fails, so it (2) looks for the preprint PDF, finds it, and downloads it as the main PDF for this paper.
- This is convenient for me as a researcher (because what I care about most is gaining access to the paper and its contents), but this behavior would be disallowed by the strongly-firewalled product concept above.
- To mitigate this, I’d propose that Paperpile aggressively recognizes when a restricted-access PDF has a corresponding preprint available, and makes it easy for the user to import the associated preprint (as a separate reference) in one click. (This presents some UX challenges, but would result in clear and unambiguous behavior with the same helpfulness in getting a user to their paper of interest.)

Request for feedback

I’d love to hear what others think of the above—especially if you’ve been involved in past discussions around preprints, or if you have additional scenarios or user flows we should be paying attention to. It’s also great feedback if you feel like the proposal above is, or has been, obvious to you all along!

Bruce_Borkosky · May 9, 2025, 7:36pm

I dunno, Greg, it seems easy enough to me. Although the title might be the same between the two versions, the published version will have a citation to a reputable publisher (or, in the case of books, a book publisher). Preprints won’t have that, so it’s easy to distinguish the two. Or am I missing something here?

Bruce_Borkosky · May 9, 2025, 7:40pm

In my opinion, more important than this feature → I would love to see PP incorporate user-defined paywall logins. IOW, I have access to APA articles and Wiley articles, through my subscriptions to the associations. I would love to be able to entire my login information once in PP and then have PP access those articles.

Memming_Park · May 10, 2025, 9:00am

We definitely need a solution for the preprint vs journal issue. It is wasting a lot of time & more importantly emotional energy for me to deal with this.

gjuggler · May 14, 2025, 12:11pm

Thanks Bruce. Fascinating idea – kind of like Paperpile 1Password (which is an excellent product, btw – we on the team all use and love it).

My knowledge on the details may be fuzzy, but I think in principle Paperpile, with its Chrome extension, should use your browser’s active session information on websites when crawling for PDFs. I’d be curious to know if that session persistence ever helps ease the burden of accessing paid-for content (e.g. if you sign in and then import an article, is the paywalled PDF able to be accessed).

Bbeyond that, there’s the EZProxy setup for institutional subscribers (which I understand doesn’t help much for folks who are individually subscribed to certain publishers).

I think I hear you on the use case. We’ll feed this into the hopper of ideas, and will see if or when there’s a good case to invest in building this out as a core product capability.

gjuggler · May 14, 2025, 12:12pm

Thanks for the feedback! I’d love to hear if you think the path forward I’m proposing (specifically, we create a strong firewall between preprints and their corresponding published journal articles) is the best approach.

Once we have conviction in the direction here, it’s just about executing the change. (Which is not trivial, but we’ve got a solid team that’s up to the task.)