Support for PDF Custom Metadata

KB_wydna · May 5, 2021, 7:27pm

A bit disappointing to find that like so many other reference managers, Paperpile doesn’t bother with pdf metadata.

The pdf file format supports custom xmp metadata on a very granular level. Each page of a pdf can have its own custom metadata. And annotations can also have their own custom metadata.

Why is there a prevalent attitude with so many people that pdf metadata is beneath you?

I used Calibre to embed custom metadata into the files for my entire digital library, including metadata for Library of Congress Catalog Numbers, Dewey Decimal Numbers, etc.

Decimal classification systems like Dewey Decimal or Universal Decimal Classification are semantically rich, meaning you can derive topical tags directly from the numbers. Paul Otlet’s Universal Decimal Classification even supports operators to build compound classifications. The first standard for UDC, the ‘Handbook to the Universal Bibliographic Repertory,’ was published in 1907! Yet more than one hundred years later just handling pdf metadata tags is a bridge too far.

Classification of information is the most important thing in the world. It is the world. From Conrad Gessner’s Bibliotheca Universalis of 1549 to the present library science and information management has been a central problem in philosophy, political administration, etc. Francis Bacon, Samuel Hartlib, Comenius’s Pansophism, Petrus Ramus, Leibniz, the Encyclopedia of Diderot, Hegel.

That’s the level to which I hope you guys are aspiring. Reference management software is hopelessly backwards across the board. I like Paperpile, so I hope you guys step it up and make something worthy of Leibniz and Bacon.

A great way to start is by implementing pdf metadata. It’s a real shame I have complete metadata in all my documents and Paperpile refuses it. It’s all right there. My documents hold out their bibliographic information with open hands and Paperpile says “No thanks”. Nor does it seem that Paperpile saves metadata back into the papers. When I take the papers synced to Google Drive and open them in Acrobat DC, it looks like Paperpile just doesn’t mess around with whatever metadata might be in there. So I see some papers that use Prism custom metadata fields, others that don’t. Some papers have doi and other random information stored in the Dublin Core, others don’t. I guess you literally just only rename the file? C’mon son.

For whatever reason the standard approach of reference managers is to save all reference information as a separate record from the document itself. And then the document is just ‘attached’ in some meaningless way to the record. Who wants that? I don’t. Maybe there’s some logic behind it in the insane, exploitative world of academic publishing, but personally I’m not about that. Are you?

This just wastes my time so badly. When downloading papers into paperpile from journals using the extension, maybe 1/10 times it can’t get the metadata. And I have to go back and manually fuss around and add extra information until finally the paper is correctly recognized. And then when I download the pdfs, all that info is trapped in like a worthless bibtex file. So I have to export that, upload it into another reference management software, and then link the pdf on the machine I downloaded the paper to to that record. Lmao. Is that a joke? Honestly. Please, someone tell me.

You can do a lot better than this, fellas. You should be setting it up so that annotations in the pdf viewer embed metadata into the file. Users should be able to set semantic tags for different highligher colors and embed those tags into the excerpts they’re highlighting. And then have those highlights sync in real time a markdown document or note for side by side close annotation etc. That’s the kind of out of the box thinking I hope to see from you guys.

Well, good luck. Let me know if you need any help.

vicente · May 13, 2021, 6:10pm

Thanks for this comprehensive feedback, @KB_wydna. I lack the technical knowledge to fully address the points you raise but I do know that we parse at least the first page of all uploaded PDFs, where essential metadata is usually stored by publishers. To my knowledge we haven’t received much (or any) demand to scan documents beyond this, thus focusing our development in other areas.

A bit of my own opinion: exploitative or not, the world of academic publishing sets the standard which our software must continuously respond and adapt to in order to properly cater to our growing user base. Your impassioned arguments make sense to me and I’m sure to many others as well, so I hope that we are able to consider a more disruptive approach to metadata storage in the future. To that end, I have recorded your feedback on our internal tracker for the team’s consideration. Any +1s or further observations will be more than welcome.