Better detection of metadata from uploaded PDFs


#1

I’m uploading a large number of PDFs, many of which do not have metadata legible to Paperpile. It’s irritating to have to add the information manually. Two possible ways to make this process easer: (1) improve auto-detection of metadata so that I can fill in say author, date, and title and then have other fields populate automatically and/or (2) allow for manual merging so that I can merge the incomplete entry with the PDF attached to an entry pulled from a database. Pulling an entry from a database and then uploading files one by one is a more time-consuming (and thus inferior) solution. Thanks!


#2

Realized that suggestion #1 is working for some entries. So thanks for that!


#3

That’s called “Auto-update” and already available.

If paperpile does not even get the Title or authors, you can add those and then try autocomplete either from the context menu of an item orfor multiple items from the toolbar menu:


#4

The problem I found is that a very easy and complete paper information could not be found. As I realized, it searches only at ncbi.nlm.nih.gov and says that: Could not match data online.

However, the paper has a DOI and it is present at sciencedirect.com

See: http://www.sciencedirect.com/science/article/pii/S0098135497876284

Why does Paperpile is unable to find this paper info?

Thanks.


#5

Hi,

There are multiple issues with this paper. i try to explain what Paperpile does to retrieve the information:

  1. It parses the PDF for Author, Title, and identifiers like DOIs. In this case it can retrieve the title and the authors, no DOI. It is a scanned paper and the DOI is nowhere in the PDF, not even in the meta data.

  2. It queries online resources. First it tries PubMed, then it uses CrossRef (http://search.crossref.org/). I just tried it on my machine and it finds the paper on CrossRef. It might be that the CrossRef search took to long and timed out and did not yield any result in your case.

  3. However, the information Elsevier put into CrossRef is wrong (http://search.crossref.org/?q=10.1016%2FS0098-1354(97)00175-0). The DOI Elsevier put there leads to a ScienceDirect Error, and Paperpile cannot retrieve the better publication information from there.

It is a chain of errors thats leads to Paperpile not recovering the correct information.


#6

Hi,

I may be overlooking something. But out of curiosity, why not have the search also include “Google Scholar?” At least for my field, many older articles that do not show up with a PubMed search are often identified with Google Scholar. Also, for me at least, it would be preferable to update the article with some information (Title, Authors, Year, Journal) instead of nothing in the case that the DOI itself cannot be identified.

-C


#7

Paperpile, does that already. For a regular Paper it tries following sources:

PubMed > Links to publisher’s page (if any) > CrossRef > Google Scholar > Google Books

Google Scholar is rate limited. Too many queries will cause a temporary block of your IP address, that’s why we only use it if the other sources fail.

Do you have an example of a paper that Paperpile can not match?

Best,
Andreas


#8

I understand the need for standardisation with the auto-update function, but would it be possible to add a final last stage that would read in metadata from the pdf itself?

I deal mainly with non-academic pdfs and being able to import metadata from the pdf such as CreationDate (to map to Date published) or Author (to Author) would be quite useful. Could also help with importing academic papers that are not found in the databases.


#9

The problem could be solved, to the extent it is possible, by allowing us to add our own paperpile-retrieved metadata to imported pdfs. For instance, I can upload a pdf and not have it find the correct metadata… but if I go to “Add Papers” and search for the same title, I can find the metadata/reference corresponding to the pdf I have. If the auto-update feature, having failed other measures, could then simply let me choose as I would if I did a search in Add Papers, the problem would be solved.

Right now I have to DELETE the pdf I manually dropped in, then go Add Papers, do a search, then ATTACH my own pdf. Highly inefficient - but, thankfully it seems, easy to solve?