Duplicate detection

Hello - Could you give me a clue as to how the duplicate detection is supposed to work? My library is not indicating that I have any duplicates - but I’ve just found about a dozen that I have had to manually edit and then send to trash. The titles are identical (although some have varying cases) in all the ones I found, as are some of the other details such as author or journal. I have tried copying some fields from one to another to try to provoke the duplicate detector but it doesn’t seem to want to co-operate today.
As an alternative could you include some easy user driven mechanism for identifying duplicates that could then be merged in the normal manner?

Many thanks

Julia

1 Like

Have you tried copying the DOI?

Thanks for the query, @jmw. As @T_Verron points out, the best way to trigger our system’s duplicate detection is to add a unique identifier (like DOI or URL) to both - it could even be a made-up number or random link which one can delete once the records are merged.

I realize this is just a workaround; improving our duplicate detection and implementing a clearer mechanism for users to identify and merge records easily is on our wishlist. Adding your +1 to the topic as usual to raise its priority on our tracker.

2 Likes

Thanks both. I have used the copying the doi before to get over this. The problem with that method is that although it is effective if both records have the doi field, I’m finding that a lot of my duplicates are conference papers which have somehow set themselves up as both a ‘Book Chapter’ (with no doi or URL field) and also as a ‘Journal Article’ which has both. To copy one or the other to the other duplicate record first involves adding a new doi field to the ‘Book Chapter’ one, and then copying it across. This makes it an even slower and more painful process - hence my asking what fields were used to detect duplicates. Or for any easier way around it.

And yes, it is definitely Paperpile that is making these classifications when I import the papers - I rarely make any changes to the records.

Out of curiosity, does auto-updating such items without DOI and URL have any effect? If it does, that might relieve you from having to manually copy such fields over for merging.

Thanks for the input @kernel - I don’t think so, but can’t confirm this at the moment. I have been trying to deduplicate as I come across items, and I’m pretty sure that I would have tried that.
I did try to test this out earlier, but one thing that does hamper the search for other duplicates is that the only sort option available for ‘All Papers’ is by date added. @stefan - would it be possible to sort the entire library by case insensitive title?

@kernfel auto-updating items without DOI / URL could have effect but depends on too many varying factors. It’s complicated to predict whether it will work but it’s definitely worth a try for anybody on @jmw’s position - simply selecting the references and hitting Shift+A will do the trick. Besides attempting to match DOI, URL, title and author names, we also create a synthetic value from metadata fields title, booktitle and author to identify duplicates.

@jmw is there no ‘Title’ option under Sort in the right-hand side column like below?

@vicente - Sadly there isn’t. ‘Title’ sort only appears if I include a filter such as Notes or Unsorted. If I don’t select one ( or try ‘Has PDF’ or ‘Has no PDF’) my only sort option is ‘Date Added’

Indeed, @jmw, I should’ve imagined you’re working with a sizable library. Sorting becomes limited around 2,500 citations:

image

Any filter containing under 2,500 items will enable sort options. You can also use labels and/or folders to this end.

Oh… Thanks for that useful nugget of information @vicente. Could that limit be removed at any point? I can’t guarantee that any of the other potential duplicates that may be in there would have the same labels or be in the same folder. I’ve already discovered that some of the duplicates have different ‘Types’ assigned to them, so I can’t see any other easy way to identify them.

Failing that could we have the ability to identify papers as duplicates ourselves, that could then be merged by the existing PaperPile method?