Context: my father is a lawyer and therefore has a bajillion pdf files that were digitised, stored in a server. I’ve gotten an idea on how to do OCR in all of them.

But after that, how can I make them easily searchable? (Keep in mind that unfortunately, the directory structure is important information to classify the files, aka you may have a path like clientABC/caseAV1/d.pdf

  • georift@piefed.social
    link
    fedilink
    English
    arrow-up
    3
    ·
    18 days ago

    Might be a little heavy handed for your needs but I’ve found paperless-ngx to be amazing.

    • First_Thunder@lemmy.zipOP
      link
      fedilink
      arrow-up
      1
      ·
      18 days ago

      My problem is paperless is the fact that it doesn’t preserve the directory structure, losing essential info

  • lsjw96kxs@sh.itjust.works
    link
    fedilink
    Français
    arrow-up
    3
    ·
    17 days ago

    Maybe take a look at paperless-ngx, it will take care of the OCR for you and make it searchable. Just not sure if it will show the path correctly.

  • VoxAliorum@lemmy.ml
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    17 days ago

    Search them for words? Try pdfgrep with recursive - very easy to setup and try. If you feel like that’s taking too long, you probably need to accept some simplifications/helper structures.

  • solrize@lemmy.ml
    link
    fedilink
    arrow-up
    2
    ·
    edit-2
    18 days ago

    What’s a bajillion? If the OCR output is less than a few GB, which is a heck of a lot of text (like a million pages), just grepping the files is not too bad. Maybe a second or two. Otherwise you need search software. solr.apache.org is what I’m used to but there are tons of options.

  • greyfox@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    16 days ago

    If you want the search to be flexible like handling things like root stemming (i.e. for matching words that are pluralized etc) you might want to put the text into an Elasticsearch database.

    You might run into problems with the field length if these are long documents. A possible solution to that would be an putting each page into its own field inside of the document.

    If this is for a non tech user to search, the Kibana interface should be relatively easy for anyone to use.

  • __hetz@sh.itjust.works
    link
    fedilink
    arrow-up
    2
    ·
    18 days ago

    I’m a fucking dolt that dabbles and picks up the gist of things pretty quick, but I’m not authority on anything, so “grain of salt”:

    You’re already familiar with OCR so my naive approach (assuming consistent fields on the documents where you can nab name, case no., form type, blah blah) would be to populate a simple sqlite db with that data and the full paths to the files. But I can write very basic SQL queries, so for your pops you might then need to cobble together some sort of search form. Something for people that don’t learn SELECT filepath FROM casedata WHERE name LIKE "%Luigi%"; because they had to manually repair their Jellyfin DB one time when a plugin made a bunch of erroneous entries >:|

  • Father_Redbeard@lemmy.ml
    link
    fedilink
    arrow-up
    1
    ·
    17 days ago

    Would Papra work for you? I like it better than Paperless-NGX personally, which others have mentioned. But I’ll admit I’m not sure it’ll fit in your use case as I’m feeding it newly scanned documents for mine rather than existing file/folder hierarchy.