SORT IT: Build a PDF Processor

Logo
Presented by

Adam Jelley, Data Scientist

About this talk

As the world moves ever more digital, many businesses have a need for automated processing of documents. In this webinar, we’ll walk through an example end-to-end project for extracting, classifying and summarising PDF documents, and show how you can use a combination of cutting-edge open-source technologies, together with your own in-house expertise and requirements, to build you own PDF Processor with Dataiku DSS. PDF2Image (https://pypi.org/project/pdf2image/) Tesseract OCR (https://tesseract-ocr.github.io/tessdoc/Home.html) Pytesseract (https://pypi.org/project/pytesseract/#description) The Plugin Store (https://www.dataiku.com/product/plugins/) The Text Summarisation Plugin (https://www.dataiku.com/product/plugins/text-summarization/) Sci-kit Learn 20 Newsgroups Dataset (https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#) "Surprising Findings in Document Classification" (https://towardsdatascience.com/surprising-findings-in-document-classification-7a79e30f1666) Webinar (tomorrow): How to Reduce Data Labelling Costs (+ Increase Data Quality) With Active Learning (https://www.brighttalk.com/webcast/17108/394533?utm_campaign=channel-feed&utm_source=brighttalk-portal&utm_medium=web)
Related topics:

More from this channel

Upcoming talks (2)
On-demand talks (267)
Subscribers (55963)
Dataiku is the platform for Everyday AI, enabling data experts and domain experts to work together to build data into their daily operations, from advanced analytics to Generative AI. Together, they design, develop and deploy new AI capabilities, at all scales and in all industries. Organizations that use Dataiku enable their people to be extraordinary, creating the AI that will power their company into the future. More than 600 companies worldwide use Dataiku, driving diverse use cases from predictive maintenance and supply chain optimization, to quality control in precision engineering, to marketing optimization, Generative AI use cases, and everything in between.