Automatic Scanned Document Data Extraction OCR NER in Python

Learn and Build Business Card Scanner App from Scratch with Python, Spacy, Pytesseract.

Welcome to Course “Automatic Scanned Document Data Extraction OCR NER in Python” !!!

What you’ll learn

  • Develop and Train Named Entity Recognition Model.
  • Not only Extract text from the Image but also Extract Entities from Business Card.
  • Develop Business Card Scanner like ABBY from Scratch.
  • High Level Data Preprocess Techniques for Natural Language Problem.
  • Real Time NER apps.

Course Content

  • Introduction –> 2 lectures • 3min.
  • Project Setup –> 6 lectures • 17min.
  • Data Preprocessing –> 10 lectures • 59min.
  • Training Named Entity Model (NER) –> 11 lectures • 47min.
  • Predictions –> 13 lectures • 1hr.

  • Should be at least beginner in Python.
  • Understand aggregation techniques with Pandas DataFrames.
  • Read, Write Images with OpenCV and Drawing Rectangles on Image.

In this course you will learn how to develop customized Named Entity Recognizer. The main idea of this course is to extract entities from the scanned documents like invoice, Business Card, Shipping Bill, Bill of Lading documents etc. However, for the sake of data privacy we restricted our views to Business Card. But you can use the framework explained to all kinds of financial documents. Below given is the curriculum we are following to develop the project.


Section -0 : Setting Up Project

  1. Install Python
  2. Install Dependencies

Section -1 : Data Preprocessing

  1. Gather Images
  2. Overview on Pytesseract
  3. Extract Text from all Image
  4. Clean and Prepare text

Section – 2: Train Named Entity Recognition Model

  1. Prepare Training Data for Spacy
  2. Train Model
    1. Config
    2. Train
    3. Save

Section – 3: Prediction

1. Load Model

2. Render and Serve with Displacy

3. Draw Bounding Box on Image



