What is Intelligent Document and Data Capture?
In this article we’ll discuss Intelligent Data Capture including defining the term, discussing the technology foundation and what’s expected in the future. But first, let’s look at the past.
In a now famous (or infamous) 1975 issue of BusinessWeek on the “The Office of the Future” technologists describe “The Paperless Office”. The article states, “Vincent E. Giuliano of Arthur D. Little, Inc., figures that the use of paper in business for records and correspondence should be declining by 1980, "and by 1990, most record-handling will be electronic.1
Well, it’s 2014 and we are still chasing the paperless office. Today, the consensus of an office utopia is more likely, “The Nearly Paperless Office”. Even though we are not there yet, great strides have been made since 1975 and today we have proven technologies and developing technologies propelling us to the nearly paperless office.
Document and Data Capture Definition
There are many different definitions for document and data capture. Add in “automated” and “intelligent” and the definitions can get pretty long. To keep it simple let’s stick with AIIM’s (Association for Information and Image Management) definition. AIIM is a nonprofit, serving information and image professionals and it states, “Document capture and data capture are not the same thing. Document capture is the conversion of a paper document into an electronic image of that document. Data capture extracts data from a business form”2. We’ll interpret “form” here as any paper or electronic source.
Why Intelligent or Automated Capture?
It’s not a stretch to argue that automatic capture reduces labor costs and speeds processing and ultimate information delivery to the end user or customer. Regulations such as the Health Information Technology for Economic and Clinical Health Act (HITECH) are also moving industries to adopt data capture in order to meet compliance deadlines. What is much harder to quantify is the increased accuracy and lowered cost of errors. In his AIIM blog, Brad Paxton, CEO at ADI LLC-Advanced Document Imaging, discusses this cost and states the “cost of error is not well understood or dealt with effectively.”3
What is the Capture Process?
So now that we have a basic definition of document and data capture and reasons to care, what is the capture process?
There are many models available from broad three-step processes to more specific five-step procresses. Let’s go the more specific route and look at the following:
- Capture
Most familiarly, capture includes document scanning. Using a document scanner or MFP device to capture images of documents is common. But capture also includes electronic formatted sources such as network directories, emails, electronic forms, print streams, faxes…anything made of 1’s and 0s. - Classify/Organize/Categorize
The classification process involves simply identifying what the document or information is in order to correctly process and deliver the document and extract the information. Is this an invoice, a contract, a patient record, a tax form? How should it be processed? Where should it be routed and stored? - Extract or Mine
Extracting the appropriate information from a document usually implies identifying and capturing the index or metadata information. This may be just key elements such as customer number, freight tracking number, invoice number, supplier name etc. Or, full-text indexing may be a requirement where all text on the documents are captured. See What is Document Indexing. - Validate
Validation involves using technology or manual inspection to ensure that the document is classified and processed correctly. With technology this may mean automatically validating against data sources or employing business rules. For instance if an inventory item should contain three alpha characters followed by five numbers, all documents with item numbers that are not identified with that scheme during the capture process may be tagged for manual inspection before further processing is done. - Deliver or Integrate
The final step in processing documents is delivery or integration with a search and retrieval or content management system. Often index information is sent to the document management system via an XML or CSV file where it can be made immediately available to the user. Systems such as SharePoint, Epic, Laserfiche, Dentrix, and other ECM, EMR, EDR, EHR systems have various ways of accepting data feeds.
What Technologies are Used in Document and Data Capture?
Barcode Technology in Data Capture
Built on proven standards and a successful history, barcode recognition (BCR) offers the most trustworthy recognition technology for data capture. Barcodes identify merchandise and equipment, tag documents and even tag patients in hospital care, all with accuracy that is unmatched by any other classification approach. Barcodes can be used in both structured and unstructured document formats. With barcodes and with other recognition technolgies such as OCR which is discussed further below, users can:
- Split at Barcodes
Barcode enabled capture tools can search for barcodes in a scan stack and create new documents when a new or common barcode is found. New invoice numbers, patient records, repair orders or any other classification can be used for splitting an entire stack of documents in a single pass. - Classify Documents
Barcodes help identify the type of documents being processed and can be used to select the appropriate processes for a specific document type. - Route Files
Data capture solutions can use barcode strings to determine or set the target path (or route) for scanned documents in the computer’s file system. Automatically create folders and subfolders based on the barcode values contained in document sets, thus providing an automated way of storing and folder classifying documents. - Index
Perhaps the most common use of barcodes is to contain metadata or indexing information intended for document management systems. - Name Files
Barcodes along with other system or document variables can be used to create output file names for scanned documents. - Bookmark
Instead of splitting at barcodes, barcodes can also be used to insert bookmarks in output PDF files. This option lets you keep a stack of scanned documents as one file, but add bookmarks for easy navigation through the file.
OCR (Optical Character Recognition)
OCR is another mature data capture technology that digitizes text images so that they can be electronically edited, searched, and stored. With OCR you can make your image-based file fully text-searchable or extract data from a zone for indexing. With zonal OCR, document areas are identified for automatic OCR capture. Additionally, drag-and-drop or rubber band OCR allows an operator to highlight document text which is automatically OCR'd and dropped into index fields. OCR is becoming increasingly important as text-mining technologies improve. Advanced data mining technology using "lookaround" technology with OCR text can create powerful text identification that can be used for indexing, naming, routing and more.
Other Recognition Technologies For Data Capture
ICR (Intelligent Character Recognition)
ICR is a handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition levels. Obviously this technology is not as accurate as OCR, but it does have a limited role in some capture systems.
OMR (Optical Mark Recognition)
OMR is the process of capturing human-marked data from document forms such as surveys and tests. Like ICR, lower accuracy limits the application within data capture environments today.
Forms Recognition
Generally, forms recognition uses BCR, OCR, ICR and OMR in a structured data capture format. Typically form templates are designed to instruct the capture software in where to look for information on the form and how to process the information found there.
Data Mining/Regex,
Data mining with regular expressions (regex) provides a fast and powerful method to search, extract and replace specific text strings found within scanned documents. Regular expressions are essentially a special text string for describing adjoining search patterns. You could think of regular expressions as extremely powerful wildcards that the intelligent data capture software can employ to find key information. Regex is extremely flexible and patterns can be constructed to match almost anything.
Data identified with regex can be used to classify, split, name and route files. Regex can also play a role in line item extraction. This extension of data mining is common within invoice forms and other table based documents. With line item extraction, instead of extracting strings of text beside a matching pattern, users may want to obtain data found below the match word in a column type fashion.
Batch Document Processing For “Unattended” Processing
Batch document processing is simply processing a large volume of documents, generally into a few files or one file and using intelligent capture software to process the scanned or captured file or files. Some products process folders of documents on demand or “watch” folders for files to process.
Batch processing can save companies significantly because of the lowered operator intervention gained by using the highly functional automation features of intelligent data capture and ease of walk-up scanning or PDF printing.
Field Validation
Field validation applies rules to determine if the data extracted or referenced follows established rules. In the example presented, "PA568793" does not match the part number scheme of three alpha followed by six numbers. Exception handing may direct the invalid document to a directory for further inspection or in interactive situations, it may present the document to an operator for immediate review and correction.
Image Enhancement
To improve usability and increase accuracy of OCR and other recognition technologies, image enhancement is required in a document capture solution. Typical image enhancement might include, deskew, despeckle and rotate functions. Truly intelligent capture should also include options to remove blank pages, remove separator sheets, autorotate, remove lines, and adaptive thresholding. Adaptive thresholding technology assists in cleaning “dirty” documents or documents that have a colored background which interferes with the foreground data.
Issues/Concerns Facing Data Capture
Security of documents can be of great concern, particularly in legal and health care environments. Digital Rights Management (DRM) is commonly used to encrypt PDF files with varying encryption algorithms. Be sure your provider offers point of entry encryption (meaning the file is encrypted as it exits the scanning application) and that no unprotected temporary files remain.
Where Is Document and Data Capture Headed?
As we move to a more "paperless" business model, technology will continue to improve allowing for more secure and efficient solutons to store and retrieve documents. As this continues the future of document management and capture will be characterized by these trends.
- Increased Cloud Computing. Cloud computing of documents will grow as accessiblity requirements grow and it's promise of sharable repositories. According to Gartner Group, "The use of cloud computing is growing, and by 2016 this growth will increase to become the bulk of new IT spend." As the volume of stored and shared documents increases, so will document security needs.
- Security Focus - Couple the increasing number of documents being stored with the growing ways to access them, and security concerns will continue to increase.
- Improved Data Mining and Classification - The increased used of data mining and better classification will increase OCR demands and lower the use of barcodes and separator pages.
- Increased Mobility Demands - Increased mobility demands in business impacts all information technology. Users want all information available from all platforms, no matter when or where.
Conclusion
Today’s intelligent data capture solutions have matured significantly in the last few years and we are accelerating towards “The Nearly Paperless Office”. We may never fully approach a paperless environment, but certainly one that involves less paper. Currently, no one data capture product can “do it all”, but many solutions offer a wide range of features. There is no better time to get started than now.
Works Cited
2 http://www.aiim.org/community/blogs/community/Cost-Of-Error-In-Data-Capture#sthash.qCwNOeTf.dpuf