Java text recognition technology with high recognition rate

Author：Eve Cole Update Time：2025-05-13 16:16:01

The key to Java text recognition programs is to find an OCR engine that can be called. tesseract-ocr is such an OCR engine, developed by HP Labs from 1985 to 1995 and is now on Google. tesseract-ocr 3.0 is released, supporting Chinese. However, tesseract-ocr 3.0 is not a client for the graphical interface, and the FreeOCR graphical client written by others does not support importing new 3.0 traineddata. But this means that there is now free Chinese OCR software.

The steps to use tesseract-ocr3.01 in Java are as follows:

1. Download and install tesseract-ocr-setup-3.01-1.exe (Chinese recognition is added only in version 3.0 or above)

2. You can select the language package you want to download in the installation wizard.

3. Search and download 2 packages required for java graphics processing: jai_imageio-1.1-alpha.jar, swingx-1.6.1.jar

4.java program list:

ImageIOHelper class:

 import java.awt.image.BufferedImage;import java.io.File;import java.io.IOException;import java.util.Iterator;import java.util.Locale;import javax.imageio.IIOImage;imagejavax.imageio.ImageIO;import javax.imageio.ImageReader;import javax.imageio.ImageWriteParam;import javax.imageio.ImageWriter;import javax.imageio.metadata.IIOMetadata;import javax.imageio.stream.ImageInputStream;import javax.imageio.stream.ImageOutputStream;import com.sun.media.imageio.plugins.tiff.TIFFImageWriteParam;public class ImageIOHelper { public static File createImage(File imageFile, String imageFormat) { File tempFile = null; try { Iterator readers = ImageIO.getImageReadersByFormatName(imageFormat); ImageReader reader = readers.next(); ImageInputStream iis = ImageIO.createImageInputStream(imageFile); reader.setInput(iis); //Read the stream metadata IIOMetadata streamMetadata = reader.getStreamMetadata(); //Set up the writeParam TIFFImageWriteParam tiffWriteParam = new TIFFImageWriteParam(Locale.CHINESE); tiffWriteParam.setCompressionMode(ImageWriteParam.MODE_DISABLED); //Get tif writer and set output to file Iterator writers = ImageIO.getImageWritersByFormatName("tiff"); ImageWriter writer = writers.next(); BufferedImage bi = reader.read(0); IIOImage image = new IIOImage(bi,null,reader.getImageMetadata(0)); tempFile = tempImageFile(imageFile); ImageOutputStream ios = ImageIO.createImageOutputStream(tempFile); writer.setOutput(ios); writer.write(streamMetadata, image, tiffWriteParam); ios.close(); writer.dispose(); reader.dispose(); } catch (IOException e) { e.printStackTrace(); } return tempFile; } private static File tempImageFile(File imageFile) { String path = imageFile.getPath(); StringBuffer strB = new StringBuffer(path); strB.insert(path.lastIndexOf('.'),0); return new File(strB.toString().replaceFirst("(?<=//.)(//w+)$", "tif")); } }

OCR class:

 package com.hhp.util;import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.List; import java.util.List; import org.jdesktop.swingx.util.OS; public class OCR { private final String LANG_OPTION = "-l"; //English letters lowercase l, not 1 private final String EOL = System.getProperty("line.separator"); private String tessPath = "C://Program Files (x86)//Tesseract-OCR"; //private String tessPath = new File("tesseract").getAbsolutePath(); public String recognizeText(File imageFile,String imageFormat)throws Exception{ File tempImage = ImageIOHelper.createImage(imageFile,imageFormat); File outputFile = new File(imageFile.getParentFile(),"output"); StringBuffer strB = new StringBuffer(); List cmd = new ArrayList(); if(OS.isWindowsXP()){ cmd.add(tessPath+"//tesseract"); }else if(OS.isLinux()){ cmd.add("tesseract"); }else{ cmd.add(tessPath+"//tesseract"); } cmd.add(""); cmd.add(outputFile.getName()); cmd.add(LANG_OPTION); cmd.add("chi_sim"); //cmd.add("eng"); ProcessBuilder pb = new ProcessBuilder(); pb.directory(imageFile.getParentFile()); cmd.set(1, tempImage.getName()); pb.command(cmd); pb.redirectErrorStream(true); Process process = pb.start(); //tesseract.exe 1.jpg 1 -l chi_sim int w = process.waitFor(); //Delete the temporary working file tempImage.delete(); if(w==0){ BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(outputFile.getAbsolutePath()+".txt"),"UTF-8")); String str; while((str = in.readLine())!=null){ strB.append(str).append(EOL); } in.close(); }else{ String msg; switch(w){ case 1: msg = "Errors accessing files.There may be spaces in your image's filename."; break; case 29: msg = "Cannot recongnize the image or its selected region."; break; case 31: msg = "Unsupported image format."; break; default: msg = "Errors occurred."; } tempImage.delete(); throw new RuntimeException(msg); } new File(outputFile.getAbsolutePath()+".txt").delete(); return strB.toString(); } }

TestOCR:

 import java.io.File;import java.io.IOException;import com.hhp.util.OCR;public class OcrTest { public static void main(String[] args) { String path = "C://temp//OCRcode//4.png"; System.out.println("ORC Test Begin..."); try { String valCode = new OCR().recognizeText(new File(path), "png"); System.out.println(valCode); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } System.out.println("ORC Test End..."); } }

After testing, the text recognition rate of tesseract-ocr 3.01 is very high, and the recognition rate of common verification codes on the website is also very high.

The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.