DeepSeek OCR is a vision-language model from DeepSeek AI that specializes in optical character recognition and document understanding. The model introduces "Contexts Optical Compression" as its core innovation, optimizing how visual information is compressed when processing text-heavy documents.
Key Features
DeepSeek OCR excels at converting documents and images into structured formats, with particular emphasis on markdown conversion and raw text extraction. The model supports flexible inference modes through multiple configuration sizes (Tiny, Small, Base, Large, Gundam) that can be adjusted based on processing requirements with varying base_size and image_size parameters.
The model includes specialized grounding capabilities using grounding tokens for enhanced document understanding, making it particularly effective at maintaining context and structure during OCR operations. It employs n-gram logit processing for structured output generation, which proves especially useful for complex table extraction tasks.
Architecture
Built on the Transformers framework with Safetensors format, DeepSeek OCR utilizes Flash Attention 2 for optimized performance on NVIDIA GPUs. The architecture supports custom inference parameters including crop_mode for flexible processing of various document layouts and formats. Integration with vLLM enables accelerated inference with batch processing support for production workloads.
Use Cases
DeepSeek OCR is designed for a wide range of document processing applications:
- Document digitization and conversion to markdown format
- Table extraction from complex document layouts
- Multi-page PDF processing and analysis
- Batch OCR operations for production workflows
- Text extraction from images and scanned documents
Performance and Adoption
The model has achieved significant adoption in the community, with over 4 million downloads monthly. It is actively deployed in more than 78 community Spaces, demonstrating diverse real-world applications across document understanding tasks.
DeepSeek OCR is published under the MIT license, making it accessible for both commercial and non-commercial use.