Spaces:
Running
Running
Delete docs
Browse files- docs/config_refactoring.md +0 -47
- docs/preprocessing.md +0 -179
- docs/preprocessing_triage.md +0 -17
docs/config_refactoring.md
DELETED
|
@@ -1,47 +0,0 @@
|
|
| 1 |
-
# Configuration Refactoring
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
This document outlines the changes made to centralize configuration parameters and reduce technical debt in the OCR processing system.
|
| 5 |
-
|
| 6 |
-
## Key Changes
|
| 7 |
-
|
| 8 |
-
### Centralized Configuration
|
| 9 |
-
All previously hard-coded parameters have been moved to `config.py` and organized by functional category:
|
| 10 |
-
|
| 11 |
-
- **PDF_SETTINGS**: Parameters for PDF processing
|
| 12 |
-
- **SEGMENTATION_SETTINGS**: Image segmentation configuration
|
| 13 |
-
- **CACHE_SETTINGS**: Cache TTL and capacity settings
|
| 14 |
-
- **TEXT_REPAIR_SETTINGS**: Duplication detection and repair thresholds
|
| 15 |
-
|
| 16 |
-
### Environment Variable Support
|
| 17 |
-
All configuration parameters can now be overridden via environment variables:
|
| 18 |
-
|
| 19 |
-
```bash
|
| 20 |
-
# Example: Override PDF DPI
|
| 21 |
-
export PDF_DEFAULT_DPI=200
|
| 22 |
-
|
| 23 |
-
# Example: Increase cache size
|
| 24 |
-
export CACHE_MAX_ENTRIES=50
|
| 25 |
-
```
|
| 26 |
-
|
| 27 |
-
### Import Strategy
|
| 28 |
-
To prevent circular dependencies, configuration is imported at function level where needed:
|
| 29 |
-
|
| 30 |
-
```python
|
| 31 |
-
def process_image():
|
| 32 |
-
from config import SEGMENTATION_SETTINGS
|
| 33 |
-
# Function implementation using settings
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
## Benefits
|
| 37 |
-
|
| 38 |
-
- **Maintainability**: Settings are centralized and documented
|
| 39 |
-
- **Flexibility**: Configuration can be adjusted without code changes
|
| 40 |
-
- **Consistency**: Standardized approach to configuration across modules
|
| 41 |
-
- **Traceability**: Clear overview of all configurable parameters
|
| 42 |
-
|
| 43 |
-
## Future Improvements
|
| 44 |
-
|
| 45 |
-
- Add configuration schema validation
|
| 46 |
-
- Support for configuration profiles (dev/test/prod)
|
| 47 |
-
- Add detailed documentation for each parameter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/preprocessing.md
DELETED
|
@@ -1,179 +0,0 @@
|
|
| 1 |
-
# Image Preprocessing for Historical Document OCR
|
| 2 |
-
|
| 3 |
-
This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
|
| 4 |
-
|
| 5 |
-
## Overview
|
| 6 |
-
|
| 7 |
-
The preprocessing pipeline offers several options to enhance image quality before OCR processing:
|
| 8 |
-
|
| 9 |
-
1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
|
| 10 |
-
2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
|
| 11 |
-
3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
|
| 12 |
-
4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
|
| 13 |
-
|
| 14 |
-
## Configuration
|
| 15 |
-
|
| 16 |
-
Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
|
| 17 |
-
|
| 18 |
-
### Deskewing
|
| 19 |
-
|
| 20 |
-
```python
|
| 21 |
-
"deskew": {
|
| 22 |
-
"enabled": True/False, # Whether to apply deskewing
|
| 23 |
-
"angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing
|
| 24 |
-
"max_angle": 45.0, # Maximum correction angle
|
| 25 |
-
"use_hough": True/False, # Use Hough transform in addition to minAreaRect
|
| 26 |
-
"consensus_method": "average", # How to combine angle estimations
|
| 27 |
-
"fallback": {"enabled": True/False} # Fall back to original if deskewing fails
|
| 28 |
-
}
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
Deskewing uses two methods:
|
| 32 |
-
- **minAreaRect**: Finds contours in the binary image and calculates their orientation
|
| 33 |
-
- **Hough Transform**: Detects lines in the image and their angles
|
| 34 |
-
|
| 35 |
-
The `consensus_method` can be:
|
| 36 |
-
- `"average"`: Average of all detected angles (most stable)
|
| 37 |
-
- `"median"`: Median of all angles (robust to outliers)
|
| 38 |
-
- `"min"`: Minimum absolute angle (most conservative)
|
| 39 |
-
- `"max"`: Maximum absolute angle (most aggressive)
|
| 40 |
-
|
| 41 |
-
### Thresholding
|
| 42 |
-
|
| 43 |
-
```python
|
| 44 |
-
"thresholding": {
|
| 45 |
-
"method": "adaptive", # "none", "otsu", or "adaptive"
|
| 46 |
-
"adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd)
|
| 47 |
-
"adaptive_constant": 2, # Constant subtracted from mean
|
| 48 |
-
"otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing
|
| 49 |
-
"preblur": {
|
| 50 |
-
"enabled": True/False, # Whether to apply pre-blur
|
| 51 |
-
"method": "gaussian", # "gaussian" or "median"
|
| 52 |
-
"kernel_size": 3 # Blur kernel size (must be odd)
|
| 53 |
-
},
|
| 54 |
-
"fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
|
| 55 |
-
}
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
Thresholding methods:
|
| 59 |
-
- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
|
| 60 |
-
- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
|
| 61 |
-
|
| 62 |
-
### Morphological Operations
|
| 63 |
-
|
| 64 |
-
```python
|
| 65 |
-
"morphology": {
|
| 66 |
-
"enabled": True/False, # Whether to apply morphological operations
|
| 67 |
-
"operation": "close", # "open", "close", "both"
|
| 68 |
-
"kernel_size": 1, # Size of the structuring element
|
| 69 |
-
"kernel_shape": "rect" # "rect", "ellipse", "cross"
|
| 70 |
-
}
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
Morphological operations:
|
| 74 |
-
- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
|
| 75 |
-
- **Close**: Dilation followed by erosion - fills small holes and connects broken elements
|
| 76 |
-
- **Both**: Applies opening followed by closing
|
| 77 |
-
|
| 78 |
-
### Document Type Configurations
|
| 79 |
-
|
| 80 |
-
The system includes optimized settings for different document types:
|
| 81 |
-
|
| 82 |
-
```python
|
| 83 |
-
"document_types": {
|
| 84 |
-
"standard": {
|
| 85 |
-
# Default settings - will use the global settings
|
| 86 |
-
},
|
| 87 |
-
"newspaper": {
|
| 88 |
-
"deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
|
| 89 |
-
"thresholding": {
|
| 90 |
-
"method": "adaptive",
|
| 91 |
-
"adaptive_block_size": 15,
|
| 92 |
-
"adaptive_constant": 3,
|
| 93 |
-
"preblur": {"method": "gaussian", "kernel_size": 3}
|
| 94 |
-
},
|
| 95 |
-
"morphology": {"operation": "close", "kernel_size": 1}
|
| 96 |
-
},
|
| 97 |
-
"handwritten": {
|
| 98 |
-
"deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
|
| 99 |
-
"thresholding": {
|
| 100 |
-
"method": "adaptive",
|
| 101 |
-
"adaptive_block_size": 31,
|
| 102 |
-
"adaptive_constant": 5,
|
| 103 |
-
"preblur": {"method": "median", "kernel_size": 3}
|
| 104 |
-
},
|
| 105 |
-
"morphology": {"operation": "open", "kernel_size": 1}
|
| 106 |
-
},
|
| 107 |
-
"book": {
|
| 108 |
-
"deskew": {"enabled": True},
|
| 109 |
-
"thresholding": {
|
| 110 |
-
"method": "otsu",
|
| 111 |
-
"preblur": {"method": "gaussian", "kernel_size": 5}
|
| 112 |
-
},
|
| 113 |
-
"morphology": {"operation": "both", "kernel_size": 1}
|
| 114 |
-
}
|
| 115 |
-
}
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
## Performance and Logging
|
| 119 |
-
|
| 120 |
-
```python
|
| 121 |
-
"performance": {
|
| 122 |
-
"parallel": {
|
| 123 |
-
"enabled": True/False, # Whether to use parallel processing
|
| 124 |
-
"max_workers": 4 # Maximum number of worker threads
|
| 125 |
-
},
|
| 126 |
-
"timeout_ms": 10000 # Timeout for preprocessing (in milliseconds)
|
| 127 |
-
}
|
| 128 |
-
|
| 129 |
-
"logging": {
|
| 130 |
-
"enabled": True/False, # Whether to log preprocessing metrics
|
| 131 |
-
"metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
|
| 132 |
-
"output_path": "logs/preprocessing_metrics.json"
|
| 133 |
-
}
|
| 134 |
-
```
|
| 135 |
-
|
| 136 |
-
## Usage with OCR Processing
|
| 137 |
-
|
| 138 |
-
When processing documents, simply specify the document type:
|
| 139 |
-
|
| 140 |
-
```python
|
| 141 |
-
preprocessing_options = {
|
| 142 |
-
"document_type": "newspaper", # Use newspaper-optimized settings
|
| 143 |
-
"grayscale": True, # Legacy option: apply grayscale conversion
|
| 144 |
-
"denoise": True, # Legacy option: apply denoising
|
| 145 |
-
"contrast": 10, # Legacy option: adjust contrast (0-100)
|
| 146 |
-
"rotation": 0 # Legacy option: manual rotation (degrees)
|
| 147 |
-
}
|
| 148 |
-
|
| 149 |
-
# Apply preprocessing and OCR
|
| 150 |
-
result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
## Visual Examples
|
| 154 |
-
|
| 155 |
-
### Original Document
|
| 156 |
-
*[A historical newspaper or document image would be shown here]*
|
| 157 |
-
|
| 158 |
-
### After Deskewing
|
| 159 |
-
*[The same document, with skew corrected]*
|
| 160 |
-
|
| 161 |
-
### After Thresholding
|
| 162 |
-
*[The document converted to binary with clear text]*
|
| 163 |
-
|
| 164 |
-
### After Morphological Operations
|
| 165 |
-
*[The binary image with small noise removed and/or gaps filled]*
|
| 166 |
-
|
| 167 |
-
## Troubleshooting
|
| 168 |
-
|
| 169 |
-
### Poor Deskewing Results
|
| 170 |
-
- **Symptom**: Document skew is not correctly detected or corrected
|
| 171 |
-
- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
|
| 172 |
-
|
| 173 |
-
### Thresholding Issues
|
| 174 |
-
- **Symptom**: Text is lost or background noise is excessive after thresholding
|
| 175 |
-
- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
|
| 176 |
-
|
| 177 |
-
### Performance Concerns
|
| 178 |
-
- **Symptom**: Processing is too slow for large documents
|
| 179 |
-
- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/preprocessing_triage.md
DELETED
|
@@ -1,17 +0,0 @@
|
|
| 1 |
-
# OCR Preprocessing Triage
|
| 2 |
-
|
| 3 |
-
## Quick Fixes Implemented
|
| 4 |
-
|
| 5 |
-
1. **Handwritten** - Disabled thresholding, uses grayscale only
|
| 6 |
-
2. **Newspapers** - Increased block size (51) and constant (10) for softer thresholding
|
| 7 |
-
3. **JPEG Artifacts** - Auto-detection and specialized denoising
|
| 8 |
-
4. **Border Issues** - Crops edges after deskew to avoid threshold problems
|
| 9 |
-
5. **Low Resolution** - Upscales small text for better recognition
|
| 10 |
-
|
| 11 |
-
## Testing
|
| 12 |
-
|
| 13 |
-
```
|
| 14 |
-
python testing/test_triage_fix.py
|
| 15 |
-
```
|
| 16 |
-
|
| 17 |
-
Check `output/comparison/` for results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|