OCR Service
This guide shows you how to use the ocr_service provider to extract tax return data from PDF documents using optical character recognition (OCR).
Goal
Enable automated data extraction from tax documents when:
- You have tax returns in PDF format (scans or exports)
- The company's tax filings are not yet available from official sources
- You need to process historical documents not available digitally
- You want to accelerate data collection from client-provided documents
Use Cases
OCR Service actions
Extract structured financial data from tax return PDFs automatically.
Extract
- Process scanned tax bundles
- Extract data from PDF exports
- Parse handwritten or typed forms
Accelerate
- Skip manual data entry
- Process documents in batch
- Get structured data immediately
Complement
- Add missing fiscal years
- Complement INPI public data
- Process foreign documents
Supported Data
| Data Type | Description | Output |
|---|---|---|
| Tax Return | Tax bundles (2050, 2033, etc.) | Structured financial data |
| Tax Return Analysis | Financial ratios and insights | Computed indicators |
Prerequisites
Before using ocr_service, ensure:
- You have the PDF documents to process
- Documents are readable (not too blurry or damaged)
- The user exists in your system
Configuration
Step 1: Enable the OCR Service provider
{
"enable": true
}
{
"auto_connect": false
}
Step 2: Create the data connection
{
"requested_data_types": ["TAX_RETURN"],
"provider_name": "ocr_service"
}
Uploading Documents for OCR
Submit a tax return PDF
Use the following endpoint to upload a tax bundle document for OCR processing:
/api/v6/input/users/{userId}/tax-returns/ocr-service{
"tax_return_id": "tax-return-2023-001",
"data": {
"file": "JVBERi0xLjQKJeLjz9...",
"closing_date": "2023-12-31",
"closing_year": "2023",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE"
}
}
Request parameters
| Field | Type | Required | Description |
|---|---|---|---|
tax_return_id | string | Yes | Your unique identifier for this tax return |
data.file | string | Yes | Base64-encoded PDF document |
data.closing_date | date | Yes | Fiscal year end date (YYYY-MM-DD) |
data.closing_year | string | Yes | Fiscal year (e.g., "2023") |
data.submitted_date | date | No | Date the return was filed |
data.type | string | No | Bundle type: C (full), S (simplified), K (consolidated) |
data.duration | integer | No | Fiscal year duration in months (default: 12) |
data.privacy | string | No | Visibility setting |
Response
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"tax_return_id": "tax-return-2023-001",
"process_status": "PENDING",
"error_message": null,
"data": {
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"revenues": 0,
"net_profit": 0,
"file_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"closing_date": "2023-12-31",
"closing_year": "2023",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE",
"tax_return_values": [],
"millesime": "2024"
}
}
Checking OCR Status
List all OCR processing jobs
/api/v6/input/users/{userId}/tax-returnsQuery parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
page | integer | 1 | Page number |
per_page | integer | 20 | Items per page |
Response
{
"total": 3,
"per_page": 20,
"current_page": 1,
"last_page": 1,
"result": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"tax_return_id": "tax-return-2023-001",
"process_status": "FINISHED",
"error_message": null,
"data": {
"closing_date": "2023-12-31",
"closing_year": "2023",
"type": "C",
"duration": 12,
"revenue": 2500000,
"net_profit": 180000
},
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:35:00Z"
},
{
"id": "660e8400-e29b-41d4-a716-446655440001",
"tax_return_id": "tax-return-2022-001",
"process_status": "IN_PROGRESS",
"error_message": null,
"data": {
"closing_date": "2022-12-31",
"closing_year": "2022"
},
"created_at": "2024-01-15T10:32:00Z",
"updated_at": "2024-01-15T10:32:00Z"
}
]
}
Processing statuses
| Status | Description |
|---|---|
PENDING | Document uploaded, waiting to be processed |
IN_PROGRESS | OCR extraction in progress |
FINISHED | Processing complete, data available |
ERROR | Processing failed (check error_message) |
Synchronizing the Data
Once the OCR processing is complete (process_status: "FINISHED"), trigger a synchronization to make the extracted data available through the standard tax return endpoints:
{
"data_types": ["TAX_RETURN"]
}
Retrieving Processed Data
After synchronization, retrieve the extracted tax returns:
/api/v6/users/{userId}/tax-returnsTax returns processed via OCR appear with provider_name: "ocr_service".
{
"total": 1,
"per_page": 20,
"current_page": 1,
"last_page": 1,
"result": [
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"revenues": 2500000,
"net_profit": 180000,
"file_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"closing_date": "2023-12-31",
"closing_year": "2023",
"millesime": "2024",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE",
"provider_name": "ocr_service",
"data_connection_id": "550e8400-e29b-41d4-a716-446655440000",
"warnings": []
}
]
}
Retrieve a single tax return
/api/v6/users/{userId}/tax-returns/{taxReturnId}{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"revenues": 2500000,
"net_profit": 180000,
"file_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"closing_date": "2023-12-31",
"closing_year": "2023",
"millesime": "2024",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE",
"provider_name": "ocr_service",
"data_connection_id": "550e8400-e29b-41d4-a716-446655440000",
"warnings": [],
"tax_return_values": [
{ "code": "FL", "values": [2500000, 0, 0, 0] },
{ "code": "HN", "values": [180000, 0, 0, 0] }
]
}
End-to-End Workflow
async function processDocumentWithOCR(userId, pdfBase64, fiscalYear) {
// 1. Ensure OCR Service connection exists
const connections = await qardApi.getDataConnections(userId);
const ocrConnection = connections.find(c => c.provider_name === 'ocr_service');
if (!ocrConnection) {
await qardApi.createDataConnection(userId, {
requested_data_types: ['TAX_RETURN'],
provider_name: 'ocr_service'
});
}
// 2. Upload the document for OCR processing
const ocrJob = await qardApi.post(`/input/users/${userId}/tax-returns/ocr-service`, {
tax_return_id: `tax-return-${fiscalYear}-${Date.now()}`,
data: {
file: pdfBase64,
closing_date: `${fiscalYear}-12-31`,
closing_year: String(fiscalYear),
type: 'C',
duration: 12
}
});
console.log(`OCR job created: ${ocrJob.id}`);
// 3. Poll for OCR completion
let status = 'PENDING';
while (status === 'PENDING' || status === 'IN_PROGRESS') {
await sleep(5000); // Wait 5 seconds between checks
const jobs = await qardApi.get(`/input/users/${userId}/tax-returns`);
const currentJob = jobs.result.find(j => j.id === ocrJob.id);
status = currentJob.process_status;
console.log(`OCR status: ${status}`);
if (status === 'ERROR') {
throw new Error(`OCR failed: ${currentJob.error_message}`);
}
}
// 4. Sync to integrate extracted data
await qardApi.sync(userId, {
data_types: ['TAX_RETURN']
});
// 5. Wait for sync completion
await waitForSyncCompletion(userId);
// 6. Retrieve the processed tax return
const taxReturns = await qardApi.getTaxReturns(userId, {
filter: { closing_year: fiscalYear, provider_name: 'ocr_service' }
});
return taxReturns[0];
}
// Helper function
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
Batch Processing
Process multiple documents efficiently:
async function batchProcessDocuments(userId, documents) {
const results = [];
// Upload all documents
for (const doc of documents) {
const job = await qardApi.post(`/input/users/${userId}/tax-returns/ocr-service`, {
tax_return_id: doc.id,
data: {
file: doc.pdfBase64,
closing_date: doc.closingDate,
closing_year: doc.closingYear,
type: doc.type || 'C',
duration: doc.duration || 12
}
});
results.push({ documentId: doc.id, jobId: job.id });
}
// Wait for all to complete
let allComplete = false;
while (!allComplete) {
await sleep(10000);
const jobs = await qardApi.get(`/input/users/${userId}/tax-returns`);
const pendingJobs = jobs.result.filter(j =>
results.some(r => r.jobId === j.id) &&
(j.process_status === 'PENDING' || j.process_status === 'IN_PROGRESS')
);
allComplete = pendingJobs.length === 0;
console.log(`${pendingJobs.length} jobs still processing...`);
}
// Single sync for all documents
await qardApi.sync(userId, { data_types: ['TAX_RETURN'] });
return results;
}
Error Handling
Common OCR errors and solutions:
| Error | Cause | Solution |
|---|---|---|
INVALID_FORMAT | File is not a valid PDF | Verify file format before upload |
UNREADABLE_DOCUMENT | Document too blurry or damaged | Request a clearer scan |
UNSUPPORTED_FORM | Form type not recognized | Check supported form types |
EXTRACTION_FAILED | OCR could not extract data | Try with higher quality scan |
async function handleOCRWithRetry(userId, pdfBase64, fiscalYear, maxRetries = 2) {
let attempts = 0;
while (attempts < maxRetries) {
try {
return await processDocumentWithOCR(userId, pdfBase64, fiscalYear);
} catch (error) {
attempts++;
console.error(`OCR attempt ${attempts} failed: ${error.message}`);
if (attempts >= maxRetries) {
throw new Error(`OCR failed after ${maxRetries} attempts: ${error.message}`);
}
// Wait before retry
await sleep(10000);
}
}
}
Best Practices
Document quality: Use high-resolution scans (300 DPI minimum) for best results.
File size: Keep PDF files under 10 MB for optimal processing speed.
Unique identifiers: Use meaningful
tax_return_idvalues to track documents.Batch wisely: Group related documents and sync once after all are processed.
Monitor status: Implement proper polling with exponential backoff for production.
Handle errors gracefully: Always check
error_messagewhen status isERROR.
See Also
- OCR Service Provider - Provider configuration details
- Tax Return - Data format reference
- Tax Return Analysis - Computed financial ratios
- Tax Bundles - Complete tax bundle collection guide