Skip to main content

OCR Service

This guide shows you how to use the ocr_service provider to extract tax return data from PDF documents using optical character recognition (OCR).

Goal

Enable automated data extraction from tax documents when:

  • You have tax returns in PDF format (scans or exports)
  • The company's tax filings are not yet available from official sources
  • You need to process historical documents not available digitally
  • You want to accelerate data collection from client-provided documents

Use Cases

OCR Service actions

Extract structured financial data from tax return PDFs automatically.

Extract

Digitize documents

  • Process scanned tax bundles
  • Extract data from PDF exports
  • Parse handwritten or typed forms

Accelerate

Speed up onboarding

  • Skip manual data entry
  • Process documents in batch
  • Get structured data immediately

Complement

Fill the gaps

  • Add missing fiscal years
  • Complement INPI public data
  • Process foreign documents

Supported Data

Data TypeDescriptionOutput
Tax ReturnTax bundles (2050, 2033, etc.)Structured financial data
Tax Return AnalysisFinancial ratios and insightsComputed indicators

Prerequisites

Before using ocr_service, ensure:

  1. You have the PDF documents to process
  2. Documents are readable (not too blurry or damaged)
  3. The user exists in your system

Configuration

Step 1: Enable the OCR Service provider

PUT /api/v6/providers/ocr_service
{
"enable": true
}
PUT /api/v6/providers/ocr_service/settings
{
"auto_connect": false
}

Step 2: Create the data connection

POST /api/v6/users/{userId}/data-connections
{
"requested_data_types": ["TAX_RETURN"],
"provider_name": "ocr_service"
}

Uploading Documents for OCR

Submit a tax return PDF

Use the following endpoint to upload a tax bundle document for OCR processing:

POST/api/v6/input/users/{userId}/tax-returns/ocr-service
POST /api/v6/input/users/{userId}/tax-returns/ocr-service
{
"tax_return_id": "tax-return-2023-001",
"data": {
"file": "JVBERi0xLjQKJeLjz9...",
"closing_date": "2023-12-31",
"closing_year": "2023",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE"
}
}

Request parameters

FieldTypeRequiredDescription
tax_return_idstringYesYour unique identifier for this tax return
data.filestringYesBase64-encoded PDF document
data.closing_datedateYesFiscal year end date (YYYY-MM-DD)
data.closing_yearstringYesFiscal year (e.g., "2023")
data.submitted_datedateNoDate the return was filed
data.typestringNoBundle type: C (full), S (simplified), K (consolidated)
data.durationintegerNoFiscal year duration in months (default: 12)
data.privacystringNoVisibility setting

Response

{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"tax_return_id": "tax-return-2023-001",
"process_status": "PENDING",
"error_message": null,
"data": {
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"revenues": 0,
"net_profit": 0,
"file_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"closing_date": "2023-12-31",
"closing_year": "2023",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE",
"tax_return_values": [],
"millesime": "2024"
}
}

Checking OCR Status

List all OCR processing jobs

GET/api/v6/input/users/{userId}/tax-returns

Query parameters

ParameterTypeDefaultDescription
pageinteger1Page number
per_pageinteger20Items per page

Response

{
"total": 3,
"per_page": 20,
"current_page": 1,
"last_page": 1,
"result": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"tax_return_id": "tax-return-2023-001",
"process_status": "FINISHED",
"error_message": null,
"data": {
"closing_date": "2023-12-31",
"closing_year": "2023",
"type": "C",
"duration": 12,
"revenue": 2500000,
"net_profit": 180000
},
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:35:00Z"
},
{
"id": "660e8400-e29b-41d4-a716-446655440001",
"tax_return_id": "tax-return-2022-001",
"process_status": "IN_PROGRESS",
"error_message": null,
"data": {
"closing_date": "2022-12-31",
"closing_year": "2022"
},
"created_at": "2024-01-15T10:32:00Z",
"updated_at": "2024-01-15T10:32:00Z"
}
]
}

Processing statuses

StatusDescription
PENDINGDocument uploaded, waiting to be processed
IN_PROGRESSOCR extraction in progress
FINISHEDProcessing complete, data available
ERRORProcessing failed (check error_message)

Synchronizing the Data

Once the OCR processing is complete (process_status: "FINISHED"), trigger a synchronization to make the extracted data available through the standard tax return endpoints:

POST /api/v6/users/{userId}/sync
{
"data_types": ["TAX_RETURN"]
}

Retrieving Processed Data

After synchronization, retrieve the extracted tax returns:

GET/api/v6/users/{userId}/tax-returns

Tax returns processed via OCR appear with provider_name: "ocr_service".

{
"total": 1,
"per_page": 20,
"current_page": 1,
"last_page": 1,
"result": [
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"revenues": 2500000,
"net_profit": 180000,
"file_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"closing_date": "2023-12-31",
"closing_year": "2023",
"millesime": "2024",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE",
"provider_name": "ocr_service",
"data_connection_id": "550e8400-e29b-41d4-a716-446655440000",
"warnings": []
}
]
}

Retrieve a single tax return

GET/api/v6/users/{userId}/tax-returns/{taxReturnId}
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"revenues": 2500000,
"net_profit": 180000,
"file_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"closing_date": "2023-12-31",
"closing_year": "2023",
"millesime": "2024",
"submitted_date": "2024-05-15",
"type": "C",
"duration": 12,
"privacy": "PRIVATE",
"provider_name": "ocr_service",
"data_connection_id": "550e8400-e29b-41d4-a716-446655440000",
"warnings": [],
"tax_return_values": [
{ "code": "FL", "values": [2500000, 0, 0, 0] },
{ "code": "HN", "values": [180000, 0, 0, 0] }
]
}

End-to-End Workflow

async function processDocumentWithOCR(userId, pdfBase64, fiscalYear) {
// 1. Ensure OCR Service connection exists
const connections = await qardApi.getDataConnections(userId);
const ocrConnection = connections.find(c => c.provider_name === 'ocr_service');

if (!ocrConnection) {
await qardApi.createDataConnection(userId, {
requested_data_types: ['TAX_RETURN'],
provider_name: 'ocr_service'
});
}

// 2. Upload the document for OCR processing
const ocrJob = await qardApi.post(`/input/users/${userId}/tax-returns/ocr-service`, {
tax_return_id: `tax-return-${fiscalYear}-${Date.now()}`,
data: {
file: pdfBase64,
closing_date: `${fiscalYear}-12-31`,
closing_year: String(fiscalYear),
type: 'C',
duration: 12
}
});

console.log(`OCR job created: ${ocrJob.id}`);

// 3. Poll for OCR completion
let status = 'PENDING';
while (status === 'PENDING' || status === 'IN_PROGRESS') {
await sleep(5000); // Wait 5 seconds between checks

const jobs = await qardApi.get(`/input/users/${userId}/tax-returns`);
const currentJob = jobs.result.find(j => j.id === ocrJob.id);

status = currentJob.process_status;
console.log(`OCR status: ${status}`);

if (status === 'ERROR') {
throw new Error(`OCR failed: ${currentJob.error_message}`);
}
}

// 4. Sync to integrate extracted data
await qardApi.sync(userId, {
data_types: ['TAX_RETURN']
});

// 5. Wait for sync completion
await waitForSyncCompletion(userId);

// 6. Retrieve the processed tax return
const taxReturns = await qardApi.getTaxReturns(userId, {
filter: { closing_year: fiscalYear, provider_name: 'ocr_service' }
});

return taxReturns[0];
}

// Helper function
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}

Batch Processing

Process multiple documents efficiently:

async function batchProcessDocuments(userId, documents) {
const results = [];

// Upload all documents
for (const doc of documents) {
const job = await qardApi.post(`/input/users/${userId}/tax-returns/ocr-service`, {
tax_return_id: doc.id,
data: {
file: doc.pdfBase64,
closing_date: doc.closingDate,
closing_year: doc.closingYear,
type: doc.type || 'C',
duration: doc.duration || 12
}
});
results.push({ documentId: doc.id, jobId: job.id });
}

// Wait for all to complete
let allComplete = false;
while (!allComplete) {
await sleep(10000);

const jobs = await qardApi.get(`/input/users/${userId}/tax-returns`);
const pendingJobs = jobs.result.filter(j =>
results.some(r => r.jobId === j.id) &&
(j.process_status === 'PENDING' || j.process_status === 'IN_PROGRESS')
);

allComplete = pendingJobs.length === 0;
console.log(`${pendingJobs.length} jobs still processing...`);
}

// Single sync for all documents
await qardApi.sync(userId, { data_types: ['TAX_RETURN'] });

return results;
}

Error Handling

Common OCR errors and solutions:

ErrorCauseSolution
INVALID_FORMATFile is not a valid PDFVerify file format before upload
UNREADABLE_DOCUMENTDocument too blurry or damagedRequest a clearer scan
UNSUPPORTED_FORMForm type not recognizedCheck supported form types
EXTRACTION_FAILEDOCR could not extract dataTry with higher quality scan
async function handleOCRWithRetry(userId, pdfBase64, fiscalYear, maxRetries = 2) {
let attempts = 0;

while (attempts < maxRetries) {
try {
return await processDocumentWithOCR(userId, pdfBase64, fiscalYear);
} catch (error) {
attempts++;
console.error(`OCR attempt ${attempts} failed: ${error.message}`);

if (attempts >= maxRetries) {
throw new Error(`OCR failed after ${maxRetries} attempts: ${error.message}`);
}

// Wait before retry
await sleep(10000);
}
}
}

Best Practices

  1. Document quality: Use high-resolution scans (300 DPI minimum) for best results.

  2. File size: Keep PDF files under 10 MB for optimal processing speed.

  3. Unique identifiers: Use meaningful tax_return_id values to track documents.

  4. Batch wisely: Group related documents and sync once after all are processed.

  5. Monitor status: Implement proper polling with exponential backoff for production.

  6. Handle errors gracefully: Always check error_message when status is ERROR.

See Also