Spire.Office Knowledgebase Page 1

Subscribe to this RSS feed

Knowledgebase (2344)

Children categories

Spire.OfficeJs (6)

View items...

Convert PDF to JSON in C#: Text, Tables, Forms & OCR

2026-06-26 08:56:09 Written by Allen Yang

Convert PDF to JSON in C# — extract text, tables, and form fields to structured JSON

Your application receives a PDF invoice. You need the invoice number, vendor name, and line items — not as text on a page, but as structured JSON your API can consume. That is the real problem behind PDF to JSON conversion.

Unlike CSV or XML, a PDF file has no inherent data structure — no fields, no rows, no schema. Extracting usable JSON requires different approaches depending on what the document actually contains: plain text with key-value patterns, tables with rows and columns, fillable form fields, or scanned images that need OCR.

This article covers all four scenarios with runnable C# code using Spire.PDF for .NET. We build a real invoice-to-JSON converter, handle common table extraction problems like merged cells and missing headers, and package everything into a reusable PdfToJsonConverter class you can drop into any .NET project.

Quick Navigation

What "PDF to JSON" Actually Means
Install Spire.PDF for .NET
Convert PDF Text to JSON in C#
Convert PDF Tables to JSON in C#
Convert PDF Form Fields to JSON
Invoice PDF to JSON: A Real-World Example
Convert Multiple PDFs to JSON in Batch
Build a PDF to JSON Converter in C#
Convert OCR Output to JSON in C#
Performance Considerations
FAQ

1. What "PDF to JSON" Actually Means

There is no built-in "PDF to JSON" conversion in the way you might convert a CSV to JSON. A PDF has no JSON structure. What developers actually need is: extract content from a PDF, then shape that content into a JSON format that matches their use case.

Depending on the PDF type and business requirement, the target JSON falls into one of three categories.

Raw Text JSON

Pull all text from each page and wrap it in a JSON envelope. Works for search indexing, RAG pipelines, and document archival.

{
  "sourceFile": "Contract.pdf",
  "pages": [
    { "pageNumber": 1, "text": "SERVICE AGREEMENT\nBetween Contoso Ltd and..." }
  ]
}

Key-Value JSON

Many PDFs follow a Label: Value pattern — employee records, registration forms, simple invoices. The goal here is to parse those pairs into a flat JSON object:

{
  "name": "John Smith",
  "email": "john@contoso.com",
  "department": "Engineering",
  "employeeId": "EMP-2026-0142"
}

Structured Business JSON

Real business documents have nested data: an invoice has a header, line items, tax breakdowns, and payment terms. The JSON output needs to mirror that structure:

{
  "invoiceNumber": "INV-2026-0042",
  "vendor": "Contoso Ltd",
  "date": "2026-06-15",
  "lineItems": [
    { "description": "Widget A", "quantity": 150, "unitPrice": 24.50, "total": 3675.00 }
  ],
  "subtotal": 3675.00,
  "tax": 294.00,
  "total": 3969.00
}

This distinction matters. When you search for "convert PDF to JSON," you need to decide which output format your application requires. The rest of this article shows how to build each one using Spire.PDF in C#.

2. Install Spire.PDF for .NET

Install via NuGet Package Manager Console:

Install-Package Spire.PDF

Or add to your .csproj:

<PackageReference Include="Spire.PDF" Version="*" />

Include these namespaces in your project:

using Spire.Pdf;
using Spire.Pdf.Texts;
using Spire.Pdf.Utilities;
using Spire.Pdf.Fields;
using Spire.Pdf.Widget;
using System.Text.Json;
using System.Text.Json.Serialization;

Spire.PDF supports .NET Framework, .NET Core, and .NET 6/7/8/9+.

3. Convert PDF Text to JSON in C#

The most common starting point: extract text from a PDF and produce JSON output.

Extract Text from PDF

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.Collections.Generic;

using (PdfDocument pdf = new PdfDocument())
{
    pdf.LoadFromFile("EmployeeRecord.pdf");

    var pages = new List<Dictionary<string, string>>();

    for (int i = 0; i < pdf.Pages.Count; i++)
    {
        PdfPageBase page = pdf.Pages[i];

        PdfTextExtractOptions options = new PdfTextExtractOptions();
        options.IsExtractAllText = true;

        PdfTextExtractor extractor = new PdfTextExtractor(page);
        string pageText = extractor.ExtractText(options);

        pages.Add(new Dictionary<string, string>
        {
            { "pageNumber", (i + 1).ToString() },
            { "text", pageText.Trim() }
        });
    }
}

Parse Key-Value Pairs into JSON

If your PDF follows a Label: Value pattern, parse the extracted text into structured fields:

using System.Text.Json;

var parsedFields = new Dictionary<string, string>();

foreach (var page in pages)
{
    string[] lines = page["text"].Split('\n');
    foreach (string line in lines)
    {
        int colonIndex = line.IndexOf(':');
        if (colonIndex > 0)
        {
            string key = line.Substring(0, colonIndex).Trim();
            string value = line.Substring(colonIndex + 1).Trim();
            parsedFields[key] = value;
        }
    }
}

var jsonOptions = new JsonSerializerOptions
{
    WriteIndented = true,
    PropertyNamingPolicy = JsonNamingPolicy.CamelCase
};

string jsonOutput = JsonSerializer.Serialize(parsedFields, jsonOptions);
File.WriteAllText("EmployeeRecord.json", jsonOutput);

Key API Calls

PdfDocument.LoadFromFile() — opens the PDF file
PdfTextExtractor.ExtractText() — extracts text content from a page
PdfTextExtractOptions.IsExtractAllText — preserves whitespace and formatting

Output

The following example shows the structured JSON generated from the extracted employee record.

{
  "name": "John Smith",
  "email": "john.smith@contoso.com",
  "department": "Engineering",
  "employeeId": "EMP-2026-0142",
  "startDate": "2024-03-15"
}

The following screenshot shows the actual JSON file generated after running the example.

Convert PDF Text to JSON in C#

This approach works well for forms, records, and documents with consistent key-value layouts. For unstructured text, skip the parsing step and serialize the raw pages directly.

If you need a deeper look at PDF text extraction, see our dedicated guide on extracting text from PDFs in C# using Spire.PDF for .NET.

4. Convert PDF Tables to JSON in C#

The previous section focused on extracting plain text from PDFs. While that works well for paragraphs and simple records, many business documents organize their most valuable information in tables, such as invoice line items, sales reports, and financial statements. To preserve rows, columns, and relationships between cells, table data must be extracted differently before it can be converted into structured JSON.

Why Table Extraction Is Different from Text Extraction

Text extraction returns a flat stream of characters in reading order. Although a table may appear perfectly organized on the page, the extracted text often loses its row-and-column structure, making it difficult to identify which values belong together.

To preserve the table layout, you need a dedicated table extraction engine. PdfTableExtractor analyzes the page layout, detects table boundaries, and returns PdfTable objects that you can iterate row by row and cell by cell. Instead of producing a flat string such as:

Widget A 150 $24.50 $3,675.00

it enables you to generate structured JSON like:

{
  "Product": "Widget A",
  "Quantity": "150",
  "Unit Price": "$24.50",
  "Total": "$3,675.00"
}

The following example demonstrates how to extract tables from a PDF and serialize them into JSON.

Extract Tables from PDF

using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Collections.Generic;

using (PdfDocument pdf = new PdfDocument())
{
    pdf.LoadFromFile("SalesReport.pdf");

    PdfTableExtractor tableExtractor = new PdfTableExtractor(pdf);
    var allTables = new List<List<List<string>>>();

    for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
    {
        PdfTable[] tables = tableExtractor.ExtractTable(pageIndex);

        if (tables != null && tables.Length > 0)
        {
            foreach (PdfTable table in tables)
            {
                int rowCount = table.GetRowCount();
                int colCount = table.GetColumnCount();
                var tableData = new List<List<string>>();

                for (int row = 0; row < rowCount; row++)
                {
                    var rowData = new List<string>();
                    for (int col = 0; col < colCount; col++)
                    {
                        rowData.Add(table.GetText(row, col).Trim());
                    }
                    tableData.Add(rowData);
                }

                allTables.Add(tableData);
            }
        }
    }
}

Serialize Table Data to JSON

var jsonTables = new List<object>();

foreach (var tableData in allTables)
{
    if (tableData.Count < 2) continue;

    var headers = tableData[0];
    var rows = new List<Dictionary<string, string>>();

    for (int i = 1; i < tableData.Count; i++)
    {
        var rowObj = new Dictionary<string, string>();
        for (int j = 0; j < headers.Count && j < tableData[i].Count; j++)
        {
            rowObj[headers[j]] = tableData[i][j];
        }
        rows.Add(rowObj);
    }

    jsonTables.Add(new
    {
        tableIndex = allTables.IndexOf(tableData) + 1,
        headers = headers,
        data = rows
    });
}

string tableJson = JsonSerializer.Serialize(new
{
    sourceFile = "SalesReport.pdf",
    tables = jsonTables
}, new JsonSerializerOptions { WriteIndented = true });

File.WriteAllText("SalesReport_Tables.json", tableJson);

Key API Calls

PdfTableExtractor(PdfDocument) — initializes the table extraction engine
PdfTableExtractor.ExtractTable(pageIndex) — detects and extracts tables from a page
PdfTable.GetRowCount() / GetColumnCount() — returns table dimensions
PdfTable.GetText(row, col) — reads cell content

Sample JSON Output

The resulting JSON preserves the original table structure by organizing each row into key-value pairs based on the detected column headers.

{
  "sourceFile": "SalesReport.pdf",
  "tables": [
    {
      "tableIndex": 1,
      "headers": ["Product", "Quantity", "Unit Price", "Total"],
      "data": [
        { "Product": "Widget A", "Quantity": "150", "Unit Price": "$24.50", "Total": "$3,675.00" },
        { "Product": "Widget B", "Quantity": "80", "Unit Price": "$39.90", "Total": "$3,192.00" }
      ]
    }
  ]
}

The following screenshot shows the actual JSON file generated after running the example.

Convert PDF Tables to JSON in C#

This approach works well for invoices, reports, and other PDFs with well-defined table structures. For documents containing merged cells, missing headers, or multi-page tables, additional post-processing may be required.

If you need a deeper look at PDF table extraction, see our dedicated guide on extracting tables from PDFs in C# using Spire.PDF for .NET.

Common Table Extraction Problems

Real-world PDF tables are messy. Here are the three problems you will hit most often, and how to handle them.

Problem 1: Missing Headers

Many invoices and reports have tables without explicit header rows. The data starts immediately:

Apple      10    $2.99    $29.90
Orange     5     $1.50    $7.50

When the first row is data rather than headers, assign column names manually based on your known schema:

// Define headers when the PDF table has no header row
string[] defaultHeaders = { "Product", "Quantity", "UnitPrice", "Total" };

var rows = new List<Dictionary<string, string>>();
for (int i = 0; i < tableData.Count; i++)  // Start from 0, not 1
{
    var rowObj = new Dictionary<string, string>();
    for (int j = 0; j < defaultHeaders.Length && j < tableData[i].Count; j++)
    {
        rowObj[defaultHeaders[j]] = tableData[i][j];
    }
    rows.Add(rowObj);
}

Problem 2: Merged Cells

Tables in financial reports often have merged cells for grouping:

Quarter    Revenue     Expenses
Q1         $120,000    $95,000
           $115,000    $88,000
Q2         $140,000    $102,000

The extractor returns empty strings for merged cells. Fill them forward from the last non-empty value:

// Fill merged cells with the previous row's value
for (int col = 0; col < headers.Count; col++)
{
    string lastValue = "";
    for (int row = 1; row < tableData.Count; row++)
    {
        if (col < tableData[row].Count && !string.IsNullOrWhiteSpace(tableData[row][col]))
        {
            lastValue = tableData[row][col];
        }
        else if (col < tableData[row].Count)
        {
            tableData[row][col] = lastValue;
        }
    }
}

Problem 3: Multi-Page Tables

Enterprise reports often have a single table spanning multiple pages, with the header row repeated on each page. Handle this by deduplicating headers during serialization:

var combinedRows = new List<Dictionary<string, string>>();
string[] expectedHeaders = null;

for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
{
    PdfTable[] tables = tableExtractor.ExtractTable(pageIndex);
    if (tables == null) continue;

    foreach (PdfTable table in tables)
    {
        for (int r = 0; r < table.GetRowCount(); r++)
        {
            var cells = new List<string>();
            for (int c = 0; c < table.GetColumnCount(); c++)
            {
                cells.Add(table.GetText(r, c).Trim());
            }

            // First row of first page becomes the headers
            if (expectedHeaders == null && r == 0)
            {
                expectedHeaders = cells.ToArray();
                continue;
            }

            // Skip repeated header rows on subsequent pages
            if (r == 0 && cells.SequenceEqual(expectedHeaders))
                continue;

            var rowDict = new Dictionary<string, string>();
            for (int c = 0; c < expectedHeaders.Length && c < cells.Count; c++)
            {
                rowDict[expectedHeaders[c]] = cells[c];
            }
            combinedRows.Add(rowDict);
        }
    }
}

5. Convert PDF Form Fields to JSON

Unlike plain text or tables, fillable PDF forms already store data as named fields. Applications, surveys, and registration forms contain field names and values that can be mapped directly to JSON key-value pairs, making form data one of the easiest types of PDF content to serialize.

Read and Export Form Fields

using Spire.Pdf;
using Spire.Pdf.Fields;
using Spire.Pdf.Widget;
using System.Collections.Generic;

using (PdfDocument pdf = new PdfDocument())
{
    pdf.LoadFromFile("RegistrationForm.pdf");

    PdfFormWidget formWidget = pdf.Form as PdfFormWidget;
    var formData = new Dictionary<string, object>();

    if (formWidget != null)
    {
        for (int i = 0; i < formWidget.FieldsWidget.List.Count; i++)
        {
            PdfField field = formWidget.FieldsWidget.List[i] as PdfField;

            if (field is PdfTextBoxFieldWidget textBox)
                formData[textBox.Name] = textBox.Text;
            else if (field is PdfCheckBoxWidgetFieldWidget checkBox)
                formData[checkBox.Name] = checkBox.Checked;
            else if (field is PdfRadioButtonListFieldWidget radioButton)
                formData[radioButton.Name] = radioButton.Value;
            else if (field is PdfComboBoxWidgetFieldWidget comboBox)
                formData[comboBox.Name] = comboBox.SelectedValue;
            else if (field is PdfListBoxWidgetFieldWidget listBox)
            {
                var selectedItems = new List<string>();
                foreach (PdfListWidgetItem item in listBox.Values)
                    selectedItems.Add(item.Value);
                formData[listBox.Name] = selectedItems;
            }
        }
    }

    var formOutput = new
    {
        sourceFile = "RegistrationForm.pdf",
        fieldCount = formData.Count,
        fields = formData
    };

    string json = JsonSerializer.Serialize(formOutput, new JsonSerializerOptions
    {
        WriteIndented = true
    });

    File.WriteAllText("RegistrationForm_Data.json", json);
}

Key API Calls

PdfFormWidget — provides access to the document's interactive form
PdfTextBoxFieldWidget.Text — reads text input values
PdfCheckBoxWidgetFieldWidget.Checked — reads checkbox state
PdfRadioButtonListFieldWidget.Value — reads selected radio button
PdfComboBoxWidgetFieldWidget.SelectedValue — reads combo box selection

Output

The following example shows how the extracted form fields are represented as structured JSON.

{
  "sourceFile": "RegistrationForm.pdf",
  "fieldCount": 6,
  "fields": {
    "FullName": "John Smith",
    "Email": "john.smith@contoso.com",
    "Department": "Sales",
    "AgreeTerms": true,
    "SubscriptionPlan": "Enterprise",
    "Skills": ["C#", "SQL", "Azure"]
  }
}

The following screenshot shows the actual JSON file generated after exporting the form data.

Convert PDF Form Fields to JSON in C#

This approach works well for interactive PDF forms that contain structured fields such as text boxes, check boxes, radio buttons, and drop-down lists. Because each field already has a unique name, the extracted data can be serialized directly into JSON without additional parsing.

If you need a deeper look at importing and exporting PDF form field data in C#, see our dedicated guide on working with PDF form fields using Spire.PDF for .NET.

6. Invoice PDF to JSON: A Real-World Example

Invoice processing is one of the most common business use cases for PDF to JSON conversion. Instead of presenting a full parser implementation, this section demonstrates how the extraction techniques from Sections 3 and 4 come together to solve a real problem.

Target JSON Structure

Before writing any extraction code, define your target schema. For a typical invoice, the JSON output might look like this:

{
  "invoiceNumber": "INV-2026-0042",
  "date": "2026-06-15",
  "vendor": "Contoso Ltd",
  "paymentTerms": "Net 30",
  "lineItems": [
    { "description": "Widget A", "quantity": 150, "unitPrice": 24.50, "total": 3675.00 },
    { "description": "Widget B", "quantity": 80, "unitPrice": 39.90, "total": 3192.00 }
  ],
  "subtotal": 8367.00,
  "tax": 669.36,
  "total": 9036.36
}

Extraction Pattern

Use text extraction (Section 3) to parse header fields via regex, and table extraction (Section 4) to pull line items:

// Parse header fields from extracted text using regex
invoice["invoiceNumber"] = Regex.Match(fullText, @"Invoice Number:\s*(\S+)").Groups[1].Value;
invoice["date"] = Regex.Match(fullText, @"Date:\s*(\S+)").Groups[1].Value;
invoice["vendor"] = Regex.Match(fullText, @"Vendor:\s*(.+)").Groups[1].Value;

// Extract line items from table data (Section 4 pattern)
for (int r = 1; r < table.GetRowCount(); r++)
{
    lineItems.Add(new
    {
        description = table.GetText(r, 0).Trim(),
        quantity = int.Parse(table.GetText(r, 1).Trim()),
        unitPrice = ParseCurrency(table.GetText(r, 2)),
        total = ParseCurrency(table.GetText(r, 3))
    });
}

The implementation combines the text extraction introduced in Section 3 with the table extraction introduced in Section 4. Regex is used only for simple field matching — the core PDF processing relies entirely on Spire.PDF APIs.

Handling Different Invoice Layouts

In production, you rarely deal with a single invoice format:

Fixed template + regex — works when you control the source or process invoices from a known vendor
Template matching — maintain a set of regex patterns, one per vendor
AI-assisted extraction — for unknown or highly variable layouts, combine OCR output with an LLM

Regex-based parsing is fast and reliable for known formats. For a production-ready implementation, extend the PdfToJsonConverter class from Section 8 to build a dedicated invoice parser that reuses the same extraction patterns.

7. Convert Multiple PDFs to JSON in Batch

Production workflows process hundreds or thousands of PDFs at once. This batch processor handles errors gracefully and logs results:

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.Collections.Generic;
using System.IO;
using System.Text.Json;

string inputDir = @"C:\PDFs\Invoices";
string outputDir = @"C:\Output\JSON";
Directory.CreateDirectory(outputDir);

string[] pdfFiles = Directory.GetFiles(inputDir, "*.pdf");
var results = new List<object>();

foreach (string pdfPath in pdfFiles)
{
    string fileName = Path.GetFileNameWithoutExtension(pdfPath);
    string outputPath = Path.Combine(outputDir, $"{fileName}.json");

    try
    {
        using (PdfDocument pdf = new PdfDocument())
        {
            pdf.LoadFromFile(pdfPath);
            var pageTexts = new List<string>();

            for (int i = 0; i < pdf.Pages.Count; i++)
            {
                var extractor = new PdfTextExtractor(pdf.Pages[i]);
                var options = new PdfTextExtractOptions { IsExtractAllText = true };
                pageTexts.Add(extractor.ExtractText(options).Trim());
            }

            var doc = new
            {
                sourceFile = Path.GetFileName(pdfPath),
                pageCount = pdf.Pages.Count,
                processedAt = DateTime.UtcNow,
                content = pageTexts
            };

            File.WriteAllText(outputPath, JsonSerializer.Serialize(doc,
                new JsonSerializerOptions { WriteIndented = true }));

            results.Add(new { file = fileName, status = "success" });
        }
    }
    catch (Exception ex)
    {
        results.Add(new { file = fileName, status = "error", error = ex.Message });
    }
}

File.WriteAllText(Path.Combine(outputDir, "_log.json"),
    JsonSerializer.Serialize(results, new JsonSerializerOptions { WriteIndented = true }));

Swap the text-only extraction with the invoice JSON extraction pattern from Section 6 if your batch consists of invoices, or with the PdfToJsonConverter class from Section 8 for general-purpose conversion.

8. Build a PDF to JSON Converter in C#

For production applications, encapsulate all extraction logic into a single class. The PdfToJsonConverter below combines text, table, and form field extraction into one reusable PDF to JSON converter:

using Spire.Pdf;
using Spire.Pdf.Texts;
using Spire.Pdf.Utilities;
using Spire.Pdf.Fields;
using Spire.Pdf.Widget;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text.Json;

public class PdfToJsonConverter
{
    private readonly JsonSerializerOptions _jsonOptions = new()
    {
        WriteIndented = true,
        PropertyNamingPolicy = JsonNamingPolicy.CamelCase,
        DefaultIgnoreCondition = System.Text.Json.Serialization.JsonIgnoreCondition.WhenWritingNull
    };

    public string ConvertToJson(string pdfPath)
    {
        using (PdfDocument pdf = new PdfDocument())
        {
            pdf.LoadFromFile(pdfPath);

            var result = new
            {
                sourceFile = Path.GetFileName(pdfPath),
                processedAt = DateTime.UtcNow,
                text = ExtractText(pdf),
                tables = ExtractTables(pdf),
                formFields = ExtractFormFields(pdf)
            };

            return JsonSerializer.Serialize(result, _jsonOptions);
        }
    }

    public void ConvertAndSave(string pdfPath, string outputPath)
    {
        File.WriteAllText(outputPath, ConvertToJson(pdfPath));
    }

    // Reuses the text extraction technique from Section 3 (PdfTextExtractor + PdfTextExtractOptions)
    private List<PageText> ExtractText(PdfDocument pdf) { return new List<PageText>(); }

    // Reuses the table extraction technique from Section 4 (PdfTableExtractor + ExtractTable)
    private List<TableData> ExtractTables(PdfDocument pdf) { return new List<TableData>(); }

    // Reuses the form field extraction technique from Section 5 (PdfFormWidget + field type checking)
    private Dictionary<string, object> ExtractFormFields(PdfDocument pdf) { return new Dictionary<string, object>(); }
}

public class PageText
{
    public int PageNumber { get; set; }
    public string Text { get; set; }
}

public class TableData
{
    public int PageNumber { get; set; }
    public int RowCount { get; set; }
    public List<List<string>> Rows { get; set; }
}

Usage

var converter = new PdfToJsonConverter();

// Single file
converter.ConvertAndSave("InvoiceReport.pdf", "InvoiceReport.json");

// Use inside an ASP.NET controller
[HttpPost("api/pdf-to-json")]
public IActionResult ConvertPdf(IFormFile file)
{
    var tempPath = Path.GetTempFileName();
    file.CopyTo(new FileStream(tempPath, FileMode.Create));
    var converter = new PdfToJsonConverter();
    string json = converter.ConvertToJson(tempPath);
    return Content(json, "application/json");
}

The helper methods (ExtractText, ExtractTables, ExtractFormFields) reuse the extraction techniques introduced in Sections 3–5. Refer to those sections for the full implementations.

Best Practices for Production Pipelines

When building PDF to JSON conversion into a production system:

Define your JSON schema first. Map each PDF element to a target field before writing extraction code.
Validate extracted data. Currency strings, dates, and IDs should be parsed and verified before serialization.
Handle missing values. Use JsonIgnoreCondition.WhenWritingNull to omit null fields from output.
Include metadata. Always record source file name, page numbers, and extraction timestamp for auditing.
Clean text artifacts. Trim whitespace, normalize line breaks, and handle encoding issues in extracted strings.

9. Convert OCR Output to JSON in C#

Scanned PDFs contain images rather than selectable text, so they must be processed with an OCR engine before they can be converted to JSON. Spire.PDF handles PDF rendering and page processing, while text recognition should be performed by an OCR solution such as Tesseract or Azure AI Vision.

For a complete walkthrough, see How to Extract Text from Scanned PDFs in C#.

Once OCR returns the recognized text, you can parse it using the same techniques shown earlier in this article.

Parse OCR Text into JSON

string recognizedText = ocrEngine.Recognize(imagePath);

// Parse recognized text using the same helper methods demonstrated in previous examples.
var parsedData = ParseRecognizedText(recognizedText);

string json = JsonSerializer.Serialize(parsedData, new JsonSerializerOptions
{
    WriteIndented = true
});

Best Practices

Scan documents at 300 DPI or higher for better OCR accuracy.
Validate important fields such as invoice numbers, dates, and currency values before serialization.
Reuse the parsing patterns introduced earlier in this article to build consistent JSON structures.

10. Performance Considerations

PDF to JSON conversion works fine for a single 5-page document. In production, you are processing hundreds of files with hundreds of pages each. These are the issues you will actually hit.

Large PDFs (100+ Pages)

Avoid loading all page text into a List<string> before serialization. Process and write each page incrementally:

using (var stream = File.Create("output.json"))
using (var writer = new Utf8JsonWriter(stream, new JsonWriterOptions { Indented = true }))
{
    writer.WriteStartObject();
    writer.WriteString("sourceFile", Path.GetFileName(pdfPath));
    writer.WriteStartArray("pages");

    for (int i = 0; i < pdf.Pages.Count; i++)
    {
        var extractor = new PdfTextExtractor(pdf.Pages[i]);
        var options = new PdfTextExtractOptions { IsExtractAllText = true };
        string text = extractor.ExtractText(options).Trim();

        writer.WriteStartObject();
        writer.WriteNumber("pageNumber", i + 1);
        writer.WriteString("text", text);
        writer.WriteEndObject();
    }

    writer.WriteEndArray();
    writer.WriteEndObject();
}

Utf8JsonWriter writes directly to the stream instead of building a string in memory. For a 500-page document, this can cut peak memory usage by 60-70% compared to JsonSerializer.Serialize().

Memory Usage

PdfDocument holds parsed page trees, fonts, and image references in memory. Two rules:

Always wrap PdfDocument in using — it releases unmanaged resources on dispose
Process one document at a time — do not keep multiple PdfDocument instances open simultaneously unless you have the RAM for it

For batch jobs processing 1000+ files, the using pattern inside the loop ensures each document is fully released before the next one loads.

Parallel Processing

Batch conversion is CPU-bound and parallelizes well:

var pdfFiles = Directory.GetFiles(inputDir, "*.pdf");

Parallel.ForEach(pdfFiles,
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
    pdfPath =>
{
    string outputPath = Path.Combine(outputDir,
        Path.GetFileNameWithoutExtension(pdfPath) + ".json");
    var converter = new PdfToJsonConverter();
    converter.ConvertAndSave(pdfPath, outputPath);
});

Each thread creates its own PdfToJsonConverter and PdfDocument instance. PdfDocument is not thread-safe — never share a single instance across threads.

When to Use Streaming JSON

Use Utf8JsonWriter over JsonSerializer.Serialize() when:

Output JSON exceeds 50 MB
You are processing PDFs with 200+ pages
Running in a memory-constrained environment (container with 512 MB limit)

For smaller documents, JsonSerializer is simpler and the memory difference is negligible.

11. FAQ

Can I convert PDF to JSON in C# for free?

Spire.PDF for .NET offers a free evaluation version with a page limit. For production use, you can apply for a 30-day free license or purchase a commercial license. The System.Text.Json serializer is built into .NET and free.

Can scanned PDFs be converted to JSON?

Yes, but you need an external OCR engine. Spire.PDF renders PDF pages as images via SaveAsImage(), which you then pass to Tesseract, Azure Computer Vision, or Amazon Textract for text recognition. The recognized text is then parsed and serialized to JSON. See Section 9 for the integration pattern.

Can I convert PDF tables to JSON automatically?

Yes. PdfTableExtractor automatically detects table structures on each page without manual configuration. It handles both properly structured tables (created in Word or Excel) and visual tables (text aligned to look like rows and columns). For multi-page tables or tables without headers, see the handling patterns in Section 4.

Can I batch convert multiple PDFs to JSON?

Yes. Iterate through a directory using Directory.GetFiles(), process each PDF with Spire.PDF extraction APIs, and save individual JSON files. Include error handling so one failed file does not stop the batch. See Section 7 for a complete example.

How can I convert large PDF files to JSON in C#?

Process the PDF page-by-page rather than loading all content into memory at once. For very large files (100+ pages), use Utf8JsonWriter to write JSON incrementally to a stream instead of building the entire output in memory. See Section 10 for the streaming JSON pattern and parallel processing approach.

Can I convert PDF to JSON using an API?

Yes. You can wrap the PdfToJsonConverter class from this article in an ASP.NET Web API endpoint. Accept a PDF upload, run the extraction, and return the JSON response. Spire.PDF works in any .NET hosting environment — ASP.NET Core, Azure Functions, AWS Lambda, or a self-hosted console app. See the ASP.NET controller example in Section 8.

Conclusion

PDF to JSON is not a single operation. Depending on your document, you are solving one of three different problems: wrapping raw text in a JSON envelope, parsing key-value patterns into flat objects, or building structured business JSON from text and table extraction.

This article covered all three, plus the complications that break naive implementations: tables without headers, merged cells, multi-page tables, fillable form fields, varying invoice layouts, batch processing, memory management for large documents, and OCR integration boundaries.

The PdfToJsonConverter class is a starting point you can adapt to your document types. The invoice extraction pattern shown in Section 6 demonstrates how to combine these techniques for real business documents. Both use Spire.PDF for .NET, which handles all PDF reading locally without external dependencies.

To get started:

Install via NuGet: Install-Package Spire.PDF
Apply for a 30-day free license to evaluate without page limits
Explore the Spire.PDF documentation for additional extraction scenarios

Published in Conversion

Tagged under

pdf net Conversion

How to Convert Word to JSON in Python (DOCX to JSON)

2026-06-17 01:26:12 Written by Allen Yang

Converting Word documents to JSON in Python

Converting Word documents to JSON is a common requirement when building automated document processing pipelines, feeding content into AI models, or migrating structured data from DOCX files into databases and APIs. Unlike CSV or XML, JSON provides a flexible, hierarchical format that can represent paragraphs, tables, and nested document structures in a single output.

However, Word files do not have a native JSON export format. A .docx file is a rich-text document composed of sections, paragraphs, styles, and tables—not a structured data source. Converting it to JSON requires deciding how to map that content into a meaningful schema.

This tutorial demonstrates how to convert Word to JSON in Python using Spire.Doc for Python. You will learn three progressively advanced methods: extracting plain paragraph text, converting Word tables to JSON arrays, and preserving the full document structure—including headings, paragraphs, and tables—in a hierarchical JSON output. The examples in this tutorial work with both DOCX and legacy DOC files supported by Spire.Doc.

Quick Navigation

How Is Word Converted into JSON?
Install the Required Library
Method 1 – Convert Word Text to JSON
Method 2 – Convert Word Tables to JSON
Method 3 – Preserve Document Structure in JSON
When to Use Word to JSON Conversion
Limitations and Best Practices
FAQ
Conclusion

1. How Is Word Converted into JSON?

A Word document is a rich-text format organized into sections, paragraphs, and tables—not a structured data format. When you convert Word to JSON, there is no single standard for how the content should be represented. The right schema depends on how the JSON will be used:

Goal	Recommended Schema	Key Characteristics
AI embedding / semantic search	Paragraph array	Flat list of text strings, one per paragraph
Full-text search indexing	Text blocks with metadata	Paragraphs with section index and style info
Database import from tables	Table row objects	Header-keyed dictionaries, one per row
RAG pipeline / knowledge base	Hierarchical structure	Nested sections with headings, paragraphs, and tables
Document archival / interchange	Full document model	Sections, styles, metadata, and all content types

For example, a Word document containing a heading and a paragraph could be represented in JSON as:

{
  "document": [
    {"type": "heading", "level": 1, "text": "Project Overview"},
    {"type": "paragraph", "text": "This report summarizes the quarterly results."}
  ]
}

The three methods in this tutorial correspond directly to these schema choices:

Method 1 produces a paragraph array (AI embedding, search indexing)
Method 2 produces table row objects (database import, data extraction)
Method 3 produces a hierarchical structure (RAG, knowledge base, document understanding)

Choose the method that matches your goal, or combine elements from multiple methods to build a custom schema.

2. Install the Required Library

This tutorial uses Spire.Doc for Python to read and parse DOC/DOCX files. Install it via pip:

pip install spire.doc

Alternatively, you can download Spire.Doc for Python and integrate it manually.

After installation, import the library in your Python script:

from spire.doc import Document, FileFormat
from spire.doc.common import *

Spire.Doc provides APIs to load Word documents, iterate through sections, paragraphs, and tables, and extract text content—everything needed to build a Word-to-JSON pipeline.

3. Method 1 – Convert Word Text to JSON

The simplest way to convert Word to JSON is to extract all paragraph text from the document and store it in a JSON array. This approach works well when you need the full text content without structural metadata—such as for full-text search, AI text embedding, or simple content export.

3.1 Read Paragraphs from a Word Document

Spire.Doc represents a Word document as a collection of Sections, each containing Paragraphs. To extract all text, you iterate through every section and every paragraph within it.

from spire.doc import Document
from spire.doc.common import *

input_file = "ProjectReport.docx"

document = Document()
document.LoadFromFile(input_file)

paragraphs = []
for i in range(document.Sections.Count):
    section = document.Sections.get_Item(i)
    for j in range(section.Paragraphs.Count):
        paragraph = section.Paragraphs.get_Item(j)
        text = paragraph.Text
        if text.strip():
            paragraphs.append(text)

document.Close()

Each paragraph's .Text property returns the plain text content, stripping away formatting. The if text.strip() check filters out empty paragraphs that exist as spacing or layout elements in Word.

3.2 Serialize the Extracted Text to JSON

Assuming the paragraph data extracted in the previous step is stored in the paragraphs list, you can serialize it to JSON and save it to a file as follows:

import json

output_file = "paragraphs.json"

result = {
    "source": input_file,
    "paragraph_count": len(paragraphs),
    "paragraphs": paragraphs
}

with open(output_file, "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

Output Example

The following JSON snippet shows the structure of the generated output file:

{
  "source": "ProjectReport.docx",
  "paragraph_count": 3,
  "paragraphs": [
    "Quarterly Sales Report",
    "This document provides an overview of sales performance across all regions."
  ]
}

Conversion Result

The image below shows the source Word document and the JSON file generated after extracting paragraph text.

Word to JSON conversion result - paragraph extraction

3.3 Explanation

Why iterate through Sections and Paragraphs instead of extracting all text at once? Because Word documents are organized hierarchically. A document contains one or more sections (each with its own page layout), and each section contains paragraphs. Iterating at this level gives you control over which content to include or skip—such as filtering empty paragraphs or limiting extraction to specific sections.

Storing paragraphs as a JSON array is the most straightforward structure. Each element is a string, making the output easy to consume in downstream systems. This approach is well-suited for:

Full-text indexing – feed paragraph text into search engines like Elasticsearch
AI text embedding – convert paragraphs into vector representations for semantic search
Simple content export – extract readable text from Word files without formatting

However, this method loses structural information. Headings, body text, and list items are all treated the same way. If you need to distinguish between them, see Method 3.

If your goal is simply to extract text content from Word documents without converting it to JSON, you may also be interested in our guide on extracting text from Word documents in Python.

4. Method 2 – Convert Word Tables to JSON

In many Word documents—reports, invoices, product lists, configuration tables—the most valuable content lives inside tables, not in paragraphs. Converting Word tables to JSON allows you to extract structured row-and-column data that can be directly loaded into databases, APIs, or data analysis tools.

Why Tables Need Special Handling

Tables in Word are stored as a grid of rows and cells, where each cell contains its own paragraphs. Unlike paragraph text, table data has an inherent two-dimensional structure that maps naturally to JSON objects. The first row often contains column headers, and subsequent rows contain data records.

Extracting Tables from a Word Document

The following code reads all tables from a Word document, uses the first row as column headers, and converts each subsequent row into a JSON object:

import json
from spire.doc import Document
from spire.doc.common import *

input_file = "SalesData.docx"
output_file = "tables.json"

document = Document()
document.LoadFromFile(input_file)

all_tables = []

for i in range(document.Sections.Count):
    section = document.Sections.get_Item(i)
    for t in range(section.Tables.Count):
        table = section.Tables.get_Item(t)
        rows_data = []

        if table.Rows.Count < 2:
            continue

        header_row = table.Rows[0]
        headers = []
        for c in range(header_row.Cells.Count):
            cell_text = header_row.Cells[c].Paragraphs[0].Text.strip()
            headers.append(cell_text)

        for r in range(1, table.Rows.Count):
            row = table.Rows[r]
            row_dict = {}
            for c in range(row.Cells.Count):
                cell_text = row.Cells[c].Paragraphs[0].Text.strip()
                row_dict[headers[c] if c < len(headers) else f"Column_{c}"] = cell_text
            rows_data.append(row_dict)

        all_tables.append({
            "table_index": t,
            "headers": headers,
            "row_count": len(rows_data),
            "rows": rows_data
        })

document.Close()

result = {
    "source": input_file,
    "table_count": len(all_tables),
    "tables": all_tables
}

with open(output_file, "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

Output Example

The following JSON snippet shows the structure of the generated output file, with each table row mapped to a JSON object using the header row as keys:

{
  "source": "SalesData.docx",
  "table_count": 1,
  "tables": [
    {
      "table_index": 0,
      "headers": ["Region", "Product", "Units Sold", "Revenue"],
      "row_count": 3,
      "rows": [
        {"Region": "North", "Product": "Laptop", "Units Sold": "120", "Revenue": "114000"},
        {"Region": "South", "Product": "Laptop", "Units Sold": "80", "Revenue": "76000"}
      ]
    }
  ]
}

Conversion Result

The image below demonstrates how table data from a Word document is converted into structured JSON records.

Word to JSON conversion result - table extraction

Explanation

The code treats the first row as a header row and maps each cell in subsequent rows to the corresponding header key. This produces a JSON array of objects, which is the most common and useful format for tabular data.

Key considerations:

table.Rows.Count < 2 skips tables that have only a header row or are empty
row.Cells[c].Paragraphs[0].Text extracts text from the first paragraph in each cell. For simplicity, the example reads only the first paragraph. If a cell contains multiple paragraphs, iterate through the entire Paragraphs collection and concatenate the results:

cell_text = "\n".join(
    row.Cells[c].Paragraphs[p].Text.strip()
    for p in range(row.Cells[c].Paragraphs.Count)
    if row.Cells[c].Paragraphs[p].Text.strip()
)

headers[c] if c < len(headers) else f"Column_{c}" handles cases where a data row has more cells than the header row

This method is ideal for extracting structured data from reports, invoices, product catalogs, and configuration tables stored in Word documents. The resulting JSON can be directly loaded into databases, used in web APIs, or processed by data analysis tools.

If you need to generate Word documents from structured JSON data, see our tutorial on converting JSON to Word in Python, which covers creating Word content and tables directly from JSON objects and arrays.

5. Method 3 – Preserve Document Structure in JSON

Methods 1 and 2 treat paragraphs and tables as separate, isolated elements. In practice, Word documents have a meaningful hierarchy: headings introduce sections, paragraphs provide detail, and tables present structured data within a specific context.

Preserving this hierarchy in JSON produces output that is far more useful for knowledge base construction, RAG (Retrieval-Augmented Generation) pipelines, and document understanding systems. Instead of a flat list of text, you get a structured representation that maintains the logical flow of the original document.

How to Preserve Headings, Paragraphs, and Tables in a Hierarchical JSON Structure

The approach is to iterate through all child objects in each section's body, determine the type of each object (paragraph or table), and build a structured JSON representation accordingly. For paragraphs, you can detect headings by checking the StyleName property.

import json
from spire.doc import Document
from spire.doc.common import *

input_file = "ProjectReport.docx"
output_file = "structured_output.json"

HEADING_STYLES = {
    "Heading1": 1,
    "Heading2": 2,
    "Heading3": 3,
    "Heading4": 4,
}

def get_heading_level(style_name):
    return HEADING_STYLES.get(style_name, None)

def extract_table_data(table):
    rows_data = []
    if table.Rows.Count < 1:
        return {"headers": [], "rows": []}

    header_row = table.Rows[0]
    headers = []
    for c in range(header_row.Cells.Count):
        headers.append(header_row.Cells[c].Paragraphs[0].Text.strip())

    for r in range(1, table.Rows.Count):
        row = table.Rows[r]
        row_dict = {}
        for c in range(row.Cells.Count):
            cell_text = row.Cells[c].Paragraphs[0].Text.strip()
            row_dict[headers[c] if c < len(headers) else f"Column_{c}"] = cell_text
        rows_data.append(row_dict)

    return {"headers": headers, "rows": rows_data}

document = Document()
document.LoadFromFile(input_file)

sections_data = []

for i in range(document.Sections.Count):
    section = document.Sections.get_Item(i)
    content_items = []

    for j in range(section.Body.ChildObjects.Count):
        obj = section.Body.ChildObjects.get_Item(j)

        if isinstance(obj, Paragraph):
            text = obj.Text.strip()
            if not text:
                continue

            heading_level = get_heading_level(obj.StyleName)
            if heading_level:
                content_items.append({
                    "type": "heading",
                    "level": heading_level,
                    "text": text
                })
            else:
                content_items.append({
                    "type": "paragraph",
                    "text": text
                })

        elif isinstance(obj, Table):
            table_data = extract_table_data(obj)
            content_items.append({
                "type": "table",
                "row_count": len(table_data["rows"]),
                "data": table_data
            })

    sections_data.append({
        "section_index": i,
        "content": content_items
    })

document.Close()

result = {
    "source": input_file,
    "section_count": len(sections_data),
    "sections": sections_data
}

with open(output_file, "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

Output Example

The following JSON snippet shows how headings, paragraphs, and tables are represented in the hierarchical output structure:

{
  "source": "ProjectReport.docx",
  "section_count": 1,
  "sections": [
    {
      "section_index": 0,
      "content": [
        {
          "type": "heading",
          "level": 1,
          "text": "Quarterly Sales Report"
        },
        {
          "type": "paragraph",
          "text": "This report provides an overview of sales performance across all regions."
        },
        {
          "type": "heading",
          "level": 2,
          "text": "Regional Breakdown"
        },
        {
          "type": "table",
          "row_count": 3,
          "data": {
            "headers": ["Region", "Product", "Units Sold", "Revenue"],
            "rows": [
              {"Region": "North", "Product": "Laptop", "Units Sold": "120", "Revenue": "114000"}
            ]
          }
        }
      ]
    }
  ]
}

Conversion Result

The image below illustrates how headings, paragraphs, and tables are preserved in a hierarchical JSON structure.

Word to JSON conversion result - hierarchical structure

Explanation

This method differs from the previous two in a fundamental way: it uses section.Body.ChildObjects to iterate through all content elements in document order, rather than separately iterating paragraphs and tables. This preserves the original sequence and interleaving of headings, paragraphs, and tables.

Key design decisions:

Heading detection via StyleName – Word headings are paragraphs styled with "Heading1", "Heading2", etc. Checking the style name allows you to distinguish headings from body text and record the heading level. Note that the exact heading style names may vary depending on the Word template or language settings (e.g., "Heading 1" with a space, or localized names like "标题 1" in Chinese). To handle these variations, normalize the style name before lookup:

def get_heading_level(style_name):
    normalized = style_name.lower().replace(" ", "")
    heading_map = {"heading1": 1, "heading2": 2, "heading3": 3, "heading4": 4}
    return heading_map.get(normalized, None)

ChildObjects iteration – Unlike section.Paragraphs (which only returns paragraphs) or section.Tables (which only returns tables), ChildObjects returns all elements in their original order. This is essential for preserving the document's logical structure.
Structured JSON output – Each content item includes a type field (heading, paragraph, or table), making it easy for downstream systems to process different content types appropriately.

This approach is particularly valuable for:

RAG and AI pipelines – the heading structure enables chunking documents by section, improving retrieval accuracy
Knowledge base construction – hierarchical JSON maps directly to tree-structured knowledge graphs
Document understanding – preserving the relationship between headings and their associated content allows semantic analysis of document sections

If you need to extract specific content types from Word documents, such as headings, paragraphs, or tables, see our tutorial on reading Word documents in Python, which covers content extraction techniques in more detail.

6. When to Use Word to JSON Conversion

Word to JSON conversion is useful in any scenario where structured data needs to be extracted from Word documents at scale. Common use cases include:

AI and RAG document processing – Convert Word documents into JSON chunks for embedding and retrieval in LLM-based applications. The hierarchical structure from Method 3 enables section-level chunking, which produces better retrieval results than flat text splitting.
Knowledge base construction – Build structured knowledge bases from technical documentation, policy documents, or manuals stored as .docx files.
Batch data extraction – Extract data from hundreds of Word reports, invoices, or forms and load the results into a database or data warehouse.
Contract and resume parsing – Convert legal contracts, HR documents, or resumes into structured JSON for automated analysis and comparison.
API and web application data exchange – Serve Word document content through REST APIs as JSON, enabling web and mobile applications to consume document data without handling .docx files directly.

7. Limitations and Best Practices

Limitations

No standard JSON schema for Word – Unlike CSV or XML, there is no universally accepted format for representing Word content in JSON. The structure you choose must be designed for your specific use case.
Complex formatting is not captured – The methods in this tutorial extract text content and basic structural metadata (heading levels, table data). They do not capture fonts, colors, images, page layout, headers/footers, or footnotes. If your application requires these elements, additional extraction logic is needed.
Merged table cells require special handling – Word tables can contain merged cells (both horizontal and vertical). The simple row-by-row extraction in Method 2 assumes a regular grid. Documents with merged cells may produce unexpected results.
Large documents may need chunked processing – For documents with hundreds of pages or dozens of tables, consider processing sections or tables individually to manage memory usage.

Best Practices

Design your JSON schema before writing code – Decide what you need (text only? headings? tables? full structure?) and choose the appropriate extraction method.
Validate output against sample documents – Word documents vary widely in structure and formatting. Test your conversion logic against representative samples from your actual document set.
Handle encoding explicitly – Always specify encoding="utf-8" when writing JSON files to avoid character encoding issues with non-ASCII text.
Use ensure_ascii=False in json.dump – This preserves Unicode characters in the output rather than escaping them, which is important for documents containing non-English text.

8. FAQ

Can I convert DOCX to JSON in Python?

Yes. Using Spire.Doc for Python, you can load any .docx file, iterate through its sections, paragraphs, and tables, and serialize the extracted content to JSON using Python's built-in json module. This tutorial demonstrates three methods for doing so, from simple text extraction to full structural preservation.

What is the best Word to JSON converter for developers?

For developers who need batch processing, automation, or custom JSON schemas, a Python-based approach using Spire.Doc is more flexible than online converters. Online tools work for one-off conversions but cannot handle large-scale processing, custom output formats, or integration into automated pipelines.

Can I convert Word tables to JSON?

Yes. By iterating through the tables in a Word document and extracting cell text row by row, you can convert table data into a JSON array of objects. Method 2 in this tutorial demonstrates this with header-based key mapping.

Does Word have a native JSON export option?

No. Microsoft Word does not provide a built-in JSON export format. Word files can be saved as DOCX, PDF, HTML, RTF, and plain text, but converting to JSON requires a programmatic approach that reads the document structure and maps it to a JSON schema.

Can I preserve headings and structure when converting Word to JSON?

Yes. By iterating through all child objects in each section's body and checking paragraph style names, you can detect headings, body paragraphs, and tables, then build a hierarchical JSON structure that preserves the document's logical organization. Method 3 in this tutorial provides a complete implementation.

Can I convert Word to JSON online?

Yes, there are online Word to JSON converters that can handle one-off conversions. However, online tools are limited to single-file processing and do not allow customization of the JSON schema. For batch processing, automated pipelines, or custom output structures, a Python-based approach using Spire.Doc is more practical and scalable.

9. Conclusion

In this article, we demonstrated how to convert Word documents to JSON in Python using Spire.Doc for Python. We covered three methods of increasing complexity: extracting paragraph text as a flat JSON array, converting Word tables to structured JSON objects, and preserving the full document hierarchy—including headings, paragraphs, and tables—in a single JSON output.

Each method serves a different purpose. Plain text extraction works for indexing and embedding. Table extraction is ideal for data migration and report parsing. Full structural preservation enables knowledge base construction and RAG pipelines. Choose the approach that matches your requirements, and extend the JSON schema as needed for your specific use case.

Spire.Doc for Python provides comprehensive Word document processing capabilities beyond JSON conversion, including document creation, formatting, mail merge, and format conversion. You can apply for a 30-day free license to evaluate all features.

Published in Conversion

Tagged under

doc Python Conversion

Create CSV File in Java: From Scratch, Arrays & Excel

2026-06-16 02:37:29 Written by Jane Zhao

A developer's guide to create and write CSV files in Java

CSV remains a ubiquitous, lightweight data format in Java development, powering report exports, data migration, and cross-platform data interchange. But despite its apparent simplicity, building production-grade CSV files requires handling special characters, encodings, and strict formatting rules – all of which add unnecessary development and testing overhead.

Spire.XLS for Java streamlines this workflow with a clean, robust API that automatically handles all low-level formatting and encoding details. This guide shows you how to use Java to create CSV files – covering basic CSV generation, structured batch exports, Excel-to-CSV conversion, special character support, and advanced delimiter configuration.

Why Choose Spire.XLS for Java to Create CSV Files
Creating CSV Files from Scratch with Java
Create Structured CSV from Arrays in Java
Create CSV from Excel in Java
Advanced CSV Generation Techniques
- Handle Special Characters
- Custom Delimiters and Encoding
Frequently Asked Questions

Why Choose Spire.XLS for Java to Create CSV Files

Compared to native Java IO, Apache POI, or any other CSV Java library, Spire.XLS for Java offers distinct advantages:

Simplified API: Create and write CSV files in just a few lines of code, with no manual stream operations or low-level formatting work.
Automatic Format Handling: Automatically escapes special characters (commas, double quotes, line breaks) that break standard CSV syntax.
Full Encoding Support: Natively supports UTF-8, UTF-16, GB2312, and other encodings to avoid Chinese and special text garbling.
Dual Format Compatibility: Supports both Excel (XLS/XLSX) and CSV formats, enabling bidirectional conversion between spreadsheets and delimited text.
No Dependencies Bloat: Lightweight library with no third-party dependency conflicts, suitable for Java web, desktop, and microservice projects.

Prerequisites: Install Spire.XLS for Java

To start using Java to write CSV files, you first need to integrate the library into your project. We provide Maven and manual JAR installation methods.

1. Maven Dependency Configuration (Recommended)

Add the following repository and dependency to your project’s pom.xml file:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependency>
    <groupId>e-iceblue</groupId>
    <artifactId>spire.xls</artifactId>
    <version>16.4.1</version>
</dependency>

2. Manual JAR Installation

For non-Maven projects, download the Spire.XLS for Java JAR file from the official website and add it to your project’s build path.

Create CSV Files from Scratch with Java

This example demonstrates how to create a blank CSV file from scratch, write custom row and column data, and save the file with standard comma delimiters and UTF-8 encoding. This is the most common basic scenario to generate CSV in Java.

import com.spire.xls.*;
import java.nio.charset.Charset;

public class CreateBasicCSV {
    public static void main(String[] args) {
        // Create Workbook instance
        Workbook workbook = new Workbook();

        // Get the first worksheet (index 0)
        Worksheet sheet = workbook.getWorksheets().get(0);

        // Write header row
        sheet.getCellRange("A1").setValue("ID");
        sheet.getCellRange("B1").setValue("Product Name");
        sheet.getCellRange("C1").setValue("Price");
        sheet.getCellRange("D1").setValue("Quantity");
        sheet.getCellRange("E1").setValue("Category");

        // Write data rows
        sheet.getCellRange("A2").setNumberValue(1001);
        sheet.getCellRange("B2").setValue("Wireless Mouse");
        sheet.getCellRange("C2").setNumberValue(29.99);
        sheet.getCellRange("D2").setNumberValue(150);
        sheet.getCellRange("E2").setValue("Electronics");

        sheet.getCellRange("A3").setNumberValue(1002);
        sheet.getCellRange("B3").setValue("Mechanical Keyboard");
        sheet.getCellRange("C3").setNumberValue(89.99);
        sheet.getCellRange("D3").setNumberValue(75);
        sheet.getCellRange("E3").setValue("Electronics");

        sheet.getCellRange("A4").setNumberValue(1003);
        sheet.getCellRange("B4").setValue("Desk Chair");
        sheet.getCellRange("C4").setNumberValue(199.99);
        sheet.getCellRange("D4").setNumberValue(30);
        sheet.getCellRange("E4").setValue("Furniture");

        // Save worksheet as CSV file (comma delimiter + UTF-8 encoding)
        sheet.saveToFile("products.csv", ",", Charset.forName("UTF-8"));

        // Release resources
        workbook.dispose();
    }
}

Key API Methods Explained

setValue(): Writes text or string values into a cell.
setNumberValue(): Writes numeric values (integers and decimals) into a cell.
saveToFile(filename, separator, charset): Exports the worksheet to CSV with specified delimiter and encoding.

Open the generated CSV in Excel:

A basic CSV file created with Spire.XLS for Java

Bonus Tip: Beyond generating CSV files from scratch, Spire.XLS also allows you to read a CSV file in Java, enabling full bidirectional data processing within a single library.

Create Structured CSV from Arrays in Java

For practical development, you usually need to batch write business data (e.g., user lists, order records) to CSV files. This example shows how to create a standardized CSV file with fixed headers and batch structured data from 1D & 2D arrays.

import com.spire.xls.*;
import java.nio.charset.Charset;

public class CreateStructuredCSV {
    public static void main(String[] args) {
        Workbook workbook = new Workbook();
        Worksheet sheet = workbook.getWorksheets().get(0);

        // Define CSV header row
        String[] headers = {"Order ID", "Customer Name", "Order Amount", "Order Date", "Order Status"};
        for (int i = 0; i < headers.length; i++) {
            sheet.getCellRange(1, i + 1).setValue(headers[i]);
        }

        // Batch write order data
        String[][] orderData = {
                {"ORD001", "Tom Brown", "299.99", "2026-06-01", "Completed"},
                {"ORD002", "Lucy Green", "599.50", "2026-06-05", "Pending"},
                {"ORD003", "Mike Wilson", "129.00", "2026-06-08", "Shipped"}
        };

        // Traverse and write batch data
        int rowNum = 2;
        for (String[] rowData : orderData) {
            for (int col = 0; col < rowData.length; col++) {
                sheet.getCellRange(rowNum, col + 1).setValue(rowData[col]);
            } 
            rowNum++;
        }

        // Save structured CSV file
        sheet.saveToFile("Record.csv", ",", Charset.forName("UTF-8"));
        workbook.dispose();
    }
}

Unlike the previous example where we populated each cell individually, this approach loops through arrays or collections. And the same pattern can be easily adapted to List<List<String>> or other dynamic data sources.

Output:

Create a CSV file from 1D & 2D arrays with Java

Create CSV from Excel in Java

When you need to convert Excel to CSV in Java, Spire.XLS makes this process incredibly simple with only a few lines of code.

import com.spire.xls.*;
import java.nio.charset.Charset;

public class ExcelToCSV {
    public static void main(String[] args) {
        // Load Excel file
        Workbook workbook = new Workbook();
        workbook.loadFromFile("sample.xlsx");
        
        // Get the first worksheet
        Worksheet sheet = workbook.getWorksheets().get(0);
        
        // Save worksheet as CSV
        sheet.saveToFile("converted.csv", ",", Charset.forName("UTF-8"));
        
        workbook.dispose();
    }
}

The core logic is straightforward: load the Excel file → target the specific worksheet → call the conversion API to save as CSV.

Excel to CSV Result:

Convert Excel to CSV using Java

The reverse operation – turning a CSV file into an Excel workbook – is equally valuable when you need to apply styles, formulas, or multiple worksheets to raw exported data.

Advanced CSV Generation Techniques

1. Handle Special Characters

When your data contains commas or double quotes, proper escaping is critical. Spire.XLS automatically wraps affected fields in double quotes per RFC 4180 standards, ensuring compatibility with Excel, WPS, and all standard text editors.

Example:

sheet.getCellRange("A1").setValue("Ergonomic, silent design | \"2026 New Model\"");

Create CSV with special characters in Java

2. Custom Delimiters and Encoding

Spire.XLS for Java supports custom delimiters beyond the standard comma, accommodating regional and format-specific requirements.

// Using semicolon as delimiter and UTF-16 as encoding
sheet.saveToFile("european_data.csv", ";", Charset.forName("UTF-16"));

// Using tab as delimiter for TSV files
sheet.saveToFile("tab_separated.txt", "\t", Charset.forName("UTF-8"));

Frequently Asked Questions

Q1. Does Spire.XLS for Java require Microsoft Excel?

No. The library works completely independently without any Office dependencies.

Q2. Can I append data to the end of an existing CSV file?

Spire.XLS loads full CSV content into a worksheet for editing. To append data, load the existing file, locate the last used row, write new records starting from the next row index, then save the file back.

Q3. Can multiple worksheets be exported to a single CSV file?

No. CSV is a plain-text, single-sheet format by definition. Each saveToFile() call exports exactly one worksheet to one CSV file. To export multiple sheets, call the save method separately for each worksheet to output individual CSV files.

Q4. What about licensing?

Spire.XLS for Java offers both commercial and free versions. While the free version carries certain usage limitations, it fully supports fundamental CSV operations and lightweight spreadsheet processing tasks.

Conclusion

Generating CSV files is a routine yet critical task in Java development. The quality and reliability of your CSV output directly impact downstream processes such as reporting, system migration, and data analysis. To ensure error‑free CSV generation, you need a library that handles formatting, encoding, and special characters automatically.

Spire.XLS for Java provides exactly that. By following the step‑by‑step code examples in this article, you can quickly integrate robust CSV generation into your Java projects, improving development efficiency while eliminating common formatting flaws and encoding errors.

For more advanced features (e.g., converting CSV to PDF), explore the Spire.XLS for Java Documentation.

Published in Document Operation

Tagged under

xls java Operation

News Category

Knowledgebase (2344)

Children categories

1. What "PDF to JSON" Actually Means

Raw Text JSON

Key-Value JSON

Structured Business JSON

2. Install Spire.PDF for .NET

3. Convert PDF Text to JSON in C#

Extract Text from PDF

Parse Key-Value Pairs into JSON

Key API Calls

Output

4. Convert PDF Tables to JSON in C#

Why Table Extraction Is Different from Text Extraction

Extract Tables from PDF

Serialize Table Data to JSON

Key API Calls

Sample JSON Output

Common Table Extraction Problems

Problem 1: Missing Headers

Problem 2: Merged Cells

Problem 3: Multi-Page Tables

5. Convert PDF Form Fields to JSON

Read and Export Form Fields

Key API Calls

Output

6. Invoice PDF to JSON: A Real-World Example

Target JSON Structure

Extraction Pattern

Handling Different Invoice Layouts

7. Convert Multiple PDFs to JSON in Batch

8. Build a PDF to JSON Converter in C#

Usage

Best Practices for Production Pipelines

9. Convert OCR Output to JSON in C#

Parse OCR Text into JSON

Best Practices

10. Performance Considerations

Large PDFs (100+ Pages)

Memory Usage

Parallel Processing

When to Use Streaming JSON

11. FAQ

Can I convert PDF to JSON in C# for free?

Can scanned PDFs be converted to JSON?

Can I convert PDF tables to JSON automatically?

Can I batch convert multiple PDFs to JSON?

How can I convert large PDF files to JSON in C#?

Can I convert PDF to JSON using an API?

Conclusion

1. How Is Word Converted into JSON?

2. Install the Required Library

3. Method 1 – Convert Word Text to JSON

3.1 Read Paragraphs from a Word Document

3.2 Serialize the Extracted Text to JSON

Output Example

Conversion Result

3.3 Explanation

4. Method 2 – Convert Word Tables to JSON

Why Tables Need Special Handling

Extracting Tables from a Word Document

Output Example

Conversion Result

Explanation

5. Method 3 – Preserve Document Structure in JSON

How to Preserve Headings, Paragraphs, and Tables in a Hierarchical JSON Structure

Output Example

Conversion Result

Explanation

6. When to Use Word to JSON Conversion

7. Limitations and Best Practices

Limitations

Best Practices

8. FAQ

Can I convert DOCX to JSON in Python?

What is the best Word to JSON converter for developers?

Can I convert Word tables to JSON?

Does Word have a native JSON export option?

Can I preserve headings and structure when converting Word to JSON?