Spire.Office Knowledgebase Page 7

Subscribe to this RSS feed

Knowledgebase (2329)

Children categories

Spire.OfficeJs (5)

View items...

Automate Excel Writing in Python: Professional Reporting Practices

2025-09-18 09:49:49 Written by zaki zou

Tutorial on How to Write Data to Excel Worksheets Using Python Code

Excel remains one of the most widely used tools for organizing, analyzing, and presenting data. From financial reports to operational dashboards, many workflows require exporting data into Excel for better readability and sharing. Instead of manually entering information, automating Excel file writing with Python makes it faster, more reliable, and more scalable.

This tutorial explains how to write data to Excel files with Python, covering structured data insertion, formatting, and exporting. The examples use a Python Excel library that allows programmatic creation and customization of workbooks.

What's Included in This Tutorial:

Setting Up the Environment
Writing Data into Excel Files
Formatting While Writing
Working with Multiple Worksheets
Best Practices
Conclusion
FAQ

Setting Up the Environment

Before writing Excel files in Python, you need a library that supports creating, loading, and saving workbooks programmatically. Spire.XLS for Python provides a complete API for these operations, enabling automated report generation and data processing.

Install the package using pip:

pip install spire.xls

Once installed, you can handle Excel files using three core operations:

Creating a new workbook – initialize a new Excel document with Workbook().
Loading an existing workbook – open an existing Excel file using LoadFromFile().
Saving a workbook – export the workbook to the desired format with SaveToFile(), supporting .xlsx, .xls, CSV, and more.

These operations form the foundation for further data writing, formatting, and multi-sheet management in Python.

Writing Data into Excel Files with Python

In real-world business scenarios, you may need to create new Excel files, update existing reports, or write different types of data—such as text, numbers, dates, and formulas. This section demonstrates how to efficiently write and manage data in Excel files with Python across these common use cases.

Appending Data to an Existing Excel File

When you need to update an existing Excel workbook with new information—such as adding recent sales records, inventory updates, or additional data rows—you can open the file, append the data programmatically, and save it without overwriting existing content:

from spire.xls import Workbook, ExcelVersion

workbook = Workbook()
workbook.LoadFromFile("Sample.xlsx")
sheet = workbook.Worksheets[0]

# Add new rows
sheet.Range["A4"].Value = "Laptop"
sheet.Range["B4"].NumberValue = 5
sheet.Range["C4"].NumberValue = 1200.00

sheet.Range["A5"].Value = "Monitor"
sheet.Range["B5"].NumberValue = 10
sheet.Range["C5"].NumberValue = 300.00

workbook.SaveToFile("output/updated_excel.xlsx", ExcelVersion.Version2016)

Key elements used:

LoadFromFile() – loads an existing Excel file into the workbook object.
Range["CellName"] – references a specific cell in the sheet using its name.
Value / NumberValue – assigns text or numeric data to cells.
SaveToFile() – saves the workbook to a file in the specified Excel format.

This method allows continuous updates to reports while preserving existing content.

Example showing appended data:

Append rows to an existing Excel file in Python

Writing Multiple Rows and Columns to a New Excel File

When dealing with larger datasets, writing multiple rows and columns at once is much more efficient than updating individual cells one by one. This approach not only saves time but also ensures consistent data insertion across the worksheet:

from spire.xls import Workbook, ExcelVersion

# Create a new Excel workbook
workbook = Workbook()
sheet = workbook.Worksheets[0]

orders = [
    ["OrderID", "Customer", "Product", "Quantity", "Price", "Status"],
    [1001, "Alice", "Laptop", 2, 1200.00, "Shipped"],
    [1002, "Bob", "Monitor", 1, 300.00, "Pending"],
    [1003, "Charlie", "Keyboard", 5, 45.00, "Delivered"],
    [1004, "David", "Mouse", 3, 25.00, "Shipped"],
    [1005, "Eva", "Tablet", 1, 450.00, "Pending"]
]

for row_index, row_data in enumerate(orders, start=1):
    for col_index, value in enumerate(row_data, start=1):
        if isinstance(value, (int, float)):
            sheet.Range[row_index, col_index].NumberValue = value
        else:
            sheet.Range[row_index, col_index].Value = value

workbook.SaveToFile("output/orders.xlsx", ExcelVersion.Version2016)

Important elements in this example:

enumerate() – provides row and column indices for looping.
Range[row, col] – references a cell in the worksheet by its row and column indexes.

Batch writing ensures efficiency, especially when exporting database query results or operational reports.

Example showing batch data insertion:

Write multiple rows and columns into Excel using Python

Writing Different Data Types

Excel cells can contain various types of data, such as text, numbers, dates, formulas, and more. Using the correct properties and methods ensures that each type is stored and displayed appropriately, allowing accurate calculations and proper formatting:

from spire.xls import Workbook, ExcelVersion, DateTime, TimeSpan

workbook = Workbook()
sheet = workbook.Worksheets[0]

# Set general value
sheet.Range[2, 2].Text = "General Example"
sheet.Range[2, 3].Value = "General 123"

# Set number value
sheet.Range[3, 2].Text = "Number Example"
sheet.Range[3, 3].NumberValue = 1234.56
sheet.Range[3, 3].NumberFormat = "0.000"

# Set datetime value
sheet.Range[4, 2].Text = "Date Example"
sheet.Range[4, 3].DateTimeValue = DateTime.get_UtcNow()

# Set formula value
sheet.Range[5, 2].Text = "Formula Example"
sheet.Range[5, 5].NumberValue = 1234.56
sheet.Range[5, 6].NumberValue = 6543.21
sheet.Range[5, 3].Formula = "=SUM(E5:F5)"

# Set text
sheet.Range[6, 2].Text = "Text Example"
sheet.Range[6, 3].Text = "Text"

# Set boolean value
sheet.Range[7, 2].Text = "Boolean Example"
sheet.Range[7, 3].BooleanValue = True

sheet.AllocatedRange.AutoFitColumns()

workbook.SaveToFile("output/value_types.xlsx", ExcelVersion.Version2016)

Key functions and properties used:

Value – assigns or retrieves the general value of a cell, suitable for text or mixed content.
NumberValue – specifically handles numeric values in a cell, ensuring proper number formatting and calculations.
DateTimeValue – used to input or obtain date and time values in a cell with correct formatting.
Formula – sets or retrieves the formula expression in a cell to perform dynamic calculations.
BooleanValue – stores or returns a Boolean (True/False) value in a cell.
Text – retrieves the displayed text of a cell, including any applied formatting.

Proper handling of different data types is essential for accurate business calculations and reporting. For more details on supported data types, see the XlsRange API reference.

Example showing mixed data types:

Write text, numbers, dates, and formulas into Excel using Python

Formatting Excel While Writing Data with Python

To make Excel reports clear and professional, it’s important to apply formatting while entering or updating data. This section demonstrates how to enhance readability and presentation by styling cells, setting number formats, and adjusting column widths and row heights as you write data into Excel.

Applying Cell Styles

You can enhance the readability and appearance of your Excel sheet by applying various styles to cells, such as fonts, borders, and background colors:

from spire.xls import Workbook, Color, FontUnderlineType, ExcelVersion

workbook = Workbook()
sheet = workbook.Worksheets[0]
sheet.Range["A1"].Value = "Product"
sheet.Range["B1"].Value = "Category"
sheet.Range["C1"].Value = "Price"
sheet.Range["D1"].Value = "Quantity"
sheet.Range["E1"].Value = "Total"
sheet.Range["A2"].Value = "MacBook Pro"
sheet.Range["B2"].Value = "Laptop"
sheet.Range["C2"].NumberValue = 999.99
sheet.Range["D2"].NumberValue = 1
sheet.Range["E2"].Formula = "=C2*D2"
sheet.Range["A3"].Value = "iPhone 16 Pro"
sheet.Range["B3"].Value = "Smartphone"
sheet.Range["C3"].NumberValue = 899.99
sheet.Range["D3"].NumberValue = 1
sheet.Range["E3"].Formula = "=C3*D3"

# Set header style
header = sheet.Range["A1:E1"]
header.Style.Font.FontName = "Arial"
header.Style.Font.Size = 14.0
header.Style.Font.IsBold = True
header.Style.Font.Underline = FontUnderlineType.Single
header.Style.Interior.Color = Color.get_LightGray()
header.Style.Borders[BordersLineType.EdgeRight].LineStyle = LineStyleType.Medium

Core components demonstrated:

Style.Font – controls font-related settings such as bold, underline, and more (full list of supported properties can be found in the Style.Font API documentation).
FontUnderlineType.Single – applies a single underline.
Interior.Color – fills the cell background with a specified color.
Borders.LineStyle – adds borders around cells.

Styled cells enhance readability and emphasize critical sections.

Setting Number Formats for Excel Cells

Numbers in Excel often require specific display formats to improve readability and presentation. Using CellRange.NumberFormat, you can control how numeric values appear, such as applying currency, percentage, or integer formats:

# Apply number formats
sheet.Range["C2:C3"].NumberFormat = "$#,##0.00"   # Currency format
sheet.Range["D2:D3"].NumberFormat = "0"           # Integer format
sheet.Range["E2:E3"].NumberFormat = "$#,##0.00"

Key highlights:

NumberFormat – enables reading and setting Excel cell number formats, controlling how numbers are displayed while keeping the raw data intact.
Format codes define display rules such as currency symbols, decimal places, or percentage styles, giving you flexibility in presenting numerical data.

With proper number formatting, financial data is easier to interpret and looks more professional. For more details and a full list of format codes, see our dedicated guide on Setting Excel Cell Number Format in Python.

Adjusting Column Widths and Row Heights

Properly adjusting column widths and row heights ensures that all content is clearly visible. You can set them manually or use automatic fitting to match the content:

# Auto-fit column widths and row heights
for col in range(1, 5):
    sheet.AutoFitColumn(col)
for row in range(1, 3):
    sheet.AutoFitRow(row)

# Auto-fit a specific range of cells
#sheet.Range["A1:E3"].AutoFitColumns()
#sheet.Range["A1:E3"].AutoFitRows()

# Set a fixed column width and row height
sheet.Columns[1].Width = 150
sheet.Rows[1].Height = 30

workbook.SaveToFile("output/formatted_excel.xlsx", ExcelVersion.Version2016)

Key highlights:

AutoFitColumn(colIndex) / AutoFitRow(rowIndex) – automatically adjust a single column or row to fit its content.
CellRange.AutoFitColumns() / AutoFitRows() – automatically adjust all columns or rows within a specified cell range.
Columns[colIndex].Width / Rows[rowIndex].Height – manually set a fixed width or height for precise control.

With these options, you can choose between automatic fitting for dynamic data or fixed dimensions for consistent layout, ensuring your Excel worksheets remain both readable and professionally formatted.

Example showing styled and auto-fitted headers:

Apply styles and auto-fit columns in Excel using Python

To explore more advanced techniques for formatting Excel sheets in Python, including fonts, colors, borders, and conditional formatting, check out our dedicated guide on Formatting Excel in Python for detailed instructions.

Managing Multiple Worksheets in Excel with Python

In Excel, organizing data into multiple worksheets helps keep related information separated and easy to manage. For example, you can maintain separate sheets for sales, purchases, inventory, or other categories within the same workbook. This section demonstrates how to create, access, and manage multiple worksheets using Python.

from spire.xls import Workbook, ExcelVersion

workbook = Workbook()

sheet = workbook.Worksheets[0]
sheet.Name = "Sales"

sheet1 = workbook.Worksheets["Sheet2"]
sheet1.Name = "Purchases"

sheet2 = workbook.Worksheets.Add("Inventory")
sheet2.Range["A1"].Value = "ProductID"
sheet2.Range["B1"].Value = "Stock"

workbook.SaveToFile("output/multi_sheet.xlsx", ExcelVersion.Version2016)

Main features highlighted:

Worksheets[Index] – access a worksheet by its position in the workbook (useful for iterating over all sheets or referencing the first/last sheet).
Worksheets["SheetName"] – access a worksheet by its name, which is more readable and reliable if the sheet order might change.
Worksheets.Add("SheetName") – create a new worksheet to organize different categories of data such as departments, sales regions, or product lines.

These methods allow you to structure your Excel file efficiently, keeping related data on separate sheets for clarity and easier management.

Example showing multiple worksheets:

Create and manage multiple worksheets in Excel using Python

Best Practices for Writing Excel Files with Python

When writing Excel files with Python, follow best practices to maintain efficiency, consistency, and usability:

Use descriptive sheet names like “Sales_2024” instead of “Sheet1.”
Batch write large datasets instead of individual cell updates to improve performance.
Apply consistent formatting for headers, totals, and key columns.
Leverage Excel formulas to maintain dynamic calculations.
Validate data types to prevent misinterpretation in charts or formulas.
Choose file formats suited to the audience: .xlsx for modern users, .xls only for legacy compatibility.
Organize worksheets logically, grouping related datasets for easy navigation.

Implementing these practices avoids common pitfalls and produces professional, reusable reports.

Conclusion

Automating Excel writing in Python significantly streamlines reporting. By creating workbooks, writing data efficiently, applying styles, managing worksheets, and handling diverse data types, developers can ensure consistent, accurate, and professional Excel reports. To explore the library further, you can request a free temporary license or try the Free Spire.XLS for Python edition.

Python Excel Writing FAQ

Q1: Can Python write to an existing Excel file?

Yes, Python can load an existing workbook, append or modify data, and save it while preserving all previously entered content.

Q2: How to efficiently handle large datasets in Python?

Batch writing multiple rows and minimizing formatting during data insertion helps maintain high performance even with thousands of rows.

Q3: Can formulas be included in Excel files?

Yes, you can insert formulas, including =SUM() and more complex calculations, to keep your Excel reports dynamic and automatically updated.

Q4: Which Excel formats are supported?

Spire.XLS for Python can save files in .xlsx, .xls, CSV, and even export to PDF, covering most common use cases and compatibility needs.

Published in Document Operation

Tagged under

xls Python Document Operation

Convert CSV to XML in Python (Handle Real-World Cases)

2025-09-17 08:15:39 Written by zaki zou

Python Convert CSV to XML

CSV is one of the most common formats for exchanging tabular data because of its simplicity and wide support across platforms. However, when working with structured applications, configurations, or hierarchical data, XML often becomes the preferred choice due to its ability to represent nested relationships and enforce stricter data validation.

In this guide, we’ll explore how to convert CSV files into XML using Spire.XLS for Python. You’ll learn how to convert CSV into both Excel XML format (SpreadsheetML) and standard XML for general use. We’ll also cover how to clean and preprocess real-world CSV files—handling issues like invalid headers, missing values, special characters, and nested fields—so that your XML output is always valid and structured correctly.

Table of Contents:

Why Convert CSV to XML
Prerequisites
Convert CSV to Excel XML Format
Convert CSV to Standard XML
Handle Real-World CSV Challenges
Wrap Up
FAQs

Why Convert CSV to XML

So why should developers care about converting CSV to XML? Here are some real-world use cases:

Enterprise Data Migration: Many enterprise applications like ERP or CRM systems require XML input for bulk data import.
Configuration & Metadata: XML is often used to store structured metadata, while raw data may come in CSV form.
Interoperability: Some APIs, especially in finance, healthcare, and government sectors, still rely heavily on XML payloads.
Readable Reports: XML can represent hierarchical data, making it more descriptive compared to flat CSV files.
Validation: XML can be validated against schemas (XSD), ensuring data integrity—something CSV cannot provide out-of-the-box.

CSV is great for simplicity . XML is great for structure . By converting between them, you get the best of both worlds.

Prerequisites

Before diving into the code, let’s ensure we have the right tools:

Python 3.7+
Spire.XLS for Python → A professional library that provides powerful Excel manipulation capabilities.
Standard Python Libraries → xml.etree.ElementTree, csv, and re.

You can install Spire.XLS via pip (assuming you have Python and pip installed on your system):

pip install spire.xls

Additionally, make sure you have a CSV file ready to test. A simple one might look like this:

ID,FirstName,LastName,Age,Department,Salary
1,John,Doe,30,Sales,5000
2,Jane,Smith,28,Marketing,4500
3,Mike,Johnson,35,IT,6000
4,Emily,Brown,32,HR,4800

Convert CSV to Excel XML Format in Python

The first approach is to convert CSV into Excel-compatible XML . This format is sometimes called SpreadsheetML , introduced in Excel 2003. It allows CSV data to be transformed into a fully structured XML file that Excel can open directly.

With Spire.XLS, this process is straightforward:

from spire.xls import *

# Create a Workbook
workbook = Workbook()

# Load CSV into the first worksheet
workbook.LoadFromFile("input.csv", ",", 1, 1)

# Save as XML Spreadsheet format
workbook.SaveAsXml("output.xml")

# Dispose resources
workbook.Dispose()

How it Works

LoadFromFile() – Reads the CSV file into the workbook.
SaveAsXml() – Saves it as SpreadsheetML XML format.

This approach is best if your end goal is Excel compatibility rather than pure XML processing.

Output:

Convert CSV to Excel-compatible XML

Convert CSV to Standard XML in Python

Sometimes, you need a custom XML format rather than Excel-compatible XML. For instance, you may need an XML structure like:

<Employees>
  <Employee>
    <ID>1</ID>
    <FirstName>John</FirstName>
    <LastName>Doe</LastName>
    <Department>Sales</Department>
    <Salary>5000</Salary>
  </Employee>
</Employees>

Here’s how you can achieve that:

from spire.xls import *
import xml.etree.ElementTree as ET

# Step 1: Load CSV into workbook
workbook = Workbook()
workbook.LoadFromFile("input.csv", ",", 1, 1)
sheet = workbook.Worksheets[0]

# Step 2: Create root element
root = ET.Element("Root")

# Step 3: Loop through rows and build XML
# Assume first row is header
headers = []
for col in range(1, sheet.Columns.Count + 1):
    cell_value = sheet.Range[1, col].Value
    if not cell_value:
        break
    headers.append(cell_value)

# Add data rows
for row in range(2, sheet.Rows.Count + 1):
    # Stop if row is empty
    if not sheet.Range[row, 1].Value:
        break

    record = ET.SubElement(root, "Record")
    for col, header in enumerate(headers, start=1):
        cell_value = sheet.Range[row, col].Value
        field = ET.SubElement(record, header)
        field.text = cell_value if cell_value else ""

# Step 4: Save to XML file
tree = ET.ElementTree(root)
tree.write("output.xml", encoding="utf-8", xml_declaration=True)

# Cleanup
workbook.Dispose()

Explanation

Step 1: Load the CSV into a worksheet.
Step 2: Create a root <Root> element for the XML tree.
Step 3: Extract headers from the first row and use them as XML tags.
Step 4: Loop through rows and add <Record> elements with child fields.
Step 5: Save the XML to a file.

This approach is flexible because you can customize the XML schema however you like (e.g., change <Record> to <Employee>).

Output:

Convert CSV to Standard XML

You may also like: How to Convert XML to CSV in Python

Handle Real-World CSV Challenges

Converting a “perfect” CSV to XML is easy. But real-world CSVs are rarely perfect. Let’s discuss common issues developers face and how to fix them.

Headers with Invalid XML Names
- Problem: CSV headers like Employee ID or 123Name are invalid XML tags since XML element names cannot have spaces, start with numbers, or include special characters.
- Fix: Clean headers by replacing spaces with underscores (_) or prefixing numeric headers with a safe string.
Empty or Missing Values
- Problem: Real-world CSVs often have missing cells. When converted to XML, missing cells may create malformed structures.
- Fix: Replace blanks with placeholders (NULL, Unknown, or 0).
Special Characters in Data
- Problem: Characters like <, >, & inside CSV values can break XML if not escaped.
- Fix: Always escape XML entities properly (<, >, &).
Nested / Hierarchical Data in Flat CSV
- Problem: Some CSV cells contain multiple values or hierarchical information (e.g., lists of products) within a single field. Converting directly to XML will flatten the data, losing the intended hierarchy.

Example:

OrderID,Customer,Products
1001,John,"Laptop;Mouse;Keyboard"

Flattened XML (bad):

<Products>Laptop;Mouse;Keyboard</Products>

Fix: Detect list-like or nested values and expand them into child XML elements or multiple rows, preserving the structure in XML.

Hierarchical XML (fixed):

<Products>
  <Product>Laptop</Product>
  <Product>Mouse</Product>
  <Product>Keyboard</Product>
</Products>

Automating Fixes with clean_csv

Instead of cleaning data manually, you can automate preprocessing with the following helper function:

import csv
import re

def clean_csv(input_file, output_file, nested_columns=None, nested_delimiter=";"):
    if nested_columns is None:
        nested_columns = []

    cleaned_rows = []

    # Escape special characters for XML
    def escape_xml(text):
        return (text.replace("&", "&amp;")
                    .replace("<", "&lt;")
                    .replace(">", "&gt;")
                    .replace('"', "&quot;")
                    .replace("'", "&apos;"))

    with open(input_file, "r", encoding="utf-8") as infile:
        reader = csv.reader(infile)
        headers = next(reader)

        # Clean headers
        cleaned_headers = []
        for h in headers:
            h = h.strip()
            h = re.sub(r"\s+", "_", h)          # Replace spaces with _
            h = re.sub(r"[^a-zA-Z0-9_]", "", h) # Remove invalid chars
            if re.match(r"^\d", h):             # Cannot start with digit
                h = "Field_" + h
            cleaned_headers.append(h)

        cleaned_rows.append(cleaned_headers)

        # Read all rows first
        raw_rows = []
        for row in reader:
            # Handle empty cells
            row = [cell if cell.strip() != "" else "NULL" for cell in row]
            raw_rows.append(row)

    # Expand nested columns
    if nested_columns:
        expanded_rows = [cleaned_headers]  # include header
        for row in raw_rows:
            row_variants = [row]
            for col_name in nested_columns:
                if col_name not in cleaned_headers:
                    continue
                col_index = cleaned_headers.index(col_name)
                temp = []
                for variant in row_variants:
                    cell_value = variant[col_index]
                    # Only split on the nested delimiter, not on XML special chars
                    if nested_delimiter in cell_value:
                        items = [item.strip() for item in cell_value.split(nested_delimiter)]
                        for item in items:
                            new_variant = variant.copy()
                            new_variant[col_index] = item
                            temp.append(new_variant)
                    else:
                        temp.append(variant)
                row_variants = temp
            expanded_rows.extend(row_variants)
        cleaned_rows = expanded_rows
    else:
        cleaned_rows.extend(raw_rows)

    # Escape special characters after expansion
    final_rows = [cleaned_rows[0]]  # header
    for row in cleaned_rows[1:]:
        final_row = [escape_xml(cell) for cell in row]
        final_rows.append(final_row)

    # Write cleaned CSV
    with open(output_file, "w", newline="", encoding="utf-8") as outfile:
        writer = csv.writer(outfile)
        writer.writerows(final_rows)

    print(f"Cleaned CSV saved to {output_file}")

You can invoke the clean_csv function by passing the input CSV file path, output CSV file path, and optionally the columns to expand for nested values.

# File paths
input_file = r"C:\Users\Administrator\Desktop\input.csv"
output_file = r"C:\Users\Administrator\Desktop\cleaned_output.csv"

# Specify columns that may contain nested values
nested_columns = ["Products"]  # you can add more, e.g., ["Products", "Reviews"]

# Call the clean_csv function
clean_csv(input_file, output_file, nested_columns=nested_columns, nested_delimiter=";")

This function ensures your CSV is safe and clean before converting it to XML.

It helps:

Clean headers (valid XML names)
Handle empty cells
Expand nested values in any specified column(s)
Escape special characters for XML
Write a clean, UTF-8 CSV ready for XML conversion

Wrap Up

Converting CSV to XML in Python with Spire.XLS is both powerful and flexible. If you only need Excel-readable XML, saving directly to SpreadsheetML is the quickest option. If you need custom XML structures for APIs, migrations, or reporting, you can build standard XML using xml.etree.ElementTree.

Real-world CSVs are messy. They may have missing values, invalid headers, special characters, or nested lists. By integrating a preprocessing step (like the clean_csv function), you ensure that your XML output is both valid and meaningful.

Whether you’re preparing data for enterprise integration, creating structured reports, or exchanging information with legacy systems, this workflow gives you the best of both worlds: the simplicity of CSV and the structure of XML.

FAQs

Q1: Can I convert very large CSV files to XML?

Yes, but for very large files, you should use a streaming approach (processing rows one by one) to avoid memory issues.

Q2: Does Spire.XLS support direct conversion to XML without custom code?

Yes. Saving as SpreadsheetML is built-in. But for custom XML, you’ll need to write Python code as shown.

Q3: How do I handle special characters automatically?

Use the escape_xml helper or Python’s built-in xml.sax.saxutils.escape().

Q4: What if my CSV has nested data in multiple columns?

You can pass multiple column names into nested_columns when calling clean_csv.

Q5: Can I validate the generated XML?

Yes. After generating XML, validate it against an XSD schema if your system requires strict formatting.

Get a Free License

To fully experience the capabilities of Spire.XLS for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Conversion

Tagged under

xls Python Conversion

HTML Parsing in C#: Strings, Files, and URLs

2025-09-16 10:10:02 Written by zaki zou

How to Parse HTML in C#

The need to efficiently parse HTML in C# is a common requirement for many development tasks, from web scraping, data extraction to content automation. While .NET offers built-in tools (e.g., HtmlAgilityPack), Spire.Doc simplifies HTML parsing in C# with its intuitive object model and seamless integration.

This guide explores how to leverage Spire.Doc for .NET to parse HTML, including loading HTML from various sources, navigating document structure, and extracting critical data.

Setting Up Spire.Doc
How Spire.Doc Parses HTML
How to Load and Parse HTML Content
Conclusion
Common Questions

Setting Up Spire.Doc

The easiest way to integrate the C# HTML parser library into your project is via NuGet:

Open your project in Visual Studio.
Right-click the project in the Solution Explorer → Select Manage NuGet Packages.
In the NuGet Package Manager, search for Spire.Doc.
Select the latest stable version and click Install.

Alternatively, download the library directly from the E-iceblue website, extract the ZIP file, and reference Spire.Doc.dll in your project.

How Spire.Doc Parses HTML

Spire.Doc converts HTML into a structured object model, where elements like <p>, <a>, and <table> are mapped to classes you can programmatically access. Key components include:

Document: Acts as the container for parsed HTML content.
Section: Represents a block of content (similar to HTML’s <body> or <div> sections).
Paragraph: Maps to HTML block elements like <p>, <h1>, or <li>.
DocumentObject: Base class for all elements within a Paragraph (images, links, etc.).

This model ensures that HTML structures are preserved and accessible via intuitive C# properties and methods.

How to Load and Parse HTML Content

Spire.Doc supports parsing HTML from strings, local files, or even remote URLs (when combined with HTTP clients). Below are detailed examples for each scenario.

Parse an HTML String in C#

Parse an HTML string (e.g., from a web API or database) into Spire.Doc’s object model for inspection.

using Spire.Doc;
using Spire.Doc.Documents;

namespace ParseHtmlString
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Document object
            Document doc = new Document();

            // Add a section to act as a container
            Section section = doc.AddSection();
            // Add a paragraph
            Paragraph para = section.AddParagraph();

            // Define HTML content to parse
            string htmlContent = @"
                <h2>Sample HTML String</h2>
                <p>This is a paragraph with <strong>bold text</strong> and a <a href='https://www.e-iceblue.com/'>link</a>.</p>
                <ul>
                    <li>List item 1</li>
                    <li>List item 2</li>
                </ul>
            ";

            // Parse the HTML string into the paragraph
            para.AppendHTML(htmlContent);

            // Print all paragraph text
            Console.WriteLine("Parsed HTML Content:");
            Console.WriteLine("---------------------");

            foreach (Paragraph paragraph in section.Paragraphs)
            {
                Console.WriteLine(paragraph.Text);
            }
        }
    }
}

In this code, the method AppendHTML() automatically converts HTML tags to corresponding Spire.Doc objects (e.g., <h1> → Heading1 style, <ul> → list paragraphs).

Output:

Parse an HTML string using C#

Pro Tip: You can also call the SaveToFile() method to convert the HTML string to Word in C#.

Parse an HTML File in C#

For HTML content stored in a file (e.g., downloaded web pages, static HTML reports), load it via LoadFromFile() and then analyze its structure (e.g., extracting headings, paragraphs).

using Spire.Doc;
using Spire.Doc.Documents;

namespace ParseHtmlFile
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Document object
            Document doc = new Document();

            // Load an HTML file
            doc.LoadFromFile("sample.html", FileFormat.Html);

            // Traverse sections (HTML body blocks)
            foreach (Section section in doc.Sections)
            {
                Console.WriteLine($"Section {doc.Sections.IndexOf(section) + 1}:");
                Console.WriteLine("---------------------------------");

                // Traverse paragraphs in the section
                foreach (Paragraph para in section.Paragraphs)
                {
                    // Print paragraph text and style (e.g., heading level)
                    string styleName = para.StyleName;
                    Console.WriteLine($"[{styleName}] {para.Text}"+ "\n");
                }
                Console.WriteLine();
            }
        }
    }
}

This C# code example loads a local HTML file and then uses the Paragraph.StyleName and Paragraph.Text properties to extract content along with its styling information.

Output:

Load and parse an HTML file using C#

Spire.Doc’s object model allows you to interact with an HTML file just like you would with a Word document. In addition to extracting text content, you can also extract elements like links, tables from HTML.

Parse a URL in C#

To parse HTML from a web page, combine Spire.Doc with HttpClient to fetch the HTML content first, then parse it.

using Spire.Doc;
using Spire.Doc.Documents;

namespace HtmlUrlParsing
{
    class Program
    {
        // HttpClient instance for web requests
        private static readonly HttpClient httpClient = new HttpClient();

        static async Task Main(string[] args)
        {
            try
            {
                // Fetch HTML from a URL
                string url = "https://www.e-iceblue.com/privacypolicy.html";
                Console.WriteLine($"Fetching HTML from: {url}");
                string htmlContent = await FetchHtmlFromUrl(url);

                // Parse the fetched HTML
                Document doc = new Document();
                Section section = doc.AddSection();
                Paragraph paragraph = section.AddParagraph();
                paragraph.AppendHTML(htmlContent);

                // Extract key information
                Console.WriteLine("\nParsed Content Summary:");
                Console.WriteLine($"Sections: {doc.Sections.Count}");
                Console.WriteLine($"Paragraphs: {section.Paragraphs.Count}");
                Console.WriteLine("-------------------------------------------");

                // Extract all heading paragraphs
                foreach (Paragraph para in section.Paragraphs)
                {
                    if (para.StyleName.StartsWith("Heading"))
                    {
                        string headings = para.Text;
                        Console.WriteLine($"Headings: {headings}");
                    }
                }
                
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error: {ex.Message}");
            }
        }

        // Helper method to fetch HTML from a URL
        private static async Task<string> FetchHtmlFromUrl(string url)
        {
            // Set a user-agent to avoid being blocked by servers
            httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64)");

            // Send GET request and return HTML content
            HttpResponseMessage response = await httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode(); // Throw on HTTP errors (4xx, 5xx)
            return await response.Content.ReadAsStringAsync();
        }
    }
}

This C# code combines web scraping (fetching HTML from a URL) with document parsing (using Spire.Doc) to extract structured information (like headings) from web content. It’s useful for scenarios like content analysis or web data extraction.

Output:

Parse HTML from a web URL using C#

Conclusion

Spire.Doc for .NET provides a comprehensive solution for reading HTML in C# applications. Whether you're working with HTML strings, local files, or even web URLs, this library streamlines the process with intuitive APIs and reliable performance. By following the examples outlined in this guide, you can efficiently integrate HTML parsing capabilities into your .NET projects.

To fully experience the capabilities of Spire.Doc for .NET, request a free 30-day trial license here.

Common Questions

Q1: Why use Spire.Doc for HTML parsing instead of HtmlAgilityPack?

A: Spire.Doc and HtmlAgilityPack serve different primary goals, so the choice depends on your needs:

HtmlAgilityPack: A lightweight library only for parsing and manipulating raw HTML (e.g., extracting tags, fixing invalid HTML). It does not handle document formatting or export to Word.
Spire.Doc: Designed for document manipulation first - it parses HTML and maps it directly to structured Word elements (sections, paragraphs, styles like headings/bold). This is critical if you need to:
- Preserve HTML structure in an editable Word file.
- Extract styled content (e.g., identify "Heading 1" vs. "Normal" text).
- Export parsed HTML to RTF, TXT, PDF, etc.

Q2. How do I convert HTML to Text in C#

A: To convert an HTML file to plain text in C#, get its text content via the GetText() method and then write the result to a .txt file.

// Create a Document object
Document doc = new Document();

// Load an HTML file
doc.LoadFromFile("sample.html", FileFormat.Html);

// Get text from HTML
string text = doc.GetText();

// Write to a text file
File.WriteAllText("HTMLText.txt", text);

Q3: Can Spire.Doc handle malformed or incomplete HTML?

A: Spire.Doc has good error tolerance and can handle imperfect HTML to some extent. However, severely malformed HTML might cause parsing issues. For best results, ensure your HTML is well-formed or use HTML sanitization libraries before parsing with Spire.Doc.

Q3: Can I use Spire.Doc in ASP.NET Core applications?

A: Yes, Spire.Doc is fully compatible with ASP.NET Core applications. The installation and usage process is the same as in other .NET applications.

Published in Document Operation

Tagged under

doc net Operation

News Category

Knowledgebase (2329)

Children categories

Setting Up the Environment

Writing Data into Excel Files with Python

Appending Data to an Existing Excel File

Writing Multiple Rows and Columns to a New Excel File

Writing Different Data Types

Formatting Excel While Writing Data with Python

Applying Cell Styles

Setting Number Formats for Excel Cells

Adjusting Column Widths and Row Heights

Managing Multiple Worksheets in Excel with Python

Best Practices for Writing Excel Files with Python

Conclusion

Python Excel Writing FAQ

Q1: Can Python write to an existing Excel file?

Q2: How to efficiently handle large datasets in Python?

Q3: Can formulas be included in Excel files?

Q4: Which Excel formats are supported?

Why Convert CSV to XML

Prerequisites

Convert CSV to Excel XML Format in Python

Convert CSV to Standard XML in Python

Handle Real-World CSV Challenges

Automating Fixes with clean_csv

Wrap Up

FAQs

Q1: Can I convert very large CSV files to XML?

Q2: Does Spire.XLS support direct conversion to XML without custom code?

Q3: How do I handle special characters automatically?

Q4: What if my CSV has nested data in multiple columns?

Q5: Can I validate the generated XML?

Get a Free License

Setting Up Spire.Doc

How Spire.Doc Parses HTML

How to Load and Parse HTML Content

Parse an HTML String in C#

Parse an HTML File in C#

Parse a URL in C#

Conclusion

Common Questions

Q1: Why use Spire.Doc for HTML parsing instead of HtmlAgilityPack?

Q2. How do I convert HTML to Text in C#

Q3: Can Spire.Doc handle malformed or incomplete HTML?

Q3: Can I use Spire.Doc in ASP.NET Core applications?

More...