Developer Cluster

CSV, JSON, XML, YAML: The Data Transformation Guide

Published April 11, 2026 · 11 min read

Every developer eventually becomes an accidental ETL engineer. A marketing team exports a spreadsheet, a legacy system spits out XML, a config file arrives as YAML, and somebody needs a JSON payload for the frontend. The formats themselves are simple on paper. The bugs live in the edges — unquoted commas, trailing newlines, byte order marks, number precision, mixed date formats, the "true" that becomes a boolean in YAML but a string everywhere else.

This guide walks through what each format actually is at the spec level, the transformations you will most often need, the edge cases that silently corrupt data, and the browser-based converters that handle the common cases without asking you to upload your file to a server.

The five formats, briefly

CSV is a line-oriented, comma-separated text format that predates almost everything else in this list. It is described, mostly, by RFC 4180 — "mostly" because CSV has no single authority and every producer has its own quirks. A CSV file is a header row, then data rows, with fields separated by commas and optionally wrapped in double quotes.

JSON is defined by RFC 8259 and the ECMA-404 standard. It is a strict, typed, tree-shaped format with six value kinds: string, number, boolean, null, array, and object. Unlike CSV, there is one JSON and one way to write it. That simplicity is why it won.

XML is specified by the W3C XML 1.0 recommendation. It is a verbose, tree-shaped markup language with elements, attributes, and namespaces. XML is older than JSON and still dominant in enterprise systems, SOAP APIs, document formats (DOCX, ODT), and RSS.

YAML (see YAML 1.2.2 spec) is a human-readable superset of JSON. It allows indentation-based structure, comments, and anchor/alias references. It is popular for configuration (Kubernetes, GitHub Actions, Docker Compose) where humans edit the file directly.

TOML is a newer configuration format (used by Rust's Cargo and Python's pyproject.toml) that trades YAML's flexibility for unambiguous parsing. Every TOML file parses to one obvious data structure, which is a property YAML cannot always promise.

CSV to JSON and back

The most common transformation in the world: take a spreadsheet and turn it into an array of objects. The mental model is straightforward — headers become object keys, rows become objects. In practice, here is what goes wrong.

Commas inside fields. A value like Smith, John in a "name" column needs to be quoted: "Smith, John". Parsers that split on commas without respecting quotes will silently produce garbage. The fix is to use a parser that implements RFC 4180 field-quoting rules. CSV to JSON handles quoted fields correctly; if your spreadsheet exports through Excel, expect Excel's regional quirks to require the occasional manual check.

Header inference. What if the first row is not a header? Most converters assume it is. Many tools let you disable that and use auto-generated keys (col1, col2). Know which behavior your tool is using before you trust the output.

Type inference. Is "42" a number or a string? CSV has no native types, so every converter has to guess. Most guess badly on postal codes ("01234" becomes 1234 and loses the leading zero), phone numbers, and ID strings. The safest default is "everything is a string until proven otherwise" and then explicit type coercion downstream.

Round-tripping. Converting CSV to JSON and back should preserve the data. It often does not — because JSON can represent nested objects, arrays of arrays, and typed null, none of which CSV can express. If you need to round-trip, keep the intermediate JSON flat. JSON to CSV will flatten nested structures but you lose fidelity.

XML in the JSON world

You still meet XML when you talk to SOAP services, parse RSS feeds, extract text from DOCX, or deal with legacy enterprise stacks. Converting XML to JSON is not a one-to-one mapping because XML has concepts JSON does not — attributes, namespaces, mixed content, text nodes alongside elements.

The convention most converters follow is the "Badgerfish" style: attributes become keys prefixed with @, text becomes a #text key, and repeated children become arrays. XML to JSON applies this convention by default. If you need a different mapping — say, ignoring attributes entirely — check whether your tool exposes that as an option.

The reverse direction, JSON to XML, is even messier, because many JSON shapes have no clean XML representation. An array of heterogeneous objects has to be wrapped in a container element, and you have to pick what to call it. If you find yourself repeatedly converting JSON back to XML, that is a sign the XML consumer should accept JSON instead.

YAML, TOML, and configuration

YAML is where "human-readable" meets "surprisingly error-prone." Three facts catch every new YAML user.

First, whitespace matters. Two spaces of indentation versus four is a parse error or, worse, a semantically different document. Tabs are forbidden. If you paste YAML from a webpage, check that the indentation did not come with non-breaking spaces.

Second, YAML 1.1 interprets yes, no, on, off, true, false as booleans. YAML 1.2 narrows that to just true and false. Which version does your parser use? Norway, famously, cannot have a country code of NO in a YAML 1.1 file, because NO parses as false. This is known as "the Norway problem" and it still trips teams up.

Third, YAML supports anchors and aliases (&name and *name), which let you reference a block defined earlier. This is powerful and brittle — aliases that span files or rely on specific load order cause the kind of bug that only shows up in production. If your config files are getting clever with anchors, consider flattening them.

Convert between JSON and YAML with YAML to JSON and its inverse JSON to YAML. For TOML, TOML to JSON gives you a strict intermediate, and YAML to TOML works when you are migrating config from a CI pipeline to a Rust project.

Edge cases that corrupt data

Byte order marks

A BOM is three bytes at the start of a UTF-8 file (EF BB BF) that some Windows tools add. Most parsers strip it. Some do not, and the first header column ends up named \uFEFFname instead of name. If your JSON keys mysteriously do not match, inspect the raw bytes.

Trailing commas

JSON forbids trailing commas. JavaScript allows them. If your JSON was hand-written by a developer who forgot which language they were in, it will not parse. JSON5 is a relaxed dialect that permits them but it is not JSON.

Number precision

JSON numbers are IEEE 754 doubles. A 64-bit integer — for instance, a Twitter ID — does not fit exactly in a double. Parsers that convert numbers to native types will silently truncate. The workaround: emit large IDs as strings.

Date formats

None of these formats has a native date type. Everyone agrees to use ISO 8601 strings (2026-04-11T09:30:00Z), and then someone exports from Excel and you get 4/11/26 or 11-Apr-26. Normalize dates at the boundary.

Line endings

CSV files from Windows use \r\n, from Unix \n. Most modern parsers handle both, but old ones choke. If an otherwise-valid CSV produces one enormous row, check line endings.

Encoding

Every modern stack assumes UTF-8. Legacy systems still emit Latin-1, Windows-1252, or Shift-JIS. Non-ASCII characters garbled in the output usually means the input was not what the tool thought it was.

Validation and pretty printing

Before you convert, validate. Invalid inputs produce confusing errors several steps later. JSON Validator will point at the exact offset of a parse error, which is the fastest way to debug an "it worked yesterday" file. YAML Validator does the same for YAML and will catch the indentation mistakes that silently change your document's shape. For XML, XML Formatter indents the document so you can eyeball nesting issues.

Once validated, JSON Formatter pretty-prints the result so that a diff between two versions shows real changes instead of whitespace noise. If you are committing config files to git, store them pretty-printed for exactly this reason.

Adjacent tools worth bookmarking

Other conversions developers end up needing often: Base64 Encoder/Decoder for binary payloads embedded in JSON, URL Encoder/Decoder for query strings and form bodies, JWT Decoder when a Bearer token is the data you need to inspect, and Hash Generator for checksumming files before and after conversion to prove nothing changed unintentionally.

Related pillar guide

This cluster sits inside the developer track. For a broader walkthrough of browser tools across frontend, backend, and DevOps, see The Complete Guide to Free Online Tools in 2026.

FAQ

Is it safe to convert sensitive CSV files in the browser?

If the tool runs client-side (FastTool converters do), the file can be processed without a FastTool upload workflow. No upload, no server processing, no log entries. That said, always verify with the network tab of your browser's dev tools if you are dealing with regulated data.

Why does my JSON have numbers wrapped in quotes?

Because the source format (usually CSV) had no type information and the converter chose the safe default of "emit as string." You can cast the values downstream once you know which columns are really numbers.

What is the fastest format for large data files?

For storage and streaming, nothing in this article wins. Parquet, Avro, and Arrow are binary columnar formats that are orders of magnitude faster. The formats here are for interchange and human editing, not performance.

Can I convert nested JSON to a flat CSV?

Yes, by picking a flattening convention — usually joining nested keys with dots (user.name). You lose the tree structure, so the output is for human consumption, not round-tripping.

Which format should I pick for a new config file?

TOML if a parser exists for your language. YAML if humans will frequently edit it and you accept the footguns. JSON if a machine is the primary author. Avoid XML for new projects unless you have a hard requirement.

Closing thought

Data format conversion looks trivial until the first production file arrives with a BOM, mixed line endings, and a Norwegian country code. The tools in this guide handle the common cases so you can spend your time on the actual edge cases that only your data has.