File Formats

Catalogian supports CSV, TSV, and JSONL — plus gzip-compressed variants of all three. This page covers encoding, delimiters, headers, and edge cases.

Supported formats

FormatExtensionsNotes
CSV.csvComma-separated values with optional quoting
TSV.tsvTab-separated values
JSONL.jsonlOne JSON object per line (newline-delimited JSON)
Gzip CSV.csv.gzGzip-compressed CSV
Gzip TSV.tsv.gzGzip-compressed TSV
Gzip JSONL.jsonl.gzGzip-compressed JSONL

CSV & TSV

Header row

The first row is always treated as the header. Column names from the header become field names in the snapshot. Headers are case-sensitive — SKU and sku are different fields.

Delimiter detection

Catalogian auto-detects the delimiter by sampling the first few rows. Supported delimiters:

  • , — comma (default for .csv)
  • \t — tab (default for .tsv)
  • | — pipe
  • ; — semicolon

You can also set the delimiter explicitly when creating a source via the API by passing the delimiter field.

Quoting & escaping

Standard CSV quoting rules apply. Fields containing commas, newlines, or double quotes should be wrapped in double quotes. Double quotes within a quoted field are escaped by doubling them:

sku,name,description
SKU-001,"Widget, Large","A ""premium"" widget"
SKU-002,Gadget,Simple gadget

Column mismatches

If a row has fewer columns than the header, missing fields are stored as empty strings. If a row has more columns than the header, extra values are ignored. Catalogian records a column_mismatch warning (up to 50 per ingest) but still processes the row.

JSONL

JSONL (JSON Lines) files contain one JSON object per line. Each line must be a valid JSON object. The key field must be a top-level property of each object.

{"sku": "SKU-001", "name": "Widget", "price": 29.99, "in_stock": true}
{"sku": "SKU-002", "name": "Gadget", "price": 14.99, "in_stock": false}
{"sku": "SKU-003", "name": "Doohickey", "price": 9.99, "in_stock": true}

Nested objects: Catalogian flattens nested JSON into dot-notation field names. For example, {"dimensions": {"width": 10}} becomes the field dimensions.width.

Malformed JSON lines generate a malformed_json warning and are skipped. The first 50 warnings per ingest are recorded.

Gzip compression

Append .gz to any supported extension. Catalogian detects gzip encoding from the file extension and decompresses on the fly during streaming ingest. No configuration needed.

Gzip is recommended for files over 10 MB — it reduces transfer time and bandwidth for HTTP sources.

Encoding

Catalogian expects UTF-8 encoding. Files with a UTF-8 BOM (byte order mark) are handled automatically — the BOM is stripped before parsing. Other encodings (Latin-1, Windows-1252) may produce garbled field values.

Size limits

Maximum file size is 500 MB (uncompressed). For large feeds, gzip compression is strongly recommended — a 500 MB CSV typically compresses to under 50 MB.

Learn about source types and how to connect them. Source Types →