Skip to content

[Bug] NewIntXStatistics factories unconditionally set hasDistinctCount=true, causing distinct_count=0 to always appear in Parquet output #806

@gpandolfo

Description

@gpandolfo

Describe the bug, including details regarding any error messages, version, and platform.

metadata/statistics_types.gen.go factories (NewInt32Statistics, NewInt64Statistics, NewFloat32Statistics, NewFloat64Statistics, etc.) initialize hasDistinctCount: true at construction time, even though no distinct count has been computed. The result: every Statistics object emitted by arrow-go's parquet writer carries distinct_count = 0 in the thrift output — both in ColumnChunk.metadata.statistics (footer) and in DataPageHeader.statistics (column-data page header).

The Parquet thrift Statistics.distinct_count field is OPTIONAL. Per spec semantics, the field MUST be absent when the value is unknown — emitting 0 conflates "no distinct values" with "we don't know."

Comparison to parquet-cpp (pyarrow)
parquet-cpp (pyarrow) does NOT exhibit this behavior: distinct_count is set only when actually computed. For the same input table, pyarrow's footer + page headers omit the distinct_count field entirely; arrow-go's emit distinct_count: 0. This is a real cross-implementation byte-asymmetry for any consumer who needs byte-identical Parquet output across language implementations.

Reproducer (minimal Go)
package main

import (
"bytes"
"fmt"
"github.com/apache/arrow-go/v18/arrow"
"github.com/apache/arrow-go/v18/arrow/array"
"github.com/apache/arrow-go/v18/arrow/memory"
"github.com/apache/arrow-go/v18/parquet"
"github.com/apache/arrow-go/v18/parquet/file"
"github.com/apache/arrow-go/v18/parquet/pqarrow"
)

func main() {
pool := memory.NewGoAllocator()
schema := arrow.NewSchema([]arrow.Field{{Name: "id", Type: arrow.PrimitiveTypes.Int64, Nullable: false}}, nil)
builder := array.NewInt64Builder(pool)
builder.AppendValues([]int64{1, 2, 3}, nil)
arr := builder.NewArray()
rec := array.NewRecord(schema, []arrow.Array{arr}, 3)

var buf bytes.Buffer
props := parquet.NewWriterProperties(parquet.WithStats(true), parquet.WithCompression(0)) // Uncompressed for clarity
arrowProps := pqarrow.NewArrowWriterProperties()
w, _ := pqarrow.NewFileWriter(schema, &buf, props, arrowProps)
w.Write(rec)
w.Close()

rdr, _ := file.OpenParquetReader(bytes.NewReader(buf.Bytes()), nil)
md := rdr.MetaData()
cm, _ := md.RowGroup(0).ColumnChunk(0)
stats, _ := cm.Statistics()
fmt.Printf("DistinctCount set? %v (value=%d)\n", stats.HasDistinctCount(), stats.DistinctCount())
// Expected output (after fix): false (value=0)
// Actual output (today):       true  (value=0)

}
Root cause
metadata/statistics_types.gen.go lines 61-62 (NewInt32Statistics), 362-363 (NewInt64Statistics), 663-664, 942-943, 1235-1236 — each factory sets:

statistics: statistics{
descr: descr,
hasNullCount: true,
hasDistinctCount: true,
...
},
The hasDistinctCount: true flag should NOT default to true; it should flip only when a caller explicitly invokes SetDistinctCount(...) with a known value. (hasNullCount: true may need similar review — see "Out of scope" below.)

Proposed fix
In each numeric NewXxxStatistics factory in metadata/statistics_types.gen.go, change hasDistinctCount: true to hasDistinctCount: false. The flag flips to true only via explicit SetDistinctCount() (or IncDistinct() after a real distinct-count calculation).

Backward compatibility
Output bytes change (smaller — distinct_count field removed when unknown).
All readers tolerate the optional-field-absent case (pyarrow already omits, and all readers handle pyarrow output).
arrow-go's own IsSetDistinctCount() bool method already exists and behaves correctly given the fix.
The behavior of consumers that explicitly call SetDistinctCount(N) remains unchanged.
Test
Add a regression test: write a table without computing distinct_count; read back; assert IsSetDistinctCount() == false.
Cross-impl conformance: produce a Parquet file via arrow-go and via parquet-cpp on the same input + settings; assert footer thrift bytes match (currently they don't because of this).
Out of scope (related but separate)
hasNullCount: true at the same construction sites — same factory bug pattern. Whether to fix in the same PR is a design call. arrow-go and pyarrow may align on null_count (it's typically computed during the write); flag for separate review.
pyarrow's missing LogicalType emission knob for INT64 — separate cross-impl byte-identity issue; tracked at the consumer's spec level (sealed-variant pattern).
Adjacent precedent
Similar prior fix accepted upstream: #210 "fix(parquet/metadata): fix default unsigned int statistics" (merged 2024-12-10). Same code area, same kind of factory-default correction.

PRs welcome from our side if maintainers prefer; happy to follow up.

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions