Skip to content

Conversation

@daniel-adam-tfs
Copy link
Contributor

Rationale for this change

Aligning the encryption and decryption implementation with PyArrow.

What changes are included in this PR?

TODO

Are these changes tested?

TODO

Are there any user-facing changes?

TODO

@daniel-adam-tfs
Copy link
Contributor Author

daniel-adam-tfs commented Dec 8, 2025

@zeroshade OK, so I've generated tools/pyarrow_encrypted_uniform.parquet using tools/write_encrypted_parquet.py and tools/arrowgo_encrypted_uniform.parquet using TestEncryptFile. Very likely that I'm doing something wrong, but let me walk you through what I have.

In the python code I need an instance of FileEncryptionProperties which I use in the call to write_table, but the only API I was able to find in the PyArrow lib is CryptoFactory.file_encryption_properties, but for that I need a KMS. I've created a Mock implementation, which just base64 encodes and decodes the input and use this to generate the pyarrow_encrypted_uniform.parquet.
I try to read this generated file in TestDecryptFile, it panics in StringKeyIDRetriever.GetKey call.

encryption.StringKeyIDRetriever=["footer_key": "0123456789012345", ]
func (s StringKeyIDRetriever) GetKey(keyMetadata []byte) string {
	k, ok := s[*(*string)(unsafe.Pointer(&keyMetadata))]
	if !ok {
		panic(fmt.Errorf("parquet: key missing for id %s", keyMetadata))
	}
	return k
}
keyMetadata={\"keyMaterialType\":\"PKMT1\",\"internalStorage\":true,\"isFooterKey\":true,\"kmsInstanceID\":\"DEFAULT\",\"kmsInstanceURL\":\"DEFAULT\",\"masterKeyID\":\"footer_key\",\"wrappedDEK\":\"tHPE5PlN58jGE1soVo/arMTVu8C8oezum3vSnNdEcEdIn5ImAcv9rtpfZow=\",\"doubleWrapping\":true,\"keyEncryptionKeyID\":\"7pmHfFBvnjd2Wbf218WOMQ==\",\"wrappedKEK\":\"1eX3O2IHHTkAnuIXbbIQRA==\"}"

The whole keyMetadata value is used as a key in the retriever map, which obviously doesn't work. I'm guessing that I should implement a custom KeyIDRetriever, and json decode the retrieved metadata and return the masterKeyID value?

@daniel-adam-tfs
Copy link
Contributor Author

The other way around, when I try to run tools/read_encrypted_parquet.py, which should attempt to decrypt the arrow-go generated file arrowgo_encrypted_uniform.parquet I get this error:

Reading arrowgo_encrypted_uniform.parquet
Traceback (most recent call last):
  File "/daniel-adam-tfs/arrow-go/tools/./read_encrypted_parquet.py", line 27, in <module>
    with pq.ParquetFile(
         ~~~~~~~~~~~~~~^
            input_file,
            ^^^^^^^^^^^
            decryption_properties=decryption_properties) as f:
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py", line 328, in __init__
    self.reader.open(
    ~~~~~~~~~~~~~~~~^
        source, use_memory_map=memory_map,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        arrow_extensions_enabled=arrow_extensions_enabled,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "pyarrow/_parquet.pyx", line 1656, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Failed to parse key metadata footer_key

Seems that it expects the key metadata to be written differently?

@zeroshade
Copy link
Member

Sorry for the delay here, I was on vacation all last week.

I see what the issue going on here is. What is exposed via pyarrow is the high-level API for utilizing a KMS to manage the keys by wrapping them. It will generate random bytes to use for the key and then wrap the key using the wrap_key and unwrap_key functions in the KMS implementation.

If we look at your mock example:

class MockKmsClient(pe.KmsClient):
    def __init__(self, kms_connection_configuration):
        super().__init__()

    def wrap_key(self, key_bytes, master_key_identifier):
        return base64.b64encode(key_bytes)

    def unwrap_key(self, wrapped_key, master_key_identifier):
        return base64.b64decode(wrapped_key)

And compare that with the unit tests from the Arrow repo for pyarrow parquet encryption, I was able to figure out what it is doing. Essentially, the key_bytes are the random bytes that actually got used to encrypt the data. Those bytes are the key that Go will need for decrypting. In your example with this MockKmsClient, you're storing the key bytes directly into the metadata, so the following should work on the Go side for decrypting:

type metadataKeyRetriever struct{}

func (metadataKeyRetriever) GetKey(keyMetadata []byte) string {
    var keyMeta struct {
        WrappedKey string `json:"wrappedDEK"`
    }

    json.Unmarshal(keyMetadata, &keyMeta)
    byts, err := base64.StdEncoding.DecodeString(keyMeta.WrappedDEK)
    if err != nil {
        panic(err)
    }

    return string(byts)
}

Another option might be to manipulate the key_bytes with the master key bytes so that the key itself isn't stored directly in the metadata so easily (or use some external thing). I was able to get the above to work, so think of it like this:

For Python -

  • wrap_key takes the Key ID (master_key_identifier) and the actual random bytes used as the encryption key (key_bytes) and outputs what goes into the key metadata. These key_bytes are what Go needs since we only expose the low-level API currently.
  • unwrap_key takes the result from wrap_key and the Key ID (master_key_identifier) and returns the actual key to use for decrypting. Go would need to output the properly formatted JSON blob that pyarrow is expecting as the key metadata so that the python unwrap_key would be able to somehow determine the actual key_bytes from the metadata.

Does this help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants