-
Notifications
You must be signed in to change notification settings - Fork 84
Add TestEncryptFile and TestDecryptFile tests #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add TestEncryptFile and TestDecryptFile tests #596
Conversation
|
@zeroshade OK, so I've generated tools/pyarrow_encrypted_uniform.parquet using tools/write_encrypted_parquet.py and tools/arrowgo_encrypted_uniform.parquet using TestEncryptFile. Very likely that I'm doing something wrong, but let me walk you through what I have. In the python code I need an instance of FileEncryptionProperties which I use in the call to write_table, but the only API I was able to find in the PyArrow lib is CryptoFactory.file_encryption_properties, but for that I need a KMS. I've created a Mock implementation, which just base64 encodes and decodes the input and use this to generate the pyarrow_encrypted_uniform.parquet. encryption.StringKeyIDRetriever=["footer_key": "0123456789012345", ]func (s StringKeyIDRetriever) GetKey(keyMetadata []byte) string {
k, ok := s[*(*string)(unsafe.Pointer(&keyMetadata))]
if !ok {
panic(fmt.Errorf("parquet: key missing for id %s", keyMetadata))
}
return k
}keyMetadata={\"keyMaterialType\":\"PKMT1\",\"internalStorage\":true,\"isFooterKey\":true,\"kmsInstanceID\":\"DEFAULT\",\"kmsInstanceURL\":\"DEFAULT\",\"masterKeyID\":\"footer_key\",\"wrappedDEK\":\"tHPE5PlN58jGE1soVo/arMTVu8C8oezum3vSnNdEcEdIn5ImAcv9rtpfZow=\",\"doubleWrapping\":true,\"keyEncryptionKeyID\":\"7pmHfFBvnjd2Wbf218WOMQ==\",\"wrappedKEK\":\"1eX3O2IHHTkAnuIXbbIQRA==\"}"The whole keyMetadata value is used as a key in the retriever map, which obviously doesn't work. I'm guessing that I should implement a custom KeyIDRetriever, and json decode the retrieved metadata and return the |
|
The other way around, when I try to run tools/read_encrypted_parquet.py, which should attempt to decrypt the arrow-go generated file arrowgo_encrypted_uniform.parquet I get this error: Reading arrowgo_encrypted_uniform.parquet
Traceback (most recent call last):
File "/daniel-adam-tfs/arrow-go/tools/./read_encrypted_parquet.py", line 27, in <module>
with pq.ParquetFile(
~~~~~~~~~~~~~~^
input_file,
^^^^^^^^^^^
decryption_properties=decryption_properties) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/pyarrow/parquet/core.py", line 328, in __init__
self.reader.open(
~~~~~~~~~~~~~~~~^
source, use_memory_map=memory_map,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<8 lines>...
arrow_extensions_enabled=arrow_extensions_enabled,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "pyarrow/_parquet.pyx", line 1656, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Failed to parse key metadata footer_keySeems that it expects the key metadata to be written differently? |
|
Sorry for the delay here, I was on vacation all last week. I see what the issue going on here is. What is exposed via pyarrow is the high-level API for utilizing a KMS to manage the keys by wrapping them. It will generate random bytes to use for the key and then wrap the key using the If we look at your mock example: class MockKmsClient(pe.KmsClient):
def __init__(self, kms_connection_configuration):
super().__init__()
def wrap_key(self, key_bytes, master_key_identifier):
return base64.b64encode(key_bytes)
def unwrap_key(self, wrapped_key, master_key_identifier):
return base64.b64decode(wrapped_key)And compare that with the unit tests from the Arrow repo for pyarrow parquet encryption, I was able to figure out what it is doing. Essentially, the type metadataKeyRetriever struct{}
func (metadataKeyRetriever) GetKey(keyMetadata []byte) string {
var keyMeta struct {
WrappedKey string `json:"wrappedDEK"`
}
json.Unmarshal(keyMetadata, &keyMeta)
byts, err := base64.StdEncoding.DecodeString(keyMeta.WrappedDEK)
if err != nil {
panic(err)
}
return string(byts)
}Another option might be to manipulate the For Python -
Does this help? |
Rationale for this change
Aligning the encryption and decryption implementation with PyArrow.
What changes are included in this PR?
TODO
Are these changes tested?
TODO
Are there any user-facing changes?
TODO