How to encrypt a column in Pandas/Spark dataframe using AWS KMS - pandas

I want to encrypt values in one column of my Pandas (or PySpark) dataframe, e.g. to take the the column mobno in the following dataframe, encrypt it and put the result in the encrypted_value column:
I want to use AWS KMS encryption key. My question is: what is the most elegant way how to achieve this?
I am thinking about using UDF, which will call the boto3's KMS client. Something like:
#udf
def encrypt(plaintext):
response = kms_client.encrypt(
KeyId=aws_kms_key_id,
Plaintext=plaintext
)
ciphertext = response['CiphertextBlob']
return ciphertext
and then applying this udf on the whole column.
But I am not quite confident this is the right way. This stems from the fact that I am an encryption-rookie - first, I don't even know this kms_client_encrypt function is meant for encrypting values (from the columns) or it is meant for manipulate the keys. Maybe the better way is to obtain the key and then use some python encryption library (such as hashlib).
I would like to have some clarification on the encryption process and also recommendation what the best approach to column encryption is.

Since Spark 3.3 you can do AES encryption (and decryption) without UDF.
aes_encrypt(expr, key[, mode[, padding]]) - Returns an encrypted value of expr using AES in given mode with the specified padding. Key lengths of 16, 24 and 32 bits are supported. Supported combinations of (mode, padding) are ('ECB', 'PKCS') and ('GCM', 'NONE'). The default mode is GCM.
Arguments:
expr - The binary value to encrypt.
key - The passphrase to use to encrypt the data.
mode - Specifies which block cipher mode should be used to encrypt messages. Valid modes: ECB, GCM.
padding - Specifies how to pad messages whose length is not a multiple of the block size. Valid values: PKCS, NONE, DEFAULT. The DEFAULT padding means PKCS for ECB and NONE for GCM.
from pyspark.sql import functions as F
df = spark.createDataFrame([('8223344556',)], ['mobno'])
df = df.withColumn('encrypted_value', F.expr("aes_encrypt(mobno, 'your_secret_keyy')"))
df.show()
# +----------+--------------------+
# | mobno| encrypted_value|
# +----------+--------------------+
# |8223344556|[9B 33 DB 9B 5D C...|
# +----------+--------------------+
df.printSchema()
# root
# |-- mobno: string (nullable = true)
# |-- encrypted_value: binary (nullable = true)

To avoid many calls to the KMS service in a UDF, use AWS Secrets Manager instead to retrieve your encryption key and pycrypto to encrypt the column. The following works:
from pyspark.sql.functions import udf, col
from Crypto.Cipher import AES
region_name = "eu-west-1"
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
secret_key = json.loads(get_secret_value_response['SecretString'])
clear_text_column = 'mobo'
def encrypt(key, text):
obj = AES.new(key, AES.MODE_CFB, 'This is an IV456')
return obj.encrypt(text)
def udf_encrypt(key):
return udf(lambda text: encrypt(key, text))
df.withColumn("encrypted", udf_encrypt(secret_key)(col(clear_text_column))).show()
Or alternatively, using more efficient Pandas UDF as suggested by #Vektor88 (PySpark 3 syntax):
from functools import partial
encrypt_with_key = partial(encrypt, secret_key)
#pandas_udf(BinaryType())
def pandas_udf_encrypt(clear_strings: pd.Series) -> pd.Series:
return clear_strings.apply(encrypt_with_key)
df.withColumn('encrypted', pandas_udf_encrypt(clear_text_column)).show()

Related

What is the best way to send Arrow data to the browser?

I have Apache Arrow data on the server (Python) and need to use it in the browser. It appears that Arrow Flight isn't implemented in JS. What are the best options for sending the data to the browser and using it there?
I don't even need it necessarily in Arrow format in the browser. This question hasn't received any responses, so I'm adding some additional criteria for what I'm looking for:
Self-describing: don't want to maintain separate schema definitions
Minimal overhead: For example, an array of float32s should transfer as something compact like a data type indicator, length value and sequence of 4-byte float values
Cross-platform: Able to be easily sent from Python and received and used in the browser in a straightforward way
Surely this is a solved problem? If it is I've been unable to find a solution. Please help!
Building off of the comments on your original post by David Li, you can implement a non-streaming version what you want without too much code using PyArrow on the server side and the Apache Arrow JS bindings on the client. The Arrow IPC format satisfies your requirements because it ships the schema with the data, is space-efficient and zero-copy, and is cross-platform.
Here's a toy example showing generating a record batch on server and receiving it on the client:
Server:
from io import BytesIO
from flask import Flask, send_file
from flask_cors import CORS
import pyarrow as pa
app = Flask(__name__)
CORS(app)
#app.get("/data")
def data():
data = [
pa.array([1, 2, 3, 4]),
pa.array(['foo', 'bar', 'baz', None]),
pa.array([True, None, False, True])
]
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
writer.write_batch(batch)
return send_file(BytesIO(sink.getvalue().to_pybytes()), "data.arrow")
Client
const table = await tableFromIPC(fetch(URL));
// Do what you like with your data
Edit: I added a runnable example at https://github.com/amoeba/arrow-python-js-ipc-example.

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:
s3 = boto3.client('s3')
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print("Filename: ", csv_filename)
csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
file_content = csv_fileObj["Body"].read().decode("utf-8").split()
I have tried put my csv contents into a list but didnt work.
results = []
for row in csv.DictReader(file_content):
results.append(row.values())
print(results)
print(file_content)
return {
'statusCode': 200,
'body': json.dumps('S3 file processed')
}
Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.
Later I would follow this approach to load the data into RDS
import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
Can anyone suggest how to go about it?
You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.
import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs
It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.
You could do this without using Dataframes:
Loop through the Event Records (multiple can be passed-in)
For each object:
Download the object to /tmp/
Use the Python CSVReader to loop through the contents of the file
Generate INSERT statements to insert the data into the SQL Server table
You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

A hot key <hot-key-name> was detected in our Dataflow pipeline

We have been facing a hot key issue in our Dataflow pipeline (streaming pipeline, batch load into BigQuery -- we are using batch for a cost-effective purpose):
We are ingesting data to according tables based on their decoder value. For example, data with http decoder are going to http table, data with ssl decoder are going to ssl table.
So the BigQuery ingestion is using dynamic destinations.
The key is the destination table spec for the data.
An example error log:
A hot key
'key: tableSpec: ace-prod-300923:ml_dataset_us_central1.ssl tableDescription: Table for ssl shard: 1'
was detected in step
'insertTableRowsToBigQuery/BatchLoads/SinglePartitionsReshuffle/GroupByKey/ReadStream' with age of '1116.266s'.
This is a symptom of key distribution being skewed.
To fix, please inspect your data and pipeline to ensure that elements are evenly distributed across your key space.
Error is detected in this step: 'insertTableRowsToBigQuery/BatchLoads/SinglePartitionsReshuffle/GroupByKey/ReadStream'
The hot key issue is because of the nature of data, some decoder data has disproportionately many values. And our pipeline is a streaming pipeline.
We have read the document provided by Google but still not sure how to fix it.
Dataflow shuffle. Our project is already using streaming engine
Rekey. Doesn't seem to apply to our case, as the key is the destination table spec. To make the ingestion work, the key has to match the existing table spec in bigquery.
Combine.PerKey.withHotKeyFanout(). I don't know how to apply this. Because the key is generated in this step: insertTableRowsToBigQuery. This step, we are using BigQueryIO to write to BigQuery. The key is coming from dynamically generate BigQuery table names based on the current window or the current value. Sharding BigQuery output tables
Attached the code where the hot key is detected:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.<DataIntoBQ>write()
.to((ValueInSingleWindow<DataIntoBQ> dataIntoBQ) -> {
try {
String decoder = dataIntoBQ.getValue().getDecoder(); // getter function
String destination = String.format("%s:%s.%s",
PROJECT_ID, DATASET, decoder);
if (!listOfProtocols.contains(decoder)) {
LOG.error("wrong bigquery table decoder destination: " + decoder);
}
return new TableDestination(
destination, // Table spec
"Table for " + decoder // Table description
);
} catch (Exception e) {
LOG.error("insertTableRowsToBigQuery error", e);
return null;
}
}
)
.withFormatFunction(
(DataIntoBQ elem) ->
new DataIntoBQ().toTableRow(elem)
)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(3))
.withAutoSharding()
.withCustomGcsTempLocation(ValueProvider.StaticValueProvider.of(options.getGcpTempLocation()))
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
You can still try the rekey strategy. For example, you can apply a transformation before the apply("insertTableRowsToBigQuery") such that:
elements with the key "http" -> "randomVal_http" whereas randomVal is a value in a specific range (say from 0 to 10) the width of range depends on how many splits you want your elements with key http to be split into. For example, if you have 10 million elements with key "http", and you want to make sure they are split in 10 groups, each with approx. 10 elements you can generate uniform rand nrs between 0 and 9.
the same mapping you can apply to elements that belong to hot keys, element with non-hot keys don't need to be rekeyed.
Now, in your "insertTableRowsToBigQuery", you know how to go from a key like "someVal_http" to "http" - split the string.
That should help.
Regarding the Combine.PerKey.withHotKeyFanout() I am not sure how to do this for IO Transforms. If it was some intermediate transform, I could have helpled

Why preferred ciphers, macs and compression are written twice while key negotiations?

I am trying to understand the client part of the SSH protocol and referring the Paramiko library which is a native Python library for SSH protocol. The corresponding code can be found here.
def _send_kex_init(self):
"""
announce to the other side that we'd like to negotiate keys, and what
kind of key negotiation we support.
"""
self.clear_to_send_lock.acquire()
try:
self.clear_to_send.clear()
finally:
self.clear_to_send_lock.release()
self.in_kex = True
if self.server_mode:
mp_required_prefix = 'diffie-hellman-group-exchange-sha'
kex_mp = [k for k in self._preferred_kex if k.startswith(mp_required_prefix)]
if (self._modulus_pack is None) and (len(kex_mp) > 0):
# can't do group-exchange if we don't have a pack of potential primes
pkex = [k for k in self.get_security_options().kex
if not k.startswith(mp_required_prefix)]
self.get_security_options().kex = pkex
available_server_keys = list(filter(list(self.server_key_dict.keys()).__contains__,
self._preferred_keys))
else:
available_server_keys = self._preferred_keys
m = Message()
m.add_byte(cMSG_KEXINIT)
m.add_bytes(os.urandom(16))
m.add_list(self._preferred_kex)
m.add_list(available_server_keys)
m.add_list(self._preferred_ciphers)
m.add_list(self._preferred_ciphers)
m.add_list(self._preferred_macs)
m.add_list(self._preferred_macs)
m.add_list(self._preferred_compression)
m.add_list(self._preferred_compression)
m.add_string(bytes())
m.add_string(bytes())
m.add_boolean(False)
m.add_int(0)
# save a copy for later (needed to compute a hash)
self.local_kex_init = m.asbytes()
self._send_message(m)
SSH protocol allows both parties to implement a different sets of algorithms for incoming and outgoing directions. Or practically speaking, it allows a party to implement a specific algorithm for one of the directions only.
Paramiko implements all algorithms for both directions, so it populates lists of both incoming and outgoing algorithms with the same set.
See the code in _parse_kex_init method, which parses the packet populated by the code you refer to:
client_encrypt_algo_list = m.get_list()
server_encrypt_algo_list = m.get_list()
...
agreed_local_ciphers = list(
filter(
client_encrypt_algo_list.__contains__,
self._preferred_ciphers,
)
)
agreed_remote_ciphers = list(
filter(
server_encrypt_algo_list.__contains__,
self._preferred_ciphers,
)
)

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.
As we discussed in this post, setting below property is way to go.
spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
That question had different context. we wanted to retain the checkpointed dataset so did not care to add on cleanup solution.
Setting above property is working sometime(tested scala, java and python) but its hard to rely on it. Official document says that by setting this property it Controls whether to clean checkpoint files if the reference is out of scope. I don't know what exactly it means because my understanding is that once spark session/context is stopped it should clean it. Would be great if someone can shad light on it.
Regarding
Is there any best way to break the lineage
Check this question, #BiS found some way to cut the lineage using createDataFrame(RDD, Schema) method. I haven't tested it by myself though.
Just FYI, I don't rely on above property usually and delete the checkpointed directory in code itself to be on safe side.
We can get the checkpointed directory like below:
Scala :
//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")
scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//It gives String so we can use org.apache.hadoop.fs to delete path
PySpark:
// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False