Can AWS Glue jobs (for datatframes) automatically detect schema from s3 csv - amazon-s3

I'm trying to apply a specific operation to columns that are integers only. I have a .csv file in an S3 bucket that contains a mixture of columns that contain either integer and string data types.
I'm aware of Glue crawler, but I wanted to understand if Glue jobs can detect the schema.
I use the following code to load the S3 file:
df = glueContext.create_data_frame.from_options(
format_options={"quoteChar": '"', "withHeader": True, "separator": ","},
connection_type="s3",
format="csv",
connection_options={"paths": ["s3://autodscleaninjh/catblock.csv"], "recurse": True},
transformation_ctx="S3bucket_node1",
)
df.printSchema() returns them all as string datatypes, even though I have verified through running the same code when the data schema has been inferred previous to loading into the glue job.
Am I missing from the above code to automatically infer the schema in AWS Glue or is it simply not possible, and I need to run a Glue crawler beforehand?
I would like this code to generalise across different datasets, hence the need to define the schema on the fly.

Related

DynamoDB data to S3 in Kinesis Firehose output format

Kinesis data firehose has a default format to add files into separate partitions in S3 bucket which looks like : s3://bucket/prefix/yyyy/MM/dd/HH/file.extension
I have created event streams to dump data from DynamoDB to S3 using Firehose. There is a transformation lambda in between which converts DDB records into TSV format (tab separated).
All of this is added on an existing table which already contains huge data. I need to backfill the existing data from DynamoDB to S3 bucket maintaining the parity in format with existing Firehose output style.
Solution I tried :
Step 1 : Export the Table to S3 using DDB Export feature. Use Glue crawler to create Data catalog Table.
Step 2 : Used Athena's CREATE TABLE AS SELECT Query to imitate the transformation done by the intermediate Lambda and storing that Output to S3 location.
Step 3 : However, Athena CTAS applies a default compression that cannot be done away with. So I wrote a Glue Job that reads from the previous table and writes to another S3 location. This job also takes care of adding the partitions based on year/month/day/hour as is the format with Firehose, and writes the decompressed S3 tab-separated format files.
However, the problem is that Glue creates Hive-style partitions which look like :
s3://bucket/prefix/year=2021/month=02/day=02/. And I need to match the firehose block style S3 partitions instead.
I am looking for an approach to help achieve this. Couldn't find a way to add block style partitions using Glue. Another approach I have is, to use AWS CLI S3 mv command to move all this data into separate folders with correct file-name which is not clean and optimised.
Leaving the solution I ended up implementing here in case it helps anyone.
I created a Lambda and added S3 event trigger on this bucket. The Lambda did the job of moving the file from Hive-style partitioned S3 folder to correctly structured block-style S3 folder.
The Lambda used Copy and delete function from boto3 s3Client to implement the same.
It worked like a charm even though I had like > 10^6 output files split across different partitions.

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

Grok classifier for parquet

Is it possible to create a grok classifier for Parquet files? If so, where can I find examples?
I'm using AWS Glue Catalog and I'm trying to create external tables on top of Parquet files. I'd like the classifier to split the files according to one of the column of the files.
All my files have the column "table" and all records in a file have the same table.
My S3 structure is like this
- s3://my-bucket/my-prefix/table1/...
- s3://my-bucket/my-prefix/table2/...
No, classifier is not used for conditional parsing of data and moving to different tables.
You may write lambda/ecs/glue-job (depending on processing time) which will take these files and move to table wise folders in s3 bucket. e.g. s3-data-lake/ingestion/table1, s3-data-lake/ingestion/table2 and so on. Then you can run crawler over s3-data-lake/ingestion/ which will create all glue tables.

Retrieving data from s3 bucket in pyspark

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.
s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)
prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)
I have a global variable d which is an empty list. And I am appending deviceId data into this.
applying flatMap on the keys
pkeys.flatMap(map_func)
This the function
def map_func(key):
print "in map func"
for line in key.get_contents_as_string().splitlines():
# parse one line of json
content = json.loads(line)
d.append(content['deviceID'])
But the above code gives me error.
Can anyone help!
You have two issues that I can see. The first is you are trying to manually read data from S3 using boto instead of using the direct S3 support built into spark and hadoop. It looks like you are trying to read text files containing json records per line. If that is case, you can just do this in spark:
df = spark.read.json('s3://my-bucket/path/to/json/files/')
This will create a spark DataFrame for you by reading in the JSON data with each line as a row. DataFrames require a rigid pre-defined schema (like a relational database table) which spark try to determine will determine by sampling some of your JSON data. After you have the DataFrame all you need to do to get your column is select it like this:
df.select('deviceID')
The other issue worth pointing out is you are attempting to use a global variable to store data computed across your spark cluster. It is possible to send data from your driver to all of the executors running on spark workers using either broadcast variables or implicit closures. But there is no way in spark to write to a variable in your driver from an executor! To transfer data from executors back to the driver you need to use spark's Action methods intended for exactly this purpose.
Actions are methods that tell spark you want a result computed so it needs to go execute the transformations you have told it about. In your case you would probably either want to:
If the results are large: use DataFrame.write to save the results of your tranformations back to S3
If the results are small: DataFrame.collect() to download them back to your driver and do something with them

Incrementally add data to Parquet tables in S3

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().
Is there any way to add data to and existing Parquet table
without writing a whole new copy of it
particularly when it is stored in S3?
I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.
I can use something other than Spark if needed.
The way to append to a parquet file is using SaveMode.Append
`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`
You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.
Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.