This is my situation:
I have an application that rotates json files to an s3 bucket. I would need to convert those files in ORC format to be consulted from Athena or EMR.
My first attempt was a lambda programmed in Node, but I didn't find any module for the conversion.
I think it can be done more easily with GLUE or EMR, but I can not find a solution.
any help?
Thanks!
You can use glue. You will need a glue data catalog table that describes the schema of your data, you can create this automatically with a glue crawler.
Then create a glue job, if you follow the Add Job wizard you can select ORC as a data output format on the data targets section of the wizard.
If you go through the tutorials on AWS glue it will step you through doing something similar but converting into Parquet format, if you go through the same steps with your data but select ORC it should do what you want.
Related
I have created parquet files from SQL data using python. I am able to read the data on my local machine, so I know the parquet files are valid. I created a Glue Crawler that creates a database from Parquet files in S3 and the database shows the correct amount of records in the glue dashboard.
When I query that database in Athena it shows "No Results", but does show the column names.
Please see the images below for reference.
GLUE Table Properties
Athena Query
I figured it out. You cannot point to the root of the S3 bucket with parquet files in that location. Each parquet file needs to be in a folder that has the same name as your file. I don't think this is required, but for automation purposes, it makes the most sense...
My source is SQL Server and I am using SSIS to export data to S3 Buckets, but now my requirement is to send files as parquet File formate.
Can you guys give some clues on how to achieve this?
Thanks,
Ven
For folks stumbling on this answer, Apache Parquet is a project that specifies a columnar file format employed by Hadoop and other Apache projects.
Unless you find a custom component or write some .NET code to do it, you're not going to be able to export data from SQL Server to a Parquet file. KingswaySoft's SSIS Big Data Components might offer one such custom component, but I've got no familiarity.
If you were exporting to Azure, you'd have two options:
Use the Flexible File Destination component (part of the Azure feature pack), which exports to a Parquet file hosted in Azure Blob or Data Lake Gen2 storage.
Leverage PolyBase, a SQL Server feature. It let's you export to a Parquet file via the external table feature. However, that file has to be hosted in a location mentioned here. Unfortunately S3 isn't an option.
If it were me, I'd move the data to S3 as a CSV file then use Athena to convert the CSV file to Pqrquet. There is a nifty article here that talks through the Athena piece:
https://www.cloudforecast.io/blog/Athena-to-transform-CSV-to-Parquet/
Net-net, you'll need to spend a little money, get creative, switch to Azure, or do the conversion in AWS.
I have an existing Google BigQuery table with about 30 fields. I would like to start automating the addition of data to this table on a regular basis. I have installed the command line tools and they are working correctly.
I'm confused by the proper process to append data to a table. Do I need to specify the entire schema for the table every time I want to append data? It feels strange to be recreating the schema in an avro file. The schema already exists on the table.
Can someone please clarify how to do this?
Do I need to specify the entire schema for the table every time
No, you don't need to do this. as described in BigQuery official documentation
Schema auto-detection is not used with Avro files, Parquet files, ORC files, Cloud Firestore export files, or Cloud Datastore export files. When you load these files into BigQuery, the table schema is automatically retrieved from the self-describing source data.
I have huge CSV files in the zipped format in S3 storage. I need just a subset of columns from the data for Machine learning purposes. How should I extract those columns into EMR then to Redshift without transferring the whole files?
My idea is to process all files into EMR then extract subset and push the required columns into Redshift. But this taking a lot of time. Please let me know if there is an optimized way of handling this data.
Edit: I am trying to automate this pipeline using Kafka. Let say a new folder in added into S3, it should be processed in EMR using spark and stored into redshift without any manual intervention.
Edit 2: Thanks for input guys, I was able to create a pipeline From S3 to Redshift using Pyspark in EMR. Currently, I am trying to integrate Kafka into this pipeline.
I would suggest:
Create an external table in Amazon Athena (An AWS Glue crawler can do this for you) that points to where your data is stored
Use CREATE TABLE AS to select the desired columns and store them in a new table (with the data automatically stored in Amazon S3)
Amazon Athena can handle gzip format, but you'll have to check whether this includes zip format.
See:
CREATE TABLE - Amazon Athena
Examples of CTAS Queries - Amazon Athena
Compression Formats - Amazon Athena
If the goal is to materialise a subset of the file columns in a table in Redshift then one option you have is Redshift Spectrum, which will allow you to define an "external table" over the CSV files in S3.
You can then select the relevant columns from the external tables and insert them into actual Redshift tables.
You'll have an initial cost hit when Spectrum scans the CSV files to query them, which will vary depending on how big the files are, but that's likely to be significantly less than spinning up an EMR cluster to process the data.
Getting Started with Amazon Redshift Spectrum
I'm using the java api, trying to load data from avro files into BigQuery.
When creating external tables, BigQuery automatically detects the schema from the .avro files.
Is there a way to specify a schema/data file in GCS when creating a regular BigQuery table for data to be loaded into?
thank you in advance
You could create manually the schema definition with the configuration.load.schema, however, the documentation says that:
When you load Avro, Parquet, ORC, Cloud Firestore export data, or Cloud Datastore export data, BigQuery infers the schema from the source data.
Seems the problem was that the table existed already, and I did not specify the CreateDisposition.CREATE_IF_NEEDED.
You do not need to specify the schema at all, just like for the external tables