Files are stored in S3 only by file name without extension.
I created the table with CREATE EXTERNAL TABLE. But hive doesn't seem to recognize it as a parquet file.
Is there any way to read parquet files without extension in hive?
Related
I was given a .sql file, which is snapshot of a postgres db. I have only the file itself, which is quite large, and want to read the tables into pandas or some other container. There is no external database to make a connection to, apparently it's all in the file.
How can I read a file like this using a Python script?
I searched on this and found references to cases where the .sql file contains queries into an existing database but this is not my case.
I have created parquet files from SQL data using python. I am able to read the data on my local machine, so I know the parquet files are valid. I created a Glue Crawler that creates a database from Parquet files in S3 and the database shows the correct amount of records in the glue dashboard.
When I query that database in Athena it shows "No Results", but does show the column names.
Please see the images below for reference.
GLUE Table Properties
Athena Query
I figured it out. You cannot point to the root of the S3 bucket with parquet files in that location. Each parquet file needs to be in a folder that has the same name as your file. I don't think this is required, but for automation purposes, it makes the most sense...
My source is SQL Server and I am using SSIS to export data to S3 Buckets, but now my requirement is to send files as parquet File formate.
Can you guys give some clues on how to achieve this?
Thanks,
Ven
For folks stumbling on this answer, Apache Parquet is a project that specifies a columnar file format employed by Hadoop and other Apache projects.
Unless you find a custom component or write some .NET code to do it, you're not going to be able to export data from SQL Server to a Parquet file. KingswaySoft's SSIS Big Data Components might offer one such custom component, but I've got no familiarity.
If you were exporting to Azure, you'd have two options:
Use the Flexible File Destination component (part of the Azure feature pack), which exports to a Parquet file hosted in Azure Blob or Data Lake Gen2 storage.
Leverage PolyBase, a SQL Server feature. It let's you export to a Parquet file via the external table feature. However, that file has to be hosted in a location mentioned here. Unfortunately S3 isn't an option.
If it were me, I'd move the data to S3 as a CSV file then use Athena to convert the CSV file to Pqrquet. There is a nifty article here that talks through the Athena piece:
https://www.cloudforecast.io/blog/Athena-to-transform-CSV-to-Parquet/
Net-net, you'll need to spend a little money, get creative, switch to Azure, or do the conversion in AWS.
I have a big csv file which is part of my project and available on internet. This file need to be loaded in hive. Without knowing the structure of file, how can I create external table in hive?
This is my situation:
I have an application that rotates json files to an s3 bucket. I would need to convert those files in ORC format to be consulted from Athena or EMR.
My first attempt was a lambda programmed in Node, but I didn't find any module for the conversion.
I think it can be done more easily with GLUE or EMR, but I can not find a solution.
any help?
Thanks!
You can use glue. You will need a glue data catalog table that describes the schema of your data, you can create this automatically with a glue crawler.
Then create a glue job, if you follow the Add Job wizard you can select ORC as a data output format on the data targets section of the wizard.
If you go through the tutorials on AWS glue it will step you through doing something similar but converting into Parquet format, if you go through the same steps with your data but select ORC it should do what you want.