AWS Athena Returning Zero Records from Tables Created from GLUE Crawler database using parquet from S3 - amazon-s3

I have created parquet files from SQL data using python. I am able to read the data on my local machine, so I know the parquet files are valid. I created a Glue Crawler that creates a database from Parquet files in S3 and the database shows the correct amount of records in the glue dashboard.
When I query that database in Athena it shows "No Results", but does show the column names.
Please see the images below for reference.
GLUE Table Properties
Athena Query

I figured it out. You cannot point to the root of the S3 bucket with parquet files in that location. Each parquet file needs to be in a folder that has the same name as your file. I don't think this is required, but for automation purposes, it makes the most sense...


Exporting table from Amazon RDS into a csv file using Golang API

I am looking for some way to directly export the SQL query results to a CSV file from AWS lambda. I have found this similar question - Exporting table from Amazon RDS into a csv file. But it will not work with the AWS Golang API.
Actually, I want to schedule a lambda function which will daily query some of the views/tables from RDS(SQL Server) and put it to the S3 bucket in CSV format. So, I want to directly download the query results in the CSV form in the lambda and then upload it to S3.
I have also found data pipeline service of AWS to copy RDS data to S3 directly, but I am not sure if I can make use of it here.
It would be helpful if anyone can suggest me the right process to do it and references to implement it.
You can transfer files between a DB instance running Amazon RDS for
SQL Server and an Amazon S3 bucket. By doing this, you can use Amazon
S3 with SQL Server features such as BULK INSERT. For example, you can
download .csv, .xml, .txt, and other files from Amazon S3 to the DB
instance host and import the data from D:\S3\into the database. All
files are stored in D:\S3\ on the DB instance

Zipped Data in S3 that needs to be used for Machine Learning on EMR or Redshift

I have huge CSV files in the zipped format in S3 storage. I need just a subset of columns from the data for Machine learning purposes. How should I extract those columns into EMR then to Redshift without transferring the whole files?
My idea is to process all files into EMR then extract subset and push the required columns into Redshift. But this taking a lot of time. Please let me know if there is an optimized way of handling this data.
Edit: I am trying to automate this pipeline using Kafka. Let say a new folder in added into S3, it should be processed in EMR using spark and stored into redshift without any manual intervention.
Edit 2: Thanks for input guys, I was able to create a pipeline From S3 to Redshift using Pyspark in EMR. Currently, I am trying to integrate Kafka into this pipeline.
I would suggest:
Create an external table in Amazon Athena (An AWS Glue crawler can do this for you) that points to where your data is stored
Use CREATE TABLE AS to select the desired columns and store them in a new table (with the data automatically stored in Amazon S3)
Amazon Athena can handle gzip format, but you'll have to check whether this includes zip format.
CREATE TABLE - Amazon Athena
Examples of CTAS Queries - Amazon Athena
Compression Formats - Amazon Athena
If the goal is to materialise a subset of the file columns in a table in Redshift then one option you have is Redshift Spectrum, which will allow you to define an "external table" over the CSV files in S3.
You can then select the relevant columns from the external tables and insert them into actual Redshift tables.
You'll have an initial cost hit when Spectrum scans the CSV files to query them, which will vary depending on how big the files are, but that's likely to be significantly less than spinning up an EMR cluster to process the data.
Getting Started with Amazon Redshift Spectrum

Importing data from AWS Athena to RDS instance

Currently I’m listening events from AWS Kinesis and writing them to S3. Then I query them using AWS Glue and Athena.
Is there a way to import that data, possibly with some transformation, to an RDS instance?
There are several general approaches to take with regards to that task.
Read data from and Athena query into a custom ETL script (using a JDBC connection) and load into the database
Mount the S3 bucket holding the data to a file system (perhaps using s3fs-fuse), read the data using a custom ETL script, and push it to the RDS instance(s)
Download the data to be uploaded to the RDS instance to a filesystem using the AWS CLI or the SDK, process locally, and then push to RDS
As you suggest, use AWS Glue to import the data to from Athena to the RDS instance. If you are building an application that is tightly coupled with AWS, and if you are using Kinesis and Athena you are, then such a solution makes sense.
When connecting GLUE to RDS a couple of things to keep in mind (mostly on the networking side:
Ensure that DNS Hostnames are enabled the VPC hosting the target RDS instance
You'll need to setup a self-referencing rule in the Security Group associated with the target RDS instance
For some examples of code targetting a relational database, see the following tutorials
One approach for Postgres:
Install the S3 extension in Postgres:
Run the query in Athena and find the CSV result file location in S3 (S3 output location is in Athena settings) (You can also inspect the "Download results" button to get the S3 path)
Create your table in Postgres
Import from S3:
SELECT aws_s3.table_import_from_s3(
'newtable', '', '(format csv, header true)',
aws_commons.create_s3_uri('bucketname', 'reports/Unsaved/2021/05/10/aa9f04b0-d082-328g-5c9d-27982d345484.csv', 'us-east-1')
If you want to convert empty values to null, you can use this: (format csv, FORCE_NULL (columnname), header true)

Convert JSON to ORC [AWS]

This is my situation:
I have an application that rotates json files to an s3 bucket. I would need to convert those files in ORC format to be consulted from Athena or EMR.
My first attempt was a lambda programmed in Node, but I didn't find any module for the conversion.
I think it can be done more easily with GLUE or EMR, but I can not find a solution.
any help?
You can use glue. You will need a glue data catalog table that describes the schema of your data, you can create this automatically with a glue crawler.
Then create a glue job, if you follow the Add Job wizard you can select ORC as a data output format on the data targets section of the wizard.
If you go through the tutorials on AWS glue it will step you through doing something similar but converting into Parquet format, if you go through the same steps with your data but select ORC it should do what you want.

BigQuery - load a datasource in Google big query

I have a MySQL DB in AWS and can I use the database as a data source in Big Query.
I m going with CSV upload to Google Cloud Storage bucket and loading into it.
I would like to keep it Synchronised by directly giving the data source itself than loading it every time.
You can create a permanent external table in BigQuery that is connected to Cloud Storage. Then BQ is just the interface while the data resides in GCS. It can be connected to a single CSV file and you are free to update/overwrite that file. But not sure if you can link BQ to a directory full of CSV files or even are tree of directories.
Anyway, have a look here: