AWS Redshift - Format all rows into a json file

AWS Redshift - Format all rows into a json file - sql

Any information about formatting SQL table in reshift to a JSON file. I see SQL syntax doesn't work on here. It says Syntax error.

The Amazon Redshift UNLOAD command can only output in Fixed-Width or Delimited (eg CSV) format.
You could either convert the output from the UNLOAD command, or use/write a program that calls Redshift via SQL and saves the output in the desired format.

Related

Query JSON file in Presto in S3

I have a file in S3, and Presto running on EMR. I see I can use Json_extract to read the json.
I am running the following query, however, I keep seeing null instead of the correct value.
select json_extract('s3a://random-s3-bucket/analytics/20210221/myjsonfile.json', '$.dateAvailability')
I see this output
Not sure if my syntax is wrong? Thoughts?

json_extract() operates on JSON scalar values kept in memory. It does not load data from an external location. See documentation page for usage examples.
In order to query a JSON file using Trino (formerly known as Presto SQL), you need to map it as a table with JSON format like this:
CREATE TABLE my_table ( .... )
WITH (
format = 'JSON',
external_location = 's3a://random-s3-bucket/analytics/20210221'
);
See more information in Hive connector documentation.

If you need a tool to help you create the table statement, try this one: https://www.hivetablegenerator.com
From the page:
Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log
sample file to an Apache HiveQL DDL create table statement.

BigQuery fails on parsing dates in M/D/YYYY format from CSV file

Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?

According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.

Hive ORC File Format

When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!

By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.

Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.

Issues loading CSV into BigQuery table

Im trying to create a BigQuery table using a pretty simple csv file I have stored in GCS.
I keep getting the same error over and over again:
Could not parse '1/1/2008' as datetime for field XXX
I've checked that the csv file isn't corrupted, and I've managed to upload everything into one column so the file is readable by BigQuery.
I've added the word NULL to any empty fields thinking consecutive delimiters may be causing the issues but I am still facing the same issue.
I know data, I understand data and CSV files.

BigQuery cannot cast '1/1/2008' as DATETIME and rather would expecting something like '2008-1-1'
So, you can either modify your CSV file or just use STRING for that XXX field and than translate it into DATETIME in your queries - like below
#standardSQL
SELECT PARSE_DATETIME('%d/%m/%Y', '1/1/2008')

How to convert sql file to a specific formatted csv file?

Any ideas how I can extract data from a SQL database, put it into a CSV file in a specific format and push it to an external url?

Most SQL databases have some sort of Export utility that can produce a CSV file. Google "Export " and you should find it. It is not a part of SQL standard, so every product does it differently.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWS Redshift - Format all rows into a json file - sql

Any information about formatting SQL table in reshift to a JSON file. I see SQL syntax doesn't work on here. It says Syntax error.

The Amazon Redshift UNLOAD command can only output in Fixed-Width or Delimited (eg CSV) format. You could either convert the output from the UNLOAD command, or use/write a program that calls Redshift via SQL and saves the output in the desired format.

Related

Query JSON file in Presto in S3

BigQuery fails on parsing dates in M/D/YYYY format from CSV file

Hive ORC File Format

Issues loading CSV into BigQuery table

How to convert sql file to a specific formatted csv file?

Categories

Resources