Spark SQL specify encoding while saving dataframe as table - apache-spark-sql

can we specify encoding like as below
df.write.format("text").option("encoding", "UTF-8").saveasTable

You can --- In case you want to create an external table as well can specify the path as well
df.write.option("path", "/some/path").option("encoding", "UTF-8").saveAsTable("t")

Related

Query JSON file in Presto in S3

I have a file in S3, and Presto running on EMR. I see I can use Json_extract to read the json.
I am running the following query, however, I keep seeing null instead of the correct value.
select json_extract('s3a://random-s3-bucket/analytics/20210221/myjsonfile.json', '$.dateAvailability')
I see this output
Not sure if my syntax is wrong? Thoughts?
json_extract() operates on JSON scalar values kept in memory. It does not load data from an external location. See documentation page for usage examples.
In order to query a JSON file using Trino (formerly known as Presto SQL), you need to map it as a table with JSON format like this:
CREATE TABLE my_table ( .... )
WITH (
format = 'JSON',
external_location = 's3a://random-s3-bucket/analytics/20210221'
);
See more information in Hive connector documentation.
If you need a tool to help you create the table statement, try this one: https://www.hivetablegenerator.com
From the page:
Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log
sample file to an Apache HiveQL DDL create table statement.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Support for creating table out of limited number of column in Presto

I was playing around with Presto. I uploaded parquet file with 10 columns.I want to created table (external location s3) in meta store with 5 column using presto-cli. Looks like presto doesn't support this ?
Is there any other way to get this working.
That should be easily possible if you are using Parquet or ORC file formats. This is another advantage of keeping metadata separate than actual data. As mentioned in the comments, you should use column names to access the fields instead of index.
One of the example:
CREATE TABLE hive.web.request_logs (
request_time timestamp,
url varchar,
ip varchar,
user_agent varchar
)
WITH (
format = 'parquet',
external_location = 's3://my-bucket/data/logs/'
)
Reference:
https://prestodb.github.io/docs/current/connector/hive.html#examples

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

How to prevent double quotes being escaped when importing data from a text file into a hive table

I have this datatype info map and if I select this field in hive console, the result would look something like this
{"a":"value1","b":"value2"}.
How do I represent this data in a text file so that when I import it to the hive table, it is properly represented. I mean should my text file should have something like this ?
a:value1,b:value2
Are you trying to load JSON documents in Hive? There are serde's available to load and query JSON data in hive.
In your case, "a" and "b" would become column names (header)