How to achieve change of schema in parquet format - hive

Just a design issue we are facing.
I have a hive external table in parquet format with following columns:
describe payments_user
col_name,data_type,comment
('amount_hold', 'int', '')
('id', 'int', '')
('transaction_id', 'string', '')
('recipient_id', 'string', '')
('year', 'string', '')
('month', 'string', '')
('day', 'string', '')
('', None, None)
('# Partition Information', None, None)
('# col_name ', 'data_type ', 'comment ')
('', None, None)
('year', 'string', '')
('month', 'string', '')
('day', 'string', '')
We get the data on daily basis which we ingest into partitions dynamically which are year, month and day.
So if the data on the source side is to be changed where they add a new column and send the batch file, how can we ingest the data. I know avro has this capability but inorder to reduce the rework how can this be achieved in parquet format?
If avro what is the procedure?

what you are looking for is schema evolution, it is supported by Hive with some limitations compared with AVRO.
Schema evolution in parquet format

Related

Declare datatype for nested json in presto

I'm trying to declare a table that reads json files from a s3 bucket. The format of these json files is like these -
[{'metadata': [{'commands': '/test.py',
'cwds': '/test',
'service': 'test',
'hosts': 'test123',
'name': 'test.py',
'path': 'test/python3',
'id': '123',
'project': 'test'}]}]
I declared like these -
CREATE TABLE IF NOT EXISTS schema.test
(
metadata array
)
WITH (format = 'JSON', external_location = 's3://test/');```
but when I query the table, the dictionary object is rendered as a list and not as key,value pairs therefore making it useless. I tried declaring it as ARRAY<MAP<VARCHAR>> and got this error -
mismatched input '>'. Expecting: '(', ',', 'ARRAY'
Also tried this one - array(MAP(varchar)) which also didn't work with this error -
Unknown type 'array(MAP(varchar))' for column metadata

How can I convert a string to JSON in Snowflake?

I have this string {id: evt_1jopsdgqxhp78yqp7pujesee, created: 2021-08-14t16:38:17z} and would like to convert it to a JSON, I tried parse_json but got an error, to_variant and converted to "{id: evt_1jopsdgqxhp78yqp7pujesee, created: 2021-08-14t16:38:17z}"
To Gokhan & Simon's point, the original data isn't valid JSON.
If you're 100% (1000%) certain it'll "ALWAYS" come that way, you can treat it as a string parsing exercise and do something like this, but once someone changes the format a bit it'll have an issue.
create temporary table abc (str varchar);
insert into abc values ('{id: evt_1jopsdgqxhp78yqp7pujesee, created: 2021-08-14t16:38:17z}');
select to_json(parse_json(json_str)) json_json
FROM (
select split_part(ltrim(str, '{'), ',', 1) as part_a,
split_part(rtrim(str, '}'), ',', 2) as part_b,
split_part(trim(part_a), ': ', 1) part_a_name,
split_part(trim(part_a), ': ', 2) part_a_val,
split_part(trim(part_b), ': ', 1) part_b_name,
split_part(trim(part_b), ': ', 2) part_b_val,
'{"'||part_a_name||'":"'||part_a_val||'", "'||part_b_name||'":"'||part_b_val||'"}' as json_str
FROM abc);
which returns a valid JSON
{"created":"2021-08-14t16:38:17z","id":"evt_1jopsdgqxhp78yqp7pujesee"}
Overall this is very fragile, but if you must do it, feel free to.
Your JSON is not valid, as you can validate it using any online tool:
https://jsonlint.com/
This is a valid JSON version of your data:
{
"id": "evt_1jopsdgqxhp78yqp7pujesee",
"created": "2021-08-14t16:38:17z"
}
And you can parse it successfully using parse_json:
select parse_json('{ "id": "evt_1jopsdgqxhp78yqp7pujesee", "created": "2021-08-14t16:38:17z"}');

How to query json column using databricks sql?

Query is
SELECT *
FROM companies c
where c.urls ->> 'Website' = '';
Here is the companies table.
urls
{'Website': '', 'Twitter': ''}
{'Website': 'www.google.com', 'Twitter: ''}
I'm querying the companies table for rows that have urls column with Website as empty string. However, this query is erroring out in dattabricks sql with:
mismatched input '->' expecting {<EOF>, ';'}(line 3, pos 13)
Does anyone know how to query the json column in databricks sql?
Take a look at the following page from the Databricks documentation: Query semi-structured data in SQL.
If the content of the column is JSON as a string, then you can make use of this syntax: <column-name>:<extraction-path>. For example:
select * from companies c
where c.urls:Website = ''
If the content of the column is a struct, then you can make use of this syntax: <column-name>:<nested-field>:
SELECT *
FROM companies c
where c.urls.Website = '';

Load Teradata table from excel using tpt loader

I have developed python and used pandas module to write excel.
While executing command print(df1.columns), I get dtype as 'Object'.
and using same excel to load in Teradata table using TPT script and getting below error
FILE_READER[1]: TPT19108 Data Format 'DELIMITED' requires all 'VARCHAR/JSON/JSON BY NAME/CLOB BY NAME/BLOB BY NAME/XML BY NAME/XML/CLOB' schema.
Using Description in TPT:-
DEFINE SCHEMA Teradata__DATA
DESCRIPTION 'SCHEMA OF Teradata data'
(
Issue_Key VARCHAR(255),
Log_Date VARDATE(10) FORMATIN ('YYYY-MM-DD') FORMATOUT ('YYYY-MM-DD'),
User_Name VARCHAR(255),
Time_Spent NUMBER(10,2)
Please help in resolving the failure message. Error might be due different Datatype or due to defined delimeter as "TAB". Please suggest if any other reason is causing this failure.
CODE
df = pd.read_excel('Time_Log_Source_2019-05-30.xlsx', sheet_name='Sheet1', dtype=str)
print("Column headings:")
print(df.columns)
df = pd.DataFrame(df,columns=['Issue Key', 'Log Date', 'User', 'Time Spent(Sec)'])
df['Log Date'] = df['Log Date'].str[:10]
df['Time Spent(Sec)'] = df['Time Spent(Sec)'].astype(int)/3600
print(df)
df.to_excel("Time_Log_Source_2019-05-30_output.xlsx")
df1 = pd.read_excel('Time_Log_Source_2019-05-30_output.xlsx', sheet_name='Sheet1',dtype=str)
df1['Issue Key'] = df1['Issue Key'].astype('str')
df1['Log Date'] = df1['Log Date'].astype('str')
df1['User'] = df1['User'].astype('str')
df1['Time Spent(Sec)'] = df1['Time Spent(Sec)'].astype('str')
df1.to_excel("Time_Log_Source_2019-05-30_output.xlsx",startrow=0, startcol=0, index=False)
print(type(df1['Time Spent(Sec)']))
print(df.columns)
print(df1.columns)
Result
Index([u'Issue Key', u'Log Date', u'User', u'Time Spent(Sec)'], dtype='object')
Index([u'Issue Key', u'Log Date', u'User', u'Time Spent(Sec)'], dtype='object')
A TPT Schema describes fields in client-side records, not columns in the database table. You would need to change the schema to say the (input) Time_Spent is VARCHAR.
But TPT does not natively read .xlsx files. Consider using to_csv instead of to_excel.

SQL server query on json string for stats

I have this SQL Server database that holds contest participations. In the Participation table, I have various fields and a special one called ParticipationDetails. It's a varchar(MAX). This field is used to throw in all contest specific data in json format. Example rows:
Id,ParticipationDetails
1,"{'Phone evening': '6546546541', 'Store': 'StoreABC', 'Math': '2', 'Age': '01/01/1951'}"
2,"{'Phone evening': '6546546542', 'Store': 'StoreABC', 'Math': '2', 'Age': '01/01/1952'}"
3,"{'Phone evening': '6546546543', 'Store': 'StoreXYZ', 'Math': '2', 'Age': '01/01/1953'}"
4,"{'Phone evening': '6546546544', 'Store': 'StoreABC', 'Math': '3', 'Age': '01/01/1954'}"
I'm trying to get a a query runing, that will yield this result:
Store, Count
StoreABC, 3
StoreXYZ, 1
I used to run this query:
SELECT TOP (20) ParticipationDetails, COUNT(*) Count FROM Participation GROUP BY ParticipationDetails ORDER BY Count DESC
This works as long as I want unique ParticipationDetails. How can I change this to "sub-query" into my json strings. I've gotten to this query, but I'm kind of stuck here:
SELECT 'StoreABC' Store, Count(*) Count FROM Participation WHERE ParticipationDetails LIKE '%StoreABC%'
This query gets me the results I want for a specific store, but I want the store value to be "anything that was put in there".
Thanks for the help!
first of all, I suggest to avoid any json management with t-sql, since is not natively supported. If you have an application layer, let it to manage those kind of formatted data (i.e. .net framework and non MS frameworks have json serializers available).
However, you can convert your json strings using the function described in this link.
You can also write your own query which works with strings. Something like the following one:
SELECT
T.Store,
COUNT(*) AS [Count]
FROM
(
SELECT
STUFF(
STUFF(ParticipationDetails, 1, CHARINDEX('"Store"', ParticipationDetails) + 9, ''),
CHARINDEX('"Math"',
STUFF(ParticipationDetails, 1, CHARINDEX('"Store"', ParticipationDetails) + 9, '')) - 3, LEN(STUFF(ParticipationDetails, 1, CHARINDEX('"Store"', ParticipationDetails) + 9, '')), '')
AS Store
FROM
Participation
) AS T
GROUP BY
T.Store