AWS Athena: Error parsing field value and unexpected query results - amazon-s3

I have the following table schema prepared by AWS glue
When I query the table using SELECT * FROM "vietnam-property-develop"."sell" limit 10;, it throws an error:
HIVE_BAD_DATA: Error parsing field value '{"area":"85
m²","date":"14/01/2020","datetime":"2020-01-18
00:42:28.488576+00:00","address":"Quan Hoa - Cầu Giấy","price":"20
Tỷ","cat":"Bán nhà mặt
phố","lon":"105.7976502","avatar":"","id":"24169794","title":"Chính
chủ cần bán nhà mặt phố nguyễn văn huyên Quan Hoa Cầu Giấy, 2 tầng, dt
85m2. LH 0903233723","lat":"21.0376771","room":"0"}' for field 4:
org.openx.data.jsonserde.json.JSONObject cannot be cast to
java.lang.Double
Then I tired to just query the title column by using SELECT title FROM "vietnam-property-develop"."sell" limit 10;
It returns result which I didn't expect. It seems that the query return the whole json files instead of just the title column. And the number of rows is 4 but not 10 no matter how I modify the query.

Related

BigQuery - No matching signature for function TIMESTAMP_SECONDS on JSON_EXTRACT

Im trying to convert a time that i have in secconds that is stored on a JSON value, and it gives me error of No matching signature for function TIMESTAMP_SECONDS, or in the query results:
Invalid timestamp: '1646985600'
This is the query:
SELECT DATE(TIMESTAMP_SECONDS(JSON_EXTRACT_SCALAR(JSON_EXTRACT(data , '$.fechaCierre'), '$._seconds'))) AS fechaCierre
FROM db Limit 10
When Im querying without the DATE(TIMESTAMP_SECONDS()), like this:
SELECT JSON_EXTRACT_SCALAR(JSON_EXTRACT(data , '$.fechaCierre'), '$._seconds') AS fechaCierre
FROM db Limit 10
Im getting correct the seconds like this: 1646985600
Use below instead
SELECT DATE(TIMESTAMP_SECONDS(CAST(JSON_EXTRACT_SCALAR(JSON_EXTRACT(data , '$.fechaCierre'), '$._seconds') AS INT64))) AS fechaCierre
FROM db Limit 10
The problem was that the result of the previous query was a string, and the TIMESTAMP_SECONDS only receive Int64.
So I change to query like this and it worked:
SELECT date(timestamp_seconds(CAST(JSON_EXTRACT_SCALAR(JSON_EXTRACT(data , '$.fechaCierre'), '$._seconds') AS int64))) AS fechaCierre
FROM db Limit 10

How to resolve this sql error of schema_of_json

I need to find out the schema of a given JSON file, I see sql has schema_of_json function
and something like this works flawlessly
> SELECT schema_of_json('[{"col":0}]');
ARRAY<STRUCT<`col`: BIGINT>>
But if I query for my table name, it gives me the following error
>SELECT schema_of_json(Transaction) as json_data from table_name;
Error in SQL statement: AnalysisException: cannot resolve 'schemaofjson(`Transaction`)' due to data type mismatch: The input json should be a string literal and not null; however, got `Transaction`.; line 1 pos 7;
The Transaction is one of the columns in my table and after checking it manually I can attest that it is of String type(json).
The SQL statement has it to give me the schema of the JSON, how to do it?
after looking further into the documentation that it is clear that the word foldable means that of the static one, and a column from a table JSON won't work
for minimal reroducible example here you go:
SELECT schema_of_json(CAST('{ "a": "b" }' AS STRING))
As soon as the cast is introduced in the above statement, the schema_of_json will fail......... It needs a static JSON as it's input

Writing where query using pyspark on SQL table

I'm querying sql table using pyspark.
If I have a sql table which has two column (value, isDelayed) where "value" is of double type and "isDelayed" has value 0 or 1. How to write a query using pyspark aggregation query which gives sum of "value" when "isDelayed" is 1.
I've already tried below code which is giving an error
def __main__(self, data):
delayedData = data.where(col('isDelayed').cast('int')==='1')
groupByIsDelayed = delayedData.agg(sum(total))
return groupByIsDelayed
I'm getting
"Syntax Error: invalid syntax"
on below line
delayedData = data.where(col('isDelayed').cast('int')==='1')
replace data.where(col('isDelayed').cast('int')==='1') with data.where(col('isDelayed').cast('int') == 1)
2 = only (equal operator in python is 2 = sign)
1 without quote (because you compare a int, not a string)
or
data.where("isDelayed=1")

Facing issue while running select query on DB2

I'm querying a DB2 table (STG_TOOL) with 2 columns - T_L_ID - Integer, Name - VARCHAR(20).
SELECT T_L_ID, Name FROM STG_TOOL;
The query returns answer. However, the below query gives error.
SELECT T_L_ID, RTRIM(Name) FROM STG_TOOL;
This query gives error at 78th row.
DB2 Database Error: ERROR [42815] [IBM][DB2] SQL0171N The data type,
length or value of the argument for the parameter in position "1" of
routine "SYSIBM.RTRIM" is incorrect. Parameter name: "". 1 0
The reason identified is that Name in 78th row has a replacement character - '�'.
Even, the same query with a where clause gives us the error.
SELECT T_L_ID, RTRIM(Name) FROM STG_TOOL WHERE T_L_ID = 78;
The sample date on 78th rows is T_L_ID = 1040 & Name = 'test�'
The above mentioned error re-occurs for the above query.
What does the error implies? How can this be handled/solved?
Adding details to the post:
Version: DSN11010 (version 11)
OS: z/OS
Encoding: Unicode
Toad for DB2 is being used for querying. Toad version - 5.5

Issue to generate dynamic xml with XMLElement

I have a query: (show next below):
SELECT XMLElement("PetitionMarreCSV",
XMLAttributes('http://INT010.GPA.Schemas.External.GPA_AllFile' AS "xmlns"),
XMLForest(EVENT as "Event",INSERTDATE as "InsertDate",USER_ID as "USER_ID",FIRSTNAME as "FirstName",NAME as "Name",EMAIL as "Email",LANGUAGE as "Language",COUNTRY as "COUNTRY",USER_PROFILE_PIC as "USER_PROFILE_PIC",USER_PROFILE_VIDEO as "USER_PROFILE_VIDEO",OPTIN as "OPTIN",USER_WISTIA_VIDEO as "USER_WISTIA_VIDEO",USER_ACEPT as "USER_ACEPT",USER_APROVED as "USER_APROVED",STRET as "STRET",CITY as "City",POSTALCODE as "PostalCode",TELEPHONE as "Telephone",USER_INTERESTS as "USER_INTERESTS",USER_VIDEO_VIEWS as "USER_VIDEO_VIEWS",USER_SORT_ORDER as "USER_SORT_ORDER",DALING_VAN_DE_KOPKRACHT as "DALING_VAN_DE_KOPKRACHT",TOEGANG_TOT_DE_GEZONDHEIDSZORG as "TOEGANG_TOT_DE_GEZONDHEIDSZORG",RISICO_OP_EN_BLACK_OUT as "RISICO_OP_EN_BLACK_OUT",GEPLANDE_VEROUDERING as "GEPLANDE_VEROUDERING",VERHOGING_VERZEKERINGSPRIJZEN as "VERHOGING_VERZEKERINGSPRIJZEN",DURE_INTERNET as "DURE_INTERNET",TREINVERTRAGINGEN as "TREINVERTRAGINGEN",TRAGHEID_VAN_JUSTITIE as "TRAGHEID_VAN_JUSTITIE",LAGE_RENDEMENT_SPAREKENING as "LAGE_RENDEMENT_SPAREKENING",ONGEZONDE_JUNKFOD as "ONGEZONDE_JUNKFOD",MINDER_POSTKANTOREN as "MINDER_POSTKANTOREN",OPLICHTERIJ_EN_BEDROG as "OPLICHTERIJ_EN_BEDROG",VERKOPERS_AN_HUIS as "VERKOPERS_AN_HUIS",FILENAME_SOURCE as "FileName_Source")) AS "RESULT"
FROM (SELECT EVENT,INSERTDATE,USER_ID,FIRSTNAME,NAME,EMAIL,LANGUAGE,COUNTRY,USER_PROFILE_PIC,USER_PROFILE_VIDEO,OPTIN,USER_WISTIA_VIDEO,USER_ACEPT,USER_APROVED,STRET,CITY,POSTALCODE,TELEPHONE,USER_INTERESTS,USER_VIDEO_VIEWS,USER_SORT_ORDER,DALING_VAN_DE_KOPKRACHT,TOEGANG_TOT_DE_GEZONDHEIDSZORG,RISICO_OP_EN_BLACK_OUT,GEPLANDE_VEROUDERING,VERHOGING_VERZEKERINGSPRIJZEN,DURE_INTERNET,TREINVERTRAGINGEN,TRAGHEID_VAN_JUSTITIE,LAGE_RENDEMENT_SPAREKENING,ONGEZONDE_JUNKFOD,MINDER_POSTKANTOREN,OPLICHTERIJ_EN_BEDROG,VERKOPERS_AN_HUIS,FILENAME_SOURCE
FROM ops$kli.GPA22_20150130
GROUP BY EVENT,INSERTDATE,USER_ID,FIRSTNAME,NAME,EMAIL,LANGUAGE,COUNTRY,USER_PROFILE_PIC,USER_PROFILE_VIDEO,OPTIN,USER_WISTIA_VIDEO,USER_ACEPT,USER_APROVED,STRET,CITY,POSTALCODE,TELEPHONE,USER_INTERESTS,USER_VIDEO_VIEWS,USER_SORT_ORDER,DALING_VAN_DE_KOPKRACHT,TOEGANG_TOT_DE_GEZONDHEIDSZORG,RISICO_OP_EN_BLACK_OUT,GEPLANDE_VEROUDERING,VERHOGING_VERZEKERINGSPRIJZEN,DURE_INTERNET,TREINVERTRAGINGEN,TRAGHEID_VAN_JUSTITIE,LAGE_RENDEMENT_SPAREKENING,ONGEZONDE_JUNKFOD,MINDER_POSTKANTOREN,OPLICHTERIJ_EN_BEDROG,VERKOPERS_AN_HUIS,FILENAME_SOURCE)
This query create a list of CLOB value in xml format. But, by some reason that I don't know, sometimes retrieve the next error for some CLOB result instead to show the xml value:
Error: XML document must have a top level element.
The strange thing is, if I retrieve (filter the query) only with the row who has the error instead of multiples rows, the value with the error is not anymore.
SELECT XMLElement("PetitionMarreCSV",
XMLAttributes('http://INT010.GPA.Schemas.External.GPA_AllFile' AS "xmlns"),
XMLForest(EVENT as "Event",INSERTDATE as "InsertDate",USER_ID as "USER_ID",FIRSTNAME as "FirstName",NAME as "Name",EMAIL as "Email",LANGUAGE as "Language",COUNTRY as "COUNTRY",USER_PROFILE_PIC as "USER_PROFILE_PIC",USER_PROFILE_VIDEO as "USER_PROFILE_VIDEO",OPTIN as "OPTIN",USER_WISTIA_VIDEO as "USER_WISTIA_VIDEO",USER_ACEPT as "USER_ACEPT",USER_APROVED as "USER_APROVED",STRET as "STRET",CITY as "City",POSTALCODE as "PostalCode",TELEPHONE as "Telephone",USER_INTERESTS as "USER_INTERESTS",USER_VIDEO_VIEWS as "USER_VIDEO_VIEWS",USER_SORT_ORDER as "USER_SORT_ORDER",DALING_VAN_DE_KOPKRACHT as "DALING_VAN_DE_KOPKRACHT",TOEGANG_TOT_DE_GEZONDHEIDSZORG as "TOEGANG_TOT_DE_GEZONDHEIDSZORG",RISICO_OP_EN_BLACK_OUT as "RISICO_OP_EN_BLACK_OUT",GEPLANDE_VEROUDERING as "GEPLANDE_VEROUDERING",VERHOGING_VERZEKERINGSPRIJZEN as "VERHOGING_VERZEKERINGSPRIJZEN",DURE_INTERNET as "DURE_INTERNET",TREINVERTRAGINGEN as "TREINVERTRAGINGEN",TRAGHEID_VAN_JUSTITIE as "TRAGHEID_VAN_JUSTITIE",LAGE_RENDEMENT_SPAREKENING as "LAGE_RENDEMENT_SPAREKENING",ONGEZONDE_JUNKFOD as "ONGEZONDE_JUNKFOD",MINDER_POSTKANTOREN as "MINDER_POSTKANTOREN",OPLICHTERIJ_EN_BEDROG as "OPLICHTERIJ_EN_BEDROG",VERKOPERS_AN_HUIS as "VERKOPERS_AN_HUIS",FILENAME_SOURCE as "FileName_Source")) AS "RESULT"
FROM (SELECT EVENT,INSERTDATE,USER_ID,FIRSTNAME,NAME,EMAIL,LANGUAGE,COUNTRY,USER_PROFILE_PIC,USER_PROFILE_VIDEO,OPTIN,USER_WISTIA_VIDEO,USER_ACEPT,USER_APROVED,STRET,CITY,POSTALCODE,TELEPHONE,USER_INTERESTS,USER_VIDEO_VIEWS,USER_SORT_ORDER,DALING_VAN_DE_KOPKRACHT,TOEGANG_TOT_DE_GEZONDHEIDSZORG,RISICO_OP_EN_BLACK_OUT,GEPLANDE_VEROUDERING,VERHOGING_VERZEKERINGSPRIJZEN,DURE_INTERNET,TREINVERTRAGINGEN,TRAGHEID_VAN_JUSTITIE,LAGE_RENDEMENT_SPAREKENING,ONGEZONDE_JUNKFOD,MINDER_POSTKANTOREN,OPLICHTERIJ_EN_BEDROG,VERKOPERS_AN_HUIS,FILENAME_SOURCE
FROM ops$kli.GPA22_20150130
WHERE user_id = 293
GROUP BY EVENT,INSERTDATE,USER_ID,FIRSTNAME,NAME,EMAIL,LANGUAGE,COUNTRY,USER_PROFILE_PIC,USER_PROFILE_VIDEO,OPTIN,USER_WISTIA_VIDEO,USER_ACEPT,USER_APROVED,STRET,CITY,POSTALCODE,TELEPHONE,USER_INTERESTS,USER_VIDEO_VIEWS,USER_SORT_ORDER,DALING_VAN_DE_KOPKRACHT,TOEGANG_TOT_DE_GEZONDHEIDSZORG,RISICO_OP_EN_BLACK_OUT,GEPLANDE_VEROUDERING,VERHOGING_VERZEKERINGSPRIJZEN,DURE_INTERNET,TREINVERTRAGINGEN,TRAGHEID_VAN_JUSTITIE,LAGE_RENDEMENT_SPAREKENING,ONGEZONDE_JUNKFOD,MINDER_POSTKANTOREN,OPLICHTERIJ_EN_BEDROG,VERKOPERS_AN_HUIS,FILENAME_SOURCE)
Do you know if there is a problem with this function when you try to generate several CLOB columns?