Amazon S3 has a new feature called select from which allows one to run simple SQL queries against simple data files - like CSV or JSON. So I thought I'd try it.
I created and uploaded the following CSV to my S3 bucket in Oregon (I consider this file to be extremely simple):
aaa,bbb,ccc
111,111,111
222,222,222
333,333,333
I indicated this was CSV with a header row and issued the following SQL:
select * from s3object s
...which worked as expected, returning:
111,111,111
222,222,222
333,333,333
Then I tried one of the provided sample queries, which failed:
select s._1, s._2 from s3object s
...the error message was "Some headers in the query are missing from the file. Please check the file and try again.".
Also tried the following, each time receiving the same error:
select aaa from s3object s
select s.aaa from s3object s
select * from s3object s where aaa = 111
select * from s3object s where s.aaa = 111
select * from s3object s where s._1 = 111
So anytime my query references a column, either by name or number, either in the SELECT or WHERE clauses, I get the "headers in the query are missing". The AWS documentation provides no follow up information on this error.
So my question is, what's wrong? Is there an undocumented requirement about the column headers? Is there an undocumented way to reference columns? Does the "Select From" feature have a bug in it?
I did the following:
Created a file with the contents you show above
Entered S3 Select on the file, and ticked File has header row
Changed no other settings
These queries did NOT work:
select s._1, s._2 from s3object s
select * from s3object s where s._1 = 111
The reason they didn't work is that the file contains headers, so the columns have actual names.
These queries DID work:
select aaa from s3object s
select s.aaa from s3object s
select * from s3object s where aaa = 111 (Gave empty result)
select * from s3object s where s.aaa = 111 (Gave empty result)
When I treated the last two queries as strings, they returned the row as expected:
select * from s3object s where aaa = '111'
select * from s3object s where s.aaa = '111'
Getting back to this, on a whim I decided to replace this sample file with a new identical example file, and now I do not encounter the problem. In fact, I'm unable to replicate the problem that I originally posted.
I have a few theories: character encoding, end-of-line character, and the possible presence of an extra line in my original file, but I have been unable to re-create the original issue.
I've tried different editors to create the source file, I've tried unix vs windows end of line characters, I've tried extra line on the end, I've tried upper case vs lower case column headers, and I've tried different regions. Everything works now, so I'm completely mystified as to why it did not work in the first place.
Life goes on. Thanks to everyone for your efforts.
s3 select treats everything as string. The query
select * from s3object s where cast(aaa as int) = 111
select * from s3object s where cast(s.aaa as int) = 111
should return the expected results if the header rows are checked/unchecked appropriately.
Related
I have two sql using parquet-arrow:
`table` has 50 column
sql1 = `select * from table`, total_data_size = 45GB
sql2 = `select value from table`, total_data_size = 30GB
I add profile for io-throughput(Yeah, drop page-cache and just watch disk-io).
I found:
Parquet on HDFS: sql2 is faster than sql1, about 1.5 times which is reasonable
Parquet on local-disk(1MB randread=130MB;1MB read=250MB): sq1 is faster than sql2, about 4 times which is confusing.
I guess two reasons via iostat:
the io-load is high(about 100~130MB/S, utils=90%~100%) when execute sql2, which seem mean the select one column is more rand read and make the io-throughput decrease
select * will cache more page-cache and the hit-ratio is high in process though I drop page-cache before executing. so for the select *, the io-throughput actually is benefit from cache hit ratio.
Expect your help, thanks!
I use cachestat to get the page-cache hit-ratio, and I found select * has higher ratio(50%) than select one column(27%), so the io-throughput of select * is more better because of the page-cache
I try open with O_DIRECT to read the parquet to make sure the conclusion, but it report errno: 22, strerror: Invalid argument, I haven't found the error root cause, but I think the page-cache hit-ratio is the root cause for io-throughput.
However, why select * has higher hit-ratio?
I have written this except query to get difference in record from both hive tables from databricks notebook.(I am trying to get result as we get in mssql ie only difference in resultset)
select PreqinContactID,PreqinContactName,PreqinPersonTitle,EMail,City
from preqin_7dec.PreqinContact where filename='InvestorContactPD.csv'
except
select CONTACT_ID,NAME,JOB_TITLE,EMAIL,CITY
from preqinct.InvestorContactPD where contact_id in (
select PreqinContactID from preqin_7dec.PreqinContact
where filename='InvestorContactPD.csv')
But the result set returned is also having matching records.The record which i have shown above is coming in result set but when i checked it separately based on contact_id it is same.so I am not sure why except is returning the matching record also.
Just wanted to know how we can use except or any difference finding command in databrick notebook by using sql.
I want to see nothing in result set if source and target data is same.
EXCEPT works perfectly well in Databricks as this simple test will show:
val df = Seq((3445256, "Avinash Singh", "Chief Manager", "asingh#gmail.com", "Mumbai"))
.toDF("contact_id", "name", "job_title", "email", "city")
// Save the dataframe to a temp view
df.createOrReplaceTempView("tmp")
df.show
The SQL test:
%sql
SELECT *
FROM tmp
EXCEPT
SELECT *
FROM tmp;
This query will yield no results. Is it possible you have some leading or trailing spaces for example? Spark is also case-sensitive so that could also be causing your issue. Try a case-insensitive test by applying the LOWER function to all columns, eg
I have a following table:
EstimatedCurrentRevenue -- Revenue column value of yesterday
EstimatedPreviousRevenue --- Revenue column value of current day
crmId
OwnerId
PercentageChange.
I am querying two snapshots of the similarly structured data in Azure data lake and trying to query the percentage change in Revenue.
Following is my query i am trying to join on OpportunityId to get the difference between the revenue values:
#opportunityRevenueData = SELECT (((opty.EstimatedCurrentRevenue - optyPrevious.EstimatedPreviousRevenue)*100)/opty.EstimatedCurrentRevenue) AS PercentageRevenueChange, optyPrevious.EstimatedPreviousRevenue,
opty.EstimatedCurrentRevenue, opty.crmId, opty.OwnerId From #opportunityCurrentData AS opty JOIN #opportunityPreviousData AS optyPrevious on opty.OpportunityId == optyPrevious.OpportunityId;
But i get the following error:
E_CSC_USER_SYNTAXERROR: syntax error. Expected one of: AS EXCEPT FROM
GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE ';' ')'
','
at token 'From', line 40
near the ###:
This expression is having the problem i know but not sure how to fix it.
(((opty.EstimatedCurrentRevenue - optyPrevious.EstimatedPreviousRevenue)*100)/opty.EstimatedCurrentRevenue)
Please help, i am completely new to U-sql
U-SQL is case-sensitive (as per here) with all SQL reserved words in UPPER CASE. So you should capitalise the FROM and ON keywords in your statement, like this:
#opportunityRevenueData =
SELECT (((opty.EstimatedCurrentRevenue - optyPrevious.EstimatedPreviousRevenue) * 100) / opty.EstimatedCurrentRevenue) AS PercentageRevenueChange,
optyPrevious.EstimatedPreviousRevenue,
opty.EstimatedCurrentRevenue,
opty.crmId,
opty.OwnerId
FROM #opportunityCurrentData AS opty
JOIN
#opportunityPreviousData AS optyPrevious
ON opty.OpportunityId == optyPrevious.OpportunityId;
Also, if you are completely new to U-SQL, you should consider working through some tutorials to establish the basics of the language, including case-sensitivity. Start at http://usql.io/.
This same crazy sounding error message can occur for (almost?) any USQL syntax error. The answer above was clearly correct for the provided code.
However since many folks will probably get to this page from a search for 'AS EXCEPT FROM GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE', I'd say the best advice to handle these is look closely at the snippet of your code that the error message has marked with '###'.
For example I got to this page upon getting a syntax error for a long query and it turned out I didn't have a casing issue, but just a malformed query with parens around the wrong thing. Once I looked more closely at where in the snippet the ### symbol was, the error became clear.
Getting the following error when copying an input file into an empty db table. The input file only has 56732 rows, however I am getting an error on row 56733:
continue
* * * * * * * * * *
copy table temptable
(
abc = c(3),
bcao = c(1),
cba = c(10),
test = c(1)nl
)
from 'tempfile'
Executing . . .
E_CO0024 COPY: Unexpected END OF FILE while processing row 56733.
E_CO002A COPY: Copy has been aborted.
Anyone have any ideas why its trying to process an extra row? I have four other files the exact same format with different data and it processes fine.
Have no idea why this is happening...
The most likely cause is that you have some spaces or similar after your final row of data. You have set a new line as a delimiter on test, so the file needs to end with a new line. Delete anything after your data which isn't a blank new line.
As an example. Using the code below:
DECLARE GLOBAL TEMPORARY TABLE test (
v int
) ON COMMIT PRESERVE ROWS WITH NORECOVERY;
COPY test (
v = c(5)nl
) FROM 'J:\test.csv';
Will result in an error on line 4 for the following data:
34565
37457
35764
45685
and error on line 5 for this data (punctuation used to show issue, but it is probably a space or tab in your own file):
34565
37457
35764
45685
.
I'm working on a IBM iseries v6r1m0 system.
I'm trying to execute a very simple query :
select * from XG.ART where DOS = 998 and (DES like 'ALB%' or DESABR like 'ALB%')
The columns are:
DOS -> numeric (3,0)
DES -> Graphic(80) CCSID 1200
DESABR -> Garphic(25) CCSID 1200
I get :
SQL State : 58004
SQL Code : -901
Message : [SQL0901] SQL System error.
Cause . . . . . : An SQL system error has occurred. The current SQL statement cannot be completed successfully. The error will not prevent other SQL statements from being processed. Previous messages may indicate that there is a problem with the SQL statement and SQL did not correctly diagnose the error. The previous message identifier was CPF4204. Internal error type 3107 has occurred. If precompiling, processing will not continue beyond this statement.
Recovery . . . : See the previous messages to determine if there is a problem with the SQL statement. To view the messages, use the DSPJOBLOG command if running interactively, or the WRKJOB command to view the output of a precompile. An application program receiving this return code may attempt further SQL statements. Correct any errors and try the request again.
If I change DES into REF (graphic(25)), it works...
EDIT :
I run some tests this afternoon, and it is very strange :
Just after the creation of the table/indexes, I have no errors.
If I insert some datas : error
If I clear the table : error
If I remove an index (see below) : it works (with or without datas)
!!
The index is :
create index XG.GTFAT_ART_B on XG.ART(
DOS,
DESABR,
ART_ID
)
Edit 2 :
Here is the job log (sorry, it is in French...)
It sais :
Function error X'1720' in machine instruction. Internal snapshot ID 01010054
Foo file created in library QTEMP.
*** stuff with the printer
DBOP *** FAILED open. Exception from call to SLIC$
Internal error in the query processor file
Sql system error
I finally contacted IBM.
It was an old bug from v5.
I have installed the latest PTF, and now, it works.
You need to use the GRAPHIC scalar function to convert your character literals on the LIKE predicate.
CREATE TABLE QTEMP/TEST (F1 GRAPHIC(80))
INSERT INTO QTEMP/TEST (F1) VALUES (GRAPHIC('TEST'))
SELECT * FROM QTEMP/TEST WHERE F1 LIKE GRAPHIC('TE%')
I know this guy got his problem fixed with an update. But here is something that worked for me that might work for the next guy here who has the problem.
My problem query had a lot of common table expressions. Most of them did not create tables with a whole lot of records. So if I figured that the maximum number of records a CTE would make was 1000, I added a "Fetch first 9999 rows only" to it. I knew that the CTE couldn't possibly have more rows than that. I guess the query optimizer had less to think about with that added.
If you have that problem and you don't have the option to upgrade or talk to IBM, I hope this help you.
For other people getting this errore, I encountered it on an IBM i Series v7r3, when tried an UPDATE, retrieving the value to be set on a field using a inner SELECT where multiple results where reduced to 1, using DISTINCT. I solved the problem removing DISTINCT and adding FETCH FIRST 1 ROW ONLY at the end of the inner SELECT.
E.g.: changed from
UPDATE MYTABLE AS T1
SET T1.FIELD1 = (
SELECT DISTINCT T2.FIELD5
FROM MYTABLE AS T2
WHERE T1.FIELD2 = T2.FIELD2
AND T1.FIELD3 = T2.FIELD3
)
WHERE T1.FIELD4 = 'XYZ'
to
UPDATE MYTABLE AS T1
SET T1.FIELD1 = (
SELECT T2.FIELD5
FROM MYTABLE AS T2
WHERE T1.FIELD2 = T2.FIELD2
AND T1.FIELD3 = T2.FIELD3
FETCH FIRST 1 ROW ONLY
)
WHERE T1.FIELD4 = 'XYZ'