I am trying to convert a SQL Code into Pyspark SQL.
While selecting the columns from a table , the Select Statement has something as below :
Select a.`(column1|column2|column3)?+.+`,trim(column c) from Table a;
I would like to understand what
a.`(column1|column2|column3)?+.+`
expression resolves to and what it actually implies? How to address this while converting the sql into pyspark?
That is a way of selecting certain column names using regexps. That regex matches (and excludes) the columns column1, column2 or column3.
It is the Spark's equivalent of the Hive's Quoted Identifiers. See also Spark's documentation.
Be aware that, for enabling this behavior, it is first necessary to run the following command:
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false)
We have to migrate Greenplum to Hive, please help me below statement.
1. ('41405'||lpad(hex_to_int(lac),5,'0')||lpad(hex_to_int(ci),5,'0'))
2. lpad(hex_to_int(lac),5,'0')
To convert hex to int, use conv(str, 16, 10) function.
In older versions of Hive use concat() for concatenation.
lpad function exists in Hive
I am using string function PATINDEX in SQL to find the index of particular pattern match in string. I have a column in table named as volume which contains values such as 0.75L, 1.0L, 0.375L.
When I execute the below SQL query i.e.
select PATINDEX('%L%',VOLUME) from volume_details
o/p of query -> 5,4,6
But in hive query how to achieve the same since PATINDEX is not supported currently?
In Oracle you can take a sample of a dataset like so:
select * from table sample(10);
What's the equivalent to this in HBase?
No. There is no equivalent to Oracle's sample syntax.
I have a requirement to construct an SQL that has a where clause which is expected to look into a file entries to be used in that clause.
SELECT DISTINCT '/',t_05.puid, ',', t_01.puid,'/', t_01.poriginal_file_name
FROM PWORKSPACEOBJECT t_02, PREF_LIST_0 t_03, PPOM_APPLICATION_OBJECT t_04, PDATASET t_05, PIMANFILE t_01
WHERE t_03.pvalu_0 = t_01.puid AND t_02.puid = t_03.puid AND t_03.puid = t_04.puid AND t_04.puid = t_05.puid AND t_02.puid IN ( 'izeVNXjf44e$yB',
'gWYRvN9044e$yB' );
The above is the SQL query. As you can see the IN clause has two different strings ( puids ) that are to be considered. But in my case, this list is like 50k entries long and would come from splunk and will be in a text file.
Sample output of the text file looks as belows:
'gWYRvN9044e$yB',
'DOZVpdOQ44e$yB',
'TlfVpdOQ44e$yB',
'wOWRehUc44e$yB',
'wyeRehUc44e$yB',
'w6URehUc44e$yB',
'wScRehUc44e$yB',
'yzXVNXjf44e$yB',
'guWRvN9044e$yB',
'QiYRehUc44e$yB',
'gycRvN9044e$yB'
I am not an SQL guru, but a quick google on this gave me a reference to OPENROWSET construct, which is not available on Oracle.
Can you please suggest some pointers on what can be done to circumvent the problem.
Thanks,
Pavan.
Consider using an external table, SQL Loader or perhaps loading the file into a table in the application layer and querying it normally.
I would recommend creating a Global Temporary table, adding the rows to that table, and then joining to your temp table.
How to create a temporary table in Oracle
Other options:
You could also use pipelined functions:
https://oracle-base.com/articles/misc/pipelined-table-functions
Or use the with as... construct to fold the data into the SQL. But that would create a long SQL statement.