I'm trying to implement a join in Spark SQL using a LIKE condition.
The row I am performing the join on looks like this and is called 'revision':
Table A:
8NXDPVAE
Table B:
[4,8]NXD_V%
Performing the join on SQL server (A.revision LIKE B.revision) works just fine, but when doing the same in Spark SQL, the join returns no rows (if using inner join) or null values for Table B (if using outer join).
This is the query I am running:
val joined = spark.sql("SELECT A.revision, B.revision FROM RAWDATA A LEFT JOIN TPTYPE B ON A.revision LIKE B.revision")
The plan looks like this:
== Physical Plan ==
BroadcastNestedLoopJoin BuildLeft, LeftOuter, revision#15 LIKE revision#282, false
:- BroadcastExchange IdentityBroadcastMode
: +- *Project [revision#15]
: +- *Scan JDBCRelation(RAWDATA) [revision#15] PushedFilters: [EqualTo(bulk_id,2016092419270100198)], ReadSchema: struct<revision>
+- *Scan JDBCRelation(TPTYPE) [revision#282] ReadSchema: struct<revision>
Is it possible to perform a LIKE join like this or am I way off?
You are only a little bit off. Spark SQL and Hive follow SQL standard conventions where LIKE operator accepts only two special characters:
_ (underscore) - which matches an arbitrary character.
% (percent) - which matches an arbitrary sequence of characters.
Square brackets have no special meaning and [4,8] matches only a [4,8] literal:
spark.sql("SELECT '[4,8]' LIKE '[4,8]'").show
+----------------+
|[4,8] LIKE [4,8]|
+----------------+
| true|
+----------------+
To match complex patterns you can use RLIKE operator which suports Java regular expressions:
spark.sql("SELECT '8NXDPVAE' RLIKE '^[4,8]NXD.V.*$'").show
+-----------------------------+
|8NXDPVAE RLIKE ^[4,8]NXD.V.*$|
+-----------------------------+
| true|
+-----------------------------+
Syntax for like in spark scala api:
dataframe.filter(col("columns_name").like("regex"))
Related
I am trying to join two tables on a field (FILE_NAME); however, there are a couple records in just one of the table, in which a timestamp is appended to the end of the file name and before the file extension. I'm not sure how to account for these. I found an Oracle function,
REGEXP SUBSTR (https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions131.htm), that seems like it may give me what I need, but I have to admit that this is extremely advanced to me and am not sure how to apply it.
My sample tables are:
FILE_INFO Table:
FILE_NAME | FILE_ID
REGIONS_ACCOUNTED.xlsx | 21
TSM_INSAT.xml | 14
FILE_PARAMETERS Table:
FILE_NAME
TSM_INSAT.xml
REGIONS_ACCOUNTED-08112017.xlsx
From what I can tell, it seems that the timestamps are always prefixed with a dash (-) so I originally thought to approach it by finding the index of a dash then use substr to concat the before and afters of the timestamp but can't figure out how to do that in a query or how to account for date ranges (e.g.:
REGIONS_ACCOUNTED-07102017-07142017.xlsx
At this point, I just have a basic Join:
SELECT a.file_name, b.file_location
FROM reports.file_info a
LEFT OUTER JOIN reports.file_parameters b on (a.file_name = b.file_name);
The SQL above of course excludes those reports with dates/date ranges in the filename. It would be better to use a file_id, I'm sure; however, there is no file_id in the file_parameters.
Any guidance would be greatly appreciated!
You seem to be looking for match of filenames from FILE_INFO in FILE_PARAMETERS.
SELECT a.file_name,
b.file_location
FROM reports.file_info a
LEFT OUTER JOIN reports.file_parameters b
ON b.file_name LIKE CONCAT('%', SUBSTR(a.file_name, 1, INSTR(a.file_name,'.',-1)-1), '%', SUBSTR(a.file_name, INSTR(a.file_name,'.',-1), LENGTH(a.file_name)), '%');
#Hepc provided the correct answer (in the comments). The modified version to account for date ranges is:
REGEXP_REPLACE(a.file_name,'\-[\d\w\-\_]+.,'.')
I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).
I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.
Here's an example:
largeDataFrame
some_idenfitier,first_name
111,bob
123,phil
222,mary
456,sue
smallDataFrame
some_identifier
123
456
desiredOutput
111,bob
222,mary
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row"))
val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
Is there a cleaner solution?
You'll need to use a left_anti join in this case.
The left anti join is the opposite of a left semi join.
It filters out data from the right table in the left table according to a given key :
largeDataFrame
.join(smallDataFrame, Seq("some_identifier"),"left_anti")
.show
// +---------------+----------+
// |some_identifier|first_name|
// +---------------+----------+
// | 222| mary|
// | 111| bob|
// +---------------+----------+
A version in pure Spark SQL (and using PySpark as an example, but with small changes
same is applicable for Scala API):
def string_to_dataframe (df_name, csv_string):
rdd = spark.sparkContext.parallelize(csv_string.split("\n"))
df = spark.read.option('header', 'true').option('inferSchema','true').csv(rdd)
df.registerTempTable(df_name)
string_to_dataframe("largeDataFrame", '''some_identifier,first_name
111,bob
123,phil
222,mary
456,sue''')
string_to_dataframe("smallDataFrame", '''some_identifier
123
456
''')
anti_join_df = spark.sql("""
select *
from largeDataFrame L
where NOT EXISTS (
select 1 from smallDataFrame S
WHERE L.some_identifier = S.some_identifier
)
""")
print(anti_join_df.take(10))
anti_join_df.explain()
will output expectedly mary and bob:
[Row(some_identifier=222, first_name='mary'),
Row(some_identifier=111, first_name='bob')]
and also Physical Execution Plan will show it is using
== Physical Plan ==
SortMergeJoin [some_identifier#252], [some_identifier#264], LeftAnti
:- *(1) Sort [some_identifier#252 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(some_identifier#252, 200)
: +- Scan ExistingRDD[some_identifier#252,first_name#253]
+- *(3) Sort [some_identifier#264 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(some_identifier#264, 200)
+- *(2) Project [some_identifier#264]
+- Scan ExistingRDD[some_identifier#264]
Notice Sort Merge Join is more efficient for joining / anti-joining data sets that are approximately of the same size.
Since you have mentioned that that the small dataframe is smaller, we should make sure that Spark optimizer chooses Broadcast Hash Join which will be much more efficient in this scenario :
I will change NOT EXISTS to NOT IN clause for this :
anti_join_df = spark.sql("""
select *
from largeDataFrame L
where L.some_identifier NOT IN (
select S.some_identifier
from smallDataFrame S
)
""")
anti_join_df.explain()
Let's see what it gave us :
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, ((some_identifier#302 = some_identifier#314) || isnull((some_identifier#302 = some_identifier#314)))
:- Scan ExistingRDD[some_identifier#302,first_name#303]
+- BroadcastExchange IdentityBroadcastMode
+- Scan ExistingRDD[some_identifier#314]
Notice that Spark Optimizer actually chose Broadcast Nested Loop Join and not Broadcast Hash Join. The former is okay since we have just two records to exclude from the left side.
Also notice that both execution plans do have LeftAnti so it is similar to #eliasah answer, but is implemented using pure SQL. Plus it shows that you can have more control over physical execution plan.
PS. Also keep in mind that if the right dataframe is much smaller than the left-side dataframe but is bigger than just a few records, you do want to have Broadcast Hash Join and not Broadcast Nested Loop Join nor Sort Merge Join. If this doesn't happen, you may need to tune up spark.sql.autoBroadcastJoinThreshold as it defaults to 10Mb, but it has to be bigger than the size of the "smallDataFrame".
The following query does not work in Postgres 9.4.5.
SELECT * FROM (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M, METRICATTRIBUTES AS A
WHERE M.NAME=A.NAME AND A.ISSTRING='FALSE'
) AS S1
WHERE CAST(S1.V AS NUMERIC)<0
I get an error like:
invalid input syntax for type numeric: "astringvalue"
Read on to see why I made query this overly complicated.
METRICS is a table of metric, value pairs. The values are stored as strings and some of the values of the VALUE field are, in fact strings. The METRICATTRIBUTES table identifies those metric names which may have string values. I populated the METRICATTRIBUTES table from an analysis of the METRICS table.
To check, I ran...
SELECT * FROM (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M, METRICATTRIBUTES AS A
WHERE M.NAME=A.NAME AND A.ISSTRING='FALSE'
) AS S1
WHERE S1.V LIKE 'a%'
This returns no values (like I would expect). The error seems to be in the execution plan. Which looks something like this (sorry, I had to fat finger this)
1 -> HAS JOIN
2 HASH COND: ((M.NAME::TEXT=(A.NAME)::TEXT))
3 SEQ SCAN ON METRICS M
4 FILTER: ((VALUE)::NUMERIC<0::NUMERIC)
5 -> HASH
6 -> Seq Scan on METRICATTRIBUTES A
7 Filter: (NOT ISSTRING)
I am not an expert on this (only 1 week of Postgres experience) but it looks like Postgres is trying to apply the cast (line 4) before it applies the join condition (line 2). By doing this, it will try to apply the cast to invalid string values which is precisely what I am trying to avoid!
Writing this with an explicit join did not make any difference. Writing it as a single select statement was my first attempt, never expecting this type of problem. That also did not work.
Any ideas?
As you can see from your plan, table METRICS is being scanned in full (Seq Scan) and filtered with your condition: CAST(S1.V AS NUMERIC)<0—join does not limits the scope at all.
Obviously, you have some rows, that contain non-numeric data in the METRICS.VALUE.
Check your table for such rows like this:
SELECT * FROM METRICS
WHERE NOT VALUE ~ '^([0-9].,e)*$'
Note, that it is difficult to catch all possible combinations with regular expression, therefore check out this related question: isnumeric() with PostgreSQL
Name VALUE for the column is not good, as this word is a reserved one.
Edit: If you're absolutely sure, that joined tables will produce wanted VALUE-s, than you can use CTEs, which have optimization fence feature in PostgreSQL:
WITH S1 AS (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M
JOIN METRICATTRIBUTES AS A USING (NAME)
WHERE A.ISSTRING='FALSE'
)
SELECT *
FROM S1
WHERE CAST(S1.V AS NUMERIC)<0;
Good morning my beloved sql wizards and sorcerers,
I am wanting to substitute on 3 columns of data across 3 tables. Currently I am using the NVL function, however that is restricted to two columns.
See below for an example:
SELECT ccc.case_id,
NVL (ccvl.descr, ccc.char)) char_val
FROM case_char ccc, char_value ccvl, lookup_value lval1
WHERE
ccvl.descr(+) = ccc.value
AND ccc.value = lval1.descr (+)
AND ccc.case_id IN ('123'))
case_char table
case_id|char |value
123 |email| work_email
124 |issue| tim_
char_value table
char | descr
work_email | complaint mail
tim_ | timeliness
lookup_value table
descr | descrlong
work_email| xxx#blah.com
Essentially what I am trying to do is if there exists a match for case_char.value with lookup_value.descr then display it, if not, then if there exists a match with case_char.value and char_value.char then display it.
I am just trying to return the description for 'issue'from the char_value table, but for 'email' I want to return the descrlong from the lookup_value table (all under the same alias 'char_val').
So my question is, how do I achieve this keeping in mind that I want them to appear under the same alias.
Let me know if you require any further information.
Thanks guys
You could nest NVL:
NVL(a, NVL(b, NVL(c, d))
But even better, use the SQL-standard COALESCE, which does take multiple arguments and also works on non-Oracle systems:
COALESCE(a, b, c, d)
How about using COALESCE:
COALESCE(ccvl.descr, ccc.char)
Better to Use COALESCE(a, b, c, d) because of below reason:
Nested NVL logic can be achieved in single COALESCE(a, b, c, d).
It is SQL standard to use COALESCE.
COALESCE gives better performance in terms, NVL always first calculate both of the queries used and then compare if the first value is null then return a second value. but in COALESCE function it checks one by one and returns response whenever if found a non-null value instead of executing all the used queries.
I would like to perform an aggregation with Slick that executes SQL like the following:
SELECT MIN(a), MAX(a) FROM table_a;
where table_a has an INT column a
In Slick given the table definition:
class A(tag: Tag) extends Table[Int](tag, "table_a") {
def a = column[Int]("a")
def * = a
}
val A = TableQuery[A]
val as = A.map(_.a)
It seems like I have 2 options:
Write something like: Query(as.min, as.max)
Write something like:
as
.groupBy(_ => 1)
.map { case (_, as) => (as.map(identity).min, as.map(identity).max) }
However, the generated sql is not good in either case. In 1, there are two separate sub-selects generated, which is like writing two separate queries. In 2, the following is generated:
select min(x2."a"), max(x2."a") from "table_a" x2 group by 1
However, this syntax is not correct for Postgres (it groups by the first column value, which is invalid in this case). Indeed AFAIK it is not possible to group by a constant value in Postgres, except by omitting the group by clause.
Is there a way to cause Slick to emit a single query with both aggregates without the GROUP BY?
The syntax error is a bug. I created a ticket: https://github.com/slick/slick/issues/630
The subqueries are a limitation of Slick's SQL compiler currently producing non-optimal code in this case. We are working on improving the situation.
As a workaround, here is a pattern to swap out the generated SQL under the hood and leave everything else intact: https://gist.github.com/cvogt/8054159
I use the following trick in SQL Server, and it seems to work in Postgres:
select min(x2."a"), max(x2."a")
from "table_a" x2
group by (case when x2.a = x2.a then 1 else 1 end);
The use of the variable in the group by expression tricks the compiler into thinking that there could be more than one group.