Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe - sql

I have several Spark Dataframes(we can call them Table a, table b etc).
I want to add a column just to table a, based on a result of a query to one of the other tables, but this table will change every time based on a value of one of the fields of table a. So this query should be parametric.
Below I show an example to make the problem clear:
Every table has the column OID and a column TableName with the name of the current table, plus other columns.
This is the fixed query to be performed on Tab A to add new column:
Select $ColumnName from $TableName where OID=$oids
Tab A
| oids|TableName |ColumnName | other fields|New Column: ValueOidDb
================================================================
| 2 | Book | Title | x |result query:harry potter
| 8 | Book | Isbn | y |result query: 556
| 1 | Author | Name | z |result query:Tolkien
| 4 | Category |Description| b |result query: Commedy
Tab Book
| OID |TableName |Title |Isbn |other fields|
================================================================
| 2 | Book |harry potter| 123 | x |
| 8 | Book | hobbit | 556 | y |
| 21 | Book | etc | 8942 | z |
| 5 | Book | etc2 | 984 | b |
Tab Author
| OID |TableName |Name |nationality |other fields|
================================================================
| 5 | Author |J.Rowling | eng | x |
| 2 | Author |Geor. Martin| us | y |
| 1 | Author | Tolkien | eng | z |
| 13 | Author | Dan Brown | us | b |
| OID | TableName |Description |
=====================================
| 12 | Category | Fantasy |
| 4 | Category | Commedy |
| 9 | Category | Thriller |
| 7 | Category | Action |
I tried with this udf
def setValueOid = (oid: Int,TableName: String, TableColumn: String) => {
try{
sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString()
}
catch{
case x: java.lang.NullPointerException => "error"
}
}
sqlContext.udf.register("setValueOid", setValueOid)
val FinalRtxf = sqlContext.sql("SELECT all the column of TAB A ,"
+ " setValueOid(oid, Table,AttributeDatabaseColumn ) as ValueOidDb"
+ " FROM TAB A")
I put the code in a try catch because otherwise it gives me a nullpointerexception, but it doesn't work, because it always returns a "problem".
If I try this function without a sql query by just passing some manual parameters it works perfectly:
val try=setValueOid(8,"BOOK","ISBN")
try: String = [0977326403 ] FINISHED
Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM.
I read here that is not possible to make a query inside a udf
Trying to execute a spark sql query from a UDF
So how can I solve my problem? I don't know how to make a parametric join. I tried this:
%sql
Select all attributes TAB A,
FROM TAB A as a
join (Select $AttributeDatabaseColumn ,TableName from $Table where OID=$oid) as b
on a.Table=b.TableName
but it gave me this exception:
org.apache.spark.sql.AnalysisException: cannot recognize input near '$' 'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)

One option:
transform each Book, Author, Category to a form:
root
|-- oid: integer (nullable = false)
|-- tableName: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
For example first record in Book:
val book = Seq((2L, "Book",
Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x")
)).toDF("oid", "title", "properties")
+---+---------+---------------------------------------------------------+
|oid|tableName|properties |
+---+---------+---------------------------------------------------------+
|2 |Book |Map(title -> harry potter, Isbn -> 123, other field -> x)|
+---+---------+---------------------------------------------------------+
union Book, Author, Category as properties.
val properties = book.union(author).union(category)
join with base table:
val comb = properties.join(table, Seq($"oid", $"tableName"))
use case when ... based on tableName to add new column from properties field.

Related

Postgresql order by out of order

I have a database where I need to retrieve the data as same order as it was populated in the table. The table name is bible When I type in table bible; in psql, it prints the data in the order it was populated with, but when I try to retrieve it, some rows are always out of order as in the below example:
table bible
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 1
day | 1
book | Genesis
chapter | 1
verse | 1
text | In the beginning God created the heavens and the earth.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 2
day | 1
book | John
chapter | 1
verse | 1
text | In the beginning was the Word, and the Word was with God, and the Word was God.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=John1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 3
day | 1
book | John
chapter | 1
verse | 2
text | The same was in the beginning with God.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=John1.2&key=dc5e2d416f46150bf6ceb21d884b644f
Everything is in order, but when I try to query the same thing using for example: select * from bible where day='1' or select * from bible where day='1' order by day or select * from bible where day='1' order by day, id;, I always get some rows out of order either in the day selected (here 1) or any other day.
I have been using Django to interfere with Postgres database, but since I found this problem, I tried to query using SQL, but nothing, I still get rows out of order, although they all have unique ids which I verified with select count(distinct id), count(id) from bible;
- [ RECORD 1 ]------------------------------------------------------------------------------------------------------
id | 1
day | 1
book | Genesis
chapter | 1
verse | 1
text | In the beginning God created the heavens and the earth.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 10
day | 1
book | Colossians
chapter | 1
verse | 18
text | And he is the head of the body, the church: who is the beginning, the firstborn from the dead; that in all things he might have the preemine
nce.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Colossians1.18&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 11
day | 1
book | Genesis
chapter | 1
verse | 2
text | And the earth was waste and void; and darkness was upon the face of the deep: and the Spirit of God moved upon the face of the waters.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.2&key=dc5e2d416f46150bf6ceb21d884b644f
As you could see above if you notice, the ids are out of order 1, 10, 11.
my table
Table "public.bible";
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+------+-----------+----------+---------+----------+--------------+-------------
id | text | | | | extended | |
day | text | | | | extended | |
book | text | | | | extended | |
chapter | text | | | | extended | |
verse | text | | | | extended | |
text | text | | | | extended | |
link | text | | | | extended | |
Access method: heap
The id field is of type text because I used pandas's to_sql() method to populate the bible table. I tried to drop the id column and then I added it again as a PK with ALTER TABLE bible ADD COLUMN id SERIAL PRIMARY KEY; but I still get data return out of order.
Is there anyway I can retrieve the data with ordering with id, without having some of the rows totally out of order? Thank you in advance!
Thou shalt cast thy id to integer to order it as number.
SELECT * FROM bible ORDER BY cast(id AS integer);
While #jordanvrtanoski is correct, the way to do this is django is:
>>> Bible.objects.extra(select={'id': 'CAST(id AS INTEGER)'}).order_by('id').values('id')
<QuerySet [{'id': 1}, {'id': 2}, {'id': 3}, {'id': 10}, {'id': 20}]>
Side note: If you want to filter on day as an example, you can do this:
>>> Bible.objects.extra(select={
'id': 'CAST(id AS INTEGER)',
'day': 'CAST(day AS INTEGER)'}
).order_by('id').values('id', 'day').filter(day=2)
<QuerySet [{'id': 2, 'day': 2}, {'id': 10, 'day': 2}, {'id': 11, 'day': 2}, {'id': 20, 'day': 2}]>
Otherwise you get this issue: (notice 1 is followed by 10 and not 2)
>>> Bible.objects.order_by('id').values('id')
<QuerySet [{'id': '1'}, {'id': '10'}, {'id': '2'}, {'id': '20'}, {'id': '3'}]>
I HIGHLY suggest you DO NOT do any of this, and set your tables correctly (have the correct column types and not have everything as text), or your query performance is going to suck.. BIG TIME
Building on both answers of #jordanvrtanoski and #Javier Buzzi, and some search online, the issue is because the ids are of type TEXT (or VARCHAR too), so, you would need to cast the id to type INTEGER as in the following:
ALTER TABLE bible ALTER COLUMN id TYPE integer USING (id::integer);
Now here is my table
Table "public.bible"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+---------+-----------+----------+-----------------------------------------+----------+--------------+-------------
id | integer | | | nextval('bible_id_seq'::regclass) | plain | |
day | text | | | | extended | |
book | text | | | | extended | |
chapter | text | | | | extended | |
verse | text | | | | extended | |
text | text | | | | extended | |
link | text | | | | extended | |
Indexes:
"lesson_unique_id" UNIQUE CONSTRAINT, btree (id)
Referenced by:
TABLE "notes_note" CONSTRAINT "notes_note_verse_id_5586a4bf_fk" FOREIGN KEY (verse_id) REFERENCES days_lesson(id) DEFERRABLE INITIALLY DEFERRED
Access method: heap
Hope this helps other people, and thank you everyone!

Can SQL REPLACE function as a "find and replace" on both strings and substrings?

I have a database of boxes within boxes. Max nesting depth is 10, so each box could have up to 9 parent or child locations. One field contains the hierarchy of each box - i.e. for box DEF which is inside ABC:
SELECT hierarchy from INVENTORY WHERE boxname = 'DEF' returns "ABC -> DEF".
I now need to allow users to rename boxes. I'm trying to use SQL's REPLACE function to accomplish this, but it can't work on substrings as far as I can tell. I've tried:
update inventory
set hierarchy = replace(hierarchy, 'DEF', 'XYZ')
and this doesn't update the hierarchy to "ABC -> XYZ" like I'd expect
My hope is to use it as a "Ctrl+F find and replace" function, but it seems like it can't do the following:
Find all fields that contain the string, including as a substring.
Replace all occurrences across all fields for a given record.
Does anyone know if either of these are indeed possible?
I'm using TSQL.
sample data as requested:
input:
| name | parent1 | parent2 | ... | hierarchy |
| --- | --- | --- | --- | --- |
| DEF | ABC | | | ABC -> DEF |
| JKL | DEF | ABC | | ABC -> DEF -> JKL |
output:
| name | parent1 | parent2 | ... | hierarchy |
| --- | --- | --- | --- | --- |
| XYZ | ABC | | | ABC -> XYZ |
| JKL | XYZ | ABC | | ABC -> XYZ -> JKL |
No, this isn't possible. SQL UPDATE with REPLACE does not function as a "find and replace". SQL does not have this functionality.

SQL Query (Display All 'x' Where 'x' Is Not In Table '2' for field 'y' and has 'z' flag)

I need to return all 'contacts' that do not appear in the 'delegate' table for 'event name' but do have flags in the 'contacts' table that can selected by the user for the search.
I know the query can be broken in to 2 parts.
Are they already attending this event (Does their email appear in 'delegates' table with delegates.event field matching 'event' on the user form)
WHERE (
d.Event <> [Forms]![usf_FindCampaignContacts]![FCC_EventName]
Do they match the criteria (Have they got the HR flag in 'contacts' table)
AND (c.[HR-DEL] = [Forms]![usf_FindCampaignContacts]![FCC_HRD] OR IsNull([Forms]![usf_FindCampaignContacts]![FCC_HRD]));
Based on the 2 things that the query is required to do I have written the following code...
SELECT
c.[First Name], c.[Last Name], c.Email, d.Event, c.Suppress, c.[HR-DEL]
FROM tbl_Contacts AS c LEFT JOIN tbl_Delegates AS d ON c.Email = d.Email
WHERE (
d.Event <> [Forms]![usf_FindCampaignContacts]![FCC_EventName]
And
c.Suppress = False
)
AND (c.[HR-DEL] = [Forms]![usf_FindCampaignContacts]![FCC_HRD] OR IsNull([Forms]![usf_FindCampaignContacts]![FCC_HRD]));
[FCC_HRD] refers to the user selected input on the form, I tried to use a <> to remove matching records but I feel this is where the compile error is so I changed these to and/or statements and this part now returns results with the matching flags (Success)
Other issue with attempting to do it this way is even if it worked it would remove anyone who was listed in the delegates/sponsor table. Which is why I added the <> statement for the Event as it only needs to remove them off the list for the named event. Again this works perfectly well (Success)
Final issue is the results are clearly being pulled from the 'delegates' table not the 'contacts' table as both parts above work but only display the results that match criteria in delegates table not from contacts.
Here is the query/table relationships
Here is the user form (This is not the final design)
Below are the 3 tables that are used in the query (2 direct, 1 linked)
Contacts (c)
+----+------------+---------------+-------------------------+--------+----------+
| ID | First Name | Last Name | Email | HR-DEL | Suppress |
+----+------------+---------------+-------------------------+--------+----------+
| 1 | A | Platt | a.platt#fake.com | TRUE | TRUE |
| 2 | D | Farr | d.farr#fake.com | TRUE | FALSE |
| 3 | Y | Helle | y.helle#fake.com | TRUE | FALSE |
| 4 | S | Oliphant | soliphant#fake.com | TRUE | FALSE |
| 5 | J | Bedell-Pearce | jbedell-pearce#fake.com | TRUE | FALSE |
| 6 | J | Walker | j.walker#fake.com | FALSE | FALSE |
| 7 | S | Rug | s.rug#fake.com | FALSE | FALSE |
| 8 | D | Brown | d.brown#fake.com | FALSE | FALSE |
| 9 | R | Cooper | r.cooper#fake.com | TRUE | FALSE |
| 10 | M | Morrall | m.morrall#fake.com | TRUE | FALSE |
+----+------------+---------------+-------------------------+--------+----------+
Delegates (d)
+----+-------------------------+-------+
| ID | Email | Event |
+----+-------------------------+-------+
| 1 | a.platt#fake.com | 2 |
| 2 | d.farr#fake.com | 1 |
| 3 | y.helle#fake.com | 4 |
| 4 | soliphant#fake.com | 3 |
| 6 | jbedell-pearce#fake.com | 2 |
+----+-------------------------+-------+
Events (not direct but used to check event name drop-down on user form vs event number in delegates)
+----+------------+
| ID | Event Name |
+----+------------+
| 1 | Test 1 |
| 2 | Test 2 |
| 3 | Test 3 |
| 4 | Test 4 |
+----+------------+
Based on form selection and this sample data I need to return the following:
All contacts who are flagged 'HR' TRUE, not suppressed or going to event named 'test 2' (Should be 5 - I always return the names of 'delegates' not going to the event only = 3)
Final results should be:
+----+------------+-----------+--------------------+--------+----------+
| ID | First Name | Last Name | Email | HR-DEL | Suppress |
+----+------------+-----------+--------------------+--------+----------+
| 2 | D | Farr | d.farr#fake.com | TRUE | FALSE |
| 3 | Y | Helle | y.helle#fake.com | TRUE | FALSE |
| 4 | S | Oliphant | soliphant#fake.com | TRUE | FALSE |
| 9 | R | Cooper | r.cooper#fake.com | TRUE | FALSE |
| 10 | M | Morrall | m.morrall#fake.com | TRUE | FALSE |
+----+------------+-----------+--------------------+--------+----------+
At the moment it appears to be pulling results from the wrong table (d not c). I attempted to change to OUTER join type but that returned with a FROM syntax error.
If I understand it correctly, basically you want to do this:
SELECT A.foo
FROM A
LEFT JOIN B
ON A.bar = B.bar
WHERE
<complex condition, partly involving B>
This cannot work. By including B in the global WHERE condition, you turn the LEFT JOIN into an INNER JOIN, and so you will only ever get records that match between A and B.
You can either move the filter on B into the JOIN condition:
SELECT A.foo
FROM A
LEFT JOIN B
ON (A.bar = B.bar)
AND (B.bamboozle = 42)
WHERE
A.columns = things
or LEFT JOIN a filtered subquery:
SELECT A.foo
FROM A
LEFT JOIN
(SELECT bar, columns FROM B
WHERE B.bamboozle = 42) AS B1
ON A.bar = B1.bar
WHERE
A.columns = things
So in your query, this is the bamboozle part you will need to move:
d.Event <> [Forms]![usf_FindCampaignContacts]![FCC_EventName]

How to select Multiple Rows based on one Column

So I have looked around the internet, and couldn't find anything that could be related to my issue.
This is part of my DB:
ID | English | Pun | SID | Writer |
=======================================================
1 | stuff | stuff | 1 | Full |
2 | stuff | stuff | 1 | Rec. |
3 | stuff | stuff | 2 | Full |
4 | stuff | stuff | 2 | Rec. |
Now how would I get all rows with SID being equal to 1.
Like this
ID | English | Pun | SID | Writer |
=======================================================
1 | stuff | stuff | 1 | Full |
2 | stuff | stuff | 1 | Rec. |
Or when I want to get all rows with SID being equal to 2.
ID | English | Pun | SID | Writer |
=======================================================
3 | stuff | stuff | 2 | Full |
4 | stuff | stuff | 2 | Rec. |
This is my current SQL Query using SQLite:
SELECT * FROM table_name WHERE SID = 1
And I only get the first row, how would I be able to get all of the rows?
Here is my PHP Code:
class GurDB extends SQLite3
{
function __construct()
{
$this->open('gurbani.db3');
}
}
$db = new GurDB();
$mode = $_GET["mode"];
if($mode == "2") {
$shabadnum = $_GET["shabadNo"];
$result = $db->query("SELECT * FROM table_name WHERE SID = $shabadnum");
$array = $result->fetchArray(SQLITE3_ASSOC);
print_r($array);
}
Fetch array only gives you one row... you want something like this:
while($row = $result->fetch_array())
{
$rows[] = $row;
}

DBIx::Class : Resultset order_by based upon existence of a value in the list

I am using DBIx::Class and I have got a ResultSet. I like to re-order the ResultSet. I like to check a particular column "City" against a fix list of values ("London", "New York" "Tokyo") If city is found in the list of values I like to move that result to the top group. If city is not found, I like to move that result to the bottom group in the ResultSet.
ORDER BY expr might be what you're looking for.
For example, here a table:
mysql> select * from test;
+----+-----------+
| id | name |
+----+-----------+
| 1 | London |
| 2 | Paris |
| 3 | Tokio |
| 4 | Rome |
| 5 | Amsterdam |
+----+-----------+
Here the special ordering:
mysql> select * from test order by name = 'London' desc,
name = 'Paris' desc,
name = 'Amsterdam' desc;
+----+-----------+
| id | name |
+----+-----------+
| 1 | London |
| 2 | Paris |
| 5 | Amsterdam |
| 3 | Tokio |
| 4 | Rome |
+----+-----------+
Translating this into a ResultSet method:
$schema->resultset('Test')->search(
{},
{order_by => {-desc => q[name in ('London', 'New York', 'Tokyo')] }}
);
Something like:
#!/usr/bin/env perl
use strict;
use warnings;
my $what = shift or die;
my #ary = qw(alpha beta gamma);
unshift(#ary,$what) unless ( grep(/$what/,#ary) );
print "#ary\n";
1;
Run as:
./myscript omega