How can I query CSV values and output them as a table? - sql

We deal with a lot of different kinds of CSV files. All the CSV files are slightly different, but they all have a sort of primary key that is constant across all of them .I have a database that stores the data from these CSVs in Postgresql (11.10) that looks roughly like this:
Files
+--+----+--------+
|id|name|headers |
+--+----+--------+
|1 |file|pk,b,c |
+--+----+--------+
|2 |ifle|pk,b,d,e|
+--+----+--------+
|3 |life|pk,e,f,g|
+--+----+--------+
|................|
Records
+--+------+---------+--------+
|id|fileID|record_pk|record |
+--+------+---------+--------+
|1 |1 |1 |1,2,3 |
+--+------+---------+--------+
|2 |1 |1 |1,5,6 |
+--+------+---------+--------+
|3 |1 |7 |7,8,9 |
+--+------+---------+--------+
|4 |2 |1 |1,f,o,o |
+--+------+---------+--------+
|5 |2 |2 |2,b,a,r |
+--+------+---------+--------+
|6 |3 |1 |1,b,a,z |
+--+------+---------+--------+
|............................|
What I'd like to do is query for a single PK within a single file, and get a queryable table of the results. For example, I may want to get data from file 1 for PK 1 from the above. I want the output to be:
+--+-+-+
|pk|b|c|
+--+-+-+
|1 |2|3|
+--+-+-+
|1 |5|6|
+--+-+-+
For another example, I may want to get data from file 2 for PK 2, and the output would be:
+--+-+-+-+
|pk|b|d|e|
+--+-+-+-+
|2 |b|a|r|
+--+-+-+-+
Is such a query actually possible?

Related

Assign Rank to Row based on Alphabetical Order Using Window Functions in PySpark

I'm trying to assign a rank to the rows of a dataframe using a window function over a string column (user_id), based on alphabetical order. So, for example:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |2
A |1
B |2
C |3
B |2
B |2
C |3
I tried using the following lines of code:
user_window = Window().partitionBy('user_id').orderBy('user_id')
data = (data
.withColumn('profile_row_num', dense_rank().over(user_window))
)
But I'm getting something like:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |1
A |1
B |1
C |1
B |1
B |1
C |1
Partition by user_id is unnecessary. This will cause all user_id to fall into their own partition and get a rank of 1. The code below should do what you wanted:
user_window = Window.orderBy('user_id')
data = data.withColumn('profile_row_num', dense_rank().over(user_window))

Select rows with same id but different result in another column

sql: I have a table like this:
+------+------+
|ID |Result|
+------+------+
|1 |A |
+------+------+
|2 |A |
+------+------+
|3 |A |
+------+------+
|1 |B |
+------+------+
|2 |B |
+------+------+
The output should be something like:
Output:
+------+-------+-------+
|ID |Result1|Result2|
+------+-------+-------+
|1 |A |B |
+------+-------+-------+
|2 |A |B |
+------+-------+-------+
|3 |A | |
+------+-------+-------+
How can I do this?
SELECT
Id,
MAX((CASE result WHEN 'A' THEN 'A' ELSE NULL END)) result1,
MAX((CASE result WHEN 'B' THEN 'B' ELSE NULL END)) result2,
FROM
table1
GROUP BY Id
results
+------+-------+-------+
|Id |Result1|Result2|
+------+-------+-------+
|1 |A |B |
|2 |A |B |
|3 |A |NULL |
+------+-------+-------+
run live demo on SQL fiddle: (http://sqlfiddle.com/#!9/e1081/2)
there are a few ways to do it.
None of tehm a are straight forward.
in theory, a simple way would be to create 2 temporary tables, where you separte the data, all the "A" resultas in one table and "B" in another table.
Then get the results with simple query. using JOIN.
if you are allowed to use some scrpting on the process then it is simpler, other wise you need a more complex logic on your query. And for you query to alwasy work, you need to have some rules like, A table always contains more ids than B table.
If you post your real example, it is easier to get better answers.
for this reason:
ID Name filename
1001 swapan 4566.jpg
1002 swapan 678.jpg
1003 karim 7688.jpg
1004 tarek 7889.jpg
1005 karim fdhak.jpg
output:
ID Name filename
1001 swapan 4566.jpg 678.jpg
1003 karim 7688.jpg fdhak.jpg
1004 tarek 7889.jpg ...
.. ... ... ...

SQL: How to get all descendants from a table with "many-to-many" related columns

I am still exploring SQL, hence I have an apparently easy question.
I have the following table in SQL Server 2008 R2.
As you see, this is a mapping table where there is a "many-to-many" relation between InstrumentIDs and ProductIDs:
|InstrumentID | ProductID |ID(PK)|
|-------------|-------------|------|
|10 |21 |0 |
|10 |22 |1 |
|11 |22 |2 |
|11 |23 |3 |
|12 |24 |4 |
|12 |25 |5 |
|----------------------------------|
I wish to write a query/SP which takes as an input let's say an InstrumentID and returns all the associated mappings extracted recursively.
For example, let's say that I provide InstrumentID = 10. Then, a satisfying result would be:
|InstrumentID | ProductID |ID(PK)|
|-------------|-------------|------|
|10 |21 |0 |
|10 |22 |1 |
|11 |22 |2 |
|11 |23 |3 |
|----------------------------------|
What is the best way of accomplishing this ?
An example would definitely be useful.
I have tried the following query using CTE:
WITH Map
AS (
-- Anchor
SELECT InstrumentID, ProductID
FROM InstrumentProduct
WHERE InstrumentID IN (10)
UNION ALL
-- Recursive
SELECT IP.InstrumentID, IP.ProductID
FROM InstrumentProduct IP
INNER JOIN Map
ON IP.InstrumentID = Map.InstrumentID
OR IP.ProductID = Map.ProductID
)
SELECT *
FROM Map
OPTION (MAXRECURSION 50);
But query is executed with an error: "The statement terminated. The maximum recursion 5 has been exhausted before statement completion."
However the results are returned apparently correct, but duplicated.
Any kind of help and advice is definitely helpful.
Thanks!

How to write an SQL statement to record the fact that Rabbit has one additional homework assigned

I have a table shown below, and I want to record the fact that Rabbit has one additional homework assigned.
id |name |partnerId |totalHW |lateHW |major
-----------------------------------------------
12 |Puma |17 |3 |0 |CS
14 |Rabbit |21 |7 |4 |SE
17 |Mouse |12 |5 |5 |CE
21 |Aardvark |32 |2 |0 |CS
22 |Cougar |24 |4 |2 |SE
24 |Zebra |28 |7 |3 |EE
27 |Orca |14 |2 |1 |CS
32 |Elephant |null |0 |null |S
How do I go about it? I could not find the relation as how is Rabbit assigned one additional homework. How to solve this query?
I would do it like this:
UPDATE [tablename]
SET totalHW = totalHW + 1
WHERE name = 'Rabbit';
You'll likely come to a solution easily if you change to a normalized database structure. "totalHW" and "lateHW" are calculated fields based on other data, i.e., individual assignments. Instead create a homework table, likely with a late field and whatever other fields, and when you need to know total and number of late, calculate that in sql. Once you do that, adding more homework and other manipulation of the data becomes very simple.

How to compare each row against each other and get the best result?

Suppose I have a table of values and categories:
+--+-----+---+
|ID|value|cat|
+--+-----+---+
|0 |1 |0 |
+--+-----+---+
|1 |3 |0 |
+--+-----+---+
|2 |2 |1 |
+--+-----+---+
|3 |1.2 |1 |
+--+-----+---+
|4 |1 |1 |
+--+-----+---+
And I want to know, for each row, the ID of the row which matches the value most closely and belongs to the same category, and I also want to know the difference.
So for row ID=0 the correct answer would be ID=1, and the difference value would be 2. The correct output would be this:
+--+----------+----------+
|ID|difference|best match|
+--+----------+----------+
|0 |2 |1 |
+--+----------+----------+
|1 |2 |0 |
+--+----------+----------+
|2 |0.8 |3 |
+--+----------+----------+
|3 |0.2 |4 |
+--+----------+----------+
|4 |0.2 |3 |
+--+----------+----------+
I'm just learning about CROSS JOIN and while I'm sure this can be done I don't really know where to start.
You can do this with a self-join and making use of the ROW_NUMBER() function in conjunction with MIN():
;WITH cte AS (SELECT a.ID aID
,MIN(ABS(a.value - b.value)) diff
,ROW_NUMBER() OVER(PARTITION BY a.ID ORDER BY MIN(ABS(a.value - b.value)))RN
,b.ID bID
FROM Table1 a
JOIN Table1 b
ON a.cat = b.cat
AND a.ID <> b.ID
GROUP BY a.ID,b.ID)
SELECT aID
,diff
,bID Best_Match
FROM cte
WHERE RN = 1
Demo: SQL Fiddle
If you want to return multiple rows in case of a tie, you'd want to use RANK() instead of ROW_NUMBER()