How to transpose pyspark dataframe which has multiple index columns? - dataframe

I have a dataframe that looks like this:
ID
Company_Id
value
Approve or Reject
1A
3412asd
value-1
Approve
2B
2345tyu
value-2
Approve
3C
9800bvd
value-3
Approve
2B
2345tyu
value-1
Approve
Note that ID can repeat with different 'value'. ID, Company_ID are indices.
Now I need the output to be:
ID
Company_Id
value-1
value-2
value-3
1A
3412asd
Approve
NULL
NULL
2B
2345tyu
Approve
Approve
NULL
3C
9800bvd
NULL
NULL
Approve

pyspark pivot
df.groupBy('ID', 'Company_Id').pivot('value').agg(first('Approve or Reject')).show()

Related

SQL - view only one datetime difference

I have two tables. Let's call the first one A and the other B.
A is:
ID
Doc_ID
Date
1
1a
1-Jan-2020
1
1a
1-Feb-2020
1
1b
1-Mar-2020
2
1a
1-Jan-2020
B is:
ID
Doc2_ID
Date
1
2a
1-Mar-2020
1
2a
1-Apr-2020
2
2b
1-Feb-2020
2
2a
1-Mar-2020
Now using SQL, I want to create a table which has all the values in Table A and the difference between the date in table A and the closest date in table B. For eg. 1-Jan-2020 should be subtracted from 1-Mar-2020 and similarly, 1-Feb-2020 should be subtracted from 1-Mar-2020. Can you please help me with it?
I am using the query below in azure databricks:
%sql
SELECT a.ID, a.Doc_ID, DATEDIFF(b.DATE, a.DATE) as day FROM a
LEFT JOIN b
ON a.ID = b.ID
AND a.DATE < b.DATE
But this is generating more than one row in the results i.e. it is subtracting from all the dates in Table 3 which fulfils the where conditions (For eg. it is subtracting 1 Jan 2020 from 1 Mar 2020 and 1 Apr 2020 and it want it subtract only from the closest date in Table B i.e. 1 Mar 2020)
The expected outcome should be:
ID
Doc_ID
day
1
1a
59
1
1a
30
1
1b
0
2
1a
30
The day column for first two rows was obtained after subtracting the respective dates in Table A from 1-Mar-2020 i.e. closest value in Table B for ID 1

How can I removes duplicates by using MAX and SUM per group identifier?

I'm creating an open order report using SQL to query data from AWS Redshift.
My current table has duplicates (same order, ln, and subln numbers)
Order
Ln
SubLn
Qty
ShpDt
4166
010
00
3
2021-01-06
4166
010
00
3
2021-01-09
4167
011
00
9
2021-02-01
4167
011
00
9
2021-01-28
4167
011
01
8
2020-12-29
I need to remove duplicates using order, ln, and subln columns as group identifiers. I want to calculate SUM of qty and keep most recent ship date for the order to achieve this result:
Order
Ln
SubLn
TotQty
Shipped
4166
010
00
6
2021-01-09
4167
011
00
18
2021-02-01
4167
011
01
8
2020-12-29
After reading (How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL?) I tried the code below, which only aggregated the fields and did not remove duplicates. What am I missing?
FROM table1 AS t1
JOIN (SELECT t1.order, t1.ln, t1.subln, SUM(qty) AS totqty, MAX(shpdt) AS shipped
FROM table1 AS t1
GROUP BY order, ln, subln) as t2
ON tb1.order = tb2.order AND tb1.ln = tb2.ln AND tb1.subln = tb2.subln
I need to remove duplicates using order, ln, and subln columns as
group identifiers. I want to calculate SUM of qty and keep most recent
ship date for the order to achieve this result:
Based on the dataset resulted from the query, the dataset is unique on those 3 columns

Pandas pivot table selecting rows with maximum values

I have pandas dataframe as:
df
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.01 27.71
82 A1 case1.06 29.54
1558 A3 case1.06 29.54
82 A1 case1.11 12.09
1558 A3 case1.11 32.09
82 A1 case1.16 33.35
1558 A3 case1.16 33.35
For each Id, Name pair I need to select the CaseId with maximum value.
i.e. I am seeking the following output:
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.16 33.35
I tried the following:
import pandas as pd
pd.pivot_table(df, index=['Id', 'Name'], columns=['CaseId'], values=['Value'], aggfunc=[np.max])['amax']
But all it does is for each CaseId as column it gives maximum value and not the results that I am seeking above.
sort_values + drop_duplicates
df.sort_values('Value').drop_duplicates(['Id'],keep='last')
Out[93]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71
Since we post same time , adding more method
df.sort_values('Value').groupby('Id').tail(1)
Out[98]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71
This should work:
df = df.sort_values('Value', ascending=False).drop_duplicates('Id').sort_index()
Output:
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35
With nlargest and groupby
pd.concat(d.nlargest(1, ['Value']) for _, d in df.groupby('Name'))
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35
Another idea is to create a joint column, take its max, then split it back to two columns:
df['ValueCase'] = list(zip(df['Value'], df['CaseId']))
p = pd.pivot_table(df, index=['Id', 'Name'], values=['ValueCase'], aggfunc='max')
p['Value'], p['CaseId'] = list(zip(*p['ValueCase']))
del p['ValueCase']
Results in:
CaseId Value
Id Name
82 A1 case1.01 37.71
1558 A3 case1.16 33.35

use foreign key constraint to ensure valid data entry

I have been cleaning up a database table which was bit of a nightmare.
I want to ensure that the data being entered going forward is correct. So I read about foreign key constraints.
So below is a table where the data gets entered into.
tblSales
Country Seller Value
52 01 100
86 01 100
102 32 100
32 52 100
52 01 100
I want to ensure the values being entered into the Country & Seller fields are a certain set of values. There is also a table called tblMap which is used for reports to give the numbers a name which is easy to read.
tblMap
Factor Code Name
Country 52 US
Country 86 Germany
Country 102 Spain
Country 32 Italy
Seller 01 Bob
Seller 32 Sarah
Seller 52 Jim
So like I say I was going to use a foreign key constraint but I can't create a primary key on the Code field in tblMap as 52 is used for both a country and a seller. I am not able to change the code numbers either.
Am I still able to use a foreign key constrain to ensure any value entered into tblSales exists in tblMap?
May be you can replace tblMap by 2 tables tblMapCountry and tblMapSeller
tblMapCountry
Code Name
52 US
86 Germany
102 Spain
32 Italy
tblMapSeller
Code Name
01 Bob
32 Sarah
52 Jim
After you can base
a FK between tblSales.country and tblMapCountry
a FK between tblSales.seller and tblMapSeller
At the end you can build a view tblMap by union the 2 tables tblMapCountry and tblMapSeller
create view `tblMap`
as
select 'Country' as Factor,Code,Name from tblMapCountry
union all
select 'Seller' as Factor,Code,Name from tblMapSeller

Alphanumeric Sorting in PostgreSQL 9.4

I've a table in PostgreSQL 9.4 database in which one of the column contains data both integer and alphabets in following format.
1
10
10A
10A1
1A
1A1
1A1A
1B
1C
1C1
2
65
89
Format is, it starts with a number then an alphabet then number then alphabet and it goes on. I want to sort the field like below,
1
1A
1A1
1A1A
1B
1C
1C1
2
10
10A
10A1
65
89
But when sorting 10 comes before 2. Please suggest a possible query to obtain desired result.
Thanks in advance
Try this
SELECT *
FROM table_name
ORDER BY (substring(column_name, '^[0-9]+'))::int -- cast to integer
,coalesce(substring(column_name, '[^0-9_].*$'),'')