Merge Dataframes based on values from different datasets - pandas

I have the following data frames:
print(df)
id_code turnover costs
001 100 200
002 100 200
003 100 200
004 100 200
print(df_db)
Description Code1, Code2, ... CodeN
Retail 001 002 ... nan
Wholesale 003 nan ... nan
Supply 004 nan ... nan
And I would like to create the following final_df, adding a column representing the description in df_db; basically, if the id_code is present in a row of df_db, merge the values:
print(final_df)
id_code turnover costs Description
001 100 200 Retail
002 100 200 Retail
003 100 200 Wholesale
004 100 200 Supply
I tried with pd pivot but it does not report the desired result. How can I obtain final_df?

Use DataFrame.melt + Series.map
if there are no duplicate codes in df_db:
mapper=df_db.melt('Description').set_index('value')['Description']
df['Description']=df['id_code'].map(mapper)
print(df)
id_code turnover costs Description
0 1 100 200 Retail
1 2 100 200 Retail
2 3 100 200 Wholesale
3 4 100 200 Supply
Detail:
print(mapper)
value
1 Retail
3 Wholesale
4 Supply
2 Retail
5 Wholesale
6 Supply
Name: Description, dtype: object

We use melt before merge
final_df=df.merge(df_db.melt('Description').drop('variable',1),left_on='id_code',right_on='value').\
drop('value',1)
Out[157]:
id_code turnover costs Description
0 1 100 200 Retail
1 2 100 200 Retail
2 3 100 200 Wholesale
3 4 100 200 Supply

Related

How to transfer data from one column to another

I am trying to merge one column with another one but not sure what should I use insert or update statement.
The table looks like this
marketId | Product | Phone | Value |
1 washing machine null 800
1 air condition null 300
1 refrigerator null 600
1 TV null 500
2 washing machine null 850
2 air condition null 300
2 refrigerator null 600
2 TV null 500
I want to get result like this -
marketId | Product | Value |
1 washing machine 800
1 air condition 300
1 refrigerator 600
1 TV 500
1 Phone null
2 washing machine 850
2 air condition 200
2 refrigerator 550
2 TV 500
2 Phone null
Tried few things (update/insert statements), but I was unsuccessful. Do you have some other ideas?
Thanks!
You seem to want:
select market_id, product, value
from t
union all
select market_id, 'phone', null
from t;

How to assign equal revenue weight to every location of a company in a table? Google Big Query

I am working on a problem where I have the following table:
+----------+ | +------+ | +------------+
company_id | country | total revenue
1 Russia 1200
2 Croatia 1200
2 Italy 1200
3 USA 1200
3 UK 1200
3 Italy 1200
There are 3 companies in this table, but company '2' and company '3' have offices in 2 and 3 countries respectively. All companies pay 1200 per month, and because company 2 has 2 offices it shows as if they paid 1200 per month 2 times, and because company 3 has 3 offices it shows as if it paid 1200 per month 3 times. Instead, I would like revenue to be equally distributed based on how many times company_id appears in the table. company_id will only appear more than once for every additional country in which a company is based.
Assuming each company always pays 1,200 per month, my desired output is:
+----------+ | +------+ | +------------+
company_id | country | total revenue
1 Russia 1200
2 Croatia 600
2 Italy 600
3 USA 400
3 UK 400
3 Italy 400
Being new to SQL, I was thinking this can maybe be done through CASE WHEN statement, but I only learned to use CASE WHEN when I want to output a string depending on a condition. Here, I am trying to assign equal revenue weight to each company's country, depending on in how many countries a company is based in.
Thank you in advance for you help!
Below is for BigQuery Standard SQL
#standardSQL
SELECT company_id, country,
total_revenue / (COUNT(1) OVER(PARTITION BY company_id)) AS total_revenue
FROM `project.dataset.table`
If to apply to sample data from your question - output is
Row company_id country total_revenue
1 1 Russia 1200.0
2 2 Croatia 600.0
3 2 Italy 600.0
4 3 USA 400.0
5 3 UK 400.0
6 3 Italy 400.0

Combine 2 queries into 1 table with user entering the parameters twice for both queries

My project needs me to come up with a query which can compare any 2 months data side by side, by just keying in the dates of the 2 months.
I have done 2 separate queries that can only do single month data because I can only enter 1 date per query. I tried to combine this 2 separate query into 1 single query by selecting the columns from each table but it gives me blank data.
I will need some help in combining the 2 queries together, into 1 table as a view form and allowing the user to key in the 2 dates they want to get their data from. Below will be the 2 queries result I can achieve and also the end result I want to achieve from combining this 2 queries.
Conditions to merge the two table is that the company will be the same for both dates, and the item the company bought (if any). If the company did not buy the item on the month , data should be blank.
Query 1 : User will enter "First month" they want the data from
Inv Number Company Date Item Price Quantity Total
123 ABC 1/1/2018 Table 5 3 15
123 ABC 1/1/2018 Chair 2 4 8
345 XYZ 1/1/2018 Table 5 5 25
345 XYZ 1/1/2018 Chair 2 6 12
Query 2: User will enter "Second Month" they want the data from
Inv Number Company Date Item Price Quantity Total
999 ABC 1/2/2018 Table 4 3 12
999 ABC 1/2/2018 Chair 2 5 10
899 XYZ 1/2/2018 Table 4 3 12
End result : User will be allowed to key in both dates they want the data from
Inv Number Company Date Item Price Quantity Total Date Item Price Quantity Total Inv Number
123 ABC 1/1/2018 Table 5 3 15 1/2/2018 Table 4 3 12 999
123 ABC 1/1/2018 Chair 2 4 8 1/2/2018 Chair 2 5 10 999
345 XYZ 1/1/2018 Table 5 5 25 1/2/2018 Table 4 3 12 899
345 XYZ 1/1/2018 Chair 2 6 12

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

In the example df below, I'm trying to find a way to split the column headers ('1;2','4','5;6') based on the ';' that exists and duplicate the row values in these split columns. (My actual df comes from an imported csv file so generally I have around 50-80 column headers that need spliting)
Below is my code below with output
import pandas as pd
import numpy as np
#
data = np.array([['Market','Product Code','1;2','4','5;6'],
['Total Customers',123,1,500,400],
['Total Customers',123,2,400,320],
['Major Customer 1',123,1,100,220],
['Major Customer 1',123,2,230,230],
['Major Customer 2',123,1,130,30],
['Major Customer 2',123,2,20,10],
['Total Customers',456,1,500,400],
['Total Customers',456,2,400,320],
['Major Customer 1',456,1,100,220],
['Major Customer 1',456,2,230,230],
['Major Customer 2',456,1,130,30],
['Major Customer 2',456,2,20,10]])
df =pd.DataFrame(data)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print (df)
0 Market Product Code 1;2 4 5;6
1 Total Customers 123 1 500 400
2 Total Customers 123 2 400 320
3 Major Customer 1 123 1 100 220
4 Major Customer 1 123 2 230 230
5 Major Customer 2 123 1 130 30
6 Major Customer 2 123 2 20 10
7 Total Customers 456 1 500 400
8 Total Customers 456 2 400 320
9 Major Customer 1 456 1 100 220
10 Major Customer 1 456 2 230 230
11 Major Customer 2 456 1 130 30
12 Major Customer 2 456 2 20 10
Below is my desired output
0 Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
Ideally I would like to perform such a task at the 'read_csv' level. Any thoughts?
Try reindex with repeat
s=df.columns.str.split(';')
df=df.reindex(columns=df.columns.repeat(s.str.len()))
df.columns=sum(s.tolist(),[])
df
Out[247]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
You can split the columns with ';' and then reconstruct a df:
pd.DataFrame({c:df[t] for t in df.columns for c in t.split(';')})
Out[157]:
1 2 4 5 6 Market Product Code
1 1 1 500 400 400 Total Customers 123
2 2 2 400 320 320 Total Customers 123
3 1 1 100 220 220 Major Customer 1 123
4 2 2 230 230 230 Major Customer 1 123
5 1 1 130 30 30 Major Customer 2 123
6 2 2 20 10 10 Major Customer 2 123
7 1 1 500 400 400 Total Customers 456
8 2 2 400 320 320 Total Customers 456
9 1 1 100 220 220 Major Customer 1 456
10 2 2 230 230 230 Major Customer 1 456
11 1 1 130 30 30 Major Customer 2 456
12 2 2 20 10 10 Major Customer 2 456
Or if you would like to reserve column order:
pd.concat([df[t].to_frame(c) for t in df.columns for c in t.split(';')],1)
Out[167]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10

Need to create third table based on ID range in first table and line items in second table

The first table contains a range of account IDs for a company.
Table 1 Company
CID Minacct Maxacct
1 100 200
2 300 350
3 500 700
The second table contains line items for balances in accounts. The key to this problem, is that not all accounts are used in an account range from table 1. For ex., Company 1 could have account IDs from 100 to 200, but in actuality only has 100, 105, 118, and 170.
Table 2 Accts
AcctID Balance
100 $5
105 $10
118 $15
170 $20
300 $25
325 $30
350 $35
501 $40
502 $45
503 $50
602 $55
700 $60
Need to create Table 3 that combines Tables 1 and 2, hopefully using a nifty Select and not a slow loop.
CID AcctID Bal
1 100 $5
1 105 $10
1 118 $15
1 170 $20
2 300 $25
2 325 $30
2 350 $35
3 501 $40
3 502 $45
3 503 $50
3 602 $55
3 700 $60
Thanks!!
David
This is a join query but the join condition is an inequality:
select c.cid, a.*
from accts a join
company c
on a.acctid between c.minacct and c.maxacct;