Upsert the table in bigquery with the condition - google-bigquery

I have two tables A and B with key 'place', A contains 'place' values of December month and B contains 'place' values of January month. Need to create two columns (first seen and last seen) which state If 'place' in A present in B then first seen = Dec, last seen = March,
If 'place' in A not present in B then first seen = Dec, last seen = dec,
for new 'place 'added in B first seen = mar and last seen =march.

Related

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?
To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Expanding group by window to count nunique

I have the following df:
df=pd.DataFrame(data={'month':[1]*4+[2]*4+[3]*4,'customer':[1,2,3,4,1,5,6,7,2,3,10,7]})
I want to create an expanding window to count number of unique customers at any point.
the output for the following df should be:
{1:4,2:7,3:8}
because in the first month we had 4 different customers, in the 2nd one, 3 where added (the other one was in the first month, and in the last month only one added (number 10))
Thanks
You can first drop the duplicated customers (only keep the first ones that appeared) and then cumulatively sum the number of (now unique) customers per month:
counts = df.drop_duplicates("customer").groupby("month").size().cumsum().to_dict()
to get
>>> counts
{1: 4, 2: 7, 3: 8}
Since there are repeated customers, you can drop those repeated customers using
df.drop_duplicates(subset='customer',ignore_index=True,inplace=True)
By default it will keep the first occurence of customer number and will drop next occurences. To count the number of unique customers each month,
df['customer'] = df.groupby('month')['customer'].transform('count')
df = df.drop_duplicates(ignore_index=True)
To roll the window over the customer column, calculate cumulative sum of that column
df['customer'] = df['customer'].cumsum()
It will give the desired output
month customers
1 4
2 7
3 8

How to compute the difference in monthly income for the same id

The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()

Pandas Sum Values Based on Unique Values

I am trying to sum the total of unique values in a pandas data frame but for some reason I am having difficulty getting just one number for each unique values.
A B C D
calmuni 30,000.00 CA 1-3 year paper
calmuni 95,000.00 CA 1-3 year paper
massmuni 25,000.00 MA 1-3 year paper
massmuni 30,000.00 RI 1-3 year paper
massmuni 175,000.00 MA 1-3 year paper
I am trying to sum column B based off the unique values in column A but my groupby function isn't working. I would like to one line item sum value for each unique value:
orders.groupby('A')['B'].sum()
A
calmuni 30,000.0095,000.00125,000.0020,000.0020,000.00...
massmuni 25,000.0030,000.00175,000.0025,000.0050,000.00..

How do I tally how many times a word appears on a certain row?

I have four sets of data representing a softball schedule. Looks like this:
Day Team 1 Team 2
M A Team B Team
T C Team D Team
....
but four times over. I want to be able to change the schedule and have it automatically tally how many times a team plays on a given day. Ideas?
You would us something like this:
=COUNTIF(2:2,"A Team")
Edit:
You can use a SUMPRODUCT() Function with math operands * and +:
=SUMPRODUCT(($A$2:$A$43=H$1)*(($B$2:$B$43=$G2)+($C$2:$C$43=$G2)))
So how it works:
Since TRUE/FALSE is a Boolean and it can be reduced to 1/0 respectively. Using the * and + operands is like AND and OR respectively.
The SUMPRODUCT iterates through the range and test each criterion inside the () So it first test whether the cell in column A is equal to H1, if so it returns a 1, or a 0 if not. the next part sets up the OR if in the same row the team name is found it also returns a 1. 1 * 1 = 1. SUMPRODUCT keeps track of all the 1 and 0 and adds them together, so you get the count.
If there are other columns that have the team names just add those columns with the + area.
Ok, so let's start with making your table a real table via "Start > format as table" and call your table "data". Then you have three columns called data[Day], data[Team 1] and data[Team 2]. For instance this:
Day Team 1 Team 2
Monday A Team B team
Tuesday C Team D Team
Wednesday C Team A Team
Monday B Team C Team
Now comes the ugly part. You need a matrix of 7*10 (days * teams)
(Cell E1) Team 1 Team 2 Team 3 Team 4 ...
Monday *1
Tuesday
Wednesday
...
Formula *1
=SUMPRODUCT((data[Day]=$E2)*((data[Team 1]=F$1)+(data[Team 2]=F$1)))
Now drag down that formula till Sunday and then copy it to the other teams (when I tried dragging it to the other teams, Excel messed up the column names!).
This will automatically fill the matrix and tell you which team plays how often on a specific day.
What does it do? Basically SUMPRODUCT can not only build products, but it can also evaluate boolean conditions. So if on Monday, Team A plays, then the first column would return (for Team A / Monday):
1*(1+0)
SUMPRODUCT does that for each line in the matrix and then sums up the result.