Merge two dataframe based on column which has splitted value - pandas

I have two data frames. One of the data frames appears to be as follows:
.
Products columns contain data like 1;3;5.
The other data frame looks like:
I am merging both of the frames:
Merge_Store_Transaction['products'] = Merge_Store_Transaction['products'].str.split(';')
Merge_Store_Transaction = Merge_Store_Transaction.explode('products')
Which give me result like: It duplicated all other values that I don't want. Is there a way where it divide the profit column with respective number of products and replicate the number or just fill other rows with zero.

I think that once you have this result, you can do something like the following:
Merge_Store_Transaction["profit"] = Merge_Store_Transaction.groupby(["group_id", "date"])["profit"].mean().reset_index(0, drop=True)
Same thing for the revenue_in_usd column.

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

NaN output when multiplying row and column of dataframe in pandas

I have two data frames the first one looks like this:
and the second one like so:
I am trying to multiply the values in number of donors column of the second data frame(96 values) with the values in the first row of the first data frame and columns 0-95 (also 96 values).
Below is the code I have for multiplying the two right now, but as you can see the values are all NaN:
Does anyone know how to fix this?
Your second dataframe has dtype object, you must convert it to float
df_sls.iloc[0,3:-1].astype(float)

Filteration on dataframe column value with combination of values

I have a dataframe which has 2 columns named TABLEID and STATID
There are different values in the both the columns.
when I filter the dataframe on values say '101PC' and 'ST101', it gives me 14K records and when I filter the dataframe on values say '102HT' and 'ST102', it gives me 14K records also. The issue is when I try to combine both the filters like below it gives me blank dataframe. I was expecting 28K records in my resultant dataframe. Any help is much appreciated
df[df[['TABLEID','STATID']].apply(tuple, axis = 1).isin([('101PC', 'ST101'), ('102HT','ST102')])]

Can we sort multiple data frames comparing values of each element in column

I have two csv files having some data and I would like to combine and sort data based on one common column:
Here is data1.csv and data2.csv file:
The data3.csv is the output file where you I need data to be combined and sorted as below:
How can I achieve this?
Here's what I think you want to do here:
I created two dataframes with simple types, assume the first column is like your timestamp:
df1 = pd.DataFrame([[1,1],[2,2], [7,10], [8,15]], columns=['timestamp', 'A'])
df2 = pd.DataFrame([[1,5],[4,7],[6,9], [7,11]], columns=['timestamp', 'B'])
c = df1.merge(df2, how='outer', on='timestamp')
print(c)
The outer merge causes each contributing DataFrame to be fully present in the output even if not matched to the other DataFrame.
The result is that you end up with a DataFrame with a timestamp column and the dependent data from each of the source DataFrames.
Caveats:
You have repeating timestamps in your second sample, which I assume may be due to the fact you do not show enough resolution. You would not want true duplicate records for this merge solution, as we assume timestamps are unique.
I have not repeated the timestamp column here a second time, but it is easy to add in another timestamp column based on whether column A or B is notnull() if you really need to have two timestamp columns. Pandas merge() has an indicator option which would show you the source of the timestamp if you did not want to rely on columns A and B.
In the post you have two output columns named "timestamp". Generally you would not output two columns with same name since they are only distinguished by position (or color) which are not properties you should rely upon.

How to group by and sum several columns?

I have a big dataframe with several columns which contains strings, numbers, etc. I am trying to group by SCENARIO and then sum only the columns between 2020 and 2050. The only thing I have got so far is sum one column as displayed as follows, but I need to change this '2050' by the columns between 2020 and 2050, for instance.
df1 = df.groupby(["SCENARIO"])['2050'].sum().sum(axis=0)
You are creating a subset of the df with only that single column. I can't tell how your dataset looks like from the information provided, but try:
df.groupby(["SCENARIO"]).sum()
This should some up all the rows which are in the column.
Alternatively select the columns which you want to perform the summation on.
df.groupby(["SCENARIO"])[["column1","column2"]].sum()