I am working on a dataset in which I want to attribute the last action of a user to a certain goal. In the process I arrive at below tableset.
table
date | action_id | u_id | goal
2016-01-08 | CUID22 | 586758 | 'Goal#1'
2017-03-04 | CUID45 | 586758 | 'Goal#1'
2018-09-01 | CUID30 | 586758 | 'Goal#1'
How can I remove/replace the first two u_id or goal values whilst keeping the rows to arrive at below tableset.
table
date | action_id | u_id | goal
2016-01-08 | CUID22 | NaN | NaN
2017-03-04 | CUID45 | NaN | NaN
2018-09-01 | CUID30 | 586758 | 'Goal#1'
I beleive you need duplicated:
cols = ['u_id','goal']
df.loc[df.duplicated(cols, keep='last'), cols] = np.nan
Or:
cols = ['u_id','goal']
df[cols] = df[cols].mask(df.duplicated(cols, keep='last'))
print (df)
date action_id u_id goal
0 2016 0 NaN NaN
1 2017 1 NaN NaN
2 2018 2 1.0 1.0
Related
I have a data frame let's call it df it has various columns I am interested in values for a particular year and count the column for that year so I use the following
df =
ClientI OrderStartDate FunderCode
0 U27 2017-05-22 H2
1 U28 2017-05-22 H2
2 U28 2018-09-27 H3
3 U28 2019-03-19 H4
4 U29 2017-05-22 H2
HCPL2017 = df_c[df_c['OrderStartDate'].dt.year == 2017]['FunderCode'].value_counts().reset_index().rename(columns={'index': 'FCode', 'FunderCode': 'Count_2017'})
FunderCode Count_2017
0 HCPL2 431
1 HCPL4 188
2 HCPL3 59
3 HCPL1 2
Similarly I did for the year 2018 , 2019 etc
Then I merge using
pd.merge(pd.merge(HCPL2017, HCPL2018,on = "FunderCode"), HCPL2019,on ="FunderCode")
In the end, I got a merged table
FunderCode Count_2017 Count_2018 Count_2019
3 H1 2 85 207
0 H2 431 591 724
2 H3 59 205 372
1 H4 188 201 282
Is there a way to make the process quicker to get the information from the original data frame? Like I have many more years, wondering if I can filter all of them and get the count in a few steps?
You can use pandas pivot_table:
df["year"] = pd.to_datetime(df["OrderStartDate"]).dt.year
(
pd.pivot_table(df, values="ClientI", index="FunderCode", columns="year", aggfunc="count")
.add_prefix("Count_")
.reset_index()
.rename_axis(None, axis=1)
#.fillna(0) if you want to replace the nan's
)
Output:
+----+--------------+--------------+--------------+--------------+
| | FunderCode | Count_2017 | Count_2018 | Count_2019 |
|----+--------------+--------------+--------------+--------------|
| 0 | H2 | 3 | nan | nan |
| 1 | H3 | nan | 1 | nan |
| 2 | H4 | nan | nan | 1 |
+----+--------------+--------------+--------------+--------------+
Suppose that I have a dataframe as follows:
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | NaN |
| C | 25 | NaN |
| D | 30 | NaN |
+---------+-------+------------+
The above can be created using below code:
data = {'Product':['A', 'B', 'C', 'D'],
'Price':[10, 20, 25, 30],
'Calculated':[10, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
I want to update column calculated on the fly. For 2nd row the calculated = Prv. calculated / Previous Price i.e. calculated at row 2 is 10/10=1
Now that we have value for row 2 calculated row 3 calculated would be 1/20 and so on and so forth.
Expected Output
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | 1 |
| C | 25 | 0.05 |
| D | 30 | 0.002 |
+---------+-------+------------+
The above can be achieved using loops but I don't want to use loops instead I need a vectorized approach to update column Calculated. How can I achieve that?
You are looking at cumprod with a shift:
# also `df['Calculated'].iloc[0]` instead of `.ffill()`
df['Calculated'] = df['Calculated'].ffill()/df.Price.cumprod().shift(fill_value=1)
Output:
Product Price Calculated
0 A 10 10.000
1 B 20 1.000
2 C 25 0.050
3 D 30 0.002
I have a MySQL table as shown below:
ID | article | price | promo_price | delivery_days | stock | received_on
17591 03D/6H 3082.00 1716.21 30 0 2019-03-20
29315 03D/6H 3082.00 1716.21 26 0 2019-03-24
47796 03D/6H 3082.00 1716.21 24 0 2019-03-25
22016 L1620S 685.00 384.81 0 3 2019-03-20
35043 L1620S 685.00 384.81 0 2 2019-03-24
53731 L1620S 685.00 384.81 0 2 2019-03-25
I created a pivot table to monitor the stock data.
md = df.pivot_table(
values='stock',
index=['article','price', 'promo_price','delivery_days'],
columns='received_on',
aggfunc=np.sum)
dates = md.columns.tolist()
dates.sort(reverse=True)
md = md[dates]
This is the resuslt
+---------------------------------+--------------+--------------+--------------+
| | 2019-03-25 | 2019-03-24 | 2019-03-20 |
|---------------------------------+--------------+--------------+--------------|
| ('03D/6H', 3082.0, 1716.21, 24) | 0 | nan | nan |
| ('03D/6H', 3082.0, 1716.21, 26) | nan | 0 | nan |
| ('03D/6H', 3082.0, 1716.21, 30) | nan | nan | 0 |
| ('L1620S-KD', 685.0, 384.81, 0) | 2 | 2 | 3 |
+---------------------------------+--------------+--------------+--------------+
How do I filter the rows and get the price, promo_price and delivery days of an article based on the recent stock received date?
For ex: I want the stock info for all the days but price, promo_price and delivery days of only 2019-03-25 as shown below
+---------------------------------+--------------+--------------+--------------+
| | 2019-03-25 | 2019-03-24 | 2019-03-20 |
|---------------------------------+--------------+--------------+--------------|
| ('03D/6H', 3082.0, 1716.21, 24) | 0 | nan | nan |
| ('L1620S', 685.0, 384.81, 0) | 2 | 2 | 3 |
+---------------------------------+--------------+--------------+--------------+
EDIT:
If there is no change in price, promo_price and delivery days, I am getting the result as expected. But if there is any change in the values, then I am getting multiple rows for the same article.
Article L1620S data is as expected. But article 03D/6H resulted in three rows.
You can use:
df['received_on'] = pd.to_datetime(df['received_on'])
md = df.pivot_table(
values='stock',
index=['article','price', 'promo_price','delivery_days'],
columns='received_on',
aggfunc=np.sum)
#sorting columns in descending order
md = md.sort_index(axis=1, ascending=False)
#remove missing rows in first column
md = md.dropna(subset=[md.columns[0]])
#another solution
#md = md[md.iloc[:, 0].notna()]
print (md)
received_on 2019-03-25 2019-03-24 2019-03-20
article price promo_price delivery_days
03D/6H 3082.0 1716.21 24 0.0 NaN NaN
L1620S 685.0 384.81 0 2.0 2.0 3.0
EDIT: First filter by first level and then by position - first row:
md = md.sort_index(axis=1, ascending=False)
idx = pd.IndexSlice
md1 = md.loc[idx['03D/6H',:,:],:].iloc[[0]]
print (md1)
received_on 2019-03-25 2019-03-24 2019-03-20
article price promo_price delivery_days
03D/6H 3082.0 1716.21 24 0.0 NaN NaN
Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN
My text file has tables for each database. Is there any way that pandas can read this file and create separate dataframe for each database?
Database: ABC
+-----------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+-----------------------------------------------+----------+------------+
| ApplicationUpdateBankLog | 13 | 0 |
| ChangeLogTemp | 12 | 1678363 |
| Sheet2$ | 10 | 359 |
| tempAllowApplications | 1 | 9 |
+-----------------------------------------------+----------+------------+
4 rows in set.
Database: XYZ
+--------------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+--------------------------------------------------+----------+------------+
| BKP_QualificationDetails_12082014 | 14 | 7959877 |
| BillNotGeneratedCount | 11 | 2312 |
| VVshipBenefit | 19 | 197356 |
| VVBenefit_Bkup29012016 | 19 | 101318 |
+--------------------------------------------------+----------+------------+
4 rows in set.
You can use dict comprehension for creating dict of DataFrames:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Database: ABC
+-----------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+-----------------------------------------------+----------+------------+
| ApplicationUpdateBankLog | 13 | 0 |
| ChangeLogTemp | 12 | 1678363 |
| Sheet2$ | 10 | 359 |
| tempAllowApplications | 1 | 9 |
+-----------------------------------------------+----------+------------+
4 rows in set.
Database: XYZ
+--------------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+--------------------------------------------------+----------+------------+
| BKP_QualificationDetails_12082014 | 14 | 7959877 |
| BillNotGeneratedCount | 11 | 2312 |
| VVshipBenefit | 19 | 197356 |
| VVBenefit_Bkup29012016 | 19 | 101318 |
+--------------------------------------------------+----------+------------+
4 rows in set."""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['a', 'Tables', 'Columns', 'Total Rows'])
#replace NaN in column a created where not 'Database' by forward filing
df.a = df.a.where(df.a.str.startswith('Database')).ffill()
#remove rows where NaN in Tables column
df = df.dropna(subset=['Tables'])
#remove all whitespaces, set index for selecting in dict comprehension
df = df.apply(lambda x: x.str.strip()).set_index('a')
#convert to numeric columns, replace NaN, convert to int
df['Columns'] = pd.to_numeric(df['Columns'], errors='coerce').fillna(0).astype(int)
df['Total Rows'] = pd.to_numeric(df['Total Rows'], errors='coerce').fillna(0).astype(int)
#remove rows with value Tables
df = df[df['Tables'] != 'Tables']
print (df)
Tables Columns Total Rows
a
Database: ABC ApplicationUpdateBankLog 13 0
Database: ABC ChangeLogTemp 12 1678363
Database: ABC Sheet2$ 10 359
Database: ABC tempAllowApplications 1 9
Database: XYZ BKP_QualificationDetails_12082014 14 7959877
Database: XYZ BillNotGeneratedCount 11 2312
Database: XYZ VVshipBenefit 19 197356
Database: XYZ VVBenefit_Bkup29012016 19 101318
#select in dict comprehension and reset index to default monotonic index
dfs = {x:df.loc[x].reset_index(drop=True) for x in df.index.unique()}
print (dfs['Database: ABC'])
Tables Columns Total Rows
0 ApplicationUpdateBankLog 13 0
1 ChangeLogTemp 12 1678363
2 Sheet2$ 10 359
3 tempAllowApplications 1 9
print (dfs['Database: XYZ'])
Tables Columns Total Rows
0 BKP_QualificationDetails_12082014 14 7959877
1 BillNotGeneratedCount 11 2312
2 VVshipBenefit 19 197356
3 VVBenefit_Bkup29012016 19 101318