Data Preparation of a given .CSV

Data Preparation of a given .CSV - pandas

test_sales
I would like to get a new data_set with the following column structure:
#the structure of the task [new csv]
1) order_number
2) number_of_orders_past_3_months
3) gross_sum_orders_past_3_months
4) number_of_orders_future_3_months
5) gross_sum_orders_future_3_months
# The produced data frame should be unique by order number.
I have done the following but I am not sure if it is right:
df1 = pd.read_csv('test_sales')
df2 = df1.to_csv('order_count_sum.csv')
df2 = pd.read_csv('order_count_sum.csv')
df2['order_datetime'] = pd.to_datetime(df2['order_datetime'])
df2['number_of_orders_past_3_months'] = df2.groupby('order_number')['order_number'].value_counts().reset_index(drop = True)
df2['gross_sum_orders_past_3_months'] = df2.groupby(['order_number']).rolling(3, min_periods = 1, on = 'order_datetime')['gross_value'].sum().reset_index( drop = True)
df2.head(20)
the new .csv
May I know how can I get the following:
number_of_orders_future_3_months
gross_sum_orders_future_3_months

Related

Creating pandas columns with for loop

I have the following dataframe created through the following chunk of code:
df = pd.DataFrame(
[
(13412339, '07/03/2022', '08/03/2022', '10/03/2022', 1),
(13412343, '07/03/2022', '07/03/2022', '09/03/2022', 0),
(13412489, '07/02/2022', '08/02/2022', '07/03/2022', 0),
],
columns=['task_id', 'start_date', 'end_date', 'end_period', 'status']
)
df = df.astype(dtype={'status' : bool})
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df.end_period = pd.to_datetime(df.end_period)
What I need to do here is to calculate the difference in days between the start_date and end_date columns if the status column is False, else it should do the same but between start_date and end_period columns.
The code that I have implemented to calculate the days differences between the start_date and end_date columns is as follows:
new_frame = pd.DataFrame()
for row in range(df.shape[0]):
#extract the row
extracted_row = df.loc[row,:]
#Calculates the date difference in days for each row in the loop
diff = extracted_row['end_date'] - extracted_row['start_date']
diff_days = diff.days
#Iterate over these date differences and repeat the row for each full day
for i in range(diff_days+1):
new_row = extracted_row.copy()
new_row['date'] = new_row['start_date'] + dt.timedelta(days=i)
new_row = new_row[['task_id','start_date','end_date',
'end_period','date','status']]
#appends the rows created to new dataframe
new_frame = new_frame.append(new_row)
#Rearranges columns in the desired order
new_frame = new_frame[['task_id','start_date','end_date','end_period','date','status']]
#Changes data types
new_frame = new_frame.astype(dtype={'task_id' : int,'status' : bool})
Then in order to calculate the differences if the status column is False, I did the following one:
new_frame1 = pd.DataFrame()
new_frame2 = pd.DataFrame()
for row in range(df.shape[0]):
#In this iteration, status column should be equals True
if df['status'] == False:
#extract the row
extracted_row_end = df.loc[row,:]
#Calculates the date difference in days for each row in the loop
diff1 = extracted_row_end['end_date'] - extracted_row_end['start_date']
diff_days_end = diff1.days
#Iterate over these date differences and repeat the row for each full day
for i in range(diff_days_end+1):
new_row_end = extracted_row_end.copy()
new_row_end['date'] = new_row_end['start_date'] + dt.timedelta(days=i)
new_row_end = new_row_end[['task_id','start_date','end_date',
'end_period','date','status']]
#appends the rows created to new dataframe
new_frame1 = new_frame1.append(new_row_end)
#Rearranges columns in the desired order
new_frame = new_frame[['task_id','start_date','end_date','end_period','date','status']]
#Changes data types
new_frame = new_frame.astype(dtype={'task_id' : int,'status' : bool})
#In this iteration, status column should be equals False
else:
#extract the row
extracted_row_period = df.loc[row,:]
#Calculates the date difference in days for each row in the loop
diff2 = extracted_row_period['end_period'] - extracted_row_period['start_date']
diff_days_period = diff2.days
#Iterate over these date differences and repeat the row for each full day
for i in range(diff_days_period+1):
new_row_period = extracted_row_end.copy()
new_row_period['date'] = new_row_period['start_date'] + dt.timedelta(days=i)
new_row_period = new_row_period[['task_id','start_date','end_date',
'end_period','date','status']]
#appends the rows created to new dataframe
new_frame2 = new_frame2.append(new_row_period)
#Rearranges columns in the desired order
new_frame = new_frame[['task_id','start_date','end_date','end_period','date','status']]
#Changes data types
new_frame = new_frame.astype(dtype={'task_id' : int,'status' : bool})
#Merges both dataframes
frames = [new_frame1,new_frame2]
df = pd.concat(frames)
Then it throws an error when starts the first for loop, here is where I should be asking help on how to calculate the difference in days between the start_date and end_date columns if the status column is False, else calculate it between start_date and end_period columns.
The complete error is as follows:

Some part of your code did not work on my machine (so I just took the initial df from your first cell) - but when reading what you need, this is what I would do
import numpy as np
df['dayDiff']=np.where(df['status'],(df['end_period']-df['start_date']).dt.days,(df['end_date']-df['start_date']).dt.days)
df
As you already have booleand on df['status'], I would use that to the np.where condition , then either calculate the day difference df['end_period']-df['start_date']).dt.days when True either day difference (df['end_date']-df['start_date']).dt.days when False

How can I leave every answer from 'for'

I think my code works well.
But the problem is that my code does not leave every answer on DataFrame R.
When I print R, only the last answer appeared.
What should I do to display every answer?
I want to add answer on the next column.
import numpy as np
import pandas as pd
DATA = pd.DataFrame()
DATA = pd.read_excel('C:\gskim\P4DS/Monthly Stock adjClose2.xlsx')
DATA = DATA.set_index("Date")
DATA1 = np.log(DATA/DATA.shift(1))
DATA2 = DATA1.drop(DATA1.index[0])*100
F = pd.DataFrame(index = DATA2.index)
for i in range (0, 276):
Q = DATA2.iloc[i].dropna()
W = sorted(abs(Q), reverse = False)
W_qcut = pd.qcut(W, 5, labels = ['A', 'B', 'C', 'D', 'E'])
F = Q.groupby(W_qcut).sum()
R = pd.DataFrame(F)
print(R)
the first table is the current result, I want to fill every blank tables on the second table as a result:

R: How to plot the last row of a dataframe?

This must be very easy, but I cannot get a plot of the last/any row of a dataframe.
A = data.frame(a = rnorm(50), b = rnorm(50), c = rnorm(50))
barplot(A[nrow(A),1:3])
I get the error message:
Error in barplot.default(A[nrow(A), 1:3]) :
'height' must be a vector or a matrix
A solution using ggplot would be very welcome!

imported ggplot2 library and the dataset you gave me. used the tail command to get only the last row. Then had to melt() the data to get it into the right format, then plotted in ggplot2
library(ggplot2)
library(reshap2)
A = data.frame(a = rnorm(50), b = rnorm(50), c = rnorm(50))
A_tail <- tail(A, 1)
tailmelt <- melt(A_tail)
ggplot(data = tailmelt, aes( x = factor(variable), y = value, fill = variable ) ) +
geom_bar( stat = 'identity' )

Dataframe index rows all 0's

I'm iterating through PDF's to obtain the text entered in the form fields. When I send the rows to a csv file it only exports the last row. When I print results from the Dataframe, all the row indexes are 0's. I have tried various solutions from stackoverflow, but I can't get anything to work, what should be 0, 1, 2, 3...etc. are coming in as 0, 0, 0, 0...etc.
Here is what I get when printing results, only the last row exports to csv file:
0
0 1938282828
0
0 1938282828
0
0 22222222
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2])
print(df)`
Thank you for any help!

You are replacing the same dataframe each time:
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2]) # this creates new df each time
print(df)
Correct Code:
infile = glob.glob('./*.pdf')
df = pd.DataFrame()
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = df.append([myfieldvalue2])
print(df)

Create Dataframe name from 2 strings or variables pandas

i am extracting selected pages from a pdf file. and want to assign dataframe name based on the pages extracted:
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
for i in selected_pages():
df{str(i)} = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True,area = [100,10,740,950],pages= (i), index = False)
print (df{str(i)} )
The idea, ultimately, as in above example, is to have dataframes: df10, df11. I have tried "df" + str(i), "df" & str(i) & df{str(i)}. however all are giving error msg: SyntaxError: invalid syntax
Or any better way of doing it is most welcome. thanks

This is where a dictionary would be a much better option.
Also note the error you have at the start of the loop. selected_pages is a list, so you can't do selected_pages().
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
df = {}
for i in selected_pages:
df[i] = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True, area = [100,10,740,950], pages= (i), index = False)

i = int(i) - 1 # this will bring it to 10
dfB = df[str(i)]
#select row number to drop: 0:4
dfB.drop(dfB.index[0:4],axis =0, inplace = True)
dfB.columns = ['col1','col2','col3','col4','col5']

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Data Preparation of a given .CSV - pandas

Related

Creating pandas columns with for loop

How can I leave every answer from 'for'

R: How to plot the last row of a dataframe?

Dataframe index rows all 0's

Create Dataframe name from 2 strings or variables pandas

Categories

Resources