I've split a pandas dataframe into mutliple dataframes using a For Loop, but they come up undefined when I try to work with them - dataframe

I have a dataframe of accounts that can order many items. There can be multiple rows for each account, one for each item that it ordered. Each line is numbered as a line item and can go from 1 to (in this case) 15. I wanted to split the dataframe into 15 smaller dataframes, each containing the specific line item ordered. I wanted to name each dataframe by the line item number, so I used a for-loop:
# Determine max number of items per account
Max_Items=max(df_orders['Line_Item']) # In this case, there were 15, but could vary
print('There are as many as ' + str(Max_Orders) + ' items ordered at some accounts.')
# Create a list of all possible line item numbers
line_items=(df_orders.Line_Item.unique()).tolist()
print(line_items)
lines={}
# Separate each line item into own dataframe and rename to reflect the line number
for line_item in line_items:
df_name = 'df_orders_' + str(line_item) # the name for the dataframe
lines[df_name] =(df_orders[df_orders['Line_Item']==line_item]).reset_index(drop=True)
print('This is order dataframe, ' + df_name+':')
lines[df_name].head()
When I run this code, I do see a print out of the .head() for all 15 dataframes. So far, so good.
The problem is when I try to manipulate any of the ones I just created, after the loop has run:
df_orders_1.head()
I get:
NameError: name 'df_orders_1' is not defined
How can I adapt my code to make sure all of those dataframes remain in memory?

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

Efficient way to map items from one column into new column in Pandas

Let's say if I have a Pandas df called df_1 where one of the rows looks like this:
id
rank_url_agg
url_list
2223
['gtech.com','gm.com', 'ford.com']
['google.com','gtech.com','autoblog.com','gm.com', 'ford.com']
I want to create a new column called url_list_agg which does the following things for each row:
Iterate through the URLs in url_list
If URL doesn't exist in rank_url_agg in the same row, assign a value of 0.
If URL exists in rank_url_agg, then assign the value that corresponds to the difference between the length of the rank_url_agg list and the index of that URL in rank_url_agg.
Once done iterating through all URLs in url_list, wrap the results into a list.
So at the end, the first row in the new url_list_agg column will become [0,3,0,2,1].
I've tried running the following script (only to test the 1st row and not entire dataframe):
for item in agg_report['url_list'][0]:
if item in agg_report['rank_url_agg'][0]:
item=len(rank_url_agg[0]) - agg_report['rank_url_agg'][0].index(item)
else:
item=0
But when I checked agg_report['url_list'][0], it still returned just this list: ['google.com','gtech.com','autoblog.com','gm.com', 'ford.com']. So my code didn't work.
Any advice on how to achieve this goal for every row in the dataframe will be greatly appreciated!
You're not assigning back to the actual dataframe.
def idx(a, b):
return [len(a) - a.index(x) if x in a else 0 for x in b]
df_1 = df_1.assign(url_list_agg=[*map(idx, df_1.rank_url_agg, df_1.url_list)])

Pandas: Creating empty dataframe in for loop, appending

I would like to create a ((25520*43),3) pandas Dataframe in a for loop.
I created the dataframe like:
lst=['Region', 'GeneID', 'DistanceValue']
df=pd.DataFrame(index=lst).T
And now I want to fill 'Region', 43 times with 25520 values. Also GeneID and DistanceValue.
This is my for loop for that:
for i in range(43):
df.DistanceValue = np.sort(distance[i,:])
df.Region = np.ones(25520) * i
args = np.argsort(distance[i,:])
df.GeneID = ids[int(args[i])]
But than my df exists just of (25520, 3). So I just have the last iteration for 43 filled in.
How can I concat all iteration one to 43 in my df?
I can't reproduce your example but there are couple of corrections you can make:
lst=['Region', 'GeneID', 'DistanceValue']
df=pd.DataFrame(index=lst).T
region = []
for i in range(43):
region.append(np.ones(25520))
flat_list = [item for sublist in region for item in sublist]
df.Region = flat_list
First create a new list outside loop and then append values within loop in this list.
The flat_list will consolidate all 43 lists to one and then you can map it to the DataFrame. It is always easier to fill DataFrame values outside of loop.
Similarly you can update all 3 columns.

Write pandas data to a CSV file if column sums are greater than a specified value

I have a CSV file whose columns are frequency counts of words, and whose rows are time periods. I want to sum for each column the total frequencies. Then I want to write to a CSV file for sums greater than or equal to 30, the column and row values, thus dropping columns whose sums are less than 30.
Just learning python and pandas. I know it is a simple question, but my knowledge is at that level. Your help is most appreciated.
I can read in the CSV file and compute the column sums.
df = pd.read_csv('data.csv')
Except of data file containing 3,874 columns and 100 rows
df.sum(axis = 0, skipna = True)
Excerpt of sums for columns
I am stuck on how to create the output file so that it looks like the original file but no longer has columns whose sums were less than 30.
I am stuck on how to write to a CSV file each row for each column whose sums are greater than or equal to 30. The layout of the output file would be the same as for the input file. The sums would not be included in the output.
Thanks very much for your help.
So, here is a link showing an excerpt of a file containing 100 rows and 3,857 columns:
It's easiest to do this in two steps:
1. Filter the DataFrame to just the columns you want to save
df_to_save = df.loc[:, (df.sum(axis=0, skipna=True) >= 30)]
.loc is for picking rows/columns based either on labels or conditions; the syntax is .loc[rows, columns], so : means "take all the rows", and then the second part is the condition on our columns - I've taken the sum you'd given in your question and set it greater than or equal to 30.
2. Save the filtered DataFrame to CSV
df_to_save.to_csv('path/to/write_file.csv', header=True, index=False)
Just put your filepath in as the first argument. header=True means the header labels from the table will be written back out to the file, and index=False means the numbered row labels Pandas automatically created when you read in the CSV won't be included in the export.
See this answer here: How to delete a column in pandas dataframe based on a condition? . Note, the solution for your question doesn't need isnull() before the sum(), as that is specific to their question for counting NaN values.

Merging DataFrames on a specific column together

I tried to perform my self-created function on a for loop.
Some remarks in advance:
ma_strategy is my function and requires three inputs
ticker_list is a list with strings result is a pandas Dataframe with 7 columns and I can call the column 'return_cum' with result['return_cum']. - The rows of this column are containing floating point numbers.
My intention is the following:
The for loop should iterate over the items in my ticker_list and should save the 'return_cum' columns in a DataFrame. Then the different 'return_cum' columns should be stored together so that at the end I get a DataFrame with all the 'return_cum' columns of my ticker list.
How can I achieve that goal?
My approach is:
for i in ticker_list:
result = ma_strategy(i, 20, 5)
x = result['return_cum'].to_frame()
But at this stage I need some help.
If i inderstood you correctly this should work:
result_df =pd.DataFrame()
for i in ticker_list:
result= ma_strategy(i, 20,5)
resault_df[i + '_return_cum'] = result['return_cum']