Groupby for loop to export separate files by group - pandas

I'm trying to group dataframe by 'state' column, run calculations on each group, and export to excel with each file being named for the respective state group. If I print the groups, they look correct, but I can't get the files to show the group data correctly. Currently it creates separate files with correct file names, but each file has the complete data set ignoring the groups.
Source data here: https://docs.google.com/spreadsheets/d/1-wdmIz_-AILcBqzvpwAFGZfXqhq8oDRrYFVVdkjZ10o/edit?usp=sharing
df = pd.read_excel("ranker_test.xlsx", sheet_name='DATA')
grouped = df.groupby('state')
for group in grouped:
df.to_excel('test files/ranking_{}.xlsx'.format(group[0]), index=False)
^This creates the correctly named files, but each file has all states.
df = pd.read_excel("ranker_test.xlsx", sheet_name='DATA')
grouped = df.groupby('state')
for group in grouped:
group.to_frame().to_excel('test files/ranking_{}.xlsx'.format(group[0]), index=False)
^Trying to convert it to a dataframe with group.to_frame().to_excel results in this error: AttributeError: 'tuple' object has no attribute 'to_frame'
How can I convert the groups into dataframes to be stored in each file?

It looks like you've missed a parameter when unpacking the grouped values. The grouped values are a list of touples with the following format (group_index, group_dataframe). So, in order to iterate properly over it, you should do something like this:
df = pd.read_excel("ranker_test.xlsx", sheet_name='DATA')
grouped = df.groupby('state')
for name, group in grouped:
group.to_excel('test files/ranking_{}.xlsx'.format(name), index=False)
Notice the name parameter in the for loop

Related

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Pandas dataframe: Splitting single-column data from txt file into multiple columns

I have an obnoxious .txt file that is output from a late 1990's program for an Agilent instrument. I am trying to comma-separate and organize the single column of the text file into multiple columns in a pd dataframe. After some organization, the txt file currently looks like the following: See link here:
Organized Text File
Each row is indexed in a pd dataframe. The code used to reorganize the file and attempt to split into multiple columns follows:
quantData = pd.read_csv(epaTemp, header = None)
trimmed_File = quantData.iloc[16:,]
trimmed_File = trimmed_File.drop([17,18,70,71,72], axis = 0)
print (trimmed_File)
###
splitFile = trimmed_File.apply( lambda x: pd.Series(str(x).split(',')))
print (splitFile)
The split function above did not get applied to all rows present in the txt file. It only split(',')the first row rather than all of them:
0 16 Compound R... 1
dtype: object
I would like this split functionality to apply to all rows in my txt file so I can further organize my data. Thank you for the help.

How to access nested tables in hdf5 with pandas

I want to retrieve a table from an HDF5 file using pandas.
Following several references I found, I have tried to open the file using:
df = pd.read_hdf('data/test.h5', g_name),
where g_name is the path to the object I want to retrieve, i.e. the table TAB1, for instance, MAIN/Basic/Tables/TAB1.
g_name is retrieved as follows:
def get_all(name):
if 'TAB1' in name:
return name
with h5py.File('data/test.h5') as f:
g_name = f.visit(get_all)
print(g_name)
group = f[g_name]
print(type(group))
I have also tried retrieving the object itself, as seen in the above code snippet, but the object type is
How would I convert this to something I can read as a data frame in pandas?
For the first case, I get the following error:
"cannot create a storer if the object is not existing "
I do not understand why it cannot find the object, if the path is the same as retrieved during the search.
I found the following solution:
hf = h5py.File('data/test.h5')
data = hf.get('MAIN/Basic/Tables/TAB1')
result = data[()]
# This last step just converts the table into a pandas df
df = pd.DataFrame(result)

Pandas - Appending data from one Dataframe to

I have a Dataframe (called df) that has list of tickets worked for a given date. I have a script that runs each day where this df gets generated and I would like to have a new master dataframe (lets say df_master) that appends values form df to a new Dataframe. So anytime I view df_master I should be able to see all the tickets worked across multiple days. Also would like to have a new column in df_master that shows date when the row was inserted.
Given below is how df looks like:
1001
1002
1003
1004
I tried to perform concat but it threw an error
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
Update
df_ticket = tickets['ticket']
df_master = df_ticket
df_master['Date'] = pd.Timestamp('now').normalize()
L = [df_master,tickets]
master_df = pd.concat(L)
master_df.to_csv('file.csv', mode='a', header=False, index=False)
I think you need pass sequence to concat, obviously list is used:
objs : a sequence or mapping of Series, DataFrame, or Panel objects
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
L = [s1,s2]
df = pd.concat(L)
And it seems you pass only Series, so raised error:
df = pd.concat(s)
For insert Date column is possible set pd.Timestamp('now').normalize(), for master df I suggest create one file and append each day DataFrame:
df_ticket = tickets[['ticket']]
df_ticket['Date'] = pd.Timestamp('now').normalize()
df_ticket.to_csv('file.csv', mode='a', header=False, index=False)
df_master = pd.read_csv('file.csv', header=None)

Key error: '3' When extracting data from Pandas DataFrame

My code plan is as follows:
1) find csv files in folder using glob and create a list of files
2) covert each csv file into dataframe
3) extract data from a column location and convert into a separate dataframe
4) append the new data into a separate summary csv file
code is as follows:
Result = []
def result(filepath):
files = glob.glob(filepath)
print files
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['3'].ix[0:4]
new_dfb[colname] = selected_data
Result.append(new_dfb)
folder = r"C:/Users/Joey/Desktop/tcd/summary.csv"
new_dfb.to_csv(folder)
result("C:/Users/Joey/Desktop/tcd/*.csv")
print Result
The code error is shown below. The issue seems to be with line 36 .. which corresponds to the selected_data = df['3'].ix[0:4].
I show one of my csv files below:
I'm not sure what the problem is with the dataframe constructor?
You're csv snippet is a bit unclear. But as suggested in the comments, read_csv (from_csv in this case) automatically taken the first row as a list of headers. The behaviour you appear to want is the columns to be labelled as 1,2,3 etc. To achieve this you need to have
[pd.DataFrame.from_csv(f, index_col=None,header=None) for f in files]