I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)
I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?
I'm new to python, I have a multiple data frame and select data frame based one columns which contains value xxx.
below is my code
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
for d in MasterFiles:
for c in ColumName:
d = d.loc[d[c]=='XXX']
it is not working please help on this.
You need to gather the output and append it to a new Dataframe:
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
res_df = pandas.Dataframe(columns=ColumName)
for d in MasterFiles:
for c in ColumName:
res_df.append[d.loc[d[c]=='XXX']]
# the results
res_df.head()
I am not sure if I am understanding your question correctly. So, please let me rephrase your question here.
You have 3 tasks,
first is to loop through each pandas data frame,
second is to loop through each column in your ColumName list, and
third is to return the data frame rows that consists of value Surabhi - DCL - Unsecured based on the column name in the columnName list.
If I am interpreting this correctly. This is how I would work on your issue.
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
## list to store filter data frame by rows
df_temp = []
for d in MasterFiles:
for c in ColumName:
df_temp.append(d.loc[d[c] == 'Surabhi - DCL - Unsecured'])
## Assuming row wise concatenation
## i.e., using same column names to join data
df = pd.concat(df_temp, axis=0, ignore_index=True)
## df is the data frame you need
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last year.
We have two dataframes exported from Excel. Both have a column called "PN", which was set at the exporting. "First" and "Second" are the variables with those dataframes. "Third" stores a list of coinsidences between the 2 "PN" columns. Pandas Merge method worked without such list, but since the thing now is not working, I added it as well.
gnida = []
for h in first['PN']:
for u in zip(second['PN'], second['P']):
if h==u[0]:
gnida.append(u)
third = pd.DataFrame(gnida)
I need values in the second dataframe to be placed on the rows where coinsidence occurs. If I simply merge:
fourth = first.merge(second)
, columns that have names other than in the first df are added, but the output is 1 row of headings without rows with values.
If I merge
fourth = first.merge(third)
, I get:
No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False.
If I state further "left on = "PN", I get:
object of type 'NoneType' has no len().
Thus, how can Merge or Join or whatever the 2 dataframes in order to use one column of the second dataframe as a key, placing values in a new column where coinsidence occurs. Thank you
if you wish to merge by the index, just use fourth = first.join(third)
otherwise, you need to create a dataframe from third, add the column that you want to merge by, and use:
fourth = first.merge(third,on='name_of_the_column')
I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})