How to convert columns values of a csv file to different format structure in pandas? - pandas

I have 100 csv files in a folder. all of them has a column name which is the name of the file and column z value.
import pandas as pd
df = pd.read_csv("ProfileGraph1.csv")
df.head()
Name Z
0 1 -3.687422
1 1 -3.688351
2 1 -3.684376
3 1 -3.691209
4 1 -3.693000
df = pd.read_csv("ProfileGraph2.csv")
df.head()
Name Z
0 2 -3.691955
1 2 -3.694228
2 2 -3.692559
3 2 -3.699092
4 2 -3.698381
df = pd.read_csv("ProfileGraph3.csv")
df.head()
Name Z
0 3 -3.693265
1 3 -3.694765
2 3 -3.693598
3 3 -3.697865
4 3 -3.699872
I would like to go through each of them and convert Z column of each csv file to a row and store it in a new csv file, and append all of them to a new csv file. this is the output that I made it manually:
df = pd.read_csv("filename.csv")
df.head()
Name 1 2 3 4 5
0 1 -3.687422 -3.688351 -3.684376 -3.691209 -3.693000
1 2 -3.691955 -3.694228 -3.692559 -3.699092 -3.698381
2 3 -3.693265 -3.694765 -3.693598 -3.697865 -3.699872

First loop by list of all files and create big DataFrame by concat, then reshape by cumcount for counter with unstack:
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
df = pd.concat(dfs, ignore_index=True)
df = df.set_index(['Name',df.groupby('Name').cumcount()])['Z'].unstack().reset_index()

Related

How to change in a dataframe columns to rows based on same name?

As you can see I have a dataframe with several columns with the same name but split into 0., 1. until 27.
How can I take all the values of 1.name and have it under 0.name?
Thank you very much!
Assuming that for all 0<=n<=27 the column names' suffixes are the same, one solution can be:
import pandas as pd
import re
# pattern to extract colum name suffix
pattern = re.compile('^\d\.([\w\.]+)')
# getting all the distinct column names/fields
fields = set([pattern.match(colname).group(1) for colname in df.columns])
# max prefix number, for you 27
n = 27
partitions = []
for i in range(0,n+1):
# creating column selector for partitions
columns_for_partition = list(map(lambda field: str(i) + f'.{field}', fields))
# get partition from dataframe and renaming column to field name (removing n. prefix)
partition = df[columns_for_partition].rename(lambda x: x.split('.',1)[1], axis=1)
partitions.append(partition)
new_df = pd.concat(partitions)
print(new_df)
With an initial dataframe df
0.name 0.something 1.name 1.something
0 a 1 d 4
1 b 2 e 5
2 c 3 f 6
The resulting dataframe new_df will look like:
name something
0 a 1
1 b 2
2 c 3
0 d 4
1 e 5
2 f 6

the column in csv that comes from the index of DataFrame does not have a header name

here is a pandas DataFrame
>>> print(df)
A B C
0 0 1 2
1 3 4 5
2 6 7 8
with df.to_csv('df.csv') I got this file
the column in csv that comes from the index of DataFrame does not have a header name. Is it possible to specify a column name with pandas?
Try with rename_axis
df.rename_axis('index').to_csv('df.csv')

Sort data in Pandas dataframe alphabetically

I have a dataframe where I need to sort the contents of one column (comma separated) alphabetically:
ID Data
1 Mo,Ab,ZZz
2 Ab,Ma,Bt
3 Xe,Aa
4 Xe,Re,Fi,Ab
Output:
ID Data
1 Ab,Mo,ZZz
2 Ab,Bt,Ma
3 Aa,Xe
4 Ab,Fi,Re,Xe
I have tried:
df.sort_values(by='Data')
But this does not work
You can split, sorting and then join back:
df['Data'] = df['Data'].apply(lambda x: ','.join(sorted(x.split(','))))
Or use list comprehension alternative:
df['Data'] = [','.join(sorted(x.split(','))) for x in df['Data']]
print (df)
ID Data
0 1 Ab,Mo,ZZz
1 2 Ab,Bt,Ma
2 3 Aa,Xe
3 4 Ab,Fi,Re,Xe
IIUC get_dummies
s=df.Data.str.get_dummies(',')
df['n']=s.dot(s.columns+',').str[:-1]
df
Out[216]:
ID Data n
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe
IIUC you can use a list comprehension:
[','.join(sorted(i.split(','))) for i in df['Data']]
#['Ab,Mo,ZZz', 'Ab,Bt,Ma', 'Aa,Xe', 'Ab,Fi,Re,Xe']
using explode and sort_values
df["Sorted_Data"] = (
df["Data"].str.split(",").explode().sort_values().groupby(level=0).agg(','.join)
)
print(df)
ID Data Sorted_Data
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe
Using row iteration:
for index, row in df.iterrows():
row['Data'] = ','.join(sorted(row['Data'].split(',')))
In [29]: df
Out[29]:
Data
0 Ab,Mo,ZZz
1 Ab,Bt,Ma
2 Aa,Xe
3 Ab,Fi,Re,Xe

How to make pandas work for cross multiplication

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

Adding new column to an existing dataframe at an arbitrary position [duplicate]

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]