Insert column name and move the column data into the row - pandas

I want data like this
When I use df.Column[] it replaces the value with name

First question: Does the .xls file has column names? Or should you define them manually?
If it has column names (it seems like it doesn't), you can use header = 0 .If it doesn't have column names, define a list and then header = None and names = column_names.
The first one is;
df = pd.read_excel('wine-1.xls', header = 0 ) # use read_excel instead of read_csv
and the second one:
column_names = ['wine', 'acidity',...] # list of column names
df = pd.read_excel('wine-1.xls', header = None, names = columns_names)
Hope that works for you.

Related

How to select only rows containing specific values with multiple data frame in for loop?

I'm new to python, I have a multiple data frame and select data frame based one columns which contains value xxx.
below is my code
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
for d in MasterFiles:
for c in ColumName:
d = d.loc[d[c]=='XXX']
it is not working please help on this.
You need to gather the output and append it to a new Dataframe:
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
res_df = pandas.Dataframe(columns=ColumName)
for d in MasterFiles:
for c in ColumName:
res_df.append[d.loc[d[c]=='XXX']]
# the results
res_df.head()
I am not sure if I am understanding your question correctly. So, please let me rephrase your question here.
You have 3 tasks,
first is to loop through each pandas data frame,
second is to loop through each column in your ColumName list, and
third is to return the data frame rows that consists of value Surabhi - DCL - Unsecured based on the column name in the columnName list.
If I am interpreting this correctly. This is how I would work on your issue.
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
## list to store filter data frame by rows
df_temp = []
for d in MasterFiles:
for c in ColumName:
df_temp.append(d.loc[d[c] == 'Surabhi - DCL - Unsecured'])
## Assuming row wise concatenation
## i.e., using same column names to join data
df = pd.concat(df_temp, axis=0, ignore_index=True)
## df is the data frame you need

Add column with filename wildcard

I have files that have the pattern
XXXX____________030621_120933_D.csv
YYYY____________030621_120933_E.csv
ZZZZ____________030621_120933_F.csv
I am using glob.glob and for loop to parse each file to pandas to create Data frame of which i will merge at the end. I want to add a column which will add the XXXX,YYYY, and ZZZZ to each data frame accordingly
I can create the column called ID with df['ID'] and want to pick the value from the filenames. is the easiest way to grab that from the filename when reading the CSV and processing via pd
If the file names are as what you have presented, then use this code:
dir_path = #path to your directory
file_paths = glob.glob(dir_path + '*.csv')
result = pd.DataFrame()
for file_ in file_paths :
df = pd.read_csv(file_)
df['ID'] = file_[<index of the ID>]
result = result.append(df, ignore_index=True)
Finding the right index might take a bit of time, but that should do it.

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Format of data in a column in a data frame

I have read a fixed width file and created a dataframe.
I have a field called claim number which is of length 15.In the data frame I see this field appearing as "1.902431e+14" rather than full 15 length claim number.
how can I resolve this so that I can see entire 15 length of claim number in data frame ?
For example, use pandas float_format option as follows:
#data claim_number dictionary
dictionary = {'claim_number': [1.902431111111141]}
#specification of claim number format
pd.options.display.float_format = '{:,.15f}'.format
#Create dataframe
df = pd.DataFrame(data=dictionary)
If you want to apply your format specifically to one column only, you can use style.format instead of pd.options.float_format as follows:
#data claim_number dictionary
dictionary = {'claim_number_1': [1.902431111111141], 'claim_number_2': ['some_string'], 'claim_number_3': [0.2323]}
#Create dataframe
df = pd.DataFrame(data=dictionary)
#style format single column
df.style.format({'claim_number_1': "{:,.15f}"})
More options on how to use style.format can be found here https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html