Adding file name to column name pandas dataframe - pandas

I have a pandas dataframe created from several csv files. The csv files are all structured the same way, so I have the same column names over and over again. I want the column names to be expanded by the file names (which I have as a list) they come from.
From this I know how to add a count to same name columns and I know how to rename columns. But I fail at bringing the right file name to the right column value.
That should be the relevant part of the code:
for i in range(0,len(file_list)):
data = pd.read_table(file_list[i], encoding='unicode_escape')
df = pd.DataFrame(data)
df = df.drop(droplist,axis=1)
main_dataframe = pd.concat([main_dataframe, df], axis = 1)

You can use a dictionary in concat to generate a MultiIndex:
list_of_files = ['f1.csv', 'f2.csv']
pd.concat({f: pd.read_table(f, encoding='unicode_escape', sep=',')
for f in list_of_files}, axis=1)
example:
# f1.csv
a,b
1,2
3,4
# f2.csv
a,b
5,6
7,8
output:
f1.csv f2.csv
a b a b
0 1 2 5 6
1 3 4 7 8
Alternative using add_prefix in the list comprehension:
pd.concat([pd.read_table(f, encoding='unicode_escape', sep=',')
.add_prefix(f[:-3]) # add prefix without ".csv" extension
for f in list_of_files], axis=1))
output:
f1.a f1.b f2.a f2.b
0 1 2 5 6
1 3 4 7 8

Related

is There any methods to merge multiple dataframes of different templates

There are a total of 4 dataframes (df1 / df2 / df3 / df4),
Each dataframe has a different template, but they all have the same columns.
I want to merges the row of each dataframe based on the same column, but what function should I use? A 'merge' or 'join' function doesn't seem to work, and deleting the rest of the columns after grouping them into a list seems to be too messy.
I want to make attached image
This is an option, you can merge the dataframes and then drop the useless columns from the total dataframe.
df_total = pd.concat([df1, df2, df3, df4], axis=0)
df_total.drop(['Value2', 'Value3'], axis=1)
You can use reduce to get it done too.
from functools import reduce
reduce(lambda left,right: pd.merge(left, right, on=['ID','value1'], how='outer'), [df1,df2,df3,df4])[['ID','value1']]
ID value1
0 a 1
1 b 4
2 c 5
3 f 1
4 g 5
5 h 6
6 i 1

How do I use df.add_suffix to add suffixes to duplicate column names in Pandas?

I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?
Or is there a better way to do it automatically?
Here is my solution, starting with:
df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])
output
a b a b
0 1 2 3 4
Then I use Lambda function
df.columns += df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())
Output
a b a_ b_
0 0 1 2 3
If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.
If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'
df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
a aT b bT
0 1 2 3 4
My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.
You could do:
import pandas as pd
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))
# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')
print(df)
Output
a0 a1 a2
0 1 2 3
If there is only one duplicate column, you could do:
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)
print(df)
Output (for only one duplicate)
a0 a1 b0 b1
0 1 2 3 4
Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.
E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]
from collections import Counter
counter = Counter()
empty_list= []
for x in range(df.shape[1]):
counter.update([df.columns[x]])
if counter[df.columns[x]] == 1:
empty_list.append(df.columns[x])
else:
tx = counter[df.columns[x]] -1
empty_list.append(df.columns[x] + '_' + str(tx))
df.columns = empty_list
df.columns

Parsing python list of dates into a pandas DataFrame

need some help/advise how to wrangling dates into a Pandas DataFrame. I have Python list looking like this:
['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
Is there an easy way to transform this into a Pandas DataFrame with two columns (start time and end time)?
Sample:
L = ['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
I think best here is use list comprehension with split by separator and filter out values with no splitter:
df = pd.DataFrame([x.split('-') for x in L if '-' in x], columns=['start','end'])
print (df)
start end
0 20180715:1700 20180716:1600
1 20180716:1700 20180717:1600
2 20180717:1700 20180718:1600
3 20180718:1700 20180719:1600
4 20180719:1700 20180720:1600
5 20180722:1700 20180723:1600
6 20180723:1700 20180724:1600
7 20180724:1700 20180725:1600
8 20180725:1700 20180726:1600
9 20180726:1700 20180727:1600
Pandas solution is also possible, especially if need process Series - here is used split and dropna:
s = pd.Series(L)
df = s.str.split('-', expand=True).dropna(subset=[1])
df.columns = ['start','end']
print (df)
start end
1 20180715:1700 20180716:1600
2 20180716:1700 20180717:1600
3 20180717:1700 20180718:1600
4 20180718:1700 20180719:1600
5 20180719:1700 20180720:1600
7 20180722:1700 20180723:1600
8 20180723:1700 20180724:1600
9 20180724:1700 20180725:1600
10 20180725:1700 20180726:1600
11 20180726:1700 20180727:1600

Python Pandas extract columns from .csv

I have a .csv file, which I can read in Pandas. The .csv file looks like the following.
a b c d e
1 4 3 2 5
6 7 8 3 6
...
What I need to achieve is, that I can extract a and b as a column vector and
[c d e] as a matrix. I used Pandas with the following code to read the .csv file:
pd.read_csv('data.csv', sep=',',header=None)
But this will give me a vector like this: [[a,b,c,d,e],[1,4,3,2,5],...]
How can I extract the columns? I heared about df.iloc, but this cannot be used here, since after pd.read_csv there is only one column.
You should be able to do that with:
ds = pd.read_csv('data.csv', sep=',',header=0)
column_a = ds["a"]
matrix = ds[["c","d","e"]]

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c