Reformatting pandas table when column contains repeated headers - pandas

I have the pandas DataFrame below and I want to sort it such that the ["File Name", "File Start Time", etc] are column headers. I can imagine running a loop through the rows looking for strings, but perhaps there is a simpler option for this?
import pandas as pd
data = pd.read_csv(file_path + 'chb01-summary.txt',skiprows = 28, header=None, delimiter = ": ")
file source https://www.physionet.org/pn6/chbmit/chb01/chb01-summary.txt

You can use read_csv and reshape by unstack:
url = 'https://www.physionet.org/pn6/chbmit/chb01/chb01-summary.txt'
df = pd.read_csv(url, skiprows=28, sep=':\s+', names=['a','b'], engine='python')
print (df.head())
a b
0 File Name chb01_01.edf
1 File Start Time 11:42:54
2 File End Time 12:42:54
3 Number of Seizures in File 0
4 File Name chb01_02.edf
df = df.set_index([df['a'].eq('File Name').cumsum(), 'a'])['b']
.unstack()
.reset_index(drop=True)
print (df.head())
a File End Time File Name File Start Time Number of Seizures in File \
0 12:42:54 chb01_01.edf 11:42:54 0
1 13:42:57 chb01_02.edf 12:42:57 0
2 14:43:04 chb01_03.edf 13:43:04 1
3 15:43:12 chb01_04.edf 14:43:12 1
4 16:43:19 chb01_05.edf 15:43:19 0
a Seizure End Time Seizure Start Time
0 None None
1 None None
2 3036 seconds 2996 seconds
3 1494 seconds 1467 seconds
4 None None

Related

Adding file name to column name pandas dataframe

I have a pandas dataframe created from several csv files. The csv files are all structured the same way, so I have the same column names over and over again. I want the column names to be expanded by the file names (which I have as a list) they come from.
From this I know how to add a count to same name columns and I know how to rename columns. But I fail at bringing the right file name to the right column value.
That should be the relevant part of the code:
for i in range(0,len(file_list)):
data = pd.read_table(file_list[i], encoding='unicode_escape')
df = pd.DataFrame(data)
df = df.drop(droplist,axis=1)
main_dataframe = pd.concat([main_dataframe, df], axis = 1)
You can use a dictionary in concat to generate a MultiIndex:
list_of_files = ['f1.csv', 'f2.csv']
pd.concat({f: pd.read_table(f, encoding='unicode_escape', sep=',')
for f in list_of_files}, axis=1)
example:
# f1.csv
a,b
1,2
3,4
# f2.csv
a,b
5,6
7,8
output:
f1.csv f2.csv
a b a b
0 1 2 5 6
1 3 4 7 8
Alternative using add_prefix in the list comprehension:
pd.concat([pd.read_table(f, encoding='unicode_escape', sep=',')
.add_prefix(f[:-3]) # add prefix without ".csv" extension
for f in list_of_files], axis=1))
output:
f1.a f1.b f2.a f2.b
0 1 2 5 6
1 3 4 7 8

Pandas: Newbie question on compare and (re)calculate fields with pandas

What I need to do is to compare 2 fields in a row in a csv-file:
Data looks like this:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
in case that "price" is equal to "retail_price" the field retail_price must be reduced by a given percent-value, e.g. -10%
so in the example data, the first and last line should be changed to 180 and 179,955
I´m completely new to pandas and after reading the "getting started" part I did not find anything that I could set upon ...
so any help or hint (just point me in the direction, I will fiddle it out myself then) is appreciated,
Kind regards!
Use Series.eq for compare both values and if same multiple retail_price by 0.9 else not in numpy.where:
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(0.9), df['retail_price'])
print (df)
store ean price retail_price quantity
0 1 888721396226 200.00 180.000 2
1 1 888721396233 200.00 159.000 2
2 1 2194384654084 299.00 259.000 7
3 1 2194384654091 199.95 179.955 8
Or you can use DataFrame.loc for multiple only matched rows by 0.9:
mask = df['price'].eq(df['retail_price'])
df.loc[mask, 'retail_price'] *= 0.9
#working like
df.loc[mask, 'retail_price'] = df.loc[mask, 'retail_price'] * 0.9
EDIT: for filter rows not matched mask (with Falses in mask) use:
df2 = df[~mask].copy()
print (df2)
store ean price retail_price quantity
1 1 888721396233 200.0 159.0 2
2 1 2194384654084 299.0 259.0 7
print (mask)
0 True
1 False
2 False
3 True
dtype: bool
This ist my code:
import pandas as pd
import numpy as np
import sys
with open('prozente.txt', 'r') as f: #create multiplicator from static value in File "prozente.txt"
prozente = int(f.readline())
mulvalue = 1-(prozente/100)
df = pd.read_csv('1.csv', sep=';', header=1, names=['store','ean','price','retail_price','quantity'])
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(mulvalue).round(2), df['retail_price'])
df2 = df[~mask].copy()
df.to_csv('output.csv', columns=['store','ean','price','retail_price','quantity'],sep=';', index=False)
print(df)
print(df2)
using this as 1.csv:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
The content of file "prozente.txt" is
25

Increment or reset counter based on an existing value of a data frame column in Pandas

I have a dataframe imported from csv file along the lines of the below:
Value Counter
1. 5 0
2. 15 1
3. 15 2
4. 15 3
5. 10 0
6. 15 1
7. 15 1
I want to increment the value of counter only if the value= 15 else reset it to 0. I tried cumsum but stuck how to reset it back to zero of nonmatch
Here is my code
import pandas as pd
import csv
import numpy as np
dfs = []
df = pd.read_csv("H:/test/test.csv")
df["Counted"] = (df["Value"] == 15).cumsum()
dfs.append(df)
big_frame = pd.concat(dfs, sort=True, ignore_index=False)
big_frame.to_csv('H:/test/List.csv' , index=False)
Thanks for your help
Here's my approach:
s = df.Value.ne(15)
df['Counter'] = (~s).groupby(s.cumsum()).cumsum()

Parsing python list of dates into a pandas DataFrame

need some help/advise how to wrangling dates into a Pandas DataFrame. I have Python list looking like this:
['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
Is there an easy way to transform this into a Pandas DataFrame with two columns (start time and end time)?
Sample:
L = ['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
I think best here is use list comprehension with split by separator and filter out values with no splitter:
df = pd.DataFrame([x.split('-') for x in L if '-' in x], columns=['start','end'])
print (df)
start end
0 20180715:1700 20180716:1600
1 20180716:1700 20180717:1600
2 20180717:1700 20180718:1600
3 20180718:1700 20180719:1600
4 20180719:1700 20180720:1600
5 20180722:1700 20180723:1600
6 20180723:1700 20180724:1600
7 20180724:1700 20180725:1600
8 20180725:1700 20180726:1600
9 20180726:1700 20180727:1600
Pandas solution is also possible, especially if need process Series - here is used split and dropna:
s = pd.Series(L)
df = s.str.split('-', expand=True).dropna(subset=[1])
df.columns = ['start','end']
print (df)
start end
1 20180715:1700 20180716:1600
2 20180716:1700 20180717:1600
3 20180717:1700 20180718:1600
4 20180718:1700 20180719:1600
5 20180719:1700 20180720:1600
7 20180722:1700 20180723:1600
8 20180723:1700 20180724:1600
9 20180724:1700 20180725:1600
10 20180725:1700 20180726:1600
11 20180726:1700 20180727:1600

How to shift the column headers in pandas

I have .txt files I'm reading in with pandas and the header line starts with '~A'. I need to ignore the '~A' and have the next header correspond to the data in the first column. Thanks!
You can do this:
import pandas as pd
data = pd.read_csv("./test.txt", names=[ 'A', 'B' ], skiprows=1)
print(data)
and the output for input:
~A, A, B
1, 2
3, 4
is:
c:\Temp\python>python test.py
A B
0 1 2
1 3 4
You have to name the columns yourself but given that your file seems to be malformed I guess it is not that bad.
If your header lines are not the same in all files, then you can just read them in Python:
import pandas as pd;
# read first line
with open("./test.txt") as myfile:
headRow = next(myfile)
# read column names
columns = [x.strip() for x in headRow.split(',')]
# process by pandas
data = pd.read_csv("./test.txt", names=columns[1:], skiprows=1)
print(data);