Increment or reset counter based on an existing value of a data frame column in Pandas - pandas

I have a dataframe imported from csv file along the lines of the below:
Value Counter
1. 5 0
2. 15 1
3. 15 2
4. 15 3
5. 10 0
6. 15 1
7. 15 1
I want to increment the value of counter only if the value= 15 else reset it to 0. I tried cumsum but stuck how to reset it back to zero of nonmatch
Here is my code
import pandas as pd
import csv
import numpy as np
dfs = []
df = pd.read_csv("H:/test/test.csv")
df["Counted"] = (df["Value"] == 15).cumsum()
dfs.append(df)
big_frame = pd.concat(dfs, sort=True, ignore_index=False)
big_frame.to_csv('H:/test/List.csv' , index=False)
Thanks for your help

Here's my approach:
s = df.Value.ne(15)
df['Counter'] = (~s).groupby(s.cumsum()).cumsum()

Related

Pandas: Newbie question on compare and (re)calculate fields with pandas

What I need to do is to compare 2 fields in a row in a csv-file:
Data looks like this:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
in case that "price" is equal to "retail_price" the field retail_price must be reduced by a given percent-value, e.g. -10%
so in the example data, the first and last line should be changed to 180 and 179,955
I´m completely new to pandas and after reading the "getting started" part I did not find anything that I could set upon ...
so any help or hint (just point me in the direction, I will fiddle it out myself then) is appreciated,
Kind regards!
Use Series.eq for compare both values and if same multiple retail_price by 0.9 else not in numpy.where:
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(0.9), df['retail_price'])
print (df)
store ean price retail_price quantity
0 1 888721396226 200.00 180.000 2
1 1 888721396233 200.00 159.000 2
2 1 2194384654084 299.00 259.000 7
3 1 2194384654091 199.95 179.955 8
Or you can use DataFrame.loc for multiple only matched rows by 0.9:
mask = df['price'].eq(df['retail_price'])
df.loc[mask, 'retail_price'] *= 0.9
#working like
df.loc[mask, 'retail_price'] = df.loc[mask, 'retail_price'] * 0.9
EDIT: for filter rows not matched mask (with Falses in mask) use:
df2 = df[~mask].copy()
print (df2)
store ean price retail_price quantity
1 1 888721396233 200.0 159.0 2
2 1 2194384654084 299.0 259.0 7
print (mask)
0 True
1 False
2 False
3 True
dtype: bool
This ist my code:
import pandas as pd
import numpy as np
import sys
with open('prozente.txt', 'r') as f: #create multiplicator from static value in File "prozente.txt"
prozente = int(f.readline())
mulvalue = 1-(prozente/100)
df = pd.read_csv('1.csv', sep=';', header=1, names=['store','ean','price','retail_price','quantity'])
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(mulvalue).round(2), df['retail_price'])
df2 = df[~mask].copy()
df.to_csv('output.csv', columns=['store','ean','price','retail_price','quantity'],sep=';', index=False)
print(df)
print(df2)
using this as 1.csv:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
The content of file "prozente.txt" is
25

Reading CSV and import column data as an numpy array

I have many csv file all contains two column. One is 'Energy' and another is 'Count'. My target is to import those data and keep them as a numpy array separately. Let's say X and Y will be two numpy array where X have all Energy and Y have all count data. But the problem is in my csv file i have a blank row after each data that seems making a lot of trouble. How can I eliminate those lines and save data as an array?
Energy Counts
-0.4767 0
-0.4717 0
-0.4667 0
-0.4617 0
-0.4567 0
-0.4517 0
import pandas as pd
import glob
import numpy as np
import os
import matplotlib.pyplot as plt
file_path = "path" ###file path
read_files = glob.glob(os.path.join(file_path,"*.csv")) ###get all files
X = [] ##create empty list
Y = [] ##create empty list
for files in read_files:
df = pd.read_csv(files,header=[0])
X.append(['Energy'])##store X data
Y.append(['Counts'])##store y data
X=np.array(X)
Y=np.array(Y)
print(X.shape)
print(Y.shape)
plt.plot(X[50],Y[50])
plt.show()
Ideally if I can save all data correctly, I suppose to get my plot but as data is not saving correctly, I am not getting any plot.
Set the skip_blank_lines parameter to True and these lines won't be read into the dataframe:
df = pd.read_csv(files, header=[0], skip_blank_lines=True)
So your whole program should be something like this (each file has the same column headers in the first line and the columns are separated by spaces):
...
df = pd.DataFrame()
for file in read_files:
df = df.append(pd.read_csv(file, sep='\s+', skip_blank_lines=True))
df.plot(x='Energy', y='Counts')
df.show()
# save both columns in one file
df.to_csv('myXYFile.csv', index=False)
# or two files with one column each
df.Energy.to_csv('myXFile.csv', index=False)
df.Counts.to_csv('myYFile.csv', index=False)
TEST PROGRAM
import pandas as pd
import io
input1="""Energy Counts
-0.4767 0
-0.4717 0
-0.4667 0
-0.4617 0
-0.4567 0
-0.4517 0
"""
input2="""Energy Counts
-0.4767 0
-0.4717 0
"""
df = pd.DataFrame()
for input in (input1,input2):
df = df.append(pd.read_csv(io.StringIO(input), sep='\s+', skip_blank_lines=True))
print(df)
TEST OUTPUT:
Energy Counts
0 -0.4767 0
1 -0.4717 0
2 -0.4667 0
3 -0.4617 0
4 -0.4567 0
5 -0.4517 0
0 -0.4767 0
1 -0.4717 0

Counter calling in pandas?

I want to call counter values inside pandas.
Effort so far:
from __future__ import unicode_literals
import spacy,en_core_web_sm
from collections import Counter
import pandas as pd
nlp = en_core_web_sm.load()
c = Counter(([token.pos_ for token in nlp('The cat sat on the mat.')]))
sbase = sum(c.values())
for el, cnt in c.items():
el, '{0:2.2f}%'.format((100.0* cnt)/sbase)
df = pd.DataFrame.from_dict(c, orient='index').reset_index()
print df
Current Output:
index 0
0 NOUN 2
1 VERB 1
2 DET 2
3 ADP 1
4 PUNCT 1
Expected Output:
The below inside dataframe:
(u'NOUN', u'28.57%')
(u'VERB', u'14.29%')
(u'DET', u'28.57%')
(u'ADP', u'14.29%')
(u'PUNCT', u'14.29%')
I want to call el and cnt inside the data frame how?
It was a folow up question wherein i wanted to get percentage of POS distribution listed.
Percentage Count Verb, Noun using Spacy?
I was of understanding I need to put group el and cnt in place of c below:
df = pd.DataFrame.from_dict(c, orient='index').reset_index()
I can only fix your out put since I do not have the original data
(df['0']/df['0'].sum()).map("{0:.2%}".format)
Out[827]:
0 28.57%
1 14.29%
2 28.57%
3 14.29%
4 14.29%
Name: 0, dtype: object

Reformatting pandas table when column contains repeated headers

I have the pandas DataFrame below and I want to sort it such that the ["File Name", "File Start Time", etc] are column headers. I can imagine running a loop through the rows looking for strings, but perhaps there is a simpler option for this?
import pandas as pd
data = pd.read_csv(file_path + 'chb01-summary.txt',skiprows = 28, header=None, delimiter = ": ")
file source https://www.physionet.org/pn6/chbmit/chb01/chb01-summary.txt
You can use read_csv and reshape by unstack:
url = 'https://www.physionet.org/pn6/chbmit/chb01/chb01-summary.txt'
df = pd.read_csv(url, skiprows=28, sep=':\s+', names=['a','b'], engine='python')
print (df.head())
a b
0 File Name chb01_01.edf
1 File Start Time 11:42:54
2 File End Time 12:42:54
3 Number of Seizures in File 0
4 File Name chb01_02.edf
df = df.set_index([df['a'].eq('File Name').cumsum(), 'a'])['b']
.unstack()
.reset_index(drop=True)
print (df.head())
a File End Time File Name File Start Time Number of Seizures in File \
0 12:42:54 chb01_01.edf 11:42:54 0
1 13:42:57 chb01_02.edf 12:42:57 0
2 14:43:04 chb01_03.edf 13:43:04 1
3 15:43:12 chb01_04.edf 14:43:12 1
4 16:43:19 chb01_05.edf 15:43:19 0
a Seizure End Time Seizure Start Time
0 None None
1 None None
2 3036 seconds 2996 seconds
3 1494 seconds 1467 seconds
4 None None

Division between two numbers in a Dataframe

I am trying to calculate a percent change between 2 numbers in one column when a signal from another column is triggered.
The trigger can be found with np.where() but what I am having trouble with is the percent change. .pct_change does not work because if you .pct_change(-5) you get 16.03/20.35 and I want the number the opposite way 20.35/16.03. See table below. I have tried returning the array from the index in the np.where and adding it to an .iloc from the 'Close' column but it says I can't use that array to get an .iloc position. Can anyone help me solve this problem. Thank you.
IdxNum | Close | Signal (1s)
==============================
0 21.45 0
1 21.41 0
2 21.52 0
3 21.71 0
4 20.8 0
5 20.35 0
6 20.44 0
7 16.99 0
8 17.02 0
9 16.69 0
10 16.03 1<< 26.9% <<< 20.35/16.03-1 (df.Close[5]/df.Close[10]-1)
11 15.67 0
12 15.6 0
You can try this code block:
#Create DataFrame
df = pd.DataFrame({'IdxNum':range(13),
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Create a function that calculates the reqd diff
def cal_diff(row):
if(row['Signal']==1):
signal_index = int(row['IdxNum'])
row['diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1
return row
#Create a column and apply that difference
df['diff'] = 0
df = df.apply(lambda x:cal_diff(x),axis=1)
In case you don't have IdxNum column, you can use the index to calculate difference
#Create DataFrame
df = pd.DataFrame({
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Calculate the reqd difference
df['diff'] = 0
signal_index = df[df['Signal']==1].index[0]
df.ix[signal_index,'diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1