Counting Data based on Cover_Type using pandas - pandas

I have the following data in the excel sheet!
I need to count the number of times a given elevation occurs for a given cover_type. For example, elevation=1905 occurs twice for cover_type=6 and once for cover_type=3. I need to do the same Aspect, Slope, Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Hillshade_9am, Hillshade_Noon, Hillshade_3pm, Horizontal_Distance_To_Fire_Points, Soil, Wilderness_Area.
I will be using the count to calculate the entropy of the each column. I need to execute this formula.

You can do the following
import pandas as pd
df = pd.read_csv('train_data.csv')
grouped = df[['elevation','cover_type']].groupby(['elevation','cover_type'], as_index = False, sort = False)['cover_type'].count()

Related

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

How can I always choose the last column in a csv table that's updated monthly?

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

How to use Pandas vector methods based on rolling custom function that involves entire row and prior data

While its easy to use pandas rolling method to apply standard formulas, but i find it hard if it involves multiple column with limited past rows. Using the following code to better elaborate: -
import numpy as np
import pandas as pd
#create dummy pandas
df=pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
def func1(shortdf):
#dummy formula
#use last row of col1 multiply by sum of col2
return (shortdf.col1.tail(1).values[0]+shortdf.col2.sum())*3.14
for idx, i in df.iterrows():
if idx>3:
#only interested in the last 3 rows from position of dataframe
df.loc[idx,'col3']=func1(df.iloc[idx-3:idx])
I currently use this iterrow method which needless to say is extremely slow. can anyone has better suggestion?
Option 1
So shift is the solution here. You do have to use rolling for the summation, and then shift that series after the addition and multiplication.
df = pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
ans = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
You can check to see that ans is the same as df['col3'] by using ans.eq(df['col3']). Once you see that all but the first few are the same, just change ans to df['col3'] and you should be all set.
Option 2
Without additional information about the customized weight function, it is hard to help. However, this option may be a solution as it separates the rolling calculation at the cost of using more memory.
# df['col3'] = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
s = df['col2']
stride = pd.DataFrame([s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]])
res = pd.concat([df, stride], axis=1)
# here you can perform your custom weight function
res['final'] = ((res[0] + res[1] + res[2] + res['col1']) * 3.14).shift(1)
stride is adapted from this question and the calculation is concatenated row-wise to the original dataframe. In this way each column has the value needed to compute whatever it is you may need.
res['final'] is identical to option 1's ans

Count frequency of multiple words

I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.
from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))