Python count number of periods in a series - series

I am having an error while counting the number of periods in a series.
I have tried this
series = pd.Series(['how. are. you. today.', 'i. am. fine.', 'thank. you.'])
count = series.str.count('.')
Expected results are
0 4
1 3
2 2
but instead I get
0 21
1 12
2 11
How do I solve this? Thank you in advance.

series = pd.Series(['how. are. you. today.', 'i. am. fine.', 'thank. you.'])
count = series.str.count('\.')

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Divide rows in two columns with Pandas

I am using Pandas.
For each row, regardless of the County, I would like to divide "AcresBurned" by "CrewsInvolved".
For each County, I would like to sum the total AcresBurned for that County and divide by the sum of the total CrewsInvolved for that County.
I just started coding and am not able to solve this. Please help. Thank you so much.
Counties AcresBurned CrewsInvolved
1 400 2
2 500 3
3 600 5
1 800 9
2 850 8
This is very simple with Pandas. You could create a new col with these operations.
df['Acer_per_Crew'] = df['AcersBurned'] / df['CrewsaInvolved']
You could use a groupby clause for viewing the sum of AcersBurned for a county.
df_gb = df.groupby(['counties']) ['AcersBurned', 'CrewsInvolved'].sum().reset_index()
df_gb.columns = ['counties', 'AcersBurnedPerCounty', 'CrewsInvolvedPerCounty']
df = df.merge(df_gb, on = 'counties')
Once you've done this, you could create a new column with a similar arithmetic operation to divide AcersBurnedPerCounty by CrewsInvolvedPerCounty.

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

Mean of consecutive days without selling

I am trying to calculate the mean of Interval without selling of a product.
I thought that a good way to get this is:
Count (Days without selling) / Count (Intervals of consecutive days without selling)
Units Sold
0 1
1 4
2 0
3 0
4 0
5 7
6 0
7 0
8 0
9 0
10 1
11 0
In this example I had:
8 days without selling
3 Intervals of consecutive days without selling
So, 8/3 = 2.7 should be my result.
Counting days with No units sold I am using this:
x['Units Sold'] == 0).sum()
However, I don't figured out a good approach to calculate 'Intervals of consecutive days without selling' in a efficient way (considering I will run on multiple products)
Another approach using nunique
s = df["Units Sold"].eq(0)
d = s.sum()
i = s[s].index.to_series().diff().ne(1).cumsum().nunique()
final = d/i # 2.6666666666666665
Using eq, cumsum and diff
First we use eq(0) and sum, to count the amount of days where nothing was sold.
Then we get the cumsum of these days and check wether or not there's a difference between the rows. If this difference is 0, that means there was an interval.
days = x['Units Sold'].eq(0).sum()
intervals = x['Units Sold'].eq(0).cumsum().diff().eq(0)
mask = x['Units Sold'].shift(-1).eq(0)
days / (intervals & mask).sum()
Output
2.6666666666666665
You already knew how to get sum of count of 0, so try this to find number of consective group of 0
s = df['Units Sold'].eq(0)
(s & ~s.shift(fill_value=False)).sum()
Out[567]: 3
You can use:
df.eq(0).sum()/((df.eq(0)&df.shift().ne(0)).sum())
Output:
Units Solds 2.666667
dtype: float64

How to calculate the rolling sum on custom time columns?

The rolling function in Pandas can only calculate rolling statistics according to row counts or date/time columns. But I want to have a discrete time column for calculating rolling sum, something like this:
key time value
A 1 10
A 2 20
A 4 30
A 7 10
B 1 15
B 2 30
B 3 15
I want to first group by key, then calculate the rolling sum on value for the nearest 3 time:
key time value output
A 1 10 10
A 2 20 30(10+20)
A 4 30 60(10+20+30)
A 7 10 40(30+10)
B 1 15 15
B 2 30 45
B 3 15 60
I tried this:
grouped = input.groupby("key", as_index=False)
for name, group in grouped:
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
out = calcRollingStat(time, value, mode="avg")
group["output"] = out #out is a list
But then I don't know how to convert grouped back to DataFrame. Pandas tells me that there is no reset_index attribute in grouped.
Is my code the best method to do this? How would you tackle this problem?
Thank you!
I believe you can use GroupBy.apply with custom function:
def f(group):
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
group["output"] = calcRollingStat(time, value, mode="avg")
return group
df = input.groupby("key", as_index=False).apply(f)