Grouping nearby data in pandas - pandas

Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.

Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()

Related

How can I detect similarity of names in the same columns

Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

calculating probability from long series data in python pandas

I have a data ranging from 19 to 49. How can I calculate the probability of the data occurred in between 25 to 40?
46.58762816
30.50477684
27.4195249
47.98157313
44.55425608
30.21066503
34.27381019
48.19934524
46.82233375
46.05077036
42.63647302
40.11270346
48.04909583
24.18660332
24.47549276
44.45442651
19.24542913
37.44141763
28.41079638
21.69325455
31.32887617
26.26988582
18.19898804
19.01329026
28.33846808
Simplest you can do is to use the % of values that fall between 25 and 40.
If s is your pandas.Series you gave us:
In [1]: s.head()
Out[1]:
0 46.587628
1 30.504777
2 27.419525
3 47.981573
4 44.554256
Name: 0, dtype: float64
In [2]: # calculate number of values between 25 and 40 and divide by total count
s.between(25,40).sum()/float(s.count())
Out[2]: 0.3599
Otherwise it would require trying to find what distribution your data might be following (from the data you gave, which might be just a small sample of your data, it doesn't appear to be following any distribution I know...), testing if it actually follows the distribution you think it follows (using Kolmogorov-Smirnov test or another like it), then you can use that distribution to calculate the probability etc.

Pandas: aggregation on multi-level groups

I have a df that looks something like this:
batch group reading temp test block delay
0 9551 Control 340 22.9 1 X 35
1 9551 Control 345 22.9 1 Y 35
I need to group by 'group' and 'block', e.g. my means would look like so:
df.groupby(['block', 'group']).reading.mean().unstack().transpose()
block X Y
group
Control 347.339450 350.427273
Trial 347.790909 350.668182
What would be the best way to call a 2 argument function like scipy.stats.ttest_ind on data sliced this way so I end up with a table of t tests for
control vs trial in x
control vs trial in y
x vs y in control
x vs y in trial
Do you want to group and aggregate the data before applying the t-test? I think you want to select subsets of the data. Grouping can do that, but masking might get the job done more simply.
Offhand, I'd say you want something like
scipy.stats.ttest_ind(df[(df.group == 'Control') & (df.block == 'X')].reading,
df[(df.group == 'Trial') & (df.block == 'X')].reading)

Identifying graphs in heap of connected nodes -- how is this called?

I have a SQL table with three columns X, Y, Z. I need to split it in groups in such a way that all records with same value of X or Y or Z are assigned to the same group. I need to make sure that the records with same value X or Y or Z are never split across multiple groups.
If you think of records as nodes and values of X, Y, Z as edges, this problem is the same as finding all graphs where the nodes in each graph will be connected directly or indirectly via X, Y, or Z-edge, but each graph will have no edges in common with other graphs (otherwise it would be part of the same graph).
A few years ago I knew what this was called and even remembered the algorithm but now it escapes me. Please tell me how this problem is called so I can Google for solution. If you now a good algorithm -- please point me to it. If you have a SQL implementation -- I will marry you :)
Example:
X Y Z BUCKET
--------- ---------------- --------- -----------
1 34 56 1
54 43 45 2
1 12 22 1
2 34 11 1
The last row is in bucket 1 because of the value of Y=34 which is the same as of the first row, which is in bucket 1.
It looks not like a graph, more like a simplicial complex.
But if we treat this complex as its skeletal graph (the numbers are treated as vertices and a row in a table means that all that three vertices are connected by an edge), then we may just use any algorithm to find connected components of this graph. I'm not sure whether there is a feasible way to do this in SQL though, perhaps it would be more prudent to use a graph database somehow.
However, for this specific problem there may be some easy solution attainable by means of SQL which I didn't look for.
to find how many nodes in each group x:
select x, count(x)
from mytable
group by x
or to find the list of sets x:
select distinct x from mytable;
Why don't you initially GROUP BY one of the colums (say X), make buckets, then do so for Y and Z, each time merging all the buckets from the previous step if you find new groups.
Repeat the process for X, Y, and Z until the buckets stop changing.
Are you working for linked-in or facebook? :)