Sum of data entry with the given index in pandas dataframe - pandas

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.

IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

Related

Pandas - find rows sharing two out the three common values, order-independent, and collect values pairs

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs.
Please note
a couple of values can appear at most in two rows
a value can appear only once in a row
I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.
For example, taking as input the dataframe
d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 2 3
1 2 7 3
2 5 1 2
3 9 2 7
I expect as result an array, list, something as
1 2
2 3
2 7
as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).
I cannot find a concise solution.
At the moment I skecthed a numpy solution such as
def func(x):
rows, columns = x.shape[0], x.shape[1]
res = []
for i in range(0,rows):
for j in range(i+1, rows):
aux = np.intersect1d(x[i,:], x[j,:])
if aux.size>1:
res.append(aux)
return res
which outputs
func(df.values)
Out: [array([2, 3]), array([1, 2]), array([2, 7])]
It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?
I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:
import itertools
import pandas as pd
d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)
pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])
Output:
{(2, 3), (1, 2), (2, 7)}
Optionally to get it in list[np.ndarray] format:
out = list(map(np.array, out))
Similar approach to that of #Chrysophylaxs but in pure python:
from itertools import combinations
from collections import Counter
c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))
out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
df=df.assign(col4=df.index)
def function1(ss:pd.Series):
ss1=ss.value_counts().loc[lambda ss:ss>=2]
return ss1.index.tolist() if ss1.size>=2 else None
df.merge(df,how='cross',suffixes=('','_2')).query("col4!=col4_2").filter(regex=r'col[^4]', axis=1)\
.apply(function1,axis=1).dropna().drop_duplicates()
out
1 [2, 3]
2 [1, 2]
7 [2, 7]

Tensorflow dataset of sliding windows keeping track of index

I have a dataframe which contains time series data: for the sake of simplicity, let's say that the index is my "datetime" or just the element that establishes the order of the data. Columns a and b instead are real numbers and I set them equal to the index just to explain my problem.
import pandas as pd
import numpy as np
import tensorflow as tf
data = pd.DataFrame({'a': np.arange(100), 'b': np.arange(100)})
print(data)
Which outputs:
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. .. ..
95 95 95
96 96 96
97 97 97
98 98 98
99 99 99
Then, I proceed to create a dataset of sliding windows over the time series dataframe:
data = np.array(data, dtype=np.float32)
ds = tf.keras.utils.timeseries_dataset_from_array(data=data, targets=None,
sequence_length=6,
sequence_stride=6,
sampling_rate=1,
shuffle=True,
batch_size=None,
seed=1)
for i in ds.take(3):
print(i)
Which outputs:
tf.Tensor( [[84. 84.]
[85. 85.]
[86. 86.]
[87. 87.]
[88. 88.]
[89. 89.]], shape=(6, 2), dtype=float32)
tf.Tensor(
[[30. 30.]
[31. 31.]
[32. 32.]
[33. 33.]
[34. 34.]
[35. 35.]], shape=(6, 2), dtype=float32)
tf.Tensor(
[[54. 54.]
[55. 55.]
[56. 56.]
[57. 57.]
[58. 58.]
[59. 59.]], shape=(6, 2), dtype=float32)
As you can see, each matrix is "datetime" ordered (sequence_length=6) and matrixes do not overlap (sequence_stride=6). I would like to keep track of the initial index. In other words, I want to be able to say extract the matrix with shape=(6, 2) that corresponds to the index values K:K+6. I know I could do this directly from the initial dataframe, but this is just a simplified version of a bigger problem: I am trying to replicate the section Data windowing of this Tensorflow tutorial such that I can plot exactly the date that I want, rather than random dates.

Create dictionary from pandas dataframe

I have a pandas dataframe with data as such:
pandas dataframe
From this I need to create a dictionary where Key is Customer_ID and value is array of tuples(feat_id, feat_value).
Am getting close using to_dict() function on dataframe.
Thanks
you should first set Customer_ID as the DataFrame index and use df.to_dict with orient='index' to obtain a dict in the form {index -> {column -> value}}. (see Documentation). Then you can extract the values of the inner dictionaries using dict comprehension to obtain the tuples.
df_dict = {key: tuple(value.values())
for key, value in df.set_index('Customer_ID').to_dict('index').items()}
Use a comprehension:
out = {customer: [tuple(l) for l in subdf.to_dict('split')['data']]
for customer, subdf in df.groupby('Customer_ID')[['Feat_ID', 'Feat_value']]}
print(out)
# Output
{80: [(123, 0), (124, 0), (125, 0), (126, 0), (127, 0)]}
Input dataframe:
>>> df
Customer_ID Feat_ID Feat_value
0 80 123 0
1 80 124 0
2 80 125 0
3 80 126 0
4 80 127 0

ValueError: total size of new array must be unchanged (numpy for reshape)

I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.

numpy, sums of subsets with no iterations [duplicate]

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)