Numpy matrix to Pandas DataFrame - pandas

I have a numpy user-item matrix, each row corresponds to an user and each columns corresponds to an item.
I want to convert the matrix in a pandas DataFrame like the one as follows:
user item rating
0 1 1907 4.0
1 1 1028 5.0
2 1 608 4.0
3 1 2692 4.0
4 1 1193 5.0
I use the following code to generate a DataFrame:
predictions = pd.DataFrame(data=pred)
predictions = predictions.stack().reset_index(name='rating')
predictions.columns = ['user', 'item', 'rating']
and I obtain a df like this:
user item rating
0 0 0 5.000000
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.000000
Is there a way in pandas to map each values in user and items columns to a value stored in list? User with value 0 should be mapped with the 1st value in user list, user with value 5 with the 6th element in user list and so on...
I'm trying using:
predictions[["user"]].apply(lambda value: users[value])
but I got an IndexError I don't understand because my users list is of size 96
IndexError: ('index 96 is out of bounds for axis 1 with size 96', 'occurred at index user')

my fault was in this code:
while not session.should_stop():
predictions = session.run(decoder_op)
pred = np.vstack((pred, predictions))
just replaced with:
np.vstack((pred, predictions))
and it works like a charm with:
predictions['user'] = predictions['user'].map(lambda value: users[value])
predictions['item'] = predictions['item'].map(lambda value: items[value])

Related

Panda key value pair data frame

Does panda can convert the key value to customized table. Here is the sample of the data.
1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call
Generate a table with key as a header and the associated value. First field as a time.
I would use a regex to get each key/value pair, then reshape:
data = '''1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call'''
df = (pd.Series(data.splitlines()).radd('time=')
.str.extractall(r'([^\s=]+)=([^\s=]+)')
.droplevel('match').set_index(0, append=True)[1]
# unstack keeping order
.pipe(lambda d: d.unstack()[d.index.get_level_values(-1).unique()])
)
print(df)
Output:
0 time customer area height width length remarks
0 1675484100 A.1 1 20 {10,10} 1 NaN
1 1675484101 B.1 10 30 {20,11} 2 NaN
2 1675484102 C.1 11 40 {30,12} 3 call
Assuming that your input is a string defined as data, you can use this :
L = [{k: v for k, v in (x.split("=") for x in l.split()[1:])}
for l in data.split("\n") if l.strip()]
​
df = pd.DataFrame(L)
​
df.insert(0, "time", [pd.to_datetime(int(x.split()[0]), unit="s")
for x in data.split("\n")])
Otherwise, if the data are stored in some sort of a (.txt) file, add this at the beginning :
with open("file.txt", "r") as f:
data = f.read()
Output :
print(df)
​
time customer area height width length remarks
0 2023-02-04 04:15:00 A.1 1 20 {10,10} 1 NaN
1 2023-02-04 04:15:01 B.1 10 30 {20,11} 2 NaN
2 2023-02-04 04:15:02 C.1 11 40 {30,12} 3 call

Pandas Groupby and Apply

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?
The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN
Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object
If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.

Mapping dictionary of lists to a pandas df

I have a dictionary which contains a id and a list of corresponding values for that id.
I am attempting to map this dictionary to a pandas df.
The df contains the same id to map to, but it needs to map the items in that list in order of appearance within the df.
For example:
sample_dict = {0:[0.1,0.4,0.5], 1:[0.2,0.14,0.3], 2:[0.2,0.1,0.4]}
The df looks like:
The output of mapping the dictionary to the df would look like:
Sorry for typing the table out like this, the actual df is very large, and I'm still new to stack exchange and pandas.
The end output should just map the id list value in order to the players as they appear in order as the df is sorted by id and then player
Let us try explode with reindex
df['new'] = pd.Series(sample_dict).reindex(df.id.unique()).explode().values
df
Out[140]:
id Player new
0 0 1 0.1
1 0 2 0.4
2 0 3 0.5
3 1 1 0.2
4 1 2 0.14
5 1 3 0.3
6 2 1 0.2
7 2 2 0.1
8 2 3 0.4

Apply noise on non zero elements of data frame

I am a bit struggling with this one.
I have a dataframe, and I want to apply gaussian noise only on the non zero elements of the data frame. A silly way to do this is :
mu, sigma = 0, 0.1
for i in range(df.shape[0]):
for j in range(df.shape[1]):
if df.iat[i,j] != 0:
df.iat[i,j] += np.random.normal(mu,sigma)
Noise must be different for each element, we do not add the same value each time.
And I would be happy if only this worked. Actually for some reason it does not. Instead, I got this :
before noise
after noise
As you can see on the image, for columns A and C it is working well, but not for the others. What is weird is that there is still a change (+/- 1, so far from what one would except of a gaussian noise...)
I tried to see if this was some decimals problem with df.round() but nothing came up.
So I am looking for another way to apply my noise mostly rather than to solve this weird problem. Thank you by advance.
I believe you can generate array with same size as orignal DataFrame and then add values by condition with where:
np.random.seed(234)
df = pd.DataFrame(np.random.randint(5, size=(5,5)))
print (df)
0 1 2 3 4
0 0 4 1 1 3
1 3 0 3 3 2
2 0 2 4 1 3
3 4 0 3 0 2
4 3 1 3 3 1
mu, sigma = 0, 0.1
a = np.random.normal(mu,sigma, size=df.shape)
print (a)
[[ 0.10452115 -0.01051424 -0.13329652 -0.06376671 0.07245456]
[-0.21753186 0.05700441 0.03595196 -0.08154859 0.0076684 ]
[ 0.08368405 0.10390984 0.04692948 0.09711873 -0.06820933]
[-0.07229613 0.03954906 -0.06136678 -0.02328597 -0.22123564]
[-0.04316055 0.05945377 0.13736261 0.07895045 0.03714287]]
df = df.where(df == 0, df.add(a))
print (df)
0 1 2 3 4
0 0.000000 3.989486 0.866703 0.936233 3.072455
1 2.782468 0.000000 3.035952 2.918451 2.007668
2 0.000000 2.103910 4.046929 1.097119 2.931791
3 3.927704 0.000000 2.938633 0.000000 1.778764
4 2.956839 1.059454 3.137363 3.078950 1.037143

Pandas bar plot with continuous x axis

I try to make a barchart in pandas, with two data series coming from a groupby:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().plot(kind='bar', layout=(2,2))
The x axis is not continuous, and only shows values that are in the dataset. In this example, it jumps from 11 to 13.
How can I make it continuous?
**EDIT 2: **
I tried JohnE datacentric approach, and it works. It creates a new index with no missing values:
temp = data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose()
temp.reindex(np.arange(temp.index.min(), temp.index.max())).plot(kind='bar', layout=(2,2))
However, I assume there should be a better approach with histogram instead of bar plot. The best I could do with histograms is:
data.groupby(['popup','UID']).size().groupby(level=0).plot(kind='hist', bins=30, alpha=0.5, layout=(2,2), legend=True)
But I didn't find any option in hist plot to get the same rendering than bar plot, without bar overlapping.
**EDIT: ** Here are some information to answer comments.
Data sample:
INSEE C1 popup C3 date \
0 75101.0 0.0 0 NaN 2017-05-17T13:20:16Z
0 75101.0 0.0 0 NaN 2017-05-17T14:23:51Z
1 31557.0 0.0 1 NaN 2017-05-17T14:58:27Z
UID
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
1 bafe9715-3a07-4d9b-b85c-0bbf658a9115
First groupby result (sample):
data.groupby(['popup','UID']).size().head(3)
popup UID
0 016d3e7e-1901-4f84-be0e-117988ec57a8 6
01c15455-29cc-4d1e-8743-638fd0f51602 6
03fc9eb0-c5fb-4205-91f0-4b74f78a8b96 3
dtype: int64
Second groupby result (sample):
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().head(3)
popup
0 1 46
3 23
4 22
dtype: int64
After unstack and transpose:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().head(3)
popup 0 1
1 46.0 38.0
2 21.0 35.0
3 23.0 22.0
There is a solution with histogram plot from matplotlib.axes.Axes.hist. It is better to use histograms than bar plots for this purpose, as we can choose the number of bins.
# Separate groups by 'popup' and count number of records for each 'UID'
popup_values = data['popup'].unique()
count_by_popup = [data[data['popup'] == popup_value].groupby(['UID']).size() for popup_value in popup_values]
# Create histogram
fig, ax = plt.subplots()
ax.hist(count_by_popup, 20, histtype='bar', label=[str(x) for x in popup_values])
ax.legend()
plt.show()