accept duplicate keys from pandas dataframe - pandas

The following is a snippet of dataframe I have:
node group
0 28 1
167 28 2
I want to create a dictionary like structure from the above dataframe
and want to have something like
{28:1}
{28:2}
I tried to create it via
groupDict=groupTest.to_dict(orient='index')
which generates
{0: {'node': 28, 'group': 1}, 167: {'node': 28, 'group': 2}}
which is valid pandas way. But How would I generate
{28:1}
{28:2}
Potentional solution as advised in comments below by anky_91:
df.groupby('node')['group'].agg(list).to_dict()

Related

Pandas : make new column after nth row

I have the following table as a data frame.
0
8
990
15
70
85
36
2
43
5
68
61
62
624
65
82
523
98
I want to create a new column after every third row. So the data should look like this.
Thanks in advance.
Looks like your column can be converted into an array(i.e., list). If this is the case, you can break down the value by array and create an array of array. Then, use the array of array to create a dataframe.
The code might look something like this:
listofitems = [...]
## create new dataframe based on list index jump
newdf = pd.DataFrame([listofitems[i::3] for i in range(3)])
## transpose dataframe to 3 columns dataframe
newdf = newdf.T
For the example given above, 4139055 rows is not a big data. If you do have a big and complex data, please take a look at PySpark specifically on dataframe with Sparks. This is one of the big data frameworks that help optimizing data transformation over big dataframe.
import pandas as pd
import numpy as np
numbers= [0,
8,
990,
15,
70,
85,
36,
2,
43,
5,
68,
61,
62,
624,
65,
82,
523,
98]
pd.DataFrame(np.reshape(numbers, (6,3)))

Add new columns to excel file from multiple datasets with Pandas in Google Colab

I'm trying to add some columns to a excel file after some data but I'm not having good results just overwriting what I have. Let me give you some context: I'm reading a csv, for each column I'm using a for to value_counts and then create a frame from this value_counts here the code for just one column:
import pandas as pd
data= pd.read_csv('responses.csv')
datatoexcel = data['Music'].value_counts().to_frame()
datatoexcel.to_excel('savedataframetocolumns.xlsx') #Name of the file
This works like this ...
And with that code for only one column I have the format that I actually need for excel.
But the problem is when I try to do it with for to all the columns and then "Append" to excel the following dataframes using this formula:
for columnName in df:
datasetstoexcel = df.value_counts(columnName).to_frame()
print(datasetstoexcel)
# Here is my problem with the following line the .to_excel
x.to_excel('quickgraph.xlsx') #I tried more code lines but I'll leave this one as base
The result that I want to reach is this one:
I'm really close to finish this code, some help here please!
How about this?
Sample data
df = pd.DataFrame({
"col1": [1,2,3,4],
"col2": [5,6,7,8],
"col3": [9, 9, 11, 12],
"col4": [13, 14, 15, 16],
})
Find value counts and add to a list
li = []
for i in range(0, len(df)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
li.append(value_counts)
concat all the dataframes inside li and write to excel
pd.concat(li, axis=1).to_excel("result.xlsx")
Sample output:

pandas groupby and agg getting TypeError

I saw that it is possible to do groupby and then agg to let pandas produce a new dataframe that groups the old dataframe by the fields you specified, and then aggregate the fields you specified, on some function (sum in the example below).
However, when I wrote the following:
# initialize list of lists
data = [['tom', 10, 100], ['tom', 15, 200], ['nick', 15, 150], ['juli', 14, 140]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age', 'salary'])
# trying to groupby and agg
grouping_vars = ['Name']
nlg_study_grouped = df(grouping_vars,axis = 0).agg({'Name': sum}).reset_index()
Name
Age
salary
tom
10
100
tom
15
200
nick
15
150
juli
14
140
I am expecting the output to look like this (because it is grouping by Name then summing the field salary:
Name
salary
tom
300
nick
150
juli
140
The code works in someone else's example, but my toy example is producing this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-6fb9c0ade242> in <module>
1 grouping_vars = ['Name']
2
----> 3 nlg_study_grouped = df(grouping_vars,axis = 0).agg({'Name': sum}).reset_index()
TypeError: 'DataFrame' object is not callable
I wonder if I missed something dumb.
You can try this
print(df.groupby('Name').sum()['salary'])
To use multiple functions
print(df.groupby(['Name'])['salary']
.agg([('average','mean'),('total','sum'),('product','prod')])
.reset_index())
If you want to group by multiple columns, then you can try adding multiple column names within groupby list
Ex: df.groupby(['Name','AnotherColumn'])...
Further, you can refer this question
Aggregation in Pandas

Is there an easier way to grab a single value from within a Pandas DataFrame with multiindexed columns?

I have a Pandas DataFrame of ML experiment results (from MLFlow). I am trying to access the run_id of a single element in the 0th row and under the "tags" -> "run_id" multi-index in the columns.
The DataFrame is called experiment_results_df. I can access the element with the following command:
experiment_results_df.loc[0,(slice(None),'run_id')].values[0]
I thought I should be able to grab the value itself with a statement like the following:
experiment_results_df.at[0,('tags','run_id')]
# or...
experiment_results_df.loc[0,('tags','run_id')]
But either of those just results in the following rather confusing error (as I'm not setting anything):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It's working now, but I'd prefer to use a simpler syntax. And more than that, I want to understand why the other approach isn't working, and if I can modify it. I find multiindexes very frustrating to work with in Pandas compared to regular indexes, but the additional formatting is nice when I print the DF to the console, or display it in a CSV viewer as I currently have 41 columns (and growing).
I don't understand what is the problem:
df = pd.DataFrame({('T', 'A'): {0: 1, 1: 4},
('T', 'B'): {0: 2, 1: 5},
('T', 'C'): {0: 3, 1: 6}})
print(df)
# Output
T
A B C
0 1 2 3
1 4 5 6
How to extract 1:
>>> df.loc[0, ('T', 'A')]
1
>>> df.at[0, ('T', 'A')]
1
>>> df.loc[0, (slice(None), 'A')][0]
1

Merge list in multiple columns to a single column in pandas

I have a pandas dataframe in the below format:
0 1 2 3
A.pkl [121,122] [123] [124,125] [126,127]
The number of columns might be more as well. In the end, I would like to merge all the values in all the columns and write it to a single column.
Result dataframe:
values
A.pkl [121,122,123,124,125,126,127]
I use the below code to generate the first part:
df = pd.DataFrame({
g: pd.read_pickle(f'{g}')['values'].tolist()
for g in groups
}).T
I tried using itertools.chain but it doesnt seem to do the trick.
Any suggestions would be appreciated.
Input dataframe:
df = pd.DataFrame({'name': ['aa.pkl'],
'0': [["001A000001", "003A0025"]],
'1': [["003B000001","003C000001"]],
'2': [["003D000001", "003E000001"]],
'3': [["003F000001", "003G000001"]]})
The above dataframe is generated in the by reading the pickle file
Actually itertools.chain is one way to go, but you have to do it properly:
from itertools import chain
df.apply(lambda x: list(chain(*x)), axis=1)
output:
A.pkl [121, 122, 123, 124, 125, 126, 127]
dtype: object
As #QuangHoang suggested you can also use the df.sum(axis=1) trick, but be careful, this only works with lists. If for some reason you have numpy arrays this will perform the sum per position ([494, 497]).
Input:
df = pd.DataFrame({'0': [[121, 122]],
'1': [[123]],
'2': [[124, 125]],
'3': [[126, 127]]})