Create dictionary from pandas dataframe - pandas

I have a pandas dataframe with data as such:
pandas dataframe
From this I need to create a dictionary where Key is Customer_ID and value is array of tuples(feat_id, feat_value).
Am getting close using to_dict() function on dataframe.
Thanks

you should first set Customer_ID as the DataFrame index and use df.to_dict with orient='index' to obtain a dict in the form {index -> {column -> value}}. (see Documentation). Then you can extract the values of the inner dictionaries using dict comprehension to obtain the tuples.
df_dict = {key: tuple(value.values())
for key, value in df.set_index('Customer_ID').to_dict('index').items()}

Use a comprehension:
out = {customer: [tuple(l) for l in subdf.to_dict('split')['data']]
for customer, subdf in df.groupby('Customer_ID')[['Feat_ID', 'Feat_value']]}
print(out)
# Output
{80: [(123, 0), (124, 0), (125, 0), (126, 0), (127, 0)]}
Input dataframe:
>>> df
Customer_ID Feat_ID Feat_value
0 80 123 0
1 80 124 0
2 80 125 0
3 80 126 0
4 80 127 0

Related

how do I convert multiple columns into a list in ascending order?

I have a dataset of this type:
0
1
2
0:0
57:0
166:0
0:5
57:20
166:27
0:10
57:8:
166:36
0:27
57:4
166:45
i want to convert this dataframe into an ascending list. I want to check the whole data into ascending order and then create a list using the ascending order numbers in the dataframe. and ascending order should be of numbers before ':'
desired output:
list
0
57
166
You can unstack (or stack) to flatten to Series, then extract the number, convert to integer and keep the unique values in order:
For a python list you can try:
sorted(df.unstack()
.str.extract('(\d+):', expand=False)
.astype(int)
.unique().tolist())
output: [0, 57, 166]
As Series:
out = (df.unstack()
.str.extract('(\d+):', expand=False)
.astype(int)
.drop_duplicates()
.sort_values().reset_index(drop=True)
)
output:
0 0
1 57
2 166
dtype: int64

DataFrame Index Created From Columns

I have a dataframe that I am using TIA to populate data from Bloomberg. When I look at df.index I see that the data that I intended to be columns is presented to me as what appears to be a multi-index. The output for df.columns is like this:
Index([u'column1','u'column2'])
I have tried various iterations of reset_index but have not been able to remedy this situation.
1) what about the TIA manager causes the dataframe columns to be read in as an index?
2) How can I properly identify these columns as columns instead of a multi-index?
The ultimate problem that I'm trying to fix is that when I try to add this column to df2, the values for that column in df2 come out as NaT. Like below:
df2['column3'] = df1['column1']
Produces:
df2
column1 column2 column3
1135 32 NaT
1351 43 NaT
35 13 NaT
135 13 NaT
From the comments it appears df1 and df2 have completely different indexes
In [396]: df1.index
Out[400]: Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
In [401]: df2.index
Out[401]: Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
but we wish to assign values from df1 to df2, preserving order.
Usually, Pandas operations try to automatically align values based on index (and/or column) labels.
In this case, we wish to ignore the labels. To do that, use
df2['columns3'] = df1['column1'].values
df1['column1'].values is a NumPy array. Since it doesn't have a Index, Pandas simply assigns the values in the array into df2['columns3'] in order.
The assignment would behave the same way if the right-hand side were a list or a tuple.
Note that this also relies on len(df1) equaling len(df2).
For example,
import pandas as pd
df1 = pd.DataFrame(
{"column1": [1135, 1351, 35, 135, 0], "column2": [32, 43, 13, 13, 0]},
index=[u"Jan", u"Feb", u"Mar", u"Apr", u"May"],
)
df2 = pd.DataFrame(
{"column1": range(len(df1))}, index=[u"One", u"Two", u"Three", u"Four", u"Five"]
)
df2["columns3"] = df1["column1"].values
print(df2)
yields
column1 columns3
One 0 1135
Two 1 1351
Three 2 35
Four 3 135
Five 4 0
Alternatively, you could make the two Indexs the same, and then df2["columns3"] = df1["column1"] would produce the same result (but now because the index labels are being aligned):
df1.index = df2.index
df2["columns3"] = df1["column1"]
Another way to make the Indexs match, is to reset the index on both DataFrames:
df1 = df1.reset_index()
df2 = df2.reset_index()
df2["columns3"] = df1["column1"]
reset_index moves the old index into a column named index by default (if index.name was None). Integers (starting with 0) are assigned as the new index labels:
In [402]: df1.reset_index()
Out[410]:
index column1 column2
0 Jan 1135 32
1 Feb 1351 43
2 Mar 35 13
3 Apr 135 13
4 May 0 0

splitting a pandas object

I have a column in dataframe that has values such as 45+2, 98+3, 90+5. I want to split the values such that I only have 45,98,90 i.e drop the + symbol and all that follows it. The problem is that pandas has this data as an object making string stripping difficult any suggestion ?
Use Series.str.split with select first values of lists by indexing:
df = pd.DataFrame({'col':['45+2','98+3','90+5']})
df['new'] = df['col'].str.split('+').str[0]
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90
Or use Series.str.extract for first integers from values:
df['new'] = df['col'].str.extract('(\d+)')
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90
You can use lambda function for doing this.
df1 = pd.DataFrame(data=['45+2','98+3','90+5'],columns=['col'])
print df1
col
0 45+2
1 98+3
2 90+5
Delete unwanted parts from the strings in the "col" column
df1['col'] = df1['col'].map(lambda x:x.split('+')[0])
print df1
col
0 45
1 98
2 90

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

How to efficiently remove duplicate rows from a DataFrame

I'm dealing with a very large Data Frame and I'm using pandas to do the analysis.
The data frame is structured as follows
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
Source Target Weight
0 0 25846 1
1 0 1916 1
2 25846 0 1
3 0 4748 1
4 0 16856 1
The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row.
For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.
A simple way to get rid of all the "duplicates" is
for index, row in df.iterrows():
df = df[~((df.Source==row.Target)&(df.Target==row.Source))]
However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?
Create two temp columns to save minimum(df.Source, df.Target) and maximum(df.Source, df.Target), and then check duplicated rows by duplicated() method:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 5, (20, 2)), columns=["Source", "Target"])
df["T1"] = np.minimum(df.Source, df.Target)
df["T2"] = np.maximum(df.Source, df.Target)
df[~df[["T1", "T2"]].duplicated()]
No need (as usual) to use a loop with a dataframe. Use the Series.isin method:
So start with this:
df = pandas.DataFrame({
'src': [0, 0, 25, 0, 0],
'tgt': [25, 12, 0, 85, 363]
})
print(df)
src tgt
0 0 25
1 0 12
2 25 0
3 0 85
4 0 363
Then select all of the where src is not in tgt:
df[~(df['src'].isin(df['tgt']) & df['tgt'].isin(df['src']))]
src tgt
1 0 12
3 0 85
4 0 363
Your Source and Targets appear to be mutually exclusive (i.e. you can have one, but not both). Why not add them together (e.g. 25846 + 0) to get the unique identifier. You can then delete the unneeded Target column (reducing memory), and then drop duplicates. In the event your weights are not the same, it will take the first one by default.
df.Source += df.Target
df.drop('Target', axis=1, inplace=True)
df.drop_duplicates(inplace=True)
>>> df
Source Weight
0 25846 1
1 1916 1
3 4748 1
4 16856 1