save a named tuple in all rows of a pandas dataframe

save a named tuple in all rows of a pandas dataframe - pandas

I'm trying to save a named tuple n=NamedTuple(value1='x'=, value2='y') in a row of a pandas dataframe.
The problem is that the named tuple is showing a length of 2 because it has 2 parameters in my case (value1 and value2), so it doesn't fit it into a single cell of the dataframe.
How can I achieve that the named tuple is written into every call of a row of a dataframe?
df['columnd1']=n
an example:
from collections import namedtuple
import pandas as pd
n = namedtuple("test", ['param1', 'param2'])
n1 = n(param1='1', param2='2')
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df['nt'] = n1
print(df)

I don't really understand what you're trying to do, but if you want to put that named tuple in every row of a new column (i.e. like a scalar) then you can't rely on broadcasting but should instead replicate it yourself:
df['nt'] = [n1 for _ in range(df.shape[0])]

Related

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]

.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

Map columns with list and return corresponding list

import pandas as pd
pd.DataFrame({"a":["a","b","c"],"d":[1,2,3]})
Given an array ["a","b","c","c"], I want it to use to map col "a", and get output [1,2,3,3] which is from column "d". Is there a short way to do this without iterating the rows?

Use Series.reindex with index by a converted to index by DataFrame.set_index:
a = ["a","b","c","c"]
L = df.set_index('a').reindex(a)['d'].tolist()
print (L)
[1, 2, 3, 3]

Looping through a dictionary of dataframes and counting a column

I am wondering if anyone can help. I have a number of dataframes stored in a dictionary. I simply want to access each of these dataframes and count the values in a column in the column I have 10 letters. In the first dataframe there are 5bs and 5 as. For example the output from the count I would expect to be is a = 5 and b =5. However for each dataframe this count would be different hence I would like to store the output of these counts either into another dictionary or a separate variable.
The dictionary is called Dict and the column name in all the dataframes is called letters. I have tried to do this by accessing the keys in the dictionary but can not get it to work. A section of what I have tried is shown below.
import pandas as pd
for key in Dict:
Count=pd.value_counts(key['letters'])
Count here would ideally change with each new count output to store into a new variable
A simplified example (the actual dataframe sizes are max 5000,63) of the one of the 14 dataframes in the dictionary would be
`d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
df = pd.DataFrame(data=d)`
The other dataframes are names df2,df3,df4 etc
I hope that makes sense. Any help would be much appreciated.
Thanks

If you want to access both key and values when iterating over a dictionary, you should use the items function.
You could use another dictionary to store the results:
letter_counts = {}
for key, value in Dict.items():
letter_counts[key] = value["letters"].value_counts()
You could also use dictionary comprehension to do this in 1 line:
letter_counts = {key: value["letters"].value_counts() for key, value in Dict.items()}

The easiest thing is probably dictionary comprehension:
d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
d2 = {'col1': [1, 2,3,4,5,6,7,8,9,10,11], 'letters': ['a','a','a','b','b','a','b','a','b','b','a']}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(d2)
df_dict = {'d': df, 'd2': df2}
new_dict = {k: v['letters'].count() for k,v in df_dict.items()}
# out
{'d': 10, 'd2': 11}

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")

I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())

I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!

Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']

One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.

import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

save a named tuple in all rows of a pandas dataframe - pandas

I don't really understand what you're trying to do, but if you want to put that named tuple in every row of a new column (i.e. like a scalar) then you can't rely on broadcasting but should instead replicate it yourself: df['nt'] = [n1 for _ in range(df.shape[0])]

Related

i need to return a value from a dataframe cell as a variable not a series

Map columns with list and return corresponding list

Looping through a dictionary of dataframes and counting a column

Assert an integer is in list on pandas series

How to split a cell which contains nested array in a pandas DataFrame

Categories

Resources