I have a table which has internal SKUs in column 0 and then synonyms along that row. The number of synonyms is not constant (ranging from 0 to 7, but will have a tendency to grow)
I need an effient function which will allow me to get SKUs from one column in a large table and translate them to synonym 0 from my other table.
This is my current function which takes an array of SKUs from one table, searches for them in another and gives me the first column value where it finds a synonym.
def new_array(dfarray1, array1, trans_dic):
missing_values = set([])
new_array = []
for value in array1:
pos = trans_dic.eq(str(value)).any(axis=1)
if len(pos[pos]) > 0 :
new_array.append(trans_dic['sku_0'][pos[pos].index[0]])
else:
missing_values.add(str(value))
if len(missing_values) > 0 :
print("The following values are missing in dictionary. They are in DF called:"+dfarray1)
print(missing_values)
sys.exit()
else:
return new_array
I'm sure that this is very badly written because it takes my laptop about 3 minutes to go through about 75K values only. Can anyone help me make this faster?
Some questions asked previously:
What types are your function parameters? (can guess pandas, but no way to know for sure)
Yes. I am working on two pandas dataframes.
What does your table even look like?
Dictionary table:
SKU0
Synonym 0
Synonym 1
Synonym 2
foo
bar
bar1
foo1
baar1
foo2
baaar0
baar2
Values table:
SKU
Value
Value1
value1
foo
3
1
7
baar1
4
5
7
baaar0
5
5
9
Desired table:
SKU
Value
Value1
value1
foo
3
1
7
foo1
4
5
7
foo2
5
5
9
What does the rest of your code that is calling this function look like?
df1.sku = new_array('df1', list(df1.sku), sku_dic)
Given the dictionary dataframe in the format
df_dict = pd.DataFrame({
"SKU0": ["foo", "foo1", "foo2"],
"Synonym 0": ["bar", "baar1", "baaar0"],
"Synonym 1": ["bar1", np.nan, np.nan],
"Synonym 2": [np.nan, np.nan, "baar2"]
})
and a values dataframe in the format
df_values = pd.DataFrame({
"SKU": ["foo", "baar1", "baaar0"],
"Value": [3, 4, 5],
"Value1": [1, 5, 5],
"value1": [7, 7, 9]
})
you can get the output you want by first using pd.melt to restructure your dictionary dataframe and then join it to your values dataframe. Then you can use some extra logic to check which column to take the final value from and to select the final columns needed.
(
df_dict
# converts dict df from wide to long format
.melt(id_vars=["SKU0"])
# filters rows where there is no synonym
.loc[lambda x: x["value"].notna()]
# join dictionary with values df
.merge(df_values, how="right", left_on="value", right_on="SKU")
# get final value by taking the value from column "SKU0" if available, else "SKU"
.assign(SKU = lambda x: np.where(x["SKU0"].isna(), x["SKU"], x["SKU0"]))
# select final columns needed in output
[["SKU", "Value", "Value1", "value1"]]
)
# output
SKU Value Value1 value1
0 foo 3 1 7
1 foo1 4 5 7
2 foo2 5 5 9
Related
I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?
Or is there a better way to do it automatically?
Here is my solution, starting with:
df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])
output
a b a b
0 1 2 3 4
Then I use Lambda function
df.columns += df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())
Output
a b a_ b_
0 0 1 2 3
If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.
If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'
df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
a aT b bT
0 1 2 3 4
My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.
You could do:
import pandas as pd
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))
# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')
print(df)
Output
a0 a1 a2
0 1 2 3
If there is only one duplicate column, you could do:
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)
print(df)
Output (for only one duplicate)
a0 a1 b0 b1
0 1 2 3 4
Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.
E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]
from collections import Counter
counter = Counter()
empty_list= []
for x in range(df.shape[1]):
counter.update([df.columns[x]])
if counter[df.columns[x]] == 1:
empty_list.append(df.columns[x])
else:
tx = counter[df.columns[x]] -1
empty_list.append(df.columns[x] + '_' + str(tx))
df.columns = empty_list
df.columns
I have a dataframe with multiple columns and rows. One column, say 'name' has several rows with names, the same name used multiple times. Other rows, say, 'x', 'y', 'z', 'zz' have values. I want to group by name and get the mean of each column (x,y,z,zz)for each name, then plot on a bar chart.
Using the pandas.DataFrame.groupby is an important data-wrangling stuff. Let's first make a dummy Pandas data frame.
df = pd.DataFrame({"name": ["John", "Sansa", "Bran", "John", "Sansa", "Bran"],
"x": [2, 3, 4, 5, 6, 7],
"y": [5, -3, 10, 34, 1, 54],
"z": [10.6, 99.9, 546.23, 34.12, 65.04, -74.29]})
>>>
name x y z
0 John 2 5 10.60
1 Sansa 3 -3 99.90
2 Bran 4 10 546.23
3 John 5 34 34.12
4 Sansa 6 1 65.04
5 Bran 7 54 -74.29
We can use the label of the column to group the data (here the label is "name"). Explicitly defining the by parameter can be omitted (c.f., df.groupby("name")).
df.groupby(by = "name").mean().plot(kind = "bar")
which gives us a nice bar graph.
Transposing the group by results using T (as also suggested by anky) yields a different visualization. We can also pass a dictionary as the by parameter to determine the groups. The by parameter can also be a function, Pandas series, or ndarray.
df.groupby(by = {1: "Sansa", 2: "Bran"}).mean().T.plot(kind = "bar")
Given the following data
I hope to select the rows where num appears in list. In this case, it will select row 1 and row2, row 3 is not selected since 3 can't be found in [4,5].
Following is the dataframe, how should we write the filter query?
cat1=pd.DataFrame({"num":[1,2,3],
"list":[[1,2,3],[3,2],[4,5]]})
One possible solution with list comprehension, zip and in passed to boolean indexing:
df = cat1[[a in b for a, b in zip(cat1.num, cat1.list)]]
Or solution with DataFrame.apply with axis=1 for processing per rows:
df = cat1[cat1.apply(lambda x: x.num in x.list, axis=1)]
Or create DataFrame and test membership:
df = cat1[pd.DataFrame(cat1.list.tolist()).isin(cat1.num).any(axis=1)]
print (df)
num list
0 1 [1, 2, 3]
1 2 [3, 2]
A different solution if you are using pandas .25 is using explode():
cat1[cat1['num'].isin(cat1.explode('list1').query("num==list1").loc[:,'num'])]
num list1
0 1 [1, 2, 3]
1 2 [3, 2]
I have the following pandas dataframe:
data = {'ID': [1, 2, 3], 'Neighbor': [3, 1, 2], 'x': [5, 6, 7]}
Now I want to create a new column 'y', which for each row is the value of the field x, from that row referenced by the neighbor column (ie that column, whose ID equals the value of Neighbor), e.g: For row 0 (ID 1), 'Neighbor' is 3, thus 'y' should be 7.
So the resulting dataframe should have the colum y = [7, 5, 6].
Can I solve this without using df.apply? (As this is rather time-consuming for my big dataframes.)
I would like to use sth like
df.loc[:, 'y'] = df.loc[df.Neighbor.eq(df.ID), 'x']}
but this returns NaN.
we can pass a dict from your ID and X columns then map these into your new column
your_dict_ = dict(zip(df['ID'],df['x']))
print(your_dict_)
{1: 5, 2: 6, 3: 7}
Then we can use .map to pass these your column using the Neighbor column as the key.
df['Y'] = df['Neighbor'].map(your_dict_)
print(df)
ID Neighbor x Y
0 1 3 5 7
1 2 1 6 5
2 3 2 7 6
This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 4 years ago.
Assume we have a data frame in Python Pandas that looks like this:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})
Or, in table form:
ids vals
aball 1
bball 2
cnut 3
fball 4
How do I filter rows which contain the key word "ball?" For example, the output should be:
ids vals
aball 1
bball 2
fball 4
In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
ids vals
0 aball 1
1 bball 2
3 fball 4
df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1
Step-by-step explanation (from inner to outer):
df['ids'] selects the ids column of the data frame (technically, the object df['ids'] is of type pandas.Series)
df['ids'].str allows us to apply vectorized string methods (e.g., lower, contains) to the Series
df['ids'].str.contains('ball') checks each element of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicating True or False about the existence of a 'ball' substring.
df[df['ids'].str.contains('ball')] applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records.
na = False removes NA / NaN values from consideration; otherwise a ValueError may be returned.
>>> mask = df['ids'].str.contains('ball')
>>> mask
0 True
1 True
2 False
3 True
Name: ids, dtype: bool
>>> df[mask]
ids vals
0 aball 1
1 bball 2
3 fball 4
If you want to set the column you filter on as a new index, you could also consider to use .filter; if you want to keep it as a separate column then str.contains is the way to go.
Let's say you have
df = pd.DataFrame({'vals': [1, 2, 3, 4, 5], 'ids': [u'aball', u'bball', u'cnut', u'fball', 'ballxyz']})
ids vals
0 aball 1
1 bball 2
2 cnut 3
3 fball 4
4 ballxyz 5
and your plan is to filter all rows in which ids contains ball AND set ids as new index, you can do
df.set_index('ids').filter(like='ball', axis=0)
which gives
vals
ids
aball 1
bball 2
fball 4
ballxyz 5
But filter also allows you to pass a regex, so you could also filter only those rows where the column entry ends with ball. In this case you use
df.set_index('ids').filter(regex='ball$', axis=0)
vals
ids
aball 1
bball 2
fball 4
Note that now the entry with ballxyz is not included as it starts with ball and does not end with it.
If you want to get all entries that start with ball you can simple use
df.set_index('ids').filter(regex='^ball', axis=0)
yielding
vals
ids
ballxyz 5
The same works with columns; all you then need to change is the axis=0 part. If you filter based on columns, it would be axis=1.