Conditional join result count in a large dataframe - pandas

I have a data set of about 100m rows, 4gb, containing two lists like these:
Seed
a
r
apple
hair
brush
tree
Phrase
apple tree
hair brush
I want to get the count of unique matched 'Phrase's for each unique 'Seed'. So for example, the seed 'a' is contained in both 'apple tree' and 'hair brush', so it's 'Phrases_matched_count' should be '2'. Matches are just using partial patches (i.e. a 'string contains' match, does not need to be a regex or anything complex).
Seed Phrases_matched_count
a 2
r 2
apple 1
hair 1
brush 1
tree 1
I have been trying to find a way to do this using Apache Pig (on a small Amazon EMR cluster), and Python Pandas (the data set just about fits in memory), but just can't find a way to do this without looping through every row for each unique 'seed', which will take very long, or a cross product of the tables, which will use too much memory.
Any ideas?

This can be done by using built-in contains but I'm not sure of its scalability on an important number of data.
# Test data
seed = pd.Series(['a','r', 'apple', 'hair', 'brush', 'tree'])
phrase = pd.Series(['apple tree', 'hair brush'])
# Creating a DataFrame with seeds as index and phrases as columns
df = pd.DataFrame(index=seed, columns=phrase)
# Checking if each seed is contained in each phrase
df = df.apply(lambda x: x.index.str.contains(x.name), axis=1)
# Getting the result
df.sum(axis=1)
# The result
a 2
r 2
apple 1
hair 1
brush 1
tree 1

Related

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

Trouble understanding how the indices of a series are determined

Trouble understanding how the indices of a series are determined
So I have a huge data frame that i am reading a single column from, and I need to choose 100 unique values from this column. I think that what I did resulted in 100 unique values but I'm confused about the indexing of the resulting series. I looked at the indices of the data frame and they did not correspond to the value associated with the same indices of the series. I would like this to be the case, that is I want the indices of the resulting series to be the same as the indices of the data frame from which I am reading the column from. Would someone be able to explain to me how the resulting indices were determined here?
The indices of the sample do not correspond to the indices that exist in the DataFrame. This is due to the following fact:
When doing CSsq.unique() you are in fact getting back a np.ndarray (check the docs here). An array does not have any indices. But, you are passing this to the pd.Series constructor and as a result, a new Series is created, which in fact has indexing (starting from 0 up to n-1, where n is the size of the Series). This, of course, has nothing to do with the DataFrame indices, because you have firstly isolated the unique values.
See the example below for a hypothetical Series called s:
s
0 100
1 100
2 100
3 200
4 250
5 300
6 300
Let's isolate the unique occurences:
s.unique()
# [100, 200, 250, 300]
And now let's feed this to the pd.Series constructor:
pd.Series(s.unique())
0 100
1 200
2 250
3 300
As you can see this Series was generated from an array and its indices have nothing to do with the initial indices!
Now, if you take a random sample out of this Series, you'll get values with indices that correspond to this new Series object!
If you'd like to get a sample with indices that are derived from the DataFrame try something like this:
CSsq.drop_duplicates().sample(100)

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

Sum pandas columns, excluding some rows based on other column values

I'm attempting to determine the number of widget failures from a test population.
Each widget can fail in 0, 1, or multiple ways. I'd like to calculate the number of failures of for each failure method, but once a widget is known to have failed, it should be excluded from future sums. In other words, the failure modes are known and ordered. If a widget fails via mode 1 and mode 3, I don't care about mode 3: I just want to count mode 1.
I have a dataframe with one row per item, and one column per failure mode. If the widget fails in that mode, the column value is 1, else it is 0.
d = {"item_1":
{"failure_1":0, "failure_2":0},
"item_2":
{"failure_1":1, "failure_2":0},
"item_3":
{"failure_1":0, "failure_2":1},
"item_4":
{"failure_1":1, "failure_2":1}}
df = pd.DataFrame(d).T
display(df)
Output:
failure_1 failure_2
item_1 0 0
item_2 1 0
item_3 0 1
item_4 1 1
If I just want to sum the columns, that's easy: df.sum(). And if I want to calculate percentage failures, easy too: df.sum()/len(df). But this counts widgets that fail in multiple ways, multiple times. For the problem stated, the best I can come up with is this:
# create empty df to store results
df2 = pd.DataFrame(columns=["total_failures"])
for col in df.columns:
# create a row, named after the column, and assign it the value of the sum
df2.loc[col] = df[col].sum()
# drop rows in the df column that are equal to 1
df = df.loc[df[col] != 1]
display(df2)
Output:
total_failures
failure_1 2
failure_2 1
This requires creating another dataframe (that's fine), but also requires iterating over the existing dataframe columns and deleting it a couple of rows at a time. If the dataframe takes a while to generate, or is needed for future calculations, this is not workable. I can deal with iterating over the columns.
Is there a way to do this without deleting the original df, or making a temporary copy? (Not workable with large data sets.)
You can do a cumsum on axis=1 and wherever the value is greater than 1 , mask it as 0 and then take sum:
out = df.mask(df.cumsum(axis=1).gt(1), 0).sum().to_frame('total_failures')
print(out)
total_failures
failure_1 2
failure_2 1
This way the original df is retained too.

Creating a CategoricalDtype from an int column in Dask

dask.__version__ = 2.5.0
I have a table with columns containing many uint16 range 0,...,n & a bunch of lookup tables containing the mappings from these 'codes' to their 'categories'.
My question: Is there a way to make these integer columns 'categorical' without parsing the data or first replacing the codes with the categories.
Ideally I want Dask can keep the values as is and accept them as category codes and and accept the categories I tell Dask belong to these codes?
dfp = pd.DataFrame({'c01': np.random.choice(np.arange(3),size=10), 'v02': np.random.randn(10)})
dfd = dd.from_pandas(dfp, npartitions=2)
mdt = pd.CategoricalDtype(list('abc'), ordered=True)
dfd.c01 = dfd.c01.map_partitions(lambda s: pd.Categorical.from_codes(s, dtype=mdt), meta='category')
dfd.dtypes
The above does not work, the dtype is 'O' (it seem to have replaced the ints with strings)? I can subsequently do the following (which seems to do the trick):
dfd.c01 = dfd.c01.astype('category')
But than seems inefficient for big data sets.
Any pointers are much appreciated.
Some context: I have a big dataset (>500M rows) with many columns containing a limited number of strings. The perfect usecase for dtype categorical. The data gets extracted from a Teradata DW using Parallel Transporter, meaning it produces a delimited UTF-8 file. To make this process faster, I categorize the data on the Teradata side and I just need to create the dtype category from the codes on the dask side of the fence.
As long as you have an upper bound on largest integer, which you call n (equal to 3), then the following will work.
In [33]: dfd.c01.astype('category').cat.set_categories(np.arange(len(mdt.categories))).cat.rename_categories(list(mdt.categories))
Out[33]:
Dask Series Structure:
npartitions=2
0 category[known]
5 ...
9 ...
Name: c01, dtype: category
Dask Name: cat, 10 tasks
Which is the following when computed
Out[34]:
0 b
1 b
2 c
3 c
4 a
5 c
6 a
7 a
8 a
9 a
Name: c01, dtype: category
Categories (3, object): [a, b, c]
The basic idea is to make an intermediate Categorical whose categories are the codes (0, 1, ... n) and then move from those numerical categories to the actual one (a, b, c).
We have an open issue for making this nicer https://github.com/dask/dask/issues/2829