One to One mapping of data in the row in pandas - pandas

I have dataset looks like this
And I want the output of this data frame like this. So it's kind of one to one mapping of row values. Assume option1 and option2 has same comma separated values.
Please let me know how do I achieve this ?

You can use the zip() function from the standard Python library and the explode() method from Pandas dataframe like that :
df["option1"] = df["option1"].str.split(",")
df["option2"] = df["option2"].str.split(",")
df["option3"] = df["option3"]*max(df["option1"].str.len().max(), df["option2"].str.len().max())
new_df = pd.DataFrame(df.apply(lambda x: list(zip(x[0], x[1], x[2])), axis=1).explode().to_list(), columns=df.columns)
new_df

Related

Flatten and rename multi-index agg columns

I have some Pandas / cudf code that aggregates a particular column using two aggregate methods, and then renames the multi-index columns to flattened columns.
df = (
some_df
.groupby(["some_dimension"])
.agg({"some_metric" : ["sum", "max"]})
.reset_index()
.rename(columns={"some_dimension" : "some_dimension__id", ("some_metric", "sum") : "some_metric_sum", ("some_metric", "max") : "some_metric_max"})
)
This works great in cudf, but does not work in Pandas 0.25 -- the hierarchy is not flattened out.
Is there a similar approach using Pandas? I like the cudf tuple syntax and how they just implicitly flatten the columns. Hoping to find a similarly easy way to do it in Pandas.
Thanks.
In pandas 0.25.0+ there is something called groupby aggregation with relabeling.
Here is a stab at your code
df = (some_df
.groupby(["some_dimension"])
.agg(some_metric_sum=("some_metric", "sum"),
some_metric_max=("some_metric", "max"]})
.reset_index()
.rename(colunms = {"some_dimension":"some_dimension_id"}))

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.

Preferred pandas code for selecting all rows and a subset of columns

Suppose that you have a pandas DataFrame named df with columns ['a','b','c','d','e'] and you want to create a new DataFrame newdf with columns 'b' and 'd'. There are two possible ways to do this:
newdf = df[['b','d']]
or
newdf = df.loc[:,['b','d']]
The first is using the indexing operator. The second is using .loc. Is there a reason to prefer one over the other?
Thanks to #coldspeed, it seems that newdf = df.loc[:,['b','d']] is preferred to avoid the dreaded SettingWithCopyWarning.

Reassigning pandas column in place from a slice of another dataframe

So I learned from this answer to this question that in pandas 0.20.3 and above reassigning the values of a dataframe works without giving the SettingWithCopyWarning in many ways as follows:
df = pd.DataFrame(np.ones((5,6)),columns=['one','two','three',
'four','five','six'])
df.one *=5
df.two = df.two*5
df.three = df.three.multiply(5)
df['four'] = df['four']*5
df.loc[:, 'five'] *=5
df.iloc[:, 5] = df.iloc[:, 5]*5
HOWEVER
If I were to take a part of that dataframe like this for example:
df1 = df[(df.index>1)&(df.index<5)]
And then try one of the above methods for reassigning a column like so:
df.one *=5
THEN I will get the SettingWithCopyWarning.
So is this a bug or am I just missing something about the way pandas expects for this kind of operation to work?

Extracting value and creating new column out of it

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,
In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)