Extracting value and creating new column out of it - pandas

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,

In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2

This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)

Related

One to One mapping of data in the row in pandas

I have dataset looks like this
And I want the output of this data frame like this. So it's kind of one to one mapping of row values. Assume option1 and option2 has same comma separated values.
Please let me know how do I achieve this ?
You can use the zip() function from the standard Python library and the explode() method from Pandas dataframe like that :
df["option1"] = df["option1"].str.split(",")
df["option2"] = df["option2"].str.split(",")
df["option3"] = df["option3"]*max(df["option1"].str.len().max(), df["option2"].str.len().max())
new_df = pd.DataFrame(df.apply(lambda x: list(zip(x[0], x[1], x[2])), axis=1).explode().to_list(), columns=df.columns)
new_df

How to deal with nested data in pandas dataframe via "for loop"?

I have got a nested data in pandas dataframe and I want to flatten the column, "names" by using "pd.Dataframe ()" function. When I attempt to flatten via "for loop" it produces 5 different dataframe list, which I do not expect to have and rather only one dataframe list with all values listed. I have already tried "concat" or "append" methods but it did not give any clue to move forward. Any help/comment is welcome, thanks so much. Here is my "for loop":
x=df['names'].iloc[0:4]
name_data = pd.DataFrame(x)
data_row=[]
for data in x:
data_row =pd.DataFrame(data)
st.write(data_row)
If I understand correctly, you want to concat the 5 tables in the example images above to only 1 table and show the result table on streamlit.
All you have to do are
change from data_row =pd.DataFrame(data) to data_row += [pd.DataFrame(data)]
After loop for loop finished
you can concat all dataframes in data_row to one dataframe by using data_row = pd.concat(data_row)
and then, show the result table with streamlit by using st.write(data_row)
Here is example for tackling your problem.
df = pd.DataFrame({
'names': [[{'name':'a'},{'name':'b'}], [{'name':'c'}]]
})
x=df['names'].iloc[0:2]
data_row = []
for data in x:
data_row += [pd.DataFrame(data)]
data_row = pd.concat(data_row)
st.write(data_row)
or you can create the list of dictionary and create dataframe by using the example below
data_row = []
for data in x:
data_row += data
data_row = pd.DataFrame(data_row)

Empty cells when using an apply function

So I am trying to calculate a value from one column or another based based on which one has data available into a new column. This is the code I have right now. It doesn't seem to notice when there is no data present and always goes to the "else" statement. My dataframe is an imported excel file. Thanks for any advice!
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')
This is can be done by using numpy.where
Import numpy as np
df['newcol'] = np.where(df["Sulphate-S(HCL Leachable)_%S"].isna(),df["Total-S_%S"]- df["Sulphate-S(HCL Leachable)_%S"],df["Total-S_%S"]- df["Sulphate-S_%S"])

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.