Dataframe Split query - dataframe

I have a dataframe column which contains all the text below in each row
Symbol(id=15351, ticker=VXX, market=US, currency=USD, type=EQUITY,tick_size=0.010000, lot_size=100, contract_size=0, rate=None)
I am trying to extract only after ticker=, which gives VXX
I tried
df['symbolcolumn'] = df['symbolcolumn'].str.split(',market', expand=True)
But it does not extract only the symbol ticker
Looking for df['symbolcolumn'] = VXX
Can you advise me please?

Ok I managed to do it by
df['symbol'] = df['symbol'].astype(str)
df['symbol'] = df['symbol'].str.split(', market', expand=True)
df['symbol'] = df['symbol'].apply(lambda x: x.split("=")[-1])

Related

Dataframe loc with multiple string value conditions

Hi, given this dataframe is it possible to fetch the Number value associated with certain conditions using df.loc? This is what i came up with so far.
if df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]:
I want the output to be 1. Is this the correct way to do it?
You're in the right way, but you have to pass ".values[0]" in the end of the .loc statement to extract the only value that you got in the pandas Series.
df = pd.DataFrame({
'Tags': ['Brunei', 'China'],
'Type': ['Host', 'Address'],
'Number': [1, 1192]
}
)
display(df)
series = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]
print(type(series))
value = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"].values[0]
print(type(value))

KeyError to existing Colums in my panda dataframe after transposing, with dtype='object'

So i have some trouble with trying to make basic arithmetic with columns in pandas. the datatype of my columns after transposing are 'object'. Because of this i get a KeyError when trying to add a column to another column.
After transposing I have added the code:
dg.columns = dg.columns.astype(str)
The dont seem to react to it, anyone knows how to solve this?
my full code:
dg = pd.read_csv("file.csv", encoding="latin-1", header = None)
dg = dg.T
dg.columns= dg.iloc[0]
dg = dg.reindex(dg.index.drop(0))
dg.index.name = 'Date'
dg = dg.fillna(0)
dg.drop(dg.columns.difference(['Category','revenue','Result']), 1, inplace=True)
dg.columns = dg.columns.astype(str)
print (dg.columns)
dg['revenue','Result'] = pd.to_numeric(dg['revenue','Result'], errors='coerce')
dg['cost'] = dg['revenue']* - dg['Result']
dg = dg.groupby('Category','revenue','Result','cost').agg(sum).reset_index()
print (dg[:5])

Pandas get row if column is a substring of string

I can do the following if I want to extract rows whose column "A" contains the substring "hello".
df[df['A'].str.contains("hello")]
How can I select rows whose column is the substring for another word? e.g.
df["hello".contains(df['A'].str)]
Here's an example dataframe
df = pd.DataFrame.from_dict({"A":["hel"]})
df["hello".contains(df['A'].str)]
IIUC, you could apply str.find:
import pandas as pd
df = pd.DataFrame(['hell', 'world', 'hello'], columns=['A'])
res = df[df['A'].apply("hello".find).ne(-1)]
print(res)
Output
A
0 hell
2 hello
As an alternative use __contains__
res = df[df['A'].apply("hello".__contains__)]
print(res)
Output
A
0 hell
2 hello
Or simply:
res = df[df['A'].apply(lambda x: x in "hello")]
print(res)

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final