iterate of df colomns to give rows with text - pandas

I have a dataframe with columns. The columns have mostly blank rows but a few of the rows have strings and those are the only rows i want to see. I have tried the below code but dont know how to only select strings in the columns and append to get a new dataframe with only columns with strings in the rows.
columns = list(df)
for i in columns:
df1 = df[df[i]== ]
can someone please help?

df[df['column_name'].isna()]
should do the trick

Related

How to filter on panda dataframe for missing values?

I have a data frame with mutliple columns and some of these have missing values.
I would to filter so that I can return a dataframe that has missing values on one or two specific columns.
Can anyone help me figure out how to do that?
Having a dataframe "df" with a columns "A"
df_missing = df[df['A'].isnull()]

Filteration on dataframe column value with combination of values

I have a dataframe which has 2 columns named TABLEID and STATID
There are different values in the both the columns.
when I filter the dataframe on values say '101PC' and 'ST101', it gives me 14K records and when I filter the dataframe on values say '102HT' and 'ST102', it gives me 14K records also. The issue is when I try to combine both the filters like below it gives me blank dataframe. I was expecting 28K records in my resultant dataframe. Any help is much appreciated
df[df[['TABLEID','STATID']].apply(tuple, axis = 1).isin([('101PC', 'ST101'), ('102HT','ST102')])]

Merging Pandas Dataframes with unequal rows

I have two dataframes, dframe and dframp. Dframe has 301497 rows in it and dframep has 6080 rows in it. Both dataframes are show below. I want to merge the two such that when dframep is added to dframe the new dataframe puts Nans where dframep does not have any values for that date. I have tried this:
dfall = dframe.merge(dframep, on=['ID','date','instrument','close'], how='outer')
The two merge together but the result is 307577 rows e.g. for the dates that are not in dframep there are no Nan's.
Pulling my hair out so any help would be appreciated. Im guessing it has something to do with indexing and selecting the columns correctly.
Thanks
I can't replicate your problem (nor understand it given your description), but try something like this ? :
dfall = pd.merge(dframe, dframep, how = 'left' ,left_on = ['ID','date','instrument','close'], right_on =['ID','date','instrument','close']
This will keep the rows of dframe, and bring the info that matches from dframp

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

How to select different row in df as the column or delete the first few rows including the column?

I'm using read_csv to make a df, but the csv includes some garbage rows before the actual columns, the actual columns are located say in the 5th rows in the csv.
Here's the thing, I don't know how many garbage rows are there in advance and I can only read_csv once, so I can't use "head" or "skiprows" in read_csv.
So my question is how to select a different row as the columns in the df or just delete the first n rows including the columns? If I were to use "df.iloc[3:0]" the columns are still there.
Thanks for your help.
EDIT: Updated so that it also resets the index and does not include an index name:
df.columns = df.iloc[4].values
df = df.iloc[5:].reset_index(drop=True)
If you know your column names start in row 5 as in your example, you can do:
df.columns = df.iloc[4]
df = df.iloc[5:]
If the number of garbage rows is determined, then you can use 'iloc', example the number of garbage rows is 3 firs rows (index 0,1,2), then you can use the following code to get all remaining actual data rows:
df=df.iloc[3:]
If the number of garbage rows is not determined, then you must search the index of first actual data rows from the garbage rows. so you can find the first index of actual data rows and can be used to get all remaining data rows.
df=df.iloc[n:]
n=fisrt index of actual data