Dataframe drop rows that does not contain specific values in the column, result in empty df - pandas

I want to drop rows in housing_plot_end that does not contain one of 5 values specified in plot_test_data['housing price median'] dataframe object.
'Dł. geograficzna' is the name of column and it translates to 'latitiude' in english but I left it as it was because maybe the space between these two words is causing a problem?
But I am receiving empty df with:
values_to_save = [plot_test_data['Dł. geograficzna']]
housing_plot_end=housing_plot_end[~housing_plot_end['Dł. geograficzna'].isin(values_to_save)
== False]
enter code here
The column in plot_test_data contains of 5 numerical values thus 5 rows:
-121.46
-117.23
-119.04
-117.13
-118.7
Meanwhile housing_plot_end has tens of thousands of rows and I need to drop every row which does not contain one of these specific values in the column of housing_plot_end['Dł. geograficzna']
But I am receiving empty dataframe object when I've runned this code:
values_to_save = [plot_test_data['Dł. geograficzna']]
housing_plot_end=housing_plot_end[~housing_plot_end['Dł. geograficzna'].isin(values_to_save)
== False]
I don't know what to do.

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

Joining all elements in an array in a dataframe column with another dataframe

Let's say pcPartsInfoDf has the columns
pcPartCode:integer
pcPartName:string
And df has the array column
pcPartCodeList:array
|-- element:integer
The pcPartCodeList in df has a list of codes for each row that match with pcPartCode values in pcPartsInfoDf, but only pcPartsInfoDf has the names of the parts.
I'm trying to join the two dataframes so that we get a new column that is an array of strings for all the pc part names for a row, corresponding to the array of ints, pcPartCodeList. I tried doing this with the code below, but this only adds at most 1 part since pcPartName is typed as a string and only holds 1 value.
df
.join(pcPartsInfoDf, expr("array_contains(pcPartCodeList, pcPartCode"))
.select(computerDf("*"), pcPartsInfoDf("pcPartName"))
How could I collect all the pcPartName values corresponding to a pcPartCodeList for a row, and put them in an array of strings in that row?

pandas df: replace values with np.NaN if character count do not match across multiple columns

currently stuck with something I hope to find an answer for in this forum:
I have a df with multiple columns containing URLs. My index column are URLs as well.
AIM: I'd like to replace df values across all columns with np.NaN if the number of "/" (count()) in the index is not equal to the number of "/" (count()) in the values of each individual of of the other columns
E.x.
First, you need one column to compare to.
counts = df['id_url'].str.count('/')
Then you evaluate all the rows at once.
mask = df.str.count('/') == counts
Then we want to to show rows where all the values are equal.
mask = mask.all(axis=1)
Now we have a mask for where every value is equal, we can use the not operator to filter for those where at least one column is not equal.
df.loc[~mask, :] = np.nan # replaces every value in the row with np.nan

How to select different row in df as the column or delete the first few rows including the column?

I'm using read_csv to make a df, but the csv includes some garbage rows before the actual columns, the actual columns are located say in the 5th rows in the csv.
Here's the thing, I don't know how many garbage rows are there in advance and I can only read_csv once, so I can't use "head" or "skiprows" in read_csv.
So my question is how to select a different row as the columns in the df or just delete the first n rows including the columns? If I were to use "df.iloc[3:0]" the columns are still there.
Thanks for your help.
EDIT: Updated so that it also resets the index and does not include an index name:
df.columns = df.iloc[4].values
df = df.iloc[5:].reset_index(drop=True)
If you know your column names start in row 5 as in your example, you can do:
df.columns = df.iloc[4]
df = df.iloc[5:]
If the number of garbage rows is determined, then you can use 'iloc', example the number of garbage rows is 3 firs rows (index 0,1,2), then you can use the following code to get all remaining actual data rows:
df=df.iloc[3:]
If the number of garbage rows is not determined, then you must search the index of first actual data rows from the garbage rows. so you can find the first index of actual data rows and can be used to get all remaining data rows.
df=df.iloc[n:]
n=fisrt index of actual data