Pandas apply to a range of columns - pandas

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)

This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

Related

PySpark - drop rows with duplicate values with no column order

I have a PySpark DataFrame with two columns like so:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 2 | 1 |
| 3 | 2 |
| 1 | 1 |
+-------------------+--------------------------+
I want to drop duplicates but with no respect to the order of the columns.
For example, a row that contains (1,2) and a row that contains (2,1) are duplicates.
The resultant Dataframe would look like this:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 1 | 1 |
+-------------------+--------------------------+
The regular drop_duplicates method doesn't work in this case, anyone has any ideas how to do this cleanly and efficiently?
(df1.withColumn('x',array_sort(array(col('left'), col('right'))))#create sorted array column of columns left and right
.dropDuplicates(['x'])#Use column created to drop duplicates
.drop('x')#drop unwanted column
).show()

Get pandas row names if value in pandas row is present

I am trying to convert a one hot key dataframe into a 2 d frame
Is there anyways I can iterate over rows and columns and fill the values having a 1 with the column name.
problem dataframe:
+------------------+-----+-----+
| sentence | lor | sor |
+------------------+-----+-----+
| sam lived here | 0 | 1 |
+------------------+-----+-----+
| drack lived here | 1 | 0 |
+------------------+-----+-----+
Solution dataframe:
+------------------+------+
| sentence | tags |
+------------------+------+
| sam lived here | sor |
+------------------+------+
| drack lived here | lor |
+------------------+------+
You can segregate the rows having 1 for every column. For these columns, replace the value 1 with the name specified along with renaming the column names
lor_df = df.loc[df["lor"].eq(1), "lor"].rename(columns={"lor": "tags"}).replace(1, "lor")
sor_df = df.loc[df["sor"].eq(1), "sor"].rename(columns={"sor": "tags"}).replace(1, "sor")
After this, concatenate the individual results using pandas.concat, followed by dropping the columns which aren't required.
df["tags"] = pd.concat([lor_df, sor_df], sort=False)
df.drop(columns=["lor", "sor"], inplace=True)
To ensure unique values we can use pandas.DataFrame.drop_duplicates
df.drop_duplicates(inplace=True)
print(df)

how to add headers to a selected data from bigger data frame?

I'm learning pandas and I have a DataFrame (from CSV) that I need to filter. The original DataFrame looks like this:
+----------+-----------+-------------+
| Header1 | Header2 | Header3 |
| Value 1 | A | B |
| Value 1 | A | B |
| Value 2 | C | D |
| Value 1 | A | B |
| Value 3 | B | E |
| Value 3 | B | E |
| Value 2 | C | D |
+----------+-----------+-------------+
Then, I select the new data with this code:
dataframe.header1.value_counts()
output:
Value 1 -- 3
Value 2 -- 2
Value 3 -- 2
dtype: int64
So, I need to add headers to this selection and output something like this
Values Count
Value 1 -- 3
Value 2 -- 2
Value 3 -- 2
pd.Series.value_counts returns a Series, where the Index is all unique values in the Series calling the method. reset_index is what you want to make it a DataFrame, and we can use the rename methods to get the column labels correct.
(df.Header1.value_counts()
.rename('Count') # Series name becomes column label for counts
.rename_axis('Values') # Index name becomes column label for unique values.
.reset_index() # Series -> DataFrame
)
# Values Count
#0 Value_1 3
#1 Value_2 2
#2 Value_3 2

Combine column x to n in OpenRefine

I have a table with an unknown number of columns, and I need to combine all columns after a certain point. Consider the following:
| A | B | C | D | E |
|----|----|---|---|---|
| 24 | 25 | 7 | | |
| 12 | 3 | 4 | | |
| 5 | 5 | 5 | 5 | |
Columns A-C are known, and the information in them correct. But column D to N (an unknown number of columns starting with D) needs to be combined as they are all parts of the same string. How can I combine an unknown number of columns in OpenRefine?
As some columns may have empty cells (the string may be of various lengths) I also need to disregard empty cells.
There is a two step approach to this that should work for you.
From the first column you want to merge (Col D in this case) choose Transpose->Transpose cells across columns into rows
You will be asked to set some options. You'll want to choose 'From Column' D and 'To Column' N. Then choose to transpose into One Column, assign a name to that column, make sure the option to 'Ignore Blank Cells' is checked (should be checked by default. Then click Transpose.
You'll get the values that were previously in cols D-N appearing in rows. e.g.
| A | B | C | D | E | F |
|----|----|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 |
Transposes to:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 4 |
| | | | 5 |
| | | | 6 |
You can then use the dropdown menu from the head of the 'new' column to choose
Edit cells->Join multi-value cells
You'll be asked what character you want to use to separate the characters in the joined cell. Probably in your use case you can delete the joining character and combine the cells without any joining characters. This will give you:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 456 |

Calculate the difference between two non-adjacent columns, based on a "match column" using Excel VBA

I'm looking for the most efficient way to compare two sets of two columns, thus:
Set 1:
A | B | C |
11_22 | 10 | |
33_44 | 20 | |
55_66 | 30 | |
77_88 | 40 | |
99_00 | 50 | |
Set 2:
J | K |
33_44 | 19 |
99_00 | 47 |
77_88 | 40 |
For each match between column A and J, column C should display the difference between the adjacent cells
(in this case 33_44, 99_00, and 77_88) in B and K, respectively, with the full
amount in column B if no match exists in J
A | B | C
11_22 | 10 | 10
33_44 | 20 | 1
55_66 | 30 | 30
77_88 | 40 | 0
99_00 | 50 | 3
I'm thinking of creating two multi-dimensional arrays containing values
in the ranges (A, B) and (J, K), with a nested loop, but am not sure how to
get the result back into column C when a match occurs. Creating a third "result array" and outputting that on a fresh sheet would work too.
It is possible to do a lot with ADO, for example: Excel VBA to match and line up rows