Combine column x to n in OpenRefine - data-manipulation

I have a table with an unknown number of columns, and I need to combine all columns after a certain point. Consider the following:
| A | B | C | D | E |
|----|----|---|---|---|
| 24 | 25 | 7 | | |
| 12 | 3 | 4 | | |
| 5 | 5 | 5 | 5 | |
Columns A-C are known, and the information in them correct. But column D to N (an unknown number of columns starting with D) needs to be combined as they are all parts of the same string. How can I combine an unknown number of columns in OpenRefine?
As some columns may have empty cells (the string may be of various lengths) I also need to disregard empty cells.

There is a two step approach to this that should work for you.
From the first column you want to merge (Col D in this case) choose Transpose->Transpose cells across columns into rows
You will be asked to set some options. You'll want to choose 'From Column' D and 'To Column' N. Then choose to transpose into One Column, assign a name to that column, make sure the option to 'Ignore Blank Cells' is checked (should be checked by default. Then click Transpose.
You'll get the values that were previously in cols D-N appearing in rows. e.g.
| A | B | C | D | E | F |
|----|----|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 |
Transposes to:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 4 |
| | | | 5 |
| | | | 6 |
You can then use the dropdown menu from the head of the 'new' column to choose
Edit cells->Join multi-value cells
You'll be asked what character you want to use to separate the characters in the joined cell. Probably in your use case you can delete the joining character and combine the cells without any joining characters. This will give you:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 456 |

Related

PySpark - drop rows with duplicate values with no column order

I have a PySpark DataFrame with two columns like so:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 2 | 1 |
| 3 | 2 |
| 1 | 1 |
+-------------------+--------------------------+
I want to drop duplicates but with no respect to the order of the columns.
For example, a row that contains (1,2) and a row that contains (2,1) are duplicates.
The resultant Dataframe would look like this:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 1 | 1 |
+-------------------+--------------------------+
The regular drop_duplicates method doesn't work in this case, anyone has any ideas how to do this cleanly and efficiently?
(df1.withColumn('x',array_sort(array(col('left'), col('right'))))#create sorted array column of columns left and right
.dropDuplicates(['x'])#Use column created to drop duplicates
.drop('x')#drop unwanted column
).show()

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

Excel VBA to transpose set of rows if value exists in another column

I'm trying to find a way via VB script that will transpose rows from column A into a new sheet but only if there is a value in column B for rows that contain numbers. I have a sheet with ~75K rows on it that I need to do this for, and I tried creating pivot tables which allowed me to get the data into its current format but I need the data to be in columns.
The tricky part of this is that in column A, I only need to look at the rows that are all numbers and not the other rows that have text.
I created a sample sheet to view, where the sample data is in the SOURCE tab and what I want the data to look like in the TRANSPOSED tab.
https://docs.google.com/spreadsheets/d/1ujbaouZFqiPw0DbO78PCnz25OY2ugF1HtUqMg_J7KeI/edit?usp=sharing
Any help would be appreciated.
UPDATE and Answer:
I modified my approach and went back to the original source data which was not part of a pivot table and was able to use a simple match formula between the 2 data sources. So, my original data looked like this:
+----------------+---------+--------+--------------+
| Gtin | Brand | Name | TaxonomyText |
+----------------+---------+--------+--------------+
| 00030085075605 | brand 1 | name 1 | cat1 |
| 00041100015112 | brand 2 | name 2 | cat2 |
| 00041100015099 | brand 3 | name 3 | cat3 |
| 00030085075608 | brand 4 | name 4 | cat4 |
+----------------+---------+--------+--------------+
I had another sheet containing the data I needed to match to in this format:
+----------------+---------+
| Gtin | Brand |
+----------------+---------+
| 00030085075605 | brand 1 |
| 00041100015112 | brand 2 |
| 00041100015098 | brand 3 |
| 00030085075608 | brand 4 |
+----------------+---------+
I created a new column in my source sheet and used a if error match formula:
=IFERROR(IF(MATCH(A14,data_to_match!$A:$A,0),"yes",),"no")
Then copied this formula down for every row, about 75K rows which very quickly added a yes or a no.
+----------------+---------+---------+--------+--------------+
| Gtin | matched | Brand | Name | TaxonomyText |
+----------------+---------+---------+--------+--------------+
| 00030085075605 | yes | brand 1 | name 1 | cat1 |
| 00041100015112 | yes | brand 2 | name 2 | cat2 |
| 00041100015098 | no | brand 3 | name 3 | cat3 |
| 00030085075608 | yes | brand 4 | name 4 | cat4 |
+----------------+---------+---------+--------+--------------+
The final step was to just filter for Yes values and I had all the data that I needed.
My mistake was going to a pivot table first which put the data in a very funky format causing me to have to do a transpose, which wasn't really necessary. Hopefully this can help others....

Grand Total value doesn't match with Top N Filtered values in SSRS

I have a report in reporting services. In this report, I am displaying the Top N values. But my Grand Total is displaying the sum of all the values.
Right now I am getting something like this.Here N = 2
+-------+------+-------------+
| Area |ID | Count |
+-------+------+-------------+
| - A | | 4 |
| | a1 | 1 |
| | b1 | 1 |
| | c1 | 1 |
| | d1 | 1 |
| | | |
| - B | | 3 |
| | a2 | 1 |
| | b2 | 1 |
| | c2 | 1 |
| | | |
|Grand | | 10 |
|Total | | |
+-------+------+-------------+
The correct Grand Total should be 7 instead of 10. A and B are toggle items(You can expand and contract)
How can I display the correct Grand Total using Top N filter?
I also want to use the filter in the report and not in the SQL query.
You should use the filter on the Dataset. Filtering the report object itself only turns off the items (rows, for example) visibility. The item / row itself will still be part of the group and will be used for calculations.
I found a way to solve my question. As Ido said I worked on the dataset. I am using Analysis Cube. So in this cube I created a Named Set Calculation.
In this set I used the TopCount() function. It filters out the TOP N values where N can be integer according to your choice.
So the final Named Set in this case is :-
TopCount([Dim Area].[Area].[Area], 2, ([Measures].[Count]))
This will give you Grand total of Top N filtered values.

Calculate the difference between two non-adjacent columns, based on a "match column" using Excel VBA

I'm looking for the most efficient way to compare two sets of two columns, thus:
Set 1:
A | B | C |
11_22 | 10 | |
33_44 | 20 | |
55_66 | 30 | |
77_88 | 40 | |
99_00 | 50 | |
Set 2:
J | K |
33_44 | 19 |
99_00 | 47 |
77_88 | 40 |
For each match between column A and J, column C should display the difference between the adjacent cells
(in this case 33_44, 99_00, and 77_88) in B and K, respectively, with the full
amount in column B if no match exists in J
A | B | C
11_22 | 10 | 10
33_44 | 20 | 1
55_66 | 30 | 30
77_88 | 40 | 0
99_00 | 50 | 3
I'm thinking of creating two multi-dimensional arrays containing values
in the ranges (A, B) and (J, K), with a nested loop, but am not sure how to
get the result back into column C when a match occurs. Creating a third "result array" and outputting that on a fresh sheet would work too.
It is possible to do a lot with ADO, for example: Excel VBA to match and line up rows