OpenRefine: Remove row if specific cell in this row is empty - openrefine

The input for OpenRefine is a csv file containing data like this
phy,205.4,,,Unterwasserakustik. Sonar-Technik,,
phy,205.6,,,Lärm. Lärmbekämpfung. Schallschutz. Filter (vgl.a.san 525),,
phy,205.9,,,Sonstiges,,
,,,,,,
,,Wärme. Statistische Physik (Temperaturstrahlung s. phy 495),,,,
,220,,Gesamtgebiet,,,
I would like to remove all rows where the second column (the numeric code) is empty.
In Open Refine I created a Facet->CustomizedFacet->FacetByBlank on the second column. In the appearing menu on the left, I clicked true (197 false, 2 true - which is correct). Then, I went to All->EditRows->RemoveAllMatchingRows. Instead of removing only the two rows, OpenRefine removes 143 rows and no data is shown anymore.
What has happend? And how can I remove only the two rows with an empty second column?
It might be connected to the row counter in the All column: The first time the entry in the first column "phy" is missing, there is no row count anymore.
1. phy 205.4 ...
2. phy 205.6 ...
3. phy 205.9 ...
Wärme...
220 ...
The 220 row does not contain the "phy" column and is incorrectly ignored.

It looks like you may be operating in "record mode" as opposed to "row mode." If the facet says 197 true, 2 false, you should only see two rows displayed on the screen when you go to do your delete. If you see more than that try selecting Row mode.

Related

Removing more than 2 duplicates from a CSV file

I have found the following script to remove duplicates:
awk -F, '!x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
When it finds duplicate records instead of deleting all the duplicates and keeping only the first record it would be amazing if it could keep the first 2 or 3 records and remove the rest. So basically allowing the original and one duplicate but deleting the entire row of any more than one or two duplicates.
How to adjust it so that it keeps the original record and the first duplicate and deletes the entire rows of any more than the first duplicate?
You can use awk like this:
awk -F, '++x[$7] <= 2' business-records.csv > business-records-deduped.csv
This will keep 2 duplicate records for 7th column and will delete any more dupes as you desire.
I propose following minimal ameloration of your code
awk -F, '2>x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
Explanation: ++ is post-increment operation so execution order might be somewhat counter-intuitive
x[$7] gets value from array x for key being content of 7th field, if not present assume 0
2> is test deciding about printing, if this condition does hold line is printed
++ does increase value inside array x, therefore next time you encounter same 7th field content value will be bigger by 1
Observe that sole thing altered is test, regarding non-negative integers ! is true for zero and false for values above 0.

Clean output of a panda data extraction deleting unnamed index column

I have a dataset which have extracted a row under condition in column 'Description'. This is first few rows to show how data look like.
I have extracted a row whith the condition below:
ATL_ID=airport_codes[airport_codes['Description'].str.contains('Hartsfield-Jackson Atlanta ')]
It successfully finds the row. Now, I need to extract the value under 'Code' I use this code:
ATL_ID.loc[:,'Code']
and output is:
373 10397
Name: Code, dtype: int64
I dnt want anything else in the output except 10397. 373 is the row index and the rest is additional description which I dnt want. How I can get one number for the 'Code'?
Thanks

Delete duplicate number in text cell (Teradata database) [duplicate]

This question already has an answer here:
Delete duplicates GPS coordinates in column in each row
(1 answer)
Closed 3 years ago.
I have columns in which coordinates are presented in the text format. Each set of coordinates in one cell. All coordinates all coordinates are in one table cell, like text. And i have more than 1000 cells and each contains more than 100 coordinates.
For example:
23.453411011874813 41.74245395132344, 23.453972640029299 41.74214208390741, 23.453977029220994 41.741827739090233, 23.454523642352295 41.741515869012523, 23.441100249526403 41.741203996333724, 23.441661846243466 41.740892121053918,
23.456223434003668 41.74058024317317, 23.441661846243466 41.740892121053918
In the case of repeating coordinates, I need to delete the last of them (bold in the example) and delete the coordinate located between them (italic in the example).
Please tell me how this can be done?
Thanks a lot!
OLAP functions will be your friend.
- ROW_NUMBER() will identify the 2nd, 3rd,... occurences
- with COUNT() OLAP you can identify the double ones
- with CASE and some MAX-ROWS PRECEEDING you can tag the rows between 1st and 2nd
Two crucial questions for the concrete solution, you have to ask:
- by which criteria are your rows ordered (I guess a not shown column with TimeStamps...)
- what happens if a coordinate occurs 3 -times (or even more)? - Delete all between 1st and last or just between 1st and 2nd or always between uneven&even?

How to (1) condense into one row after certain number of rows; (2) How to assign field names

Using Pentaho PDI 8.3.
After REST calls with quite complex data structures, I was able to extract data with a row for each data element in a REST result/ E.g:
DataCenterClusterAbstract
1
UK1
Datacenter (auto generated)
Company
29
0
39
15
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
2
UK1_Murex
Datacenter (auto generated)
Company
0
0
0
0
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
3
UK1_UNIX
Notice that there are 8 data elements that are spread out into separate rows. I would like to condense these 8 data elements into one row each iteration in Pentaho. Is this possible? And assign field names?
Row flattener
Condense 8 data element in columns into one row. Each of these 8 data elements are repeating.
(1) Add row flattener
(2) Assign field names for the rows coming in - so you have 10 data attributes in rows specify a field name for each row.
(3) In table output use space as seperator

exclude data having specific value [substring within a string] using pentaho

I have a column "Number field" (Excel sheet). It has value as shown below.
Test_Number Number_field
1 0011 10 00A34 PS
2 0011 10 00A34 PS
3 0010 01 00A30 PS
4 0010 01 00A30 PS
5 0010 01 00A35 PS
6 0010 01 00A35 PS
Now, from these i need to remove those which contains "0A34" and "0A35". How can i achieve this? I tried "filter" option, but I cannot search substring in a string using this. Please help
You can simply do this in two steps as follows.
In Filter rows step you add the following conditions.
Use combination of User Defined Java Expression step with following parameters:
Java expression: (Number_field.indexOf("0A34") != -1 || Number_field.indexOf("0A35") != -1) ? "Remove" : "Ok"
Value type: String
New field: is_row_to_remove
and Filter rows step with this parameters:
The condition: `is_row_to_remove = Remove (String)
Send 'true' data to step: Your next step
Send 'false' data to step: Dummy (do nothing) step
Flow explanation:
User Defined Java Expression: Java code finds 0A34 or 0A35 and marks such a row with Remove value in a new field is_row_to_remove
Filter rows: The step filters record stream according to value in is_row_to_remove. If value is set to Remove then continues with Dummy step. Otherwise continues to your next step.
If you want to do that in excel itself then you can use below formula and have filter on that to remove the records from your excel.
Add below formula and drag it upto your all records. Create filter on this new formula column and then remove the records.
=IF(OR(IFERROR( SEARCH("A34",B2), 0),IFERROR( SEARCH("A35",B2), 0)), "REMOVE", "KEEP")
check snap below.
Hope this will help you.
If it helps then mark it as answer.