Add consecutive numbers to collection titles in OpenRefine - openrefine

I am working on migrating a digital collection to another system that requires unique titles for each object. Is there a way in OpenRefine to add consecutive numbers in a column?
As an example:
Data currently reads as:
View of Broad Street, 1999-01-01;
View of Broad Street, 1999-01-01;
View of Broad Street, 1999-01-01
I would like to automate adding the numbering sequence and have it read:
View of Broad Street, 1999-01-01 (1);
View of Broad Street, 1999-01-01 (2);
View of Broad Street, 1999-01-01 (3)

In OpenRefine we have a concept called records. You can use them to add counters to groups of data as described in the answers to OpenRefine: Fill down with increasing counter.
Here are the necessary steps:
Assuming you already have the data in columns and sorted (see column "Original" in the table below).
Move the column to the beginning.
Blank down the column to generate records (see column "Records").
Use the expression row.record.cells[row.columnNames[0]][0].value + " (" +(1 + row.index - row.record.fromRowIndex) + ")" to add a new column based on the first column or to transform the first column (see column "Transformed").
Original
Records
Transformed
View of Broad Street, 1999-01-01
View of Broad Street, 1999-01-01
View of Broad Street, 1999-01-01 (1)
View of Broad Street, 1999-01-01
View of Broad Street, 1999-01-01 (2)
View of Broad Street, 1999-01-01
View of Broad Street, 1999-01-01 (3)
View of Broad Street, 1999-01-02
View of Broad Street, 1999-01-02
View of Broad Street, 1999-01-02 (1)
View of Broad Street, 1999-01-02
View of Broad Street, 1999-01-02 (2)

Related

Best way to add info/description to my items?

I made a geo game a while back where the player has to guess an item from an image (what I call an item is a SQL row basically) for example the bot sends the flag of the Netherlands, you have to type "Netherlands" to win.
Items can be the flag of a country, a capital city, a french department...
I made an info tab where it would basically give info about an item (ie region, former name, capital city, etc).
What I would like to do is properly save this information. I don't really know if I should store this in files like JSON because I would also like to give stats (Win rate per region, amount of games played per region, etc...).
Also, these elements are not fixed because some items have regions, capital cities or whatever and some don't.
Item examples :
(For a flag
Column
Attribute
ID
1
Name
United Kingdom
Former name
United Kingdom of Great Britain and Northern Ireland
Code
GB
Continent
Europe
Subregion
Northern Europe
Capital city
London
...
(For a U.S. State)
Column
Attribute
ID
1
Name
Arizona
Capital city
Phoenix
Largest city
Phoenix
...
The both solution (Add all as column and json) are not the proper way.
I think the best design is to have a key-value table.
Create Table tableName (ID INT, [Key] SYSNAME, [Value])
And data will look like:
ID
Key
Value
1
Name
Arizona
1
Capital City
Phoenix
1
Largest City
Phoenix
2
Name
United Kingdom
2
Former name
United Kingdom of Great Britain and Northern Ireland
Most valuable benefits: No Extra storage for columns with large amount of rows with NULL value.

ms word index only subentry

I have a Word document where I've marked various entries for the Index. The entries are like this:
Inland Empire
David Shaver
John Jameson
JM Granny
Justin Flatterer
Mary Martinson
Palouse Poppies
Sara Talk
Eddie Haskell
I've marked each Organization as a Main Entry and each Person as a SubEntry.
I need TWO indices.
1) List of companies with all their people (similar to as shown, above).
2) List of ONLY the people. No companies.
How can I generate an index that shows only the Subentries?

Compare two addresses which are not in standard format

I have to compare addresses from two tables and get the Id if the address matches.
Each table has three columns Houseno, street, state
The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE.
Any combination with a dot or comma or spaces ,hypen.
I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA
There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.
Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.
The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match
you can make a cursor where you do first a group by where house number and city =.
in a loop
you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!

Joining multiple fields between the same tables

I have a table called 'Resources' that looks like this:
Country City Street Headcount
UK Halifax High Street 20
United Kingdom Oxford High Street 30
Canada Halifax North St 40
Because of the nature of the location fields, I need to map them to a single 'Address' field, and so I also have the following table called 'Addresses':
Country City Street Address
UK Halifax High Street High Street, Halifax, UK
Canada Halifax North St North Street, Halifax, Canada
United Kingdom Oxford High Street High Street, Oxford, UK
(In reality the Address field does add information rather than just combining what is already there.)
I am currently using the following SQL to produce the query:
SELECT Resources.Country, Resources.City, Resources.Street, Addresses.Address,
Resources.Headcount
FROM Resources
INNER JOIN Addresses ON Resources.Country = Addresses.Country
AND Resources.City = Addresses.City
AND Resources.Street = Addresses.Street
This works for me, but I am worried that I have not seen people use this many ANDs in a single join elsewhere, so don't know if it is a bad idea. (This is simplified version - I may need up to 8 ANDs in a single join in another case) Is this the best way to approach the problem, or is there a better solution?
Thanks
Joining on multiple columns is fine. You don't have to "fear" this.
As far as "a better way". I would suggest creating some variable tables, putting some data in them, and posting that TSQL (DDL and DML) here. Then you can get some possible alternatives. Your question is vague at the present (in regards to the "is there a better way" portion of your question)

Is there a publicly available list of the US States in machine readable form?

Where can I find a list of the US States in a form for importing into my database?
SQL would be ideal, otherwise CSV or some other flat file format is fine.
Edit: Complete with the two letter state codes
I needed this a few weeks ago and put it on my blog as SQL and Tab Delimited. The data was sourced from wikipedia in early January so should be up to date.
US States: http://www.john.geek.nz/index.php/2009/01/sql-tips-list-of-us-states/
I use the Worlds Simplest Code Generator if I need to add columns or remove some of the fields - http://secretgeek.net/wscg.asp
I've also done Countries of the world and International Dialling Codes too.
Countries: http://www.john.geek.nz/index.php/2009/01/sql-tips-list-of-countries/
IDC's: http://www.john.geek.nz/index.php/2009/01/sql-tips-list-of-international-dialling-codes-idcs/
Edit: New: Towns and cities of New Zealand
Depending on why you need the states, it is worth keeping in mind that there are more than 50 valid state codes. For someone deployed outside the USA, it is annoying to come across websites that do not allow address entry with perfectly valid state codes like AE and AP. A better resource would be USPS.
Cut/Paste these into notepad and then import..should be easy enough - there are only 50 after all:
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Out of interest: As there are only 50 and they rarely change, couldn't you not just manually create such a list from a source and put it on a public webspace?
In response to #cspoe7's astute observation, here is a query with all valid states and their abbreviations according to USPS. I have them sorted here by category (official US states, District of Columbia, US territories, military "states") and then alphabetically.
INSERT INTO State (Name, Abbreviation)
VALUES
('Alabama','AL'), -- States
('Alaska','AK'),
('Arizona','AZ'),
('Arkansas','AR'),
('California','CA'),
('Colorado','CO'),
('Connecticut','CT'),
('Delaware','DE'),
('Florida','FL'),
('Georgia','GA'),
('Hawaii','HI'),
('Idaho','ID'),
('Illinois','IL'),
('Indiana','IN'),
('Iowa','IA'),
('Kansas','KS'),
('Kentucky','KY'),
('Louisiana','LA'),
('Maine','ME'),
('Maryland','MD'),
('Massachusetts','MA'),
('Michigan','MI'),
('Minnesota','MN'),
('Mississippi','MS'),
('Missouri','MO'),
('Montana','MT'),
('Nebraska','NE'),
('Nevada','NV'),
('New Hampshire','NH'),
('New Jersey','NJ'),
('New Mexico','NM'),
('New York','NY'),
('North Carolina','NC'),
('North Dakota','ND'),
('Ohio','OH'),
('Oklahoma','OK'),
('Oregon','OR'),
('Pennsylvania','PA'),
('Rhode Island','RI'),
('South Carolina','SC'),
('South Dakota','SD'),
('Tennessee','TN'),
('Texas','TX'),
('Utah','UT'),
('Vermont','VT'),
('Virginia','VA'),
('Washington','WA'),
('West Virginia','WV'),
('Wisconsin','WI'),
('Wyoming','WY'),
('District of Columbia','DC'),
('American Samoa','AS'), -- Territories
('Federated States of Micronesia','FM'),
('Marshall Islands','MH'),
('Northern Mariana Islands','MP'),
('Palau','PW'),
('Puerto Rico','PR'),
('Virgin Islands','VI'),
('Armed Forces Africa','AE'), -- Armed Forces
('Armed Forces Americas','AA'),
('Armed Forces Canada','AE'),
('Armed Forces Europe','AE'),
('Armed Forces Middle East','AE'),
('Armed Forces Pacific','AP')
If you need to memorize them, let Wakko help you :)
You can download a lot of lists on http://www.freebase.com/ .
http://www.geonames.org/export/
The GeoNames geographical database is available for download free of charge under a creative commons attribution license. It contains over eight million geographical names and consists of 6.5 million unique features whereof 2.2 million populated places and 1.8 million alternate names. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes. (more statistics ...).
The data is accessible free of charge through a number of webservices and a daily database export.
You could use google sets to make a list of all states as well as lists of more or less anything.
If you need only 52 states SQL server script you can use the following query: solved
INSERT INTO
States ( StateName )
VALUES
( 'Alabama'),
( 'Alaska'),
( 'Arizona'),
( 'Arkansas'),
( 'California'),
( 'Colorado'),
( 'Connecticut'),
( 'Delaware'),
( 'District of Columbia'),
( 'Florida'),
( 'Georgia'),
( 'Hawaii'),
( 'Idaho'),
( 'Illinois'),
( 'Indiana'),
( 'Iowa'),
( 'Kansas'),
( 'Kentucky'),
( 'Louisiana'),
( 'Maine'),
( 'Maryland'),
( 'Massachusetts'),
( 'Michigan'),
( 'Minnesota'),
( 'Mississippi'),
( 'Missouri'),
( 'Montana'),
( 'Nebraska'),
( 'Nevada'),
( 'New Hampshire'),
( 'New Jersey'),
( 'New Mexico'),
( 'New York'),
( 'North Carolina'),
( 'North Dakota'),
( 'Ohio'),
( 'Oklahoma'),
( 'Oregon'),
( 'Pennsylvania'),
( 'Puerto Rico'),
( 'Rhode Island'),
( 'South Carolina'),
( 'South Dakota'),
( 'Tennessee'),
( 'Texas'),
( 'Utah'),
( 'Vermont'),
( 'Virginia'),
( 'Washington'),
( 'West Virginia'),
( 'Wisconsin'),
( 'Wyoming');
I'm just gonna put this list of the United States bash/linux format here so I can save someone some time:
alabama|alaska|arizona|arkansas|california|colorado|connecticut|delaware|florida|georgia|hawaii|idaho|illinois|indiana|iowa|kansas|kentucky|louisiana|maine|maryland|massachusetts|michigan|minnesota|mississippi|missouri|montana|nebraska|nevada|newhampshire|newjersey|newmexico|newyork|northcarolina|northdakota|ohio|oklahoma|oregon|pennsylvania|rhodeisland|southcarolina|southdakota|tennessee|texas|utah|vermont|virginia|washington|westvirginia|wisconsin|wyoming