I have a Word document where I've marked various entries for the Index. The entries are like this:
Inland Empire
David Shaver
John Jameson
JM Granny
Justin Flatterer
Mary Martinson
Palouse Poppies
Sara Talk
Eddie Haskell
I've marked each Organization as a Main Entry and each Person as a SubEntry.
I need TWO indices.
1) List of companies with all their people (similar to as shown, above).
2) List of ONLY the people. No companies.
How can I generate an index that shows only the Subentries?
Related
Hello swarm intelligence!
I have the following use case: For every movie that is requested by a user, I create a number of tags for that specific movie, derived from several sources (actors, plot etc.. ).
I will use this data for associaton mining.
The problem: If I use the movie for rows and the tags for columns, the tags will easily exceed the technical limitations of 3000 columns ( there is even more actors, and then plot keywords etc)
Is there any way, I can organize this data to then use it for (quick) association mining?
Thanks a lot
Don't put tags in columns. Instead create a separate table, named something like movie_tags with two columns, movie_id and tag. Put each tag in a separate row of that table.
This is known as "normalizing" your data. Here's a nice walkthrough with an example very similar to yours.
Edit: Let's say you have a catalog of movies about the Italian Mafia in New York City in the 20th century. Let's say the movies are
1 Godfather
2 Goodfellas
3 Godfather II
Then your movie_tags table might contain these rows.
1 Gangsters
2 Gangsters
3 Gangsters
1 Francis Ford Coppola
3 Francis Ford Coppola
2 Martin Scorsese
Pro tip If you find yourself thinking about putting lots of data items with the same meaning in their own columns, you probably need to normalize the data and add appropriate tables.
Is there a SQL/Regex or some advance function where we can extract human names for a columns that has around 2 million rows? some thing like NLTK
below is my sample. In the below I wanted to extract only human names (i.e.) filter companies **. Like these I have 2 million mixed with real companies and human names
KAREN STRAUSS
KASEY NEMELKA
KATHLEEN MCMAHON
KATHRYN HOCKADAY
KATHRYN HOLAHAN
KATIE NELSON
**KATHERINE KACENA CONSULTING**
KATHY ATKINS
KATRINA GRANT
KATY DYER
KATY G TACKES
**KAUFFMAN S TRANSPORT LLC**
KATHERINE MAGPANTAY
KATHERINE VENTURA
KATHRYN RUANO
JORGE DANIEL MUSCIA
JOSE MANUEL ROSALES SANTEROS
JOSE MANUEL VILAS CARR
JOSEPH H WILNER
This is too long for a comment. Human names are too variable. After all, is "John Deere" the name of a company. Or is it the name of a person? Or both?
You can construct special purpose logic for your data. It will take time to develop but something like this:
regexp(lower(name), '\s(consult|llc)')
How would one go about telling a CAML query to sort the results in a thoroughly custom order?
.
For instance, for a given field:
-- when equal to 'Chestnut' at the top,
-- then equal to 'Zebra' next,
-- then equaling 'House'?
Finally, within those groupings, sort on a second condition (such as 'Name'), normally ascending.
So this
ID Owns Name
————————————————————
1 Zebra Sue
2 House Jim
3 Chestnut Sid
4 House Ken
5 Zebra Bob
6 Chestnut Lou
becomes
ID Owns Name
————————————————————
6 Chestnut Lou
3 Chestnut Sid
5 Zebra Bob
1 Zebra Sue
2 House Jim
4 House Ken
In SQL, this can be done with Case/When. But in CAML? Not so much!
CAML does not have such a sort operator by my knowledge. The workaround might be that you add a calculated column to the list with a number datatype and formula
=IF(Owns="Chestnut",0,IF(Owns="Zebra",1,IF(Owns="House",3,999))).
Now it is possible to order on the calculated column, which translates the custom sort order to numbers. Another solution is that you create a second list with the items to own, and a second column which contains their sort order. You can link these two lists and order by the item list sort order. The benefit is that a change in the sort order is as easy as editing the respective listitems.
I have a spreadsheet for payroll that is populated from a seperate spreadsheet. Occasionally,one of our workers will get a promotion. That promotion shows on the timesheets: ex. Smith, Adam Position becomes Smith, Adam Promotion.
This data is then populated into a pivot table where Smith, Adam Position and Smith, Adam Promotion show in separate cells. Currently, we are manually adding the two data sets so that payroll gets a single number instead of multiple. I would like to simplify this tasks. I am using excel 2003, so some more advanced functions don't work.
Any suggestions and help would be greatly appreciated. Thanks in advance.
Ideally, you'd use a different field (a unique identifier) to identify Smith, Adam (e.g., an employee ID number), but if that's not available, then you could take the following approach:
(Suppose that "Smith, Adam Position" is in A1.)
You could add an additional column that extracts the last name, the comma, and then whatever the next word is. For example, from
Smith, Adam Analyst
you would get Smith, Adam. Unfortunately, this means that If you have something like
Jones, Mary Ellen Consultant
you would end up with Jones, Mary. If you think you can live with that, this solution could work. The way you would extract that would be with the following formula:
=SUBSTITUTE(LEFT(SUBSTITUTE(A1,", ",",",1),FIND(" ",A1)-1),",",", ",1)
And then build your pivot table on that field.
Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.
Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...
Have one table that stores your external data:
id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
1 | Mary | Smith | A | abcdefg123 | 5/2/12 1:00am
2 | Mary | Jane | B | 1234567abc | 5/2/12 2:00am
Then have a second table that contains your cleaned data:
id | first_name | last_name
----+------------+-----------
1 | Mary | Jane-Smith (or whatever)
Then have a mapping table between the two.
local_person_id | foreign_person_id
-----------------+-------------------
1 | 1
1 | 2
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.
You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.
One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".
#Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.
Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:
FIRST_NAME
LAST_NAME
SOURCE1_FIRST_NAME
SOURCE1_LAST_NAME
SOURCE2_FIRST_NAME
SOURCE2_LAST_NAME
For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.