This question, if anything, more has to do with resolving my OCD when working on these text files, but I feel like a picture helps describe better what I intend to do than anything I can describe:
http://i.stack.imgur.com/iDmi5.png
What I'm looking to accomplish is lining up the values (with tabs, not spaces) all on the same column (in this case, column 53), but I don't know how to go about it effectively.
Rushing out to work at the moment so if I need to provide more information to help, let me know and I'll post back when I'm home. Thanks!
If, for example, you have the following text:
aaa aaa
a
You can hit tab after a on the second line, and after a couple of hits, it should get you to the column in which aaa begins in the row above.
Related
I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.
Below is simplified example of my data. As you can see – there are just two rows here
So I run below and suddenly getting unexpected result
What I expected was something like:
Why am I getting wrong result?
Moreover, when I run below – I am getting only one row. Why second row with id=1 is not showing??
Is there BigQuery bug or what?
Disclaimer: I was asked exactly this type of question few times offline (outside of StackOverflow) and recently saw very same question on SO (I can't understand this BigQuery magic. find string with LIKE) but unfortunately it was deleted so I decided to Post this on my own
The reason for GROUP BY not grouping those two rows is that str field in those rows are actually different. Unfortunately, BigQuery Web UI collapses spaces in result panel when it is in Table mode. To see real/original values you can switch to JSON mode, as below
Same reason is for unexpected result for use of LIKE
As of how to deal with this? It depends!
For example you can kind of normalize your strings by suppressing spaces by yourself as it is shown below
P.S. In our internal tools – we just fixed the issue with suppressed spaces and just simply show all spaces:
it's sounds tricky and it is!
I have an example. I am trying to fill out the red values automatically
As you can see, column A and column B make a tuple and these tuples need to get incremented numbers.
While coding I feel like I am running out of functions to do this. I would be glad for hints, about even how to approach this problem the best!
Thank you!
Use COUNTIFS()
Put this in C1:
=COUNTIFS($A$1:A1,A1,$B$1:B1,B1)
Then copy down
For some background information, I'm creating an application that searches against a couple of indexed tables to retrieve some records. It isn't overtly complex to the point of say Google, but it's good enough for the purpose it serves, barring this strange issue.
I'm using the Contains() function, and it's going very well, except when the search contains strings of numbers. Now, I'm only passing in a string -- nowhere numerical datatypes being passed in -- only characters. We're searching against a collection of emails, each appended with a custom ID when shot off from a workflow. So while testing, we decided to search via number strings.
In our test, we isolated a number 0042600006, which belongs to one and only one email subject. However, when using our query we are getting results for 0042600001, 0042600002, etc. The query is this as follows (with some generic columns standing in):
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006')
We've tried every possible combination: '0042600006*', '"0042600006"' and '"0042600006*"'.
I think it's just a limitation of the function, but I thought this would probably be the best place for answers. Thanks in advance.
Asked this same question recently. Please see the insightful answer someone left me here
Essentially what this user says to do is to turn off the noise words (Microsoft has included integers 0-9 as noise in the Full Text Search). Hope you can use this awesome tool with integers as I now am!
try to add language 1033 as an additional parameter. that worked with my solution.
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006', language 1033)
try using
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '%0042600006%')
I have recently just started working with Lucene (specifically, Lucene.Net) and have successfully created several indicies and have no problem with any of them. Previously having worked with Endeca, I find that Lucene is lightweight, powerful, and has a much lower learning curve (due mostly to a concise API).
However, I have one specific index/query situation which I am having problems wrapping my head around. What I have is a person directory. People can be searched for in this application, with the goal of returning both exact and approximate matches. Right now, in the index I concatenate the "FirstName" and "LastName" into a single field called "FullName", adding a space between the two. So FirstName:Jon with LastName:Smith yield FullName:Jon Smith. I do anticipate the possibility of middle names and possibly suffix, but that is not important at the moment.
I would like to do the equivalent of a fuzzy search on the name, so someone searching for "John Smith" would still get back "Jon Smith". I had thought about a multisearch, however, this becomes more involved if his name was actually "Jon Del Carmen" or "Jon Paul Del Carmen". I have nothing in what the user types in to delineate the first name or last name pieces.
The only thought that I have is that I could replace spaces in the concatenated value with a character that would not be discarded. If I did this when I built the document for the index and also when I parsed the query, I could treat it as one larger word, right? Is there another way to do this that would work for both simple names ("Jon Smith") and also more complex names ("Jon Paul Del Carmen")?
Any advice would truly be appreciated. Thanks in advance!
Edit: Additional detail follows.
In Luke, I put in the following query:
FullName:jonn smith~
It is being parsed as:
FullName:jonn CreatedOn:smith~0.5
With an Explanation of:
BooleanQuery:boost=1.0000
clauses=2, maxClauses=1024
Clause 0: SHOULD
TermQuery:boost=1.0000
Term: field='FullName' text='jonn'
Cluase 1: SHOULD
FuzzyQuery: boost=1.0000
prefixLen=0, minSimilarity=0.5000
org.apache.lucene.search.FuzzyTermEnum: diff=-1.0000
FilteredTermEnum: Exception null
"CreatedOn" is another Field in the index. I tried putting quotes around the term "jonn smith", but it then treats it like a phrasequery, instead. I am sure that the problem is that I am just not doing something right, but being so green at all of this, I am not sure what that something truly is.
My problem was with how I was building the index. What I ended up doing was making sure that it was not tokenizing the FullName, and the query started returning the correct results. The Explain results from above were due to an ID10T error on my part and is now returning correctly.