splitting text file data into columns - sql

Hello all I had a text file in which it contains thousands of lines/records and they are in the following format
6242S10TH AVENUE KWANOBUHLE Y 6242
the space between these words are also inconsistent.so, now I just want to split this data into three separate columns so as to make a table.

If you assume this:
There is at least three spaces between colums
A column value doesn't contain as much as three spaces
There are no empty column values
You can split a line on three spaces and trim the values. C# example:
string[] columns =
line.Split(new string[]{" "}, StringSplitOptions.None)
.Select(s => s.Trim())
.ToArray();

Class mentioned here--> http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser(v=vs.100).aspx will be helpful

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

I have two dataframes. The first df includes column b&c that has multiple stings seperated by a comma. the second has three columns, one that includes all stings in column B, two that includes all strings in c, and three is the resulting string I want to use.
x <- data.frame("uuid" = 1:2, "first" = c("jeff,fred,amy","tina,cat,dog"), "job" = c("bank teller,short cook, sky diver, no job, unknown job","bank clerk,short pet, ocean diver, hot job, rad job"))
x1 <- data.frame("meta" = c("ace", "king", "queen", "jack", 10, 9, 8,7,6,5,4,3), "first" = c("jeff","jeff","fred","amy","tina","cat","dog","fred","amy","tina","cat","dog"), "job" = c("bank teller","short cook", "sky diver", "no job", "unknown job","bank clerk","short pet", "ocean diver", "hot job", "rad job","bank teller","short cook"))
The result would be
result <- data.frame("uuid" = 1:2, "combined" = c("ace,king,queen,jack","5,9,8"))
Thank you in advance!
I tried to beat my head against the wall and it didn't help
Edit- This is the first half of the puzzle BUT it does not search for and then concat the strings together in a cell, only returns the first match found rather than all matches.
Is there a way to exactly match a string in one column with couple of strings in another column in R?

Compare two comma separated columns

I want to compare two columns actual_data and pipeline_data based on source column bcz every source has different format.
I am trying to achieve the result column based on comparision between actual_data and pipeline_data .
I am new to pandas and looking for a way to implement this.
df['result'] = np.where(df['pipeline_data'].str.len() == df['actual_data'].str.len(), 'Match', np.where(df['pipeline_data'].str.len() > df['actual_data'].str.len(), 'Length greater than actual_data', 'Length shorter than actual_data'))
The code above should to what you want to do.

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

Remove the last punctuation in list of numbers in Python

I have variable of numbers and letters and want a code to remove the apostrophe between each number/letter and only keeping the first and last apostrophe for the variable. Desired output is shown below
numbers = 'V7780T103', '494368103', '003654100', '26210C104'
output should be
numbers = 'V7780T103, 494368103, 003654100, 26210C104'

concatenate values of string column and long column in panda dataframe

I have a pandas data frame which doesn't have an index yet (just artificial 1,2,3,.. index)
Column 'store', 'style' is string, column 'color', 'size' is a long int.
None of them are unique by themselves, but the concatenation of them are unique.
I want to concatenate them to produce an index, but
df2['store']+df2['style']+str(df2['color'])+str(df2['size'])
or
df2['store']+df2['style']+df2['color'].to_string()+df2['size'].to_string()
both doesn't work. I think it takes the whole column, force it to become a string and concatenate which results in weird symbols. And merges doesn't work correctly.
What's the correct way to concatenate a string column and a long column?
This should be:
df2['store'] + df2['style'] + df2['color'].astype(str) + df2['size'].astype(str)
Explanation: str(df2['size']) will make a string representation of the full column (one string, comparable as to what you see if you print the string), while .astype(str) will convert all values of the series to strings.
to_string gives the same result as str() (but takes optional parameters to control the result)