I need to take input from a file like this:
Let's say file name is test.dat and its like this:
keyword1 123456a
keyword2 small hard sour
keyword2 midsize firm bland
keyword2 large hard sour
keyword1
2234567
keyword1 3234567
keyword1 4234567
keyword2
small
soft
sour
keyword1 123456a
keyword2 midsize hard bland
keyword1 123456A
keyword2 large firm sweet
keyword1 AAAAAAA
keyword2 midsize hard bland
keyword1 BBBBBBA
keyword2 large firm sweet
I need to detect keyword1 and read the next seven characters string and then I need to detect keywords2 and read next three strings and so on. As you can see there can be any amount of white space before and after any string in the file.
I am new to ada. Please help.
Take a look at GNAT.AWK. It likely is one of the simplest way to do this parsing automatically. Something like:
procedure On_Keyword1 is
begin
Put_Line ("Field1 = " & GNAT.AWK.Field (2));
end On_Keyword1;
procedure On_Keyword2 is
begin
Put_Line ("Word1=" & GNAT.AWK.Field (2)
& " Word2=" & GNAT.AWK.Field (3)
& " Word3=" & GNAT.AWK.Field (4));
end On_Keyword2;
GNAT.AWK.Register (1, "Keyword1", On_Keyword1'Access);
GNAT.AWK.Register (1, "Keyword2", On_Keyword2'Access);
Related
I have a csv file with data on grouping of cereal brands in an experiment. I have one row for each subject (~2000 subjects) and each row has a variable number of brand that people liked (text strings separated by commas)
srno Brands
1 GMI 'TOTAL' WG ORIG,POS H-C GRAPE NUTS ORIG,POST GREAT GRAINS CRUNCHY PCN
2 GMI TINY TST STB,GMI HONEY NUT CHEERIOS REG,GMI TINY TST BB,GMI APPLE CN TOAST CRUNCH
3 QKR SQUARES CN
I want to read the data into a data frame so that I have the brands in each row as one element of a list
My goal is to be able to do a text mining analysis to explore similarities (ie brands that occur together)
I see a similar question asked a few years ago but I was not able to adapt the solution
Text file to list in R
Managed to work this out!
I read in the csv file with StringsAsFactors=FALSE option (this is important)
`x = read.csv ("datafile.csv", stringsAsFactors=FALSE)
#strings of brand names read into variable str_brand
#the following stmt then turns the chars in str_brand into a list
#note..I had the individual brands separated by commas in the csv file
brands_list <- stringr::str_split(x$str_brand,pattern = ",")
`
excuse the basic nature of this question but I have searched for hours for an answer and they all seem to over complicate what I need.
I have a dataframe like the following: -
id food_item_1 food_item_2 food_item_3
1 nuts bread coffee
2 potatoes coffee cake
3 fish beer coffee
4 bread coffee coffee
What I want to do is search all the 'food_item_*' columns (so in this case there are 3) and have returned back to me the single most common value such as e.g. 'coffee' across all 3 columns.
Could someone please recommend the best way to do this?
Many thanks
md
Use DataFrame.filter, reshape by DataFrame.stack and then use Series.mode, last select first value by position with Series.iat:
a = df.filter(like='food_item_').stack().mode().iat[0]
print (a)
coffee
Another idea is with Series.value_counts and selecting first value of index:
a = df.filter(like='food_item_').stack().value_counts().index[0]
You can also melt your columns and value_counts:
print (df.melt(id_vars="id", value_vars=df.columns[1:])["value"].value_counts())
coffee 5
bread 2
nuts 1
potatoes 1
cake 1
beer 1
fish 1
I have a dataframe that looks like this.
code description col3 col4
123456 nice shoes size4 something something
123456 nice shoes size5 something something
567890 boots size 1 something something
567890 boots size 2 something something
567890 boots size 3 something something
234567 baby overall 2yrs something. something
234567 baby overall 3-4yrs something something
456778 shirt m Something. Something
456778 shirt l something Something
456778 shirt xl Something Something
I like to shorten 'description' to be the common substring based on similar 'code' column. and drop duplicates.
code description col3 col4
123456 nice shoes something something
567890 boots something something
234567 baby overall something something
456778 shirt Something Something
I Suspect need to groupby and maybe apply a function but not able to get my head round this.
Found a function but that takes in 2 strings. Not sure if it
could be of help. And this function only takes 2 strings whereas my data may have 5 rows having same code...
from difflib import SequenceMatcher
string1 = "apple pie available"
string2 = "come have some apple pies"
def extract_common(string1, string2):
match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
print(match) # -> Match(a=0, b=15, size=9)
print(string1[match.a: match.a + match.size]) # -> apple pie
print(string2[match.b: match.b + match.size]) # -> apple pie
return string1[match.a: match.a + match.size]
Appreciate any help rendered.
You need pandas 0.25.1 to use explode
mask=(df.groupby('code')['code'].transform('size')>1)
df1=df[mask]
df2=df[~mask]
s=df1.groupby('code',sort=False)['description'].apply(lambda x: ' '.join(x).split(' ')).explode()
s_not_duplicates=s.to_frame()[s.map(s.value_counts()>1)].drop_duplicates().groupby(level=0)['description'].apply(lambda x: ' '.join(x))
description_not_duplicates=pd.concat([s_not_duplicates,df2.description])
print(description_not_duplicates)
123456 nice shoes
234567 baby overall
456778 shirt
567890 boots size
Name: description, dtype: object
I have a class of variables. To keep it simple I will use car stock as an example. So in my class I have variables for say:-
Car Manufacturer
Car Colour
Car Quantity
and my class looks something like this.. (but potential for hundreds of lines)
BMW, Black, 2
Mercedes, White, 1
Honda, Green, 3
BMW, Red, 1
I need to create a list that merges the manufacturer and the quantity but separates the colours e.g. new list should look something like:-
BMW, 3
Black
Red
Mercedes, 1
White
Honda, 3
Green
Can someone help explain the best way to go about this please?
Assuming that you don't hate one-liners and have some idea of how LINQ works, the following expression will return an array of Tuple(Of String, Integer) where Item1 will have the name of manufacturer and Item2 will have the total number of their cars:
YourCarsList.GroupBy(Function(r) r.Manufacturer).Select(Function(r) New Tuple(Of String, Integer)(r.Key, r.Sum(Function(r2) r2.Quantity))).ToArray()
Note that the output you have shown in the question is console-ish and cannot be returned in that form by a function or expression. If you really want to just print this as text output, you could extend the above expression like this:
Dim Res = YourCarsList.GroupBy(Function(r) r.Manufacturer)
For Each R In Res
output.AppendLine(R.Key, r.Sum(Function(r) r.Quantity))
For Each Clr In R.Select(Function(r) r.Colour).Distinct().OrderBy(Function(r) r)
output.AppendLine(Clr)
Next
Next
Sorry for posting something that's probably obvious, but I don't have much database experience. Any help would be greatly appreciated - but remember, I'm a beginner :-)
I have a table like this:
Table.fruit
ID type Xcoordinate Ycoordinate Taste Fruitiness
1 Apple 3 3 Good 1,5
2 Orange 5 4 Bad 2,9
3 Apple 7 77 Medium 1,4
4 Banana 4 69 Bad 9,5
5 Pear 9 15 Medium 0,1
6 Apple 3 38 Good -5,8
7 Apple 1 4 Good 3
8 Banana 15 99 Bad 6,8
9 Pear 298 18789 Medium 10,01
… … … … … …
1000 Apple 1344 1388 Bad 5
… … … … … …
1958 Banana 759 1239 Good 1
1959 Banana 3 4 Medium 5,2
I need:
A table that gives me
The n (eg.: n=5) closest points to EACH point in the original table, including distance
Table.5nearest (please note that the distances are fake). So the resulting table has ID1, ID2 and distance between ID1 and ID2 (can't post images yet, unfortunately).
ID.Fruit1 ID.Fruit2 Distance
1 1959 1
1 7 2
1 2 2
1 5 30
1 14 50
2 1959 1
2 1 2
… … …
1000 1958 400
1000 Xxx Xxx
… … …
How can I do this (ideally with SQL/database management) or in ArcGis or similar? Any ideas?
Unfortunately, my table contains 15000 datasets, so the resulting table will have 75000 datasets if I choose n=5.
Any suggestions GREATLY appreciated.
EDIT:
Thank you very much for your comments and suggestions so far. Let me expand on it a little:
The first proposed method is sort of a brute-force scan of the whole table rendering huge filesizes or, likely, crashes, correct?
Now, the fruit is just a dummy, the real table contains a fix ID, nominal attributes ("fruit types" etc), X and Y spatial columns (in Gauss-Krueger) and some numeric attributes.
Now, I guess there is a way to code a "bounding box" into this, so the distances calculation is done for my point in question (let's say 1) and every other point within a square with a certain edge length. I can imagine (remotely) coding or querying for that, but how do I get the script to do that for EVERY point in my ID column. The way I understand it, this should either create a "subtable" for each record/point in my "Table.Fruit" containing all points within the square around the record/point with a distance field added - or, one big new table ("Table.5nearest"). I hope this makes some kind of sense. Any ideas? THanks again
To get all the distances between all fruit is fairly straightforward. In Access SQL (although you may need to add parentheses everywhere to get it to work :P):
select fruit1.id,
fruit2.id,
sqr(((fruit2.xcoordinate - fruit1.xcoordinate)^2) + ((fruit2.ycoordinate - fruit1.ycoordinate)^2)) as distance
from fruit as fruit1
join fruit as fruit2
on fruit2.id <> fruit1.id
order by distance;
I don't know if Access has the necessary sophistication to limit this to the "top n" records for each fruit; so this query, on your recordset, will return 225 million records (or, more likely, crash while trying)!
Thank you for your comments so far; in the meantime, I have gone for a pre-fabricated solution, an add-in for ArcGis called Hawth's Tools. This really works like a breeze to find the n closest neighbors to any point feature with an x and y value. So I hope it can help someone with similar problems and questions.
However, it leaves me with a more database-related issue now. Do you have an idea how I can get any DBMS (preferably Access), to give me a list of all my combinations? That is, if I have a point feature with 15000 fruits arranged in space, how do I get all "pure banana neighborhoods" (apple, lemon, etc.) and all other combinations?
Cheers and best wishes.