Only print strings starting with a given letter / Python - startswith

print cites from visited_cities list in alphabetical order using .sort()
only print cities that names start "Q" or earlier (meaning a to q)
visited_cities = ["New York", "Shanghai", "Munich", "Toyko", "Dubai", "Mexico City", "São Paulo", "Hyderabad"]
the .sort() was easy to do but I don't know how figure out the second part of the problem.

you could do it with regular expressions and filtering like:
import re
regex=re.compile('[A-Q]{1}.*')
cities = list(filter(lambda city: re.match(regex, city), visited_cities))
print(*cities, sep='\n')
the regex variable looks for any city starting from [A-Q]
there is even an easier solution by utilizing the Unicode code point of a character. look at the method ord
for city in visited_cities:
first_character = city[0]
if ord(first_character) >= ord('A') and ord(first_character) <= ord('Q'):
print(city)
the Unicode code points are ordered so an A is at 65, B is at 66 ... Q is at 81 ... Z is at 90. so if you want to print only those cities starting with letters from A to Q you have to make sure their Unicode code point is between 65 and 81

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

I have two dataframes. The first df includes column b&c that has multiple stings seperated by a comma. the second has three columns, one that includes all stings in column B, two that includes all strings in c, and three is the resulting string I want to use.
x <- data.frame("uuid" = 1:2, "first" = c("jeff,fred,amy","tina,cat,dog"), "job" = c("bank teller,short cook, sky diver, no job, unknown job","bank clerk,short pet, ocean diver, hot job, rad job"))
x1 <- data.frame("meta" = c("ace", "king", "queen", "jack", 10, 9, 8,7,6,5,4,3), "first" = c("jeff","jeff","fred","amy","tina","cat","dog","fred","amy","tina","cat","dog"), "job" = c("bank teller","short cook", "sky diver", "no job", "unknown job","bank clerk","short pet", "ocean diver", "hot job", "rad job","bank teller","short cook"))
The result would be
result <- data.frame("uuid" = 1:2, "combined" = c("ace,king,queen,jack","5,9,8"))
Thank you in advance!
I tried to beat my head against the wall and it didn't help
Edit- This is the first half of the puzzle BUT it does not search for and then concat the strings together in a cell, only returns the first match found rather than all matches.
Is there a way to exactly match a string in one column with couple of strings in another column in R?

Extract words from the text in Pyspark Dataframe

I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
+----------+----------------------------------------------------------------------------------------------------------------------------+
I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:
df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?
You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()
+----------+--------------------+------------+
| begin_end| text| new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+

generating palindromes with John the Ripper

How can I configure John the Ripper to generate only mangled (Jumbo) palindromes from a word-list to crack a password hash?
(I've googled it but only found "how to avoid palindromes")
in john/john.conf (for e.g. 9 and 10 letter palindromes) -append the following rules at the end:
# End of john.conf file.
# Keep this comment, and blank line above it, to make sure a john-local.conf
# that does not end with \n is properly loaded.
[List.Rules:palindromes]
f
f D5
then run john with your wordlist plus the newly created "palindromes" rules:
$ john --wordlist=wordlist.lst --rules:palindromes hashfile.hash
rule f simply appends a reflection of itself to the current word from the wordlist, e.g. P4ss! -> P4ss!!ss4P
rule f D5 not only reflects the word but then deletes the 5th character, e.g. P4ss! -> P4ss!ss4P
I haven't found a way to "delete the middle character" so as of now, the rule has to be adjusted to the required length of palindromes, e.g. f D4 for length of 7, f D6 for length of 11 etc.
Edit: Possible solution for variable length (not tested yet):
f
Mr[6
M = Memorize current word, r = Reverse the entire word , [ = Delete first character, 6 = Prepend the word saved to memory to current word
With this approach the palindromes could additionally be "turned inside out" (word from wordlist at the end of the resulting palindrome instead of at beginning)
f
Mr[6
Mr]4
M = Memorize current word, r = Reverse the entire word , ] = Delete last character, 4 = Append the word saved to memory to current word

Regex - isolating string from larger word

The following regex within DB2 SQL works pretty well to get extra elements out of an address (i.e. not the street name or number). Limiting myself to two cases (UNIT or GATE) to keep my example simple, where HAD1 is the field containing the first line of a street address:
select HAD1,
regexp_substr(HAD1,'(UNITS?|GATES?)\s[0-9A-Z]{1,}')
from ECH
where regexp_like(HAD1,'(UNIT|GATE)')
and length(trim(HAD1)) > 12
I get this:
Ship To REGEXP_SUBSTR
Address
Line 1
UNIT 4, 117 MONTGOMORIE RD UNIT 4
END OF WAINUI RD, HIGHGATE -
UNIT 3, 37 TE ROTO DRIVE UNIT 3
GATE 6 52 MAHIA ROAD GATE 6
UNIT B 11 LANGSTONE LANE UNIT B
ASHBURTON FITTINGS GATE 2 GATE 2
GOODS: PLACEMAKERS - WESTGATE -
UNIT 3, 37 TE ROTO DRIVE UNIT 3
ASHBURTON FITTINGS GATE 2 GATE 2
SH 8A TARRAS-LUGGATE HIGHWAY GATE HIGHWAY
Which is very encouraging. It correctly didn't pick up HIGHGATE or WESTGATE because they weren't followed by a space then something else.
But it did pick up LUGGATE (last line), which I don't want. So, I'd like to be able to include that my text strings are not preceded by any character.
As you may guess I'm an absolute beginner with regex, so thank you for your patience.
Edit
Now I have my most excellent regex like so:
\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}
Using it over a larger data set I notice the occasional unwanted match where, for instance, GATE is followed by an ordinary English word:
THE THIRD GATE ON THE LEFT = GATE ON
The gates, levels, doors and units that I'm looking for will always be followed by one of the following: (a) A number of up to 6 digits (b) One letter (c) A number and one letter, possibly with a dash
Examples:
UNIT 7A
GATE 6
GATE 31113
UNIT B
LEVEL B2
LEVEL 2B
UNIT D06
So, my follow up question is, can I limit the number of letters in second part of the expression to 0 or 1, but allow up to six digits.
I've played around with the numbers in curly brackets but they seem to affect only how many characters are returned rather than how many characters must be present.

Determining What Line Does in Awk

I'm a very new beginner to awk. I'm reading over a simple loop statement where by using the split() command I have defined the 'a' array before the beginning of the loop and the 'b' array in each iteration of the loop.
Can someone help me with the statement below? I put it in to perspective since I know what the splits and for loop are doing.
split($2,a,":");
for(i=1,i<length(a),i++){
split(a[i],b," ")
#I don't know what the statement below this line does.
#It appears to be creating a multidimensional thing?
x[b[1]]=b[2]
It looks like a single dimension array. Let's say if you had a text file with one line like this:
1|age 10:fname john:lname smith|12345
Assuming a delimiter of pipe symbol |, your $2 is going to be age 10:fname john:lname smith.
Split that by colon : will give 3 items: age 10, fname john and lname smith
for loops through these 3 items. It takes the first item age 10
It is split that up by space. b[1] is now age, b[2] is now 10
Array x['age'] is set to 10
Similarly, x['lname'] is set to smith and x['fname'] is set to 'john'
x[b[1]]=b[2]
It's not creating a multidementional array.
x is a array. it's assigning the value of array key b[z] to b[z]. z is a positive integer I just used here.