I am trying to extract data from a data frame in Pandas and merge the results into one string (or .txt file)
Data Frame:
NUM
LETTER
0
4
Z
1
5
U
2
6
A
3
7
P
4
1
B
5
4
P
6
5
L
7
6
T
8
7
V
9
1
E
Script so far:
data = pd.read_csv("TEST.csv")
fdata = data[data["LETTER"].str.contains("A|E|L|P")]
ffdata = fdata.RESULT.to_string()
print(ffdata)
Running the script on TEST.csv gives me this result:
LETTER
2
A
3
P
5
P
6
L
9
E
Next, I want to join the data from the filtered cells and join them into one string:
--> "APPLE", optional with saving them as .txt to use them later.
How do I proceed from here? I was thinking about iterating over the data frame and use join, but I have no idea how to implement this. Any clues?
Based on #Frodnar's answer, this is the code that works:
data = pd.read_csv("TEST.csv")
fdata = data[data["LETTER"].str.contains("A|E|L|P")]
ffdata = ''.join(fdata.LETTER.to_list())
print(ffdata)
which gives the output 'APPLE'
Thank you for your help!!
This question already has answers here:
How to generate a power set of a given set?
(8 answers)
Closed 4 years ago.
I am trying to find an algorithm enabling to generate the full list of possible combinations from x given numbers.
Example: possible combinations from 3 numbers (a, b,c):
a, b, c , a +b , a + c , b + c , a+b+c
Many thanks in advance for your help!
Treat the binary representation of the numbers from 0 to 2^x-1 as set membership. E.g., for ABC:
0 = 000 = {}
1 = 001 = {C}
2 = 010 = {B}
3 = 011 = {B,C}
4 = 100 = {A}
etc...
Do you meant generate possible combination of sum of numbers?
Start with an empty set s = {0}
For each number a,b,c:
duplicate the existing set s, add each number to the duplicated set. Add the results back to s.
Example:
s = {0}
for a:
duplicate s, s' = {0}
add a to each of s', s' = {a}
add s' back to s, s = {0,a}
for b:
duplicate s, s' = {0,a}
add b to each of s' = {b,a+b}
add s' back to s, s= {0,a,b,a+b}
for c:
dupicate s, s' = {0,a,b,a+b}
add c to each of s' = {c,a+c,b+c,a+b+c}
add s' to s, s = {0,a,b,a+b,c,a+c,b+c,a+b+c}
I have following dataset:
import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['#Zuora wants to help #Network4Good with Hurric...','#ztrip please help spread the good word on hel...']})
DOCUMENT_ID MESSAGE
0 263403828328665088 #Zuora wants to help #Network4Good with Hurric...
1 264142543883739136 #ztrip please help spread the good word on hel...
I am trying to reshape my data in the form of
docID wordID count
0 1 118 1
1 1 285 1
2 1 1229 1
3 1 1688 1
4 1 2068 1
I used following
r=[]
for i in jsonDF['MESSAGE']:
for j in sortedValues(wordsplit(i)):
r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
gives me following result
0 17
1 help 2
2 wants 1
3 hurricane 1
4 relief 1
5 text 1
6 sandy 1
7 donate 1
8 6
9 please 1
I can get word counts
I have no idea to to append Document_ID to the in the above dataframe.
Following functions were used to split words
from nltk.corpus import stopwords
import re
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
UPDATE: SOLUTION HERE:
string split and and assign unique ids to Pandas DataFrame
'DOCUMENT_ID' is one of the two fields in each row of jsonDF. Your current code doesn't access it because it directly works on jsonDF['MESSAGE'].
Here is some non-working pseudocode - something like:
for _, row in jsonDF.iterrows():
doc_id, msg = row
words = [word for word in wordsplit(msg)][0].split() # hack
wordcounts = Counter(words).most_common() # sort by decr frequency
Then do a pd.concat(pd.DataFrame({'DOCUMENT_ID': doc_id, ...
and get the 'wordId' and 'count' fields from wordcounts.
I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.
SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;
I am currently writing a program to solve a brain teaser,
How this works:
Using the digits 1-9 only once, make the four corners, and each diagonal = 26
hint make the middle 7
anyways, my code basically starts at "111111111" and counts up, each time checking to see if it matches the required parameters.
Code:
Public Class Main
Dim count As Integer
Dim split() As Char
Dim done As Boolean
Dim attempts As Integer
Private Sub IncreaseOne()
If count < 999999999 Then
count += 1
Else
done = True
End If
If CStr(count).Contains("0") Then
count = CStr(count).Replace("0", "1")
End If
End Sub
Private Sub Reset()
count = 111111111
attempts = 0
End Sub
Private Sub IntToLbl()
split = CStr(count).ToCharArray
lbl1.Text = split(0)
lbl2.Text = split(1)
lbl3.Text = split(2)
lbl4.Text = split(3)
lbl5.Text = split(4)
lbl6.Text = split(5)
lbl7.Text = split(6)
lbl8.Text = split(7)
lbl9.Text = split(8)
lblAttempts.Text = "Attempt: " & attempts
End Sub
Private Sub Check()
attempts += 1
If split(0) + split(1) + split(7) + Int(8) = 26 And split(0) + split(2) + split(4) + split(6) + split(8) = 26 And split(1) + split(3) + split(4) + split(5) + split(7) = 26 Then
If CStr(count).Contains("1") And CStr(count).Contains("2") And CStr(count).Contains("3") And CStr(count).Contains("4") _
And CStr(count).Contains("5") And CStr(count).Contains("6") And CStr(count).Contains("7") And CStr(count).Contains("8") _
And CStr(count).Contains("9") Then
ListBox1.Items.Add("A" & attempts & ": " & CStr(count))
End If
End If
End Sub
Private Sub Act()
While done = False
IncreaseOne()
IntToLbl()
Check()
End While
tr.Abort()
End Sub
Dim suspended As Boolean = False
Dim tr As New System.Threading.Thread(AddressOf Act)
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles btnSolve.Click
If suspended = True Then
tr.Resume()
suspended = False
Else
If tr.IsAlive = False Then
Reset()
tr.Start()
CheckForIllegalCrossThreadCalls = False
Else
Dim Reply = MsgBox("Thread is running! Stop the thread?", MsgBoxStyle.YesNo, "Error!")
If Reply = MsgBoxResult.Yes Then
tr.Suspend()
suspended = True
End If
End If
End If
End Sub
Private Sub Main_FormClosing(sender As Object, e As FormClosingEventArgs) Handles Me.FormClosing
tr.Abort()
End Sub
Private Sub tr2_Tick(sender As Object, e As EventArgs) Handles tr2.Tick
IncreaseOne()
IntToLbl()
Check()
End Sub
End Class
Before using a thread, you should 1) reduce your algorithm complexity and 2) improve its efficiency.
1) For the complexity, since figures can only be here once, you have 9! = 362.880 test to do, which is 27.557 times less tests than a full scan.
I guess that allready at that point you'll be real-time on most computers, but there might be also some combinations for which you can stop the tests before testing all sub-combination ( expl : if first diagonal is not 26, no need to test permutations of other items). With this you could cut down even more the number of tests.
Another way to reduce the case count is to use symmetry. Here, 1 step or 2 step rotations, and horizontal or vertical flip won't affect result, which makes another X16 cut in test count.
2) For the efficiency, using arrays of integers instead of strings will bring you a huge speed boost.
I did a jsfiddle (in javascript, so), that is only testing 9! elements and uses array, it gives result instantly, so i did not look further for early stop / symmetry.
One solution is, for instance : 3,2,7,5,9,6,1,4,8
which makes :
3 6
2 1
7
4 5
8 9
which seems to be ok.
fiddle is here : http://jsfiddle.net/HRdyf/2/
The figures are coded this way : 5 first figures goes for the first diagonal,
the central item has index 2, the 4 others are for the second diagonal except
its central item.
(There might be more efficient ways to encode the array allowing, as explained
earlier, to stop earlier some tests.)
Rq : We can find all solutions with maths :
Let's call c1, c2, c3, c4 the four corners, c the central point, d11, d12, d21, d22 the two
remaining points of the two diagonals.
then
1) c1 + c2 + c3 + c4 = 26
2) c1 + d11 + m + d12 + c3 = 26
3) c2 + d21 + m + d22 + c4 = 26
4) all points are different and in the 1..9 range.
5) (from 4) : sum of all points = 45 (sum from 1 to 9 )
6) from 5) and 1) --> d11 + d12 + m + d21 + d22 = 45 - 26 = 19
(inner points total = total - corner total)
7 ) now adding 2) and 3) and using 1) and 6) we have 19 + 26 + m = 26 + 26
So --->>> m=7
8) considering 1) and 4) and 7), we see that we cannot reach 26 with four integers
different from 7 without using both 9 and 8, ( the max we can reach without 7
and 9 is 8+6+5+4 = 25, and the max reached without 7 and 8 is 9+6+5+4 = 24 )
So --> two corners have 9,8 as value.
9) With 8), 1), 7), and 4) : the two other corners can only be 6,3 or 5,4
(if r1 and r2 are the not 9 or 8 corners, we have r1+ r2 = 9 )
At this point : center is 7, and corners are either [4,5,8,9] or [3,6,8,9] (and permutations)
For [4,5,8,9] - > remains [1,2,3,6] (sum = 12)
For [3,6,8,9] - > remains [1,2,4,5] (sum = 12)
We cannot have 9 and 8 on same diagonal, since 8 + d11 + 7 + d12 + 9 = 26 makes d11 + d12 = 2 which is not
possible considering 4)
Let's consider the corners = [4,5,8,9] case, and see the end of the diagonal starting by 9. It might be
4 or 5.
4 : 9 + d11 + 7 + d12 + 4 = 26 --> d11 + d12 = 6 --> (3,1) is the only solution for d11 and d12 --> remains (2,6) for d21 and d22.
5 ->> d11 + d12 = 7 --> no solution, given 4) and that 4 and 5 are in use
now the corners = [3,6,8,9] case, consider also the end of the diagonal starting by 9. It might end by 6 or 3
3 : d11 + d12 = 7 (5, 2) only solution (4,3 and 6,1 cannot work since 3 and 6 are in use)
6 : d11 + d12 = 10 no solution. (6,4 / / 7,3 / 8,2 / 9,1 all uses a used figure.)
---> so the diagonal starting by 9 can only end by 4 or 3.
deduction ---> the diagonal starting by 8 will end by 5 (when the other one ends by 4) or by 6
(when the other one ends by 3 ).
How many solutions ?
4 possibilities to choose where the 9 is, then 2 choices for the 9 diagonal end (4 or 3) , then 2 choices for the 8 diagonal (starting upstairs or downstairs), then 4 possibilities left for d11, d12 ; d21, d22 choices : [3,1] + [2,6] if we choose 4 as 9's end and [5,2] + [1,4] if we choose 3 as 9's end.
4 *2 * 2 * 4 makes 64 combinations of solutions.
This is one of those problems that requires some analysis(pencil / paper and subtraction) before coding. Since at least one of the diagonals must have 9, the possibilities for that sequence(diagonal) are few. The next number in that sequence can only be 8, 7, or 6 with each of those only having a few possibilities.
9 8 6 2 1
9 8 5 3 1
9 8 4 3 2
9 7 6 3 1 remaining 2 4 5 8 = 19
9 7 5 4 1 remaining 2 3 6 8 = 19
9 7 5 3 2 remaining 1 4 6 8 = 19 *
9 6 5 4 2
(I may have missed some???)
Once those few sequences are known then the sum of the remaining numbers plus one of the numbers from a sequence must equal 26.
edit: a little more pencil / paper work shows that of those only the sequences with 7's in the center work.
edit: John Wein on the MSDN site came up with this math.
the sum of all possible numbers (1-9) = 45
diag1val(26) + diag2val(26) - sum = center square value - 52-45 = 7
sum - cornervals - centerval = values of 4 raidal boxes -> 45 - 26 -
7 = 12
12 can only be some combo of 1,2,3,6 or 1,2,4,5
If you have a simple "try all possibilities" approach, then paralellizing the code could definitely make it faster. And it would be easy, as no changing data needs to be shared between threads.