How to return different rows from the same table on single query as a list - sql

I have a table which has a boolean column, this column is used to filter some responses. I'm in the need to return a response as a tuple as {claimed, unclaimed} (Imagine the table is called winnings)
While working on it I've done two separate queries to return claimed then unclaimed rows and manually constructing the response, then I went with returning all rows without checking the boolean column and splitting it outside of the query. Now I'm wondering if there's a way I can run a single query on the same table and return both claimed and unclaimed as separate results mainly for performance hoping it runs better. I've tried doing it with joins but its returning a list of two items tuples like:
[{claimed, unclaimed}, {claimed, unclaimed}]...
While I want:
{claimed, unclaimed}
# OR
[{claimed, unclaimed}]
At most, no more tuples. Note that I'm not running the raw queries but using a library so excuse if the terminology is not right.
This is the last query I ran:
SELECT w0."claimed", w1."claimed"
FROM "winnings" AS w0
INNER JOIN "winnings" AS w1 ON TRUE
WHERE (w0."claimed" AND NOT (w1."claimed"))
LIMIT 10;
EDIT: More details.
When I run the query from above this is the result I get:
=> SELECT w0."claimed", w1."claimed" FROM "winnings" AS w0 INNER JOIN "winnings" AS w1 ON TRUE WHERE (w0."claimed" AND NOT (w1."claimed")) LIMIT 10;
claimed | claimed
---------+---------
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
(10 rows)
This is converted to the following on Elixir which is the language I'm using:
[
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false
]
This is a keyword list which internally is a list of tuples as [{true, false}, {true, false}] - I want: [{[true, true], [false, false]}]
Means that I want 2 lists, each list with their respective rows, only claimed on one and only unclaimed on the other one.
I don't really mind the type it outputs as long as it includes two lists with their rows how I said.

To get the first column from a list of rows, you can use Enum.map/2 to get the first element of each tuple:
Enum.map(rows, &elem(&1, 0))
If you're a newcomer to elixir, the & syntax may be a bit confusing. That code is shorthand for
Enum.map(rows, fn field -> elem(field, 0) end)
You could make that into a function that does that for all your columns like this:
def columnize(rows = [first_row | _]) when is_tuple(first_row) do
for column <- 1..tuple_size(first_row), do: Enum.map(rows, &elem(&1, column - 1))
end
def columnize([]) do
[]
end
hd/1 is a function used to get the first tuple in the list. The = [first_row | _] part guarantees the argument is a list with at least one element; the when is_tuple(first_row) assures at least the first row is a tuple.
If you'd like to know more about the for, it's a comprehension.
Note that this function assumes all rows have the same number of columns and are tuples (which should be true in the results of a query; but may not be in other cases), and will throw an ArgumentError if the first row is wider than any other row. It will also cause errors if other elements of the list are not tuples.
This feels very unidiomatic though. Is this maybe an XY problem?

Related

Adding column value for a list of indexes

I have a list of indexes and trying to populate a column 'Type' for these rows only.
What I tried to do:
index_list={1,5,9,10,13}
df.loc[index_list,'Type']=="gain+loss"
Output:
1 False
5 False
9 False
10 False
13 False
But the output just gives the list with all False instead of populating these rows.
Thanks for any advice.
You need to put a single equal instead of double equal. In python, and in most progamming languages, == is the comparison operator. In your case you need the assignment operator =.
So the following code will do what you want :
index_list={1,5,9,10,13}
df.loc[index_list,'Type']="gain+loss"

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))

convert Int64Index to Int

I'm iterating through a dataframe (called hdf) and applying changes on a row by row basis. hdf is sorted by group_id and assigned a 1 through n rank on some criteria.
# Groupby function creates subset dataframes (a dataframe per distinct group_id).
grouped = hdf.groupby('group_id')
# Iterate through each subdataframe.
for name, group in grouped:
# This grabs the top index for each subdataframe
index1 = group[group['group_rank']==1].index
# If criteria1 == 0, flag all rows for removal
if(max(group['criteria1']) == 0):
for x in range(rank1, rank1 + max(group['group_rank'])):
hdf.loc[x,'remove_row'] = 1
I'm getting the following error:
TypeError: int() argument must be a string or a number, not 'Int64Index'
I get the same error when I try to cast rank1 explicitly I get the same error:
rank1 = int(group[group['auction_rank']==1].index)
Can someone explain what is happening and provide an alternative?
The answer to your specific question is that index1 is an Int64Index (basically a list), even if it has one element. To get that one element, you can use index1[0].
But there are better ways of accomplishing your goal. If you want to remove all of the rows in the "bad" groups, you can use filter:
hdf = hdf.groupby('group_id').filter(lambda group: group['criteria1'].max() != 0)
If you only want to remove certain rows within matching groups, you can write a function and then use apply:
def filter_group(group):
if group['criteria1'].max() != 0:
return group
else:
return group.loc[other criteria here]
hdf = hdf.groupby('group_id').apply(filter_group)
(If you really like your current way of doing things, you should know that loc will accept an index, not just an integer, so you could also do hdf.loc[group.index, 'remove_row'] = 1).
call tolist() on Int64Index object. Then the list can be iterated as int values.
simply add [0] to insure the getting the first value from the index
rank1 = int(group[group['auction_rank']==1].index[0])

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2