How to use agg function while not drop other columns in dataframe? - dataframe

Suppose I have a dataframe like below:
+-------------+----------+-----+------+
| key | Time |value|value2|
+-------------+----------+-----+------+
| 1 | 1 | 1 | 1 |
| 1 | 2 | 2 | 2 |
| 1 | 4 | 3 | 3 |
| 2 | 2 | 4 | 4 |
| 2 | 3 | 5 | 5 |
+-------------+----------+-----+------+
I want to select the key with same value with the least time. For this case, the is the dataframe I want.
+----------+----------+-----+------+
| Time | key |value|value2|
+----------+----------+-----+------+
| 1 | 1 | 1 | 1 |
| 2 | 2 | 4 | 4 |
+----------+----------+-----+------+
I tried use groupBy and agg, but these operations will drop the value columns. Is there a way to keep the value1 and value2 columns?

You can use struct to create a tuple containing the time column and everything you want to keep (the s column in the code below). Then, you can use s.* to unfold the struct and retrieve your columns.
val result = df
.withColumn("s", struct('Time, 'value, 'value2))
.groupBy("key")
.agg(min('s) as "s")
.select("key", "s.*")
// to order the columns the way you want:
val new_result = result.select("Time", "key", "value", "value2")
// or
val new_result = result.select("Time", df.columns.filter(_ != "time") : _*)

Related

Generate new values in dataframe based on other data

I am trying to calculate an additional column in a results dataframe based on a filter operation from another dataframe that does not match in size.
So, I have my source dataframe source_df:
| id | date |
| -----------------------------
| 1 | 2100-01-01 |
| 2 | 2021-12-12 |
| 3 | 2018-09-01 |
| 4 | 2100-01-01 |
and the target dataframe target_df. Both dataframe lengths and amount of ids do not necessarily match:
| id |
| --------
| 1 |
| 2 |
| 3 |
| 4 |
| 5. |
...
| 100 |
I actually want to find out which dates lie more than 30 days in the past.
To do so, I created a query
query = (pd.datetime.today() - pd.to_datetime(source_df["date"], errors="coerce")).dt.days > 30
ids = source_df[query]["id"]
--> ids = [2,3]
My intention is to calculate a column "date_in_past"
that contains the values 0 and 1. If the date difference is greater than 30 days, a 1 is inserted, 0 elsewise.
The target_df should look like:
| id | date_in_past |
| ----------------- |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| 5 | 0 |
Both indices and df lengths do not match.
I tried to create a lambda function
map_query = lambda x: 1 if x in ids.values else 0
When I try to pass the target frame map_query(target_df["id"]) a ValueError is thrown, that "lengths must match to compare".
How can I assign the new column "date_in_past" having the calculated values based on the source dataframe?

Selecting rows of a table in which some special values exist

The first column of a table contains some Ids and the values in the other columns are the numbers corresponded to those Ids. Considering some special numbers, we want to select the rows that this special numbers are among the corresponded numbers to Ids. For example, let we have the following table and the special numbers are 3,5. We want to select the rows in which 2,5 are among the columns except Id:
| Id | corresponded numbers
|----|----------------------
| 1 | 2 | 3 | 5 |
| 2 | 1 | 5 |
| 3 | 1 | 2 | 4 | 5 | 7 |
| 4 | 3 | 5 | 6 |
Therefore, we want to have the following table as the result:
| Id | corresponded numbers
|----|----------------------
| 1 | 2 | 3 | 5 |
| 3 | 1 | 2 | 4 | 5 | 7 |
Would you please introduce me a function in Excel or a query in SQL to do the above selection?
SELECT id,
[corresponded numbers]
FROM TableName
WHERE (charIndex('2', [corresponded numbers]) > 0
AND charIndex('5', [corresponded numbers]) > 0)

Cumulative count of unique values in pandas

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
How would I do this vectorised in pandas?
Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
Jezrael's answer can be slightly improved by using pipe instead of apply(list), which should be faster, and then using np.unique instead of the trick with np.cumsum:
df1 = (df.groupby(['user_id', 'week']).pipe(lambda x: x.apply(np.unique))
.groupby('user_id')
.apply(np.cumsum)
.apply(np.sum)
.apply(lambda x: len(set(x)))
.rename('cumulated_module_count')
.reset_index(drop=False))
print(df1)
user_id week cumulated_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3

SQL moving aggregate SUM without partial results

Assume I have this schema (tested on postgresql) where the 'Scorelines' relation contains results of sport matches. (kickoff is a TIMESTAMP but replaced by INT for readability)
SQLFiddle here: http://sqlfiddle.com/#!12/52475/3
CREATE TABLE Scorelines (
team TEXT,
kickoff INT,
scored INT,
conceded INT
);
Now I want to produce another column 'three_matches_scored' that contains the sum of the points scored
over the 3 preceding game (determined by kickoff) of the same team. I have this:
SELECT team, kickoff, scored, conceded, SUM(scored) OVER three_matches AS three_matches_scored
FROM Scorelines
WINDOW three_matches AS
(PARTITION BY team ORDER BY kickoff
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
ORDER BY kickoff;
This works beautifully so far, except that I get values starting from the second game. Example:
| TEAM | KICKOFF | SCORED | CONCEDED | THREE_MATCHES_SCORED |
|------|---------|--------|----------|----------------------|
| A | 1 | 1 | 0 | (null) |
| B | 2 | 1 | 1 | (null) |
| A | 3 | 1 | 1 | 1 |
| A | 4 | 3 | 0 | 2 |
| B | 4 | 1 | 4 | 1 |
| A | 6 | 0 | 2 | 5 |
| B | 6 | 4 | 2 | 2 |
| B | 8 | 1 | 2 | 6 |
| B | 10 | 1 | 1 | 6 |
| A | 11 | 2 | 1 | 4 |
I want the column 'three_matches_scored' to be (null) for the first 3 games because there are no 3 results to sum up. How can I achieve this?
I'd prefer simple understandable solutions, performance is not critical for this particular case.
My only idea right now, is to define a stored function SUM3, that results in (null) with less than 3 values to add up. But I never defined a function in SQL and can't seem to figure it out.
You can use a case statement to null the rows where there are less than 3 games:
SELECT team, kickoff, scored, conceded,
CASE WHEN COUNT(scored) OVER three_matches = 3
THEN SUM(scored) OVER three_matches
ELSE NULL
END AS three_matches_scored
FROM Scorelines
WINDOW three_matches AS
(PARTITION BY team ORDER BY kickoff
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
ORDER BY kickoff;
Output:
team | kickoff | scored | conceded | three_matches_scored
------+---------+--------+----------+----------------------
A | 1 | 1 | 0 |
B | 2 | 1 | 1 |
A | 3 | 1 | 1 |
A | 4 | 3 | 0 |
B | 4 | 1 | 4 |
A | 6 | 0 | 2 | 5
B | 6 | 4 | 2 |
B | 8 | 1 | 2 | 6
B | 10 | 1 | 1 | 6
A | 11 | 2 | 1 | 4
(10 rows)
See harmics answer above.
(my first solution, just for reference)
Solution with user defined aggregate:
CREATE TYPE intermediate_sum AS (
sum INT,
count INT
);
CREATE FUNCTION sum_sfunc(intermediate_sum, INTEGER) RETURNS intermediate_sum AS
$$ SELECT $2 + $1.sum AS sum, $1.count - 1 AS count $$ LANGUAGE SQL;
CREATE FUNCTION sum_ffunc(intermediate_sum) RETURNS INTEGER AS
$$ SELECT (CASE WHEN $1.count > 1 THEN null
WHEN $1.count = 0 THEN $1.sum
END)
$$ LANGUAGE SQL;
CREATE AGGREGATE sum3(INTEGER) (
sfunc = sum_sfunc,
finalfunc = sum_ffunc,
stype = intermediate_sum,
initcond = '(0,3)'
);
The aggregate SUM3 wants at least 3 values, otherwise it returns (null). One can define other aggreates like SUM4 by changing the initcond, for example to '(0,4)'.

How to query following scenario to count number of users in Django?

I have table in database called fileupload_share.
+----+----------+----------+----------------+----------------------------------+
| id | users_id | files_id | shared_user_id | shared_date |
+----+----------+----------+----------------+----------------------------------+
| 3 | 1 | 1 | 2 | 2013-01-31 14:27:06.523908+00:00 |
| 2 | 1 | 1 | 2 | 2013-01-31 14:25:37.760192+00:00 |
| 4 | 1 | 3 | 2 | 2013-01-31 14:46:01.089560+00:00 |
| 5 | 1 | 1 | 3 | 2013-01-31 14:50:54.917337+00:00 |
I want to count the number of shared_user_id according to the file_id.
For example I want to find with how many users the file with id 1 is shared. The answer is with 2 users(shared_user_id). How can I find that in Django?
file_id = 2 #Here is your file_id variable
fileupload_share.objects.filter(file_id = file_id)
.order_by('shared_user_id').distinct('shared_user_id').count()
As comments below say this example doesn't work on MySQL, because of distinct method on field.
However you can try danihp's method:
file_id = 2 #Here is your file_id variable
fileupload_share.objects.filter(file_id = file_id)
.values_list('shared_user_id', flat=True).distinct().count()