Generate new values in dataframe based on other data - pandas

I am trying to calculate an additional column in a results dataframe based on a filter operation from another dataframe that does not match in size.
So, I have my source dataframe source_df:
| id | date |
| -----------------------------
| 1 | 2100-01-01 |
| 2 | 2021-12-12 |
| 3 | 2018-09-01 |
| 4 | 2100-01-01 |
and the target dataframe target_df. Both dataframe lengths and amount of ids do not necessarily match:
| id |
| --------
| 1 |
| 2 |
| 3 |
| 4 |
| 5. |
...
| 100 |
I actually want to find out which dates lie more than 30 days in the past.
To do so, I created a query
query = (pd.datetime.today() - pd.to_datetime(source_df["date"], errors="coerce")).dt.days > 30
ids = source_df[query]["id"]
--> ids = [2,3]
My intention is to calculate a column "date_in_past"
that contains the values 0 and 1. If the date difference is greater than 30 days, a 1 is inserted, 0 elsewise.
The target_df should look like:
| id | date_in_past |
| ----------------- |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| 5 | 0 |
Both indices and df lengths do not match.
I tried to create a lambda function
map_query = lambda x: 1 if x in ids.values else 0
When I try to pass the target frame map_query(target_df["id"]) a ValueError is thrown, that "lengths must match to compare".
How can I assign the new column "date_in_past" having the calculated values based on the source dataframe?

Related

How to use agg function while not drop other columns in dataframe?

Suppose I have a dataframe like below:
+-------------+----------+-----+------+
| key | Time |value|value2|
+-------------+----------+-----+------+
| 1 | 1 | 1 | 1 |
| 1 | 2 | 2 | 2 |
| 1 | 4 | 3 | 3 |
| 2 | 2 | 4 | 4 |
| 2 | 3 | 5 | 5 |
+-------------+----------+-----+------+
I want to select the key with same value with the least time. For this case, the is the dataframe I want.
+----------+----------+-----+------+
| Time | key |value|value2|
+----------+----------+-----+------+
| 1 | 1 | 1 | 1 |
| 2 | 2 | 4 | 4 |
+----------+----------+-----+------+
I tried use groupBy and agg, but these operations will drop the value columns. Is there a way to keep the value1 and value2 columns?
You can use struct to create a tuple containing the time column and everything you want to keep (the s column in the code below). Then, you can use s.* to unfold the struct and retrieve your columns.
val result = df
.withColumn("s", struct('Time, 'value, 'value2))
.groupBy("key")
.agg(min('s) as "s")
.select("key", "s.*")
// to order the columns the way you want:
val new_result = result.select("Time", "key", "value", "value2")
// or
val new_result = result.select("Time", df.columns.filter(_ != "time") : _*)

Selecting rows of a table in which some special values exist

The first column of a table contains some Ids and the values in the other columns are the numbers corresponded to those Ids. Considering some special numbers, we want to select the rows that this special numbers are among the corresponded numbers to Ids. For example, let we have the following table and the special numbers are 3,5. We want to select the rows in which 2,5 are among the columns except Id:
| Id | corresponded numbers
|----|----------------------
| 1 | 2 | 3 | 5 |
| 2 | 1 | 5 |
| 3 | 1 | 2 | 4 | 5 | 7 |
| 4 | 3 | 5 | 6 |
Therefore, we want to have the following table as the result:
| Id | corresponded numbers
|----|----------------------
| 1 | 2 | 3 | 5 |
| 3 | 1 | 2 | 4 | 5 | 7 |
Would you please introduce me a function in Excel or a query in SQL to do the above selection?
SELECT id,
[corresponded numbers]
FROM TableName
WHERE (charIndex('2', [corresponded numbers]) > 0
AND charIndex('5', [corresponded numbers]) > 0)

Clean Data Using SQL - Take Column Difference

I have data in SQL as follows:
Actual Table
+-------------+--------+------+
| Id | Weight | Type |
+-------------+--------+------+
| 00011223344 | 35 | A |
| 00011223344 | 10 | A |
| 12311223344 | 100 | B |
| 00034343434 | 25 | A |
| 00034343434 | 25 | A |
| 99934343434 | 200 | C |
| 88855667788 | 100 | D |
+-------------+--------+------+
Column ID will always have length of 11 and has data type varchar. I need to create a column Actual Weight and Actual ID from the table above.
Actual Id is dependent on column ID. If the ID starts with 000 than we need to find ID from column ID that does not starts with 000 but characters after that (i.e. 8 characters from right) are similar. Matched ID would be the Actual Id. For example if we look at first 3 ids first 2 starts with 000 and another ID that does not starts with 000 and contains similar 8 characters from right can be found in 3rd row i.e. 12311223344 therefore in derived column Actual ID the first 2 rows would have Actual Id as 12311223344.
Actual Weight is dependent on values in 2 columns ID and Weight. We need to group column Id based on the criteria mentioned above if for any Id that does not starts with 000 but contains another entry that does starts with 000. Then we need to recalculate Weight for Id that does not starts with 000 by adding all Weights of ones starting with 000 and taking difference with one that does not starts with 000.
Example if we look at first 3 rows, in 3rd row we have Id starting with 123 and having entries that have 8 digits from right similar to this one except they start with 000 instead of 123 (i.e. row 1 and 2). For cases starting with 000 Actual Weight would be similar to Weight but for the one starting with 123 Actual Weight would be 100-(35+10)
I am looking for a query that can create these 2 derived column without need of creating any other table/view.
Desired Output
+-------------+-------------+--------+---------------+------+
| Id | Actual ID | Weight | Actual Weight | Type |
+-------------+-------------+--------+---------------+------+
| 00011223344 | 12311223344 | 35 | 35 | A |
| 00011223344 | 12311223344 | 10 | 10 | A |
| 12311223344 | 12311223344 | 100 | 55 | B |
| 00034343434 | 99934343434 | 25 | 25 | A |
| 00034343434 | 99934343434 | 25 | 25 | A |
| 99934343434 | 99934343434 | 200 | 150 | C |
| 88855667788 | 88855667788 | 100 | 100 | D |
+-------------+-------------+--------+---------------+------+
Hmmmm . . . If I'm following this:
select t.*,
(case when id like '000%' then weight
else weight - sum(case when id like '000%' then weight else 0 end) over (partition by actual_id)
end) as actual_weight
from (select t.*,
max(id) over (partition by stuff(id, 1, 3, '')) as actual_id
from t
) t;
Here is a db<>fiddle.

Cumulative count of unique values in pandas

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
How would I do this vectorised in pandas?
Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
Jezrael's answer can be slightly improved by using pipe instead of apply(list), which should be faster, and then using np.unique instead of the trick with np.cumsum:
df1 = (df.groupby(['user_id', 'week']).pipe(lambda x: x.apply(np.unique))
.groupby('user_id')
.apply(np.cumsum)
.apply(np.sum)
.apply(lambda x: len(set(x)))
.rename('cumulated_module_count')
.reset_index(drop=False))
print(df1)
user_id week cumulated_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3

How to query following scenario to count number of users in Django?

I have table in database called fileupload_share.
+----+----------+----------+----------------+----------------------------------+
| id | users_id | files_id | shared_user_id | shared_date |
+----+----------+----------+----------------+----------------------------------+
| 3 | 1 | 1 | 2 | 2013-01-31 14:27:06.523908+00:00 |
| 2 | 1 | 1 | 2 | 2013-01-31 14:25:37.760192+00:00 |
| 4 | 1 | 3 | 2 | 2013-01-31 14:46:01.089560+00:00 |
| 5 | 1 | 1 | 3 | 2013-01-31 14:50:54.917337+00:00 |
I want to count the number of shared_user_id according to the file_id.
For example I want to find with how many users the file with id 1 is shared. The answer is with 2 users(shared_user_id). How can I find that in Django?
file_id = 2 #Here is your file_id variable
fileupload_share.objects.filter(file_id = file_id)
.order_by('shared_user_id').distinct('shared_user_id').count()
As comments below say this example doesn't work on MySQL, because of distinct method on field.
However you can try danihp's method:
file_id = 2 #Here is your file_id variable
fileupload_share.objects.filter(file_id = file_id)
.values_list('shared_user_id', flat=True).distinct().count()