Create a query that groups by multiple categories? - sql

I have these:
colnames(w)
[1] "user_id" "install_date" "app_version" "user_session_id"
[5] "event_timestamp" "app_page" "time_seconds"
I want to get the mean time of each of session per app_page (3 overall) so i did this. Since there are 3 app versions i would like to plot all these 3 pages with the avg app time used on them per version.
This is what i did:
df=sqldf('select app_version,app,round(avg(time_seconds),0)
as time_app from w group by app_version')
df
which gives this:
app_version app_page time_app
1 v1 build 1019
2 v2 learn 910
3 v3 learn 966
but it doesn't look correct.
If i try this though
df1=sqldf('select app_version,app,round(avg(time_seconds),0) as time_app from w group by app')
df1
app_version app_page time_app
1 v2 build 1001
2 v2 draw 727
3 v2 learn 982
i think its correct but it has all the version included and not each one as a standalone.
Trying to plot it.
sw<-ggplot(data=df1, aes(x=app, y=time_app)) +
geom_bar(stat="identity") +facet_grid(app_version ~.)
sw
How to change the sql query so it gives the proper result and
thus the plot would provide each version with the avg time of every app page?

plot would provide each version with the avg time of every app page
This sounds like an aggregation along two dimensions:
select app, app_version, round(avg(time_seconds), 0) as time_app
from w
group by app, app_version
order by app, app_version;

Related

PySpark Grouping and Aggregating based on A Different Column?

I'm working on a problem where I have a dataset in the following format (replaced real data for example purposes):
session
activity
timestamp
1
enter_store
2022-03-01 23:25:11
1
pay_at_cashier
2022-03-01 23:31:10
1
exit_store
2022-03-01 23:55:01
2
enter_store
2022-03-02 07:15:00
2
pay_at_cashier
2022-03-02 07:24:00
2
exit_store
2022-03-02 07:35:55
3
enter_store
2022-03-05 11:07:01
3
exit_store
2022-03-05 11:22:51
I would like to be able to compute counting statistics for these events based on the pattern observed within each session. For example, based on the table above, the count of each pattern observed would be as follows:
{
'enter_store -> pay_at_cashier -> exit_store': 2,
'enter_store -> exit_store': 1
}
I'm trying to do this in PySpark, but I'm having some trouble figuring out the most efficient way to do this kind of pattern matching where some steps are missing. The real problem involves a much larger dataset of ~15M+ events like this.
I've tried logic in the form of filtering the entire DF for unique sessions where 'enter_store' is observed, and then filtering that DF for unique sessions where 'pay_at_cashier' is observed. That works fine, the only issue is I'm having trouble thinking of ways where I can count the sessions like 3 where there is only a starting step and final step, but no middle step.
Obviously one way to do this brute-force would be to iterate over each session and assign it a pattern and increment a counter, but I'm looking for more efficient and scalable ways to do this.
Would appreciate any suggestions or insights.
For Spark 2.4+, you could do
df = (df
.withColumn("flow", F.expr("sort_array(collect_list(struct(timestamp, activity)) over (partition by session))"))
.withColumn("flow", F.expr("concat_ws(' -> ', transform(flow, v -> v.activity))"))
.groupBy("flow").agg(F.countDistinct("session").alias("total_session"))
)
df.show(truncate=False)
# +-------------------------------------------+-------------+
# |flow |total_session|
# +-------------------------------------------+-------------+
# |enter_store -> pay_at_cashier -> exit_store|2 |
# |enter_store -> exit_store |1 |
# +-------------------------------------------+-------------+
The first block was collecting list of timestamp and its activity for each session in an ordered array (be sure timestamp is timestamp format) based on its timestamp value. After that, use only the activity values from the array using transform function (and combine them to create a string using concat_ws if needed) and group them by the activity order to get the distinct sessions.

How to update record in R - problem with sqldf

I would like to change some records in my table. I think the easest way is to use sqldf and Update. But when i using it i get warning (the table b isn't empty):
c<-sqldf("UPDATE b
SET l_all = ''
where id='12293' ")
# In result_fetch(res#ptr, n = n) :
# SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
Can you help me how to change chosen records in the easest way?
The query worked but there are several possible problems:
The message is a spurious warning, not an error, caused by backwardly incompatible changes to RSQLite. You can ignore the warning or use the sqldf2 workaround here: https://github.com/ggrothendieck/sqldf/issues/40
The SQL update command does not return anything so one would not expect the command shown in the question to return anything. To return the updated value ask for it.
1) Using the built in BOD data frame, defining sqldf2 from (1) and taking into account (2) we have:
sqldf2(c("update BOD set demand = 0 where Time = 1", "select * from BOD"))
giving:
Time demand
1 1 0.0
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
2) Another approach to do it is to use select giving the same result.
sqldf("select Time, iif(Time == 1, 0, demand) demand from BOD")

Sum duplicate bigrams in dataframe

I currently have a data frame that contains values such as:
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 19
3 (galaxy, nexus) 14
4 (android, google) 12
There are values in there that I want to merge (like google, android and android,google) there are others like "ice, cream" and "cream, sandwich" but that's a different problem.
In order to sum up the duplicates I tried to do this:
def remove_duplicates(ngrams):
return {" ".join(sorted(key.split(" "))):ngrams[key] for key in ngrams}
freq_all_tw_pos_bg['Word'] = freq_all_tw_pos_bg['Word'].apply(remove_duplicates)
I looked around and found similar exercises which are marked as right answers but when I try to do it I get:
TypeError: tuple indices must be integers or slices, not str
Which makes sense but then I tried converting it to a string and it shuffled the bigrams in a weird way so I wonder, am I missing something that should be easier?
EDIT:
The input is the first values I show. A list of bigrams some which are repeated (due to the words in them being reversed. I.e. google, android vs android,google
I want to have this same output (that is a dataframe with the bigrams) but that it sums up the frequencies of the reversed words. If I grab the same list from above and process it then it should output.
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
Notice how it "merged" (google, android) and (android, google) and also summed up the frequencies.
If there ara tuples use sorted with convert to tuples:
freq_all_tw_pos_bg['Bigram'] = freq_all_tw_pos_bg['Bigram'].apply(lambda x:tuple(sorted(x)))
print (freq_all_tw_pos_bg)
Bigram Frequency
0 (cream, ice) 23
1 (cream, sandwich) 21
2 (android, google) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
And then aggregate sum:
df = freq_all_tw_pos_bg.groupby('Bigram', as_index=False)['Frequency'].sum()

Grouping nearby data in pandas

Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.
Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()

Pandas: aggregation on multi-level groups

I have a df that looks something like this:
batch group reading temp test block delay
0 9551 Control 340 22.9 1 X 35
1 9551 Control 345 22.9 1 Y 35
I need to group by 'group' and 'block', e.g. my means would look like so:
df.groupby(['block', 'group']).reading.mean().unstack().transpose()
block X Y
group
Control 347.339450 350.427273
Trial 347.790909 350.668182
What would be the best way to call a 2 argument function like scipy.stats.ttest_ind on data sliced this way so I end up with a table of t tests for
control vs trial in x
control vs trial in y
x vs y in control
x vs y in trial
Do you want to group and aggregate the data before applying the t-test? I think you want to select subsets of the data. Grouping can do that, but masking might get the job done more simply.
Offhand, I'd say you want something like
scipy.stats.ttest_ind(df[(df.group == 'Control') & (df.block == 'X')].reading,
df[(df.group == 'Trial') & (df.block == 'X')].reading)