Why do we consider time complexity as O(Logn) if a matrix is powered by a constant value? - time-complexity

I am trying to think about it, and I don't get it.
If the matrix size is 1x1, than power by a constant = n is O(n).
If the matrix size is 2x2, than power by 2 goes like that:
|a b| X |a b|
|c d| |c d|
=> aa + bc, ab + bd, ca + dc, cb + dd.
Multipication done was 8 times, addition - 4.
So 2X2 Matrix, goes to 8 MULT and 4 ADDs, overall - 12 calculations.
I don't see the connection to O(log n)?

Related

How to Pivot multiple columns in pyspark similar to pandas

I want to perform similar operation in pyspark like in how its possible with pandas
My dataframe is :
Year win_loss_date Deal L2 GFCID Name L2 GFCID GFCID GFCID Name Client Priority Location Deal Location Revenue Deal Conclusion New/Rebid
0 2021 2021-03-08 00:00:00 1-2JZONGU TEST GFCID CREATION P-1-P1DO P-1-P5O TEST GFCID CREATION None UNITED STATES UNITED STATES 4567.0000000 Won New
enter image description here
In pandas: code to pivot is :
df = pd.pivot_table(deal_df_pandas,
index=['GFCID', 'GFCID Name', 'Client Priority'],
columns=['New/Rebid', 'Year', 'Deal Conclusion'],
aggfunc={'Deal':'count',
'Revenue':'sum',
'Location': lambda x: set(x),
'Deal Location': lambda x: set(x)}).reset_index()
columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted
Output I get and expected:
GFCID GFCID Name Client Priority Deal Revenue
New/Rebid New Rebid New Rebid
Year 2020 2021 2020 2021 2020 2021 2020 2021
Deal Conclusion Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won
0 0000000752 ARAMARK SERVICES INC Bronze NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000 20000.0000000 NaN NaN NaN NaN
enter image description here
What i want is to convert above code to pyspark.
what i am trying is not working:
from pyspark.sql import functions as F
df_pivot2=(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))
AS THIS OPERATION NOT POSSIBLE IN PySPARK:
(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid','Year','Deal Conclusion') #--error
you can concatenate the multiple columns into a single column which can be used within pivot.
consider the following example
data_sdf.show()
# +---+-----+--------+--------+
# | id|state| time|expected|
# +---+-----+--------+--------+
# | 1| A|20220722| 1|
# | 1| A|20220723| 1|
# | 1| B|20220724| 2|
# | 2| B|20220722| 1|
# | 2| C|20220723| 2|
# | 2| B|20220724| 3|
# +---+-----+--------+--------+
data_sdf. \
withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
groupBy('id'). \
pivot('pivot_col'). \
agg(func.sum('expected')). \
fillna(0). \
show()
# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# | 1| 1| 1| 0| 2| 0|
# | 2| 0| 0| 1| 3| 2|
# +---+----------+----------+----------+----------+----------+
The input dataframe had 2 fields - state and time - that were to be pivoted. They were concatenated with a '_' delimiter and used within pivot. You can use multiple aggregations within the agg, per your requirements, post that.

Can Spark SQL refer to the first row of the previous window / group?

I have a kind of event stream that looks like this:
Time UserId SessionId EventType EventData
1 2 A Load /a ...
2 1 B Impressn X ...
3 2 A Impressn Y ...
4 1 B Load /b ...
5 2 A Load /info ...
6 1 B Load /about ...
7 2 A Impressn Z ...
In practice users can have many sessions over larger time windows and there's also a click event type but keeping this simple here, I'm trying to see the (page views) loads that lead to next load and also what impressions happened in aggregate.
So, without SQL I've loaded this, grouped by user, sequenced by time, and for each session marked each row with previous load info (if any). With a
val outDS = logDataset.groupByKey(_.UserId)
.flatMapGroups((_, iter) => gather(iter))
where gather sorts the iter by time (might be redundant as the input is sorted by time), then iterates over the sequence, sets lastLoadData to null at each new session, adds lastLoadData to each row and updates lastLoadData to the data of this row if the row is a Load type. Producing something like:
Time UserId SessionId EventType EventData LastLoadData
1 2 A Load / ... null
2 1 B Impressn X ... null
3 2 A Impressn Y ... / ...
4 1 B Load / ... null
5 2 A Load /info ... / ...
6 1 B Load /about ... / ...
7 2 A Impressn Z ... /info ...
Allowing me to then aggregate what (page views) loads lead to what other loads, or on each (page) load what are the top 5 Impressn events.
outDS.createOrReplaceTempView(tempTable)
val journeyPageViews = sparkSession.sql(
s"""SELECT lastLoadData, EventData,
| count(distinct UserId) as users,
| count(distinct SessionId) as sessions
|FROM ${tempTable}
|WHERE EventType='Load'
|GROUP BY lastLoadData, EventData""".stripMargin)
But, I get the feeling that the adding of a lastLoadData column could be done using Spark SQL windows too, however I'm hung up on two parts of that:
If I make a window over UserId+SessionId ordered by time how do have it apply to all events but look at the previous load event? (EG Impressn would get a new column lastLoadData assigned to this window's previous EventData)
If I somehow make a new window per session's Load event (also not sure how), The Load event in the start of the window (presumably "first") should get the lastLoadData of the previous window's "first" so that's probably not the right way to do it either.
You can mask the data that are not Load with null using case when, and get LastLoadData using last with ignorenull set to true:
logDataset.createOrReplaceTempView("table")
val logDataset2 = spark.sql("""
select
*,
last(case when EventType = 'Load' then EventData end, true)
over (partition by UserId, SessionId
order by Time
rows between unbounded preceding and 1 preceding) LastLoadData
from table
order by time
""")
logDataset2.show
+----+------+---------+---------+----------+------------+
|Time|UserId|SessionId|EventType| EventData|LastLoadData|
+----+------+---------+---------+----------+------------+
| 1| 2| A| Load| /a ...| null|
| 2| 1| B| Impressn| X ...| null|
| 3| 2| A| Impressn| Y ...| /a ...|
| 4| 1| B| Load| /b ...| null|
| 5| 2| A| Load| /info ...| /a ...|
| 6| 1| B| Load|/about ...| /b ...|
| 7| 2| A| Impressn| Z ...| /info ...|
+----+------+---------+---------+----------+------------+

Transpose wide dataframe to long dataframe

I have a data frame looks like:
Region, 2000Q1, 2000Q2, 2000Q3, ...
A, 1,2,3,...
I want to transpose this wide table to a long table by 'Region'. So the final product will look like:
Region, Time, Value
A, 2000Q1,1
A, 2000Q2, 2
A, 2000Q3, 3
A, 2000Q4, 4
....
The original table has a very wide array of columns but the aggregation level is always region and remaining columns are set to be tranposed.
Do you know an easy way or function to do this?
Try with arrays_zip function then explode the array
Example:
df=spark.createDataFrame([('A',1,2,3)],['Region','2000q1','2000q2','2000q3'])
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("cc",explode(arrays_zip(array(cols),split(lit(col_name),"\\|")))).\
select("Region","cc.*").\
toDF(*['Region','Value','Time']).\
show()
#+------+-----+------+
#|Region|Value| Time|
#+------+-----+------+
#| A| 1|2000q1|
#| A| 2|2000q2|
#| A| 3|2000q3|
#+------+-----+------+
Similar but improved for the column calculation.
cols = df.columns
cols.remove('Region')
import pyspark.sql.functions as f
df.withColumn('array', f.explode(f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array(*cols), ))) \
.select('Region', 'array.*') \
.toDF('Region', 'Time', 'Value') \
.show(30, False)
+------+------+-----+
|Region|Time |Value|
+------+------+-----+
|A |2000Q1|1 |
|A |2000Q2|2 |
|A |2000Q3|3 |
|A |2000Q4|4 |
|A |2000Q5|5 |
+------+------+-----+
p.s. Don't accept this as an answer :)

Appropriate idea or SQL to obtain the results set

Fig 1
TxnId | TxnTypeId |BranchId |TxnNumber |LocalAmount |ItemName
--------------------------------|-----------|---------------|----------
1777486 | 101 |1099 |1804908 |65.20000000 |A
1777486 | 101 |1099 |1804908 |324.50000000 |B
1777486 | 101 |1099 |1804908 |97.20000000 |C
1777486 | 101 |1099 |1804908 |310.00000000 |D
1777486 | 101 |1099 |1804908 |48.90000000 |E
Fig 2
TxnId |TxnTypeId |BankId |Number |Check |Bank |Cash |Wallet
--------|-----------|-------|--------|-------|------|------|------
1777486 |101 |1099 |1804908 | 48.9 | 310 |389.7 |97.2
Fig 3 (Expected Output)
TxnId |BankId |ItemName |Amount |Wallet |Bank |Check |Cash
--------|-------|-----------|-------|-------|-------|-------|-------
1777486 |1099 |A |65.2 |0 0 |0 |0 |65.2
1777486 |1099 |B |324.5 |0 0 |0 |0 |324.5
1777486 |1099 |C |97.2 |97.2 |0 |0 |0
1777486 |1099 |D |48.9 |0 |0 |48.9 |0
1777486 |1099 |E |310 |0 |310 |0 |0
I have two different result set that is obtained from the different query.
Fig 1 and Fig 2.
The Result i wanted is like shown in fig 3.
Currently i do not have the flag to identify the payment mode use for each transaction(each item). I have the flag for only the complete transaction.
Fig 4
IndividualTxnPaymentDetailId| IndividualTxnId |PaymentAmount |PaymentMode
---------------------------:|:-----------------:|:-------------:|:--------------
2106163 | 1777486 |389.70000000 | Cash
2106164 | 1777486 |97.20000000 | Wallet
2106165 | 1777486 |310.00000000 | Bank
2106166 | 1777486 |48.90000000 | Check
Means if two item or more is purchased using one payment mode i do not have the proper way of identifying the payment done for each item.
Item A and B is purchased using cash as payment mode with the amount 65.2 and 324.5. Total Cash paid is 389.7
Item C is purchased using Wallet as payment mode with amount 97.2. Total Wallet amount is 97.2.
Fig 5
TxnId |LocalAmount |ItemName
--------|--------------:|:------------
1777486 |65.20000000 | A
1777486 |324.50000000 | B
1777486 |97.20000000 | C
1777486 |310.00000000 | D
1777486 |48.90000000 | E
Query by which i generated the result in Fig 4 and Fig 5
select IndividualTxnPaymentDetailId, IndividualTxnId, PaymentAmount, cc.choicecode as PaymentMode
from dbo.IndividualTxnPaymentDetail it
inner join configchoice cc on cc.configchoiceid= it.configpaymentmodeid
where IndividualTxnId = 1777486
select IndividualTxnId as TxnId, LocalAmount, CurrencyName from dbo.IndividualTxnFCYDetail where IndividualTxnId = 1777486
This is the query written to identify the transaction made through Bank. Similarly i wanted to get the transaction on all the payment mode. But could not obtain the transaction properly.
CASE
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount < 0 THEN 0
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount > txn.LocalAmount THEN txn.LocalAmount
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount > tpm.Bank THEN tpm.Bank
ELSE tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount
END AS Bank,
Can you help me to get the idea or with some sql to get the result set as in fig 3.
Updated Question - Updated Responce
I read your updated question and I'm afraid the problem still stands. Neither of those queries are summing the data - they are just pulling the same already summed numbers. You would either need to get at the numbers prior to the aggregation happening -or- to have some column in your IndividualTxnPaymentDetail table that ties each row to its counterpart rows in the other table (presumably through a cross table as in - Row 1 : ItemName A, Row 1 : ItemName B, Row 2 : ItemName C, etc).
If these are simply impossible, then perhaps your approaching this the wrong way, or to put it better, perhaps you are being asked to do something that doesn't make sense - and provable so. If there is no direct relationship between these activities in the data there's not much you can be expected to do. What's more it may indicate that your organization doesn't 'think' about them that way.
These two tables seem be payments and liabilities. Perhaps consider an approach where each payment goes toward what ever the oldest outstanding balance is and are matched to the items in Fig 4 that way. Add a column to the details table to store payment toward that item. Rather than a simple Paid/Unpaid Boolean, I would store the amount of payment that has been applied toward each item or the amount still owed on each item; that way you can handle partially applied payments. As payments come in, apply them. You would likely want a similar column in the payments table too to measure the amount of each payment that you have applied; that way you can handle over-payments, and be able to know the status of things such as pending receipts in the case that payments aren't applied immediately.
I hope this helps.
Fundamental Flaw
Your question is looking to take aggregated data (in your example, the Fig 2 Cash total of 389.7) and tease out what numbers were totaled to get the sum. You can do it here since 3 of the 4 numbers in Fig 2 are unique, one-to-one matches with numbers in Fig 1 - meaning the remaining ones have to belong to each other. But imagine 100s of numbers, many or most of them sums (i.e. not one-to-one matches like most of these). Or imagine an example as simple as yours except the numbers aren't so unique (e.g. Fig 1 = (10, 10, 10, 10, 20) and Fig 2 = (10, 20, 20, 10) - it is not possible to say which ones are which) and there only needs to be two possible combinations that could be responsible for a particular sum for the results to become ambiguous.
The weakness is in Fig 2. Do you have any control over that data source? Can grab the numbers up-stream before they are totaled?
Sorry for the negative conclusion but...
I hope this helps.
The Continuing Saga
Comment: [A version of this] report has already been made ...[but] I cannot contact the person who actually wrote that thing.
Perhaps he was also asked to do something that didn't make sense but did it anyway. The math simply doesn't work. He may have written something that finds as many one-to-one matches as it can and then sort of rolls the dice on the rest of it. He may have done something like the following:
Find and eliminate all the one-to-one matches.
Take any total and subtract any item amount from it to see if it
matches any remaining item amounts(s), if so, arbitrarily pick one,
eliminate all three numbers.
Repeat this until all combinations have been tested.
But you are still potentially left with unmatched numbers, so you next need to test for sums of three numbers by:
Arbitrarily subtract any two item amounts from any of the remaining
totals.
and so on and so on, followed by testing for sums of four items and so on.
I think part of what you're looking for is buried in here:
http://www.itprotoday.com/software-development/algorithms-still-matter
it calls it 'order fulfillment' where you go through transactions, combining them until you reach a given total
I think the solution will be in multiple parts, including cursors etc.
I'm not convinced you would be able to understand or implement any solution posted. Also, I maintain that there are cases where there are ambiguous solutions.
Lastly I see you have asked 16 questions and not marked a single one as answered.

Spark-SQL Window functions on Dataframe - Finding first timestamp in a group

I have below dataframe (say UserData).
uid region timestamp
a 1 1
a 1 2
a 1 3
a 1 4
a 2 5
a 2 6
a 2 7
a 3 8
a 4 9
a 4 10
a 4 11
a 4 12
a 1 13
a 1 14
a 3 15
a 3 16
a 5 17
a 5 18
a 5 19
a 5 20
This data is nothing but user (uid) travelling across different regions (region) at different time (timestamp). Presently, timestamp is shown as 'int' for simplicity. Note that above dataframe will not be necessarily in increasing order of timestamp. Also, there may be some rows in between from different users. I have shown dataframe for single user only in monotonically incrementing order of timestamp for simplicity.
My goal is - to find User 'a' spent how much time in each region and in what order? So My final expected output looks like
uid region regionTimeStart regionTimeEnd
a 1 1 5
a 2 5 8
a 3 8 9
a 4 9 13
a 1 13 15
a 3 15 17
a 5 17 20
Based on my findings, Spark SQL Window functions can be used for this purpose.
I have tried below things,
val w = Window
.partitionBy("region")
.partitionBy("uid")
.orderBy("timestamp")
val resultDF = UserData.select(
UserData("uid"), UserData("timestamp"),
UserData("region"), rank().over(w).as("Rank"))
But here onwards, I am not sure on how to get regionTimeStart and regionTimeEnd columns. regionTimeEnd column is nothing but 'lead' of regionTimeStart except the last entry in group.
I see Aggregate operations have 'first' and 'last' functions but for that I need to group data based on ('uid','region') which spoils monotonically increasing order of path traversed i.e. at time 13,14 user has come back to region '1' and I want that to be retained instead of clubbing it with initial region '1' at time 1.
It would be very helpful if anyone one can guide me. I am new to Spark and I have better understanding of Scala Spark APIs compared to Python/JAVA Spark APIs.
Window functions are indeed useful although your approach can work only if you assume that user visits given region only once. Also window definition you use is incorrect - multiple calls to partitionBy simply return new objects with different window definitions. If you want to partition by multiple columns you should pass them in a single call (.partitionBy("region", "uid")).
Lets start with marking continuous visits in each region:
import org.apache.spark.sql.functions.{lag, sum, not}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"uid").orderBy($"timestamp")
val change = (not(lag($"region", 1).over(w) <=> $"region")).cast("int")
val ind = sum(change).over(w)
val dfWithInd = df.withColumn("ind", ind)
Next you we simply aggregate over the groups and find leads:
import org.apache.spark.sql.functions.{lead, coalesce}
val regionTimeEnd = coalesce(lead($"timestamp", 1).over(w), $"max_")
val result = dfWithInd
.groupBy($"uid", $"region", $"ind")
.agg(min($"timestamp").alias("timestamp"), max($"timestamp").alias("max_"))
.drop("ind")
.withColumn("regionTimeEnd", regionTimeEnd)
.withColumnRenamed("timestamp", "regionTimeStart")
.drop("max_")
result.show
// +---+------+---------------+-------------+
// |uid|region|regionTimeStart|regionTimeEnd|
// +---+------+---------------+-------------+
// | a| 1| 1| 5|
// | a| 2| 5| 8|
// | a| 3| 8| 9|
// | a| 4| 9| 13|
// | a| 1| 13| 15|
// | a| 3| 15| 17|
// | a| 5| 17| 20|
// +---+------+---------------+-------------+