Vowpal Wabbit: How to use --interactions while using --keep and --ignore - data-science

I create a namespace for every feature I have so a line in my vw format file looks like:
1 |A A:0.00649797025326 |B B_09e6fb92eadea53ec206e05e4aea5654 |C C_2171791366 |D D_830995 |E E_2082213077 |F F_830995 |G G_283.0 |H H_Firefox |I I_MacOS |J J_1 |K K_100 |L L_27079 |M M_1403.0 |N N_-1.0 |O O_-1.0 |P P_0 |Q Q_JA |R R_1403.0 |S S_-1.0 |T T_-1.0 |U U_JA |V V_desktop |W W_JP |X X_--
After choosing the most important features to work with (6 out of 23 features), I want to choose cross features out of the remaining features in order to improve my AUC. So for that I tried using --keep per feature for the 6 features and then using --interactions for the new examined features, but the AUC never changes, i guess because the features are not in the kept namespaces? whats the right approach here?
the command line I tried to use:
vw train.tsv -f model.vw --loss_function logistic --hash all -l 0.01 --power_t 0.0 --noconstant --ignore 'A' --keep 'L' --keep 'P' --keep 'U' --keep 'K' --keep 'F' --keep 'W' --interactions 'BC'
which didn't change my metrics at all. Only when using the following command line:
vw train.tsv -f model.vw --loss_function logistic --hash all -l 0.01 --power_t 0.0 --noconstant --ignore 'A' --keep 'L' --keep 'P' --keep 'U' --keep 'K' --keep 'F' --keep 'W' --keep 'B' --keep 'C' --interactions 'BC'
it improved the metrics.
But then because the features are in the kept namespaces, I'm not sure the metrics are improved because of the cross features and not because of adding more features.
Any suggestions?
My approach is to run with the cross features combinations until there is no more improvement in the AUC.

Related

Getting start and end indices of string in Pandas

I have a df that looks like this:
|Index|Value|Anomaly|
---------------------
|0 |4 | |
|1 |2 |Anomaly|
|2 |1 |Anomaly|
|3 |2 | |
|4 |6 |Anomaly|
I want to get the start and end indices of the consecutive anomaly counts so in this case, it will be [[1,2],[4]]
I understand I have to use .shift and .cumsum but I am lost and I hope someone would be able to enlighten me.
Get consecutive groups taking the cumsum of the Boolean Series that checks where the value is not 'Anomoly'. Use where so that we only only take the 'Anomoly' rows. Then we can loop over the groups and grab the indices.
m = df['Anomaly'].ne('Anomaly')
[[idx[0], idx[-1]] if len(idx) > 1 else [idx[0]]
for idx in df.groupby(m.cumsum().where(~m)).groups.values()]
#[[1, 2], [4]]
Or if you want to use a much longer groupby you can get the first and last index, then drop duplicates (to deal with streaks of only 1) and get it into a list of lists. This is much slower though
(df.reset_index().groupby(m.cumsum().where(~m))['index'].agg(['first', 'last'])
.stack()
.drop_duplicates()
.groupby(level=0).agg(list)
.tolist())
#[[1, 2], [4]]

SQL Output Rows as columns

I have a table that tests an item and stores any faliures similar to:
Item|Test|FailureValue
1 |1a |"ZZZZZZ"
1 |1b | 123456
2 |1a |"MMMMMM"
2 |1c | 111111
1 |1d |"AAAAAA"
Is there a way in SQL to essential pivot these and have the failure values be output to individual columns? I know that I can already use STUFF to achieve what I want for the Test field but I would like the results as individual columns if possible.
I'm hoping to achieve something like:
Item|Tests |FailureValue1|FailureValue2|FailureValue3|Failure......
1 |1a,1b |"ZZZZZZ" |123456 |NULL |NULL ......
2 |1a,1b |"MMMMMM" |111111 |"AAAAAA" |NULL ......
Kind regards
Matt

Appropriate idea or SQL to obtain the results set

Fig 1
TxnId | TxnTypeId |BranchId |TxnNumber |LocalAmount |ItemName
--------------------------------|-----------|---------------|----------
1777486 | 101 |1099 |1804908 |65.20000000 |A
1777486 | 101 |1099 |1804908 |324.50000000 |B
1777486 | 101 |1099 |1804908 |97.20000000 |C
1777486 | 101 |1099 |1804908 |310.00000000 |D
1777486 | 101 |1099 |1804908 |48.90000000 |E
Fig 2
TxnId |TxnTypeId |BankId |Number |Check |Bank |Cash |Wallet
--------|-----------|-------|--------|-------|------|------|------
1777486 |101 |1099 |1804908 | 48.9 | 310 |389.7 |97.2
Fig 3 (Expected Output)
TxnId |BankId |ItemName |Amount |Wallet |Bank |Check |Cash
--------|-------|-----------|-------|-------|-------|-------|-------
1777486 |1099 |A |65.2 |0 0 |0 |0 |65.2
1777486 |1099 |B |324.5 |0 0 |0 |0 |324.5
1777486 |1099 |C |97.2 |97.2 |0 |0 |0
1777486 |1099 |D |48.9 |0 |0 |48.9 |0
1777486 |1099 |E |310 |0 |310 |0 |0
I have two different result set that is obtained from the different query.
Fig 1 and Fig 2.
The Result i wanted is like shown in fig 3.
Currently i do not have the flag to identify the payment mode use for each transaction(each item). I have the flag for only the complete transaction.
Fig 4
IndividualTxnPaymentDetailId| IndividualTxnId |PaymentAmount |PaymentMode
---------------------------:|:-----------------:|:-------------:|:--------------
2106163 | 1777486 |389.70000000 | Cash
2106164 | 1777486 |97.20000000 | Wallet
2106165 | 1777486 |310.00000000 | Bank
2106166 | 1777486 |48.90000000 | Check
Means if two item or more is purchased using one payment mode i do not have the proper way of identifying the payment done for each item.
Item A and B is purchased using cash as payment mode with the amount 65.2 and 324.5. Total Cash paid is 389.7
Item C is purchased using Wallet as payment mode with amount 97.2. Total Wallet amount is 97.2.
Fig 5
TxnId |LocalAmount |ItemName
--------|--------------:|:------------
1777486 |65.20000000 | A
1777486 |324.50000000 | B
1777486 |97.20000000 | C
1777486 |310.00000000 | D
1777486 |48.90000000 | E
Query by which i generated the result in Fig 4 and Fig 5
select IndividualTxnPaymentDetailId, IndividualTxnId, PaymentAmount, cc.choicecode as PaymentMode
from dbo.IndividualTxnPaymentDetail it
inner join configchoice cc on cc.configchoiceid= it.configpaymentmodeid
where IndividualTxnId = 1777486
select IndividualTxnId as TxnId, LocalAmount, CurrencyName from dbo.IndividualTxnFCYDetail where IndividualTxnId = 1777486
This is the query written to identify the transaction made through Bank. Similarly i wanted to get the transaction on all the payment mode. But could not obtain the transaction properly.
CASE
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount < 0 THEN 0
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount > txn.LocalAmount THEN txn.LocalAmount
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount > tpm.Bank THEN tpm.Bank
ELSE tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount
END AS Bank,
Can you help me to get the idea or with some sql to get the result set as in fig 3.
Updated Question - Updated Responce
I read your updated question and I'm afraid the problem still stands. Neither of those queries are summing the data - they are just pulling the same already summed numbers. You would either need to get at the numbers prior to the aggregation happening -or- to have some column in your IndividualTxnPaymentDetail table that ties each row to its counterpart rows in the other table (presumably through a cross table as in - Row 1 : ItemName A, Row 1 : ItemName B, Row 2 : ItemName C, etc).
If these are simply impossible, then perhaps your approaching this the wrong way, or to put it better, perhaps you are being asked to do something that doesn't make sense - and provable so. If there is no direct relationship between these activities in the data there's not much you can be expected to do. What's more it may indicate that your organization doesn't 'think' about them that way.
These two tables seem be payments and liabilities. Perhaps consider an approach where each payment goes toward what ever the oldest outstanding balance is and are matched to the items in Fig 4 that way. Add a column to the details table to store payment toward that item. Rather than a simple Paid/Unpaid Boolean, I would store the amount of payment that has been applied toward each item or the amount still owed on each item; that way you can handle partially applied payments. As payments come in, apply them. You would likely want a similar column in the payments table too to measure the amount of each payment that you have applied; that way you can handle over-payments, and be able to know the status of things such as pending receipts in the case that payments aren't applied immediately.
I hope this helps.
Fundamental Flaw
Your question is looking to take aggregated data (in your example, the Fig 2 Cash total of 389.7) and tease out what numbers were totaled to get the sum. You can do it here since 3 of the 4 numbers in Fig 2 are unique, one-to-one matches with numbers in Fig 1 - meaning the remaining ones have to belong to each other. But imagine 100s of numbers, many or most of them sums (i.e. not one-to-one matches like most of these). Or imagine an example as simple as yours except the numbers aren't so unique (e.g. Fig 1 = (10, 10, 10, 10, 20) and Fig 2 = (10, 20, 20, 10) - it is not possible to say which ones are which) and there only needs to be two possible combinations that could be responsible for a particular sum for the results to become ambiguous.
The weakness is in Fig 2. Do you have any control over that data source? Can grab the numbers up-stream before they are totaled?
Sorry for the negative conclusion but...
I hope this helps.
The Continuing Saga
Comment: [A version of this] report has already been made ...[but] I cannot contact the person who actually wrote that thing.
Perhaps he was also asked to do something that didn't make sense but did it anyway. The math simply doesn't work. He may have written something that finds as many one-to-one matches as it can and then sort of rolls the dice on the rest of it. He may have done something like the following:
Find and eliminate all the one-to-one matches.
Take any total and subtract any item amount from it to see if it
matches any remaining item amounts(s), if so, arbitrarily pick one,
eliminate all three numbers.
Repeat this until all combinations have been tested.
But you are still potentially left with unmatched numbers, so you next need to test for sums of three numbers by:
Arbitrarily subtract any two item amounts from any of the remaining
totals.
and so on and so on, followed by testing for sums of four items and so on.
I think part of what you're looking for is buried in here:
http://www.itprotoday.com/software-development/algorithms-still-matter
it calls it 'order fulfillment' where you go through transactions, combining them until you reach a given total
I think the solution will be in multiple parts, including cursors etc.
I'm not convinced you would be able to understand or implement any solution posted. Also, I maintain that there are cases where there are ambiguous solutions.
Lastly I see you have asked 16 questions and not marked a single one as answered.

find the closest time between two tables in spark

I am using pyspark and I have two dataframes like this:
user time bus
A 2016/07/18 12:00:00 1
B 2016/07/19 12:00:00 2
C 2016/07/20 12:00:00 3
bus time stop
1 2016/07/18 11:59:40 sA
1 2016/07/18 11:59:50 sB
1 2016/07/18 12:00:05 sC
2 2016/07/19 11:59:40 sB
2 2016/07/19 12:00:10 sC
3 2016/07/20 11:59:55 sD
3 2016/07/20 12:00:10 sE
Now I want to know at which stop the user reports according to the bus number and the closest time in the second table.
For example, in table 1, user A reports at 2016/07/18 12:00:00 and he is on bus No.1, and according to the second table, there are three records of bus No.1, but the closest time is 2016/07/18 12:00:05(the third record), so the user is in sC now.
The desired output should be like this:
user time bus stop
A 2016/07/18 12:00:00 1 sC
B 2016/07/19 12:00:00 2 sC
C 2016/07/20 12:00:00 3 sD
I have transferred the time into timestamp so that the only problem is to find the closest timestamp where the bus number is eqaul.
Because I'm not familiar with sql right now, I tried to use map function to find the closest time and its stop, which means I have to use sqlContext.sql in the map function, and spark dosen't seem to allow this:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
So how can I write a sql query to get the right output?
This can be done using window functions.
from pyspark.sql.window import Window
from pyspark.sql import Row, functions as W
def tm(str):
return datetime.strptime(str, "%Y/%m/%d %H:%M:%S")
#setup data
userTime = [ Row(user="A",time=tm("2016/07/18 12:00:00"),bus = 1) ]
userTime.append(Row(user="B",time=tm("2016/07/19 12:00:00"),bus = 2))
userTime.append(Row(user="C",time=tm("2016/07/20 12:00:00"),bus = 3))
busTime = [ Row(bus=1,time=tm("2016/07/18 11:59:40"),stop = "sA") ]
busTime.append(Row(bus=1,time=tm("2016/07/18 11:59:50"),stop = "sB"))
busTime.append(Row(bus=1,time=tm("2016/07/18 12:00:05"),stop = "sC"))
busTime.append(Row(bus=2,time=tm("2016/07/19 11:59:40"),stop = "sB"))
busTime.append(Row(bus=2,time=tm("2016/07/19 12:00:10"),stop = "sC"))
busTime.append(Row(bus=3,time=tm("2016/07/20 11:59:55"),stop = "sD"))
busTime.append(Row(bus=3,time=tm("2016/07/20 12:00:10"),stop = "sE"))
#create RDD
userDf = sc.parallelize(userTime).toDF().alias("usertime")
busDf = sc.parallelize(busTime).toDF().alias("bustime")
joinedDF = userDf.join(busDf,col("usertime.bus") == col("bustime.bus"),"inner").select(
userDf.user,
userDf.time.alias("user_time"),
busDf.bus,
busDf.time.alias("bus_time"),
busDf.stop)
additional_cols = joinedDF.withColumn("bus_time_diff", abs(unix_timestamp(col("bus_time")) - unix_timestamp(col("user_time"))))
partDf = additional_cols.select("user","user_time","bus","bus_time","stop","bus_time_diff", W.rowNumber().over(Window.partitionBy("user","bus").orderBy("bus_time_diff") ).alias("rank") ).filter(col("rank") == 1)
additional_cols.show(20,False)
partDf.show(20,False)
Output:
+----+---------------------+---+---------------------+----+-------------+
|user|user_time |bus|bus_time |stop|bus_time_diff|
+----+---------------------+---+---------------------+----+-------------+
|A |2016-07-18 12:00:00.0|1 |2016-07-18 11:59:40.0|sA |20 |
|A |2016-07-18 12:00:00.0|1 |2016-07-18 11:59:50.0|sB |10 |
|A |2016-07-18 12:00:00.0|1 |2016-07-18 12:00:05.0|sC |5 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 11:59:40.0|sB |20 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 12:00:10.0|sC |10 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 11:59:55.0|sD |5 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 12:00:10.0|sE |10 |
+----+---------------------+---+---------------------+----+-------------+
+----+---------------------+---+---------------------+----+-------------+----+
|user|user_time |bus|bus_time |stop|bus_time_diff|rank|
+----+---------------------+---+---------------------+----+-------------+----+
|A |2016-07-18 12:00:00.0|1 |2016-07-18 12:00:05.0|sC |5 |1 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 12:00:10.0|sC |10 |1 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 11:59:55.0|sD |5 |1 |
+----+---------------------+---+---------------------+----+-------------+----+

How should I design my table

I need to create a table for operating on data, which has been provided to me like this:
col1 col2 col3
1 < 3 50%
2 < 5 50%
3 < 10 50%
1 5>RC >=3 25%
2 10>RC >=5 25%
3 20>RC >=10 25%
1 >=5 0%
2 >=10 0%
3 >=20 0%
A user of the system would be passing a number, which is present in col2 and col1 above. Let's say that the user passed 7 for col2 and 1 for col1. Business requirement is that I should return the user the following row
1 >=5 0%
Roughly speaking, it means that I checked the value in col2, and noticed that it is >=5, which my input data fits.
I was thinking of splitting col2 across two columns - one for storing the number and the other for operator. Something like this:
col1 col2 col3 col4
1 3 50% <
2 5 50% <
3 10 50% <
1 5>RC >=3 25%
2 10>RC >=5 25%
3 20>RC >=10 25%
1 5 0% >=
2 10 0% >=
3 20 0% >=
Thsi way, I will be able to write queries for addressing queries based on data in first and last three columns (though I have not running queries right now, I just did dry run). What I am not able to figure out so far is - How to address the data in rows 4,5,6? You can ignore the RC part in those rows, as I can certainly do away with it, as I am concerned with the numeric range for my queries.
I tried splitting the data for rows 4,5,6 in 2 rows each, something like:
1 3 25% >=
1 5 25% <
2 5 25% >=
2 10 25% <
3 10 25% >=
3 20 25% <
But, I see an imminent issue here, when it comes to retrieving the data. Let's say that user paased col2 = 7 AND col1 = 1. Now, I should have got only one row,that is row number 7 in the first table in my question, but I am also getting an additional row (1st row in last table, where I was splitting data for BETWEEN conditions)
Can anyone suggest me a better approach for storing this data so that my requierment can be achieved?
SQLFiddle demo: http://www.sqlfiddle.com/#!4/d2d90/7
I suggest, that you should just split col2 in two columns - lower and higher bound, replacing not existing bound, for example, with NULL. It will look something like this:
+----+-------+-------+----+
|col1|col2_lb|col2_hb|col3|
+----+-------+-------+----+
|1 |NULL |3 |50% |
+----+-------+-------+----+
|2 |NULL |5 |50% |
+----+-------+-------+----+
|3 |NULL |10 |50% |
+----+-------+-------+----+
|1 |3 |5 |25% |
+----+-------+-------+----+
|... |... |... |... |
+----+-------+-------+----+
|1 |5 |NULL |0% |
+----+-------+-------+----+
Using this structure, you'll be able to find needed row with simple query:
SELECT *
FROM T_TABLE t
WHERE t.col1 = :VAL1
AND NVL(t.col2_lb,:VAL2) <= :VAL2
AND NVL(t.col2_hb,:VAL2+1) > :VAL2