Evaluate SQL records in a For Each loop taking into account previous steps through loop - sql

I have written a script in Powershell that searches through a relational database using Select Top statements to pick records which are suitable to match with items in an input text file. For the sake of simplicity I not include all of the condition that need to be met, just the ones I am having trouble with:
Each item in the input file has a corresponding requirement for resource x and resource y.
Input File:
Record 1 2x 1y
Record 2 1x 1y
In the database each record is similar
Database:
Record 1 4x 3y
Record 2 1x 2y
What my script does is loop through each item in my input file and searches through the database to find a record that has sufficient amount of resource x and resource y. The script does this and outputs a file which basically matches records of the input file to suitable records of the database.
However, it doesn't work properly as it step through each item in the loop, it doesn't take into account if the previous item(s) in the loop have been matched to records (and used up resources). For example:
The script evaluates input file record 1 (2x 1y) and matches it to record 1 in the DB (4x 3y). Now when the script goes to the next item in input file (1x 1y) it evaluates record 1 in the DB as still having 4x and 3y despite having been matched previously in the loop and it's resources should now be looked at as 2x 2y (4x-2x 3y-1y).
How can I accomplish this? In the end the script could be evaulating 200 input records at a time against a Database with 70,000 records. The answer doesn't have to be in PowerShell, I'm just having a hard time thinking of a conceptual answer to this problem.

Here's a PowerShell example using randomly generated CSVs.
Input table format:
RecordName ResourceX ResourceY Match
Record 0 8 0
Record 1 2 5
Record 2 5 9
Processing:
$cResources = Import-Csv resources-before.csv
$cResourcesNeeded = Import-Csv needed-infile.csv
foreach ($needed in $cResourcesNeeded) {
foreach ($supply in $cResources) {
if (($needed.ResourceX -le $supply.ResourceX) -and `
($needed.ResourceY -le $supply.ResourceY)) {
# Match found.
$needed.Match = $supply.RecordName
$supply.Match += $needed.RecordName
# Updating supply record.
$supply.ResourceX = $supply.ResourceX - $needed.ResourceX
$supply.ResourceY = $supply.ResourceY - $needed.ResourceY
# Back to outer loop.
break
}
}
}
$cResources | Export-Csv -NoTypeInformation resources-after.csv
$cResourcesNeeded | Export-Csv -NoTypeInformation needed-outfile.csv
Of course this is just a very basic example. I don't know what other requirements you have so feel free to elaborate further (i.e. update the question with your actual code) if you need something more specific.

Related

Record Duplication in BigQuery while Running a DataFlow Job

I'm running an hourly dataflow job that reads records from a source table, processes and writes them to a target table. Since some of the records may repeat in the source table, we've created a hash value based on the record fields of interest, append it to the read source table records(in memory), and filter out the existing hashes already stored on the target table(the hash value will be stored in the target table). This way we aim to avoid duplications from different jobs(triggered at different times). In order to avoid duplication on the same job, we're using a GroupByKey apache beam method, where the key is the hash value, and pick only the first element in the list. However, the duplication in bigquery still persists. My only hunch is that maybe, due to multiple workers handling the same job, they might be out of sync and process the same data, but since I'm using pipelines all the way, this assumption sounds unreasonable(at least to me..). Does any of you have an idea why the problem still persists?
Here's the job which creates the duplication:
with beam.Pipeline(options=options) as p:
# read fields of interest from the source table
records = p | 'Read Records from BigQuery' >> beam.io.Read(
beam.io.ReadFromBigQuery(query=read_from_source_query, use_standard_sql=True))
#step 1 - filter already existing records
# read existing hashes from the target table
hashes = p | 'read existing hashes from the target table' >> \
beam.io.Read(beam.io.ReadFromBigQuery(
query=select_hash_value_from_target_table,
use_standard_sql=True)) | \
'Get vals' >> beam.Map(lambda hash: hash['HashValue'])
# add hash value to each record and filter out the ones which already exist in the target table
hashed_records = (
records
| 'Add Hash Column in Memory to Each source table Record' >> beam.Map(lambda record: add_hash_field(record))
| 'Filter Existing Hashes' >> beam.Filter(lambda record,
hashes: record['HashValue'] not in hashes,
hashes=beam.pvalue.AsIter(hashes))
)
# step 2 - filter duplicated hashes created on the same job
key_val_records = (
hashed_records | 'Create a Key Value Pair' >> beam.Map(lambda record: (record['HashValue'], record))
)
# combine elements with the same key and get only one of them
unique_hashed_records = (
key_val_records | 'Combine the Same Hashes' >> beam.GroupByKey()
| 'Get First Element in Collection' >> beam.Map(lambda element: element[1][0])
)
records_to_store = unique_hashed_records | 'Create Records to Store' >> beam.ParDo(CreateTargetTableRecord(gal_options))
records_to_store | 'Write to target table' >> beam.io.WriteToBigQuery(
target_table)
As the code above suggested, i've expected to have no duplicates in the target table, but i'm still getting

LINQ - Select rows based on whether their sum meets a condition

I’ve run into a problem, as I cannot get a proper working LINQ statement here.
Suppose I have a DataTable with x rows and I have to sort based on the sum of the Quantity column. Then I have a condition Requested Quantity = 20. I need to find the rows equal to the exact sum of RequestedQuantity, but only where the combination of 3 rows is equal to it.
+-----+----------+
| Bin | Quantity |
+-----+----------+
| 1 | 10 |
| 2 | 5 |
| 3 | 5 |
| 4 | 10 |
| 5 | 15 |
+-----+----------+
I can’t seem to figure out the proper LINQ syntax to get this to work. My starting point is this:
From row In StorageBins.AsEnumerable.GroupBy( _
Convert.ToDouble(Function (x) x("Quantity"), cultureInfo)).Sum( _
Function (y) Convert.ToDouble(y("Quantity"), cultureInfo) = _
Double.Parse(RequestedQuantity,cultureInfo))
Initially, I am just trying to get any rows that are equal to my condition. My end-goal, however, is getting any three rows that exactly sum up to my Requested quantity.
I’m not an expert in LINQ, unfortunately. I hope some of you might be!
Maybe I'm missing something, but this actually seems like a pretty complicated problem. Pick any 3 records, but only 3, that add up to exactly 20. How many rows are there in the database? Because this could get to be quite a few potential combinations pretty quickly. And what do you do after you get the 3? Do you have to go back through recursively and group up the other records as well? Or you just need the first set of 3 that add up to 20?
Assuming you just need the first 3, I would do something like this:
Get the first record that is less that 20. Remove it from your input list and put it into your target set.
Then get the first record that is less than 20 minus the first value. ie if the first value was a '5', get records that are less than 15 (20 minus 5). This ensures you 'leave room' for the third value. Remove it from the original list and into your target set.
Then get the first record that is exactly 20 minus number one minus number two. Remove it from the input list and into the target set.
Now you would have to do this in iterators. If there is no value that meets the third criterion, release the third value from your target set and put it back in your input list. Then go back to step 2 and pick the next record that matches step 2 (and ideally that is not equal to the previous value). And if you exhaust all of the iterations through step 2, go back to step one and pick the next value there, and start the whole thing over again...
Unless I'm misunderstanding your requirement...

Why can't I read all of the values in the matrix in scilab?

i am trying to read a csv file and my code is as follows
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 3 4]); //reads number of clusters and features
data=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%f",'double',[],[],[3 1 19 4]); //reads the values
numft=param(1,1);//save number of features
numcl=param(2,1);//save number of clusters
data_pts=0;
data_pts = max(size(data, "r"));//checks how many number of rows
disp(data(numft-3:data_pts,:));//print all data points (I added -3 otherwise it displays only 15 rows)
disp(numft);//print features
disp(data_pts);//print features
disp(param);
endfunction
below is the values that i am trying to read
features,4,,
clusters,3,,
5.1,3.5,1.4,0.2
4.9,3,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5,3.6,1.4,0.2
7,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,1.5
5.5,2.3,4,1.3
6.5,2.8,4.6,1.5
5.7,2.8,4.5,1.3
6.3,3.3,6,2.5
5.8,2.7,5.1,1.9
7.1,3,5.9,2.1
6.3,2.9,5.6,1.8
6.5,3,5.8,2.2
7.6,3,6.6,2.1
I do not know why the code only displays 15 rows instead of 17. The only time it displays the correct matrix is when i put -3 in numft but with that, the number of columns would be 1. I am so confused. Is there a better way to read the values?
In the csvRead call in the first line of your script the boundaries of the region to read is incorrect, it should be corrected like this:
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 2 2]);

Generating variable observations for one id to be observation for new variable of another id

I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.