How to output data with Hive-style directory structure in Scalding? - scalding

We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows:
pipe
// some other ETL
.map('STATE -> 'hdfs_state) { state: Int => "State=" + state }
.groupBy('hdfs_state) { _.pass }
.write(TemplatedTsv(baseOutputPath, "%s", 'hdfs_state,
writeHeader = false,
sinkMode = SinkMode.UPDATE,
fields = ('all except 'hdfs_state)))
We adopt the code sample from How to bucket outputs in Scalding.
Here are two issues we have:
except can't be resolved by IntelliJ: Am I missing some imports? We don't want to explicitly enter all the fields within the "fields = ()" statement as fields are derived from the code inside the groupBy statement. If entering explicitly, they could be easily out of sync.
This approach looks too hacky as we are creating an extra column so that the directory names can be processed by Hive/Hcatalog. We are wondering what should be the right way to accomplish it?
Many thanks!

Sorry previous example was a pseudocode. Below I will give a small code with input data example.
Please note that this only works with Scalding version 0.12.0 or above
Let's image we have input as below which define some purchase data,
user1 1384034400 6 75
user1 1384038000 6 175
user2 1383984000 48 3
user3 1383958800 48 281
user3 1384027200 9 7
user3 1384027200 9 11
user4 1383955200 37 705
user4 1383955200 37 15
user4 1383969600 36 41
user4 1383969600 36 21
Tab separated and the 3rd column is a State number. Here we have integer but for string based States you can easily adapt.
This code will read the input and put them in 'State=stateid' output folder buckets.
class TemplatedTsvExample(args: Args) extends Job(args) {
val purchasesPath = args("purchases")
val outputPath = args("output")
// defines both input & output schema, you can also make separate for each of them
val ioSchema = ('USERID, 'TIMESTAMP, 'STATE, 'PURCHASE)
val Purchases =
Tsv(purchasesPath, ioSchema)
.read
.map('STATE -> 'STATENAME) { state: Int => "State=" + state } // here you can make necessary changes
.groupBy('STATENAME) { _.pass } // this is optional
.write(TemplatedTsv(outputPath, "%s", 'STATENAME, false, SinkMode.REPLACE, ioSchema))
}
I hope this is helpful. Please ask me if anything is not clear.
You can find full code here.

Related

SQL Query to return which columns have different values given two rows

I have one table like this:
id status time days ...
1 optimal 60 21
2 optimal 50 21
3 no solution 60 30
4 optimal 21 31
5 no solution 34 12
.
.
.
There are many more rows and columns.
I need to make a query that will return which columns have different information, given two IDs.
Rephrasing it, I'll provide two IDs, for example 1 and 5 and I need to know if these two rows have any columns with different values. In this case, the result should be something like:
id status time days
1 optimal 60 21
5 no solution 34 12
If I provide IDs 1 and 2, for example, the result should be:
id time
1 60
2 50
The output format doesn't need to be like this, it only needs to show clearly which columns are different and their values
I can tell you off the bat that processing this data in some sort of programming language will greatly help you out in terms of simplicity and readability for this type of solution, but here a thread of how it can be done in SQL.
Compare two rows and identify columns whose values are different
If you are looking for the solution in R. Here is my solution:
df <- read.csv(file = "sf.csv", header = TRUE)
diff.eval <- function(first.id, second.id, eval.df) {
res <- eval.df[c(first.id, second.id), ]
cols <- colnames(eval.df)
for (col in cols) {
if (res[1, col] == res[2, col]) {
res[, col] <- NULL
}
}
return(res)
}
print(diff.eval(1, 5, df))
print(diff.eval(1, 2, df))
You just need to create a dataframe out of table. I just created a .csv for ease locally and used the data by importing into a dataframe.

Read multiple files, create a data frame and add a new column containing the name of each file in R

I am new using dplyr package and I have been trying to read multiple files in R and then create a data frame by binding all the rows, but including the name of each file as a new column. This new column is the corresponding date which is not included in the data.
My list of files (for example):
01012019.aps
02012019.aps
I would like to have my final dataframe like this:
x y file date
1 4 01012019 01-01-2019
2 5 01012019 01-01-2019
3 6 02012019 02-01-2019
4 7 02012019 02-01-2019
I've been trying this:
path_aps<- "C:/Users/.../.../APS"
files_aps <- list.files(path_aps, pattern = "*.aps")
data_aps <- files_aps %>%
map(~ read.table(file.path(path_aps, .), sep = "\t")) %>%
map(~ mutate(filename = files_aps, .))%>%
reduce(gtools::smartbind)
But I am getting this error:
Error: Column filename must be length 288 (the number of rows) or one, not 61
I understand that the list of files in files_aps has 61 elements as this is the number of files that I have in my directory and 288 is the number of rows of each .aps file; however, I haven't been able to make it work to the extend of each .aps file. I've been reading multiple answers to similar questions but still I am not getting the expected result.
I've solved it with the help of this other answer and I've got this:
data_aps <- list.files(path_aps, pattern = "*.aps", full.names = TRUE) %>%
map_df(function(x) read.table(x, sep = "\t") %>%
mutate(filename=gsub(".aps","", basename(x))))

Using a table made from input file Lua

I have a text file with contents like this
Jack 17
Will 16
Jordan 15
Elsie 16
You get the idea, it's a list of people's names with their ages.
I have a program that reads the file in. Like so:
file = io.open("ages.txt")
for line in file:lines()
do
local name, age = line:match("(%a+) (%d+)")
print(age) --Not exactly what I want
end
file:close()
print(age) gives me the ages of all people, without names. It runs for everyone, as expected as it's within the loop (as an aside, why does it not work outside the loop? It gives me nil there)
What I want to do is load it into a table. This way, if I want to know Jack's age, I can go print(Jack.age) and it will give me 17. How can this be program be constructed to support this functionality?
Perhaps you are looking for something like this to build a table in the loop:
file = io.open("ages.txt")
names = {}
for line in file:lines()
do
local n, a = line:match("(%a+) (%d+)")
names[n] = {age = a}
end
file:close()
Here is a sample interaction:
> print(names.Will.age)
16
> print(names.Jordan.age)
15
> print(names.Elsie.age)
16

Generating variable observations for one id to be observation for new variable of another id

I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.

Reading sparse columns from a CSV

I get a CSV that I need to read into a SQL table. Right now it's manually uploaded with a web application, but I want to move this into SQL server. Rather than port my import script straight across into a script in SSIS, I wanted to check and see if there was a better way to do it.
The issue with this particular CSV is that the first few columns are known, and have appropriate headers. However, after that group, the rest of the columns are sparsely populated and might not even have headers.
Example:
Col1,Col2,Col3,,,,,,
value1,value2,value3,,value4
value1,value2,value3,value4,value5
value1,value2,value3,,value4,value5
value1,value2,value3,,,value4
What makes this tolerable is that everything after Col3 can get concatenated together. The script checks each row for these trailing columns and puts them together into a "misc" column. It has to do this in a bit of a blind method because there is no way of knowing ahead of time how many of these columns will be out there.
Is there a way to do this with SSIS tools, or should I just port my existing import script to an SSIS script task?
Another option outside of SSIS is using BulkInsert with format files.
Format files allow you to describe the format of the incoming data.
For example..
9.0
4
1 SQLCHAR 0 100 "," 1 Header1 SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "," 2 Header2 SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "," 3 Header3 SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 100 "\r\n" 4 Misc SQL_Latin1_General_CP1_CI_AS
Bulk Insert>> http://msdn.microsoft.com/en-us/library/ms188365.aspx
Format Files >> http://msdn.microsoft.com/en-us/library/ms178129.aspx
Step 0. My test file with an additional line
Col1,Col2,Col3,,,,,,
value1,value2,value3,,value4
value1,value2,value3,value4,value5
value1,value2,value3,,value4,value5
value1,value2,value3,,,value4
ends,with,comma,,,value4,
Drag a DFT on the Control flow surface
Inside the DFT, on the data flow surface, drag a Flat file source
Let is map by itself to start with. Check Column names in the first data row.
You will see Col1, Col2, Col3 which are your known fields.
You will also see Column 3 through Column 8. These are the columns
that need to be lumped into one Misc column.
Go to the Advanced section of the Flat File Manager Editor.
Rename Column 3 to Misc. Set field size to 4000.
Note: For longer than that, you would need to use Text data type.
That will pose some challenge, so be ready for fun ;-)
Delete Columns 4 through 8.
Now add a script component.
Input Columns - select only Misc field. Usage Type: ReadWrite
Code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string sMisc = Row.Misc;
string sManipulated = string.Empty;
string temp = string.Empty;
string[] values = sMisc.Split(',');
foreach (string value in values)
{
temp = value;
if (temp.Trim().Equals(string.Empty))
{
temp = "NA";
}
sManipulated = string.Format("{0},{1}", sManipulated, temp);
}
Row.Misc = sManipulated.Substring(1);
}
-- Destination.
Nothing different from usual.
Hope I have understood your problem and the solution works for you.