Hive Serde while loading data into a single column - hive

I created an external table with below serde properties,
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"')
When I try to load the data, the below value to be loaded into a single column is moved to the next columns as below,
Data to be present in single column
"{""t"":""xxxxxxxxx"",""mi_u"":""anon-xxxxxxxxxxx"",""mi_cid"":""xxxx"",""page_title"":""120 pages: Skip line ruling, 1/2\"" writing space, dotted midline, 1/4\"" black cover. Journals, Paperback | xxxxxxxx""}"
Data moved as,
Column a:
{""t"":""xxxxxxxxx"",""mi_u"":""anon-xxxxxxxxxxx"",""mi_cid"":""xxxx"",""page_title"":""120 pages: Skip line ruling, 1/2"" writing space
Column b:
dotted midline
Column c:
1/4"" black cover. Journals, Paperback | xxxxxxxx""}
I want the entire data to be in column a.
Please let me know for how this Serde can be modified to accomadate the data into a single column.
Thanks!.

Related

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)

Split one file into multiple file using PIG script

I have a pipe delimited text file say abc.txt. which has different number of columns in different records. Number of columns in a record can be 100,80,70,60. I need to split abc.txt based on 3rd column value. If third column has value "A" , then that record will go to A.txt, if "B" then B.txt. Need to write a PIG script.
abc = LOAD 'abc.txt' using PigStorage('|');
Assuming you have the 3rd column in all the records, SPLIT using the positional notation. It starts from 0, so the third column will be $2.
SPLIT abc into a_records if $2 == 'A', b_records if $2 == 'B';
Then store the results, also note that STORE does not accept filenames as path.
STORE a_records into 'A_DIR' using PigStorage('|');
STORE b_records into 'B_DIR' using PigStorage('|');

matching several combinations of columns in a table

I am reading a table where all its values has to be validated before we process it further. The valid values are stored in another table that we match our main table with. The validation criteria is to match several columns as follows:
Table 1 (the main data we read in)
Name --- Unit --- Age --- Address --- Nationality
The above shows the column names that we are reading from the table and the other table contains the valid values of the above columns . When we look only for valid values in our main table, we have to consider combination of columns in the main data table, for example Name --- Unit --- Age. If all the value in a particular row for the column combination matches against the other table then we keep the row, otherwise we delete it.
How do I address the issue with Numpy ?
Thanks
you can just loop through rows. An easy/simple way would be:
dummy_df = table_df ## make a copy of your table, since we are deleting rows we want to have the original df saved.
relevant_columns = ['age','name','sex',...] ## define relevant columns, in case either dataframe has columns you dont want to compare on
for indx in dummy_df.index :
## checks if any row is identical, if so, drops it.
if ((np.array(dummy_df.loc[indx][relevant_columns]) == main_df[relevant_columns].values).sum(1) == len(relevant_columns)).sum() > 0:
dummy_df = dummy_df .drop(indx)
ps: i am assuming the data is in pandas dataframe format.
hope it helps :)
ps2: if the headers/columns have different names it wont work

Extract column metadata from rows

I'm trying to read a csv file where metadata is placed on row 4 to 7
Source,"EUROSTAT","INE","BDP"
Magnitude,,,
Unit ,"Percent","Percent","Unit"
Series,"Rate","PIB","Growth"
Where the relevant data starts on row 10. This CSV format will always be fixed, with this sort of data and this row disposition
Date,Series1,Series2,Series3
30-09-2014,13.1,1.1,5.55
30-06-2014,13.9,0.9,5.63
31-03-2014,15.1,1.0,5.57
31-12-2013,15.3,1.6,5.55
30-09-2013,15.5,-1.0,5.66
30-06-2013,16.4,-2.1,5.65
What I've done was skip the first rows and read the data from Row 10 afterwards, and define the column metadata myself. However, the rows don't have the Unit type nor the magnitude that was defined in the Metadata header. In SSIS I do an Unpivot of the columns to an Indicator column so I have the Date, Series, Value style of table.
My question here is how can I make a table of the format Date, Series, Value, Magnitude, Unit where the Series, Magnitude, Unit are read from the first 10 rows?

apache spark sql query optimization and saving result values?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json file using spark.
Now i am trying to build a kd tree with this large data .
My steps :
1) calculate variance of each column pick the column with maximum variance and make it as key first node , mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the process until you reach a point.
my sample code :
import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")
the people table has 128 columns
My questions :
1) How to save result values of a query into a list ?
2) How to calculate variance of a column ?
3) i will be runnig multiple queries on same data .Does spark has any way to optimize it ?
4) how to save the output as key value pairs in a text file ?
please help