Extract column metadata from rows - sql

I'm trying to read a csv file where metadata is placed on row 4 to 7
Source,"EUROSTAT","INE","BDP"
Magnitude,,,
Unit ,"Percent","Percent","Unit"
Series,"Rate","PIB","Growth"
Where the relevant data starts on row 10. This CSV format will always be fixed, with this sort of data and this row disposition
Date,Series1,Series2,Series3
30-09-2014,13.1,1.1,5.55
30-06-2014,13.9,0.9,5.63
31-03-2014,15.1,1.0,5.57
31-12-2013,15.3,1.6,5.55
30-09-2013,15.5,-1.0,5.66
30-06-2013,16.4,-2.1,5.65
What I've done was skip the first rows and read the data from Row 10 afterwards, and define the column metadata myself. However, the rows don't have the Unit type nor the magnitude that was defined in the Metadata header. In SSIS I do an Unpivot of the columns to an Indicator column so I have the Date, Series, Value style of table.
My question here is how can I make a table of the format Date, Series, Value, Magnitude, Unit where the Series, Magnitude, Unit are read from the first 10 rows?

Related

How can I create a new dataframe column based on the top five values in Groupby?

How do I create a new data frame with only the top 5 values in the groupby?
or
How do I delete the values that are not in the top 5 values in the groupby?
I want to create a new dataframe column with these values, I want to keep all the old columns in the new dataframe just delete the currencies that are not in the top five.
Ethereum 607475.601104
bitcoin 588236.080000
bitcoin-cash 288909.117573
litecoin 92405.990000
bitcoin-sv 59480.781151
I've been looking around to delete values from Columns but I can only find ways to delete columns.
Each crypto currency also has several instances because the data is over a time period. So there are 100's of Ethereum value records at different dates.
I want the new dataframe to look like this, I want to have all the same columns, I just want the currency column to only have the top 5 values that I ordered in the group by and delete any currencies that are not in the top 5
Ethereum 607475.601104
bitcoin 588236.080000
bitcoin-cash 288909.117573
litecoin 92405.990000
bitcoin-sv 59480.781151
Thank you

Repeating sex variable matched by ID from one data frame in another data frame with multiple observations per ID

I have an issue trying to move a sex variable from a data frame with single observations per ID to another data frame with the same IDs but with multiple observations per ID. These multiple observations are not equal (ie. one ID may have 15 and another may have 24).
Also, some IDs are missing in the first data frame but not in the second.I don't mind if R drops the unmatched extra IDs. My objective is to get the sex variable to the data frame with multiple observations. Thank you!

return rows dosn't have specefic number of length in pandas

am clean my dataset and cleaned it but am stuck in some rows don't have the specific length must have in the column
The column (order_id) must have 16 character the column type is object, so i'dont know how i can extract all rows don't have the exact character must be in column and how to remove those rows
Thank You .
for more information
image of column
in excel i can just filter the column and show only value that has 16 character
i want to do that in pandas i want just to return rows that contain 16 character and drop all row greater or lower than 16 character .
I suppose you want to keep all rows which match this pattern [0-9A-F]{16}:
df = df[df['order_id'].str.contains(r'^[0-9A-F]{16}$')]

Compare 2 Datasets and get differences in Spark-Scala

I have a Dataset from Audit Table which is 24 hrs Old (Ds1) and current Day Changes as of Now (Ds2)
How to get differences of values (individual cells) in a 3rd - DiffDs
If the two dataset have the same schema you can do
val delta = ds2.except(ds1)
that from the doc Returns a new Dataset containing rows in this Dataset but not in another Dataset.
This will be the delta of the newest record and the oldest record.
If the schema of the dataset are different in my opinion all ds2 are the difference.
But this how you say on the comment return the difference on the entire Row.
I think that, to extract the cell difference you need to do something like this:
val diff = ds1.union(ds2).distinct() //that contains all different record
diff.rdd.keyBy(r(key_index_here)).groupByKey(a_simple_function_that_compute_column_difference)
Now you have to write a function that compute the difference in terms of cell from a sequence of Row that are grouped by a key.

apache spark sql query optimization and saving result values?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json file using spark.
Now i am trying to build a kd tree with this large data .
My steps :
1) calculate variance of each column pick the column with maximum variance and make it as key first node , mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the process until you reach a point.
my sample code :
import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")
the people table has 128 columns
My questions :
1) How to save result values of a query into a list ?
2) How to calculate variance of a column ?
3) i will be runnig multiple queries on same data .Does spark has any way to optimize it ?
4) how to save the output as key value pairs in a text file ?
please help