Export Data from SQL to CSV - sql

I'm using EntityFramework to access a sql server to return data. The data needs to be formatted into a tab delimited file. I then want to compress the data to return to the user.
I can do the select, and then iterate over the EF objects and format all the data into one big string- but this takes forever (I'm returning abouit 800k rows). The query itself is quite fast, but its just the creating of the csv file in memory that is killing it.
I found this post that describes how to use sqlcmd to do this directly as an export (but with csv) with sql which seems very promising, but I'm unclear how to pass the -E and other parameters to ExecuteSqlCommand()... or if it is even meant for this.
I tried to do something like this:
var test = context.Database.ExecuteSqlCommand("select Chromosome c,
StartLocation sl, Endlocation el, GeneName gn from Gencode where c = chr1",
"-E", "-Q", new SqlParameter("-s", "\t"));
But of course that didn't work...
Any suggestions as to how to go about this? I'm using EF 6.1 if that matters.

Alternate option using simple method.
F5-->store result--> keep file name

Related

How to run SPARQL queries in R (WBG Topical Taxonomy) without parallelization

I am an R user and I am interested to use the World Bank Group (WBG) Topical Taxonomy through SPARQL queries.
This can be done directly on the API https://vocabulary.worldbank.org/PoolParty/sparql/taxonomy but it can be done also through R by using the functions load.rdf (to load the taxonomy.rdf rdfxml file downloaded from https://vocabulary.worldbank.org/ ) and after using the sparql.rdf to perform the query. These functions are available on the "rrdf" package
These are the three lines of codes:
taxonomy_file <- load.rdf("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_1 <- sparql.rdf(taxonomy_file,query)
What I obtain from result_query_1 is exactly the same I get through the API.
However, the load.rdf function uses all the cores available on my computer and not only one. This function is somehow parallelizing the load task over all the core available on my machine and I do not want that. I haven't found any option on that function to specify its serialized usege.
Therefore, I am trying to find other solutions. For instance, I have tried "rdf_parse" and "rdf_query" of the package "rdflib" but without any encouraging result. These are the code lines I have used.
taxonomy_file <- rdf_parse("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_2 <- rdf_query(taxonomy_file , query = query)
Is there any other function that perform this task? The objective of my work is to run several queries simultaneously using foreach.
Thank you very much for any suggestion you could provide me.

Is there a way to execute text gremlin query with PartitionStrategy

I'm looking for an implementation to run text query ex: "g.V().limit(1).toList()" while using the PatitionStrategy in Apache TinkerPop.
I'm attempting to build a REST interface to run queries on selected graph paritions only. I know how to run a raw query using Client, but I'm looking for an implementation where I can create a multi-tenant graph (https://tinkerpop.apache.org/docs/current/reference/#partitionstrategy) and query only selected tenants using raw text query instead of a GLV. Im able to query only selected partitions using pythongremlin, but there is no reference implementation I could find to run a text query on a tenant.
Here is tenant query implementation
connection = DriverRemoteConnection('ws://megamind-ws:8182/gremlin', 'g')
g = traversal().withRemote(connection)
partition = PartitionStrategy(partition_key="partition_key",
write_partition="tenant_a",
read_partitions=["tenant_a"])
partitioned_g = g.withStrategies(partition)
x = partitioned_g.V.limit(1).next() <---- query on partition only
Here is how I execute raw query on entire graph, but Im looking for implementation to run text based queries on only selected partitions.
from gremlin_python.driver import client
client = client.Client('ws://megamind-ws:8182/gremlin', 'g')
results = client.submitAsync("g.V().limit(1).toList()").result().one() <-- runs on entire graph.
print(results)
client.close()
Any suggestions appreciated? TIA
It depends on how the backend store handles text mode queries, but for the query itself, essentially you just need to use the Groovy/Java style formulation. This will work with GremlinServer and Amazon Neptune. For other backends you will need to make sure that this syntax is supported. So from Python you would use something like:
client.submit('
g.withStrategies(new PartitionStrategy(partitionKey: "_partition",
writePartition: "b",
readPartitions: ["b"])).V().count()')

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

Not all data read in from SQL database to R

I am creating a Shiny app and have been using data from Microsoft SQL Server Management Studio by creating my table with the query below, saving it as a CSV, and reading it in with
alldata<-read.csv(file1$datapath, fileEncoding = "UTF-8-BOM")
with the above code in my server function and the below code in my ui function
fileInput("file1", "Choose CSV File", accept=".csv")
Using this code, I have been able to manipulate the all the data (creating tables and plots) from the CSV successfully. I wanted to try directly obtaining the data from the SQL server when my app loads instead of going into SQL, executing the query, saving the data, and then loading it into my app. I tried the below code, and it sort of works. For example, the variable CODE has 30 levels, all of which are represented and able to be manipulated when I read the data in with the CSV, but only 23 are represented and manipulated when I run the below code. Is there a specific reason this may be happening. I tried running the SQL code along with the code to make my datatables in base R, instead of shiny to see if I could spot something specific not being read in correctly, but it all works perfectly when I read it in line by line
library(RODBCext)
dbhandle<-odbcDriverConnect('driver={SQL Server}; server=myserver.com;
database=mydb; trusted_connection=true')
query<-"SELECT CAST(r.DATE_COMPLETED AS DATE)[DATE]
, res.CODE
, r.TYPE
, r.LOCATION
, res.OPERATION
, res.UNIT
FROM
mydb.RECORD r
LEFT OUTER JOIN mydb.RESULT res
ON r.AMSN = res.AMSN
and r.unit = res.unit
where r.STATUS = 'C'
and res.CODE like '%ABC-%'"
auditdata<-sqlExecute(channel=dbhandle, query=query, fetch=TRUE, stringsAsFactors=FALSE)
odbcClose(dbhandle)
*I only want the complete data set loaded once per Shiny session, so I currently have this outside of the server function in my server.R file.
Try adding 'believeNRows=FALSE' to your SQLExecute call. RODBC doesn't return the full query without this parameter.
auditdata<-sqlExecute(channel=dbhandle, query=query, fetch=TRUE, stringsAsFactors=FALSE, believeNRows=FALSE)

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.