Is there a way to store PyTable columns in a specific order? - pytables

It seems that the PyTable columns are alphabetically ordered when using both dictionary or class for schema definition for the call to createTable(). My need is to establish a specific order and then use numpy.genfromtxt() to read and store my data from text. My text file does not have the variable names included alphabetically as they are for the PyTable.
For example, assuming text file is named mydata.txt and is organized as follows:
time(row1) bVar(row1) dVar(row1) aVar(row1) cVar(row1)
time(row2) bVar(row2) dVar(row2) aVar(row2) cVar(row2)
...
time(rowN) bVar(rowN) dVar(rowN) aVar(rowN) cVar(rowN)
So, the desire is to create a table that is ordered with these columns
and then use the numpy.genfromtxt command to populate the table.
# Column and Table definition with desired order
class parmDev(tables.IsDescription):
time = tables.Float64Col()
bVar = tables.Float64Col()
dVar = tables.Float64Col()
aVar = tables.Float64Col()
cVar = tables.Float64Col()
#...
mytab = tables.createTable( group, tabName, paramDev )
data = numpy.genfromtxt(mydata.txt)
mytab.append(data)
This is desired because it is straightforward code and is very fast. But, the PyTable columns are always ordered alphabetically and the appended data is ordered according to the desired order. Am I missing something basic here? Is there a way to have the order of the table columns follow the class definition order instead of being alphabetical?

Yes, you can define an order in tables in several different ways. The easiest one is to use the pos parameter for each column. See the docs for the Col class:
http://pytables.github.io/usersguide/libref/declarative_classes.html#the-col-class-and-its-descendants
For your example, it will look like:
class parmDev(tables.IsDescription):
time = tables.Float64Col(pos=0)
bVar = tables.Float64Col(pos=1)
dVar = tables.Float64Col(pos=2)
aVar = tables.Float64Col(pos=3)
cVar = tables.Float64Col(pos=4)
Hope this helps

Related

How to retrieve the list of dynamic nested keys of BigQuery nested records

My ELT tools imports my data in bigquery and generates/extends automatically the schema for dynamic nested keys (in the schema below, under properties)
It looks like this
How can I get the list of nested keys of a repeated record ? so for example I can group by properties when those items have said property non-null ?
I have tried
select column_name
from my_schema.INFORMATION_SCHEMA.COLUMNS
where
table_name = 'my_table
But it will only list first level keys
From the picture above, I want, as a first step, a SQL query that returns
message
user_id
seeker
liker_id
rateable_id
rateable_type
from_organization
likeable_type
company
existing_attempt
...
My real goal through, is to group/count my data based on a non-null value of a 2nd level nested properties properties.filters.[filter_type]
The schema may evolve when our application adds more filters, so this need to be dynamically generated, I can't just hard-code the list of nested keys.
Note: this is very similar to this question How to extract all the keys in a JSON object with BigQuery but in my case my data is already in a shcema and it's not a JSON object
EDIT:
Suppose I have a list of such records with nested properties, how do I write a SQL query that adds a field "enabled_filters" which aggregates, for each item, the list of properties for wihch said property is not null ?
Example input (properties.x are dynamic and not known by the programmer)
search_id
properties.filters.school
properties.filters.type
1
MIT
master
2
Princetown
null
3
null
master
Example output
search_id
enabled_filters
1
["school", "type"]
2
["school"]
3
["type"]
Have you looked at COLUMN_FIELD_PATHS? It should give you the paths for all columns.
select field_path from my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS where table_name = '<table>'
[https://cloud.google.com/bigquery/docs/information-schema-column-field-paths]
The field properties is not nested by array only by structures. Then a UDF in JavaScript to parse thise field should work fast enough.
CREATE TEMP FUNCTION jsonObjectKeys(input STRING, shownull BOOL,fullname Bool)
RETURNS Array<String>
LANGUAGE js AS """
function test(input,old){
var out=[]
for(let x in input){
let te=input[x];
out=out.concat(te==null ? (shownull?[x+'==null']:[]) : typeof te=='object' ? test(te,old+x+'.') : [fullname ? old+x : x] );
}
return out;
Object.keys(JSON.parse(input));
}
return test(JSON.parse(input),"");
""";
with tbl as (select struct(1 as alpha,struct(2 as x, 3 as y,[1,2,3] as z ) as B) A from unnest(generate_array(1,10*1))
union all select struct(null,struct(null,1,[999])) )
select *,
TO_JSON_STRING (A ) as string_output,
jsonObjectKeys(TO_JSON_STRING (A),true,false) as output1,
jsonObjectKeys(TO_JSON_STRING (A),false,true) as output2,
concat('["', array_to_string(jsonObjectKeys(TO_JSON_STRING (A),false,true),'","' ) ,'"]') as output_sring,
jsonObjectKeys(TO_JSON_STRING (A.B),false,true) as outpu
from tbl

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

how to refer values based on column names in python

i am trying to extract and read the data from a SQL query.
Below is the sample data from SQL developer:
target_name expected_instances environment system_name hostname
--------------------------------------------------------------------------------------
ORAUAT_host1 1 UAT ORAUAT_host1_sys host1.sample.net
ORAUAT_host2 1 UAT ORAUAT_host1_sys host2.sample.net
Normally i pass the system_name to the query (which has a bind variable for system_name) and get the data as a list,but not the column names.
Is there a way in Python to retrieve the data along with the column names and reference values with column name like target_name[0] giving the value ORAUAT_host1?Please suggest.Thanks.
If what you want is to get the column names from the table you are querying, you can do something like this:
My example is printing a csv file
import os
import sys
import cx_Oracle
db = cx_Oracle.connect('user/pass#host:1521/service_name')
SQL = "select * from dual"
print(SQL)
cursor = db.cursor()
f = open("C:\dual.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
r = cursor.execute(SQL)
#this takes the column names
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
for row in cursor:
writer.writerow(row)
f.close()
The way to print the columns is using the method description of the cursor object
Cursor.description
This read-only attribute is a sequence of 7-item sequences. Each of
these sequences contains information describing one result column:
(name, type, display_size, internal_size, precision, scale, null_ok).
This attribute will be None for operations that do not return rows or
if the cursor has not had an operation invoked via the execute()
method yet.
The type will be one of the database type constants defined at the
module level.
https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#

Converting xml node string to strip out nodes

I have a table that has a column called RAW DATA of type NVARCHAR MAX, which is a dump from a web service. Here is a sample of 1 data line:
<CourtRecordEventCaseHist>
<eventDate>2008-02-11T06:00:00Z</eventDate>
<eventDate_TZ>-0600</eventDate_TZ>
<histSeqNo>4</histSeqNo>
<countyNo>1</countyNo>
<caseNo>xxxxxx</caseNo>
<eventType>WCCS</eventType>
<descr>Warrant/Capias/Commitment served</descr>
<tag/>
<ctofcNameL/>
<ctofcNameF/>
<ctofcNameM/>
<ctofcSuffix/>
<sealCtofcNameL/>
<sealCtofcNameF/>
<sealCtofcNameM/>
<sealCtofcSuffix/>
<sealCtofcTypeCodeDescr/>
<courtRptrNameL/>
<courtRptrNameF/>
<courtRptrNameM/>
<courtRptrSuffix/>
<dktTxt>Signature bond set</dktTxt>
<eventAmt>0.00</eventAmt>
<isMoneyEnabled>false</isMoneyEnabled>
<courtRecordEventPartyList>
<partyNameF>Name</partyNameF>
<partyNameM>A.</partyNameM>
<partyNameL>xxxx</partyNameL>
<partySuffix/>
<isAddrSealed>false</isAddrSealed>
<isSeal>false</isSeal>
</courtRecordEventPartyList>
</CourtRecordEventCaseHist>
It was suppose to go in a table, with the node names representing the column names. The table it's going to is created, I just need to exract the data from this row to the table. I have 100's of thousands records like this. I was going to copy to a xml file, then import. But there is so much data, I would rather try and do the work within the DB.
Any ideas?
First, create the table with all the required columns.
Then, use your favorite scripting language to load the table! Mine being groovy, here is what I'd do:
def sql = Sql.newInstance(/* SQL connection here*/)
sql.eachRow("select RAW_DATA from TABLE_NAME") { row ->
String xmlData = row."RAW_DATA"
def root = new XmlSlurper().parseText(xmlData)
def date = root.eventDate
def histSeqNo = root.histSeqNo
//Pull out all the data and insert into new table!
}
I did find an answer to this, I'm sure there is more than one way of doing this. But this is what I got to work. Thanks for everyone's help.
SELECT
pref.value('(caseNo/text())[1]', 'varchar(20)') as CaseNumber,
pref.value('(countyNo/text())[1]', 'int') as CountyNumber
FROM
dbo.CaseHistoryRawData_10 CROSS APPLY
RawData.nodes('//CourtRecordEventCaseHist') AS CourtRec(pref)

Merging result from 2 columns with same name and not over-writing one

I have a simple MySQL query like:
SELECT *
FROM `content_category` CC , `content_item` CI
WHERE CI.content_id = '" . (int)$contentId . "'
AND CI.category_id = CC.category_id
AND CI.active = 1
Both tables have a column called configuration one of which gets overwritten in the query i.e only content_item.configuration is returned in the result.
Short of implicitly naming and aliasing the columns like
SELECT CC.configuration as `category_configuration`,
CC.category_id as `.....
is there a way of selecting ALL data i.e * from both and resolve those duplicate column names in a non-destructive way.
You don't need to alias ALL the columns, just the one conflicting one:
SELECT *,CC.configuration as cc_conf, CI.configuration as ci_conf FROM `content_category` CC , `content_item` CI WHERE
CI.content_id = '" . (int)$contentId . "'
AND CI.category_id = CC.category_id
AND CI.active = 1
This demonstrates one of the many reasons why using the * wildcard is not a good practice all the time. All the columns are returned in the result set, but if you access them via an associative array or via object properites in your host language (e.g. PHP or Ruby) you can naturally only have one of the columns associated with each key or object property.
Solutions:
Fetch them all and reference the columns by ordinal position.
Stop using the wildcard for one table or the other, and give column aliases.
Rename your columns to be distinct.
Define a VIEW with the column aliasing spelled out, and query from the view.