How to pass the options JSON in the "Create Project" Post Rquest of the OpenRefine ReST-API? - openrefine

I am currently trying to upload Excel tables (as .xls) to the OpenRefine (or OntoRefine) module of Ontotexts GraphDB. Since I had problems uploading the xls, I decided to first convert the xls file to a csv file and then upload it. Unfortunately, OpenRefine does not automatically recognize every time the file as CSV. So all data in each row is stored in a single column. E.g.:
--------------------------------------------------
| Col1, Col2, Col3, Col4 |
--------------------------------------------------
| Row11, Row12, Row13, Row14 |
--------------------------------------------------
| Row21, Row22, Row23, Row24 |
--------------------------------------------------
Instead of:
--------------------------------------------------
| Col1 | Col2 | Col3 | Col4 |
--------------------------------------------------
| Row11 | Row12 | Row13 | Row14 |
--------------------------------------------------
| Row21 | Row22 | Row23 | Row24 |
--------------------------------------------------
With the Post Request
POST /command/core/create-project-from-upload
a file format in the 'format' parameter and a json with the delimiter in 'options' parameter can be added to the POST request. However, this does not work either and the official OpenRefine documentation (https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-API) does not contain any hints as to the syntax of the 'options' JSON.
My current code looks like this:
import os
import xlrd
import csv
import requests
import re
xls_file_name_ext = os.path.basename('excel_file.xls')
# create the filename with path to the new csv file (path + name stays the same)
csv_file_path = os.path.dirname(xls_file_name_ext) + '/' + os.path.splitext(xls_file_name_ext)[0] + '.csv'
# remove all comma in xls file
xls_wb = xlrd.open_workbook(xls_file_name_ext)
xls_sheet = xls_wb.sheet_by_index(0)
for col in range(xls_sheet.ncols):
for row in range(xls_sheet.nrows):
_new_cell_val = str(xls_sheet.cell(row, col).value).replace(",", " ")
xls_sheet._cell_values[row][col] = _new_cell_val
# write to csv
with open(csv_file_path, 'w', newline='', encoding='utf-8') as csv_file:
c_w = csv.writer(csv_file, delimiter=',')
for row in range(xls_sheet.nrows):
c_w.writerow(xls_sheet.row_values(row))
ontorefine_server = 'http://localhost:7200/orefine'
# filename of csv as project name in OntoRefine
onterefine_project_name = os.path.splitext(os.path.basename(csv_file_path))[0]
# the required paraneters for the post request
ontorefine_data = {"project-name": onterefine_project_name,
"format": "text/line-based/*sv",
"options": {
"separator": ","
}
}
ontorefine_file = {'project-file': open(csv_file_path, "rb")}
# execute the post request
ontorefine_response = requests.post(
ontorefine_server + '/command/core/create-project-from-upload', data=ontorefine_data, files=ontorefine_file
)
I assume that I am passing the POST request parameters incorrectly.

It all depends on your input data, of course, but the formatting looks OK. Here's what OntoRefine does "behind the curtains" if you try to import from the UI. You can see the same payload for yourself by intercepting your network traffic:
{
"format": "text/line-based/*sv",
"options": {
"project-name":"Your-project-here",
"separator":","
}
Judging from that, it looks like the project-name location is the only difference. Here is a curl command which does the same:
curl 'http://localhost:7200/orefine/command/core/importing-controller?controller=core%2Fdefault-importing-controller&jobID=1&subCommand=create-project' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data 'format%3Dtext%2Fline-based%2F*sv%26options%3D%7B%22separator%22%3A%22%2C%22%22projectName%22%3A%22Your-project-name%22%7D'

Related

Reading an EXCEL file from bucket into Bigquery

I'm trying to upload this data into from my bucket into bigquery, but it's complaining.
My file is an excel file.
ID A B C D E F Value1 Value2
333344 ALG A RETAIL OTHER YIPP Jun 2019 123 4
34563 ALG A NON-RETAIL OTHER TATS Mar 2019 124 0
7777777777 - E RETAIL NASAL KHPO Jul 2020 1,448 0
7777777777 - E RETAIL SEVERE ASTHMA PZIFER Oct 2019 1,493 162
From python I call the file as follow:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
table_id = "project.dataset.my_table"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('ID','STRING'),
bigquery.SchemaField('A','STRING'),
bigquery.SchemaField('B','STRING'),
bigquery.SchemaField('C','STRING'),
bigquery.SchemaField('D','STRING'),
bigquery.SchemaField('E','STRING'),
bigquery.SchemaField('F','STRING'),
bigquery.SchemaField('Value1','STRING'),
bigquery.SchemaField('Value2','STRING'),
],
skip_leading_rows=1
)
uri = "gs://bucket/folder/file.xlsx"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Wait for the job to complete.
table = client.get_table(table_id)
print("Loaded {} rows to table {}".format(table.num_rows, table_id))
I am getting the following error, and its complaining about a line that it's not even there.
BadRequest: 400 Error while reading data, error message: CSV table references column position 8, but line starting at position:660 contains only 1 columns.
I thought the problem was the data type, as I had selected ID as integer and value1 and value2 as Integer too and F as timestamp, so now I'm trying everything as String, and I still get the error.
My file is only 4 lines in this test I'm doing
Excel files are not supported by BigQuery.
A few workaround solutions:
Upload a CSV version of your file into your bucket (a simple bq load command will do, cf here),
Read the Excel file with Pandas in your python script and insert the rows in BQ with the to_gbq() function,
Upload your Excel file in your Google Drive, make a spreadsheet version out of it and make an external table linked to that spreadsheet.
Try specifying field_delimiter in the LoadJobConfig
Your input file seems like TSV.
So you need to set field delimiter to '\t' like this,
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('ID','STRING'),
bigquery.SchemaField('A','STRING'),
bigquery.SchemaField('B','STRING'),
bigquery.SchemaField('C','STRING'),
bigquery.SchemaField('D','STRING'),
bigquery.SchemaField('E','STRING'),
bigquery.SchemaField('F','STRING'),
bigquery.SchemaField('Value1','STRING'),
bigquery.SchemaField('Value2','STRING'),
],
skip_leading_rows=1,
field_delimiter='\t'
)

Loading files with dynamically generated columns

I need to create a SSIS project that loads daily batches of 150 files into a SQL Server database. Each batch always contains the same 150 files and each file in the batch has a unique name. Also each file can either be a full or incremental type. Incremental files have one more column than the full files. Each batch contains a control file that states if a file is full or incremental. See example of a file below:
Full File
| SID | Name | DateOfBirth |
|:---: | :----: | :-----------: |
| 1 | Samuel | 20/05/1964 |
| 2 | Dave | 06/03/1986 |
| 3 | John | 15/09/2001 |
Incremental File
| SID | Name | DateOfBirth | DeleteRow |
|:---: | :----: | :-----------: | :----------: |
| 2 | | | 1 |
| 4 | Abil | 19/11/1993 | 0 |
| 5 | Zainab | 26/02/2006 | 0 |
I want to avoid creating 2 packages (full and incremental) for each file.
Is there a way to dynamically generate the column list in each source/destination component based on the file type in the control file? For example, when the file type is incremental, the column list should include the extra column (DeleteRow).
Let's assume my ControlFile.xlsx is :
Col1 Col2
File1.xlsx Full
file2.xlsx Incremental
Flow:
1.Create a DFT where ControlFile.xlsx is captured in an object variable. Source : Control connection, Destination : RecordSet Destination
Pass this object variable in ForEach loop. ResultSet variable should be capturing Col2 of ControlFile.xlsx.
Create a Sequence container just for a start point. Add 2 DFD for full load and incremental load. Use the constraints (as shown in below
image) to decide which DFD will run.
Inside DFD, use excel source to OLEDB destination.
Use FilePath variable for connection property in Full load and incremental excel connections to make it dynamic.
Step1: overall image
Step2:
In DFT - read control file, you read the FlowControl.xlsx to save it RecordSet destination, into RecordOutput variable
Step3:
Your precedence constraints should look like below image("Full" for full load, "Incremental" for incremental load ) :
Use the source and destination connections as shown in first image. It's a bit hard to explain all the steps, but flow is simple.
one thing to notice is, you have additional column in Incremental, hence you'll need to use 'Derived Column' in your full load for correct mapping.
Also, make sure DelayValidation property is set to true.
For each loop container uses For each ADO Enumerator. Following images describe the properties :
AND
I can think of two solutions.
1) Have a script task at the beginning of the package that looks to see if this is an incremental load or a full load. If it is a full load, have it loop through all the files and add a "DeleteRow" column with all zeros to every file. Then you can use the same column list.
2) Use BiML to dynamically generate your package at run time based on the available metadata.

Prestashop - import csv files of products in different languages : feature value not translated

I want to import csv files of products in 2 different languages in Prestashop 1.6.
I have 2 csv files, one for each languages.
Everything is fine when I import the csv file of the first language.
When I import the csv file of the second language, the features values are not understand by Prestashop as the translation of the features values of the first language, but added as new features values.
It s added as a new feature value because I use Multiple Feature module (http://addons.prestashop.com/en/search-filters-prestashop-modules/6356-multiple-features-assign-your-features-as-you-want.html) .
Without this module, the second csv import updates the feature value of both languages.
How can I make Prestashop understand that it s a translation, not a new feature value of a feature?
Thanks!
I found a solution by updating the database directly.
- I imported all my products using csv import in prestashop for the main language.
- feature values are stored in ps_feature_value_lang table. 3 columns : id_feature_value | id_lang | value
- In my case, french is ps_feature_value_lang.id_lang = 1 and english ps_feature_value_lang.id_lang = 2
- Before I do any change, data of ps_feature_value_lang looks like that:
id_feature_value | id_lang | value
1 | 1 | my value in french
1 | 2 | my value in english
- I created a table (myTableOfFeatureValueIWantToImport) with 2 columns : feature_value_FR / feature_value_EN. I filled this table with data.
- because I don't know the ID (id_feature_value) of my feature values (prestashop has created these ID during the import of the csv file of my first language), I m gonna loop on the data of myTableOfFeatureValueIWantToImport and each time ps_feature_value_lang.id_lang == 2 and ps_feature_value_lang.value == "value I want to translate" I m gonna update ps_feature_value_lang.value with my feature values translated.
$select = $connection>query("SELECT * FROM myTableOfFeatureValueIWantToImport GROUP BY feature_value_FR");
$select->setFetchMode(PDO::FETCH_OBJ);
while( $data = $select->fetch() )
{
$valFR = $data->feature_value_FR;
$valEN = $data->feature_value_EN;
$req = $connection->prepare('UPDATE ps_feature_value_lang
SET ps_feature_value_lang.value = :valEN
WHERE ps_feature_value_lang.id_lang = 2
AND ps_feature_value_lang.value = :valFR
');
$req->execute(array(
'valEN' => $valEN,
'valFR' => $valFR
));
}
done :D

How do I write output to a text file without overriding the existing contents of the file?

I have developed this function which transfers data from erlang to a .txt file:
exporttxt()->
F1 ="1",
F2 = "afif",
F3 = "kaled",
file:write_file("test.txt",[io_lib:format("~p\t~p\t~p~n",[F1,F2,F3])] ).
After running this function test.txt contains these values:
"1" "afif" "kaled"
but when I change F1, F2 and F3 in the function exporttxt() to:
F1 ="2"
F2 ="ahmed"
F3 = "alagi"
then test.txtcontains just these values:
"2" "ahmed" "alagi"
and I want test.txt to contain:
"1" "afif" "kaled"
"2" "ahmed" "alagi"
The problem is that at each execution of the function it records the new data
and the old data in test.txt is deleted.
How can I write new data to test.txt without overwriting existing data?
Use file:write_file/3 for this point:
Third arguments is Modes. A list of possible modes is read | write | append | exclusive | raw | binary | {delayed_write, Size, Delay} | delayed_write | {read_ahead, Size} | read_ahead | compressed | {encoding, Encoding}. append mode is for your needs.

SQL query engine for text files on Linux?

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.
Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?
David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.
An example on squeal's home page:
$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source |
--------+--------------------+
1633 |kernel |
1324 |NetworkManager |
98 |ntpd |
70 |avahi-daemon |
63 |dhclient |
48 |setroubleshoot |
39 |dnsmasq |
29 |nm-system-settings |
27 |bluetoothd |
14 |/usr/sbin/gpm |
13 |acpid |
10 |init |
9 |pcscd |
9 |pulseaudio |
6 |gnome-keyring-ask |
6 |gnome-keyring-daemon|
6 |gnome-session |
6 |rsyslogd |
5 |rpc.statd |
4 |vpnc |
3 |gdm-session-worker |
2 |auditd |
2 |console-kit-daemon |
2 |libvirtd |
2 |rpcbind |
1 |nm-dispatcher.action|
1 |restorecond |
q - Run SQL directly on CSV or TSV files:
https://github.com/harelba/q
Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.
I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.
Sample invocation:
$ sql.py --file foo --sql "select count(*) from data"
The code:
#!/usr/bin/env python
"""Run a SQL statement on a text file"""
import os
import sys
import getopt
import tempfile
import re
class Usage(Exception):
def __init__(self, msg):
self.msg = msg
def runCmd(cmd):
if os.system(cmd):
print "Error running " + cmd
sys.exit(1)
# TODO(dan): Return actual exit code
def usage():
print >>sys.stderr, "Usage: sql.py --file file --sql sql"
def main(argv=None):
if argv is None:
argv = sys.argv
try:
try:
opts, args = getopt.getopt(argv[1:], "h",
["help", "file=", "sql="])
except getopt.error, msg:
raise Usage(msg)
except Usage, err:
print >>sys.stderr, err.msg
print >>sys.stderr, "for help use --help"
return 2
filename = None
sql = None
for o, a in opts:
if o in ("-h", "--help"):
usage()
return 0
elif o in ("--file"):
filename = a
elif o in ("--sql"):
sql = a
else:
print "Found unexpected option " + o
if not filename:
print >>sys.stderr, "Must give --file"
sys.exit(1)
if not sql:
print >>sys.stderr, "Must give --sql"
sys.exit(1)
# Get the first line of the file to make a CREATE statement
#
# Copy the rest of the lines into a new file (datafile) so that
# sqlite3 can import data without header. If sqlite3 could skip
# the first line with .import, this copy would be unnecessary.
foo = open(filename)
datafile = tempfile.NamedTemporaryFile()
first = True
for line in foo.readlines():
if first:
headers = line.rstrip().split()
first = False
else:
print >>datafile, line,
datafile.flush()
#print datafile.name
#runCmd("cat %s" % datafile.name)
# Create columns with NUMERIC affinity so that if they are numbers,
# SQL queries will treat them as such.
create_statement = "CREATE TABLE data (" + ",".join(
map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"
cmdfile = tempfile.NamedTemporaryFile()
#print cmdfile.name
print >>cmdfile,create_statement
print >>cmdfile,".separator ' '"
print >>cmdfile,".import '" + datafile.name + "' data"
print >>cmdfile, sql + ";"
cmdfile.flush()
#runCmd("cat %s" % cmdfile.name)
runCmd("cat %s | sqlite3" % cmdfile.name)
if __name__ == "__main__":
sys.exit(main())
Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?
Depending on the amount of data, performance could be acceptable.
MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.
Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.
You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.
That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?
I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!