How to enrich places with geonames ID - sparql

I have a list of places which I would enrich with the IDs from geonames.
Since geonames by default it's embedded into WikiData I chose to go directly via SPARQL using WikiData endpoint.
My workflow:
I have imported the excel file into OpenRefine and created a new project
In OpenRefine I have created my graph, then I have downloaded it as RDF/XML. Here a snapshot:
<rdf:Description rdf:about="http://localhost:3333/0">
<rdfs:label>Aïre</rdfs:label>
<crm:P1_is_identified_by>5A1CE163-105F-4BAF 8BF9</crm:P1_is_identified_by>
</rdf:Description>
I have imported then the RDF file into my local graphDB and I runned the federated query:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {?place <http://purl.org/NET/cidoc-crm/core#P1_is_identified_by> ?value;
rdfs:label ?label_geo.
SERVICE <https://query.wikidata.org/sparql> {
?value wdt:P31/wdt:P279* wd:Q515;
rdfs:label ?label;
wdt:P1566 ?id_value.
}
}
limit 10
No results.
The output should be something like this:
|-----------------------|------------------|---------------|
| Oggetto | Place | GeonamesID |
|-----------------------|------------------|---------------|
|5A1CE163-105F-4BAF 8BF9| Aïre |11048419 |
|-----------------------|------------------|---------------|
Suggestions?
Thanks a lot.

I solved the problem directly via client
Here my pipeline:
I have created an Excel sheet with a list of place name
I built a Python script that uses as query parameters the values from the excel sheet and save the output in a .txt file. E.g. Aïre,https://www.geonames.org/11048419
import pandas as pd
import requests
import json
import csv
url = 'http://api.geonames.org/searchJSON?'
#Change df parameters according to excel sheet specification.
df = pd.read_excel('grp.xlsx', sheet_name='Foglio14', usecols="A")
for item in df.place_name:
df.place_name.head()
#Change username params with geonames API username
params ={ 'username': "XXXXXXXX",
'name_equals': item,
'maxRows': "1"}
e = requests.get(url, params=params)
pretty_json = json.loads(e.text)
with open("data14.txt", "a") as myfile:
writer = csv.writer(myfile)
for item in pretty_json["geonames"]:
#print("{}, https://www.geonames.org/{}".format(item["name"], item["geonameId"]))
writer.writerow([item["name"], "https://www.geonames.org/{}".format(item["geonameId"])]) #Write row.
myfile.close()
I have copied the output from the .txt file in the column B of the excel sheet.
I split then the output values into two columns. E.g.
|---------------------|-----------------------------------|
| ColA | ColB |
|---------------------|-----------------------------------|
| Aïre | https://www.geonames.org/11048419 |
|---------------------|-----------------------------------|
Since there is no a 1:1 correspondence between place name and the obtained results I have aligned the values.
In the excel sheet I have created a new empty column B
In the column B I wrote the formula: =IF(ISNA(MATCH(A1;C:C;0));"";INDEX(C:C;MATCH(A1;C:C;0))) and I have iterated the formula till the end of the list
Then I have created a new empty column C
In the column C I wrote the formula: =IFERROR(INDEX($E:$E;MATCH($B1;$D:$D;0));"") and I have iterated the formula till the end of the list
Here the final result:

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

How can I import data from excel to postgres- many to many relationship

I'm creating a web application and I encountered a problem with importing data to a table in a postgress database.
I have excel with id_b and id_cat(books id and categories id) books have several categories and categories can be assigned to many books, excel looks like this:
excel data
It has 30 000 records.
I have a problem how to import it into the database(Postgres). The table for this data has two columns:
id_b and id_cat. I wanted to export this data to csv in this way, each book has to be assigned a category identifier (e.g., book with identifier 1 should appear 9 times because it has 9 categories assigned to it and so on)- but I can't do it easily. It should looks like this:
correct data
Does anyone know any way to get data in this form?
Your excel sheet format has a large number of columns, which also depends on the number of categories, and SQL isn't well adapted to that.
The simplest option would be to:
Export your excel data as CSV.
Use a python script to read it using the csv module and output COPY-friendly tab-delimited format.
Load this into the database (or INSERT directly from python script).
Something like that...
import csv
with open('bookcat.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
id = row[0].strip()
categories = row[1:]
for cat in categories:
cat = cat.strip()
if cat:
print("%s\t%s" % (id, cat))
csv output version:
import csv
with open('bookcat.csv') as csvfile, open("out.csv","w") as outfile:
reader = csv.reader(csvfile)
writer = csv.writer(outfile)
for row in reader:
if row:
id = row[0].strip()
categories = row[1:]
for cat in categories:
cat = cat.strip()
if cat:
writer.writerow((id, cat))
If you need a specific csv format, check the docs of csv module.

BeautifulSoup/lxml - Extract href, and converting line breaks to to csv columns?

I have a large number of this page saved locally and am working to extract the content and put in a CSV. I have two questions and over two full days I've tried so many solutions it would be difficult to list them here.
Here's the page hosted online for reference: source page
and code:
import csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/my/path/to/local/files/filename.html"), "lxml")
table = soup.find('table', attrs={ "class" : "report_column"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
url = []
with open('output_file.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["","License Num","Status","Type/ Dup.","Expir. Date","Primary Owner and Premises Addr.","Mailing Address","Action","Conditions","Escrow","District Code","Geo Code"])
writer.writerows(row for row in rows if row)
The first question is I need to associate each row of the csv with the date which is shown at the page top, but also available in the href of all the cols headers. How would I extract that href and either join or somehow add an independent column in the csv?
The second question is, when getting cols with multiple line breaks (like Primary Owner and Mailing Address cols) I'm getting cell content that is one long string. Could you give me any tips on how I could delineate the line breaks with pipes or, ideally put them in unique cols eg Owner1, Owner2, Owner3, Owner4, one for each (up to 4) lines in the cell.
Thanks for any help!
Desired Output:
Right now I'm getting this in the Primary Owner col:
DBA:MECCA DELICATESSEN RAWLINGS, LINDA MAE 215 RESERVATION RDMARINA, CA 93933
And, ideally I could get four cols, one for each row (delineated by a "BR" in table):
col0 July 12, 2017 (date from page header)
col6 DBA:MECCA DELICATESSEN RAWLINGS
col7 LINDA MAE
col8 215 RESERVATION RD
col9 MARINA, CA 93933

Prestashop - import csv files of products in different languages : feature value not translated

I want to import csv files of products in 2 different languages in Prestashop 1.6.
I have 2 csv files, one for each languages.
Everything is fine when I import the csv file of the first language.
When I import the csv file of the second language, the features values are not understand by Prestashop as the translation of the features values of the first language, but added as new features values.
It s added as a new feature value because I use Multiple Feature module (http://addons.prestashop.com/en/search-filters-prestashop-modules/6356-multiple-features-assign-your-features-as-you-want.html) .
Without this module, the second csv import updates the feature value of both languages.
How can I make Prestashop understand that it s a translation, not a new feature value of a feature?
Thanks!
I found a solution by updating the database directly.
- I imported all my products using csv import in prestashop for the main language.
- feature values are stored in ps_feature_value_lang table. 3 columns : id_feature_value | id_lang | value
- In my case, french is ps_feature_value_lang.id_lang = 1 and english ps_feature_value_lang.id_lang = 2
- Before I do any change, data of ps_feature_value_lang looks like that:
id_feature_value | id_lang | value
1 | 1 | my value in french
1 | 2 | my value in english
- I created a table (myTableOfFeatureValueIWantToImport) with 2 columns : feature_value_FR / feature_value_EN. I filled this table with data.
- because I don't know the ID (id_feature_value) of my feature values (prestashop has created these ID during the import of the csv file of my first language), I m gonna loop on the data of myTableOfFeatureValueIWantToImport and each time ps_feature_value_lang.id_lang == 2 and ps_feature_value_lang.value == "value I want to translate" I m gonna update ps_feature_value_lang.value with my feature values translated.
$select = $connection>query("SELECT * FROM myTableOfFeatureValueIWantToImport GROUP BY feature_value_FR");
$select->setFetchMode(PDO::FETCH_OBJ);
while( $data = $select->fetch() )
{
$valFR = $data->feature_value_FR;
$valEN = $data->feature_value_EN;
$req = $connection->prepare('UPDATE ps_feature_value_lang
SET ps_feature_value_lang.value = :valEN
WHERE ps_feature_value_lang.id_lang = 2
AND ps_feature_value_lang.value = :valFR
');
$req->execute(array(
'valEN' => $valEN,
'valFR' => $valFR
));
}
done :D

apache spark sql query optimization and saving result values?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json file using spark.
Now i am trying to build a kd tree with this large data .
My steps :
1) calculate variance of each column pick the column with maximum variance and make it as key first node , mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the process until you reach a point.
my sample code :
import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")
the people table has 128 columns
My questions :
1) How to save result values of a query into a list ?
2) How to calculate variance of a column ?
3) i will be runnig multiple queries on same data .Does spark has any way to optimize it ?
4) how to save the output as key value pairs in a text file ?
please help