How can I import data from excel to postgres- many to many relationship - sql

I'm creating a web application and I encountered a problem with importing data to a table in a postgress database.
I have excel with id_b and id_cat(books id and categories id) books have several categories and categories can be assigned to many books, excel looks like this:
excel data
It has 30 000 records.
I have a problem how to import it into the database(Postgres). The table for this data has two columns:
id_b and id_cat. I wanted to export this data to csv in this way, each book has to be assigned a category identifier (e.g., book with identifier 1 should appear 9 times because it has 9 categories assigned to it and so on)- but I can't do it easily. It should looks like this:
correct data
Does anyone know any way to get data in this form?

Your excel sheet format has a large number of columns, which also depends on the number of categories, and SQL isn't well adapted to that.
The simplest option would be to:
Export your excel data as CSV.
Use a python script to read it using the csv module and output COPY-friendly tab-delimited format.
Load this into the database (or INSERT directly from python script).
Something like that...
import csv
with open('bookcat.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
id = row[0].strip()
categories = row[1:]
for cat in categories:
cat = cat.strip()
if cat:
print("%s\t%s" % (id, cat))
csv output version:
import csv
with open('bookcat.csv') as csvfile, open("out.csv","w") as outfile:
reader = csv.reader(csvfile)
writer = csv.writer(outfile)
for row in reader:
if row:
id = row[0].strip()
categories = row[1:]
for cat in categories:
cat = cat.strip()
if cat:
writer.writerow((id, cat))
If you need a specific csv format, check the docs of csv module.

Related

export Bigquery results to GCS based on value of a column

I've been using the following code to export BQ result to GCS.
export_query = f"""
EXPORT DATA
OPTIONS(
uri='{uri}',
format=format,
overwrite=true,
compression='GZIP')
AS {query}"""
client.query(export_query, project=project).result()
Now I need to split a large table base on the value of a particular column, and then export each part separately , i.e.
query_k = """
SELECT * FROM table WHERE col = k
"""
for k = 1,2,3...N
I can do it by running N queries, but that seems to be very slow and consume a lot of resources. I am wondering if it is possible to accomplish the task with one single query.

split a large csv file according to some NaN values

There is a huge file, almost reaching 10 million rows, which I would like to split based on the values of a given column. One represents internal measurements (inside a house) and the other one is the external data (outside). This codes takes way too long to split, any ideas?
fext = open('external.csv', 'a')
fint = open('internal.csv', 'a')
for df in pd.read_csv('todo.csv', parse_dates=['Measured At'],
low_memory=False, chunksize=500000):
dfExt = df[df['Temperatura Exterior'].notnull()]
dfInt = df[df['Temperatura Exterior'].isnull()]
dfExt.to_csv(fext,header=False)
dfInt.to_csv(fint,header=False)
fext.close();fint.close()

Reformat wide Excel table to more SQL-friendly structure

I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?
I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.
R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)
Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1

Prestashop - import csv files of products in different languages : feature value not translated

I want to import csv files of products in 2 different languages in Prestashop 1.6.
I have 2 csv files, one for each languages.
Everything is fine when I import the csv file of the first language.
When I import the csv file of the second language, the features values are not understand by Prestashop as the translation of the features values of the first language, but added as new features values.
It s added as a new feature value because I use Multiple Feature module (http://addons.prestashop.com/en/search-filters-prestashop-modules/6356-multiple-features-assign-your-features-as-you-want.html) .
Without this module, the second csv import updates the feature value of both languages.
How can I make Prestashop understand that it s a translation, not a new feature value of a feature?
Thanks!
I found a solution by updating the database directly.
- I imported all my products using csv import in prestashop for the main language.
- feature values are stored in ps_feature_value_lang table. 3 columns : id_feature_value | id_lang | value
- In my case, french is ps_feature_value_lang.id_lang = 1 and english ps_feature_value_lang.id_lang = 2
- Before I do any change, data of ps_feature_value_lang looks like that:
id_feature_value | id_lang | value
1 | 1 | my value in french
1 | 2 | my value in english
- I created a table (myTableOfFeatureValueIWantToImport) with 2 columns : feature_value_FR / feature_value_EN. I filled this table with data.
- because I don't know the ID (id_feature_value) of my feature values (prestashop has created these ID during the import of the csv file of my first language), I m gonna loop on the data of myTableOfFeatureValueIWantToImport and each time ps_feature_value_lang.id_lang == 2 and ps_feature_value_lang.value == "value I want to translate" I m gonna update ps_feature_value_lang.value with my feature values translated.
$select = $connection>query("SELECT * FROM myTableOfFeatureValueIWantToImport GROUP BY feature_value_FR");
$select->setFetchMode(PDO::FETCH_OBJ);
while( $data = $select->fetch() )
{
$valFR = $data->feature_value_FR;
$valEN = $data->feature_value_EN;
$req = $connection->prepare('UPDATE ps_feature_value_lang
SET ps_feature_value_lang.value = :valEN
WHERE ps_feature_value_lang.id_lang = 2
AND ps_feature_value_lang.value = :valFR
');
$req->execute(array(
'valEN' => $valEN,
'valFR' => $valFR
));
}
done :D

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.