Reformat wide Excel table to more SQL-friendly structure - sql

I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?

I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.

R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)

Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1

Related

Read multiple files, create a data frame and add a new column containing the name of each file in R

I am new using dplyr package and I have been trying to read multiple files in R and then create a data frame by binding all the rows, but including the name of each file as a new column. This new column is the corresponding date which is not included in the data.
My list of files (for example):
01012019.aps
02012019.aps
I would like to have my final dataframe like this:
x y file date
1 4 01012019 01-01-2019
2 5 01012019 01-01-2019
3 6 02012019 02-01-2019
4 7 02012019 02-01-2019
I've been trying this:
path_aps<- "C:/Users/.../.../APS"
files_aps <- list.files(path_aps, pattern = "*.aps")
data_aps <- files_aps %>%
map(~ read.table(file.path(path_aps, .), sep = "\t")) %>%
map(~ mutate(filename = files_aps, .))%>%
reduce(gtools::smartbind)
But I am getting this error:
Error: Column filename must be length 288 (the number of rows) or one, not 61
I understand that the list of files in files_aps has 61 elements as this is the number of files that I have in my directory and 288 is the number of rows of each .aps file; however, I haven't been able to make it work to the extend of each .aps file. I've been reading multiple answers to similar questions but still I am not getting the expected result.
I've solved it with the help of this other answer and I've got this:
data_aps <- list.files(path_aps, pattern = "*.aps", full.names = TRUE) %>%
map_df(function(x) read.table(x, sep = "\t") %>%
mutate(filename=gsub(".aps","", basename(x))))

Pandas: Reading big CSV with variable timestamp

I have a log file per day which is placed in my LAN on a http server that grows to about 3MB per day.
Every 15 seconds new values are written to that file. It has a timestamp column. There are many other columns which are not needed for me, so I only need about 5 of the columns.
Pandas should "monitor" that file by reading only records which are new. Let's say last execution was 2018-02-05 00:00:04.467 then this should be the filter for next runtime (>2018-02-05 00:00:04.467) and in the end of this runtime the last timestamp read should be filter for next and so on...
I'm new to pandas and haven't found any similar thread for this.
I guess the CSV would be written line by line, so instead of reading the whole file and filtering, you could accumulate the number of rows in the file in a variable rows and for the next run, use read_csv passing in the optional argument skiprows with value range(1, rows + 1) to skip the first rows in the file, and then incrementing rows += len(df)
If data.csv is
a,b,c
1,2,3
4,5,6
7,8,9
3,2,1
6,5,4
and rows = 2 (i.e., the last time the file was read it had 2 rows) then
df = pd.read_csv("data.csv", usecols=["a", "c"], skiprows=range(1, rows + 1))
would be the dataframe
a c
0 7 9
1 3 1
2 6 4
and you would increment rows
rows += len(df) # rows now equals 5, so 5 rows would be skipped in the next run

i want to convert 2nd row into column and ratings with each. 2 dimentional matrix

"User-ID";"ISBN";"Book-Rating"
"276725";"034545104X";"0"
"276726";"0155061224";"5"
"276727";"0446520802";"0"
Ouput would be like:
"034545104X";"0155061224";"0446520802"
"276725" "0"
"276726" "5"
"276727" "0"
Solved it with below script in R.
loading reader library
library(readr)
reading CSV file with ';' as delimiter
BX_Book_Ratings <- read_delim("C:/Users/panch/Desktop/Lambton/Term_2_Fall_2017/2017F-T2 BDM 2013 - Data Collection Methods/project_03/dataset/BX-Book-Ratings.csv", ";", escape_double = FALSE, trim_ws = TRUE)
viewing data
View(BX_Book_Ratings)
loading reshape2 library
library(reshape2)
reading only few first rows from dataset
sample_data<-head(BX_Book_Ratings,30)
generating matrix
d <- dcast( sample_data, User_ID ~ ISBN, value.var = "Book_Rating" )
replacing NA with 0
d[is.na(d)] <- 0
loading grid library to display data in table format
library(gridExtra)
displaying table
grid.table(d)
Storing output in CSV format
write.csv(d,"C:/Users/panch/Desktop/Lambton/Term_2_Fall_2017/2017F-T2 BDM 2013 - Data Collection Methods/project_03/dataset/output.csv")

Organizing data (pandas dataframe)

I have a data in the following form:
product/productId B000EVS4TY
1 product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2 product/price unknown
3 review/userId A2SRVDDDOQ8QJL
4 review/profileName MJ23447
5 review/helpfulness 2/4
6 review/score 4.0
7 review/time 1206576000
8 review/summary Delicious cookie mix
9 review/text I thought it was funny that I bought this pro...
10 product/productId B0000DF3IX
11 product/title Paprika Hungarian Sweet
12 product/price unknown
13 review/userId A244MHL2UN2EYL
14 review/profileName P. J. Whiting "book cook"
15 review/helpfulness 0/0
16 review/score 5.0
17 review/time 1127088000
I want to convert it to a dataframe such that the entries in the 1st column
product/productId
product/title
product/price
review/userId
review/profileName
review/helpfulness
review/score
review/time
review/summary
review/text
are the column headers with the values arranged corresponding to each header in the table.
I still had a tiny doubt about your file, but since both my suggestions are quite similar, I will try to address both the scenarios you might have.
In case your file doesn't actually have the line numbers inside of it, this should do it:
filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...
# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None)
df = df.set_index(0) # this takes column '0' and uses it as the dataframe index
df = df.T # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1'
# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)
If you do have line numbers, then we just need to do some little adjustments
filepath = "./untitled1.txt"
column_separator="\s{3,}"
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity
In this last case, I would advise you change it in order to have line numbers in all of them (in the example you provided, the numbering starts at the second line, this might be an option about how you handle headers when exporting the data in whatever tool you might be using
Regarding the regex, the caveat is that "\s{3,}" looks for any block of 3 consecutive whitespaces or more to determine the column separator. The problem here is that we'll depend a bit on the data to find the columns. For instance, if in any of the values just so happens to appear 3 consecutive spaces, pandas will raise an exception, since the line will have one more column than the others. One solution to this could be increasing it to any other 'appropriate' number, but then we still depend on the data (for instance, with more than 3, in your example, "review/text" would have enough spaces for the two columns to be identified)
edit after realising what you meant by "stacked"
Whatever "line-number scenario" you have, you'll need to make sure you always have the same number of columns for all registers and reshape the continuous dataframe with something similar to this:
number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
,columns=df.columns.tolist()[:-10])
Again, take notice of making sure that all lines have the same number of columns (for instance, a file with just the data you provided, assuming 10 columns, wouldn't work). Also, this solution assumes all columns will have the same name.

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.