Read only n-th column of a text file which has no header with R and sqldf - sql

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))

You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.

Related

Read multiple files, create a data frame and add a new column containing the name of each file in R

I am new using dplyr package and I have been trying to read multiple files in R and then create a data frame by binding all the rows, but including the name of each file as a new column. This new column is the corresponding date which is not included in the data.
My list of files (for example):
01012019.aps
02012019.aps
I would like to have my final dataframe like this:
x y file date
1 4 01012019 01-01-2019
2 5 01012019 01-01-2019
3 6 02012019 02-01-2019
4 7 02012019 02-01-2019
I've been trying this:
path_aps<- "C:/Users/.../.../APS"
files_aps <- list.files(path_aps, pattern = "*.aps")
data_aps <- files_aps %>%
map(~ read.table(file.path(path_aps, .), sep = "\t")) %>%
map(~ mutate(filename = files_aps, .))%>%
reduce(gtools::smartbind)
But I am getting this error:
Error: Column filename must be length 288 (the number of rows) or one, not 61
I understand that the list of files in files_aps has 61 elements as this is the number of files that I have in my directory and 288 is the number of rows of each .aps file; however, I haven't been able to make it work to the extend of each .aps file. I've been reading multiple answers to similar questions but still I am not getting the expected result.
I've solved it with the help of this other answer and I've got this:
data_aps <- list.files(path_aps, pattern = "*.aps", full.names = TRUE) %>%
map_df(function(x) read.table(x, sep = "\t") %>%
mutate(filename=gsub(".aps","", basename(x))))

Sqldf in R - error with first column names

Whenever I use read.csv.sql I cannot select from the first column with and any output from the code places an unusual character (A(tilde)-..) at the begging of the first column's name.
So suppose I create a df.csv file in in Excel that looks something like this
df = data.frame(
a = 1,
b = 2,
c = 3,
d = 4)
Then if I use sqldf to query the csv which is in my working directory I get the following error:
> read.csv.sql("df.csv", sql = "select * from file where a == 1")
Error in result_create(conn#ptr, statement) : no such column: a
If I query a different column than the first, I get a result but with the output of the unusual characters as seen below
df <- read.csv.sql("df.csv", sql = "select * from file where b == 2")
View(df)
Any idea how to prevent these characters from being added to the first column name?
The problem is presumably that you have a file that is larger than R can handle and so only want to read a subset of rows into R and specifying the condition to filter it by involves referring to the first column whose name is messed up so you can't use it.
Here are two alternative approaches. The first one involves a bit more code but has the advantage that it is 100% R. The second one is only one statement and also uses R but additionally makes use an of an external utility.
1) skip header Read the file in skipping over the header. That will cause the columns to be labelled V1, V2, etc. and use V1 in the condition.
# write out a test file - BOD is a data frame that comes with R
write.csv(BOD, "BOD.csv", row.names = FALSE, quote = FALSE)
# read file skipping over header
DF <- read.csv.sql("BOD.csv", "select * from file where V1 < 3",
skip = 1, header = FALSE)
# read in header, assign it to DF and fix first column
hdr <- read.csv.sql("BOD.csv", "select * from file limit 0")
names(DF) <- names(hdr)
names(DF)[1] <- "TIME" # suppose we want TIME instead of Time
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
2) filter Another way to proceed is to use the filter= argument. Here we assume we know that the end of the column name is ime but there are other characters prior to that that we don't know. This assumes that sed is available and on your path. If you are on Windows install Rtools to get sed. The quoting might need to be changed depending on your shell.
When trying this on Windows I noticed that sed from Rtools changed the line endings so below we specified eol= to ensure correct processing. You may not need that.
DF <- read.csv.sql("BOD.csv", "select * from file where TIME < 3",
filter = 'sed -e "1s/.*ime,/TIME,/"' , eol = "\n")
DF
## TIME demand
## 1 1 8.3
## 2 2 10.3
So I figured it out by reading through the above comments.
I'm on a Windows 10 machine using Excel for Office 365. The special characters will go away by changing how I saved the file from a "CSV UTF-8 (Comma Delimited)" to just "CSV (Comma delimited)".

Extracting a word from string from n rows and append that word as a new col in SQL Server

I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.
SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;

Organizing data (pandas dataframe)

I have a data in the following form:
product/productId B000EVS4TY
1 product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2 product/price unknown
3 review/userId A2SRVDDDOQ8QJL
4 review/profileName MJ23447
5 review/helpfulness 2/4
6 review/score 4.0
7 review/time 1206576000
8 review/summary Delicious cookie mix
9 review/text I thought it was funny that I bought this pro...
10 product/productId B0000DF3IX
11 product/title Paprika Hungarian Sweet
12 product/price unknown
13 review/userId A244MHL2UN2EYL
14 review/profileName P. J. Whiting "book cook"
15 review/helpfulness 0/0
16 review/score 5.0
17 review/time 1127088000
I want to convert it to a dataframe such that the entries in the 1st column
product/productId
product/title
product/price
review/userId
review/profileName
review/helpfulness
review/score
review/time
review/summary
review/text
are the column headers with the values arranged corresponding to each header in the table.
I still had a tiny doubt about your file, but since both my suggestions are quite similar, I will try to address both the scenarios you might have.
In case your file doesn't actually have the line numbers inside of it, this should do it:
filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...
# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None)
df = df.set_index(0) # this takes column '0' and uses it as the dataframe index
df = df.T # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1'
# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)
If you do have line numbers, then we just need to do some little adjustments
filepath = "./untitled1.txt"
column_separator="\s{3,}"
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity
In this last case, I would advise you change it in order to have line numbers in all of them (in the example you provided, the numbering starts at the second line, this might be an option about how you handle headers when exporting the data in whatever tool you might be using
Regarding the regex, the caveat is that "\s{3,}" looks for any block of 3 consecutive whitespaces or more to determine the column separator. The problem here is that we'll depend a bit on the data to find the columns. For instance, if in any of the values just so happens to appear 3 consecutive spaces, pandas will raise an exception, since the line will have one more column than the others. One solution to this could be increasing it to any other 'appropriate' number, but then we still depend on the data (for instance, with more than 3, in your example, "review/text" would have enough spaces for the two columns to be identified)
edit after realising what you meant by "stacked"
Whatever "line-number scenario" you have, you'll need to make sure you always have the same number of columns for all registers and reshape the continuous dataframe with something similar to this:
number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
,columns=df.columns.tolist()[:-10])
Again, take notice of making sure that all lines have the same number of columns (for instance, a file with just the data you provided, assuming 10 columns, wouldn't work). Also, this solution assumes all columns will have the same name.

Reformat wide Excel table to more SQL-friendly structure

I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?
I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.
R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)
Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1