How can I delete a specific line (e.g. line 102,206,973) from a 30gb csv file? - sql

What method can I use to delete a specific line from a csv/txt file that is too big too load into memory and edit manually?
Background
My question is actually an indirect solution to a problem related with importing csv into sql databases.
I have a series of 10-30gb csv files I want to import and populate an sqlite table from within R (Since they are too large to import as data frames as a whole into R). I am using the 'RSQlite' package for this.
A couple fail because of an error related to one of the lines being badly formatted. The populating process is then cancelled. R returns the line number which caused the process to fail.
The error given is:
./csvfilename line 102206973 expected 9 columns of data but found 3)
So I know exactly the line which causes the error.
I see 2 potential 'indirect' solutions which I was hoping someone could help me with.
(i) Deleting the line causing the error in 20+gb files. e.g. line 102,206,973 in the example above.
I am not concerned with 'losing' the data in line 102,206,973 by just skipping or deleting it. However I have tried and failed to somehow access the csv file and to remove the line.
(ii) Using sqlite directly (or anything else?) to import an csv which does allow you to skip lines or an error.
Although not likely to be related directly to the solution, here is the R code used.
db <- dbConnect(SQLite(), dbname=name_of_table)
dbWriteTable(conn = db, name ="currentdata", value = csvfilename, row.names = FALSE, header = TRUE)
Thanks!

To delete a specific line you can use sed:
sed -e '102206973d' your_file
If you want the replacement to be done in-place, do
sed -i.bak -e '102206973d' your_file
This will create a backup names your_file.bak and your_file will have the specified line removed.
Example
$ cat a
1
2
3
4
5
$ sed -i.bak -e '3d' a
$ cat a
1
2
4
5
$ cat a.bak
1
2
3
4
5

Related

How to clean bad data from huge csv file

So I have huge csv file (assume 5 GB) and I want to insert the data to the table but it return error that the length of the data is not the same
I found that some data has more columns than I want
For example the correct data I have has 8 columns but some data has 9 (it can be human/system error)
I want to take only 8 columns data, but because the data is so huge, I can not do it manually or using parsing in python
Any recommendation of a way to do it?
I am using linux, so any linux command also welcome
In sql I am using COPY ... FROM ... CSV HEADER; command to import the csv into table
You can use awk for this purpose. Assuming you field delimiter is comma (,) this code can do the work:
awk -F\, 'NF==8 {print}' input_file >output_file
A fast and dirty php solution as single command line:
php -r '$f=fopen("a.csv","rb"); $g=fopen("b.csv","wb"); while ( $r=fgetcsv($f) ) { $r = array_slice($r,0,8); fputcsv($g,$r); }'
It reads file a.csv and writes b.csv.

Troubleshooting BCP and Format File Errors

First off, sorry for the long post. I wanted to be thorough with my examples/data, and the bulk of this post is just that.
I inherited a Bulk Import Process using a format file (.fmt) at my new job. This process was created by the guy that worked here before me, and it is my job to learn this process (and fix it now). I have limited knowledge of this stuff, but I have done some research. After a few weeks, I haven't really gotten anywhere. Here is what I am working with...
--BCP Command to import data from C:\Desktop\20180629_2377167_PR_NP.txt to table LA_Temp.dbo.ProvReg
bcp LA_Temp.dbo.ProvReg IN C:\Desktop\20180629_2377167_PR_NP.txt -f C:\Desktop\PROVREG.FMT -T -S SERVERNAME -k -m 1000000
--Table Structure which format file is created from:
SELECT [NPI]
,[D1]
,[EntityType]
,[D2]
,[ReplaceNPI]
,[D3]
,[ProvName]
,[D4]
,[MailAddr1]
,[D5]
,[MailAddr2]
,[D6]
,[MailCity]
,[D7]
,[MailState]
,[D8]
,[MailZip]
,[D9]
,[MailCountry]
,[D10]
,[MailPhone]
,[D11]
,[MailFax]
,[D12]
,[LocAddr1]
,[D13]
,[LocAddr2]
,[D14]
,[LocCity]
,[D15]
,[LocState]
,[D16]
,[LocZip]
,[D17]
,[LocCountry]
,[D18]
,[LocPhone]
,[D19]
,[LocFax]
,[D20]
,[Taxonomy1]
,[D21]
,[Taxonomy2]
,[D22]
,[Taxonomy3]
,[D23]
,[OtherProvID]
,[D24]
,[OtherProvIDType]
,[D25]
,[ProvEnumDate]
,[D26]
,[LastUpdate]
,[D27]
,[DeactivateRC]
,[D28]
,[DeactivateDate]
,[D29]
,[ReactivateDate]
,[D30]
,[Gender]
,[D31]
,[License]
,[D32]
,[LicenseState]
,[D33]
,[AuthorizedContact]
,[D34]
,[ContactTitle]
,[D35]
,[ContactPhone]
,[D36]
,[PanelOpen]
,[D37]
,[Language1]
,[D38]
,[Language2]
,[D39]
,[Language3]
,[D40]
,[Language4]
,[D41]
,[Language5]
,[D42]
,[AgeRestrict]
,[D43]
,[PCPMax]
,[D44]
,[PCPActual]
,[D45]
,[PCPAll]
,[D46]
,[EnrollInd]
,[D47]
,[EnrollDate]
,[D48]
,[FamilyOnly]
,[D49]
,[SubSpec1]
,[D50]
,[SubSpec2]
,[D51]
,[SubSpec3]
,[D52]
,[ContractName]
,[D53]
,[ContractBegin]
,[D54]
,[ContractEnd]
,[D55]
,[Parish1]
,[D56]
,[Parish2]
,[D57]
,[Parish3]
,[D58]
,[Parish4]
,[D59]
,[Parish5]
,[D60]
,[Parish6]
,[D61]
,[Parish7]
,[D62]
,[Parish8]
,[D63]
,[Parish9]
,[D64]
,[Parish10]
,[D65]
,[Parish11]
,[D66]
,[Parish12]
,[D67]
,[Parish13]
,[D68]
,[Parish14]
,[D69]
,[Parish15]
,[D70]
,[PCPInd]
,[D71]
,[DisplayOnline]
,[D72]
,[ExpAgeRestrict]
,[D73]
,[Suffix]
,[D74]
,[Title]
,[D75]
,[PrescriberInd]
,[Spaces]
,[End]
FROM [LA_Temp].[dbo].[ProvReg]
--Example Text File Data (this is one line)
9999999999 ^0^ ^ ^3800 HMA BLVD STE 305 ^ ^METAIRIE ^LA^70006 ^ ^5048729679^ ^3800 HMA BLVD ^ ^METAIRIE ^LA^70006 ^ ^9999999999^ ^207Q00000X^ ^ ^0000000^2001^ ^00000000^ ^00000000^00000000^F^ ^LA^ ^ ^ ^N^1^0^0^0^0^2^00000^00000^00000^ ^ ^ ^ ^ ^ ^000000000000000000000000000000^00000000^00000000^26^00^00^00^00^00^00^00^00^00^00^00^00^00^00^0^0^Accept patients of age 000-000^ ^MD ^ ^
--Format file
11.0
153
1 SQLCHAR 0 40 "\t" 1 NPI SQL_Latin1_General_Pref_CP1_CI_AS
2 SQLCHAR 0 2 "\t" 2 D1 SQL_Latin1_General_Pref_CP1_CI_AS
3 SQLCHAR 0 2 "\t" 3 EntityType
...all the way to...
153 SQLCHAR 0 2 "\r\n" 153 End
I have changed directories, servername, and some of the text file data to maintain security, however, it is very similar.
Here is the problem I am encountering:
With the "\t" used in the format file I just created from the SQL table, I get the error: [Microsoft][SQL Server Native Client 11.0]Unexpected EOF encountered in BCP data-file.
If I change this to just "" or "^" (as I 'think' it should be since the text file is using carrot delimiter), the rows began to copy with error
[Microsoft][SQL Server Native Client 11.0]String data, right truncation SQLState = 22001, NativeError = 0. BCP copy in failed.
If anyone can please point me in the right direction here for troubleshooting this issue, or if you see anything out of place, please let me know. As I mentioned, I have been at this for some time, and can use any suggestions I can get. Unfortunately, there is no one at my company I can ask about this.
try adding the -e option to your bcp command. this will give you an error file in which BCP will write some samlpe lines from the file that it had problems with. Very helpful with troubleshooting the type of error you are getting now (you are correct to change your delimiter in the format file).
The error you are getting now "string data" and "truncation" is just as it states. However, this truncation can be occurring for a number of reasons. The destination table's columns may not be large enough to hold the data that is contained between the defined field delimiters. There may be delimiters appearing in your data and so this could be tricking the bcp utility into thinking a column has ended before it was intended to end in the file (this is less likely with the delimiter you are using... but ya never know... I always prefer fixed width if possible.). And, of course, the source of the data may very well have written you a file that contradict whatever agreed upon spec led you to define your destination as you have.
The error is accurate, teh trick is finding where. Use the -e option to allow BCP to capture problematic lines:
BCP table_dest IN "C:\FILE.TXT" -S SVR -T -f"C:\FORMAT_FILE.txt" -e"C:\ERROR_FILE.txt"
The "error_file.txt" will include line numbers and will include a sample of lines that it couldn't handle. Just copy and past to find in the file youare trying to load to see for yourself.
Strongly suggest using a more advanced text editing tool. Do not use windows notepad or wordpad. Use something like notepad++ or ultraedit to inspect ascii text files.

Linux batch rename

I am trying to batch rename multiple files and so far I am pretty close to what I am trying to achieve. I have some files called "website.txt", "website1.txt", "website2.txt", "website3.txt" and I am trying to rename only the files that have a number associated with them (so excluding "website.txt"). My first attempt is as follows (I'm using -n for testing):
rename -n 's/website/website_edit/' *txt
Result:
rename(website1.txt, website_edit1.txt)
rename(website2.txt, website_edit2.txt)
rename(website3.txt, website_edit3.txt)
rename(website.txt, website_edit.txt)
As you can see this almost works but it is renaming the "website.txt" file as well which I don't want. So to try and remove it I did this:
rename -n 's/website\w/website_edit/' *txt
Result:
rename(website1.txt, website_edit.txt)
rename(website2.txt, website_edit.txt)
rename(website3.txt, website_edit.txt)
This time it did remove "website.txt" from the list but it also removed the the numbers from the end on the new names. I have also tried messing around with some regular expressions as well but to no avail.
Try this :
rename -n 's/website(\d+)/website_edit$1/' *txt
____ __
^ ^
| |
capturing at least one digit captured group

Appending the datetime to the end of every line in a 600 million row file

I have a 680 million rows (19gig) file that I need the datetime appended onto every line. I get this file every night and I have to add the time that I processed it to the end of each line. I have tried many ways to do this including sed/awk and loading it into a SQL database with the last column being defaulted to the current timestamp.
I was wondering if there is a fast way to do this? My fastest way so far takes two hours and that is just not fast enough given the urgency of the information in this file. It is a flat CSV file.
edit1:
Here's what I've done so far:
awk -v date="$(date +"%Y-%m-%d %r")" '{ print $0","date}' lrn.ae.txt > testoutput.txt
Time = 117 minutes
perl -ne 'chomp; printf "%s.pdf\n", $_' EXPORT.txt > testoutput.txt
Time = 135 minutes
mysql load data local infile '/tmp/input.txt' into table testoutput
Time = 211 minutes
You don't specify if the timestamps have to be different for each of the lines. Would a "start of processing" time be enough?
If so, a simple solution is to use the paste command, with a pre-generated file of timestamps, exactly the same length as the file you're processing. Then just paste the whole thing together. Also, if the whole process is I/O bound, as others are speculating, then maybe running this on a box with an SSD drive would help speed up the process.
I just tried it locally on a 6 million row file (roughly 1% of yours), and it's actually able to do it in less than one second, on Macbook Pro, with an SSD drive.
~> date; time paste file1.txt timestamps.txt > final.txt; date
Mon Jun 5 10:57:49 MDT 2017
real 0m0.944s
user 0m0.680s
sys 0m0.222s
Mon Jun 5 10:57:49 MDT 2017
I'm going to now try a ~500 million row file, and see how that fares.
Updated:
Ok, the results are in. Paste is blazing fast compared to your solution, it took just over 90 seconds total to process the whole thing, 600M rows of simple data.
~> wc -l huge.txt
600000000 huge.txt
~> wc -l hugetimestamps.txt
600000000 hugetimestamps.txt
~> date; time paste huge.txt hugetimestamps.txt > final.txt; date
Mon Jun 5 11:09:11 MDT 2017
real 1m35.652s
user 1m8.352s
sys 0m22.643s
Mon Jun 5 11:10:47 MDT 2017
You still need to prepare the timestamps file ahead of time, but that's a trivial bash loop. I created mine in less than one minute.
A solution that simplifies mjuarez' helpful approach:
yes "$(date +"%Y-%m-%d %r")" | paste -d',' file - | head -n "$(wc -l < file)" > out-file
Note that, as with the approach in the linked answer, you must know the number of input lines in advance - here I'm using wc -l to count them, but if the number is fixed, simply use that fixed number.
yes keeps repeating its argument indefinitely, each on its own output line, until it is terminated.
paste -d',' file - pastes a corresponding pair of lines from file and stdin (-) on a single output line, separated with ,
Since yes produces "endless" output, head -n "$(wc -l < file)" ensures that processing stops once all input lines have been processed.
The use of a pipeline acts as a memory throttle, so running out of memory shouldn't be a concern.
Another alternative to test is
$ date +"%Y-%m-%d %r" > timestamp
$ join -t, -j9999 file timestamp | cut -d, -f2-
or time stamp can be generated in place as well <(date +"%Y-%m-%d %r")
join creates a cross product of the first file and second file using the non-existing field (9999), and since second file is only one line, practically appending it to the first file. Need the cut to get rid of the empty key field generated by join
If you want to add the same (current) datetime to each row in the file, you might as well leave the file as it is, and put the datetime in the filename instead. Depending on the use later, the software that processes the file could then first get the datetime from the filename.
To put the same datetime at the end of each row, some simple code could be written:
Make a string containing a separator and the datetime.
Read the lines from the file, append the above string and write back to a new file.
This way a conversion from datetime to string is only done once, and converting the file should not take much longer than copying the file on disk.

Bigquery error (ASCII 0) encountered for external table and when loading table

I'm getting this error
"Error: Error detected while parsing row starting at position: 4824. Error: Bad character (ASCII 0) encountered."
The data is not compressed.
My external table points to multiple CSV files, and one of them contains a couple of lines with that character. In my table definition I added "MaxBadRecords", but that had no effect. I also get the same problem when loading the data in a regular table.
I know I could use DataFlow or even try to fix the CSVs, but is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
Try below in Google Cloud SDK Shell (with use of tr utility)
gsutil cp gs://bucket/badfile.csv - | tr -d '\000' | gsutil cp - gs://bucket/fixedfile.csv
This will
Read your "bad" file
Remove ASCII 0
Save "fixed" file into new file
After you have new file - just make sure your table now points to that fixed one
Sometimes it occurs that a final byte appears in file.
What could help is replacing it thanks to :
tr '\0' ' ' < file1 > file2
You can clean the file using an external tool like python or PowerShell. There is no way to load any file with an ASCII0 in bigquery
This is a script that can clear the file with python:
def replace_chars(self,file_path,orignal_string,new_string):
#Create temp file
fh, abs_path = mkstemp()
with os.fdopen(fh,'w', encoding='utf-8') as new_file:
with open(file_path, encoding='utf-8', errors='replace') as old_file:
print("\nCurrent line: \t")
i=0
for line in old_file:
print(i,end="\r", flush=True)
i=i+1
line=line.replace(orignal_string, new_string)
new_file.write(line)
#Copy the file permissions from the old file to the new file
shutil.copymode(file_path, abs_path)
#Remove original file
os.remove(file_path)
#Move new file
shutil.move(abs_path, file_path)
The same but for PowerShell:
(Get-Content "C:\Source.DAT") -replace "`0", " " | Set-Content "C:\Destination.DAT"