write chinese words in csv file using python2.7 - python-unicode

I am trying to write Chinese words like 花花公子昊天鞋类专营店 in a CSV file in python, but not able to do it. I tried solution given here("issues with writing Chinese to csv file in Python"). Any help will be appreciated.

The module unicodecsv helps with that (you can install that with pip):
import unicodecsv
w = unicodecsv.writer(open("test.csv", "w"))
w.writerow((u"花花公子昊天鞋类专营店", 78.10))
del w
The resulting csv file opens succesfully in OpenOffice.
You can also read it back in Python:
r = unicodecsv.reader(open("test.csv", "rb"))
for row in r:
print row[0], row[1]
And when run, it should print:
(user#motoom) ~/Prj/python $ python chinesecsv.py
花花公子昊天鞋类专营店 78.1

Related

Extract an attribute in GPKG

I am trying to extract rivers from OSM. I downloaded the waterway GPKG where I believe there are over 21 million entries (see link) with a file size of 19.9 GB.
I have tried using the split vector layer in QGIS, but it would crash.
Was thinking of using GDAL ogr2ogr, but having trouble generating the command line.
I first isolated the multiline string with the following command.
ogr2ogr -f gpkg water.gpkg waterway_EPSG4326.gpkg waterway_EPSG4326_line -nlt linestring
ogrinfo water.gpkg INFO: Open of water.gpkg' using driver GPKG' successful. 1: waterway_EPSG4326_line (Line String)
Tried the following command below, but it is not working.
ogr2ogr -f GPKG SELECT * FROM waterway_EPSG4326_line - where waterway="river" river.gpkg water.gpkg
Please let me know what is missing or if there is any easy way to perform the task. I tried opening the file in R sf package, but it would not load after a long time.
Thanks

How can I set a For loop to read a list of URLs and scrape the data with Python Pandas Lib

This is my first Python script. I'm currently trying to scrape data embedded in HTML tables from multiple urls which are located in a file called url-list.txt.
I've been successfully able to scrape the data I need using Python's Panda library from a single page, however I'm having the worst time trying to do a simple for loop to load each url from the url-list.txt file to scrape the data from the remaining 100 or so urls.
Here is what I've got so far. You can see my attempt at the for loop commented out. Any help would be greatly appreciated.
import bs4 as bs
import urllib.request
import pandas as pd
#url_list = "/home/awephuck/url-list.txt"
#for x in urls:
dfs = pd.read_html('http://example.com/dir1/file.html')
for df in dfs:
print(df)
You just have to make a list of the urls in your text file then loop around them.
ie:
with open('file.txt', 'r') as text:
links = text.read().splitlines()
for url in links:
#whatever you need to do
This assumes that each url is on its own line.
Hate Python, ended up getting super hood with it but it works.
#!/bin/bash
for i in `cat url-list.txt`; do
rm pyget.py
echo "import bs4 as bs" >> pyget.py
echo "import urllib.request" >> pyget.py
echo "import pandas as pd" >> pyget.py
echo "dfs = pd.read_html($i)" >> pyget.py
echo "for df in dfs:" >> pyget.py
echo " print(df)" >> pyget.py
python3 pyget.py >> clientdata.txt
done
Data scrape using HTML will be store in "list" not in "Dataframe".
All data will be store in first position in list.
create the Dataframe using list and attempt at the for loop over index
import pandas as pd
data=pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')
type(data) # Data type of Scraped data
df=data[0] #convert into dataframe
type(df)
for i in df.index: #loop using df.index
print(df.iloc[i])

Bigquery error (ASCII 0) encountered for external table and when loading table

I'm getting this error
"Error: Error detected while parsing row starting at position: 4824. Error: Bad character (ASCII 0) encountered."
The data is not compressed.
My external table points to multiple CSV files, and one of them contains a couple of lines with that character. In my table definition I added "MaxBadRecords", but that had no effect. I also get the same problem when loading the data in a regular table.
I know I could use DataFlow or even try to fix the CSVs, but is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
Try below in Google Cloud SDK Shell (with use of tr utility)
gsutil cp gs://bucket/badfile.csv - | tr -d '\000' | gsutil cp - gs://bucket/fixedfile.csv
This will
Read your "bad" file
Remove ASCII 0
Save "fixed" file into new file
After you have new file - just make sure your table now points to that fixed one
Sometimes it occurs that a final byte appears in file.
What could help is replacing it thanks to :
tr '\0' ' ' < file1 > file2
You can clean the file using an external tool like python or PowerShell. There is no way to load any file with an ASCII0 in bigquery
This is a script that can clear the file with python:
def replace_chars(self,file_path,orignal_string,new_string):
#Create temp file
fh, abs_path = mkstemp()
with os.fdopen(fh,'w', encoding='utf-8') as new_file:
with open(file_path, encoding='utf-8', errors='replace') as old_file:
print("\nCurrent line: \t")
i=0
for line in old_file:
print(i,end="\r", flush=True)
i=i+1
line=line.replace(orignal_string, new_string)
new_file.write(line)
#Copy the file permissions from the old file to the new file
shutil.copymode(file_path, abs_path)
#Remove original file
os.remove(file_path)
#Move new file
shutil.move(abs_path, file_path)
The same but for PowerShell:
(Get-Content "C:\Source.DAT") -replace "`0", " " | Set-Content "C:\Destination.DAT"

How to load lists in pelilcanconf.py from external file

There are different lists available in pelicanconf.py such as
SOCIAL = (('Facebook','www.facebook.com'),)
LINKS =
etc.
I want to manage these content and create my own lists by loading these values from an external file which can be edited independently. I tried importing data as a text file using python but it doesn't work. Is there any other way?
What exactly did not work? Can you provide code?
You can execute arbitrary python code in your pelicanconf.py.
Example for a very simple CSV reader:
# in pelicanconf.py
def fn_to_list(fn):
with open(fn, 'r') as res:
return tuple(map(lambda line: tuple(line[:-1].split(';')), res.readlines()))
print(fn_to_list("data"))
CSV file data:
A;1
B;2
C;3
D;4
E;5
F;6
Together, this yields the following when running pelican:
# ...
((u'A', u'1'), (u'B', u'2'), (u'C', u'3'), (u'D', u'4'), (u'E', u'5'), (u'F', u'6'))
# ...
Instead of printing you can also assign this list to a variable, say LINKS.

File line splitting in Jython

I am trying to read a file and populate the values in DB with the help of Jython in ODI.
For this, I read the line one by one split the line on the basis of ',' present.
Now I have a line as
4JGBB8GB5AA557812,,Miss,Maria,Cruz,,"266 Faller Drive Apt. B",
New Milford,NJ,07646,2015054604,2015054604,20091029,51133,,,
N,LESSEE,"MERCEDES-BENZ USA, LLC",N,N
"MERCEDES-BENZ USA, LLC" this field has , within the double quotes due to which it gets split into two fields whereas it should only be considered one. Can someone please tell me how should i avoid this.
fields = valueList.split(',')
I use this for splitting where valuelist is the individual line present in the file
You can use csv module which can take care of quotes:
line = '4JGBB8GB5AA557812,,Miss,Maria,Cruz,,"266 Faller Drive Apt. B",New Milford,NJ,07646,2015054604,2015054604,20091029,51133,,,N,LESSEE,"MERCEDES-BENZ USA, LLC",N,N'
import StringIO
import csv
f = StringIO.StringIO(line)
reader = csv.reader(f, delimiter=',')
for row in reader:
print('\n'.join(row))
result:
...
266 Faller Drive Apt. B
...
LESSEE
MERCEDES-BENZ USA, LLC
...
My example uses StringIO because test line is as string in code, you can simply use just opened file handler as f.
You will find more examples at "Module of the Month": http://pymotw.com/2/csv/index.html#module-csv