OpenStreetMap, Osmium, List, Pandas DataFrame, Python - pandas

I have a problem, which I do not understand.
I want to read some nodes out of a *.osm file with osmium in Python.
https://docs.osmcode.org/pyosmium/latest/intro.html
The example shows that I can append a list with all node information, that I want to save.
I understand, that I can not store the node object, because it just exists in the scope, but why can I not store the same data, which I store in the list, into a pandas dataframe?
Then I got the message "RuntimeError: Node callback keeps reference to OSM object. This is not allowed.". Also shown in the examples?
What is different in that case?
How can I speed up the reading process.
Greats.

Related

I have this file in matlab, how do I open it with pandas dataframe

I have this file https://mega.nz/file/9tMlnALK#usFDanDAD6qp6TTZU9bFEoP6FPmeNPhMAa75Q5jgE7w, which is coded in .mat, inside there're many variables. I have a code that works with pandas datframe, so I need to access one variable (a matrix called F ideally I want to access some indexes of the matrix not the entire matrix) in order to use the code with it.
I was trying to use loadmat but as I said i dont know how to access just the matrix and not the other dat. How I may do it?
You have try the below code:
Convert mat

Dask unable to write to parquet with concatenated data

I am trying to do the following:
Read in a .dat file with pandas, converting it to a dask dataframe, concatenate it to another dask dataframe that I read in from a parquet file, and then output to a new parquet file. I do the following:
import dask.dataframe as dd
import pandas as pd
hist_pth = "\path\to\hist_file"
hist_file = dd.read_parquet(hist_pth)
pth = "\path\to\file"
daily_file = pd.read_csv(pth, sep="|", encoding="latin")
daily_file = daily_file.astype(hist_file.dtypes.to_dict(), errors="ignore")
dask_daily_file = dd.from_pandas(daily_file, npartitions=1)
combined_file = dd.concat([dask_daily_file, hist_file])
output_path = "\path\to\output"
combined_file.to_parquet(output_path)
The combined_file.to_parquet(output_path) always starts and then stops / or doesn't work correctly. In a jupyter notebook when I do this I get a kernel fail error. When I do it in a python script the script completes but the whole combined file isn't written (I know because of the size - the CSV is 140MB and the parquet file is around 1GB - the output of to_parquet is only 20MB).
Some context, this is for an ETL process and with the amount of data were adding daily I'm soon going to run out of memory on the historical and combined datasets, so I'm trying to migrate the process from just pandas to Dask to handle the larger than memory data I will soon have. The current data, daily + historical, still fits in memory but just barely (I already make use of categoricals, these are stored in the parquet file and then I copy that schema to the new file).
I also noticed that after the dd.concat([dask_daily_file, hist_file]) that I am unable to call .compute() even on simple tasks without it crashing the same way it does when writing to parquet. For example, on the original, pre-concatenated data, I can call hist_file["Value"].div(100).compute() and get the expected value but the same method on combined_file crashes. Even just combined_file.compute() to turn it into a pandas df crashes. I have tried repartitioning as well with no luck.
I was able to do these exact operations, just in pandas, without issue. But again, I'm going to be running out of memory soon which is why I am moving to dask.
Is this something dask isn't able to handle? If it can handle it, am I processing it correctly? Specifically, it seems like the concat is causing issues. Any help appreciated!
UPDATE
After playing around more I ended up with the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
There is an existing GitHub issue that seems like it could be related to this - i asked and am waiting for confirm.
As a work around I converted all categorical columns to strings/objects and tried again and then ended up with
ArrowTypeError: ("Expected a bytes object, got a 'int' object, 'Conversion failed for column Account with type object')
When I check that column df["Account"].dtype it returns dtype('O') so I think I have the correct dtype already. The values in this column are mainly numbers but there are some records with just letters.
Is there a way to resolve this?
I got this error in Pandas after concatenating dataframes and saving the result to Parquet format..
data = pd.concat([df_1, d2, df3], axis=0, ignore_index=True)
data.to_parquet(filename)
..apparently because the rows contained different data types, either int or float. By forcing them before saving to have the same data type the error goes away
cols = ["first affected col", "second affected col", ..]
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

Reading Fortran binary file in Python

I'm having trouble reading an unformatted F77 binary file in Python.
I've tried the SciPy.io.FortraFile method and the NumPy.fromfile method, both to no avail. I have also read the file in IDL, which works, so I have a benchmark for what the data should look like. I'm hoping that someone can point out a silly mistake on my part -- there's nothing better than having an idiot moment and then washing your hands of it...
The data, bcube1, have dimensions 101x101x101x3, and is r*8 type. There are 3090903 entries in total. They are written using the following statement (not my code, copied from source).
open (unit=21, file=bendnm, status='new'
. ,form='unformatted')
write (21) bcube1
close (unit=21)
I can successfully read it in IDL using the following (also not my code, copied from colleague):
bcube=dblarr(101,101,101,3)
openr,lun,'bcube.0000000',/get_lun,/f77_unformatted,/swap_if_little_endian
readu,lun,bcube
free_lun,lun
The returned data (bcube) is double precision, with dimensions 101x101x101x3, so the header information for the file is aware of its dimensions (not flattend).
Now I try to get the same effect using Python, but no luck. I've tried the following methods.
In [30]: f = scipy.io.FortranFile('bcube.0000000', header_dtype='uint32')
In [31]: b = f.read_record(dtype='float64')
which returns the error Size obtained (3092150529) is not a multiple of the dtypes given (8). Changing the dtype changes the size obtained but it remains indivisible by 8.
Alternately, using fromfile results in no errors but returns one more value that is in the array (a footer perhaps?) and the individual array values are wildly wrong (should all be of order unity).
In [38]: f = np.fromfile('bcube.0000000')
In [39]: f.shape
Out[39]: (3090904,)
In [42]: f
Out[42]: array([ -3.09179121e-030, 4.97284231e-020, -1.06514594e+299, ...,
8.97359707e-029, 6.79921640e-316, -1.79102266e-037])
I've tried using byteswap to see if this makes the floating point values more reasonable but it does not.
It seems to me that the np.fromfile method is very close to working but there must be something wrong with the way it's reading the header information. Can anyone suggest how I can figure out what should be in the header file that allows IDL to know about the array dimensions and datatype? Is there a way to pass header information to fromfile so that it knows how to treat the leading entry?
I played a bit around with it, and I think I have an idea.
How Fortran stores unformatted data is not standardized, so you have to play a bit around with it, but you need three pieces of information:
The Format of the data. You suggest that is 64-bit reals, or 'f8' in python.
The type of the header. That is an unsigned integer, but you need the length in bytes. If unsure, try 4.
The header usually stores the length of the record in bytes, and is repeated at the end.
Then again, it is not standardized, so no guarantees.
The endianness, little or big.
Technically for both header and values, but I assume they're the same.
Python defaults to little endian, so if that were the the correct setting for your data, I think you would have already solved it.
When you open the file with scipy.io.FortranFile, you need to give the data type of the header. So if the data is stored big_endian, and you have a 4-byte unsigned integer header, you need this:
from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '>u4')
When you read the data, you need the data type of the values. Again, assuming big_endian, you want type >f8:
vals = ff.read_reals('>f8')
Look here for a description of the syntax of the data type.
If you have control over the program that writes the data, I strongly suggest you write them into data streams, which can be more easily read by Python.
Fortran has record demarcations which are poorly documented, even in binary files.
So every write to an unformatted file:
integer*4 Test1
real*4 Matrix(3,3)
open(78,format='unformatted')
write(78) Test1
write(78) Matrix
close(78)
Should ultimately be padded by an np.int32 values. (I've seen references that this tells you the record length, but haven't verified persconally.)
The above could be read in Python via numpy as:
input_file = open(file_location,'rb')
datum = np.dtype([('P1',np.int32),('Test1',np.int32),('P2',np.int32),('P3',mp.int32),('MatrixT',(np.float32,(3,3))),('P4',np.int32)])
data = np.fromfile(input_file,datum)
Which should fully populate the data array with the individual data sets of the format above. Do note that numpy expects data to be packed in C format (row major) while Fortran format data is column major. For square matrix shapes like that above, this means getting the data out of the matrix requires a transpose as well, before using. For non square matrices, you will need to reshape and transpose:
Matrix = np.transpose(data[0]['MatrixT']
Transposing your 4-D data structure is going to need to be done carefully. You might look into SciPy for automated ways to do so; the SciPy package seems to have Fortran related utilities which I have not fully explored.

How to write to a file in Go

I have seen How to read/write from/to file using golang? and http://golang.org/pkg/os/#File.Write but could not get answer.
Is there a way, I can directly write an array of float/int to a file. Or do I have to change it to byte/string to write it. Thanks.
You can use the functions in the encoding/binary package for this purpose.
As far as writing an entire array at once goes, there are no functions for this. You will have to iterate the array and write each element individually. Ideally, you should prefix these elements with a single integer, denoting the length of the array.
If you want a higher level solution, you can try the encoding/gob package:
Package gob manages streams of gobs - binary values exchanged between an Encoder (transmitter) and a Decoder (receiver). A typical use is transporting arguments and results of remote procedure calls (RPCs) such as those provided by package "rpc".

How to strip a text file into a single line, and then split that into a relevant list in python?

I'm a noob right now with pygame and I was wondering how to load a textfile, then strip that into a a single line. I believe that i would need to use the .rstrip('/n') function on my variable with the openned text file. But now, how do I turn this into a list? If I intentionally used two colons (::) to separate between my relevant pieces of information in the text file, how do I make it into a list with each list index being the contents in between two sets of ::? The purpose is to create save files in a menu GUI when closed, so is there a simpler way to save and open the contents of variables from one instance of the program to the next?
>>> "foo::bar::baz".split("::")
['foo', 'bar', 'baz']
If you just want to save structured data, however, you might want to look at either the pickle or json libraries. Both of them give ways to dump Python objects to files and then load them back out again.