loading a pandas table from a weird txt file - pandas

I'm working with a package called qiime2, that generates me a txt file I want to load to pandas as a table. An exapmle for the txt format is the following:
The format is quite quacky. I want to load as a pandas table the part from line 10 downward to a specified line, with the first coulmn in each line ("L1S105", "L1S140"..) being the index, coulmns will be named from 1 to 31, and all the values will fit right in (for e.g, -0.4034.. will be at["L1S105", 1].
I tried to load the whole file as a pandas table and than manipulate but got an error. after writing on_bad_line = "skip" not much of the table was left.

Related

Pandas to_csv adds new rows when data has special characters

My data has multiple columns including a text column
id text date
8950026 Make your EMI payments only through ABC 01-04-2021 07:43:54
8950969 Pay from your Bank Account linked on XXL \r 01-04-2021 02:16:48
8953627 Do not share it with anyone. -\r 01-04-2021 08:04:57
I used pandas to_csv to export my data. That works well for my first row but for the next 2 rows, it creates a new line and moves the date to the next line and adds to the total rows. Basically my output csv will have 5 rows instead of 3
df_in.to_csv("data.csv", index = False)
What is the best way to handle the special character "\r" here? I tried converting the text variable to string in pandas (Its dtype is object now) but that doesn't help . I can try and remove all \r in the end of text in my dataframe before exporting, but is there a way to modify to_csv to export this in the right format ?
**** EDIT****
This question below is similar and I can solve the problem by replacing all instances of \r in my dataframe but how can this be solved by not replacing? Does to_csv have options to handle these
Pandas to_csv with escape characters and other junk causing return to next line

Julia CSV skipping rows

I have a csv file as shown below. I basically want to add the last two rows into a dataframe (24 & 25). Unfortunately with the program (Netlogo), generating this file it's not possible to export this as a xlsx file. So using the package xlsx gives me an error.
I am wondering how to skiprows and get a dataframe. I'vs tried this piece of code but it gives me 2x1 DataFrame with X and 0 as values(Basically column A and rows 24-25). What I am after is rows 24-25 and columns A to AC
using DataFrames
using CSV
df = CSV.File(
joinpath("D:/ABM/Simulation Runs/Output Files/run_1.csv"),
skipto = 24
)
You can use (you could do the same with CSV.File if you do not want DataFrame as a sink):
CSV.read("run_1.csv DataFrame, header=24, limit=1, threaded=false)
Explanation:
header: line in which header is stored
limit: number of rows of data to read (omit it if below header you only have data),
threaded: use this to ensure that limit is respected exactly (as in general CSV.jl might use multiple threads to read your data and try to read more than asked)

Issue in pyspark dataframe after toPandas csv

I've a data frame in pyspark.sql.dataframe.DataFrame, I converted it into pandas dataframe then saved it as csv file. Here in csv when opened, I discovered the columns which has empty values in a field become \"\".
I go back to the spark dataframe.toPandas() the when I check one of these columns values I see this empty string with blank space.
dfpandas.colX[2] give this res: ' '.
I used this kind of csv saving.
df_sparksql.repartition(1).write.format('com.databricks.spark.csv').save("/data/rep//CLT_20200729csv",
header = 'true',)
I used also this kind of saving method but lead to memory outage.
df = df_per_mix.toPandas()
df.to_csv("/data/rep//CLT_20200729.csv",sep=";", index=False)
What is the issue and how to remove the blank space converted to \"\"?

Converting a massive JSON file to CSV

I have a JSON file which is 48MB (collection of tweets I data mined). I need to convert the JSON file to CSV so I can import it into a SQL database and cleanse it.
I've tried every JSON to CSV converter but they all come back with the same result of "file exceeds limits" / the file is too large. Is there a good method of converting such a massive JSON file to CSV within a short period of time?
Thank you!
A 48mb json file is pretty small. You should be able to load the data into memory using something like this
import json
with open('data.json') as data_file:
data = json.load(data_file)
Dependending on how you wrote to the json file, data may be a list, which contains many dictionaries. Try running:
type(data)
If the type is a list, then iterate over each element and inspect it. For instance:
for row in data:
print(type(row))
# print(row.keys())
If row is a dict instance, then inspect the keys and within the loop, start building up what each row of the CSV should contain, then you can either use pandas, the csv module or just open a file and write line by line with commas yourself.
So maybe something like:
import json
with open('data.json') as data_file:
data = json.load(data_file)
with open('some_file.txt', 'w') as f:
for row in data:
user = row['username']
text = row['tweet_text']
created = row['timestamp']
joined = ",".join([user, text, created])
f.write(joined)
You may still run into issues with unicode characters, commas within your data, etc...but this is a general guide.

Loading huge csv file using COPY

I am loading CSV file using COPY.
COPY cts FROM 'C:\...\cts.csv' using DELIMITERS',';
However, error comes out
ERROR: invalid input syntax for type double precision: ""
CONTEXT: COPY testdata, line 7, column latitude: ""
How to fix it please?
Looks like your CSV isn't quite formatted correctly. "" isn't a number, and numbers don't need to be be quoted in CSV.
I find it's usually easier in PostgreSQL to create a staging import table with all text columns, and import CSVs to there first. Then do a cleanup query to put the CSV data into the real table.