Loading/analyzing a bunch of text files in Pandas/SQL - sql

I have a few thousand files of text and would like to analyze them for trends/word patterns, etc. I am familiar with both Pandas and SQL but am not sure how to "load" all these files into a table/system such that I can run code on them. Any advice?

If you have all the same columns in all the text files you can use something like this.
import pandas as pd
import glob
path = r'C:/location_rawdata_files'#use the path where you stored all txt's
all_files = glob.glob(path + "/*.txt")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None)
lst .append(df)
df= pd.concat(lst, axis=0, ignore_index=True)

Related

Returning all the column names as lists from multiple Parquet Files in Python

I have more than 100 Parquet files in a folder. I am not sure if all the files are having same feature name(column name). I want to write some python codes, through pandas which could read all the file in directory and return the name of columns with file name as prefix.
I tried 'for loop', but not sure how to structure the query. Being a beginner I could not write looped script.
import glob
path = r'C:\Users\NewFOlder1\NewFOlder\Folder'
all_files = glob.glob(path + '\*.gzip')
col=[]
for paths in all_files:
df=pd.read_parquet(paths)
col.append(df.columns)
print(col)
IIUC, use pandas.concat with pandas.DataFrame.columns :
import glob
import pandas as pd
path = r'C:\Users\NewFOlder1\NewFOlder\Folder'
all_files = glob.glob(path + '\*.gzip')
list_dfs = []
for paths in all_files:
df = pd.read_parquet(paths)
list_dfs.append(df)
col_names = pd.concat(list_dfs).columns.tolist()
Can you try this:
import glob
import pandas as pd
path = r'C:\Users\NewFOlder1\NewFOlder\Folder'
all_files = glob.glob(path + '\*.gzip')
col=[]
for paths in all_files:
df=pd.read_parquet(paths)
col.append(list(df.columns + '_' + paths))
print(col)
if the filenames are like this: "abcd.parquet" (if not please provide sample of filename), you can try something like this to find the differences:
replaced_cols=[i.split("_",1)[0] for i in col]
differences=[]
for i in col:
val=i.split("_", 1)[0]
if not val in replaced_cols:
differences.append(i)

Combining CSV of different shapes into one CSV

I have CSVs of different number of rows and columns. I would like to create one large CSV where all the CSV data are stacked directly on top of each other, aligned by the first column. I tried the script below with limited success; b which is an empty array does not hold the data from the previous loops.
from os import walk
import sys
import numpy as np
filenames= []
dirpath = []
filtered = []
original = []
f = []
b = np.empty([2, 2])
for (dirpath, dirnames, filenames) in walk("C:\\Users\\dkim1\\Python Scripts\\output"):
f.extend(dirnames)
print(f)
for names in f:
print(names)
df = np.genfromtxt('C:\\Users\\dkim1\\Python Scripts\\output\\' + names + '\\replies.csv', dtype =None, delimiter = ',', skip_header=1, names=True)
b = np.column_stack(df)
print(b)
Have you tried pd.concat()?
import os
import pandas as pd
# just used a single dir for example simplicity, rather than os.walk()
root_dir = "your directory path here"
file_names = os.listdir(root_dir)
cat_list=[]
for names in file_names:
df = pd.read_csv(os.path.join(root_dir, names), delimiter = ',', header=None)
cat_list.append(df)
concatted_df = pd.concat(cat_list)

Concatenate CSVs into XLSX files based on filename (pandas)

I have a bunch of CSVs with names '<3-letter-string> YYYY.csv'. There are four different versions of <3-letter-string>, and I want to sort the csvs into four xlsxs, each identified by that three letter string.
My code:
import pandas as pd
import os
full_df = pd.DataFrame()
for filename in os.listdir('C:/Users/XXXXXX/ZZZZZZ'):
if filename.endswith(".csv"):
print(filename)
df = pd.read_csv(filename, skiprows=1, names=['ID','Units Sold','Retail Dollars'])
df['Year'] = filename[-8:-4]
full_df = pd.concat([full_df, df])
full_df.to_excel(filename[0:3] + '.xlsx', index=False)
This makes four different xlsxs, which is what I want, but they're all a mixture of the different csvs.
How do I tell pandas to group them into four separate xlsxs according to the filename? My initial thought is to include filename slicing in the penultimate line and create four different concatenated full_df dataframes to write separately, but I'm not sure how.
import pandas as pd
import os
def Get_Yo_Fantasy_Hennnnnyyyyy():
full_df = pd.DataFrame()
for filename in os.listdir("path"):
if filename.endswith(".csv"):
print(filename)
df = pd.read_csv(
filename,
skiprows=1,
names=["ID", "Units Sold", "Retail Dollars"])
df["Year"] = filename[-8:-4]
df["Type"] = filename[0:3]
full_df = pd.concat([full_df, df])
for i in list(full_df.Type.unique()):
full_df[full_df.Type.str.contains(i)].to_excel(
"{}".format(i) + ".xlsx", index=False)
Get_Yo_Fantasy_Hennnnnyyyyy()

Import a growing list() of csv files only to append after imoprting [duplicate]

This question already has answers here:
Import multiple CSV files into pandas and concatenate into one DataFrame
(20 answers)
Closed 3 years ago.
So I am building a dataset with a growing set of csv's. Rather than adding the new df# = pd.read_csv(filename, index...) I would prefer to just create a function to read the list of csv's and then append them upon importing. Any recommendations? I put the code down below for what I currently have.
import glob
files = glob.glob('*.csv')
files
alg1_2018_2019 = pd.read_csv('alg1_2018_2019.csv', index_col=False)
alg1_2017_2018 = pd.read_csv('alg1_2017_2018.csv', index_col=False)
geometry_2018_2019 = pd.read_csv('geometry_2018_2019.csv', index_col=False)
geom_8_2017_2018 = pd.read_csv('geom_8_2017_2018.csv', index_col=False)
alg2_2016_2017 = pd.read_csv('alg2_2016_2017.csv', index_col=False)
alg1_2016_2017 = pd.read_csv('alg1_2016_2017.csv', index_col=False)
geom_2016_2017 = pd.read_csv('geom_2016_2017.csv', index_col=False)
geom_2015_2016 = pd.read_csv('geom_2015_2016.csv', index_col=False)
alg2_2015_2016 = pd.read_csv('alg2_2015_2016.csv', index_col=False)
alg1_part2_2015_2016 = pd.read_csv('alg1_part2_2015_2016.csv', index_col=False)```
i'm using the following function:
import pandas as pd
from pathlib import Path
def glob_filemask(filemask):
"""
allows to "glob" files using file masks with full path
Usage:
for file in glob_filemask("/path/to/file_*.txt"):
# process file here
or:
files = list(glob_filemask("/path/to/file_*.txt"))
:param filemask: wildcards can be used only in the last part
(file name or extension), but NOT in the directory part
:return: Pathlib glob generator, for all matching files
Example:
glob_filemask("/root/subdir/data_*.csv") -
will return a Pathlib glob generator for all matching files
glob_filemask("/root/subdir/single_file.csv") -
will return a Pathlib glob generator for a single file
"""
p = Path(filemask)
try:
if p.is_file():
return [p]
except OSError:
return p.parent.glob(p.name)
Usage:
df = pd.concat([pd.read_csv(f) for f in glob_filemask("/path/to/file_*.csv")],
ignore_index=True)

How to specify column type(I need string) using pandas.to_csv method in Python?

import pandas as pd
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data,type = str)
Df.to_csv("path/to/save.csv")
There result I've obtained seems as this
To achieve such result it will be easier to export directly to xlsx file, even without setting dtype of DataFrame.
import pandas as pd
writer = pd.ExcelWriter('path/to/save.xlsx')
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data)
Df.to_excel(writer,"Sheet1")
writer.save()
I've tried also some other methods like prepending apostrophe or quoting all fields with ", but it gave no effect.