Python Pandas Series.any() == - pandas

Am I using pandas correctly? I am trying to loop through files and find if any value in a series matches
import pandas as pd
path = user/Desktop/New Folder
for file in path:
df = pd.read_excel(file)
if df[Series].any() == "string value"
do_something()

Please, check if this address your problem:
if df[df['your column']=="string value"].any()
do_something()
I think you should fix also your file iteration, please check this: https://www.newbedev.com/python/howto/how-to-iterate-over-files-in-a-given-directory/

Related

Spark read multiple csv into one dataframe - error with path

I'm trying to read all the csv under a HDFS directory to a dataframe, but got an error that says its "not a valid DFS filename" Could someone help to point out what I did wrong? I tried without the hdfs:// part as well but it says path could not be found. Many thanks.
val filelist = "hdfs://path/to/file/file1.csv,hdfs://path/to/file/file2.csv "
val df = spark.read.csv(filelist)
val df = spark.read.csv(filelist:_*)

Pandas - No Quote Character saved to file

I am having a difficult time trying to get any "Quote" Character to print out using to_csv function in Pandas.
import pandas as pd
final = pd.DataFrame(dataset.loc[::])
final.to_csv(r'c:\temp\temp2.dat', doublequote=True, mode='w',
sep='\x14', quotechar='\xFE', index=False)
print (final)
I have tried various options without success, I am not sure what i am missing. Wondering igf anyone can point me in the right direction. thank you in advance.
Finally! it appears the documentation has changed or it not updated on the this. adding the option of quoting=1 cures the issues. apparently, quoting=csv.QUOTE_ALL no longer works.
the complete command is
import pandas as pd
final = pd.DataFrame(dataset.loc[::])
final.to_csv(r'c:\temp\temp2.dat', index=False, doublequote=True,sep='\x14', quoting=1, quotechar='\xFE')
print (final)

Exporting Multiple log files data to single Excel using Pandas

How do I export multiple dataframes to a single excel, I'm not talking about merging or combining. I just want a specific line from multiple log files to be compiled to a single excel sheet. I already wrote a code but I am stuck:
import pandas as pd
import glob
import os
from openpyxl.workbook import Workbook
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
for files in read_files:
logs = pd.read_csv(files, header=None).loc[540:1060, :]
print(LBS_logs)
logs.to_excel("LBS.xlsx")
When I do this, I only get data from the first log.
Appreciate your recommendations. Thanks !
You are saving logs, which is the variable in your for loop that changes on each iteration. What you want is to make a list of dataframes and combine them all, and then save that to excel.
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
dfs = []
for file in read_files:
log = pd.read_csv(file, header=None).loc[540:1060, :]
dfs.append(log)
logs = pd.concat(logs)
logs.to_excel("LBS.xlsx")

Pandas checking for False value in Dataframe

I have a csv file that has something like:
col_a, col_b,isactive
a,b,true
c,d,false
I am trying to write all the rows that have true to one file and the ones that are false to one file. I am trying to understand how i can check for the "false" boolean flag in the dataframe.
# this works
df1 = pd.read_csv(all_users)
df_isActive=(df1[df1['isActive']])
df_isActive.to_csv('onlyactive.csv')
# this doesn't work below.
df_isNotActive=(df1[df1['isActive' == 'False']])
I am trying to figure out how to use "Not" inActive in a dataframe.
Thanks,
Try this:
df_isNotActive = (df1[~df1['isActive']])
A quick way
df_isNotActive=df1.drop(df_isActive.index)
Can also try
m=df.isactive==False
df[m]

Getting wildcard from input files when not used in output files

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.