I have a csv file that has something like:
col_a, col_b,isactive
a,b,true
c,d,false
I am trying to write all the rows that have true to one file and the ones that are false to one file. I am trying to understand how i can check for the "false" boolean flag in the dataframe.
# this works
df1 = pd.read_csv(all_users)
df_isActive=(df1[df1['isActive']])
df_isActive.to_csv('onlyactive.csv')
# this doesn't work below.
df_isNotActive=(df1[df1['isActive' == 'False']])
I am trying to figure out how to use "Not" inActive in a dataframe.
Thanks,
Try this:
df_isNotActive = (df1[~df1['isActive']])
A quick way
df_isNotActive=df1.drop(df_isActive.index)
Can also try
m=df.isactive==False
df[m]
Related
I am having a difficult time trying to get any "Quote" Character to print out using to_csv function in Pandas.
import pandas as pd
final = pd.DataFrame(dataset.loc[::])
final.to_csv(r'c:\temp\temp2.dat', doublequote=True, mode='w',
sep='\x14', quotechar='\xFE', index=False)
print (final)
I have tried various options without success, I am not sure what i am missing. Wondering igf anyone can point me in the right direction. thank you in advance.
Finally! it appears the documentation has changed or it not updated on the this. adding the option of quoting=1 cures the issues. apparently, quoting=csv.QUOTE_ALL no longer works.
the complete command is
import pandas as pd
final = pd.DataFrame(dataset.loc[::])
final.to_csv(r'c:\temp\temp2.dat', index=False, doublequote=True,sep='\x14', quoting=1, quotechar='\xFE')
print (final)
Am I using pandas correctly? I am trying to loop through files and find if any value in a series matches
import pandas as pd
path = user/Desktop/New Folder
for file in path:
df = pd.read_excel(file)
if df[Series].any() == "string value"
do_something()
Please, check if this address your problem:
if df[df['your column']=="string value"].any()
do_something()
I think you should fix also your file iteration, please check this: https://www.newbedev.com/python/howto/how-to-iterate-over-files-in-a-given-directory/
I am trying to save spark dataframe into csv file but I want all the records in double quotes but it is not generating. Could you help me how to do this?
Example:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
IMS|20080628|183.0|16470.0|165653.256349|AUD|AUSTRALIA HOSPITAL|PFIZER
Desirable Output:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
Code I am running:
df4.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').option("quoteAll", 'True').save(Output_Path_ASPAC,quote = '',sep='|',header='True',nullValue=None)
You can just use df.write.csv with quoteAll set to True:
df4.repartition(1).write.csv(Output_Path_ASPAC, quote='"', header=True,
quoteAll=True, sep='|', mode='overwrite')
Which produces, with your example data:
"Source_System"|"Date"|"Market_Volume"|"Volume_Units"|"Market_Value"|"Value_Currency"|"Sales_Channel"|"Competitor_Name"
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.
I have the following code and when I export my data to an excel the sorting I used does not work:
df.sort_values(['ID1','ID2'],ascending=True).groupby('ID1').
df.to_excel (r'C:\Users\user\Desktop\DOCUMENT.xlsx', index = None, header=True)
Could you explain to me why this does not work, thank you.
you need to reassign to the variable you output to excel or use the inplace argument
e.g
df.sort_values(['ID1','ID2'],ascending=True,inplace=True).groupby('ID1').
or
df = df.sort_values(['ID1','ID2'],ascending=True).groupby('ID1').