I have a csv file having 300 columns. Out of these 300 columns, I need only 3 columns. Hence i defines schema for same. But when I am mapping schema to dataframe it shows only 3 columns but incorretly mapping schema with first 3 columns. Its not mapping csv columns names with my schema structfields. Please advise
from pyspark.sql.types import *
dfschema = StructType([
StructField("Call Number",IntegerType(),True),
StructField("Incident Number",IntegerType(),True),
StructField("Entry DtTm",DateType() ,True)
])
df = spark.read.format("csv")\
.option("header","true")\
.schema(dfschema)\
.load("/FileStore/*/*")
df.show(5)
This is actually the expected behaviour of Spark's CSV-Reader.
If the columns in the csv file do not match the supplied schema, Spark treats the row as a corrupt record. The easiest way to see that is to add another column _corrupt_record with type string to the schema. You will see that all rows are stored in this column.
The easiest way to get the correct columns would be to read the the csv file without schema (or if feasible with the complete schema) and then select the required columns. There will be no performance penalty for reading the whole csv file as (unlike in formats like parquet) Spark cannot read selected columns from csv. The file is always read completely.
#read the csv file without infering the schema
df=spark.read.option("header","true").option("inferSchema", False).csv(<...>)
#all columns will now be of type string
df.printSchema()
#select the required columns and cast them to the appropriate type
df2 = df.selectExpr("cast(`Call Number` as int)", "cast(`Incident Number` as int)", ....)
#only the required columns with the correct type are contained in df2
df2.printSchema()
Related
I have a data frame DF1 with n columns in those n columns we two columns called (storeid,Date). I am trying to take a subset of that DF1 with those two columns while group by storeid and with max(Date) as below:
df2=(df1.select(['storeid','Date']).withColumn("Date", col("Date").cast("timestamp"))\ .groupBy("storeid").agg(max("Date")))
After that i am renaming the column Max(Date) since write to S3 will not allow brackets:
df2=df2.withColumnRenamed('max(Date)', 'max_date')
Now i am trying write this to S3 bucket and it is trying to creating parquet files on each row, like for example if the df2 has 10 rows it is creating 10 parquet files instead of 1 parquet file.
df2.write.mode("overwrite").parquet('s3://path')
Can any one please help me in this, i need the df2 to write only single parquet file instead of many with all the data in it as a table format.
If the DataFrame is not too big, you can try to repartition it to a single partition before writing:
df2.repartition(1).write.mode("overwrite").parquet('s3://path')
I have two csv files having some data and I would like to combine and sort data based on one common column:
Here is data1.csv and data2.csv file:
The data3.csv is the output file where you I need data to be combined and sorted as below:
How can I achieve this?
Here's what I think you want to do here:
I created two dataframes with simple types, assume the first column is like your timestamp:
df1 = pd.DataFrame([[1,1],[2,2], [7,10], [8,15]], columns=['timestamp', 'A'])
df2 = pd.DataFrame([[1,5],[4,7],[6,9], [7,11]], columns=['timestamp', 'B'])
c = df1.merge(df2, how='outer', on='timestamp')
print(c)
The outer merge causes each contributing DataFrame to be fully present in the output even if not matched to the other DataFrame.
The result is that you end up with a DataFrame with a timestamp column and the dependent data from each of the source DataFrames.
Caveats:
You have repeating timestamps in your second sample, which I assume may be due to the fact you do not show enough resolution. You would not want true duplicate records for this merge solution, as we assume timestamps are unique.
I have not repeated the timestamp column here a second time, but it is easy to add in another timestamp column based on whether column A or B is notnull() if you really need to have two timestamp columns. Pandas merge() has an indicator option which would show you the source of the timestamp if you did not want to rely on columns A and B.
In the post you have two output columns named "timestamp". Generally you would not output two columns with same name since they are only distinguished by position (or color) which are not properties you should rely upon.
I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.
I have a list of csv files that I am trying to concatenate using Pandas.
Given below is sample view of the csv file:
Note: Column 4 - stores the latitude
Column 5 - store the longitude
store-001,store_name,building_no_060,23.4324,43.3532,2018-10-01 10:00:00,city_1,state_1
store-002,store_name,building_no_532,12.4345,45.6743,2018-10-01 12:00:00,city_2,state_1
store-003,store_name,building_no_536,54.3453,23.3444,2018-07-01 04:00:00,city_3,state_1
store-004,store_name,building_no_004,22.4643,56.3322,2018-04-01 07:00:00,city_2,state_3
store-005,store_name,building_no_453,76.3434,55.4345,2018-10-02 16:00:00,city_4,state_2
store-006,store_name,building_no_456,35.3455,54.3334,2018-10-05 10:00:00,city_6,state_2
When I try to concat multiple csv files in the above format, I see the columns having latitude and longitude are first saved in the first row from A2 - A30 and they are followed by the other columns all in the row 1.
Given below is the way I am performing the concat:
masterlist = glob.glob('path') <<- This is the path where all the csv files are stored.
df_v1 = [pd.read_csv(fp, sep=',', error_bad_lines=False).assign(FileName=os.path.basename(fp)) for fp in masterlist] <<-- This also includes the file name in the csv file
df = pd.concat(df_v1, ignore_index=True)
df.to_csv('path'), index=False) <<-- This stores the final concatenated csv file
Could anyone guide me why is the concatenation not working properly. Thanks
I was experimenting with HDF and it seems pretty great because my data is not normalized and it contains a lot of text. I love being able to query when I read data into pandas.
loc2 = r'C:\\Users\Documents\\'
(my dataframe with data is called 'export')
hdf = HDFStore(loc2+'consolidated.h5')
hdf.put('raw', export, format= 'table', complib= 'blosc', complevel=9, data_columns = True, append = True)
21 columns and about 12 million rows so far and I will add about 1 million rows per month.
1 Date column [I convert this to datetime64]
2 Datetime columns (one of them for each row and the other one is null about 70% of the time) [I convert this to datetime64]
9 text columns [I convert these to categorical which saves a ton of room]
1 float column
8 integer columns, 3 of these can reach a max of maybe a couple of hundred and the other 5 can only be 1 or 0 values
I made a nice small h5 table and it was perfect until I tried to append more data to it (literally just one day of data since I am receiving daily raw .csv files). I received errors which showed that the dtypes were not matching up for each column although I used the same exact ipython notebook.
Is my hdf.put code correct? If I have append = True does that mean it will create the file if it does not exist, but append the data if it does exist? I will be appending to this file everyday basically.
For columns which only contain 1 or 0, should I specify a dtype like int8 or int16 - will this save space or should I keep it at int64? It looks like some of my columns are randomly float64 (although no decimals) and int64. I guess I need to specify the dtype for each column individually. Any tips?
I have no idea what blosc compression is. Is that the most efficient one to use? Any recommendations here? This file is mainly used to quickly read data into a dataframe to join to other .csv files which Tableau is connected to