I'm trying to join two tables in java spark, one of the tables contains duplicate columns. The problem is that the columns are renamed with trailing numbers, hence the dropDuplicates() function doesn't work.
Here is the code:
Dataset<Row> data = spark.read().format("csv").option("header", "true").option("inferSchema", "true")
.load(path);
data.dropDuplicates();
The problem is that the duplicate columns in the table are already renamed with trailing numbers, so no duplicates are removed.
What is the right way to handle it?
I'm using spark-sql_2.11-2.3.0
Group all column names by the suffix (without the trailing numbers) and then take only one (random) column from each group. This list of column names can then be used to select the columns before the join.
String[] allFieldNames = data.schema().fieldNames();
String[] selected = Stream.of(allFieldNames)
.collect(Collectors.toMap(s -> s.replaceAll("^[0-9]*", ""), Function.identity(), (a, b) -> b)).values()
.toArray(new String[0]);
Dataset<Row> dfWithUniqueCols = data.select(selected[0], Arrays.copyOfRange(selected, 1, selected.length));
Related
Hi i have string in BigQuery column like this
cancellation_amount: 602000
after_cancellation_transaction_amount: 144500
refund_time: '2022-07-31T06:05:55.215203Z'
cancellation_amount: 144500
after_cancellation_transaction_amount: 0
refund_time: '2022-08-01T01:22:45.94919Z'
i already using this logic to get cancellation_amount
regexp_extract(file,r'.*cancellation_amount:\s*([^\n\r]*)')
but the output only amount 602000, i need the output 602000 and 144500 become different column
Appreciate for helping
If your lines in the input (which will eventually become columns) are fixed you can use multiple regexp_extracts to get all the values.
SELECT
regexp_extract(file,r'cancellation_amount:\s*([^\n\r]*)') as cancellation_amount
regexp_extract(file,r'. after_cancellation_transaction_amount:\s*([^\n\r]*)') as after_cancellation_transaction_amount
FROM table_name
One issue I found with your regex expression is that .*cancellation_amount won't match after_cancellation_transaction_amount.
There is also a function called regexp_extract_all which returns all the matches as an array which you can later explode into columns, but if you have finite values separating them out in different columns would be a easier.
I have a large Pandas DataFrame with >100 columns and I would like to select all columns where the substring einkst_l appears in the column name.
In addition, I want to select the two columns name and year.
So far, I could only create two new data frames:
e = 'einkst_l'
df_1 = df.filter(like = e, axis=1).reset_index(drop=True)
df_2 = df.filter(items = ['name', 'year'], axis=1).reset_index(drop=True)
I would like to select all the columns in one shot, but unfortunately 'like' and 'items' cannot be combined in one statement.
How can I select name + year + all columns containing the specified substring all at once?
This is more fuzzy but you could just use regex match like.
df[df.columns[df.columns.str.contains('einkst_l|name|year')]]
Also, could use ^ or $ to make match exactly for name and year.
Try without filter using str accessors:
Replace:
like by contains
items by isin
out = df[df.columns[df.columns.str.contains('einkst_l')
| df.columns.isin(['name', 'year'])]]
You can try something like a "nested filtering":
df.filter(like = e, axis=1).filter(items = ['name', 'jahr'], axis=1)
I have a very large pyspark dataframe in which I need to select a lot of columns (which is why I want to use a for instead of writing each column name). The majority of those columns I need to cast them to DoubleType(), except for one column that I need to keep as a StringType() (column "ID").
When I'm selecting all the columns that I need to cast to DoubleType() I use this code (it works) :
df_num2 = df_num1.select([col(c).cast(DoubleType()) for c in num_columns])
How can I also select my column "ID" which is a StringType() ?
List concatenation in python :
df_num2 = df_num1.select(["id"] + [col(c).cast(DoubleType()) for c in num_columns])
# OR
df_num2 = df_num1.select(["id", *(col(c).cast(DoubleType()) for c in num_columns)])
The query I start out with has 40,000 lines of empty rows, which stems from a problem with the original spreadsheet from which it was taken.
Using CF16 server
I would like to do a Query of Queries on a variably named 'key column'.
In my query:
var keyColumn = "Permit No."
var newQuery = "select * from source where (cast('#keyColumn#' as varchar) <> '')";
Note: the casting comes from this suggestion
I still get all those empty fields in there.
But when I use "City" as the keyColumn, it works. How do the values in both those columns differ when they both say [empty string] on the query dump?
Is it a problem with column names? What kind of data are in those cells?
where ( cast('Permit No.' as varchar) <> '' )
The problem is the SQL, not the values. By enclosing the column name in quotes, you are actually comparing the literal string "P-e-r-m-i-t N-o-.", not the values inside that column. Since the string "Permit No." can never equal an empty string, the comparison always returns true. That is why the resulting query still includes all rows.
Unless it was fixed in ColdFusion 2016, QoQ's do not support column names containing invalid characters like spaces. One workaround is to use the "columnNames" attribute to specify valid column names when reading the spreadsheet. Failing that, another option is to take advantage of the fact that query columns are arrays and duplicate the data under a valid column name: queryAddColumn(yourQuery, "PermitNo", yourQuery["Permit No."]) (Though the latter option is less ideal because it may require copying the underlying data internally):
I found a weird problem with MySQL select statement having "IN" in where clause:
I am trying this query:
SELECT ads.*
FROM advertisement_urls ads
WHERE ad_pool_id = 5
AND status = 1
AND ads.id = 23
AND 3 NOT IN (hide_from_publishers)
ORDER BY rank desc
In above SQL hide_from_publishers is a column of advertisement_urls table, with values as comma separated integers, e.g. 4,2 or 2,7,3 etc.
As a result, if hide_from_publishers contains same above two values, it should return only record for "4,2" but it returns both records
Now, if I change the value of hide_for_columns for second set to 3,2,7 and run the query again, it will return single record which is correct output.
Instead of hide_from_publishers if I use direct values there, i.e. (2,7,3) it does recognize and returns single record.
Any thoughts about this strange problem or am I doing something wrong?
There is a difference between the tuple (1, 2, 3) and the string "1, 2, 3". The former is three values, the latter is a single string value that just happens to look like three values to human eyes. As far as the DBMS is concerned, it's still a single value.
If you want more than one value associated with a record, you shouldn't be storing it as a comma-separated value within a single field, you should store it in another table and join it. That way the data remains structured and you can use it as part of a query.
You need to treat the comma-delimited hide_from_publishers column as a string. You can use the LOCATE function to determine if your value exists in the string.
Note that I've added leading and trailing commas to both strings so that a search for "3" doesn't accidentally match "13".
select ads.*
from advertisement_urls ads
where ad_pool_id = 5
and status = 1
and ads.id = 23
and locate(',3,', ','+hide_from_publishers+',') = 0
order by rank desc
You need to split the string of values into separate values. See this SO question...
Can Mysql Split a column?
As well as the supplied example...
http://blog.fedecarg.com/2009/02/22/mysql-split-string-function/
Here is another SO question:
MySQL query finding values in a comma separated string
And the suggested solution:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_find-in-set