Select a list of set columns AND other columns depending on their name/regex (Spark SQL Java)

53 views Asked by At

I have a list of columns that I need to select. I have the field names for each column in this list, so it is easy to select.

public Column[] getSelectColumns()
{
        return new Column[]{
                col("name"),
                col("value"),
                col("date"),
        }
}

final Dataset<Row> testDf = df.select(getSelectColumns());

However, I want to combine this with other columns where I do not have their exact field names. These other columns only share a similar pattern in their names, such as weather_id, house_id, person_id. They all end in "_id". Each row may or may not have these "_id" columns. The "_id" columns are dynamic so I cannot hardcode them in like I do in getSelectColumns().

Am I able to select columns based on the "_id" pattern (like using regex)? And if I can, how do I combine that with my regular select? So that my end result Dataset<Row> has all the columns I need.

2

There are 2 answers

0
Rajat Mishra On

Spark has a colRegex function which can be use to select columns based on regex.

Something like this :

df.select(df.colRegex("`^.*name*`")).show()

0
Medzila On

You can use Dataset#colRegex(String)

df.colRegex("`.*_id`")

This should work. I didn't test that but you can change your code to:

Dataset<Row> df = ...;
Column[] cols = new Column[]{
     functions.col("name"),
     functions.col("value"),
     functions.col("date"),
     df.colRegex("`.*_id`")
    };
final Dataset<Row> testDf = df.select(getSelectColumns());

Alternativaly, you could use df.columns() to get all the column names, then apply a regular java regex to filter out only the columns you need and select the columns from there.

See more:

https://spark.apache.org/docs/3.0.2/api/java/org/apache/spark/sql/Dataset.html#colRegex-java.lang.String-