rlike() function in pyspark is not working properly

671 views Asked by At

I trying to use rlike() to the money [whether it has dollar sign( $) , comma ( ,) , decimal sign(.) and numbers before and after the decimal sign also there can be a negative sign before / after the $ sign) Below is the regex expression i came up with - ^$?-?[0-9],?[0-9].?[0-9]*$ its can able tp find the match, if i try to test in https://regex101.com/

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df= df.unionAll(cdf.withColumn("ErrorMessage", lit("Invalid Amount Recovered"))\
                                                           .filter(~ col("AmountRecovered").rlike('^\$?\-?[0-9]*\,?[0-9]*\.?[0-9]*$'))).distinct()
display(df)

Also i tried replacing ~ with == False like this-

df= df.unionAll(cdf.withColumn("ErrorMessage", lit("Invalid Amount Recovered"))\
                                                           .filter( col("AmountRecovered").rlike('^\$?\-?[0-9]*\,?[0-9]*\.?[0-9]*$')==False)).distinct()

It is not working either.

enter image description here

1

There are 1 answers

2
Benji On

I noticed two things wrong with your regular expression: it doesn't match a - before a $ (for an input like -$5.00) and it doesn't let you have multiple commas (for an input like $500,000,000,000).

I also simplified the expression a bit by removing unnecessary \'s and replacing [0-9] with \d.

Here's a tweaked pattern that should match your criteria better:

^-?\$?-?(\d*,)*\d*\.?\d*$

You can see it in action here: https://regexr.com/6om9n