Understanding the constraint in Spark's StringIndexer: why must inputCols and outputCols be different?

31 views Asked by jsn At 25 September 2023 at 19:47

I'm in the process of developing a data preprocessing pipeline utilizing Apache Spark, and I've encountered an intriguing behavior with the StringIndexer transformer. In my pipeline, I rely on the StringIndexer to convert categorical columns into numerical representations. However, when I attempt to define the inputCols and outputCols parameters for the StringIndexer, I've observed that Spark imposes a restriction disallowing identical values in both inputCols and outputCols. This is in contrast to other transformers like Imputer, which permit such a configuration.

I'm curious to understand the rationale behind this distinction in behavior. Is there a specific technical or design decision underlying this constraint?

Your insights into this matter would be greatly appreciated. Thank you!

Original Q&A

TechQA.

Understanding the constraint in Spark's StringIndexer: why must inputCols and outputCols be different?

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-ML

Popular Questions

Trending Questions