pyspark.sql.functions.collect_set#
- pyspark.sql.functions.collect_set(col)[source]#
Aggregate function: Collects the values from a column into a set, eliminating duplicates, and returns this set of objects.
New in version 1.6.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or str The target column on which the function is computed.
- col
- Returns
Column
A new Column object representing a set of collected values, duplicates excluded.
Notes
This function is non-deterministic as the order of collected results depends on the order of the rows, which may be non-deterministic after any shuffle operations.
Examples
Example 1: Collect values from a DataFrame and sort the result in ascending order
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1,), (2,), (2,)], ('value',)) >>> df.select(sf.sort_array(sf.collect_set('value')).alias('sorted_set')).show() +----------+ |sorted_set| +----------+ | [1, 2]| +----------+
Example 2: Collect values from a DataFrame and sort the result in descending order
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',)) >>> df.select(sf.sort_array(sf.collect_set('age'), asc=False).alias('sorted_set')).show() +----------+ |sorted_set| +----------+ | [5, 2]| +----------+
Example 3: Collect values from a DataFrame with multiple columns and sort the result
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name")) >>> df = df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set')) >>> df.orderBy(sf.desc("name")).show() +----+----------+ |name|sorted_set| +----+----------+ |John| [1, 2]| | Ana| [3]| +----+----------+