pyspark.pandas.sql#

pyspark.pandas.sql(query, index_col=None, args=None, **kwargs)[source]#

Execute a SQL query and return the result as a pandas-on-Spark DataFrame.

This function acts as a standard Python string formatter with understanding the following variable types:

  • pandas-on-Spark DataFrame

  • pandas-on-Spark Series

  • pandas DataFrame

  • pandas Series

  • string

Also the method can bind named parameters to SQL literals from args.

Note

pandas-on-Spark DataFrame is not supported for Spark Connect.

Parameters
querystr

the SQL query

index_colstr or list of str, optional

Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.

Note

If you want to preserve the index, explicitly use DataFrame.reset_index(), and pass it to the SQL statement with index_col parameter.

For example,

>>> psdf = ps.DataFrame({"A": [1, 2, 3], "B":[4, 5, 6]}, index=['a', 'b', 'c'])
>>> new_psdf = psdf.reset_index()
>>> ps.sql("SELECT * FROM {new_psdf}", index_col="index", new_psdf=new_psdf)
... 
       A  B
index
a      1  4
b      2  5
c      3  6

For MultiIndex,

>>> psdf = ps.DataFrame(
...     {"A": [1, 2, 3], "B": [4, 5, 6]},
...     index=pd.MultiIndex.from_tuples(
...         [("a", "b"), ("c", "d"), ("e", "f")], names=["index1", "index2"]
...     ),
... )
>>> new_psdf = psdf.reset_index()
>>> ps.sql(
...     "SELECT * FROM {new_psdf}", index_col=["index1", "index2"], new_psdf=new_psdf)
... 
               A  B
index1 index2
a      b       1  4
c      d       2  5
e      f       3  6

Also note that the index name(s) should be matched to the existing name.

argsdict or list

A dictionary of parameter names to Python objects or a list of Python objects that can be converted to SQL literal expressions. See Supported Data Types for supported value types in Python. For example, dictionary keys: “rank”, “name”, “birthdate”; dictionary values: 1, “Steven”, datetime.date(2023, 4, 2). A value can be also a Column of a literal or collection constructor functions such as map(), array(), struct(), in that case it is taken as is.

New in version 3.4.0.

Changed in version 3.5.0: Added positional parameters.

kwargs

other variables that the user want to set that can be referenced in the query

Returns
pandas-on-Spark DataFrame

Examples

Calling a built-in SQL function.

>>> ps.sql("SELECT * FROM range(10) where id > 7")
   id
0   8
1   9
>>> ps.sql("SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9)
   id
0   8
>>> mydf = ps.range(10)
>>> x = tuple(range(4))
>>> ps.sql("SELECT {ser} FROM {mydf} WHERE id IN {x}", ser=mydf.id, mydf=mydf, x=x)
   id
0   0
1   1
2   2
3   3

Mixing pandas-on-Spark and pandas DataFrames in a join operation. Note that the index is dropped.

>>> ps.sql('''
...   SELECT m1.a, m2.b
...   FROM {table1} m1 INNER JOIN {table2} m2
...   ON m1.key = m2.key
...   ORDER BY m1.a, m2.b''',
...   table1=ps.DataFrame({"a": [1,2], "key": ["a", "b"]}),
...   table2=pd.DataFrame({"b": [3,4,5], "key": ["a", "b", "b"]}))
   a  b
0  1  3
1  2  4
2  2  5

Also, it is possible to query using Series.

>>> psdf = ps.DataFrame({"A": [1, 2, 3], "B":[4, 5, 6]}, index=['a', 'b', 'c'])
>>> ps.sql("SELECT {mydf.A} FROM {mydf}", mydf=psdf)
   A
0  1
1  2
2  3

And substitute named parameters with the : prefix by SQL literals.

>>> ps.sql("SELECT * FROM range(10) WHERE id > :bound1", args={"bound1":7})
   id
0   8
1   9

Or positional parameters marked by ? in the SQL query by SQL literals.

>>> ps.sql("SELECT * FROM range(10) WHERE id > ?", args=[7])
   id
0   8
1   9