pyspark.SparkContext.addFile#

SparkContext.addFile(path, recursive=False)[source]#

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

To access the file in Spark jobs, use SparkFiles.get() with the filename to find its download location.

A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.

New in version 0.7.0.

Parameters
pathstr

can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get() to find its download location.

recursivebool, default False

whether to recursively add files in the input directory

Notes

A path can be added only once. Subsequent additions of the same path are ignored.

Examples

>>> import os
>>> import tempfile
>>> from pyspark import SparkFiles
>>> with tempfile.TemporaryDirectory(prefix="addFile") as d:
...     path1 = os.path.join(d, "test1.txt")
...     with open(path1, "w") as f:
...         _ = f.write("100")
...
...     path2 = os.path.join(d, "test2.txt")
...     with open(path2, "w") as f:
...         _ = f.write("200")
...
...     sc.addFile(path1)
...     file_list1 = sorted(sc.listFiles)
...
...     sc.addFile(path2)
...     file_list2 = sorted(sc.listFiles)
...
...     # add path2 twice, this addition will be ignored
...     sc.addFile(path2)
...     file_list3 = sorted(sc.listFiles)
...
...     def func(iterator):
...         with open(SparkFiles.get("test1.txt")) as f:
...             mul = int(f.readline())
...             return [x * mul for x in iterator]
...
...     collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
>>> file_list1
['file:/.../test1.txt']
>>> file_list2
['file:/.../test1.txt', 'file:/.../test2.txt']
>>> file_list3
['file:/.../test1.txt', 'file:/.../test2.txt']
>>> collected
[100, 200, 300, 400]