pyspark.SparkContext.addFile#

SparkContext.addFile(path, recursive=False)[source]#

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

To access the file in Spark jobs, use SparkFiles.get() with the filename to find its download location.

A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.

New in version 0.7.0.

Parameters

pathstr: can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get() to find its download location.
recursivebool, default False: whether to recursively add files in the input directory

See also

SparkContext.listFiles()
SparkContext.addPyFile()
SparkFiles.get()

Notes

A path can be added only once. Subsequent additions of the same path are ignored.

Examples

>>> import os
>>> import tempfile
>>> from pyspark import SparkFiles

>>> with tempfile.TemporaryDirectory(prefix="addFile") as d:
...     path1 = os.path.join(d, "test1.txt")
...     with open(path1, "w") as f:
...         _ = f.write("100")
...
...     path2 = os.path.join(d, "test2.txt")
...     with open(path2, "w") as f:
...         _ = f.write("200")
...
...     sc.addFile(path1)
...     file_list1 = sorted(sc.listFiles)
...
...     sc.addFile(path2)
...     file_list2 = sorted(sc.listFiles)
...
...     # add path2 twice, this addition will be ignored
...     sc.addFile(path2)
...     file_list3 = sorted(sc.listFiles)
...
...     def func(iterator):
...         with open(SparkFiles.get("test1.txt")) as f:
...             mul = int(f.readline())
...             return [x * mul for x in iterator]
...
...     collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()

>>> file_list1
['file:/.../test1.txt']
>>> file_list2
['file:/.../test1.txt', 'file:/.../test2.txt']
>>> file_list3
['file:/.../test1.txt', 'file:/.../test2.txt']
>>> collected
[100, 200, 300, 400]