pyspark.SparkContext.addFile#
- SparkContext.addFile(path, recursive=False)[source]#
Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
To access the file in Spark jobs, use
SparkFiles.get()
with the filename to find its download location.A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.
New in version 0.7.0.
- Parameters
- pathstr
can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use
SparkFiles.get()
to find its download location.- recursivebool, default False
whether to recursively add files in the input directory
Notes
A path can be added only once. Subsequent additions of the same path are ignored.
Examples
>>> import os >>> import tempfile >>> from pyspark import SparkFiles
>>> with tempfile.TemporaryDirectory(prefix="addFile") as d: ... path1 = os.path.join(d, "test1.txt") ... with open(path1, "w") as f: ... _ = f.write("100") ... ... path2 = os.path.join(d, "test2.txt") ... with open(path2, "w") as f: ... _ = f.write("200") ... ... sc.addFile(path1) ... file_list1 = sorted(sc.listFiles) ... ... sc.addFile(path2) ... file_list2 = sorted(sc.listFiles) ... ... # add path2 twice, this addition will be ignored ... sc.addFile(path2) ... file_list3 = sorted(sc.listFiles) ... ... def func(iterator): ... with open(SparkFiles.get("test1.txt")) as f: ... mul = int(f.readline()) ... return [x * mul for x in iterator] ... ... collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
>>> file_list1 ['file:/.../test1.txt'] >>> file_list2 ['file:/.../test1.txt', 'file:/.../test2.txt'] >>> file_list3 ['file:/.../test1.txt', 'file:/.../test2.txt'] >>> collected [100, 200, 300, 400]