Đổi tên các tên cột của DataFrame trong Spark Scala

Question 1

Tôi đang cố gắng chuyển đổi tất cả các tiêu đề / tên cột của một DataFrametrong Spark-Scala. như bây giờ tôi nghĩ ra mã sau chỉ thay thế một tên cột duy nhất.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

Question 2

Nếu cấu trúc phẳng:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

điều đơn giản nhất bạn có thể làm là sử dụng toDFphương pháp:

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

Nếu bạn muốn đổi tên các cột riêng lẻ, bạn có thể sử dụng selectvới alias:

df.select($"_1".alias("x1"))

có thể dễ dàng tổng quát hóa thành nhiều cột:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

hoặc withColumnRenamed:

df.withColumnRenamed("_1", "x1")

sử dụng với foldLeftđể đổi tên nhiều cột:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

Với các cấu trúc lồng nhau ( structs), một tùy chọn khả thi là đổi tên bằng cách chọn toàn bộ cấu trúc:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Lưu ý rằng nó có thể ảnh hưởng đến nullabilitysiêu dữ liệu. Một khả năng khác là đổi tên bằng cách ép kiểu:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

hoặc là:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

Question 3

Đối với những người bạn quan tâm đến phiên bản PySpark (thực sự nó giống trong Scala - xem bình luận bên dưới):

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Kết quả:

root
| - merchant_id: integer (nullable = true)
| - category: string (nullable = true)
| - subcategory: string (nullable = true)
| - merchant: string (nullable = true)

Question 4

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

Trong trường hợp không rõ ràng, điều này sẽ thêm tiền tố và hậu tố vào mỗi tên cột hiện tại. Điều này có thể hữu ích khi bạn có hai bảng với một hoặc nhiều cột có cùng tên và bạn muốn nối chúng nhưng vẫn có thể phân biệt các cột trong bảng kết quả. Chắc chắn sẽ rất hay nếu có một cách tương tự để làm điều này trong SQL "bình thường".

Question 5

Giả sử khung dữ liệu df có 3 cột id1, name1, price1 và bạn muốn đổi tên chúng thành id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

Tôi thấy cách tiếp cận này hữu ích trong nhiều trường hợp.

Question 6

Tham gia bảng kéo không đổi tên khóa đã tham gia

// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)

// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
    day1 = day1.withColumnRenamed(x, y)
}

làm!