Skip to content

Conversation

koperagen
Copy link
Collaborator

@koperagen koperagen commented Aug 28, 2025

This PR enables basic idea of a new workflow with imported schemas:

  1. KSP processor finds "import schema" declarations (now CLASSES, not KT FILES), matches declaration to reader, reads data, writes schemas to directory. Can be triggered as usual Gradle task whenever needed
  2. Compiler plugin handles all codegen

The goal for now is MVP for internal testing.

How this approach different:

  1. Schema declaration is now class, not file annotation. More convenient syntax for declarations
  2. Schemas are stored under version control system, more transparent compared to hidden generated code directory

What's new:
Approach to custom formats and generic schema preprocessing
We explore idea of service loader in KSP processor. Anyone can provide SchemaReader implementation in another module of their project and generate schemas for arbitrary data, for example

KSP plugin now has two new parameters:

  1. disables old annotation processing for DataSchema and ImportDataSchema so they don't conflict with the compiler plugin

  2. output directory where json files will be generated, serves as input of compiler plugin

ksp {
  arg("dataframe.experimentalImportSchema", "true") 
  arg("dataframe.importedSchemasOutput", path)
}

Example setup:

plugins {
    kotlin("jvm") version "2.3.255-SNAPSHOT"
    kotlin("plugin.dataframe") version "2.3.255-SNAPSHOT"
    id("com.google.devtools.ksp") version "2.2.0-2.0.2"
}

repositories {
    mavenLocal()
    maven("https://packages.jetbrains.team/maven/p/kt/dev/")
    mavenCentral()
}

dependencies {
    val version = "1.0.0-dev"
    implementation("org.jetbrains.kotlinx:dataframe-core:$version")
    implementation("org.jetbrains.kotlinx:dataframe-json:$version")
    implementation("org.jetbrains.kotlinx:dataframe-csv:$version")
    ksp("org.jetbrains.kotlinx.dataframe:symbol-processor-all:$version")

    // Module with custom readers
    ksp(project(":reader"))
    implementation(project(":reader"))
    testImplementation(kotlin("test"))
}

tasks.test {
    useJUnitPlatform()
}

val schemasDir = layout.projectDirectory.dir("src/schemas")!!
ksp {
    arg("dataframe.importedSchemasOutput", schemasDir.toString())
    arg("dataframe.experimentalImportSchema", "true")
    arg("dataframe.resolutionDir", layout.projectDirectory.asFile.absolutePath)
}

kotlin {
    jvmToolchain(11)
    compilerOptions.freeCompilerArgs.addAll(
        "-P", "plugin:org.jetbrains.kotlin.dataframe:schemasPath=${schemasDir.asFile}"
    )
}

@@ -43,6 +44,9 @@ public annotation class ImportDataSchema(
val enableExperimentalOpenApi: Boolean = false,
)

@Target(AnnotationTarget.CLASS)
public annotation class DataSchemaSource(val source: String, val qualifier: String = SchemaReader.DEFAULT_QUALIFIER)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can guess what source does, but qualifier is unclear for me. Some comments would be nice, even though it's just a proof-of-concept

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I remember vaguely from your demo that this allowed to make distinctions of some kind. But I don't remember exactly without a small example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be some KDocs here as well?

@@ -59,6 +59,36 @@ public interface SupportedDataFrameFormat : SupportedFormat {
public fun readDataFrame(file: File, header: List<String> = emptyList()): DataFrame<*>
}

/**
* User-facing API implemented by a companion object of an imported schema [org.jetbrains.kotlinx.dataframe.annotations.DataSchemaSource]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*the companion object


/**
* Handler of classes annotated with [org.jetbrains.kotlinx.dataframe.annotations.DataSchemaSource].
* Implementations must have a single zero-argument constructor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they could also be object singletons maybe, since they have no state

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! It first needs to be adjusted in compiler plugin


public fun accepts(path: String, qualifier: String): Boolean = qualifier == DEFAULT_QUALIFIER

public fun read(path: String): DataFrame<*>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do need a way to pass extra arguments in the future. There are many ways to do this but we can figure that out later :)


public fun read(path: String): DataFrame<*>

public fun default(path: String): DataFrame<*> = read(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still rename this to readDefault or readSource, something more imperative.


/**
* Serializes data schema into a human-readable JSON format.
* Input of compiler plugin for "imported data schema" feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a tiny sample :) similar to what Nikita did in serialization_format.md It helps to see that this builds

* Input of compiler plugin for "imported data schema" feature
*/
fun DataFrameSchema.toJsonString(
json: Json = Json { prettyPrint = true },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be extracted to a private const

val configuration = DataFrameConfiguration(
resolutionDir = environment.options["dataframe.resolutionDir"],
importedSchemasOutput = environment.options[DATAFRAME_IMPORTED_SCHEMAS_OUTPUT],
experimentalImportSchema = environment.options["dataframe.experimentalImportSchema"].equals(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better readable like:

environment.options["dataframe.experimentalImportSchema"]
                .equals("true", ignoreCase = true)

This also satisfies KtLint :)

@@ -43,6 +44,9 @@ public annotation class ImportDataSchema(
val enableExperimentalOpenApi: Boolean = false,
)

@Target(AnnotationTarget.CLASS)
public annotation class DataSchemaSource(val source: String, val qualifier: String = SchemaReader.DEFAULT_QUALIFIER)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be some KDocs here as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants