Ideas about Big Data / Common Datasets? Base close to the 100000 record limit

Hi there,

here’s a request for any ideas about how so solve the following problem - any input appreciated.

I need (want) to publish all the adresses of our city as a common dataset, to make address lookups available to all base developers. The data is pushed into Seatable via the API throught ETL processes from our GeoInfo Servers.

The problem: The set of unique addresses (Street + House number) comprises about 98600 records, just below the 100000 record limit for single bases, with a few to spare in the future.

A few questions - maybe someone has any input:

  1. Does it make sense to publish such a large common data set (in terms of Sync time, performance)?
  2. The data is very redundant - every street is an extra text field in every record
    • However, if I put all street names into a separate table, and link them to the address records, I need to add another table with 3500 street names, exceeding the 100000 record / base limit of seatable
  3. Using Big Data excludes using common datasets, so no solution there?

Anyone out there with a clever solution?

If not: before I bite the bullet and publish < 100000 highly redundant records as a common dataset, I’d rather know from question 1 that it’s worth the effort, and does not break my system.

Thanks everyone!

Hello! Based on your description of managing a large city address dataset (~98,600 records), here is an analysis and some suggestions for your SeaTable setup:

1. Performance and Syncing

Publishing a dataset of nearly 100,000 records as a Common Dataset is technically possible, but you should expect some performance overhead:

  • Synchronization Time: The initial import and subsequent manual or periodic syncs (especially the “Force Sync” push) will take longer than with smaller sets.

  • Memory Usage: SeaTable loads the data into the browser’s memory. A 100k-row table with multiple columns will increase the memory footprint of the base, potentially making the user interface feel less responsive on lower-end hardware.

  • Recommendation: If performance becomes an issue, consider creating filtered views (e.g., by district or zip code) and publishing those as separate, smaller Common Datasets for specific use cases.

2. Data Redundancy vs. Normalization

In SeaTable, the 100,000-record limit applies to the entire base (the sum of rows across all tables).

  • The Problem with Normalization: If you move 3,500 street names to a second table and link them, your total record count would be 98,600 + 3,500 = 102,100, which exceeds the limit.

  • The Solution: In this specific case, redundancy is actually the better path. Storing the street name as a text field in each record is more efficient for the row quota than creating links to a separate table. SeaTable is designed to handle large amounts of text data efficiently within a single table.

3. Big Data and Common Datasets

You are correct that Big Data storage (the Big Data backend) currently excludes the use of Common Datasets.

  • Big Data is designed for archiving or querying millions of records that don’t need the real-time collaborative features of the standard table.

  • If you move this data to Big Data, other bases will not be able to “sync” to it as a Common Dataset.

Clever Alternatives / Summary

  • Stay Redundant: Keep everything in one table using text fields to stay under the 100k limit.

  • API-First Approach: If base developers only need to lookup addresses rather than having the full list in their own base, consider having them use a Data Processing script or a Custom Plugin that queries your “Master Address Base” via API on demand. This bypasses the need for syncing 100k rows into every base.

  • Verify Your Version: If you are on an Enterprise subscription, you may have access to higher row limits or more frequent synchronization options.

In the future

A better dtable-server to support more data

We are working on a new golang based dtable-server to support more data in a base, like 200K to 300K. But because it is still a memory based data storage, it cannot store too many records.

This is an on-going task.

An updated way to use big data storage

Records can be stored in big data storage and can be used directly by other bases without sync.

This is a long term task. We have not worked it on yet.

@daniel.pan 1000 (or 100.000 :slight_smile:) Thanks for this extensive answer, and the outlook into the future!

A lot of this supports my findings, so based on that confirmation, I think I’ll explore 2 different ways:

  1. Trying to separate the common dataset (going back to redundant data) into logical chunks as separate bases (e.g. city quarters/areas), if my internal customers can go along with it.
    • However, sharing your considerations about size, speed and system load, and seeing that we are awfully close to the 100K limit, which I do not wish to raise, even in our enterprise installation, I’ll keep that option as a Plan B.
  2. The “Data Processing Script” / “Custom Plugin” sounds very attractive, so I’d like to follow that path if time allows it. The “on demand” option caught my eye, so I’d like to learn more about it.

The reason behind this: The address data actually comes from our GeoInfo system, which also has a “Location Finder” API to support on-demand address research. Therefore, the hope would be that somewhere in Seatable’s GUI (Base Backend or App), I could place a call to an external API, and write the resulting data into the Seatable base.

Is that possible? How? Is there any documentation? Is that normal scripting, triggered by automations?

Thanks again,

Arndt

Calling external APIs and writing the results back into SeaTable is a very common use case.

How does it work?

The “on-demand” experience you’re looking for is typically achieved in one of two ways:

  1. Button Column (Manual On-Demand): You can add a “Button” type column to your table and set its action to “Run Script.” When a user clicks the button in a specific row, it triggers your script. The script can then take data from that row (like a search term), query your GeoInfo API, and update the row with the result.
  2. Automations (Automatic): You can set up an automation that triggers “When a record is added” or “When a column value changes.” This allows the “Location Finder” logic to run automatically whenever someone enters a partial address.

Technical Implementation

We recommend using Python scripts for this, as the requests library is pre-installed in SeaTable’s environment, making API calls very straightforward.

Here is a simplified example of how a Python script would look:

from seatable_api import Base, context
import requests

# The 'context' object automatically provides credentials when run within SeaTable
base = Base(context.api_token, context.server_url)
base.auth()

# 1. Get data from the row where the button was clicked
row = context.current_row
address_query = row.get('Address Search')

# 2. Call your external GeoInfo API
api_url = f"https://your-geoinfo-system.com/api/search?q={address_query}"
response = requests.get(api_url)
data = response.json()

# 3. Write the result back to SeaTable
base.update_row(context.current_table, row['_id'], {
    'Street': data.get('street_name'),
    'Postal Code': data.get('zip'),
    'Coordinates': f"{data.get('lat')}, {data.get('lng')}"
})

The disadvantage of Python script approach is that the user cannot filter addresses when they are typing.

In version 6.2, you will also be able to add a custom HTML page in SeaTable App (or generate it by AI) to implement a complex UI effect.

@daniel.pan Thanks for the update and explanations!

The most exciting development seems to be the possible development in 6.2. - looking forward to it!

You mentioned it yourself: Both button action and on-change automations have the disadvantage that they happen after the user entered the incomplete data, and not as a as-you-type lookup during data entry. In addition, the button action seems to work in the backend, but not in App forms.

At least I can offer two options now:

  • Treat this as a background service, where incomplete address entries are verfied and corrected after the data entry.
  • Provide an on-demand “verify (and correct) address” button in table views. In the backend for sure. For Apps, I need to test whether the result of the python script (calling the Location Finder API) will update in the table immediately, or only after a page refresh.

The second button option seems to work in Apps as well, so I’ll go with that one.