database greenhorn

PoisonedPrisonPanda@discuss.tchncs.de · edit-2 2 months ago

database greenhorn

pinball_wizard@lemmy.zip · edit-2 1 month ago

When detecting duplicates gets expensive, the secret is to process them anyway, but in a way that de-duplicates the result of processing them.

Usually, that means writing the next processing step into a (new) table whose primary key contains every detail that could make a record a duplicate.

Then, as all the records are processed, just let it overwrite that same record with each duplicate.

The resulting table is a list of keys containing no duplicates.

(Tip: This can be a good process to run overnight.)

(Tip: be sure the job also marks each original record as processed/deduped, so the overnight job only ever has to look at new unprocessed/un-deduped records.)

Then, we drive all future processing steps from that new de-duplicated table, joining back to only whichever of the duplicate records was processed last for the other record details. (Since they’re duplicates anyway, we don’t care which one wins, as long as only one does.)

This tends to result in a first single pass through the full data to process to create the de-duplicated list, and then a second pass through the de-duplicated list for all remaining steps. So roughly 2n processing time.

(But the first n can be a long running background job, and the second n can be optimized by indexes supporting the needs of each future processing step.)

normalexit@lemmy.world · edit-2 2 months ago

“They simply go through the whole table”… that’s the problem. A full table scan should be avoided at all costs.

Learn: how to run and read an explain plan, indexes, keys, constraints, and query optimization (broadly you want to locate individual records as quickly as possible by using the most selective criteria).

You also need to learn basic schema design and to familiarize yourself with normalization.

Avoid processing huge result sets in your application. The database is good at answering questions about data it contains. It isn’t just a big bucket to throw data into to retrieve later.

PoisonedPrisonPanda@discuss.tchncs.de · 2 months ago

broadly you want to locate individual records as quickly as possible by using the most selective criteria

What can be more selective than "if ID = “XXX”? Yet the whole table still has to be reviewed until XXX is found?

… and to familiarize yourself with normalization.

based on a quick review of normalization, I doubt that this helps me - as we are not experiencing such links in the data. For us we “simply” have many products with certain parameters (title, description, etc.) and based on those we process the product and store the product with additional output in a table. However to not process products which were already processed, we want to dismiss any product which is in the processing pipeline which is already stored in the “final” table.

It isn’t just a big bucket to throw data into to retrieve later.

thats probably the biggest enlightment I have got since we started working with a database.

Anyway I appreciate your input. so thank you for this.

normalexit@lemmy.world · 1 month ago

If you are searching by a primary key or other indexed id you should be fine. Here are a couple of articles to check out:

https://www.atlassian.com/data/databases/how-does-indexing-work

https://www.red-gate.com/simple-talk/featured/postgresql-indexes-what-they-are-and-how-they-help/

The TLDR is a where clause that hits an index doesn’t have to go through all the rows in the table.

solrize@lemmy.world · edit-2 2 months ago

What? Problems like this usually come down to some missing indexes. Can you view the query plan for your slow queries? See how long they are taking? IDK about SQL Server but usually there is a command called something like ANALYZE, that breaks down a query into the different parts of its execution plan, executes it, and measures how long each part takes. If you see something like “FULL TABLE SCAN” taking a long time, that can usually be fixed with an index.

If this doesn’t make any sense to you, ask if there are any database gurus at your company, or book a few hours with a consultant. If you go the paid consultant route, say you want someone good at SQL Server query optimization.

By the way I think some people in this thread are overestimating the complexity of this type of problem or are maybe unintentionally spreading FUD. I’m not a DB guru but I would say that by now I’m somewhat clueful, and I got that way mostly by reading the SQLlite docs including the implementation manuals over a few evenings. That’s probably a few hundred pages but not 2000 or anything like that.

First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

Also, look at your slowest queries. They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query? If just one, you probably need an index; if two or three, it might take a bit of head scratching; and if 4 or more, something is possibly wrong with your schema or how the queries are written and you have to straighten that out.

Basically from having seen this type of thing many times before, there is about a 50% chance that it can be solved with very little effort, by adding indexes based on studying the slow query executions.

PoisonedPrisonPanda@discuss.tchncs.de · 2 months ago

First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

Currently about ~50. But like 30 of them are the result of splitting them into a common column like “country”. In the beginning I assumed this lead to the same as partitioning one large table?

Also, look at your slowest queries

The different queries itself take not long because of the query per se. but due to the limitation of the HDD, SQL reads as much as possible from the disk to go through a table, given that there are now multiple connections all querying multiple tables this leads to a server overload. While I see now the issue with our approach, I hope that migrating the server from SQL server to postgreSQL and to modern hardware + refactoring our approach in general will give us a boost.

They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query?

Actually no JOIN. Most “complex” query is INSERT INTO with a WHEN NOT EXIST constraint.

But thank you for your advice. I will incorporate the tips in our new design approach.

solrize@lemmy.world · 2 months ago

You really have to see what the db is doing to understand where the bottlenecks are, i.e. find the query plans. It’s ok if it’s just single selects. Look for stuff like table scans that shouldn’t happen. How many queries per second are there? Remember that SSD’s have been a common thing for maybe 10 years. Before that it was HDD’s everywhere, and people still ran systems with very high throughput. They had much less ram then than now too.

wwb4itcgas@lemm.ee · 2 months ago

To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

Silver bullets and magic wands don’t really exist, I’m afraid. There’s amble reasons for DBA’s being well-paid people.

There’s basically three options: Either increase the hardware capabilities to be able to handle the amount of data you want to deal with, decrease the amount of data so that the hardware you’ve got can handle it at the level of performance you want or… Live with the status quo.

If throwing more hardware at the issue was an option, I presume you would just have done so. As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series? If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage? Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

PoisonedPrisonPanda@discuss.tchncs.de · 2 months ago

To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

This can be the new slogan of our development. :')

I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

so no. No narrowing.

Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.

wwb4itcgas@lemm.ee · 2 months ago

Sorry for taking so long to get back to you on this, but I’m not always on Lemmy. There’s always more code to be written - you know how it is, I’m sure.

Given the constraints you outline, one other avenue of attack could be to consider the time-sensitivity of product updates and the relative priority thereof. If it’s acceptable for updates to products to lag somewhat, you can at least perform them at a lower rate over longer time, thus reducing hardware load at any given time. If the periodic updates are make to the same per-product values, you could even potentially get smart and replace queued updates not yet performed, if they’re superseded by a subsequent change before they’re actually committed thus further reducing load.