Data Cleaning


Data Cleansing is one of the most important procedures for improving your model performance.

Most times when we collect data, we must do data cleaning, to ensure that the data is as perfect as it can be.  Data cleaning can involve many assessments.  For example, let’s say a survey questionnaire was put online and data was collected via a website.Read More »

Why logs keep on growing?

A Shorter Answer:
You probably either have a long running transaction running (Index maintenance? Big batch delete or update?) or you are in the “default” (more below on what is meant by default) recovery mode of Full and have not taken a log backup (or aren’t taking them frequently enough).
There could be other reasons, but these are the most common. This answer begins to dive into the most common two reasons and gives you some background information on the why and how behind the reasons as well as exploring some other reasons.
A Longer Answer:

What Scenarios can Cause the Log to Keep Growing?

There are many reasons, but usually these reasons are of the following two patterns:

1.) Not Understanding Recovery Models

(Being in Full Recovery Mode and Not Taking Log Backups – This is the most common reason – the vast majority of those experiencing this issue are. )
This answer is not a deep dive in SQL Server recovery models, the topic of recovery models is critical to this problem. In SQL Server, there are three recovery models – FullBulk-Logged and Simple. We’ll ignore Bulk-Logged for now we’ll sort of say it is a hybrid model and most people who are in this model are there for a reason and understand recovery models. The two we care about and their confusion are the cause of the majority of the cases of people having this issue are Simple and Full.
Before we talk about Recovery Models – Let’s talk about recovery in general.
1.    The transaction log file is there for crash/restart recovery. For the rolling forward and rolling back of work that was either done (rolling forward/redo) before a crash or restart and the work that was started but not finished after a crash or restart (rolling back/undo). It is the job of the transaction log to see that a transaction started but never finished (rolled back or crash/restart happened before the transaction committed). In that situation It is the log’s job to say “hey.. this never really finished, let’s roll it back” during recovery. It is also the log’s job to see that you did finish something and that your client application was told it was finished (even if it hadn’t yet hardened to your data file) and say “Hey.. this really happened, let’s roll it forward, let’s make it like the applications think it was” after a restart. Now there is more but that is the main purpose.
2.    The other purpose for a transaction log file is to be able to give us the ability to recover to a point in time due to an “oops” in a database or to guarantee a recovery point in the event of a hardware failure involving the data and/or log files of a database. If this transaction log contains the records of transactions that have been started and finished for recovery, SQL Server can and does then use this information to get a database to where it was before an issue happened. But that isn’t always an available option for us. For that to work we have to have our database in the right recovery model, and we have to take log backups.
Onto the recovery models:
Simple Recovery Model – So with the above introduction, it is easiest to talk about Simple Recoverymodel first. In this model, you are telling SQL Server – I am fine with you using your transaction log file for crash and restart recovery (You really have no choice there.. Look up ACID properties and that should make sense quickly), but once you no longer need it for that crash/restart recovery purpose, go ahead and reuse the log file.
SQL Server listens to this request in Simple Recovery and it only keeps the information it needs to do crash/restart recovery. Once SQL Server is sure it can recover because data is hardened to the data file (more or less), the data that has been hardened is no longer necessary in the log and is marked for truncation – which means it gets re-used.
Full Recovery Model – With Full Recovery, you are telling SQL Server that you want to be able to recover to a specific point in time, as long as your log file is available or to a specific point in time that is covered by a log backup. In this case when SQL Server reaches the point where it would be safe to truncate the log file in Simple Recovery Model, it will not do that. Instead It lets the log file continue to grow and will allow it to keep growing, until you take a log backup (or run out of space on your log file drive) under normal circumstances.
There are rules and exceptions here: We’ll talk about long running transactions below, but one to keep in mind for in full mode – If you just switch into Full Recovery mode, but never take an initial Full Backup, SQL Server will not honor your request to be in Full Recovery model. Your transaction log will continue to operate as it has in simple until you switch to Full Recovery Model AND Take your first Full Backup.
So, that’s the most common reason for uncontrolled log growth: Being in Full Recovery mode without having any log backups. This happens all the time to people.
(Why does it happen all the time? Because each new database gets its initial recovery model setting by looking at the model database. Model’s initial recovery model setting is always Full Recovery Model – until and unless someone changes that. So you could say the “default Recovery Model” is Full. Many people are not aware of this and have their databases running in Full Recovery Model with no log backups, and therefore a transaction log file much larger than necessary. This is why it is important to change defaults when they don’t work for your organization and its needs)
You can also get yourself in trouble here by not taking log backups frequently enough. – Taking a log backup a day may sound fine, it makes a restore require less restore commands, but keeping in mind the discussion above, that log file will continue to grow and grow until you take log backups. You need to consider your log backup frequency with two things in mind:
1.    Recovery Needs – This should hopefully be first. In the event that the drive housing your transaction log goes bad or you get serious corruption that affects your log backup, how much data can be lost? If that number is no more than 10-15 minutes, then you need to be taking the log backup every 10-15 minute, end of discussion.
2.    Log Growth – If your organization is fine to lose more data because of the ability to easily recreate that day you may be fine to have a log backup much less frequently than 15 minutes. Maybe your organization is fine with every 4 hours. But you have to look at how many transactions you generate in 4 hours. Will allowing the log to keep growing in those four hours make too large of a log file? Will that mean your log backups take too long?

2.) Long Running Transactions

(“My recovery model is fine! The log is still growing!)
This can also be a cause of uncontrolled and unrestrained log growth. No matter the recovery model, but it often comes up as “But I’m in Simple Recovery Model – why is my log still growing?!”
The reason here is simple, if SQL is using this transaction log for recovery purposes as I described above, then it has to see back to the start of a transaction. If you have a transaction that takes a long time or does a lot of changes, the log cannot truncate on checkpoint for any of the changes that are still in open transactions or that have started since that transaction started.
This means that a big delete, deleting millions of rows in one delete statement is one transaction and the log cannot do any truncating until that whole delete is done. In Full Recovery Model, this delete is logged and that could be a lot of log records. Same thing with Index optimization work during maintenance windows. It also means that poor transaction management and not watching for and closing open transactions can really hurt you and your log file.
You can save yourself here by: – Properly sizing your log file to account for the worst case scenario – like your maintenance or known large operations. And when you grow your log file you should look to thisguidance (and the two links she sends you to) by Kimberly Tripp. Right sizing is super critical here. – Watching your usage of transactions. Don’t start a transaction in your application server and start having long conversations with SQL Server and risk leaving one open too long. – Watch your implied transactions in your DML statements. UPDATE TableName Set Col1 = ‘New Value’ is a transaction. I didn’t put a BEGIN TRAN there and I don’t have to, it is one transaction that just automatically commits when done. So if doing operations on large numbers of rows, consider batching those operations up into more manageable chunks and giving the log time to recover. Or consider the right size to deal with that, or perhaps look into changing recovery models during a bulk load window.

What about Log Shipping?

“I’m using log shipping, so my log backups are automated… Why am I still seeing transaction log growth?”
Log shipping is just what it sounds like – you are shipping your transaction log backups to another server for DR purposes. The process is fairly simple – after the initialization – a job to backup the log on one server, a job to copy that log backup and a job to restore it without recovery (either NORECOVERY or STANDBY) on the destination server. There are also some jobs to monitor and alert if things don’t go as you have them planned.
In some cases, you may only want to do the log shipping restore once a day or every 3rd day or once a week. That is fine. But if you make this change on all of the jobs (including the log backup and copy jobs) that means you are waiting all that time to take a log backup. That means you will have a lot of log growth – because you are in full recovery mode without log backups, and it also likely means a large log file to copy across. You should only modify the restore job’s schedule and let the log backups and copies happen on a more frequent basis, otherwise you will suffer from the first issue described in this answer.

Getting Info on Your Cause

There are reasons other than these two, but these are the most common. Regardless of the cause – there is a way you can analyze your reason for this unexplained log growth/lack of truncation and see what they are:
By Querying the [sys.databases][5] catalog view you can see information describing the reason your log file may be waiting on truncate/reuse. There is a column called log_reuse_waitwith a lookup ID of the reason code and a log_reuse_wait_desc column with a description of the wait reason. From the referenced books online article are the majority of the reasons (the ones you are likely to see and the ones we can explain reasons for. The missing ones are either out of use or for internal use) with a few notes about the wait in italics:
·         0 = Nothing – What it sounds like.. Shouldn’t be waiting
·         1 = Checkpoint – Waiting for a checkpoint to occur. This should happen and you should be fine – but there are some cases to look for here for later answers or edits.
·         2 = Log backup – You are waiting for a log backup to occur. Either you have them scheduled and it will happen soon, or you have the first problem described here and you now know how to fix it
·         3 = Active backup or restore – A backup or restore operation is running on the database
·         4 = Active transaction – * There is an active transaction that needs to complete (either way – ROLLBACK or COMMIT) before the log can be backed up. This is the second reason described in this answer.
·         5 = Database mirroring Either a mirror is getting behind or under some latency in a high performance mirroring situation or mirroring is paused for some reason
·         6 = Replication – There can be issues with replication that would cause this – like a log reader agent not running, a database thinking it is marked for replication that no longer is and various other reasons. You can also see this reason and it is perfectly normal because you are looking at just the right time, just as transactions are being consumed by the log reader
·         7 = Database snapshot creation You are creating a database snapshot, you’ll see this if you look at just the right moment as a snapshot is being created
·         8 = Log Scan I have yet to encounter an issue with this running along forever. If you look long enough and frequently enough you can see this happen, but it shouldn’t be a cause of excessive transaction log growth, that I’ve seen.
·         9 = An AlwaysOn Availability Groups secondary replica is applying transaction log records of this database to a corresponding secondary database. About the clearest description yet..
Originally written by Mike Walsh

Filtered Index with variable

Filtered Index is a great way to improve your SQL performance and save a vast amount of server resources while processing your big data query.
Whenever you have a filtered index you just might find that it’s not used when its columns are compared to a variable. The thing is, SQL Server will always make an Execution Plan that would work with any values for those variable (including the values that don’t match the index’s filter).
Thus SQL Server will not use the filtered index.
There’s an easy way to fix that: adding the filtering clauses to your statement.
For instance, if you have an index on column ID filtered on ID ”, instead of using a clause like this: WHERE ID = @pID use a clause like this: WHERE ID = @pID AND ID ” SQL Server will then know it can use the filter.

How to maintain a history of all data modifications made to a table, including the type of modification and modified values?

Change data capture

Change data capture (CDC) is designed to capture insert, update, and delete activity applied to SQL Server tables, and to make the details of the changes available in an easily consumed relational format. The change tables used by change data capture contain columns that mirror the column structure of a tracked source table, along with the metadata needed to understand the changes that have occurred.

CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. 

Change data capture is available only on the Enterprise, Developer, and Evaluation editions of SQL Server.

How to check if you have enabled Change data capture on your database?

The following code will check  if you have enabled Change data capture on your database:
SELECT [name], database_id, is_cdc_enabled  
FROM sys.databases   
WHERE [name] =’AdventureWorks’

Enabling Change data capture on database level

The following code will enable Change data capture on database level:

EXEC sys.sp_cdc_enable_db 

After enabling Change data capture on database level you will notice that new schema ‘cdc’ was created.

Also the following tables are created under cdc schema:

  • cdc.captured_columns – This table returns result for list of captured column.
  • cdc.change_tables – This table returns list of all the tables which are enabled for capture.
  • cdc.ddl_history – This table contains history of all the DDL changes since capture data enabled.
  • cdc.index_columns – This table contains indexes associated with change table.
  • cdc.lsn_time_mapping – This table maps LSN number (for which we will learn later) and time.

Enabling Change Data Capture on Database Tables

First, we can check if CDC has been enabled to a table with the following query:
SELECT [name], is_tracked_by_cdc  
FROM sys.tables 
We can enable change data capture on a table with the following stored procedure:

EXEC sys.sp_cdc_enable_table 
@source_schema = N’Person’, 
@source_name   = N’Contact’, 
@role_name     = NULL 

Before executing the stored procedure make sure that SQL Agent is enabled since enabling CDC on tables will cause creation of two agent jobs:
  • cdc.AdventureWorks_capture 
  • cdc.AdventureWorks_cleanup 
After everything is completed successfully you will notice new table cdc.Person_Contact_CT where all changes will be tracked.

In that table two columns are really important, __$operation and __$update_mask.

Column _$operation contains value which corresponds to DML Operations.

This is a list of values and its meaning:
Delete Statement = 1
Insert Statement = 2
Value before Update Statement = 3
Value after Update Statement = 4

The column _$update_mask shows, via a bitmap, which columns were updated in the DML operation that was specified by _$operation. If this was a DELETE or INSERT operation, all columns are updated and so the mask contains value which has all 1’s in it. This mask is contains value which is formed with Bit values.